nips nips2002 nips2002-46 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Saharon Rosset, Eran Segal
Abstract: Several authors have suggested viewing boosting as a gradient descent search for a good fit in function space. We apply gradient-based boosting methodology to the unsupervised learning problem of density estimation. We show convergence properties of the algorithm and prove that a strength of weak learnability property applies to this problem as well. We illustrate the potential of this approach through experiments with boosting Bayesian networks to learn density models.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Several authors have suggested viewing boosting as a gradient descent search for a good fit in function space. [sent-5, score-0.828]
2 We apply gradient-based boosting methodology to the unsupervised learning problem of density estimation. [sent-6, score-0.846]
3 We show convergence properties of the algorithm and prove that a strength of weak learnability property applies to this problem as well. [sent-7, score-0.607]
4 We illustrate the potential of this approach through experiments with boosting Bayesian networks to learn density models. [sent-8, score-0.854]
5 Given data , a basis (or dictionary) of weak learners and a loss function , a boosting algorithm sequentially finds models and constants to minimize . [sent-10, score-1.447]
6 AdaBoost [6], the original boosting algorithm, was specifically devised for the task of classification, where with and . [sent-11, score-0.701]
7 AdaBoost sequentially fits weak learners on re-weighted versions of the data, where the weights are determined according to the performance of the model so far, emphasizing the more “challenging” examples. [sent-12, score-0.659]
8 Its inventors attribute its success to the “boosting” effect which the linear combination of weak learners achieves, when compared to their individual performance. [sent-13, score-0.597]
9 " © ¢ 8 £ 97¡ It has been shown [8, 13] that AdaBoost can be described as a gradient descent algorithm, where the weights in each step of the algorithm correspond to the gradient of an exponential loss function at the “current” fit. [sent-18, score-0.325]
10 This view of boosting as gradient descent has allowed several authors [7, 13, 21] to suggest “gradient boosting machines” which apply to a wider class of supervised learning problems and loss functions than the original AdaBoost. [sent-20, score-1.586]
11 In this paper we apply gradient boosting methodology to the unsupervised learning problem of density estimation, using the negative log likelihood loss criterion . [sent-22, score-0.982]
12 The density estimation problem has been studied extensively in many contexts using various parametric and non-parametric approaches [2, 5]. [sent-23, score-0.162]
13 A particular 8 4 4 U6T¡ ( 1 # 1 31 ' ¡ ( S 4 4 6T¡ ( 1 # 1 31 '( Y XV D©WG framework which has recently gained much popularity is that of Bayesian networks [11], whose main strength stems from their graphical representation, allowing for highly interpretable models. [sent-24, score-0.203]
14 More recently, researchers have developed methods for learning Bayesian networks from data including learning in the context of incomplete data. [sent-25, score-0.087]
15 We use Bayesian networks as our choice of weak learners, combining the models using the boosting methodology. [sent-26, score-1.141]
16 We note that several researchers have considered learning weighted mixtures of networks [14], or ensembles of Bayesian networks combined by model averaging [9, 20]. [sent-27, score-0.193]
17 We describe a generic density estimation boosting algorithm, following the approach of [13]. [sent-28, score-0.914]
18 The main idea is to identify, at each boosting iteration, the basis function which gives the largest “local” improvement in the loss at the current fit. [sent-29, score-0.854]
19 Intuitively, assigns higher probability to instances that received low probability by the current model. [sent-30, score-0.059]
20 A line search is then used to find an appropriate coefficient for the newly selected function, and it is added to the current model. [sent-31, score-0.087]
21 We provide a theoretical analysis of our density estimation boosting algorithm, showing an explicit condition, which if satisfied, guarantees that adding a weak learner to the model improves the training set loss. [sent-33, score-1.519]
22 We also prove a “strength of weak learnability” theorem which gives lower bounds on overall training loss improvement as a function of the individual weak learners’ performance on re-weighted versions of the training data. [sent-34, score-1.071]
23 We also show that our theoretical criterion for a weak learner to improve the overall model applies well in practice. [sent-36, score-0.611]
24 2 A density estimation boosting algorithm 4 ¡ ( 8 4 ¡ At each step in a boosting algorithm, the model built so far is: . [sent-37, score-1.651]
25 ( ) £ ¨ ( 4 4£ 65¢¡ ¤¢ ¥¡ £ 0¢¡ ( S £ ¨ ¤¢ ¥¡ 8 4 £ UB4 ¢¡ ( © £ 0¢¡ ( S £ which in the case of negative log-likelihood loss can be written as 4 ¨ ( 4£ E¢¡ ( H ( 4£ 5¢¡ ¤¢ ¥¡ £ ¨ ( G 64 £ ¡ 4 ¤¢ ¡ ¥%( #! [sent-39, score-0.082]
26 $" G £ ¨ Since is small, we can ignore the second order term and choose the next boosting step to maximize . [sent-41, score-0.776]
27 We are thus finding the first order optimal weak learner, which gives the “steepest descent” in the loss at the current model predictions. [sent-42, score-0.523]
28 ¢ # When we consider applying this methodology to density estimation, where the basis is comprised of probability distributions and the overall model is a probability distribution ¢ £¡ as well, we cannot simply augment the model, since will no longer be a probability distribution. [sent-46, score-0.194]
29 It is easy to see that the first order theory of gradient boosting and the line search solution apply to this formulation as well. [sent-48, score-0.803]
30 ¢ # ¢ 4 G H ¤¢ ¢ ¥¡ # ( ¢ # ¢8 ¤¢ ¥£¡ ¢ ¡ ¢ H ¢ ¢ £¡ # If at some stage , the current cannot be improved by adding any of the weak learners as above, the algorithm terminates, and we have reached a global minimum. [sent-49, score-0.719]
31 This can only happen if the derivative of the loss at the current model with respect to the coefficient of each weak learner is non-negative: ¡ E4¢¡ ( £ H ¤ ¥¢ ¡ £ © 4 £ ¡ ( 4 4£ 65¢¡ © §¨ 8 ¨D ¥¦ G 0) 19' & £ ' ( gives # ( 4£ E¢¡ ! [sent-50, score-0.702]
32 ¤ Thus, the algorithm terminates if no proof and discussion). [sent-52, score-0.092]
33 (see section 3 for The resulting generic gradient boosting algorithm for density estimation can be seen in Fig. [sent-53, score-1.01]
34 Implementation details for this algorithm include the choice of the family of weak at each boosting iteration. [sent-55, score-1.128]
35 We address learners , and the method for searching for these details in Section 4. [sent-56, score-0.212]
36 Output the final model Figure 1: Boosting density estimation algorithm 3 Training data performance The concept of “strength of weak learnability” [6, 18] has been developed in the context of boosting classification models. [sent-68, score-1.327]
37 At each step in the algorithm, the weighted error of the previous model, using the new weights is exactly . [sent-72, score-0.069]
38 Thus, the new weak learner doing “better than random” on the re-weighted data means it can improve the previous weak learner’s performance at the current fit, by achieving weighted classification error better than . [sent-73, score-1.072]
39 In fact it is easy to show that the weak learnability condition of at least one weak learner attaining classification error less than on the re-weighted data does not hold only if the current combined model is the optimal solution in the space of linear combinations of weak learners. [sent-74, score-1.607]
40 ¡ ¡ ¡ We now derive a similar formulation for our density estimation boosting algorithm. [sent-75, score-0.886]
41 Theorem 1 If there does not exist a weak learner such that is the global minimum in the domain of normalized linear combinations of then , : ( 8 6 52 ' & 4 ' H 8 1 1 2' © ¡ D65£ ¡ 4 4 ( 1 1 1 #! [sent-79, score-0.639]
42 For , we can see intuitively that a boosting algorithm will not let any observation have exceptionally low probability over time since that would cause this observation to have overwhelming weight in the next boosting iteration and hence the next selected is certain to give it high probability. [sent-88, score-1.495]
43 Thus, after some iterations we can assume that we would actually have a threshold independent of the iteration number and hence the loss would decrease at least as the sum of squares of the “weak learnability” quantities . [sent-89, score-0.105]
44 ¢ ¢ ¢ ¡ ¢ # ¨ 4 Boosting Bayesian Networks We now focus our attention on a specific application of the boosting methodology for density estimation, using Bayesian networks as the weak learners. [sent-90, score-1.286]
45 A Bayesian network is a § 4 £ 64 ¢¡ ¢ 8 6 52 1"' & 4 0) ( Combining these two gives: 4£ E¢¡ Proof: ¢¨ © 8 # $ Then we get: ¢ 2. [sent-91, score-0.054]
46 Recently, there has been much work on developing algorithms for learning Bayesian networks (both network structure and parameters) from data for the task of density estimation and hence they seem appropriate as our choice of weak learners. [sent-93, score-0.656]
47 Another advantage of Bayesian networks in our context, is the ability to tune the strength of the weak learners using parameters such as number of edges and strength of prior. [sent-94, score-0.826]
48 We rewrite step 2(b) of the boosting algorithm as: Find to maximize , where In this formulation, all possible values of have weights, some of which may be . [sent-96, score-0.818]
49 ' 4 ¢ ¢ £ ¤( As mentioned above, the two main implementation-specific details in the generic density estimation algorithm are the set of weak models and the method for searching for the ”optimal” weak model at each boosting iteration. [sent-98, score-1.726]
50 When boosting Bayesian networks, a natural way of limiting the “strength” of weak learners in is to limit the complexity of the network structure in . [sent-99, score-1.406]
51 This can be done, for instance, by bounding the number of edges in each “weak density estimator” learned during the boosting iterations. [sent-100, score-0.856]
52 ¢ The problem of finding an “optimal” weak model at each boosting iteration (step 2(b) of the algorithm) is trickier. [sent-101, score-1.109]
53 We first note that if we only impose an constraint on the norm of (specifically, the PDF constraint ), then step 2(b) has a trivial solution, concentrating all the probability at the value of with the highest “weight”: . [sent-102, score-0.067]
54 This phenomenon is not limited to the density estimation case and would norm, rather appear in boosting for classification if the set of weak learners had fixed than the fixed norm, implicitly imposed by limiting to contain classifiers. [sent-103, score-1.514]
55 This consequence of limiting to contain probability distributions is particularly problematic when boosting Bayesian networks, since can be represented with a fully disconnected network. [sent-104, score-0.755]
56 Thus, limiting to “simple” structures by itself does not amend this problem. [sent-105, score-0.054]
57 8 ¡ PH 8 T¡ 4 H ( ¡ ( 8 4 "T¡ 4 ¥ ¢ ' ¤ ¢ ¥Y £¡ § ¨¥ ¡ © However, the boosting algorithm does not explicitly require to include only probability distributions. [sent-106, score-0.743]
58 This fact points to an interesting correspondence between solutions to -constrained linear optimization problems and -constrained log optimization problems and leads us to believe that good solutions to step of the boosting algorithm can be approximated by solving step instead. [sent-111, score-0.877]
59 £ A( 4 T¡ ( D©V Y X © 4 B4 ' Our implementation of the boosting algorithm, therefore, does indeed limit to include probability distributions only, in this case those that can be represented by “simple” Bayesian networks. [sent-114, score-0.701]
60 Note that this use of “surrogate” optimization tasks is not alien to other boosting 4 £ ¤( ¢ 4 £ ¤( ¢ -25. [sent-116, score-0.701]
61 6 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 BoostingIterations (c) (d) Figure 2: (a) Comparison of boosting, single Bayesian network and AutoClass performance on the genomic expression dataset. [sent-139, score-0.241]
62 (c) The weak learnability condition is plotted along with training data performance as a reference for the genomic expression dataset. [sent-143, score-0.751]
63 The plot is in log-scale and also includes where is the number of training instances (d) Same as (c) for the census dataset. [sent-144, score-0.169]
64 For example, Adaboost calls for optimizing a re-weighted classification problem at each step; Decision trees, the most popular boosting weak learners, search for “optimal” solutions using surrogate loss functions - such as the Gini index for CART [3] or information gain for C4. [sent-146, score-1.251]
65 5 Experimental Results We evaluated the performance of our algorithms on two distinct datasets: a genomic expression dataset and a US census dataset. [sent-148, score-0.286]
66 In gene expression data, the level of mRNA transcript of every gene in the cell is measured simultaneously, using DNA microarray technology, allowing researchers to detect functionally related genes based on the correlation of their expression profiles across the various experiments. [sent-149, score-0.333]
67 We combined three yeast expression data sets [10, 12, 19] for a total of 550 expression experiments. [sent-150, score-0.195]
68 To test our methods on a set of correlated variables, we selected 56 genes associated with the oxidative phosphorlylation pathway in the KEGG database [1]. [sent-151, score-0.079]
69 We discretized the expression measurements of each gene into three levels (down, same, up) as in [15]. [sent-152, score-0.12]
70 We obtained the 1990 US census data set from the UC Irvine data repository (http://kdd. [sent-153, score-0.099]
71 Each of the data sets was randomly partitioned into 5 equally sized sets and our boosting algorithm was learned from each of the 5 possible combinations of 4 partitions. [sent-161, score-0.788]
72 The performance of each boosting model was evaluated by measuring the log-likelihood achieved on Avg. [sent-162, score-0.738]
73 We compared the performance achieved to that of a single Bayesian network learned using standard techniques (see [11] and references therein). [sent-166, score-0.112]
74 To test whether our boosting approach gains its performance primarily by using an ensemble of Bayesian networks, we also compared the performance to that achieved by an ensemble of Bayesian networks learned using AutoClass [4], varying the number of classes from 2 to 100. [sent-167, score-0.911]
75 We omit the results of AutoClass for the census data as it was not comparable to boosting and a single Bayesian network, achieving an average test instance log-likelihood of . [sent-171, score-0.857]
76 As can be seen, our boosting algorithm performs significantly better, rendering each instance in the test data roughly and times more likely than it is using other approaches in the genomic and census datasets, respectively. [sent-172, score-0.961]
77 § ¨¢ ¡ ¦¤¢ PH ¡G ¥ £ § ¢ To illustrate the theoretical concepts discussed in Section 3, we recorded the performance of our boosting algorithm on the training set for both data sets. [sent-173, score-0.852]
78 As shown in Section 3, if , then adding to the model is guaranteed to improve our training set performance. [sent-174, score-0.066]
79 Theorem 2 relates the magnitude of this difference to the amount of improvement in training set performance. [sent-175, score-0.082]
80 2(c,d) plots the weak learnability quantity , the training set log-likelihood and the threshold for both data sets on a log scale. [sent-177, score-0.541]
81 As can be seen, the theory matches nicely, as the improvement is large when the weak learnability condition is large and stops entirely once it asymptotes to . [sent-178, score-0.556]
82 © © 4 ¢¡ £ ( 4£ E¢¡ © £ ( 8 6 52 ' & 4 8 6 52 ' & 4 £ ' ' Finally, boosting theory tells us that the effect of boosting is more pronounced for “weaker” weak learners. [sent-179, score-1.787]
83 To that extent, we experimented (data not shown) with various strength parameters for the family of weak learners (number of allowed edges in each Bayesian network, strength of prior). [sent-180, score-0.771]
84 As expected, the overall effect of boosting was much stronger for weaker learners. [sent-181, score-0.743]
85 6 Discussion and future work In this paper we extended the boosting methodology to the domain of density estimation and demonstrated its practical performance on real world datasets. [sent-182, score-0.974]
86 We believe that this direction shows promise and hope that our work will lead to other boosting implementations in density estimation as well as other function estimation domains. [sent-183, score-0.927]
87 Our theoretical results include an exposition of the training data performance of the generic algorithm, proving analogous results to those in the case of boosting for classification. [sent-184, score-0.861]
88 Of particular interest is theorem 1, implying that the idealized algorithm converges, asymptotically, to the global minimum. [sent-185, score-0.101]
89 This result is interesting, as it implies that the greedy boosting algorithm converges to the exhaustive solution. [sent-186, score-0.743]
90 However, this global minimum is usually not a good solution in terms of test-set performance as it will tend to overfit (especially if is not very small). [sent-187, score-0.062]
91 Boosting can be described as generating a regularized path to this optimal solution [17], and thus we can assume that points along the path will usually have better generalization performance than the non-regularized optimum. [sent-188, score-0.105]
92 In Section 4 we described the theoretical and practical difficulties in solving the optimization step of the boosting iterations (step 2(b)). [sent-189, score-0.773]
93 However, it will be interesting to formulate other cases where the original problem has non-trivial solutions - for instance, by not limiting to probability distributions only and using non-density estimation algorithms to generate the “weak” models . [sent-191, score-0.14]
94 ¢ The popularity of Bayesian networks as density estimators stems from their intuitive interpretation as describing causal relations in data. [sent-192, score-0.211]
95 However, when learning the network structure from data, a major issue is assigning confidence to the learned features. [sent-193, score-0.075]
96 A potential use of boosting could be in improving interpretability and reducing instability in structure learning. [sent-194, score-0.701]
97 If the weak models in are limited to a small number of edges, we can collect and interpret the “total influence” of edges in the combined model. [sent-195, score-0.448]
98 Being bayesian about network structure: A bayesian approach to structure discovery in bayesian networks. [sent-257, score-0.465]
99 Genomic expression program in the response of yeast cells to environmental changes. [sent-273, score-0.109]
100 Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. [sent-335, score-0.165]
wordName wordTfidf (topN-words)
[('boosting', 0.701), ('weak', 0.385), ('learners', 0.212), ('learner', 0.178), ('autoclass', 0.146), ('bayesian', 0.13), ('learnability', 0.111), ('census', 0.099), ('density', 0.098), ('genomic', 0.091), ('wg', 0.091), ('loss', 0.082), ('xv', 0.073), ('px', 0.073), ('strength', 0.069), ('adaboost', 0.067), ('estimation', 0.064), ('boostingiterations', 0.063), ('expression', 0.059), ('networks', 0.055), ('network', 0.054), ('limiting', 0.054), ('gradient', 0.054), ('classi', 0.051), ('genes', 0.051), ('generic', 0.051), ('yeast', 0.05), ('descent', 0.048), ('stanford', 0.048), ('methodology', 0.047), ('training', 0.045), ('step', 0.045), ('algorithm', 0.042), ('eran', 0.042), ('logweaklearnability', 0.042), ('saharon', 0.042), ('segal', 0.042), ('trainingperformance', 0.042), ('weaklearnability', 0.042), ('friedman', 0.041), ('find', 0.041), ('performance', 0.037), ('improvement', 0.037), ('edges', 0.036), ('kegg', 0.036), ('rosset', 0.036), ('spellman', 0.036), ('surrogate', 0.036), ('cell', 0.035), ('current', 0.034), ('gene', 0.034), ('theorem', 0.034), ('cation', 0.034), ('attaining', 0.033), ('researchers', 0.032), ('popularity', 0.031), ('pro', 0.031), ('maximize', 0.03), ('ensemble', 0.03), ('margin', 0.03), ('achieving', 0.029), ('microarray', 0.029), ('instance', 0.028), ('optimality', 0.028), ('selected', 0.028), ('augment', 0.028), ('terminates', 0.028), ('boosted', 0.028), ('domain', 0.027), ('combined', 0.027), ('theoretical', 0.027), ('stems', 0.027), ('discretized', 0.027), ('annals', 0.026), ('sequentially', 0.025), ('search', 0.025), ('instances', 0.025), ('freund', 0.025), ('global', 0.025), ('lemma', 0.024), ('combinations', 0.024), ('weighted', 0.024), ('ig', 0.024), ('path', 0.023), ('happen', 0.023), ('condition', 0.023), ('formulation', 0.023), ('iteration', 0.023), ('norm', 0.022), ('proof', 0.022), ('solutions', 0.022), ('optimal', 0.022), ('overall', 0.021), ('weaker', 0.021), ('discovery', 0.021), ('achieves', 0.021), ('graphical', 0.021), ('learned', 0.021), ('ph', 0.021), ('adding', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 46 nips-2002-Boosting Density Estimation
Author: Saharon Rosset, Eran Segal
Abstract: Several authors have suggested viewing boosting as a gradient descent search for a good fit in function space. We apply gradient-based boosting methodology to the unsupervised learning problem of density estimation. We show convergence properties of the algorithm and prove that a strength of weak learnability property applies to this problem as well. We illustrate the potential of this approach through experiments with boosting Bayesian networks to learn density models.
2 0.43123102 181 nips-2002-Self Supervised Boosting
Author: Max Welling, Richard S. Zemel, Geoffrey E. Hinton
Abstract: Boosting algorithms and successful applications thereof abound for classification and regression learning problems, but not for unsupervised learning. We propose a sequential approach to adding features to a random field model by training them to improve classification performance between the data and an equal-sized sample of “negative examples” generated from the model’s current estimate of the data density. Training in each boosting round proceeds in three stages: first we sample negative examples from the model’s current Boltzmann distribution. Next, a feature is trained to improve classification performance between data and negative examples. Finally, a coefficient is learned which determines the importance of this feature relative to ones already in the pool. Negative examples only need to be generated once to learn each new feature. The validity of the approach is demonstrated on binary digits and continuous synthetic data.
3 0.30670398 69 nips-2002-Discriminative Learning for Label Sequences via Boosting
Author: Yasemin Altun, Thomas Hofmann, Mark Johnson
Abstract: This paper investigates a boosting approach to discriminative learning of label sequences based on a sequence rank loss function. The proposed method combines many of the advantages of boosting schemes with the efficiency of dynamic programming methods and is attractive both, conceptually and computationally. In addition, we also discuss alternative approaches based on the Hamming loss for label sequences. The sequence boosting algorithm offers an interesting alternative to methods based on HMMs and the more recently proposed Conditional Random Fields. Applications areas for the presented technique range from natural language processing and information extraction to computational biology. We include experiments on named entity recognition and part-of-speech tagging which demonstrate the validity and competitiveness of our approach. 1
4 0.28788644 92 nips-2002-FloatBoost Learning for Classification
Author: Stan Z. Li, Zhenqiu Zhang, Heung-yeung Shum, Hongjiang Zhang
Abstract: AdaBoost [3] minimizes an upper error bound which is an exponential function of the margin on the training set [14]. However, the ultimate goal in applications of pattern classification is always minimum error rate. On the other hand, AdaBoost needs an effective procedure for learning weak classifiers, which by itself is difficult especially for high dimensional data. In this paper, we present a novel procedure, called FloatBoost, for learning a better boosted classifier. FloatBoost uses a backtrack mechanism after each iteration of AdaBoost to remove weak classifiers which cause higher error rates. The resulting float-boosted classifier consists of fewer weak classifiers yet achieves lower error rates than AdaBoost in both training and test. We also propose a statistical model for learning weak classifiers, based on a stagewise approximation of the posterior using an overcomplete set of scalar features. Experimental comparisons of FloatBoost and AdaBoost are provided through a difficult classification problem, face detection, where the goal is to learn from training examples a highly nonlinear classifier to differentiate between face and nonface patterns in a high dimensional space. The results clearly demonstrate the promises made by FloatBoost over AdaBoost.
5 0.20348246 120 nips-2002-Kernel Design Using Boosting
Author: Koby Crammer, Joseph Keshet, Yoram Singer
Abstract: The focus of the paper is the problem of learning kernel operators from empirical data. We cast the kernel design problem as the construction of an accurate kernel from simple (and less accurate) base kernels. We use the boosting paradigm to perform the kernel construction process. To do so, we modify the booster so as to accommodate kernel operators. We also devise an efficient weak-learner for simple kernels that is based on generalized eigen vector decomposition. We demonstrate the effectiveness of our approach on synthetic data and on the USPS dataset. On the USPS dataset, the performance of the Perceptron algorithm with learned kernels is systematically better than a fixed RBF kernel. 1 Introduction and problem Setting The last decade brought voluminous amount of work on the design, analysis and experimentation of kernel machines. Algorithm based on kernels can be used for various machine learning tasks such as classification, regression, ranking, and principle component analysis. The most prominent learning algorithm that employs kernels is the Support Vector Machines (SVM) [1, 2] designed for classification and regression. A key component in a kernel machine is a kernel operator which computes for any pair of instances their inner-product in some abstract vector space. Intuitively and informally, a kernel operator is a means for measuring similarity between instances. Almost all of the work that employed kernel operators concentrated on various machine learning problems that involved a predefined kernel. A typical approach when using kernels is to choose a kernel before learning starts. Examples to popular predefined kernels are the Radial Basis Functions and the polynomial kernels (see for instance [1]). Despite the simplicity required in modifying a learning algorithm to a “kernelized” version, the success of such algorithms is not well understood yet. More recently, special efforts have been devoted to crafting kernels for specific tasks such as text categorization [3] and protein classification problems [4]. Our work attempts to give a computational alternative to predefined kernels by learning kernel operators from data. We start with a few definitions. Let X be an instance space. A kernel is an inner-product operator K : X × X → . An explicit way to describe K is via a mapping φ : X → H from X to an inner-products space H such that K(x, x ) = φ(x)·φ(x ). Given a kernel operator and a finite set of instances S = {xi , yi }m , the kernel i=1 matrix (a.k.a the Gram matrix) is the matrix of all possible inner-products of pairs from S, Ki,j = K(xi , xj ). We therefore refer to the general form of K as the kernel operator and to the application of the kernel operator to a set of pairs of instances as the kernel matrix. The specific setting of kernel design we consider assumes that we have access to a base kernel learner and we are given a target kernel K manifested as a kernel matrix on a set of examples. Upon calling the base kernel learner it returns a kernel operator denote Kj . The goal thereafter is to find a weighted combination of kernels ˆ K(x, x ) = j αj Kj (x, x ) that is similar, in a sense that will be defined shortly, to ˆ the target kernel, K ∼ K . Cristianini et al. [5] in their pioneering work on kernel target alignment employed as the notion of similarity the inner-product between the kernel matrices < K, K >F = m K(xi , xj )K (xi , xj ). Given this definition, they defined the i,j=1 kernel-similarity, or alignment, to be the above inner-product normalized by the norm of ˆ ˆ ˆ ˆ ˆ each kernel, A(S, K, K ) = < K, K >F / < K, K >F < K , K >F , where S is, as above, a finite sample of m instances. Put another way, the kernel alignment Cristianini et al. employed is the cosine of the angle between the kernel matrices where each matrix is “flattened” into a vector of dimension m2 . Therefore, this definition implies that the alignment is bounded above by 1 and can attain this value iff the two kernel matrices are identical. Given a (column) vector of m labels y where yi ∈ {−1, +1} is the label of the instance xi , Cristianini et al. used the outer-product of y as the the target kernel, ˆ K = yy T . Therefore, an optimal alignment is achieved if K(xi , xj ) = yi yj . Clearly, if such a kernel is used for classifying instances from X , then the kernel itself suffices to construct an excellent classifier f : X → {−1, +1} by setting, f (x) = sign(y i K(xi , x)) where (xi , yi ) is any instance-label pair. Cristianini et al. then devised a procedure that works with both labelled and unlabelled examples to find a Gram matrix which attains a good alignment with K on the labelled part of the matrix. While this approach can clearly construct powerful kernels, a few problems arise from the notion of kernel alignment they employed. For instance, a kernel operator such that the sign(K(x i , xj )) is equal to yi yj but its magnitude, |K(xi , xj )|, is not necessarily 1, might achieve a poor alignment score while it can constitute a classifier whose empirical loss is zero. Furthermore, the task of finding a good kernel when it is not always possible to find a kernel whose sign on each pair of instances is equal to the products of the labels (termed the soft-margin case in [5, 6]) becomes rather tricky. We thus propose a different approach which attempts to overcome some of the difficulties above. Like Cristianini et al. we assume that we are given a set of labelled instances S = {(xi , yi ) | xi ∈ X , yi ∈ {−1, +1}, i = 1, . . . , m} . We are also given a set of unlabelled m ˜ ˜ examples S = {˜i }i=1 . If such a set is not provided we can simply use the labelled inx ˜ ˜ stances (without the labels themselves) as the set S. The set S is used for constructing the ˆ primitive kernels that are combined to constitute the learned kernel K. The labelled set is used to form the target kernel matrix and its instances are used for evaluating the learned ˆ kernel K. This approach, known as transductive learning, was suggested in [5, 6] for kernel alignment tasks when the distribution of the instances in the test data is different from that of the training data. This setting becomes in particular handy in datasets where the test data was collected in a different scheme than the training data. We next discuss the notion of kernel goodness employed in this paper. This notion builds on the objective function that several variants of boosting algorithms maintain [7, 8]. We therefore first discuss in brief the form of boosting algorithms for kernels. 2 Using Boosting to Combine Kernels Numerous interpretations of AdaBoost and its variants cast the boosting process as a procedure that attempts to minimize, or make small, a continuous bound on the classification error (see for instance [9, 7] and the references therein). A recent work by Collins et al. [8] unifies the boosting process for two popular loss functions, the exponential-loss (denoted henceforth as ExpLoss) and logarithmic-loss (denoted as LogLoss) that bound the empir- ˜ ˜ Input: Labelled and unlabelled sets of examples: S = {(xi , yi )}m ; S = {˜i }m x i=1 i=1 Initialize: K ← 0 (all zeros matrix) For t = 1, 2, . . . , T : • Calculate distribution over pairs 1 ≤ i, j ≤ m: Dt (i, j) = exp(−yi yj K(xi , xj )) 1/(1 + exp(−yi yj K(xi , xj ))) ExpLoss LogLoss ˜ • Call base-kernel-learner with (Dt , S, S) and receive Kt • Calculate: + − St = {(i, j) | yi yj Kt (xi , xj ) > 0} ; St = {(i, j) | yi yj Kt (xi , xj ) < 0} + Wt = (i,j)∈S + Dt (i, j)|Kt (xi , xj )| ; Wt− = (i,j)∈S − Dt (i, j)|Kt (xi , xj )| t t 1 2 + Wt − Wt • Set: αt = ln ; K ← K + α t Kt . Return: kernel operator K : X × X → Figure 1: The skeleton of the boosting algorithm for kernels. ical classification error. Given the prediction of a classifier f on an instance x and a label y ∈ {−1, +1} the ExpLoss and the LogLoss are defined as, ExpLoss(f (x), y) = exp(−yf (x)) LogLoss(f (x), y) = log(1 + exp(−yf (x))) . Collins et al. described a single algorithm for the two losses above that can be used within the boosting framework to construct a strong-hypothesis which is a classifier f (x). This classifier is a weighted combination of (possibly very simple) base classifiers. (In the boosting framework, the base classifiers are referred to as weak-hypotheses.) The strongT hypothesis is of the form f (x) = t=1 αt ht (x). Collins et al. discussed a few ways to select the weak-hypotheses ht and to find a good of weights αt . Our starting point in this paper is the first sequential algorithm from [8] that enables the construction or creation of weak-hypotheses on-the-fly. We would like to note however that it is possible to use other variants of boosting to design kernels. In order to use boosting to design kernels we extend the algorithm to operate over pairs of instances. Building on the notion of alignment from [5, 6], we say that the inner-product of x1 and x2 is aligned with the labels y1 and y2 if sign(K(x1 , x2 )) = y1 y2 . Furthermore, we would like to make the magnitude of K(x, x ) to be as large as possible. We therefore use one of the following two alignment losses for a pair of examples (x 1 , y1 ) and (x2 , y2 ), ExpLoss(K(x1 , x2 ), y1 y2 ) = exp(−y1 y2 K(x1 , x2 )) LogLoss(K(x1 , x2 ), y1 y2 ) = log(1 + exp(−y1 y2 K(x1 , x2 ))) . Put another way, we view a pair of instances as a single example and cast the pairs of instances that attain the same label as positively labelled examples while pairs of opposite labels are cast as negatively labelled examples. Clearly, this approach can be applied to both losses. In the boosting process we therefore maintain a distribution over pairs of instances. The weight of each pair reflects how difficult it is to predict whether the labels of the two instances are the same or different. The core boosting algorithm follows similar lines to boosting algorithms for classification algorithm. The pseudo code of the booster is given in Fig. 1. The pseudo-code is an adaptation the to problem of kernel design of the sequentialupdate algorithm from [8]. As with other boosting algorithm, the base-learner, which in our case is charge of returning a good kernel with respect to the current distribution, is left unspecified. We therefore turn our attention to the algorithmic implementation of the base-learning algorithm for kernels. 3 Learning Base Kernels The base kernel learner is provided with a training set S and a distribution D t over a pairs ˜ of instances from the training set. It is also provided with a set of unlabelled examples S. Without any knowledge of the topology of the space of instances a learning algorithm is likely to fail. Therefore, we assume the existence of an initial inner-product over the input space. We assume for now that this initial inner-product is the standard scalar products over vectors in n . We later discuss a way to relax the assumption on the form of the inner-product. Equipped with an inner-product, we define the family of base kernels to be the possible outer-products Kw = wwT between a vector w ∈ n and itself. Using this definition we get, Kw (xi , xj ) = (xi ·w)(xj ·w) . Input: A distribution Dt . Labelled and unlabelled sets: ˜ ˜ Therefore, the similarity beS = {(xi , yi )}m ; S = {˜i }m . x i=1 i=1 tween two instances xi and Compute : xj is high iff both xi and xj • Calculate: ˜ are similar (w.r.t the standard A ∈ m×m , Ai,r = xi · xr ˜ inner-product) to a third vecm×m B∈ , Bi,j = Dt (i, j)yi yj tor w. Analogously, if both ˜ ˜ K ∈ m×m , Kr,s = xr · xs ˜ ˜ xi and xj seem to be dissim• Find the generalized eigenvector v ∈ m for ilar to the vector w then they the problem AT BAv = λKv which attains are similar to each other. Dethe largest eigenvalue λ spite the restrictive form of • Set: w = ( r vr xr )/ ˜ ˜ r vr xr . the inner-products, this famt ily is still too rich for our setReturn: Kernel operator Kw = ww . ting and we further impose two restrictions on the inner Figure 2: The base kernel learning algorithm. products. First, we assume ˜ that w is restricted to a linear combination of vectors from S. Second, since scaling of the base kernels is performed by the boosted, we constrain the norm of w to be 1. The m ˜ resulting class of kernels is therefore, C = {Kw = wwT | w = r=1 βr xr , w = 1} . ˜ In the boosting process we need to choose a specific base-kernel K w from C. We therefore need to devise a notion of how good a candidate for base kernel is given a labelled set S and a distribution function Dt . In this work we use the simplest version suggested by Collins et al. This version can been viewed as a linear approximation on the loss function. We define the score of a kernel Kw w.r.t to the current distribution Dt to be, Score(Kw ) = Dt (i, j)yi yj Kw (xi , xj ) . (1) i,j The higher the value of the score is, the better Kw fits the training data. Note that if Dt (i, j) = 1/m2 (as is D0 ) then Score(Kw ) is proportional to the alignment since w = 1. Under mild assumptions the score can also provide a lower bound of the loss function. To see that let c be the derivative of the loss function at margin zero, c = Loss (0) . If all the √ training examples xi ∈ S lies in a ball of radius c, we get that Loss(Kw (xi , xj ), yi yj ) ≥ 1 − cKw (xi , xj )yi yj ≥ 0, and therefore, i,j Dt (i, j)Loss(Kw (xi , xj ), yi yj ) ≥ 1 − c Dt (i, j)Kw (xi , xj )yi yj . i,j Using the explicit form of Kw in the Score function (Eq. (1)) we get, Score(Kw ) = i,j D(i, j)yi yj (w·xi )(w·xj ) . Further developing the above equation using the constraint that w = m ˜ r=1 βr xr we get, ˜ Score(Kw ) = βs βr r,s i,j D(i, j)yi yj (xi · xr ) (xj · xs ) . ˜ ˜ To compute efficiently the base kernel score without an explicit enumeration we exploit the fact that if the initial distribution D0 is symmetric (D0 (i, j) = D0 (j, i)) then all the distributions generated along the run of the boosting process, D t , are also symmetric. We ˜ now define a matrix A ∈ m×m where Ai,r = xi · xr and a symmetric matrix B ∈ m×m ˜ with Bi,j = Dt (i, j)yi yj . Simple algebraic manipulations yield that the score function can be written as the following quadratic form, Score(β) = β T (AT BA)β , where β is m dimensional column vector. Note that since B is symmetric so is A T BA. Finding a ˜ good base kernel is equivalent to finding a vector β which maximizes this quadratic form 2 m ˜ under the norm equality constraint w = ˜ 2 = β T Kβ = 1 where Kr,s = r=1 βr xr xr · xs . Finding the maximum of Score(β) subject to the norm constraint is a well known ˜ ˜ maximization problem known as the generalized eigen vector problem (cf. [10]). Applying simple algebraic manipulations it is easy to show that the matrix AT BA is positive semidefinite. Assuming that the matrix K is invertible, the the vector β which maximizes the quadratic form is proportional the eigenvector of K −1 AT BA which is associated with the m ˜ generalized largest eigenvalue. Denoting this vector by v we get that w ∝ ˜ r=1 vr xr . m ˜ m ˜ Adding the norm constraint we get that w = ( r=1 vr xr )/ ˜ vr xr . The skeleton ˜ r=1 of the algorithm for finding a base kernels is given in Fig. 3. To conclude the description of the kernel learning algorithm we describe how to the extend the algorithm to be employed with general kernel functions. Kernelizing the Kernel: As described above, we assumed that the standard scalarproduct constitutes the template for the class of base-kernels C. However, since the proce˜ dure for choosing a base kernel depends on S and S only through the inner-products matrix A, we can replace the scalar-product itself with a general kernel operator κ : X × X → , where κ(xi , xj ) = φ(xi ) · φ(xj ). Using a general kernel function κ we can not compute however the vector w explicitly. We therefore need to show that the norm of w, and evaluation Kw on any two examples can still be performed efficiently. First note that given the vector v we can compute the norm of w as follows, T w 2 = vr xr ˜ vs xr ˜ r s = vr vs κ(˜r , xs ) . x ˜ r,s Next, given two vectors xi and xj the value of their inner-product is, Kw (xi , xj ) = vr vs κ(xi , xr )κ(xj , xs ) . ˜ ˜ r,s Therefore, although we cannot compute the vector w explicitly we can still compute its norm and evaluate any of the kernels from the class C. 4 Experiments Synthetic data: We generated binary-labelled data using as input space the vectors in 100 . The labels, in {−1, +1}, were picked uniformly at random. Let y designate the label of a particular example. Then, the first two components of each instance were drawn from a two-dimensional normal distribution, N (µ, ∆ ∆−1 ) with the following parameters, µ=y 0.03 0.03 1 ∆= √ 2 1 −1 1 1 = 0.1 0 0 0.01 . That is, the label of each examples determined the mean of the distribution from which the first two components were generated. The rest of the components in the vector (98 8 0.2 6 50 50 100 100 150 150 200 200 4 2 0 0 −2 −4 −6 250 250 −0.2 −8 −0.2 0 0.2 −8 −6 −4 −2 0 2 4 6 8 300 20 40 60 80 100 120 140 160 180 200 300 20 40 60 80 100 120 140 160 180 Figure 3: Results on a toy data set prior to learning a kernel (first and third from left) and after learning (second and fourth). For each of the two settings we show the first two components of the training data (left) and the matrix of inner products between the train and the test data (right). altogether) were generated independently using the normal distribution with a zero mean and a standard deviation of 0.05. We generated 100 training and test sets of size 300 and 200 respectively. We used the standard dot-product as the initial kernel operator. On each experiment we first learned a linear classier that separates the classes using the Perceptron [11] algorithm. We ran the algorithm for 10 epochs on the training set. After each epoch we evaluated the performance of the current classifier on the test set. We then used the boosting algorithm for kernels with the LogLoss for 30 rounds to build a kernel for each random training set. After learning the kernel we re-trained a classifier with the Perceptron algorithm and recorded the results. A summary of the online performance is given in Fig. 4. The plot on the left-hand-side of the figure shows the instantaneous error (achieved during the run of the algorithm). Clearly, the Perceptron algorithm with the learned kernel converges much faster than the original kernel. The middle plot shows the test error after each epoch. The plot on the right shows the test error on a noisy test set in which we added a Gaussian noise of zero mean and a standard deviation of 0.03 to the first two features. In all plots, each bar indicates a 95% confidence level. It is clear from the figure that the original kernel is much slower to converge than the learned kernel. Furthermore, though the kernel learning algorithm was not expoed to the test set noise, the learned kernel reflects better the structure of the feature space which makes the learned kernel more robust to noise. Fig. 3 further illustrates the benefits of using a boutique kernel. The first and third plots from the left correspond to results obtained using the original kernel and the second and fourth plots show results using the learned kernel. The left plots show the empirical distribution of the two informative components on the test data. For the learned kernel we took each input vector and projected it onto the two eigenvectors of the learned kernel operator matrix that correspond to the two largest eigenvalues. Note that the distribution after the projection is bimodal and well separated along the first eigen direction (x-axis) and shows rather little deviation along the second eigen direction (y-axis). This indicates that the kernel learning algorithm indeed found the most informative projection for separating the labelled data with large margin. It is worth noting that, in this particular setting, any algorithm which chooses a single feature at a time is prone to failure since both the first and second features are mandatory for correctly classifying the data. The two plots on the right hand side of Fig. 3 use a gray level color-map to designate the value of the inner-product between each pairs instances, one from training set (y-axis) and the other from the test set. The examples were ordered such that the first group consists of the positively labelled instances while the second group consists of the negatively labelled instances. Since most of the features are non-relevant the original inner-products are noisy and do not exhibit any structure. In contrast, the inner-products using the learned kernel yields in a 2 × 2 block matrix indicating that the inner-products between instances sharing the same label obtain large positive values. Similarly, for instances of opposite 200 1 12 Regular Kernel Learned Kernel 0.8 17 0.7 16 0.5 0.4 0.3 Test Error % 8 0.6 Regular Kernel Learned Kernel 18 10 Test Error % Averaged Cumulative Error % 19 Regular Kernel Learned Kernel 0.9 6 4 15 14 13 12 0.2 11 2 0.1 10 0 0 10 1 10 2 10 Round 3 10 4 10 0 2 4 6 Epochs 8 10 9 2 4 6 Epochs 8 10 Figure 4: The online training error (left), test error (middle) on clean synthetic data using a standard kernel and a learned kernel. Right: the online test error for the two kernels on a noisy test set. labels the inner products are large and negative. The form of the inner-products matrix of the learned kernel indicates that the learning problem itself becomes much easier. Indeed, the Perceptron algorithm with the standard kernel required around 94 training examples on the average before converging to a hyperplane which perfectly separates the training data while using the Perceptron algorithm with learned kernel required a single example to reach a perfect separation on all 100 random training sets. USPS dataset: The USPS (US Postal Service) dataset is known as a challenging classification problem in which the training set and the test set were collected in a different manner. The USPS contains 7, 291 training examples and 2, 007 test examples. Each example is represented as a 16 × 16 matrix where each entry in the matrix is a pixel that can take values in {0, . . . , 255}. Each example is associated with a label in {0, . . . , 9} which is the digit content of the image. Since the kernel learning algorithm is designed for binary problems, we broke the 10-class problem into 45 binary problems by comparing all pairs of classes. The interesting question of how to learn kernels for multiclass problems is beyond the scopre of this short paper. We thus constraint on the binary error results for the 45 binary problem described above. For the original kernel we chose a RBF kernel with σ = 1 which is the value employed in the experiments reported in [12]. We used the kernelized version of the kernel design algorithm to learn a different kernel operator for each of the binary problems. We then used a variant of the Perceptron [11] and with the original RBF kernel and with the learned kernels. One of the motivations for using the Perceptron is its simplicity which can underscore differences in the kernels. We ran the kernel learning al˜ gorithm with LogLoss and ExpLoss, using bith the training set and the test test as S. Thus, we obtained four different sets of kernels where each set consists of 45 kernels. By examining the training loss, we set the number of rounds of boosting to be 30 for the LogLoss and 50 for the ExpLoss, when using the trainin set. When using the test set, the number of rounds of boosting was set to 100 for both losses. Since the algorithm exhibits slower rate of convergence with the test data, we choose a a higher value without attempting to optimize the actual value. The left plot of Fig. 5 is a scatter plot comparing the test error of each of the binary classifiers when trained with the original RBF a kernel versus the performance achieved on the same binary problem with a learned kernel. The kernels were built ˜ using boosting with the LogLoss and S was the training data. In almost all of the 45 binary classification problems, the learned kernels yielded lower error rates when combined with the Perceptron algorithm. The right plot of Fig. 5 compares two learned kernels: the first ˜ was build using the training instances as the templates constituing S while the second used the test instances. Although the differenece between the two versions is not as significant as the difference on the left plot, we still achieve an overall improvement in about 25% of the binary problems by using the test instances. 6 4.5 4 5 Learned Kernel (Test) Learned Kernel (Train) 3.5 4 3 2 3 2.5 2 1.5 1 1 0.5 0 0 1 2 3 Base Kernel 4 5 6 0 0 1 2 3 Learned Kernel (Train) 4 5 Figure 5: Left: a scatter plot comparing the error rate of 45 binary classifiers trained using an RBF kernel (x-axis) and a learned kernel with training instances. Right: a similar scatter plot for a learned kernel only constructed from training instances (x-axis) and test instances. 5 Discussion In this paper we showed how to use the boosting framework to design kernels. Our approach is especially appealing in transductive learning tasks where the test data distribution is different than the the distribution of the training data. For example, in speech recognition tasks the training data is often clean and well recorded while the test data often passes through a noisy channel that distorts the signal. An interesting and challanging question that stem from this research is how to extend the framework to accommodate more complex decision tasks such as multiclass and regression problems. Finally, we would like to note alternative approaches to the kernel design problem has been devised in parallel and independently. See [13, 14] for further details. Acknowledgements: Special thanks to Cyril Goutte and to John Show-Taylor for pointing the connection to the generalized eigen vector problem. Thanks also to the anonymous reviewers for constructive comments. References [1] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998. [2] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. [3] Huma Lodhi, John Shawe-Taylor, Nello Cristianini, and Christopher J. C. H. Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2:419–444, 2002. [4] C. Leslie, E. Eskin, and W. Stafford Noble. The spectrum kernel: A string kernel for svm protein classification. In Proceedings of the Pacific Symposium on Biocomputing, 2002. [5] Nello Cristianini, Andre Elisseeff, John Shawe-Taylor, and Jaz Kandla. On kernel target alignment. In Advances in Neural Information Processing Systems 14, 2001. [6] G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix with semi-definite programming. In Proc. of the 19th Intl. Conf. on Machine Learning, 2002. [7] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2):337–374, April 2000. [8] Michael Collins, Robert E. Schapire, and Yoram Singer. Logistic regression, adaboost and bregman distances. Machine Learning, 47(2/3):253–285, 2002. [9] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classifiers. MIT Press, 1999. [10] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1985. [11] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. [12] B. Sch¨ lkopf, S. Mika, C.J.C. Burges, P. Knirsch, K. M¨ ller, G. R¨ tsch, and A.J. Smola. Input o u a space vs. feature space in kernel-based methods. IEEE Trans. on NN, 10(5):1000–1017, 1999. [13] O. Bosquet and D.J.L. Herrmann. On the complexity of learning the kernel matrix. NIPS, 2002. [14] C.S. Ong, A.J. Smola, and R.C. Williamson. Superkenels. NIPS, 2002.
6 0.09424638 45 nips-2002-Boosted Dyadic Kernel Discriminants
7 0.083268106 99 nips-2002-Graph-Driven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA
8 0.08096806 140 nips-2002-Margin Analysis of the LVQ Algorithm
9 0.07436844 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
10 0.069404669 19 nips-2002-Adapting Codes and Embeddings for Polychotomies
11 0.0691102 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture
12 0.069090776 161 nips-2002-PAC-Bayes & Margins
13 0.067518324 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
14 0.066771328 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
15 0.065662764 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization
16 0.065168314 106 nips-2002-Hyperkernels
17 0.064657293 58 nips-2002-Conditional Models on the Ranking Poset
18 0.060244456 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
19 0.055695448 149 nips-2002-Multiclass Learning by Probabilistic Embeddings
20 0.055496871 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction
topicId topicWeight
[(0, -0.215), (1, -0.117), (2, 0.042), (3, -0.012), (4, 0.239), (5, 0.019), (6, -0.061), (7, 0.085), (8, -0.049), (9, -0.211), (10, -0.076), (11, -0.016), (12, -0.044), (13, 0.288), (14, -0.286), (15, 0.378), (16, -0.197), (17, -0.214), (18, 0.074), (19, 0.001), (20, 0.022), (21, 0.172), (22, -0.04), (23, -0.012), (24, -0.04), (25, 0.093), (26, 0.003), (27, -0.002), (28, 0.023), (29, 0.027), (30, -0.024), (31, -0.045), (32, -0.014), (33, 0.101), (34, 0.049), (35, 0.044), (36, 0.0), (37, -0.001), (38, 0.018), (39, -0.044), (40, -0.038), (41, 0.074), (42, 0.031), (43, 0.011), (44, -0.018), (45, -0.009), (46, 0.058), (47, 0.093), (48, 0.005), (49, -0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.96858251 46 nips-2002-Boosting Density Estimation
Author: Saharon Rosset, Eran Segal
Abstract: Several authors have suggested viewing boosting as a gradient descent search for a good fit in function space. We apply gradient-based boosting methodology to the unsupervised learning problem of density estimation. We show convergence properties of the algorithm and prove that a strength of weak learnability property applies to this problem as well. We illustrate the potential of this approach through experiments with boosting Bayesian networks to learn density models.
2 0.80827451 181 nips-2002-Self Supervised Boosting
Author: Max Welling, Richard S. Zemel, Geoffrey E. Hinton
Abstract: Boosting algorithms and successful applications thereof abound for classification and regression learning problems, but not for unsupervised learning. We propose a sequential approach to adding features to a random field model by training them to improve classification performance between the data and an equal-sized sample of “negative examples” generated from the model’s current estimate of the data density. Training in each boosting round proceeds in three stages: first we sample negative examples from the model’s current Boltzmann distribution. Next, a feature is trained to improve classification performance between data and negative examples. Finally, a coefficient is learned which determines the importance of this feature relative to ones already in the pool. Negative examples only need to be generated once to learn each new feature. The validity of the approach is demonstrated on binary digits and continuous synthetic data.
3 0.68527836 69 nips-2002-Discriminative Learning for Label Sequences via Boosting
Author: Yasemin Altun, Thomas Hofmann, Mark Johnson
Abstract: This paper investigates a boosting approach to discriminative learning of label sequences based on a sequence rank loss function. The proposed method combines many of the advantages of boosting schemes with the efficiency of dynamic programming methods and is attractive both, conceptually and computationally. In addition, we also discuss alternative approaches based on the Hamming loss for label sequences. The sequence boosting algorithm offers an interesting alternative to methods based on HMMs and the more recently proposed Conditional Random Fields. Applications areas for the presented technique range from natural language processing and information extraction to computational biology. We include experiments on named entity recognition and part-of-speech tagging which demonstrate the validity and competitiveness of our approach. 1
4 0.58164012 92 nips-2002-FloatBoost Learning for Classification
Author: Stan Z. Li, Zhenqiu Zhang, Heung-yeung Shum, Hongjiang Zhang
Abstract: AdaBoost [3] minimizes an upper error bound which is an exponential function of the margin on the training set [14]. However, the ultimate goal in applications of pattern classification is always minimum error rate. On the other hand, AdaBoost needs an effective procedure for learning weak classifiers, which by itself is difficult especially for high dimensional data. In this paper, we present a novel procedure, called FloatBoost, for learning a better boosted classifier. FloatBoost uses a backtrack mechanism after each iteration of AdaBoost to remove weak classifiers which cause higher error rates. The resulting float-boosted classifier consists of fewer weak classifiers yet achieves lower error rates than AdaBoost in both training and test. We also propose a statistical model for learning weak classifiers, based on a stagewise approximation of the posterior using an overcomplete set of scalar features. Experimental comparisons of FloatBoost and AdaBoost are provided through a difficult classification problem, face detection, where the goal is to learn from training examples a highly nonlinear classifier to differentiate between face and nonface patterns in a high dimensional space. The results clearly demonstrate the promises made by FloatBoost over AdaBoost.
5 0.36198607 120 nips-2002-Kernel Design Using Boosting
Author: Koby Crammer, Joseph Keshet, Yoram Singer
Abstract: The focus of the paper is the problem of learning kernel operators from empirical data. We cast the kernel design problem as the construction of an accurate kernel from simple (and less accurate) base kernels. We use the boosting paradigm to perform the kernel construction process. To do so, we modify the booster so as to accommodate kernel operators. We also devise an efficient weak-learner for simple kernels that is based on generalized eigen vector decomposition. We demonstrate the effectiveness of our approach on synthetic data and on the USPS dataset. On the USPS dataset, the performance of the Perceptron algorithm with learned kernels is systematically better than a fixed RBF kernel. 1 Introduction and problem Setting The last decade brought voluminous amount of work on the design, analysis and experimentation of kernel machines. Algorithm based on kernels can be used for various machine learning tasks such as classification, regression, ranking, and principle component analysis. The most prominent learning algorithm that employs kernels is the Support Vector Machines (SVM) [1, 2] designed for classification and regression. A key component in a kernel machine is a kernel operator which computes for any pair of instances their inner-product in some abstract vector space. Intuitively and informally, a kernel operator is a means for measuring similarity between instances. Almost all of the work that employed kernel operators concentrated on various machine learning problems that involved a predefined kernel. A typical approach when using kernels is to choose a kernel before learning starts. Examples to popular predefined kernels are the Radial Basis Functions and the polynomial kernels (see for instance [1]). Despite the simplicity required in modifying a learning algorithm to a “kernelized” version, the success of such algorithms is not well understood yet. More recently, special efforts have been devoted to crafting kernels for specific tasks such as text categorization [3] and protein classification problems [4]. Our work attempts to give a computational alternative to predefined kernels by learning kernel operators from data. We start with a few definitions. Let X be an instance space. A kernel is an inner-product operator K : X × X → . An explicit way to describe K is via a mapping φ : X → H from X to an inner-products space H such that K(x, x ) = φ(x)·φ(x ). Given a kernel operator and a finite set of instances S = {xi , yi }m , the kernel i=1 matrix (a.k.a the Gram matrix) is the matrix of all possible inner-products of pairs from S, Ki,j = K(xi , xj ). We therefore refer to the general form of K as the kernel operator and to the application of the kernel operator to a set of pairs of instances as the kernel matrix. The specific setting of kernel design we consider assumes that we have access to a base kernel learner and we are given a target kernel K manifested as a kernel matrix on a set of examples. Upon calling the base kernel learner it returns a kernel operator denote Kj . The goal thereafter is to find a weighted combination of kernels ˆ K(x, x ) = j αj Kj (x, x ) that is similar, in a sense that will be defined shortly, to ˆ the target kernel, K ∼ K . Cristianini et al. [5] in their pioneering work on kernel target alignment employed as the notion of similarity the inner-product between the kernel matrices < K, K >F = m K(xi , xj )K (xi , xj ). Given this definition, they defined the i,j=1 kernel-similarity, or alignment, to be the above inner-product normalized by the norm of ˆ ˆ ˆ ˆ ˆ each kernel, A(S, K, K ) = < K, K >F / < K, K >F < K , K >F , where S is, as above, a finite sample of m instances. Put another way, the kernel alignment Cristianini et al. employed is the cosine of the angle between the kernel matrices where each matrix is “flattened” into a vector of dimension m2 . Therefore, this definition implies that the alignment is bounded above by 1 and can attain this value iff the two kernel matrices are identical. Given a (column) vector of m labels y where yi ∈ {−1, +1} is the label of the instance xi , Cristianini et al. used the outer-product of y as the the target kernel, ˆ K = yy T . Therefore, an optimal alignment is achieved if K(xi , xj ) = yi yj . Clearly, if such a kernel is used for classifying instances from X , then the kernel itself suffices to construct an excellent classifier f : X → {−1, +1} by setting, f (x) = sign(y i K(xi , x)) where (xi , yi ) is any instance-label pair. Cristianini et al. then devised a procedure that works with both labelled and unlabelled examples to find a Gram matrix which attains a good alignment with K on the labelled part of the matrix. While this approach can clearly construct powerful kernels, a few problems arise from the notion of kernel alignment they employed. For instance, a kernel operator such that the sign(K(x i , xj )) is equal to yi yj but its magnitude, |K(xi , xj )|, is not necessarily 1, might achieve a poor alignment score while it can constitute a classifier whose empirical loss is zero. Furthermore, the task of finding a good kernel when it is not always possible to find a kernel whose sign on each pair of instances is equal to the products of the labels (termed the soft-margin case in [5, 6]) becomes rather tricky. We thus propose a different approach which attempts to overcome some of the difficulties above. Like Cristianini et al. we assume that we are given a set of labelled instances S = {(xi , yi ) | xi ∈ X , yi ∈ {−1, +1}, i = 1, . . . , m} . We are also given a set of unlabelled m ˜ ˜ examples S = {˜i }i=1 . If such a set is not provided we can simply use the labelled inx ˜ ˜ stances (without the labels themselves) as the set S. The set S is used for constructing the ˆ primitive kernels that are combined to constitute the learned kernel K. The labelled set is used to form the target kernel matrix and its instances are used for evaluating the learned ˆ kernel K. This approach, known as transductive learning, was suggested in [5, 6] for kernel alignment tasks when the distribution of the instances in the test data is different from that of the training data. This setting becomes in particular handy in datasets where the test data was collected in a different scheme than the training data. We next discuss the notion of kernel goodness employed in this paper. This notion builds on the objective function that several variants of boosting algorithms maintain [7, 8]. We therefore first discuss in brief the form of boosting algorithms for kernels. 2 Using Boosting to Combine Kernels Numerous interpretations of AdaBoost and its variants cast the boosting process as a procedure that attempts to minimize, or make small, a continuous bound on the classification error (see for instance [9, 7] and the references therein). A recent work by Collins et al. [8] unifies the boosting process for two popular loss functions, the exponential-loss (denoted henceforth as ExpLoss) and logarithmic-loss (denoted as LogLoss) that bound the empir- ˜ ˜ Input: Labelled and unlabelled sets of examples: S = {(xi , yi )}m ; S = {˜i }m x i=1 i=1 Initialize: K ← 0 (all zeros matrix) For t = 1, 2, . . . , T : • Calculate distribution over pairs 1 ≤ i, j ≤ m: Dt (i, j) = exp(−yi yj K(xi , xj )) 1/(1 + exp(−yi yj K(xi , xj ))) ExpLoss LogLoss ˜ • Call base-kernel-learner with (Dt , S, S) and receive Kt • Calculate: + − St = {(i, j) | yi yj Kt (xi , xj ) > 0} ; St = {(i, j) | yi yj Kt (xi , xj ) < 0} + Wt = (i,j)∈S + Dt (i, j)|Kt (xi , xj )| ; Wt− = (i,j)∈S − Dt (i, j)|Kt (xi , xj )| t t 1 2 + Wt − Wt • Set: αt = ln ; K ← K + α t Kt . Return: kernel operator K : X × X → Figure 1: The skeleton of the boosting algorithm for kernels. ical classification error. Given the prediction of a classifier f on an instance x and a label y ∈ {−1, +1} the ExpLoss and the LogLoss are defined as, ExpLoss(f (x), y) = exp(−yf (x)) LogLoss(f (x), y) = log(1 + exp(−yf (x))) . Collins et al. described a single algorithm for the two losses above that can be used within the boosting framework to construct a strong-hypothesis which is a classifier f (x). This classifier is a weighted combination of (possibly very simple) base classifiers. (In the boosting framework, the base classifiers are referred to as weak-hypotheses.) The strongT hypothesis is of the form f (x) = t=1 αt ht (x). Collins et al. discussed a few ways to select the weak-hypotheses ht and to find a good of weights αt . Our starting point in this paper is the first sequential algorithm from [8] that enables the construction or creation of weak-hypotheses on-the-fly. We would like to note however that it is possible to use other variants of boosting to design kernels. In order to use boosting to design kernels we extend the algorithm to operate over pairs of instances. Building on the notion of alignment from [5, 6], we say that the inner-product of x1 and x2 is aligned with the labels y1 and y2 if sign(K(x1 , x2 )) = y1 y2 . Furthermore, we would like to make the magnitude of K(x, x ) to be as large as possible. We therefore use one of the following two alignment losses for a pair of examples (x 1 , y1 ) and (x2 , y2 ), ExpLoss(K(x1 , x2 ), y1 y2 ) = exp(−y1 y2 K(x1 , x2 )) LogLoss(K(x1 , x2 ), y1 y2 ) = log(1 + exp(−y1 y2 K(x1 , x2 ))) . Put another way, we view a pair of instances as a single example and cast the pairs of instances that attain the same label as positively labelled examples while pairs of opposite labels are cast as negatively labelled examples. Clearly, this approach can be applied to both losses. In the boosting process we therefore maintain a distribution over pairs of instances. The weight of each pair reflects how difficult it is to predict whether the labels of the two instances are the same or different. The core boosting algorithm follows similar lines to boosting algorithms for classification algorithm. The pseudo code of the booster is given in Fig. 1. The pseudo-code is an adaptation the to problem of kernel design of the sequentialupdate algorithm from [8]. As with other boosting algorithm, the base-learner, which in our case is charge of returning a good kernel with respect to the current distribution, is left unspecified. We therefore turn our attention to the algorithmic implementation of the base-learning algorithm for kernels. 3 Learning Base Kernels The base kernel learner is provided with a training set S and a distribution D t over a pairs ˜ of instances from the training set. It is also provided with a set of unlabelled examples S. Without any knowledge of the topology of the space of instances a learning algorithm is likely to fail. Therefore, we assume the existence of an initial inner-product over the input space. We assume for now that this initial inner-product is the standard scalar products over vectors in n . We later discuss a way to relax the assumption on the form of the inner-product. Equipped with an inner-product, we define the family of base kernels to be the possible outer-products Kw = wwT between a vector w ∈ n and itself. Using this definition we get, Kw (xi , xj ) = (xi ·w)(xj ·w) . Input: A distribution Dt . Labelled and unlabelled sets: ˜ ˜ Therefore, the similarity beS = {(xi , yi )}m ; S = {˜i }m . x i=1 i=1 tween two instances xi and Compute : xj is high iff both xi and xj • Calculate: ˜ are similar (w.r.t the standard A ∈ m×m , Ai,r = xi · xr ˜ inner-product) to a third vecm×m B∈ , Bi,j = Dt (i, j)yi yj tor w. Analogously, if both ˜ ˜ K ∈ m×m , Kr,s = xr · xs ˜ ˜ xi and xj seem to be dissim• Find the generalized eigenvector v ∈ m for ilar to the vector w then they the problem AT BAv = λKv which attains are similar to each other. Dethe largest eigenvalue λ spite the restrictive form of • Set: w = ( r vr xr )/ ˜ ˜ r vr xr . the inner-products, this famt ily is still too rich for our setReturn: Kernel operator Kw = ww . ting and we further impose two restrictions on the inner Figure 2: The base kernel learning algorithm. products. First, we assume ˜ that w is restricted to a linear combination of vectors from S. Second, since scaling of the base kernels is performed by the boosted, we constrain the norm of w to be 1. The m ˜ resulting class of kernels is therefore, C = {Kw = wwT | w = r=1 βr xr , w = 1} . ˜ In the boosting process we need to choose a specific base-kernel K w from C. We therefore need to devise a notion of how good a candidate for base kernel is given a labelled set S and a distribution function Dt . In this work we use the simplest version suggested by Collins et al. This version can been viewed as a linear approximation on the loss function. We define the score of a kernel Kw w.r.t to the current distribution Dt to be, Score(Kw ) = Dt (i, j)yi yj Kw (xi , xj ) . (1) i,j The higher the value of the score is, the better Kw fits the training data. Note that if Dt (i, j) = 1/m2 (as is D0 ) then Score(Kw ) is proportional to the alignment since w = 1. Under mild assumptions the score can also provide a lower bound of the loss function. To see that let c be the derivative of the loss function at margin zero, c = Loss (0) . If all the √ training examples xi ∈ S lies in a ball of radius c, we get that Loss(Kw (xi , xj ), yi yj ) ≥ 1 − cKw (xi , xj )yi yj ≥ 0, and therefore, i,j Dt (i, j)Loss(Kw (xi , xj ), yi yj ) ≥ 1 − c Dt (i, j)Kw (xi , xj )yi yj . i,j Using the explicit form of Kw in the Score function (Eq. (1)) we get, Score(Kw ) = i,j D(i, j)yi yj (w·xi )(w·xj ) . Further developing the above equation using the constraint that w = m ˜ r=1 βr xr we get, ˜ Score(Kw ) = βs βr r,s i,j D(i, j)yi yj (xi · xr ) (xj · xs ) . ˜ ˜ To compute efficiently the base kernel score without an explicit enumeration we exploit the fact that if the initial distribution D0 is symmetric (D0 (i, j) = D0 (j, i)) then all the distributions generated along the run of the boosting process, D t , are also symmetric. We ˜ now define a matrix A ∈ m×m where Ai,r = xi · xr and a symmetric matrix B ∈ m×m ˜ with Bi,j = Dt (i, j)yi yj . Simple algebraic manipulations yield that the score function can be written as the following quadratic form, Score(β) = β T (AT BA)β , where β is m dimensional column vector. Note that since B is symmetric so is A T BA. Finding a ˜ good base kernel is equivalent to finding a vector β which maximizes this quadratic form 2 m ˜ under the norm equality constraint w = ˜ 2 = β T Kβ = 1 where Kr,s = r=1 βr xr xr · xs . Finding the maximum of Score(β) subject to the norm constraint is a well known ˜ ˜ maximization problem known as the generalized eigen vector problem (cf. [10]). Applying simple algebraic manipulations it is easy to show that the matrix AT BA is positive semidefinite. Assuming that the matrix K is invertible, the the vector β which maximizes the quadratic form is proportional the eigenvector of K −1 AT BA which is associated with the m ˜ generalized largest eigenvalue. Denoting this vector by v we get that w ∝ ˜ r=1 vr xr . m ˜ m ˜ Adding the norm constraint we get that w = ( r=1 vr xr )/ ˜ vr xr . The skeleton ˜ r=1 of the algorithm for finding a base kernels is given in Fig. 3. To conclude the description of the kernel learning algorithm we describe how to the extend the algorithm to be employed with general kernel functions. Kernelizing the Kernel: As described above, we assumed that the standard scalarproduct constitutes the template for the class of base-kernels C. However, since the proce˜ dure for choosing a base kernel depends on S and S only through the inner-products matrix A, we can replace the scalar-product itself with a general kernel operator κ : X × X → , where κ(xi , xj ) = φ(xi ) · φ(xj ). Using a general kernel function κ we can not compute however the vector w explicitly. We therefore need to show that the norm of w, and evaluation Kw on any two examples can still be performed efficiently. First note that given the vector v we can compute the norm of w as follows, T w 2 = vr xr ˜ vs xr ˜ r s = vr vs κ(˜r , xs ) . x ˜ r,s Next, given two vectors xi and xj the value of their inner-product is, Kw (xi , xj ) = vr vs κ(xi , xr )κ(xj , xs ) . ˜ ˜ r,s Therefore, although we cannot compute the vector w explicitly we can still compute its norm and evaluate any of the kernels from the class C. 4 Experiments Synthetic data: We generated binary-labelled data using as input space the vectors in 100 . The labels, in {−1, +1}, were picked uniformly at random. Let y designate the label of a particular example. Then, the first two components of each instance were drawn from a two-dimensional normal distribution, N (µ, ∆ ∆−1 ) with the following parameters, µ=y 0.03 0.03 1 ∆= √ 2 1 −1 1 1 = 0.1 0 0 0.01 . That is, the label of each examples determined the mean of the distribution from which the first two components were generated. The rest of the components in the vector (98 8 0.2 6 50 50 100 100 150 150 200 200 4 2 0 0 −2 −4 −6 250 250 −0.2 −8 −0.2 0 0.2 −8 −6 −4 −2 0 2 4 6 8 300 20 40 60 80 100 120 140 160 180 200 300 20 40 60 80 100 120 140 160 180 Figure 3: Results on a toy data set prior to learning a kernel (first and third from left) and after learning (second and fourth). For each of the two settings we show the first two components of the training data (left) and the matrix of inner products between the train and the test data (right). altogether) were generated independently using the normal distribution with a zero mean and a standard deviation of 0.05. We generated 100 training and test sets of size 300 and 200 respectively. We used the standard dot-product as the initial kernel operator. On each experiment we first learned a linear classier that separates the classes using the Perceptron [11] algorithm. We ran the algorithm for 10 epochs on the training set. After each epoch we evaluated the performance of the current classifier on the test set. We then used the boosting algorithm for kernels with the LogLoss for 30 rounds to build a kernel for each random training set. After learning the kernel we re-trained a classifier with the Perceptron algorithm and recorded the results. A summary of the online performance is given in Fig. 4. The plot on the left-hand-side of the figure shows the instantaneous error (achieved during the run of the algorithm). Clearly, the Perceptron algorithm with the learned kernel converges much faster than the original kernel. The middle plot shows the test error after each epoch. The plot on the right shows the test error on a noisy test set in which we added a Gaussian noise of zero mean and a standard deviation of 0.03 to the first two features. In all plots, each bar indicates a 95% confidence level. It is clear from the figure that the original kernel is much slower to converge than the learned kernel. Furthermore, though the kernel learning algorithm was not expoed to the test set noise, the learned kernel reflects better the structure of the feature space which makes the learned kernel more robust to noise. Fig. 3 further illustrates the benefits of using a boutique kernel. The first and third plots from the left correspond to results obtained using the original kernel and the second and fourth plots show results using the learned kernel. The left plots show the empirical distribution of the two informative components on the test data. For the learned kernel we took each input vector and projected it onto the two eigenvectors of the learned kernel operator matrix that correspond to the two largest eigenvalues. Note that the distribution after the projection is bimodal and well separated along the first eigen direction (x-axis) and shows rather little deviation along the second eigen direction (y-axis). This indicates that the kernel learning algorithm indeed found the most informative projection for separating the labelled data with large margin. It is worth noting that, in this particular setting, any algorithm which chooses a single feature at a time is prone to failure since both the first and second features are mandatory for correctly classifying the data. The two plots on the right hand side of Fig. 3 use a gray level color-map to designate the value of the inner-product between each pairs instances, one from training set (y-axis) and the other from the test set. The examples were ordered such that the first group consists of the positively labelled instances while the second group consists of the negatively labelled instances. Since most of the features are non-relevant the original inner-products are noisy and do not exhibit any structure. In contrast, the inner-products using the learned kernel yields in a 2 × 2 block matrix indicating that the inner-products between instances sharing the same label obtain large positive values. Similarly, for instances of opposite 200 1 12 Regular Kernel Learned Kernel 0.8 17 0.7 16 0.5 0.4 0.3 Test Error % 8 0.6 Regular Kernel Learned Kernel 18 10 Test Error % Averaged Cumulative Error % 19 Regular Kernel Learned Kernel 0.9 6 4 15 14 13 12 0.2 11 2 0.1 10 0 0 10 1 10 2 10 Round 3 10 4 10 0 2 4 6 Epochs 8 10 9 2 4 6 Epochs 8 10 Figure 4: The online training error (left), test error (middle) on clean synthetic data using a standard kernel and a learned kernel. Right: the online test error for the two kernels on a noisy test set. labels the inner products are large and negative. The form of the inner-products matrix of the learned kernel indicates that the learning problem itself becomes much easier. Indeed, the Perceptron algorithm with the standard kernel required around 94 training examples on the average before converging to a hyperplane which perfectly separates the training data while using the Perceptron algorithm with learned kernel required a single example to reach a perfect separation on all 100 random training sets. USPS dataset: The USPS (US Postal Service) dataset is known as a challenging classification problem in which the training set and the test set were collected in a different manner. The USPS contains 7, 291 training examples and 2, 007 test examples. Each example is represented as a 16 × 16 matrix where each entry in the matrix is a pixel that can take values in {0, . . . , 255}. Each example is associated with a label in {0, . . . , 9} which is the digit content of the image. Since the kernel learning algorithm is designed for binary problems, we broke the 10-class problem into 45 binary problems by comparing all pairs of classes. The interesting question of how to learn kernels for multiclass problems is beyond the scopre of this short paper. We thus constraint on the binary error results for the 45 binary problem described above. For the original kernel we chose a RBF kernel with σ = 1 which is the value employed in the experiments reported in [12]. We used the kernelized version of the kernel design algorithm to learn a different kernel operator for each of the binary problems. We then used a variant of the Perceptron [11] and with the original RBF kernel and with the learned kernels. One of the motivations for using the Perceptron is its simplicity which can underscore differences in the kernels. We ran the kernel learning al˜ gorithm with LogLoss and ExpLoss, using bith the training set and the test test as S. Thus, we obtained four different sets of kernels where each set consists of 45 kernels. By examining the training loss, we set the number of rounds of boosting to be 30 for the LogLoss and 50 for the ExpLoss, when using the trainin set. When using the test set, the number of rounds of boosting was set to 100 for both losses. Since the algorithm exhibits slower rate of convergence with the test data, we choose a a higher value without attempting to optimize the actual value. The left plot of Fig. 5 is a scatter plot comparing the test error of each of the binary classifiers when trained with the original RBF a kernel versus the performance achieved on the same binary problem with a learned kernel. The kernels were built ˜ using boosting with the LogLoss and S was the training data. In almost all of the 45 binary classification problems, the learned kernels yielded lower error rates when combined with the Perceptron algorithm. The right plot of Fig. 5 compares two learned kernels: the first ˜ was build using the training instances as the templates constituing S while the second used the test instances. Although the differenece between the two versions is not as significant as the difference on the left plot, we still achieve an overall improvement in about 25% of the binary problems by using the test instances. 6 4.5 4 5 Learned Kernel (Test) Learned Kernel (Train) 3.5 4 3 2 3 2.5 2 1.5 1 1 0.5 0 0 1 2 3 Base Kernel 4 5 6 0 0 1 2 3 Learned Kernel (Train) 4 5 Figure 5: Left: a scatter plot comparing the error rate of 45 binary classifiers trained using an RBF kernel (x-axis) and a learned kernel with training instances. Right: a similar scatter plot for a learned kernel only constructed from training instances (x-axis) and test instances. 5 Discussion In this paper we showed how to use the boosting framework to design kernels. Our approach is especially appealing in transductive learning tasks where the test data distribution is different than the the distribution of the training data. For example, in speech recognition tasks the training data is often clean and well recorded while the test data often passes through a noisy channel that distorts the signal. An interesting and challanging question that stem from this research is how to extend the framework to accommodate more complex decision tasks such as multiclass and regression problems. Finally, we would like to note alternative approaches to the kernel design problem has been devised in parallel and independently. See [13, 14] for further details. Acknowledgements: Special thanks to Cyril Goutte and to John Show-Taylor for pointing the connection to the generalized eigen vector problem. Thanks also to the anonymous reviewers for constructive comments. References [1] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998. [2] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. [3] Huma Lodhi, John Shawe-Taylor, Nello Cristianini, and Christopher J. C. H. Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2:419–444, 2002. [4] C. Leslie, E. Eskin, and W. Stafford Noble. The spectrum kernel: A string kernel for svm protein classification. In Proceedings of the Pacific Symposium on Biocomputing, 2002. [5] Nello Cristianini, Andre Elisseeff, John Shawe-Taylor, and Jaz Kandla. On kernel target alignment. In Advances in Neural Information Processing Systems 14, 2001. [6] G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix with semi-definite programming. In Proc. of the 19th Intl. Conf. on Machine Learning, 2002. [7] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2):337–374, April 2000. [8] Michael Collins, Robert E. Schapire, and Yoram Singer. Logistic regression, adaboost and bregman distances. Machine Learning, 47(2/3):253–285, 2002. [9] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classifiers. MIT Press, 1999. [10] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1985. [11] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. [12] B. Sch¨ lkopf, S. Mika, C.J.C. Burges, P. Knirsch, K. M¨ ller, G. R¨ tsch, and A.J. Smola. Input o u a space vs. feature space in kernel-based methods. IEEE Trans. on NN, 10(5):1000–1017, 1999. [13] O. Bosquet and D.J.L. Herrmann. On the complexity of learning the kernel matrix. NIPS, 2002. [14] C.S. Ong, A.J. Smola, and R.C. Williamson. Superkenels. NIPS, 2002.
6 0.30041426 140 nips-2002-Margin Analysis of the LVQ Algorithm
7 0.29337975 58 nips-2002-Conditional Models on the Ranking Poset
8 0.26647073 45 nips-2002-Boosted Dyadic Kernel Discriminants
9 0.25636142 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture
10 0.23518591 109 nips-2002-Improving a Page Classifier with Anchor Extraction and Link Analysis
11 0.23348975 111 nips-2002-Independent Components Analysis through Product Density Estimation
12 0.22758345 18 nips-2002-Adaptation and Unsupervised Learning
13 0.22486652 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
14 0.22142634 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
15 0.21806714 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction
16 0.21701416 138 nips-2002-Manifold Parzen Windows
17 0.21626839 7 nips-2002-A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences
18 0.21605785 99 nips-2002-Graph-Driven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA
19 0.21289451 40 nips-2002-Bayesian Models of Inductive Generalization
20 0.1984033 179 nips-2002-Scaling of Probability-Based Optimization Algorithms
topicId topicWeight
[(11, 0.022), (14, 0.011), (23, 0.019), (38, 0.026), (42, 0.175), (54, 0.131), (55, 0.043), (57, 0.013), (67, 0.025), (68, 0.019), (69, 0.13), (74, 0.08), (87, 0.012), (92, 0.048), (98, 0.166)]
simIndex simValue paperId paperTitle
same-paper 1 0.91544068 46 nips-2002-Boosting Density Estimation
Author: Saharon Rosset, Eran Segal
Abstract: Several authors have suggested viewing boosting as a gradient descent search for a good fit in function space. We apply gradient-based boosting methodology to the unsupervised learning problem of density estimation. We show convergence properties of the algorithm and prove that a strength of weak learnability property applies to this problem as well. We illustrate the potential of this approach through experiments with boosting Bayesian networks to learn density models.
2 0.87266314 79 nips-2002-Evidence Optimization Techniques for Estimating Stimulus-Response Functions
Author: Maneesh Sahani, Jennifer F. Linden
Abstract: An essential step in understanding the function of sensory nervous systems is to characterize as accurately as possible the stimulus-response function (SRF) of the neurons that relay and process sensory information. One increasingly common experimental approach is to present a rapidly varying complex stimulus to the animal while recording the responses of one or more neurons, and then to directly estimate a functional transformation of the input that accounts for the neuronal firing. The estimation techniques usually employed, such as Wiener filtering or other correlation-based estimation of the Wiener or Volterra kernels, are equivalent to maximum likelihood estimation in a Gaussian-output-noise regression model. We explore the use of Bayesian evidence-optimization techniques to condition these estimates. We show that by learning hyperparameters that control the smoothness and sparsity of the transfer function it is possible to improve dramatically the quality of SRF estimates, as measured by their success in predicting responses to novel input.
3 0.86919647 138 nips-2002-Manifold Parzen Windows
Author: Pascal Vincent, Yoshua Bengio
Abstract: The similarity between objects is a fundamental element of many learning algorithms. Most non-parametric methods take this similarity to be fixed, but much recent work has shown the advantages of learning it, in particular to exploit the local invariances in the data or to capture the possibly non-linear manifold on which most of the data lies. We propose a new non-parametric kernel density estimation method which captures the local structure of an underlying manifold through the leading eigenvectors of regularized local covariance matrices. Experiments in density estimation show significant improvements with respect to Parzen density estimators. The density estimators can also be used within Bayes classifiers, yielding classification rates similar to SVMs and much superior to the Parzen classifier.
4 0.85738117 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
Author: Max Welling, Simon Osindero, Geoffrey E. Hinton
Abstract: We propose a model for natural images in which the probability of an image is proportional to the product of the probabilities of some filter outputs. We encourage the system to find sparse features by using a Studentt distribution to model each filter output. If the t-distribution is used to model the combined outputs of sets of neurally adjacent filters, the system learns a topographic map in which the orientation, spatial frequency and location of the filters change smoothly across the map. Even though maximum likelihood learning is intractable in our model, the product form allows a relatively efficient learning procedure that works well even for highly overcomplete sets of filters. Once the model has been learned it can be used as a prior to derive the “iterated Wiener filter” for the purpose of denoising images.
5 0.85644138 65 nips-2002-Derivative Observations in Gaussian Process Models of Dynamic Systems
Author: E. Solak, R. Murray-smith, W. E. Leithead, D. J. Leith, Carl E. Rasmussen
Abstract: Gaussian processes provide an approach to nonparametric modelling which allows a straightforward combination of function and derivative observations in an empirical model. This is of particular importance in identification of nonlinear dynamic systems from experimental data. 1) It allows us to combine derivative information, and associated uncertainty with normal function observations into the learning and inference process. This derivative information can be in the form of priors specified by an expert or identified from perturbation data close to equilibrium. 2) It allows a seamless fusion of multiple local linear models in a consistent manner, inferring consistent models and ensuring that integrability constraints are met. 3) It improves dramatically the computational efficiency of Gaussian process models for dynamic system identification, by summarising large quantities of near-equilibrium data by a handful of linearisations, reducing the training set size – traditionally a problem for Gaussian process models.
6 0.85370398 181 nips-2002-Self Supervised Boosting
7 0.85138094 41 nips-2002-Bayesian Monte Carlo
8 0.84851158 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
9 0.84831005 110 nips-2002-Incremental Gaussian Processes
10 0.84470737 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
11 0.84278244 3 nips-2002-A Convergent Form of Approximate Policy Iteration
12 0.84139895 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation
13 0.83457923 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
14 0.83361417 169 nips-2002-Real-Time Particle Filters
15 0.83284819 22 nips-2002-Adaptive Nonlinear System Identification with Echo State Networks
16 0.8272863 143 nips-2002-Mean Field Approach to a Probabilistic Model in Information Retrieval
17 0.82410699 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
18 0.8216868 115 nips-2002-Informed Projections
19 0.82026494 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation
20 0.81773293 53 nips-2002-Clustering with the Fisher Score