nips nips2002 nips2002-181 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Max Welling, Richard S. Zemel, Geoffrey E. Hinton
Abstract: Boosting algorithms and successful applications thereof abound for classification and regression learning problems, but not for unsupervised learning. We propose a sequential approach to adding features to a random field model by training them to improve classification performance between the data and an equal-sized sample of “negative examples” generated from the model’s current estimate of the data density. Training in each boosting round proceeds in three stages: first we sample negative examples from the model’s current Boltzmann distribution. Next, a feature is trained to improve classification performance between data and negative examples. Finally, a coefficient is learned which determines the importance of this feature relative to ones already in the pool. Negative examples only need to be generated once to learn each new feature. The validity of the approach is demonstrated on binary digits and continuous synthetic data.
Reference: text
sentIndex sentText sentNum sentScore
1 Hinton Department of Computer Science University of Toronto 10 King’s College Road Toronto, M5S 3G5 Canada Abstract Boosting algorithms and successful applications thereof abound for classification and regression learning problems, but not for unsupervised learning. [sent-3, score-0.12]
2 We propose a sequential approach to adding features to a random field model by training them to improve classification performance between the data and an equal-sized sample of “negative examples” generated from the model’s current estimate of the data density. [sent-4, score-0.399]
3 Training in each boosting round proceeds in three stages: first we sample negative examples from the model’s current Boltzmann distribution. [sent-5, score-1.085]
4 Next, a feature is trained to improve classification performance between data and negative examples. [sent-6, score-0.435]
5 Finally, a coefficient is learned which determines the importance of this feature relative to ones already in the pool. [sent-7, score-0.188]
6 Negative examples only need to be generated once to learn each new feature. [sent-8, score-0.156]
7 1 Introduction While researchers have developed and successfully applied a myriad of boosting algorithms for classification and regression problems, boosting for density estimation has received relatively scant attention. [sent-10, score-0.969]
8 One can imagine that the initial features, or weak learners, could model the rough outlines of the data density, and more detailed carving of the density landscape could occur on each successive round. [sent-12, score-0.171]
9 Ideally, the algorithm would achieve automatic model selection, determining the requisite number of weak learners on its own. [sent-13, score-0.167]
10 It has proven difficult to formulate an objective for such a system, under which the weights on examples, and the objective for training a weak learner at each round have a natural gradient-descent interpretation as in standard boosting algorithms [10] [7]. [sent-14, score-1.041]
11 A key idea in our algorithm is that unsupervised learning can be converted into supervised learning by using the model’s imperfect current estimate of the data to generate negative examples. [sent-16, score-0.539]
12 We take the idea a step further here by training a weak learner to discriminate between the positive examples from the original data and the negative examples generated by sampling from the current density estimate. [sent-18, score-1.026]
13 This new weak learner minimizes a simple additive logistic loss function [2]. [sent-19, score-0.387]
14 Our algorithm obtains an important advantage over sampling-based, unsupervised methods that learn features in parallel. [sent-20, score-0.187]
15 Parallel-update methods require a new sample after each iteration of parameter changes, in order to reflect the current model’s estimate of the data density. [sent-21, score-0.121]
16 We improve on this by using one sample per boosting round, to fit one weak learner. [sent-22, score-0.631]
17 The justification for this approach comes from the proposal that, for stagewise additive models, boosting can be considered as gradient-descent in function space, so the new learner can simply optimize its inner product with the gradient of the objective in function space [3]. [sent-23, score-0.731]
18 Unlike other attempts at “unsupervised boosting” [9], where at each round a new component distribution is added to a mixture model, our approach will add features in the log-domain and as such learns a product model. [sent-24, score-0.3]
19 In these applications, the features are typically not learned; instead the algorithms greedily select at each round the most informative feature from a large set of pre-enumerated features. [sent-27, score-0.388]
20 § © We furthermore assume that the energy is additive. [sent-31, score-0.375]
21 (2) the features and each feature may depend on its own A I B Q The model described above is very similar to an “additive random field”, otherwise known as “maximum entropy model”. [sent-33, score-0.318]
22 The key difference is that we allow each feature to be flexible through its dependence on the parameters . [sent-34, score-0.138]
23 A I Learning in random fields may proceed by performing gradient ascent on the loglikelihood: V (3) Y § ©s 7V ¨ § © r p h0 qi2 g b 7¨ Y© fXV V W XV e c ¦db a0 Y ` & Y XV where is a data-vector and is some arbitrary parameter that we want to learn. [sent-35, score-0.191]
24 This equation makes explicit the main philosophy behind learning in random fields: the energy of states “occupied” by data is lowered (weighted by ) while the energy of all states is raised (weighted by ). [sent-36, score-1.038]
25 Since there are usually an exponential number of states in the system, the second term is often approximated by a sample from . [sent-37, score-0.132]
26 To reduce sampling noise a relatively large sample is necessary and moreover, it must be drawn each time we compute gradients. [sent-38, score-0.145]
27 b f e § © a § © Iterative scaling methods have been developed for models that do not include adaptive feature parameters but instead train only the coefficients [8]. [sent-40, score-0.138]
28 These methods make more efficient use of the samples than gradient ascent, but they only minimize a loose bound on the cost function and their terminal convergence can be slow. [sent-41, score-0.121]
29 1 Finding New Features In [7], boosting is reinterpreted as functional gradient descent on a loss function. [sent-45, score-0.563]
30 Using the log-likelihood as a negative loss function this idea can be used to find features for additive random field models. [sent-46, score-0.458]
31 Consider a change in the energy by adding an infinitesimal multiple of a feature. [sent-47, score-0.491]
32 The optimal feature is then the one that provides the maximal increase in log-likelihood, i. [sent-48, score-0.138]
33 the feature that maximizes the second term of W V ¥¥ § ¨c ¥¦ ¡ ¥ V ¢g @(98§7¨ $ ¡ © W £ ¤@(98§© A FD ¡ ¢g we rewrite the second term as, A D (5) § 98© § 98© g r qi2 p h0 b 8© f A D W §© 987¨ $ A XD e c ¦db a0 ` & ¥¥ © ¡ V ¨ V Using Eqn. [sent-50, score-0.138]
34 In order to maximize this derivative, the feature should therefore be small at the data and large at all other states. [sent-52, score-0.138]
35 It is however important to realize that the norm of the feature must be bounded, since otherwise . [sent-53, score-0.213]
36 the derivative can be made arbitrarily large by simply increasing the length of § © § © A D Because the total number of possible states of a model is often exponentially large, the second term of Eqn. [sent-54, score-0.164]
37 5 must be approximated using samples from , § © (6) §© A D § e ¦c 0 g b © f A D e¦cdb a0 ` W XV ¥¥ § £ c ¥¥¦ ¡ V & These samples, or “negative examples”, inform us about the states that are likely under the current model. [sent-55, score-0.146]
38 Intuitively, because the model is imperfect, we would like to move its density estimate away from these samples and towards the actual data. [sent-56, score-0.121]
39 By labelling the data with and the negative examples with , we can map this to a supervised problem where a new feature is a classifier. [sent-57, score-0.569]
40 Since a good classifier is negative at the data and positive at the negative examples (so we can use its sign to discriminate them), adding its output to the total energy will lower the energy at states where there are data and raise it at states where there are negative examples. [sent-58, score-1.911]
41 The main difference with supervised boosting is that the negative examples change at every round. [sent-59, score-0.899]
42 2 Weighting the Data It has been observed [6] that boosting algorithms can outperform classifications algorithms that maximize log-likelihood. [sent-61, score-0.421]
43 This has motivated us to use the logistic loss function from the boosting literature for training new features. [sent-62, score-0.604]
44 C (7) g Loss ) and negative examples ( ). [sent-64, score-0.372]
45 Perturbing the where runs over data ( energy of the negative loss function by adding an infinitesimal multiple of a new feature: & 5 we derive the following cost function ¨ A D ¡ & © ¥ £ ¤ A D £ £ ¡ g ¡ and computing the derivative w. [sent-65, score-0.775]
46 for adding a new feature, (8) § © g 0 e ¦c b 8© f A D £ b a0 e c b & ¨ ¨ ¢ on data and negative examples, that give The main difference with Eqn. [sent-68, score-0.285]
47 6 is the weights poorly “classified” examples (data with very high energy and negative examples with very low energy) a stronger vote in changes to the energy surface. [sent-69, score-1.359]
48 The extra weights (which are bounded between [0,1]) will incur a certain bias w. [sent-70, score-0.112]
49 However, it is expected that the extra effort on “hard cases” will cause the algorithm to converge faster to good density models. [sent-74, score-0.115]
50 7 is a valid cost function only when the negative examples are fixed. [sent-76, score-0.372]
51 The reason is that after a change of the energy surface, the negative examples are no longer a representative sample from the Boltzmann distribution in Eqn. [sent-77, score-0.867]
52 However, as long as we re-sample the negative examples after every change in the energy we may use Eqn. [sent-79, score-0.794]
53 8 as an objective to decide what feature to add to the energy, i. [sent-80, score-0.171]
54 we may consider it as the derivative of some (possibly unknown) weighted log-likelihood: c ¤ § V ©§ § ¦ ¨¡ © ¦ © W V ¢ By analogy, we can interpret as the probability that a certain state is occupied by a data-vector and consequently as the “margin”. [sent-82, score-0.171]
55 Note that the introduction of the weights has given meaning to the “height” of the energy surface, in contrast to the Boltzmann distribution for which only relative energy differences count. [sent-83, score-0.831]
56 In fact, as we will further explain in the next section, the height of the energy will be chosen such that the total weight on data is equal to the total weight on the negative examples. [sent-84, score-0.815]
57 3 Adding the New Feature to the Pool According to the functional gradient interpretation, the new feature computed as described above represents the infinitesimal change in energy that maximally increases the (weighted) log-likelihood. [sent-86, score-0.642]
58 In fact, we will propose a slightly more general change in energy given by, (9) A B A g § 98© A A D A g B §© 987¨ §© 7¨ As mentioned in the previous section, the constant will have no effect on the Boltzmann distribution in Eqn. [sent-88, score-0.422]
59 However, it does influence the relative total weight on data versus negative examples. [sent-90, score-0.308]
60 to and1 are given by, V W V ¢ A A D e 8§© £ ¦c 0 g b 8© f £ (11) e ¦c A A D A £ b g 0 the total weight on data and negative exam- £ b e¦cdb a0 ¦ W & A W V ¦ A & e¦cdb a0 £ W ¦ B w. [sent-95, score-0.308]
61 5 0 0 100 200 300 400 500 600 boosting round Figure 1: (a – left). [sent-104, score-0.592]
62 Training error (lower curves) and test error (higher curves) for the weighted boosting algorithm (solid curves) and the un-weighted algorithm (dashed curves). [sent-105, score-0.551]
63 To correct for this we include importance weights negative examples that are all at . [sent-109, score-0.503]
64 It is well from iteration to iteration using known that in high dimensions the effective sample size of the weighted sample can rapidly become too small to be useful. [sent-111, score-0.21]
65 We therefore monitor the effective sample size, given by , where the sum runs over the negative examples only. [sent-112, score-0.445]
66 We can obtain a new set of negative examples from the updated Boltzmann distribution, reset the importance weights to and resume fitting . [sent-114, score-0.503]
67 Alternatively, we simply accept the current value of and proceed to the next round of boosting. [sent-115, score-0.254]
68 Because we initialize in the fitting procedure, the latter approach underestimates the importance of this particular feature, which is not a problem since a similar feature can be added in the next round. [sent-116, score-0.219]
69 Each feature is parametrized by weights and a bias : (12) where the RBM is obtained by setting all . [sent-119, score-0.219]
70 One can sample from the summed energy model using straightforward Gibbs sampling, where every visible unit is sampled given all the others. [sent-120, score-0.525]
71 Alternatively, one can design a much faster mixing Markov chain by introducing hidden variables and sampling all hidden units independently given the visible units and vice versa. [sent-121, score-0.304]
72 When we fit a new feature we need to make sure its norm is controlled. [sent-131, score-0.175]
73 We used different sets of negative examples, examples each, to fit and . [sent-140, score-0.372]
74 After a new feature was added, the total energies of all “2”s and “3”s were computed under both models. [sent-141, score-0.235]
75 The energies of the training data (under both models) were used as two-dimensional features to compute a separation boundary using logistic regression, which was subsequently applied to the test data to compute the total misclassification. [sent-142, score-0.299]
76 In Figure 1a we show the total error on both training data and test data as a function of the number of features in the model. [sent-143, score-0.178]
77 The classification error after rounds of boosting for the algorithm ( weighted algorithm is about , and only very gradually increases to about after rounds of boosting. [sent-145, score-0.775]
78 This is good as compared to logistic regression ( ), k-nearest is optimal), while a parallel-trained RBM with hidden neighbors ( units achieves respectively. [sent-146, score-0.151]
79 In Figure 1b we show every feature between rounds and for both digits. [sent-148, score-0.25]
80 ¡ A ¡ £ £ £ £ 5 A Continuous Example: The Dimples Model For continuous data we propose a different form of feature, which we term a dimple because of its shape in the energy domain. [sent-150, score-0.551]
81 A dimple is a mixture of a narrow Gaussian and a broad Gaussian, with a common mean: (16) ¤ ¡ 42 ( ) 1 G § H8© ¤ g e ¤ ¡ 32 G § 3© # "! [sent-151, score-0.333]
82 Each round where the mixing proportion is constant and equal, and of the algorithm fits and for a new learner. [sent-153, score-0.236]
83 A nice property of dimples is that they can reduce the entropy of an existing distribution by placing the dimple in a region that already has low energy, but they can also raise the entropy by putting the dimple in a high energy region [5]. [sent-154, score-1.095]
84 e A ¤ 2 Sampling is again simple if all , since in that case we can use a Gibbs chain which first picks a narrow or broad Gaussian for every feature given the visible variables and then samples the visible variables from the resulting multivariate Gaussian. [sent-155, score-0.497]
85 In the low-dimensional example discussed below we implemented a simple MCMC chain with isotropic, normal proposal density which was initiated at the data-points and run for a fixed number of steps. [sent-162, score-0.172]
86 The crosses represent the data and the dots the negative examples generated from the model. [sent-166, score-0.458]
87 £ ¡ £ The type of dimple we used in the experiment below can adapt a common mean ( ) and the inverse-variance of the small Gaussian ( ) in each dimension separately. [sent-173, score-0.176]
88 ¨ e & © & A © # e 1 To illustrate the proposed algorithm we fit the dimples model to the two-dimensional data (crosses) shown in Figure 2a-c. [sent-178, score-0.209]
89 The first feature is an isotropic Gaussian with the mean and the variance of the data, while later features were dimples trained in the way described above. [sent-180, score-0.464]
90 Figure 2a also shows the contours of equal energy after rounds of boosting together with examples (dots) from the model. [sent-181, score-1.102]
91 A 3-dimensional plot of the negative energy surface is shown in Figure 2b. [sent-182, score-0.715]
92 £ The main qualitative difference between the fits in Figures 2a-b (product of dimples) and 2c-d (mixture of Gaussians), is that the first seems to produce smoother energy surfaces, only creating structure where there is structure in the data. [sent-186, score-0.375]
93 This can be understood by recalling that the role of the negative examples is precisely to remove “dips” in the energy surface where there is no data. [sent-187, score-0.821]
94 The philosophy of avoiding structure in the model that is not dictated by the data is consistent with the ideas behind maximum entropy modelling [11] and is thought to improve generalization. [sent-188, score-0.253]
95 6 Discussion This paper discusses a boosting approach to density estimation, which we formulate as a sequential approach to training additive random field models. [sent-189, score-0.686]
96 The philosophy is to view unsupervised learning as a sequence of classification problems where the aim is to discriminate between data-vectors and negative examples generated from the current model. [sent-190, score-0.664]
97 The sampling step is usually the most time consuming operation, but it is also unavoidable since it informs the algorithm of the states whose energy is too low. [sent-191, score-0.539]
98 The proposed algorithm uses just one sample of negative examples to fit a new feature, which is very economical as compared to most non-sequential algorithms which must generate an entire new sample for every gradient update. [sent-192, score-0.633]
99 Can we improve the accuracy of the model by fitting the feature parameters and the coefficients together? [sent-197, score-0.186]
100 Does re-sampling the negative examples more frequently during learning improve the final model? [sent-198, score-0.42]
wordName wordTfidf (topN-words)
[('boosting', 0.421), ('energy', 0.375), ('negative', 0.216), ('boltzmann', 0.186), ('rbm', 0.183), ('dimple', 0.176), ('dimples', 0.176), ('round', 0.171), ('examples', 0.156), ('feature', 0.138), ('rounds', 0.112), ('cdb', 0.105), ('philosophy', 0.105), ('learner', 0.094), ('weak', 0.089), ('density', 0.082), ('gradient', 0.082), ('weights', 0.081), ('features', 0.079), ('nitesimal', 0.078), ('contrastive', 0.078), ('visible', 0.077), ('unsupervised', 0.075), ('surface', 0.074), ('logistic', 0.074), ('xv', 0.074), ('sample', 0.073), ('coef', 0.072), ('sampling', 0.072), ('additive', 0.07), ('adding', 0.069), ('entropy', 0.068), ('classi', 0.067), ('discriminate', 0.064), ('narrow', 0.064), ('weighted', 0.064), ('cients', 0.064), ('loss', 0.06), ('chain', 0.059), ('supervised', 0.059), ('states', 0.059), ('raise', 0.056), ('responsibilities', 0.056), ('imperfect', 0.056), ('derivative', 0.055), ('elds', 0.054), ('curves', 0.053), ('occupied', 0.052), ('converted', 0.052), ('digits', 0.051), ('total', 0.05), ('plot', 0.05), ('mixture', 0.05), ('importance', 0.05), ('training', 0.049), ('della', 0.049), ('current', 0.048), ('gibbs', 0.048), ('improve', 0.048), ('change', 0.047), ('pietra', 0.047), ('self', 0.047), ('energies', 0.047), ('volume', 0.046), ('tting', 0.045), ('regression', 0.045), ('dots', 0.045), ('learners', 0.045), ('broad', 0.043), ('weight', 0.042), ('gaussians', 0.042), ('ascent', 0.041), ('crosses', 0.041), ('height', 0.04), ('samples', 0.039), ('interpretation', 0.039), ('pool', 0.039), ('contours', 0.038), ('isotropic', 0.038), ('realize', 0.038), ('norm', 0.037), ('toronto', 0.036), ('proceed', 0.035), ('eld', 0.034), ('db', 0.034), ('algorithm', 0.033), ('objective', 0.033), ('random', 0.033), ('cation', 0.033), ('trained', 0.033), ('units', 0.032), ('mixing', 0.032), ('behind', 0.032), ('stanford', 0.032), ('formulate', 0.031), ('hinton', 0.031), ('initialize', 0.031), ('proposal', 0.031), ('fd', 0.031), ('bounded', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999893 181 nips-2002-Self Supervised Boosting
Author: Max Welling, Richard S. Zemel, Geoffrey E. Hinton
Abstract: Boosting algorithms and successful applications thereof abound for classification and regression learning problems, but not for unsupervised learning. We propose a sequential approach to adding features to a random field model by training them to improve classification performance between the data and an equal-sized sample of “negative examples” generated from the model’s current estimate of the data density. Training in each boosting round proceeds in three stages: first we sample negative examples from the model’s current Boltzmann distribution. Next, a feature is trained to improve classification performance between data and negative examples. Finally, a coefficient is learned which determines the importance of this feature relative to ones already in the pool. Negative examples only need to be generated once to learn each new feature. The validity of the approach is demonstrated on binary digits and continuous synthetic data.
2 0.43123102 46 nips-2002-Boosting Density Estimation
Author: Saharon Rosset, Eran Segal
Abstract: Several authors have suggested viewing boosting as a gradient descent search for a good fit in function space. We apply gradient-based boosting methodology to the unsupervised learning problem of density estimation. We show convergence properties of the algorithm and prove that a strength of weak learnability property applies to this problem as well. We illustrate the potential of this approach through experiments with boosting Bayesian networks to learn density models.
3 0.22451919 69 nips-2002-Discriminative Learning for Label Sequences via Boosting
Author: Yasemin Altun, Thomas Hofmann, Mark Johnson
Abstract: This paper investigates a boosting approach to discriminative learning of label sequences based on a sequence rank loss function. The proposed method combines many of the advantages of boosting schemes with the efficiency of dynamic programming methods and is attractive both, conceptually and computationally. In addition, we also discuss alternative approaches based on the Hamming loss for label sequences. The sequence boosting algorithm offers an interesting alternative to methods based on HMMs and the more recently proposed Conditional Random Fields. Applications areas for the presented technique range from natural language processing and information extraction to computational biology. We include experiments on named entity recognition and part-of-speech tagging which demonstrate the validity and competitiveness of our approach. 1
4 0.16313949 92 nips-2002-FloatBoost Learning for Classification
Author: Stan Z. Li, Zhenqiu Zhang, Heung-yeung Shum, Hongjiang Zhang
Abstract: AdaBoost [3] minimizes an upper error bound which is an exponential function of the margin on the training set [14]. However, the ultimate goal in applications of pattern classification is always minimum error rate. On the other hand, AdaBoost needs an effective procedure for learning weak classifiers, which by itself is difficult especially for high dimensional data. In this paper, we present a novel procedure, called FloatBoost, for learning a better boosted classifier. FloatBoost uses a backtrack mechanism after each iteration of AdaBoost to remove weak classifiers which cause higher error rates. The resulting float-boosted classifier consists of fewer weak classifiers yet achieves lower error rates than AdaBoost in both training and test. We also propose a statistical model for learning weak classifiers, based on a stagewise approximation of the posterior using an overcomplete set of scalar features. Experimental comparisons of FloatBoost and AdaBoost are provided through a difficult classification problem, face detection, where the goal is to learn from training examples a highly nonlinear classifier to differentiate between face and nonface patterns in a high dimensional space. The results clearly demonstrate the promises made by FloatBoost over AdaBoost.
5 0.16022809 120 nips-2002-Kernel Design Using Boosting
Author: Koby Crammer, Joseph Keshet, Yoram Singer
Abstract: The focus of the paper is the problem of learning kernel operators from empirical data. We cast the kernel design problem as the construction of an accurate kernel from simple (and less accurate) base kernels. We use the boosting paradigm to perform the kernel construction process. To do so, we modify the booster so as to accommodate kernel operators. We also devise an efficient weak-learner for simple kernels that is based on generalized eigen vector decomposition. We demonstrate the effectiveness of our approach on synthetic data and on the USPS dataset. On the USPS dataset, the performance of the Perceptron algorithm with learned kernels is systematically better than a fixed RBF kernel. 1 Introduction and problem Setting The last decade brought voluminous amount of work on the design, analysis and experimentation of kernel machines. Algorithm based on kernels can be used for various machine learning tasks such as classification, regression, ranking, and principle component analysis. The most prominent learning algorithm that employs kernels is the Support Vector Machines (SVM) [1, 2] designed for classification and regression. A key component in a kernel machine is a kernel operator which computes for any pair of instances their inner-product in some abstract vector space. Intuitively and informally, a kernel operator is a means for measuring similarity between instances. Almost all of the work that employed kernel operators concentrated on various machine learning problems that involved a predefined kernel. A typical approach when using kernels is to choose a kernel before learning starts. Examples to popular predefined kernels are the Radial Basis Functions and the polynomial kernels (see for instance [1]). Despite the simplicity required in modifying a learning algorithm to a “kernelized” version, the success of such algorithms is not well understood yet. More recently, special efforts have been devoted to crafting kernels for specific tasks such as text categorization [3] and protein classification problems [4]. Our work attempts to give a computational alternative to predefined kernels by learning kernel operators from data. We start with a few definitions. Let X be an instance space. A kernel is an inner-product operator K : X × X → . An explicit way to describe K is via a mapping φ : X → H from X to an inner-products space H such that K(x, x ) = φ(x)·φ(x ). Given a kernel operator and a finite set of instances S = {xi , yi }m , the kernel i=1 matrix (a.k.a the Gram matrix) is the matrix of all possible inner-products of pairs from S, Ki,j = K(xi , xj ). We therefore refer to the general form of K as the kernel operator and to the application of the kernel operator to a set of pairs of instances as the kernel matrix. The specific setting of kernel design we consider assumes that we have access to a base kernel learner and we are given a target kernel K manifested as a kernel matrix on a set of examples. Upon calling the base kernel learner it returns a kernel operator denote Kj . The goal thereafter is to find a weighted combination of kernels ˆ K(x, x ) = j αj Kj (x, x ) that is similar, in a sense that will be defined shortly, to ˆ the target kernel, K ∼ K . Cristianini et al. [5] in their pioneering work on kernel target alignment employed as the notion of similarity the inner-product between the kernel matrices < K, K >F = m K(xi , xj )K (xi , xj ). Given this definition, they defined the i,j=1 kernel-similarity, or alignment, to be the above inner-product normalized by the norm of ˆ ˆ ˆ ˆ ˆ each kernel, A(S, K, K ) = < K, K >F / < K, K >F < K , K >F , where S is, as above, a finite sample of m instances. Put another way, the kernel alignment Cristianini et al. employed is the cosine of the angle between the kernel matrices where each matrix is “flattened” into a vector of dimension m2 . Therefore, this definition implies that the alignment is bounded above by 1 and can attain this value iff the two kernel matrices are identical. Given a (column) vector of m labels y where yi ∈ {−1, +1} is the label of the instance xi , Cristianini et al. used the outer-product of y as the the target kernel, ˆ K = yy T . Therefore, an optimal alignment is achieved if K(xi , xj ) = yi yj . Clearly, if such a kernel is used for classifying instances from X , then the kernel itself suffices to construct an excellent classifier f : X → {−1, +1} by setting, f (x) = sign(y i K(xi , x)) where (xi , yi ) is any instance-label pair. Cristianini et al. then devised a procedure that works with both labelled and unlabelled examples to find a Gram matrix which attains a good alignment with K on the labelled part of the matrix. While this approach can clearly construct powerful kernels, a few problems arise from the notion of kernel alignment they employed. For instance, a kernel operator such that the sign(K(x i , xj )) is equal to yi yj but its magnitude, |K(xi , xj )|, is not necessarily 1, might achieve a poor alignment score while it can constitute a classifier whose empirical loss is zero. Furthermore, the task of finding a good kernel when it is not always possible to find a kernel whose sign on each pair of instances is equal to the products of the labels (termed the soft-margin case in [5, 6]) becomes rather tricky. We thus propose a different approach which attempts to overcome some of the difficulties above. Like Cristianini et al. we assume that we are given a set of labelled instances S = {(xi , yi ) | xi ∈ X , yi ∈ {−1, +1}, i = 1, . . . , m} . We are also given a set of unlabelled m ˜ ˜ examples S = {˜i }i=1 . If such a set is not provided we can simply use the labelled inx ˜ ˜ stances (without the labels themselves) as the set S. The set S is used for constructing the ˆ primitive kernels that are combined to constitute the learned kernel K. The labelled set is used to form the target kernel matrix and its instances are used for evaluating the learned ˆ kernel K. This approach, known as transductive learning, was suggested in [5, 6] for kernel alignment tasks when the distribution of the instances in the test data is different from that of the training data. This setting becomes in particular handy in datasets where the test data was collected in a different scheme than the training data. We next discuss the notion of kernel goodness employed in this paper. This notion builds on the objective function that several variants of boosting algorithms maintain [7, 8]. We therefore first discuss in brief the form of boosting algorithms for kernels. 2 Using Boosting to Combine Kernels Numerous interpretations of AdaBoost and its variants cast the boosting process as a procedure that attempts to minimize, or make small, a continuous bound on the classification error (see for instance [9, 7] and the references therein). A recent work by Collins et al. [8] unifies the boosting process for two popular loss functions, the exponential-loss (denoted henceforth as ExpLoss) and logarithmic-loss (denoted as LogLoss) that bound the empir- ˜ ˜ Input: Labelled and unlabelled sets of examples: S = {(xi , yi )}m ; S = {˜i }m x i=1 i=1 Initialize: K ← 0 (all zeros matrix) For t = 1, 2, . . . , T : • Calculate distribution over pairs 1 ≤ i, j ≤ m: Dt (i, j) = exp(−yi yj K(xi , xj )) 1/(1 + exp(−yi yj K(xi , xj ))) ExpLoss LogLoss ˜ • Call base-kernel-learner with (Dt , S, S) and receive Kt • Calculate: + − St = {(i, j) | yi yj Kt (xi , xj ) > 0} ; St = {(i, j) | yi yj Kt (xi , xj ) < 0} + Wt = (i,j)∈S + Dt (i, j)|Kt (xi , xj )| ; Wt− = (i,j)∈S − Dt (i, j)|Kt (xi , xj )| t t 1 2 + Wt − Wt • Set: αt = ln ; K ← K + α t Kt . Return: kernel operator K : X × X → Figure 1: The skeleton of the boosting algorithm for kernels. ical classification error. Given the prediction of a classifier f on an instance x and a label y ∈ {−1, +1} the ExpLoss and the LogLoss are defined as, ExpLoss(f (x), y) = exp(−yf (x)) LogLoss(f (x), y) = log(1 + exp(−yf (x))) . Collins et al. described a single algorithm for the two losses above that can be used within the boosting framework to construct a strong-hypothesis which is a classifier f (x). This classifier is a weighted combination of (possibly very simple) base classifiers. (In the boosting framework, the base classifiers are referred to as weak-hypotheses.) The strongT hypothesis is of the form f (x) = t=1 αt ht (x). Collins et al. discussed a few ways to select the weak-hypotheses ht and to find a good of weights αt . Our starting point in this paper is the first sequential algorithm from [8] that enables the construction or creation of weak-hypotheses on-the-fly. We would like to note however that it is possible to use other variants of boosting to design kernels. In order to use boosting to design kernels we extend the algorithm to operate over pairs of instances. Building on the notion of alignment from [5, 6], we say that the inner-product of x1 and x2 is aligned with the labels y1 and y2 if sign(K(x1 , x2 )) = y1 y2 . Furthermore, we would like to make the magnitude of K(x, x ) to be as large as possible. We therefore use one of the following two alignment losses for a pair of examples (x 1 , y1 ) and (x2 , y2 ), ExpLoss(K(x1 , x2 ), y1 y2 ) = exp(−y1 y2 K(x1 , x2 )) LogLoss(K(x1 , x2 ), y1 y2 ) = log(1 + exp(−y1 y2 K(x1 , x2 ))) . Put another way, we view a pair of instances as a single example and cast the pairs of instances that attain the same label as positively labelled examples while pairs of opposite labels are cast as negatively labelled examples. Clearly, this approach can be applied to both losses. In the boosting process we therefore maintain a distribution over pairs of instances. The weight of each pair reflects how difficult it is to predict whether the labels of the two instances are the same or different. The core boosting algorithm follows similar lines to boosting algorithms for classification algorithm. The pseudo code of the booster is given in Fig. 1. The pseudo-code is an adaptation the to problem of kernel design of the sequentialupdate algorithm from [8]. As with other boosting algorithm, the base-learner, which in our case is charge of returning a good kernel with respect to the current distribution, is left unspecified. We therefore turn our attention to the algorithmic implementation of the base-learning algorithm for kernels. 3 Learning Base Kernels The base kernel learner is provided with a training set S and a distribution D t over a pairs ˜ of instances from the training set. It is also provided with a set of unlabelled examples S. Without any knowledge of the topology of the space of instances a learning algorithm is likely to fail. Therefore, we assume the existence of an initial inner-product over the input space. We assume for now that this initial inner-product is the standard scalar products over vectors in n . We later discuss a way to relax the assumption on the form of the inner-product. Equipped with an inner-product, we define the family of base kernels to be the possible outer-products Kw = wwT between a vector w ∈ n and itself. Using this definition we get, Kw (xi , xj ) = (xi ·w)(xj ·w) . Input: A distribution Dt . Labelled and unlabelled sets: ˜ ˜ Therefore, the similarity beS = {(xi , yi )}m ; S = {˜i }m . x i=1 i=1 tween two instances xi and Compute : xj is high iff both xi and xj • Calculate: ˜ are similar (w.r.t the standard A ∈ m×m , Ai,r = xi · xr ˜ inner-product) to a third vecm×m B∈ , Bi,j = Dt (i, j)yi yj tor w. Analogously, if both ˜ ˜ K ∈ m×m , Kr,s = xr · xs ˜ ˜ xi and xj seem to be dissim• Find the generalized eigenvector v ∈ m for ilar to the vector w then they the problem AT BAv = λKv which attains are similar to each other. Dethe largest eigenvalue λ spite the restrictive form of • Set: w = ( r vr xr )/ ˜ ˜ r vr xr . the inner-products, this famt ily is still too rich for our setReturn: Kernel operator Kw = ww . ting and we further impose two restrictions on the inner Figure 2: The base kernel learning algorithm. products. First, we assume ˜ that w is restricted to a linear combination of vectors from S. Second, since scaling of the base kernels is performed by the boosted, we constrain the norm of w to be 1. The m ˜ resulting class of kernels is therefore, C = {Kw = wwT | w = r=1 βr xr , w = 1} . ˜ In the boosting process we need to choose a specific base-kernel K w from C. We therefore need to devise a notion of how good a candidate for base kernel is given a labelled set S and a distribution function Dt . In this work we use the simplest version suggested by Collins et al. This version can been viewed as a linear approximation on the loss function. We define the score of a kernel Kw w.r.t to the current distribution Dt to be, Score(Kw ) = Dt (i, j)yi yj Kw (xi , xj ) . (1) i,j The higher the value of the score is, the better Kw fits the training data. Note that if Dt (i, j) = 1/m2 (as is D0 ) then Score(Kw ) is proportional to the alignment since w = 1. Under mild assumptions the score can also provide a lower bound of the loss function. To see that let c be the derivative of the loss function at margin zero, c = Loss (0) . If all the √ training examples xi ∈ S lies in a ball of radius c, we get that Loss(Kw (xi , xj ), yi yj ) ≥ 1 − cKw (xi , xj )yi yj ≥ 0, and therefore, i,j Dt (i, j)Loss(Kw (xi , xj ), yi yj ) ≥ 1 − c Dt (i, j)Kw (xi , xj )yi yj . i,j Using the explicit form of Kw in the Score function (Eq. (1)) we get, Score(Kw ) = i,j D(i, j)yi yj (w·xi )(w·xj ) . Further developing the above equation using the constraint that w = m ˜ r=1 βr xr we get, ˜ Score(Kw ) = βs βr r,s i,j D(i, j)yi yj (xi · xr ) (xj · xs ) . ˜ ˜ To compute efficiently the base kernel score without an explicit enumeration we exploit the fact that if the initial distribution D0 is symmetric (D0 (i, j) = D0 (j, i)) then all the distributions generated along the run of the boosting process, D t , are also symmetric. We ˜ now define a matrix A ∈ m×m where Ai,r = xi · xr and a symmetric matrix B ∈ m×m ˜ with Bi,j = Dt (i, j)yi yj . Simple algebraic manipulations yield that the score function can be written as the following quadratic form, Score(β) = β T (AT BA)β , where β is m dimensional column vector. Note that since B is symmetric so is A T BA. Finding a ˜ good base kernel is equivalent to finding a vector β which maximizes this quadratic form 2 m ˜ under the norm equality constraint w = ˜ 2 = β T Kβ = 1 where Kr,s = r=1 βr xr xr · xs . Finding the maximum of Score(β) subject to the norm constraint is a well known ˜ ˜ maximization problem known as the generalized eigen vector problem (cf. [10]). Applying simple algebraic manipulations it is easy to show that the matrix AT BA is positive semidefinite. Assuming that the matrix K is invertible, the the vector β which maximizes the quadratic form is proportional the eigenvector of K −1 AT BA which is associated with the m ˜ generalized largest eigenvalue. Denoting this vector by v we get that w ∝ ˜ r=1 vr xr . m ˜ m ˜ Adding the norm constraint we get that w = ( r=1 vr xr )/ ˜ vr xr . The skeleton ˜ r=1 of the algorithm for finding a base kernels is given in Fig. 3. To conclude the description of the kernel learning algorithm we describe how to the extend the algorithm to be employed with general kernel functions. Kernelizing the Kernel: As described above, we assumed that the standard scalarproduct constitutes the template for the class of base-kernels C. However, since the proce˜ dure for choosing a base kernel depends on S and S only through the inner-products matrix A, we can replace the scalar-product itself with a general kernel operator κ : X × X → , where κ(xi , xj ) = φ(xi ) · φ(xj ). Using a general kernel function κ we can not compute however the vector w explicitly. We therefore need to show that the norm of w, and evaluation Kw on any two examples can still be performed efficiently. First note that given the vector v we can compute the norm of w as follows, T w 2 = vr xr ˜ vs xr ˜ r s = vr vs κ(˜r , xs ) . x ˜ r,s Next, given two vectors xi and xj the value of their inner-product is, Kw (xi , xj ) = vr vs κ(xi , xr )κ(xj , xs ) . ˜ ˜ r,s Therefore, although we cannot compute the vector w explicitly we can still compute its norm and evaluate any of the kernels from the class C. 4 Experiments Synthetic data: We generated binary-labelled data using as input space the vectors in 100 . The labels, in {−1, +1}, were picked uniformly at random. Let y designate the label of a particular example. Then, the first two components of each instance were drawn from a two-dimensional normal distribution, N (µ, ∆ ∆−1 ) with the following parameters, µ=y 0.03 0.03 1 ∆= √ 2 1 −1 1 1 = 0.1 0 0 0.01 . That is, the label of each examples determined the mean of the distribution from which the first two components were generated. The rest of the components in the vector (98 8 0.2 6 50 50 100 100 150 150 200 200 4 2 0 0 −2 −4 −6 250 250 −0.2 −8 −0.2 0 0.2 −8 −6 −4 −2 0 2 4 6 8 300 20 40 60 80 100 120 140 160 180 200 300 20 40 60 80 100 120 140 160 180 Figure 3: Results on a toy data set prior to learning a kernel (first and third from left) and after learning (second and fourth). For each of the two settings we show the first two components of the training data (left) and the matrix of inner products between the train and the test data (right). altogether) were generated independently using the normal distribution with a zero mean and a standard deviation of 0.05. We generated 100 training and test sets of size 300 and 200 respectively. We used the standard dot-product as the initial kernel operator. On each experiment we first learned a linear classier that separates the classes using the Perceptron [11] algorithm. We ran the algorithm for 10 epochs on the training set. After each epoch we evaluated the performance of the current classifier on the test set. We then used the boosting algorithm for kernels with the LogLoss for 30 rounds to build a kernel for each random training set. After learning the kernel we re-trained a classifier with the Perceptron algorithm and recorded the results. A summary of the online performance is given in Fig. 4. The plot on the left-hand-side of the figure shows the instantaneous error (achieved during the run of the algorithm). Clearly, the Perceptron algorithm with the learned kernel converges much faster than the original kernel. The middle plot shows the test error after each epoch. The plot on the right shows the test error on a noisy test set in which we added a Gaussian noise of zero mean and a standard deviation of 0.03 to the first two features. In all plots, each bar indicates a 95% confidence level. It is clear from the figure that the original kernel is much slower to converge than the learned kernel. Furthermore, though the kernel learning algorithm was not expoed to the test set noise, the learned kernel reflects better the structure of the feature space which makes the learned kernel more robust to noise. Fig. 3 further illustrates the benefits of using a boutique kernel. The first and third plots from the left correspond to results obtained using the original kernel and the second and fourth plots show results using the learned kernel. The left plots show the empirical distribution of the two informative components on the test data. For the learned kernel we took each input vector and projected it onto the two eigenvectors of the learned kernel operator matrix that correspond to the two largest eigenvalues. Note that the distribution after the projection is bimodal and well separated along the first eigen direction (x-axis) and shows rather little deviation along the second eigen direction (y-axis). This indicates that the kernel learning algorithm indeed found the most informative projection for separating the labelled data with large margin. It is worth noting that, in this particular setting, any algorithm which chooses a single feature at a time is prone to failure since both the first and second features are mandatory for correctly classifying the data. The two plots on the right hand side of Fig. 3 use a gray level color-map to designate the value of the inner-product between each pairs instances, one from training set (y-axis) and the other from the test set. The examples were ordered such that the first group consists of the positively labelled instances while the second group consists of the negatively labelled instances. Since most of the features are non-relevant the original inner-products are noisy and do not exhibit any structure. In contrast, the inner-products using the learned kernel yields in a 2 × 2 block matrix indicating that the inner-products between instances sharing the same label obtain large positive values. Similarly, for instances of opposite 200 1 12 Regular Kernel Learned Kernel 0.8 17 0.7 16 0.5 0.4 0.3 Test Error % 8 0.6 Regular Kernel Learned Kernel 18 10 Test Error % Averaged Cumulative Error % 19 Regular Kernel Learned Kernel 0.9 6 4 15 14 13 12 0.2 11 2 0.1 10 0 0 10 1 10 2 10 Round 3 10 4 10 0 2 4 6 Epochs 8 10 9 2 4 6 Epochs 8 10 Figure 4: The online training error (left), test error (middle) on clean synthetic data using a standard kernel and a learned kernel. Right: the online test error for the two kernels on a noisy test set. labels the inner products are large and negative. The form of the inner-products matrix of the learned kernel indicates that the learning problem itself becomes much easier. Indeed, the Perceptron algorithm with the standard kernel required around 94 training examples on the average before converging to a hyperplane which perfectly separates the training data while using the Perceptron algorithm with learned kernel required a single example to reach a perfect separation on all 100 random training sets. USPS dataset: The USPS (US Postal Service) dataset is known as a challenging classification problem in which the training set and the test set were collected in a different manner. The USPS contains 7, 291 training examples and 2, 007 test examples. Each example is represented as a 16 × 16 matrix where each entry in the matrix is a pixel that can take values in {0, . . . , 255}. Each example is associated with a label in {0, . . . , 9} which is the digit content of the image. Since the kernel learning algorithm is designed for binary problems, we broke the 10-class problem into 45 binary problems by comparing all pairs of classes. The interesting question of how to learn kernels for multiclass problems is beyond the scopre of this short paper. We thus constraint on the binary error results for the 45 binary problem described above. For the original kernel we chose a RBF kernel with σ = 1 which is the value employed in the experiments reported in [12]. We used the kernelized version of the kernel design algorithm to learn a different kernel operator for each of the binary problems. We then used a variant of the Perceptron [11] and with the original RBF kernel and with the learned kernels. One of the motivations for using the Perceptron is its simplicity which can underscore differences in the kernels. We ran the kernel learning al˜ gorithm with LogLoss and ExpLoss, using bith the training set and the test test as S. Thus, we obtained four different sets of kernels where each set consists of 45 kernels. By examining the training loss, we set the number of rounds of boosting to be 30 for the LogLoss and 50 for the ExpLoss, when using the trainin set. When using the test set, the number of rounds of boosting was set to 100 for both losses. Since the algorithm exhibits slower rate of convergence with the test data, we choose a a higher value without attempting to optimize the actual value. The left plot of Fig. 5 is a scatter plot comparing the test error of each of the binary classifiers when trained with the original RBF a kernel versus the performance achieved on the same binary problem with a learned kernel. The kernels were built ˜ using boosting with the LogLoss and S was the training data. In almost all of the 45 binary classification problems, the learned kernels yielded lower error rates when combined with the Perceptron algorithm. The right plot of Fig. 5 compares two learned kernels: the first ˜ was build using the training instances as the templates constituing S while the second used the test instances. Although the differenece between the two versions is not as significant as the difference on the left plot, we still achieve an overall improvement in about 25% of the binary problems by using the test instances. 6 4.5 4 5 Learned Kernel (Test) Learned Kernel (Train) 3.5 4 3 2 3 2.5 2 1.5 1 1 0.5 0 0 1 2 3 Base Kernel 4 5 6 0 0 1 2 3 Learned Kernel (Train) 4 5 Figure 5: Left: a scatter plot comparing the error rate of 45 binary classifiers trained using an RBF kernel (x-axis) and a learned kernel with training instances. Right: a similar scatter plot for a learned kernel only constructed from training instances (x-axis) and test instances. 5 Discussion In this paper we showed how to use the boosting framework to design kernels. Our approach is especially appealing in transductive learning tasks where the test data distribution is different than the the distribution of the training data. For example, in speech recognition tasks the training data is often clean and well recorded while the test data often passes through a noisy channel that distorts the signal. An interesting and challanging question that stem from this research is how to extend the framework to accommodate more complex decision tasks such as multiclass and regression problems. Finally, we would like to note alternative approaches to the kernel design problem has been devised in parallel and independently. See [13, 14] for further details. Acknowledgements: Special thanks to Cyril Goutte and to John Show-Taylor for pointing the connection to the generalized eigen vector problem. Thanks also to the anonymous reviewers for constructive comments. References [1] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998. [2] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. [3] Huma Lodhi, John Shawe-Taylor, Nello Cristianini, and Christopher J. C. H. Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2:419–444, 2002. [4] C. Leslie, E. Eskin, and W. Stafford Noble. The spectrum kernel: A string kernel for svm protein classification. In Proceedings of the Pacific Symposium on Biocomputing, 2002. [5] Nello Cristianini, Andre Elisseeff, John Shawe-Taylor, and Jaz Kandla. On kernel target alignment. In Advances in Neural Information Processing Systems 14, 2001. [6] G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix with semi-definite programming. In Proc. of the 19th Intl. Conf. on Machine Learning, 2002. [7] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2):337–374, April 2000. [8] Michael Collins, Robert E. Schapire, and Yoram Singer. Logistic regression, adaboost and bregman distances. Machine Learning, 47(2/3):253–285, 2002. [9] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classifiers. MIT Press, 1999. [10] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1985. [11] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. [12] B. Sch¨ lkopf, S. Mika, C.J.C. Burges, P. Knirsch, K. M¨ ller, G. R¨ tsch, and A.J. Smola. Input o u a space vs. feature space in kernel-based methods. IEEE Trans. on NN, 10(5):1000–1017, 1999. [13] O. Bosquet and D.J.L. Herrmann. On the complexity of learning the kernel matrix. NIPS, 2002. [14] C.S. Ong, A.J. Smola, and R.C. Williamson. Superkenels. NIPS, 2002.
6 0.1460779 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
7 0.13609096 90 nips-2002-Feature Selection in Mixture-Based Clustering
8 0.1353564 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture
9 0.12531757 94 nips-2002-Fractional Belief Propagation
10 0.11093513 174 nips-2002-Regularized Greedy Importance Sampling
11 0.10926788 189 nips-2002-Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy
12 0.10470629 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
13 0.10289155 32 nips-2002-Approximate Inference and Protein-Folding
14 0.10175968 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
15 0.097189642 10 nips-2002-A Model for Learning Variance Components of Natural Images
16 0.093649164 73 nips-2002-Dynamic Bayesian Networks with Deterministic Latent Tables
17 0.092577547 105 nips-2002-How to Combine Color and Shape Information for 3D Object Recognition: Kernels do the Trick
18 0.090161927 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
19 0.084248126 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
20 0.081369825 89 nips-2002-Feature Selection by Maximum Marginal Diversity
topicId topicWeight
[(0, -0.294), (1, -0.108), (2, -0.004), (3, 0.072), (4, 0.182), (5, 0.05), (6, -0.094), (7, 0.134), (8, -0.064), (9, -0.175), (10, -0.091), (11, 0.032), (12, -0.042), (13, 0.296), (14, -0.191), (15, 0.3), (16, -0.246), (17, -0.169), (18, 0.015), (19, 0.022), (20, 0.038), (21, 0.108), (22, 0.003), (23, -0.029), (24, -0.072), (25, -0.006), (26, 0.087), (27, 0.022), (28, 0.003), (29, 0.012), (30, -0.037), (31, -0.006), (32, -0.042), (33, -0.006), (34, 0.021), (35, 0.075), (36, -0.034), (37, 0.056), (38, 0.066), (39, -0.03), (40, -0.073), (41, 0.049), (42, 0.025), (43, -0.027), (44, -0.041), (45, -0.018), (46, -0.027), (47, -0.001), (48, -0.014), (49, 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.95151734 181 nips-2002-Self Supervised Boosting
Author: Max Welling, Richard S. Zemel, Geoffrey E. Hinton
Abstract: Boosting algorithms and successful applications thereof abound for classification and regression learning problems, but not for unsupervised learning. We propose a sequential approach to adding features to a random field model by training them to improve classification performance between the data and an equal-sized sample of “negative examples” generated from the model’s current estimate of the data density. Training in each boosting round proceeds in three stages: first we sample negative examples from the model’s current Boltzmann distribution. Next, a feature is trained to improve classification performance between data and negative examples. Finally, a coefficient is learned which determines the importance of this feature relative to ones already in the pool. Negative examples only need to be generated once to learn each new feature. The validity of the approach is demonstrated on binary digits and continuous synthetic data.
2 0.91522717 46 nips-2002-Boosting Density Estimation
Author: Saharon Rosset, Eran Segal
Abstract: Several authors have suggested viewing boosting as a gradient descent search for a good fit in function space. We apply gradient-based boosting methodology to the unsupervised learning problem of density estimation. We show convergence properties of the algorithm and prove that a strength of weak learnability property applies to this problem as well. We illustrate the potential of this approach through experiments with boosting Bayesian networks to learn density models.
3 0.65444189 69 nips-2002-Discriminative Learning for Label Sequences via Boosting
Author: Yasemin Altun, Thomas Hofmann, Mark Johnson
Abstract: This paper investigates a boosting approach to discriminative learning of label sequences based on a sequence rank loss function. The proposed method combines many of the advantages of boosting schemes with the efficiency of dynamic programming methods and is attractive both, conceptually and computationally. In addition, we also discuss alternative approaches based on the Hamming loss for label sequences. The sequence boosting algorithm offers an interesting alternative to methods based on HMMs and the more recently proposed Conditional Random Fields. Applications areas for the presented technique range from natural language processing and information extraction to computational biology. We include experiments on named entity recognition and part-of-speech tagging which demonstrate the validity and competitiveness of our approach. 1
4 0.58528686 92 nips-2002-FloatBoost Learning for Classification
Author: Stan Z. Li, Zhenqiu Zhang, Heung-yeung Shum, Hongjiang Zhang
Abstract: AdaBoost [3] minimizes an upper error bound which is an exponential function of the margin on the training set [14]. However, the ultimate goal in applications of pattern classification is always minimum error rate. On the other hand, AdaBoost needs an effective procedure for learning weak classifiers, which by itself is difficult especially for high dimensional data. In this paper, we present a novel procedure, called FloatBoost, for learning a better boosted classifier. FloatBoost uses a backtrack mechanism after each iteration of AdaBoost to remove weak classifiers which cause higher error rates. The resulting float-boosted classifier consists of fewer weak classifiers yet achieves lower error rates than AdaBoost in both training and test. We also propose a statistical model for learning weak classifiers, based on a stagewise approximation of the posterior using an overcomplete set of scalar features. Experimental comparisons of FloatBoost and AdaBoost are provided through a difficult classification problem, face detection, where the goal is to learn from training examples a highly nonlinear classifier to differentiate between face and nonface patterns in a high dimensional space. The results clearly demonstrate the promises made by FloatBoost over AdaBoost.
5 0.40493539 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
Author: Max Welling, Simon Osindero, Geoffrey E. Hinton
Abstract: We propose a model for natural images in which the probability of an image is proportional to the product of the probabilities of some filter outputs. We encourage the system to find sparse features by using a Studentt distribution to model each filter output. If the t-distribution is used to model the combined outputs of sets of neurally adjacent filters, the system learns a topographic map in which the orientation, spatial frequency and location of the filters change smoothly across the map. Even though maximum likelihood learning is intractable in our model, the product form allows a relatively efficient learning procedure that works well even for highly overcomplete sets of filters. Once the model has been learned it can be used as a prior to derive the “iterated Wiener filter” for the purpose of denoising images.
6 0.38350403 174 nips-2002-Regularized Greedy Importance Sampling
7 0.36802849 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture
8 0.36315137 179 nips-2002-Scaling of Probability-Based Optimization Algorithms
9 0.36239636 120 nips-2002-Kernel Design Using Boosting
10 0.35299385 111 nips-2002-Independent Components Analysis through Product Density Estimation
11 0.34415746 90 nips-2002-Feature Selection in Mixture-Based Clustering
12 0.34393704 32 nips-2002-Approximate Inference and Protein-Folding
13 0.3344008 189 nips-2002-Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy
14 0.32461223 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
15 0.32126659 138 nips-2002-Manifold Parzen Windows
16 0.31209195 110 nips-2002-Incremental Gaussian Processes
17 0.30912775 94 nips-2002-Fractional Belief Propagation
18 0.307078 89 nips-2002-Feature Selection by Maximum Marginal Diversity
19 0.30551627 41 nips-2002-Bayesian Monte Carlo
20 0.30480084 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
topicId topicWeight
[(11, 0.027), (23, 0.016), (42, 0.521), (54, 0.108), (55, 0.025), (68, 0.02), (74, 0.078), (92, 0.027), (98, 0.105)]
simIndex simValue paperId paperTitle
1 0.97323245 115 nips-2002-Informed Projections
Author: David Tax
Abstract: Low rank approximation techniques are widespread in pattern recognition research — they include Latent Semantic Analysis (LSA), Probabilistic LSA, Principal Components Analysus (PCA), the Generative Aspect Model, and many forms of bibliometric analysis. All make use of a low-dimensional manifold onto which data are projected. Such techniques are generally “unsupervised,” which allows them to model data in the absence of labels or categories. With many practical problems, however, some prior knowledge is available in the form of context. In this paper, I describe a principled approach to incorporating such information, and demonstrate its application to PCA-based approximations of several data sets. 1
2 0.96296936 22 nips-2002-Adaptive Nonlinear System Identification with Echo State Networks
Author: Herbert Jaeger
Abstract: Echo state networks (ESN) are a novel approach to recurrent neural network training. An ESN consists of a large, fixed, recurrent
same-paper 3 0.95210242 181 nips-2002-Self Supervised Boosting
Author: Max Welling, Richard S. Zemel, Geoffrey E. Hinton
Abstract: Boosting algorithms and successful applications thereof abound for classification and regression learning problems, but not for unsupervised learning. We propose a sequential approach to adding features to a random field model by training them to improve classification performance between the data and an equal-sized sample of “negative examples” generated from the model’s current estimate of the data density. Training in each boosting round proceeds in three stages: first we sample negative examples from the model’s current Boltzmann distribution. Next, a feature is trained to improve classification performance between data and negative examples. Finally, a coefficient is learned which determines the importance of this feature relative to ones already in the pool. Negative examples only need to be generated once to learn each new feature. The validity of the approach is demonstrated on binary digits and continuous synthetic data.
4 0.93454343 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum
Author: Christopher Williams, John S. Shawe-taylor
Abstract: In this paper we analyze the relationships between the eigenvalues of the m x m Gram matrix K for a kernel k(·, .) corresponding to a sample Xl, ... ,X m drawn from a density p(x) and the eigenvalues of the corresponding continuous eigenproblem. We bound the differences between the two spectra and provide a performance bound on kernel peA. 1
5 0.84993315 138 nips-2002-Manifold Parzen Windows
Author: Pascal Vincent, Yoshua Bengio
Abstract: The similarity between objects is a fundamental element of many learning algorithms. Most non-parametric methods take this similarity to be fixed, but much recent work has shown the advantages of learning it, in particular to exploit the local invariances in the data or to capture the possibly non-linear manifold on which most of the data lies. We propose a new non-parametric kernel density estimation method which captures the local structure of an underlying manifold through the leading eigenvectors of regularized local covariance matrices. Experiments in density estimation show significant improvements with respect to Parzen density estimators. The density estimators can also be used within Bayes classifiers, yielding classification rates similar to SVMs and much superior to the Parzen classifier.
6 0.69641638 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation
7 0.68883955 46 nips-2002-Boosting Density Estimation
8 0.67748988 143 nips-2002-Mean Field Approach to a Probabilistic Model in Information Retrieval
9 0.65954202 65 nips-2002-Derivative Observations in Gaussian Process Models of Dynamic Systems
10 0.6567443 169 nips-2002-Real-Time Particle Filters
11 0.64204454 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
12 0.6379469 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
13 0.63650143 3 nips-2002-A Convergent Form of Approximate Policy Iteration
14 0.62801856 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
15 0.61738181 41 nips-2002-Bayesian Monte Carlo
16 0.61525905 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs
17 0.61145186 100 nips-2002-Half-Lives of EigenFlows for Spectral Clustering
18 0.60979986 61 nips-2002-Convergent Combinations of Reinforcement Learning with Linear Function Approximation
19 0.6070587 96 nips-2002-Generalized² Linear² Models
20 0.6056928 69 nips-2002-Discriminative Learning for Label Sequences via Boosting