nips nips2002 nips2002-149 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ofer Dekel, Yoram Singer
Abstract: We describe a new algorithmic framework for learning multiclass categorization problems. In this framework a multiclass predictor is composed of a pair of embeddings that map both instances and labels into a common space. In this space each instance is assigned the label it is nearest to. We outline and analyze an algorithm, termed Bunching, for learning the pair of embeddings from labeled data. A key construction in the analysis of the algorithm is the notion of probabilistic output codes, a generalization of error correcting output codes (ECOC). Furthermore, the method of multiclass categorization using ECOC is shown to be an instance of Bunching. We demonstrate the advantage of Bunching over ECOC by comparing their performance on numerous categorization problems.
Reference: text
sentIndex sentText sentNum sentScore
1 il Abstract We describe a new algorithmic framework for learning multiclass categorization problems. [sent-4, score-0.304]
2 In this framework a multiclass predictor is composed of a pair of embeddings that map both instances and labels into a common space. [sent-5, score-0.565]
3 In this space each instance is assigned the label it is nearest to. [sent-6, score-0.199]
4 We outline and analyze an algorithm, termed Bunching, for learning the pair of embeddings from labeled data. [sent-7, score-0.23]
5 A key construction in the analysis of the algorithm is the notion of probabilistic output codes, a generalization of error correcting output codes (ECOC). [sent-8, score-0.438]
6 Furthermore, the method of multiclass categorization using ECOC is shown to be an instance of Bunching. [sent-9, score-0.358]
7 1 Introduction The focus of this paper is supervised learning from multiclass data. [sent-11, score-0.259]
8 In multiclass problems the goal is to learn a classifier that accurately assigns labels to instances where the set of labels is of finite cardinality and contains more than two elements. [sent-12, score-0.424]
9 Many machine learning applications employ a multiclass categorization stage. [sent-13, score-0.328]
10 Dietterich and Bakiri [6] proposed a technique based on error correcting output coding (ECOC) as a means of reducing a multiclass classification problem to several binary classification problems and then solving each binary problem individually to obtain a multiclass classifier. [sent-15, score-0.855]
11 In the above papers, as well as in most previous work on ECOC, learning the set of binary classifiers and selecting a particular error correcting code are done independently. [sent-18, score-0.494]
12 An exception is a method based on continuous relaxation of the code [3] in which the code matrix is post-processed once based on the learned binary classifiers. [sent-19, score-0.712]
13 On one hand it offers great flexibility and modularity, on the other hand, the resulting binary learning problems might be unnatural and therefore potentially difficult. [sent-21, score-0.174]
14 The approach we take perceives the set of binary classifiers as an embedding of the instance space and the code matrix as an embedding of the label set into a common space. [sent-23, score-0.781]
15 In this common space each instance is assigned the label from which it’s divergence is smallest. [sent-24, score-0.241]
16 We then describe an algorithm that constructs the label and instance embeddings such that the resulting classifier achieves a small empirical error. [sent-26, score-0.436]
17 One step improves the embedding of the instance space into the common space while keeping the embedding of the label set fixed. [sent-29, score-0.38]
18 The second step complements the first by updating the label embedding while keeping the instance embedding fixed. [sent-31, score-0.403]
19 In the next section we give a formal description of the multiclass learning problem and of our classification setting. [sent-35, score-0.259]
20 3 we give an alternative view of ECOC which naturally leads to the definition of probabilistic output codes presented in Sec. [sent-37, score-0.252]
21 2 Problem Setting Let X be a domain of instance encodings from m and let Y be a set of r labels that can be assigned to each instance from X . [sent-45, score-0.22]
22 Given a training set of instance-label pairs S = (xj , yj )n such that each xj is in X and each yj is in Y, we are faced with the problem of j=1 learning a classification function that predicts the labels of instances from X . [sent-46, score-0.56]
23 This problem is often referred to as multiclass learning. [sent-47, score-0.236]
24 In other multiclass problem settings it is common to encode the set Y as a prefix of the integers {1, . [sent-48, score-0.236]
25 The classification functions we study in this paper are composed of a pair of embeddings from the spaces X and Y into a common space Z, and a measure of divergence between vectors in Z. [sent-53, score-0.29]
26 That is, given an instance x ∈ X , we embed it into Z along with all of the label vectors in Y and predict the label that x is closest to in Z. [sent-54, score-0.362]
27 The logistic transformation σ : s → (0, 1)s is defined ∀k = 1, . [sent-56, score-0.205]
28 , s σk (ω) = (1 + e−ωk )−1 The entropy of a multivariate Bernoulli random variable with parameter p ∈ [0, 1] s is s H[p] = − [pk log(pk ) + (1 − pk ) log(1 − pk )] . [sent-59, score-0.262]
29 k=1 The Kullback-Leibler (KL) divergence between a pair of multivariate Bernoulli random variables with respective parameters p, q ∈ [0, 1]s is s pk 1 − pk D[p q] = pk log + (1 − pk ) log . [sent-60, score-0.58]
30 Given any two linear mappings T : m → s and C : r → s , where T is given as a matrix in s×m and C as a matrix in s×r , instances from X are embedded into Z by σ(T x) and labels from Y are embedded into Z by σ(Cy). [sent-62, score-0.294]
31 is our means of classifying new instances: given a new instance we predict its label to be y if ˆ y = argmin (x, y|C, T ) . [sent-68, score-0.258]
32 We name the loss over the entire training set S the empirical loss and use the notation L(S|C, T ) = (x, y|C, T ) . [sent-71, score-0.23]
33 (4) (x,y)∈S Our goal is to learn a good multiclass prediction function by finding a pair (C, T ) that attains a small empirical loss. [sent-72, score-0.376]
34 As we show in the sequel, the rationale behind this choice of empirical loss lies in the fact that it bounds the (discrete) empirical classification error attained by the classification function. [sent-73, score-0.324]
35 3 An Alternative View of Error Correcting Output Codes The technique of ECOC uses error correcting codes to reduce an r-class classification problem to multiple binary problems. [sent-74, score-0.344]
36 Each binary problem is then learned independently via an external binary learning algorithm and the learned binary classifiers are combined into one r-class classifier. [sent-75, score-0.356]
37 We begin by giving a brief overview of ECOC for the case where the binary learning algorithm used is a logistic regressor. [sent-76, score-0.277]
38 A binary output code C is a matrix in {0, 1}s×r where each of C’s columns is an s-bit code word that corresponds to a label in Y. [sent-77, score-0.925]
39 Therefore, the code word corresponding to the label y is simply the product of the matrix C and the vector y, Cy. [sent-79, score-0.495]
40 The distance ρ of a code C is defined as the minimal Hamming distance between any two code words, formally s ρ(C) = min Ck,i (1 − Ck,j ) + Ck,j (1 − Ck,i ) . [sent-80, score-0.62]
41 , the set of labels in Y which are mapped according to C k to the binary label 0) and the labels for which Ck · y = 1. [sent-86, score-0.365]
42 Thus, each Ck induces a binary classification problem from the original multiclass problem. [sent-87, score-0.347]
43 Formally, we construct for each k a binarylabeled sample Sk = {(xj , Ck · yj )}n and for each Sk we learn a binary classification j=1 function Tk : X → using a logistic regression algorithm. [sent-88, score-0.407]
44 That is, for each original instance xj and induced binary label Ck · yj we posit a logistic model that estimates the conditional probability that Ck · yj equals 1 given xj , P r[Ck · yj = 1| xj ; Tk ] = σ(Tk · xj ) . [sent-89, score-1.377]
45 (5) Given a predefined code matrix C the learning task at hand is to find T k that maximizes the log-likelihood of the labelling given in Sk , n Tk = argmax m Tk ∈ log(P r[Ck · yj | xj ; Tk ]) . [sent-90, score-0.604]
46 (6) as follows n Tk = argmin m Tk ∈ D[Ck · yj σ(Tk · xj )] . [sent-94, score-0.321]
47 j=1 In words, a good set of binary predictors is found by minimizing the sample-averaged KLdivergence between the binary vectors induced by C and the logistic estimates induced by T1 , . [sent-95, score-0.464]
48 For any instance x ∈ X , σ(T x) is a vector of probability estimates k=1 that the label of x is 1 for each of the s induced binary problems. [sent-100, score-0.339]
49 We can summarize the learning task defined by the code C as the task of finding a matrix T such that n T = argmin s×m T∈ D[Cyj σ(T xj )] . [sent-101, score-0.51]
50 j=1 Given a code matrix C and a transformation T we classify a new instance as follows, y = argmin D[Cy ˆ σ(T x)] . [sent-102, score-0.517]
51 (7) y∈Y A classification error occurs if the predicted label y is different from the correct label y. [sent-103, score-0.271]
52 [1] it is straightforward to show that the empirical classification error (ˆ = y) is bounded above by the empirical KL-divergence between the y correct code word Cy and the estimated probabilities σ(T x) divided by the code distance, |{ˆj = y y j }n | j=1 ≤ n j=1 D[Cyj ρ(C) σ(T xj )] . [sent-106, score-0.836]
53 4 Probabilistic Output Codes We now describe a relaxation of binary output codes by defining the notion of probabilistic output codes. [sent-110, score-0.446]
54 We give a bound on the empirical error attained by a classifier that uses probabilistic output codes which generalizes the bound in Eq. [sent-111, score-0.457]
55 The rationale for our construction is that the discrete nature of ECOC can potentially induce difficult binary classification problems. [sent-113, score-0.2]
56 In contrast, probabilistic codes induce real-valued problems that may be easier to learn. [sent-114, score-0.224]
57 Analogous to discrete codes, A probabilistic output code C is a matrix in s×r used in conjunction with the logistic transformation to produce a set of r probability vectors that correspond to the r labels in Y. [sent-115, score-0.751]
58 Namely, C maps each label y ∈ Y to the probabilistic code word σ(Cy) ∈ [0, 1]s . [sent-116, score-0.495]
59 As before, we assume that Y is the set of r standard unit vectors in {0, 1}r and therefore each probabilistic code word is the image of one of C’s columns under the logistic transformation. [sent-117, score-0.557]
60 The natural extension of code distance to probabilistic codes is achieved by replacing Hamming distance with expected Hamming distance. [sent-118, score-0.578]
61 , s} we view the k’th component of the code word that corresponds to y as a Bernoulli random variable with parameter p = σ k (Cy) then the expected Hamming distance between the code word for classes i and j is, s σk (Cyi )(1 − σk (Cyj )) + σk (Cyj )(1 − σk (Cyi )) . [sent-122, score-0.677]
62 k=1 Analogous to discrete codes we define the distance ρ of a code C as the minimum expected Hamming distance between all pairs of code words in C, that is, s ρ(C) = min i=j σk (Cyi )(1 − σk (Cyj )) + σk (Cyj )(1 − σk (Cyi )) . [sent-123, score-0.752]
63 k=1 Put another way, we have relaxed the definition of code words from deterministic vectors to multivariate Bernoulli random variables. [sent-124, score-0.324]
64 When C’s entries are all ±∞ then the logistic transformation of C’s entries defines a deterministic code and the two definitions of ρ coincide. [sent-126, score-0.464]
65 Given a probabilistic code matrix C ∈ s×r and a transformation T ∈ s×m we associate a loss (x, y|C, T ) with each instance-label pair (x, y) using Eq. [sent-127, score-0.558]
66 We classify new instances by finding the label y that attains the smallest loss as defined in Eq. [sent-130, score-0.303]
67 2 that employs embeddings except that instead of viewing C and T as abstract embeddings C is interpreted as a probabilistic output code and the rows of T are viewed as binary classifiers. [sent-133, score-0.836]
68 We now give a theorem that builds on our construction of probabilistic output codes and relates the classification rule from Eq. [sent-137, score-0.315]
69 Let C ∈ s×r be a probabilistic output code with distance ρ(C) and let T ∈ s×m be a transformation matrix. [sent-143, score-0.492]
70 Given a sample S = {(xj , yj )}n of instance-label pairs where xj ∈ X and yj ∈ Y, denote by L the loss i=j on S with respect to C and T as given by Eq. [sent-144, score-0.498]
71 (4) and denote by y j the predicted label of ˆ xj according to the classification rule given in Eq. [sent-145, score-0.231]
72 n |{ˆj = yj }j=1 | ≤ y 5 The Learning Problem We now discuss how our formalism of probabilistic output codes via embeddings and the accompanying Thm. [sent-149, score-0.578]
73 1 implies that the empirical error over S can be reduced by minimizing the empirical loss over S while maintaining a large distance ρ(C). [sent-152, score-0.289]
74 A naive modification of C so as to minimize the loss may result in a probabilistic code whose distance is undesirably small. [sent-153, score-0.453]
75 Therefore, we assume that we are initially provided with a fixed reference matrix C0 ∈ s×r that is known to have a large code distance. [sent-154, score-0.319]
76 Rather than requiring that C attain a fixed distance to C 0 we add a penalty proportional to the distance between C and C0 to the loss defined in Eq. [sent-156, score-0.21]
77 The regularization factor we employ is the KL-divergence between the images of C and C0 under the logistic transformation, n D[σ(Cyj ) R(S|C, C0 ) = σ(C0 yj )] . [sent-160, score-0.352]
78 The algorithm is provided with initial matrices C0 and T0 , where C0 is assumed to have a large code distance ρ. [sent-173, score-0.339]
79 Since the regularization factor R is independent of T we can restrict our description and analysis of the I MPROVE -T step to considering only the loss term L of the objective function O. [sent-190, score-0.196]
80 The next theorem states that updating T by the I MPROVE -T step decreases the loss or otherwise T remains unchanged and is globally optimal with respect to C. [sent-193, score-0.204]
81 In the I MPROVE -C step we fix the current matrix T and find a code matrix C that globally minimizes the objective function. [sent-199, score-0.491]
82 According to the discussion above, the matrix C defines an embedding of the label vectors from Y into Z and the images of this embedding constitute the classification rule. [sent-200, score-0.375]
83 For each y ∈ Y denote its image under C and the logistic transformation by py = σ(Cy) and let Sy be the subset of S that is labeled y. [sent-201, score-0.327]
84 (x,y)∈Sy We can therefore find for each y ∈ Y the vector py that minimizes O(Sy ) independently and then reconstruct the code matrix C that achieves these values. [sent-203, score-0.441]
85 7 Experiments 70 random one−vs−rest 60 50 Relative performance % To assess the merits of Bunching we compared it to a standard ECOCbased algorithm on numerous multiclass problems. [sent-210, score-0.261]
86 For the ECOCbased algorithm we used a logistic regressor as the binary learning algorithm, trained using the parallel update described in [2]. [sent-211, score-0.277]
87 The two approaches share the same form of classifiers (logistic regressors) and differ solely in the coding matrix they employ: while ECOC uses a fixed code matrix Bunching adapts its code matrix during the learning process. [sent-212, score-0.721]
88 40 30 20 10 0 −10 glass isolet letter mnist satimage soybean vowel Figure 3: The relative performance of Bunching We selected the following multiclass compared to ECOC on various datasets. [sent-213, score-0.46]
89 We conducted the experiments for two families of code matrices. [sent-224, score-0.259]
90 The first family corresponds to the one-vs-rest approach in which each class is trained against the rest of the classes and the corresponding code is a matrix whose logistic transformation is simply the identity matrix. [sent-225, score-0.524]
91 The second family is the set of random code matrices with r log2 r rows where r is the number of different labels. [sent-226, score-0.288]
92 These matrices are used as C 0 for Bunching and as the fixed code for ECOC. [sent-227, score-0.288]
93 The improvement is more significant when using random code matrices. [sent-233, score-0.259]
94 This can be explained by the fact that random code matrices tend to induce unnatural and rather difficult binary partitions of the set of labels. [sent-234, score-0.471]
95 Since Bunching modifies the code matrix C along its run, it can relax difficult binary problems. [sent-235, score-0.43]
96 This suggests that Bunching can improve the classification accuracy in problems where, for instance, the one-vs-rest approach fails to give good results or when there is a need to add error correction properties to the code matrix. [sent-236, score-0.286]
97 8 A Brief Discussion In this paper we described a framework for solving multiclass problems via pairs of embeddings. [sent-237, score-0.236]
98 First, the probabilistic embeddings can be replaced with non-negative embeddings by replacing the logistic transformation with the exponential function. [sent-240, score-0.636]
99 Reducing multiclass to binary: A unifying approach for margin classifiers. [sent-254, score-0.236]
100 On the learnability and design of output codes for multiclass problems. [sent-263, score-0.428]
wordName wordTfidf (topN-words)
[('bunching', 0.341), ('mprove', 0.321), ('ecoc', 0.287), ('sy', 0.281), ('code', 0.259), ('multiclass', 0.236), ('tk', 0.21), ('embeddings', 0.173), ('classi', 0.16), ('cy', 0.16), ('ck', 0.154), ('yj', 0.153), ('logistic', 0.143), ('cyj', 0.14), ('codes', 0.132), ('label', 0.122), ('py', 0.122), ('binary', 0.111), ('xj', 0.109), ('pk', 0.106), ('loss', 0.083), ('cyi', 0.08), ('instance', 0.077), ('embedding', 0.076), ('correcting', 0.074), ('labels', 0.066), ('cation', 0.064), ('empirical', 0.064), ('transformation', 0.062), ('tt', 0.061), ('output', 0.06), ('probabilistic', 0.06), ('matrix', 0.06), ('argmin', 0.059), ('hamming', 0.059), ('instances', 0.056), ('word', 0.054), ('attained', 0.054), ('objective', 0.052), ('distance', 0.051), ('allwein', 0.048), ('bernoulli', 0.047), ('categorization', 0.045), ('divergence', 0.042), ('attains', 0.042), ('vectors', 0.041), ('ecocbased', 0.04), ('kldivergence', 0.04), ('satimage', 0.04), ('unnatural', 0.04), ('th', 0.039), ('theorem', 0.038), ('sk', 0.035), ('alternates', 0.035), ('isolet', 0.035), ('soybean', 0.035), ('pair', 0.034), ('bregman', 0.033), ('glass', 0.033), ('induce', 0.032), ('rationale', 0.032), ('regularization', 0.032), ('globally', 0.031), ('er', 0.031), ('bound', 0.03), ('de', 0.03), ('schapire', 0.03), ('ers', 0.03), ('vowel', 0.03), ('matrices', 0.029), ('induced', 0.029), ('step', 0.029), ('log', 0.028), ('qk', 0.028), ('eb', 0.028), ('tx', 0.028), ('della', 0.028), ('error', 0.027), ('pietra', 0.027), ('mnist', 0.027), ('yoram', 0.027), ('embedded', 0.026), ('nes', 0.026), ('entropy', 0.026), ('nding', 0.026), ('merits', 0.025), ('replacing', 0.025), ('penalty', 0.025), ('ned', 0.025), ('construction', 0.025), ('saddle', 0.024), ('sequel', 0.024), ('multivariate', 0.024), ('employ', 0.024), ('ct', 0.024), ('letter', 0.024), ('prede', 0.024), ('learning', 0.023), ('updating', 0.023), ('relaxation', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 149 nips-2002-Multiclass Learning by Probabilistic Embeddings
Author: Ofer Dekel, Yoram Singer
Abstract: We describe a new algorithmic framework for learning multiclass categorization problems. In this framework a multiclass predictor is composed of a pair of embeddings that map both instances and labels into a common space. In this space each instance is assigned the label it is nearest to. We outline and analyze an algorithm, termed Bunching, for learning the pair of embeddings from labeled data. A key construction in the analysis of the algorithm is the notion of probabilistic output codes, a generalization of error correcting output codes (ECOC). Furthermore, the method of multiclass categorization using ECOC is shown to be an instance of Bunching. We demonstrate the advantage of Bunching over ECOC by comparing their performance on numerous categorization problems.
2 0.36089295 19 nips-2002-Adapting Codes and Embeddings for Polychotomies
Author: Gunnar Rätsch, Sebastian Mika, Alex J. Smola
Abstract: In this paper we consider formulations of multi-class problems based on a generalized notion of a margin and using output coding. This includes, but is not restricted to, standard multi-class SVM formulations. Differently from many previous approaches we learn the code as well as the embedding function. We illustrate how this can lead to a formulation that allows for solving a wider range of problems with for instance many classes or even “missing classes”. To keep our optimization problems tractable we propose an algorithm capable of solving them using twoclass classifiers, similar in spirit to Boosting.
3 0.23606341 59 nips-2002-Constraint Classification for Multiclass Classification and Ranking
Author: Sariel Har-Peled, Dan Roth, Dav Zimak
Abstract: The constraint classification framework captures many flavors of multiclass classification including winner-take-all multiclass classification, multilabel classification and ranking. We present a meta-algorithm for learning in this framework that learns via a single linear classifier in high dimension. We discuss distribution independent as well as margin-based generalization bounds and present empirical and theoretical evidence showing that constraint classification benefits over existing methods of multiclass classification.
4 0.16514319 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
Author: Sepp Hochreiter, Klaus Obermayer
Abstract: We investigate the problem of learning a classification task for datasets which are described by matrices. Rows and columns of these matrices correspond to objects, where row and column objects may belong to different sets, and the entries in the matrix express the relationships between them. We interpret the matrix elements as being produced by an unknown kernel which operates on object pairs and we show that - under mild assumptions - these kernels correspond to dot products in some (unknown) feature space. Minimizing a bound for the generalization error of a linear classifier which has been obtained using covering numbers we derive an objective function for model selection according to the principle of structural risk minimization. The new objective function has the advantage that it allows the analysis of matrices which are not positive definite, and not even symmetric or square. We then consider the case that row objects are interpreted as features. We suggest an additional constraint, which imposes sparseness on the row objects and show, that the method can then be used for feature selection. Finally, we apply this method to data obtained from DNA microarrays, where “column” objects correspond to samples, “row” objects correspond to genes and matrix elements correspond to expression levels. Benchmarks are conducted using standard one-gene classification and support vector machines and K-nearest neighbors after standard feature selection. Our new method extracts a sparse set of genes and provides superior classification results. 1
5 0.15443234 120 nips-2002-Kernel Design Using Boosting
Author: Koby Crammer, Joseph Keshet, Yoram Singer
Abstract: The focus of the paper is the problem of learning kernel operators from empirical data. We cast the kernel design problem as the construction of an accurate kernel from simple (and less accurate) base kernels. We use the boosting paradigm to perform the kernel construction process. To do so, we modify the booster so as to accommodate kernel operators. We also devise an efficient weak-learner for simple kernels that is based on generalized eigen vector decomposition. We demonstrate the effectiveness of our approach on synthetic data and on the USPS dataset. On the USPS dataset, the performance of the Perceptron algorithm with learned kernels is systematically better than a fixed RBF kernel. 1 Introduction and problem Setting The last decade brought voluminous amount of work on the design, analysis and experimentation of kernel machines. Algorithm based on kernels can be used for various machine learning tasks such as classification, regression, ranking, and principle component analysis. The most prominent learning algorithm that employs kernels is the Support Vector Machines (SVM) [1, 2] designed for classification and regression. A key component in a kernel machine is a kernel operator which computes for any pair of instances their inner-product in some abstract vector space. Intuitively and informally, a kernel operator is a means for measuring similarity between instances. Almost all of the work that employed kernel operators concentrated on various machine learning problems that involved a predefined kernel. A typical approach when using kernels is to choose a kernel before learning starts. Examples to popular predefined kernels are the Radial Basis Functions and the polynomial kernels (see for instance [1]). Despite the simplicity required in modifying a learning algorithm to a “kernelized” version, the success of such algorithms is not well understood yet. More recently, special efforts have been devoted to crafting kernels for specific tasks such as text categorization [3] and protein classification problems [4]. Our work attempts to give a computational alternative to predefined kernels by learning kernel operators from data. We start with a few definitions. Let X be an instance space. A kernel is an inner-product operator K : X × X → . An explicit way to describe K is via a mapping φ : X → H from X to an inner-products space H such that K(x, x ) = φ(x)·φ(x ). Given a kernel operator and a finite set of instances S = {xi , yi }m , the kernel i=1 matrix (a.k.a the Gram matrix) is the matrix of all possible inner-products of pairs from S, Ki,j = K(xi , xj ). We therefore refer to the general form of K as the kernel operator and to the application of the kernel operator to a set of pairs of instances as the kernel matrix. The specific setting of kernel design we consider assumes that we have access to a base kernel learner and we are given a target kernel K manifested as a kernel matrix on a set of examples. Upon calling the base kernel learner it returns a kernel operator denote Kj . The goal thereafter is to find a weighted combination of kernels ˆ K(x, x ) = j αj Kj (x, x ) that is similar, in a sense that will be defined shortly, to ˆ the target kernel, K ∼ K . Cristianini et al. [5] in their pioneering work on kernel target alignment employed as the notion of similarity the inner-product between the kernel matrices < K, K >F = m K(xi , xj )K (xi , xj ). Given this definition, they defined the i,j=1 kernel-similarity, or alignment, to be the above inner-product normalized by the norm of ˆ ˆ ˆ ˆ ˆ each kernel, A(S, K, K ) = < K, K >F / < K, K >F < K , K >F , where S is, as above, a finite sample of m instances. Put another way, the kernel alignment Cristianini et al. employed is the cosine of the angle between the kernel matrices where each matrix is “flattened” into a vector of dimension m2 . Therefore, this definition implies that the alignment is bounded above by 1 and can attain this value iff the two kernel matrices are identical. Given a (column) vector of m labels y where yi ∈ {−1, +1} is the label of the instance xi , Cristianini et al. used the outer-product of y as the the target kernel, ˆ K = yy T . Therefore, an optimal alignment is achieved if K(xi , xj ) = yi yj . Clearly, if such a kernel is used for classifying instances from X , then the kernel itself suffices to construct an excellent classifier f : X → {−1, +1} by setting, f (x) = sign(y i K(xi , x)) where (xi , yi ) is any instance-label pair. Cristianini et al. then devised a procedure that works with both labelled and unlabelled examples to find a Gram matrix which attains a good alignment with K on the labelled part of the matrix. While this approach can clearly construct powerful kernels, a few problems arise from the notion of kernel alignment they employed. For instance, a kernel operator such that the sign(K(x i , xj )) is equal to yi yj but its magnitude, |K(xi , xj )|, is not necessarily 1, might achieve a poor alignment score while it can constitute a classifier whose empirical loss is zero. Furthermore, the task of finding a good kernel when it is not always possible to find a kernel whose sign on each pair of instances is equal to the products of the labels (termed the soft-margin case in [5, 6]) becomes rather tricky. We thus propose a different approach which attempts to overcome some of the difficulties above. Like Cristianini et al. we assume that we are given a set of labelled instances S = {(xi , yi ) | xi ∈ X , yi ∈ {−1, +1}, i = 1, . . . , m} . We are also given a set of unlabelled m ˜ ˜ examples S = {˜i }i=1 . If such a set is not provided we can simply use the labelled inx ˜ ˜ stances (without the labels themselves) as the set S. The set S is used for constructing the ˆ primitive kernels that are combined to constitute the learned kernel K. The labelled set is used to form the target kernel matrix and its instances are used for evaluating the learned ˆ kernel K. This approach, known as transductive learning, was suggested in [5, 6] for kernel alignment tasks when the distribution of the instances in the test data is different from that of the training data. This setting becomes in particular handy in datasets where the test data was collected in a different scheme than the training data. We next discuss the notion of kernel goodness employed in this paper. This notion builds on the objective function that several variants of boosting algorithms maintain [7, 8]. We therefore first discuss in brief the form of boosting algorithms for kernels. 2 Using Boosting to Combine Kernels Numerous interpretations of AdaBoost and its variants cast the boosting process as a procedure that attempts to minimize, or make small, a continuous bound on the classification error (see for instance [9, 7] and the references therein). A recent work by Collins et al. [8] unifies the boosting process for two popular loss functions, the exponential-loss (denoted henceforth as ExpLoss) and logarithmic-loss (denoted as LogLoss) that bound the empir- ˜ ˜ Input: Labelled and unlabelled sets of examples: S = {(xi , yi )}m ; S = {˜i }m x i=1 i=1 Initialize: K ← 0 (all zeros matrix) For t = 1, 2, . . . , T : • Calculate distribution over pairs 1 ≤ i, j ≤ m: Dt (i, j) = exp(−yi yj K(xi , xj )) 1/(1 + exp(−yi yj K(xi , xj ))) ExpLoss LogLoss ˜ • Call base-kernel-learner with (Dt , S, S) and receive Kt • Calculate: + − St = {(i, j) | yi yj Kt (xi , xj ) > 0} ; St = {(i, j) | yi yj Kt (xi , xj ) < 0} + Wt = (i,j)∈S + Dt (i, j)|Kt (xi , xj )| ; Wt− = (i,j)∈S − Dt (i, j)|Kt (xi , xj )| t t 1 2 + Wt − Wt • Set: αt = ln ; K ← K + α t Kt . Return: kernel operator K : X × X → Figure 1: The skeleton of the boosting algorithm for kernels. ical classification error. Given the prediction of a classifier f on an instance x and a label y ∈ {−1, +1} the ExpLoss and the LogLoss are defined as, ExpLoss(f (x), y) = exp(−yf (x)) LogLoss(f (x), y) = log(1 + exp(−yf (x))) . Collins et al. described a single algorithm for the two losses above that can be used within the boosting framework to construct a strong-hypothesis which is a classifier f (x). This classifier is a weighted combination of (possibly very simple) base classifiers. (In the boosting framework, the base classifiers are referred to as weak-hypotheses.) The strongT hypothesis is of the form f (x) = t=1 αt ht (x). Collins et al. discussed a few ways to select the weak-hypotheses ht and to find a good of weights αt . Our starting point in this paper is the first sequential algorithm from [8] that enables the construction or creation of weak-hypotheses on-the-fly. We would like to note however that it is possible to use other variants of boosting to design kernels. In order to use boosting to design kernels we extend the algorithm to operate over pairs of instances. Building on the notion of alignment from [5, 6], we say that the inner-product of x1 and x2 is aligned with the labels y1 and y2 if sign(K(x1 , x2 )) = y1 y2 . Furthermore, we would like to make the magnitude of K(x, x ) to be as large as possible. We therefore use one of the following two alignment losses for a pair of examples (x 1 , y1 ) and (x2 , y2 ), ExpLoss(K(x1 , x2 ), y1 y2 ) = exp(−y1 y2 K(x1 , x2 )) LogLoss(K(x1 , x2 ), y1 y2 ) = log(1 + exp(−y1 y2 K(x1 , x2 ))) . Put another way, we view a pair of instances as a single example and cast the pairs of instances that attain the same label as positively labelled examples while pairs of opposite labels are cast as negatively labelled examples. Clearly, this approach can be applied to both losses. In the boosting process we therefore maintain a distribution over pairs of instances. The weight of each pair reflects how difficult it is to predict whether the labels of the two instances are the same or different. The core boosting algorithm follows similar lines to boosting algorithms for classification algorithm. The pseudo code of the booster is given in Fig. 1. The pseudo-code is an adaptation the to problem of kernel design of the sequentialupdate algorithm from [8]. As with other boosting algorithm, the base-learner, which in our case is charge of returning a good kernel with respect to the current distribution, is left unspecified. We therefore turn our attention to the algorithmic implementation of the base-learning algorithm for kernels. 3 Learning Base Kernels The base kernel learner is provided with a training set S and a distribution D t over a pairs ˜ of instances from the training set. It is also provided with a set of unlabelled examples S. Without any knowledge of the topology of the space of instances a learning algorithm is likely to fail. Therefore, we assume the existence of an initial inner-product over the input space. We assume for now that this initial inner-product is the standard scalar products over vectors in n . We later discuss a way to relax the assumption on the form of the inner-product. Equipped with an inner-product, we define the family of base kernels to be the possible outer-products Kw = wwT between a vector w ∈ n and itself. Using this definition we get, Kw (xi , xj ) = (xi ·w)(xj ·w) . Input: A distribution Dt . Labelled and unlabelled sets: ˜ ˜ Therefore, the similarity beS = {(xi , yi )}m ; S = {˜i }m . x i=1 i=1 tween two instances xi and Compute : xj is high iff both xi and xj • Calculate: ˜ are similar (w.r.t the standard A ∈ m×m , Ai,r = xi · xr ˜ inner-product) to a third vecm×m B∈ , Bi,j = Dt (i, j)yi yj tor w. Analogously, if both ˜ ˜ K ∈ m×m , Kr,s = xr · xs ˜ ˜ xi and xj seem to be dissim• Find the generalized eigenvector v ∈ m for ilar to the vector w then they the problem AT BAv = λKv which attains are similar to each other. Dethe largest eigenvalue λ spite the restrictive form of • Set: w = ( r vr xr )/ ˜ ˜ r vr xr . the inner-products, this famt ily is still too rich for our setReturn: Kernel operator Kw = ww . ting and we further impose two restrictions on the inner Figure 2: The base kernel learning algorithm. products. First, we assume ˜ that w is restricted to a linear combination of vectors from S. Second, since scaling of the base kernels is performed by the boosted, we constrain the norm of w to be 1. The m ˜ resulting class of kernels is therefore, C = {Kw = wwT | w = r=1 βr xr , w = 1} . ˜ In the boosting process we need to choose a specific base-kernel K w from C. We therefore need to devise a notion of how good a candidate for base kernel is given a labelled set S and a distribution function Dt . In this work we use the simplest version suggested by Collins et al. This version can been viewed as a linear approximation on the loss function. We define the score of a kernel Kw w.r.t to the current distribution Dt to be, Score(Kw ) = Dt (i, j)yi yj Kw (xi , xj ) . (1) i,j The higher the value of the score is, the better Kw fits the training data. Note that if Dt (i, j) = 1/m2 (as is D0 ) then Score(Kw ) is proportional to the alignment since w = 1. Under mild assumptions the score can also provide a lower bound of the loss function. To see that let c be the derivative of the loss function at margin zero, c = Loss (0) . If all the √ training examples xi ∈ S lies in a ball of radius c, we get that Loss(Kw (xi , xj ), yi yj ) ≥ 1 − cKw (xi , xj )yi yj ≥ 0, and therefore, i,j Dt (i, j)Loss(Kw (xi , xj ), yi yj ) ≥ 1 − c Dt (i, j)Kw (xi , xj )yi yj . i,j Using the explicit form of Kw in the Score function (Eq. (1)) we get, Score(Kw ) = i,j D(i, j)yi yj (w·xi )(w·xj ) . Further developing the above equation using the constraint that w = m ˜ r=1 βr xr we get, ˜ Score(Kw ) = βs βr r,s i,j D(i, j)yi yj (xi · xr ) (xj · xs ) . ˜ ˜ To compute efficiently the base kernel score without an explicit enumeration we exploit the fact that if the initial distribution D0 is symmetric (D0 (i, j) = D0 (j, i)) then all the distributions generated along the run of the boosting process, D t , are also symmetric. We ˜ now define a matrix A ∈ m×m where Ai,r = xi · xr and a symmetric matrix B ∈ m×m ˜ with Bi,j = Dt (i, j)yi yj . Simple algebraic manipulations yield that the score function can be written as the following quadratic form, Score(β) = β T (AT BA)β , where β is m dimensional column vector. Note that since B is symmetric so is A T BA. Finding a ˜ good base kernel is equivalent to finding a vector β which maximizes this quadratic form 2 m ˜ under the norm equality constraint w = ˜ 2 = β T Kβ = 1 where Kr,s = r=1 βr xr xr · xs . Finding the maximum of Score(β) subject to the norm constraint is a well known ˜ ˜ maximization problem known as the generalized eigen vector problem (cf. [10]). Applying simple algebraic manipulations it is easy to show that the matrix AT BA is positive semidefinite. Assuming that the matrix K is invertible, the the vector β which maximizes the quadratic form is proportional the eigenvector of K −1 AT BA which is associated with the m ˜ generalized largest eigenvalue. Denoting this vector by v we get that w ∝ ˜ r=1 vr xr . m ˜ m ˜ Adding the norm constraint we get that w = ( r=1 vr xr )/ ˜ vr xr . The skeleton ˜ r=1 of the algorithm for finding a base kernels is given in Fig. 3. To conclude the description of the kernel learning algorithm we describe how to the extend the algorithm to be employed with general kernel functions. Kernelizing the Kernel: As described above, we assumed that the standard scalarproduct constitutes the template for the class of base-kernels C. However, since the proce˜ dure for choosing a base kernel depends on S and S only through the inner-products matrix A, we can replace the scalar-product itself with a general kernel operator κ : X × X → , where κ(xi , xj ) = φ(xi ) · φ(xj ). Using a general kernel function κ we can not compute however the vector w explicitly. We therefore need to show that the norm of w, and evaluation Kw on any two examples can still be performed efficiently. First note that given the vector v we can compute the norm of w as follows, T w 2 = vr xr ˜ vs xr ˜ r s = vr vs κ(˜r , xs ) . x ˜ r,s Next, given two vectors xi and xj the value of their inner-product is, Kw (xi , xj ) = vr vs κ(xi , xr )κ(xj , xs ) . ˜ ˜ r,s Therefore, although we cannot compute the vector w explicitly we can still compute its norm and evaluate any of the kernels from the class C. 4 Experiments Synthetic data: We generated binary-labelled data using as input space the vectors in 100 . The labels, in {−1, +1}, were picked uniformly at random. Let y designate the label of a particular example. Then, the first two components of each instance were drawn from a two-dimensional normal distribution, N (µ, ∆ ∆−1 ) with the following parameters, µ=y 0.03 0.03 1 ∆= √ 2 1 −1 1 1 = 0.1 0 0 0.01 . That is, the label of each examples determined the mean of the distribution from which the first two components were generated. The rest of the components in the vector (98 8 0.2 6 50 50 100 100 150 150 200 200 4 2 0 0 −2 −4 −6 250 250 −0.2 −8 −0.2 0 0.2 −8 −6 −4 −2 0 2 4 6 8 300 20 40 60 80 100 120 140 160 180 200 300 20 40 60 80 100 120 140 160 180 Figure 3: Results on a toy data set prior to learning a kernel (first and third from left) and after learning (second and fourth). For each of the two settings we show the first two components of the training data (left) and the matrix of inner products between the train and the test data (right). altogether) were generated independently using the normal distribution with a zero mean and a standard deviation of 0.05. We generated 100 training and test sets of size 300 and 200 respectively. We used the standard dot-product as the initial kernel operator. On each experiment we first learned a linear classier that separates the classes using the Perceptron [11] algorithm. We ran the algorithm for 10 epochs on the training set. After each epoch we evaluated the performance of the current classifier on the test set. We then used the boosting algorithm for kernels with the LogLoss for 30 rounds to build a kernel for each random training set. After learning the kernel we re-trained a classifier with the Perceptron algorithm and recorded the results. A summary of the online performance is given in Fig. 4. The plot on the left-hand-side of the figure shows the instantaneous error (achieved during the run of the algorithm). Clearly, the Perceptron algorithm with the learned kernel converges much faster than the original kernel. The middle plot shows the test error after each epoch. The plot on the right shows the test error on a noisy test set in which we added a Gaussian noise of zero mean and a standard deviation of 0.03 to the first two features. In all plots, each bar indicates a 95% confidence level. It is clear from the figure that the original kernel is much slower to converge than the learned kernel. Furthermore, though the kernel learning algorithm was not expoed to the test set noise, the learned kernel reflects better the structure of the feature space which makes the learned kernel more robust to noise. Fig. 3 further illustrates the benefits of using a boutique kernel. The first and third plots from the left correspond to results obtained using the original kernel and the second and fourth plots show results using the learned kernel. The left plots show the empirical distribution of the two informative components on the test data. For the learned kernel we took each input vector and projected it onto the two eigenvectors of the learned kernel operator matrix that correspond to the two largest eigenvalues. Note that the distribution after the projection is bimodal and well separated along the first eigen direction (x-axis) and shows rather little deviation along the second eigen direction (y-axis). This indicates that the kernel learning algorithm indeed found the most informative projection for separating the labelled data with large margin. It is worth noting that, in this particular setting, any algorithm which chooses a single feature at a time is prone to failure since both the first and second features are mandatory for correctly classifying the data. The two plots on the right hand side of Fig. 3 use a gray level color-map to designate the value of the inner-product between each pairs instances, one from training set (y-axis) and the other from the test set. The examples were ordered such that the first group consists of the positively labelled instances while the second group consists of the negatively labelled instances. Since most of the features are non-relevant the original inner-products are noisy and do not exhibit any structure. In contrast, the inner-products using the learned kernel yields in a 2 × 2 block matrix indicating that the inner-products between instances sharing the same label obtain large positive values. Similarly, for instances of opposite 200 1 12 Regular Kernel Learned Kernel 0.8 17 0.7 16 0.5 0.4 0.3 Test Error % 8 0.6 Regular Kernel Learned Kernel 18 10 Test Error % Averaged Cumulative Error % 19 Regular Kernel Learned Kernel 0.9 6 4 15 14 13 12 0.2 11 2 0.1 10 0 0 10 1 10 2 10 Round 3 10 4 10 0 2 4 6 Epochs 8 10 9 2 4 6 Epochs 8 10 Figure 4: The online training error (left), test error (middle) on clean synthetic data using a standard kernel and a learned kernel. Right: the online test error for the two kernels on a noisy test set. labels the inner products are large and negative. The form of the inner-products matrix of the learned kernel indicates that the learning problem itself becomes much easier. Indeed, the Perceptron algorithm with the standard kernel required around 94 training examples on the average before converging to a hyperplane which perfectly separates the training data while using the Perceptron algorithm with learned kernel required a single example to reach a perfect separation on all 100 random training sets. USPS dataset: The USPS (US Postal Service) dataset is known as a challenging classification problem in which the training set and the test set were collected in a different manner. The USPS contains 7, 291 training examples and 2, 007 test examples. Each example is represented as a 16 × 16 matrix where each entry in the matrix is a pixel that can take values in {0, . . . , 255}. Each example is associated with a label in {0, . . . , 9} which is the digit content of the image. Since the kernel learning algorithm is designed for binary problems, we broke the 10-class problem into 45 binary problems by comparing all pairs of classes. The interesting question of how to learn kernels for multiclass problems is beyond the scopre of this short paper. We thus constraint on the binary error results for the 45 binary problem described above. For the original kernel we chose a RBF kernel with σ = 1 which is the value employed in the experiments reported in [12]. We used the kernelized version of the kernel design algorithm to learn a different kernel operator for each of the binary problems. We then used a variant of the Perceptron [11] and with the original RBF kernel and with the learned kernels. One of the motivations for using the Perceptron is its simplicity which can underscore differences in the kernels. We ran the kernel learning al˜ gorithm with LogLoss and ExpLoss, using bith the training set and the test test as S. Thus, we obtained four different sets of kernels where each set consists of 45 kernels. By examining the training loss, we set the number of rounds of boosting to be 30 for the LogLoss and 50 for the ExpLoss, when using the trainin set. When using the test set, the number of rounds of boosting was set to 100 for both losses. Since the algorithm exhibits slower rate of convergence with the test data, we choose a a higher value without attempting to optimize the actual value. The left plot of Fig. 5 is a scatter plot comparing the test error of each of the binary classifiers when trained with the original RBF a kernel versus the performance achieved on the same binary problem with a learned kernel. The kernels were built ˜ using boosting with the LogLoss and S was the training data. In almost all of the 45 binary classification problems, the learned kernels yielded lower error rates when combined with the Perceptron algorithm. The right plot of Fig. 5 compares two learned kernels: the first ˜ was build using the training instances as the templates constituing S while the second used the test instances. Although the differenece between the two versions is not as significant as the difference on the left plot, we still achieve an overall improvement in about 25% of the binary problems by using the test instances. 6 4.5 4 5 Learned Kernel (Test) Learned Kernel (Train) 3.5 4 3 2 3 2.5 2 1.5 1 1 0.5 0 0 1 2 3 Base Kernel 4 5 6 0 0 1 2 3 Learned Kernel (Train) 4 5 Figure 5: Left: a scatter plot comparing the error rate of 45 binary classifiers trained using an RBF kernel (x-axis) and a learned kernel with training instances. Right: a similar scatter plot for a learned kernel only constructed from training instances (x-axis) and test instances. 5 Discussion In this paper we showed how to use the boosting framework to design kernels. Our approach is especially appealing in transductive learning tasks where the test data distribution is different than the the distribution of the training data. For example, in speech recognition tasks the training data is often clean and well recorded while the test data often passes through a noisy channel that distorts the signal. An interesting and challanging question that stem from this research is how to extend the framework to accommodate more complex decision tasks such as multiclass and regression problems. Finally, we would like to note alternative approaches to the kernel design problem has been devised in parallel and independently. See [13, 14] for further details. Acknowledgements: Special thanks to Cyril Goutte and to John Show-Taylor for pointing the connection to the generalized eigen vector problem. Thanks also to the anonymous reviewers for constructive comments. References [1] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998. [2] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. [3] Huma Lodhi, John Shawe-Taylor, Nello Cristianini, and Christopher J. C. H. Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2:419–444, 2002. [4] C. Leslie, E. Eskin, and W. Stafford Noble. The spectrum kernel: A string kernel for svm protein classification. In Proceedings of the Pacific Symposium on Biocomputing, 2002. [5] Nello Cristianini, Andre Elisseeff, John Shawe-Taylor, and Jaz Kandla. On kernel target alignment. In Advances in Neural Information Processing Systems 14, 2001. [6] G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix with semi-definite programming. In Proc. of the 19th Intl. Conf. on Machine Learning, 2002. [7] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2):337–374, April 2000. [8] Michael Collins, Robert E. Schapire, and Yoram Singer. Logistic regression, adaboost and bregman distances. Machine Learning, 47(2/3):253–285, 2002. [9] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classifiers. MIT Press, 1999. [10] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1985. [11] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. [12] B. Sch¨ lkopf, S. Mika, C.J.C. Burges, P. Knirsch, K. M¨ ller, G. R¨ tsch, and A.J. Smola. Input o u a space vs. feature space in kernel-based methods. IEEE Trans. on NN, 10(5):1000–1017, 1999. [13] O. Bosquet and D.J.L. Herrmann. On the complexity of learning the kernel matrix. NIPS, 2002. [14] C.S. Ong, A.J. Smola, and R.C. Williamson. Superkenels. NIPS, 2002.
6 0.14202766 58 nips-2002-Conditional Models on the Ranking Poset
7 0.13979478 135 nips-2002-Learning with Multiple Labels
8 0.12528697 69 nips-2002-Discriminative Learning for Label Sequences via Boosting
9 0.10423885 45 nips-2002-Boosted Dyadic Kernel Discriminants
10 0.10022896 67 nips-2002-Discriminative Binaural Sound Localization
11 0.087420307 92 nips-2002-FloatBoost Learning for Classification
12 0.086998537 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
13 0.084791929 140 nips-2002-Margin Analysis of the LVQ Algorithm
14 0.07728482 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
15 0.076626919 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
16 0.075736299 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
17 0.072388127 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines
18 0.070478722 181 nips-2002-Self Supervised Boosting
19 0.069489382 77 nips-2002-Effective Dimension and Generalization of Kernel Learning
20 0.068131849 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization
topicId topicWeight
[(0, -0.229), (1, -0.147), (2, 0.086), (3, -0.037), (4, 0.26), (5, -0.006), (6, -0.045), (7, -0.134), (8, -0.036), (9, -0.056), (10, -0.073), (11, -0.032), (12, -0.081), (13, -0.149), (14, -0.021), (15, 0.154), (16, 0.072), (17, 0.146), (18, -0.062), (19, -0.066), (20, 0.059), (21, -0.12), (22, -0.151), (23, 0.143), (24, 0.177), (25, -0.041), (26, 0.02), (27, 0.038), (28, -0.088), (29, 0.154), (30, -0.045), (31, 0.002), (32, 0.177), (33, -0.09), (34, -0.13), (35, -0.045), (36, 0.015), (37, -0.043), (38, 0.008), (39, 0.07), (40, -0.081), (41, -0.01), (42, -0.019), (43, -0.008), (44, 0.053), (45, -0.031), (46, 0.073), (47, -0.05), (48, 0.003), (49, 0.051)]
simIndex simValue paperId paperTitle
same-paper 1 0.95266843 149 nips-2002-Multiclass Learning by Probabilistic Embeddings
Author: Ofer Dekel, Yoram Singer
Abstract: We describe a new algorithmic framework for learning multiclass categorization problems. In this framework a multiclass predictor is composed of a pair of embeddings that map both instances and labels into a common space. In this space each instance is assigned the label it is nearest to. We outline and analyze an algorithm, termed Bunching, for learning the pair of embeddings from labeled data. A key construction in the analysis of the algorithm is the notion of probabilistic output codes, a generalization of error correcting output codes (ECOC). Furthermore, the method of multiclass categorization using ECOC is shown to be an instance of Bunching. We demonstrate the advantage of Bunching over ECOC by comparing their performance on numerous categorization problems.
2 0.82712746 19 nips-2002-Adapting Codes and Embeddings for Polychotomies
Author: Gunnar Rätsch, Sebastian Mika, Alex J. Smola
Abstract: In this paper we consider formulations of multi-class problems based on a generalized notion of a margin and using output coding. This includes, but is not restricted to, standard multi-class SVM formulations. Differently from many previous approaches we learn the code as well as the embedding function. We illustrate how this can lead to a formulation that allows for solving a wider range of problems with for instance many classes or even “missing classes”. To keep our optimization problems tractable we propose an algorithm capable of solving them using twoclass classifiers, similar in spirit to Boosting.
3 0.71419787 58 nips-2002-Conditional Models on the Ranking Poset
Author: Guy Lebanon, John D. Lafferty
Abstract: A distance-based conditional model on the ranking poset is presented for use in classification and ranking. The model is an extension of the Mallows model, and generalizes the classifier combination methods used by several ensemble learning algorithms, including error correcting output codes, discrete AdaBoost, logistic regression and cranking. The algebraic structure of the ranking poset leads to a simple Bayesian interpretation of the conditional model and its special cases. In addition to a unifying view, the framework suggests a probabilistic interpretation for error correcting output codes and an extension beyond the binary coding scheme.
4 0.6716302 59 nips-2002-Constraint Classification for Multiclass Classification and Ranking
Author: Sariel Har-Peled, Dan Roth, Dav Zimak
Abstract: The constraint classification framework captures many flavors of multiclass classification including winner-take-all multiclass classification, multilabel classification and ranking. We present a meta-algorithm for learning in this framework that learns via a single linear classifier in high dimension. We discuss distribution independent as well as margin-based generalization bounds and present empirical and theoretical evidence showing that constraint classification benefits over existing methods of multiclass classification.
5 0.53884912 42 nips-2002-Bias-Optimal Incremental Problem Solving
Author: Jürgen Schmidhuber
Abstract: Given is a problem sequence and a probability distribution (the bias) on programs computing solution candidates. We present an optimally fast way of incrementally solving each task in the sequence. Bias shifts are computed by program prefixes that modify the distribution on their suffixes by reusing successful code for previous tasks (stored in non-modifiable memory). No tested program gets more runtime than its probability times the total search time. In illustrative experiments, ours becomes the first general system to learn a universal solver for arbitrary disk Towers of Hanoi tasks (minimal solution size ). It demonstrates the advantages of incremental learning by profiting from previously solved, simpler tasks involving samples of a simple context free language. ¦ ¤ ¢ §¥£¡ 1 Brief Introduction to Optimal Universal Search Consider an asymptotically optimal method for tasks with quickly verifiable solutions: ¦ ¦ © £ £¨ © © © © ¦ ¦ ¦ Method 1.1 (L SEARCH ) View the -th binary string as a potential program for a universal Turing machine. Given some problem, for all do: every steps on average execute (if possible) one instruction of the -th program candidate, until one of the programs has computed a solution. ! © © © ¢
6 0.48506108 67 nips-2002-Discriminative Binaural Sound Localization
7 0.44016072 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
8 0.43482733 135 nips-2002-Learning with Multiple Labels
9 0.42154935 109 nips-2002-Improving a Page Classifier with Anchor Extraction and Link Analysis
10 0.39834175 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
11 0.39669788 45 nips-2002-Boosted Dyadic Kernel Discriminants
12 0.39233708 69 nips-2002-Discriminative Learning for Label Sequences via Boosting
13 0.39160362 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
14 0.37665892 108 nips-2002-Improving Transfer Rates in Brain Computer Interfacing: A Case Study
15 0.36509097 6 nips-2002-A Formulation for Minimax Probability Machine Regression
16 0.35737535 92 nips-2002-FloatBoost Learning for Classification
17 0.33241111 140 nips-2002-Margin Analysis of the LVQ Algorithm
18 0.33157122 120 nips-2002-Kernel Design Using Boosting
19 0.32104564 162 nips-2002-Parametric Mixture Models for Multi-Labeled Text
20 0.31769055 55 nips-2002-Combining Features for BCI
topicId topicWeight
[(3, 0.036), (11, 0.015), (23, 0.012), (42, 0.075), (52, 0.249), (54, 0.15), (55, 0.07), (57, 0.02), (68, 0.028), (74, 0.092), (92, 0.024), (98, 0.144)]
simIndex simValue paperId paperTitle
1 0.92556828 58 nips-2002-Conditional Models on the Ranking Poset
Author: Guy Lebanon, John D. Lafferty
Abstract: A distance-based conditional model on the ranking poset is presented for use in classification and ranking. The model is an extension of the Mallows model, and generalizes the classifier combination methods used by several ensemble learning algorithms, including error correcting output codes, discrete AdaBoost, logistic regression and cranking. The algebraic structure of the ranking poset leads to a simple Bayesian interpretation of the conditional model and its special cases. In addition to a unifying view, the framework suggests a probabilistic interpretation for error correcting output codes and an extension beyond the binary coding scheme.
same-paper 2 0.83958745 149 nips-2002-Multiclass Learning by Probabilistic Embeddings
Author: Ofer Dekel, Yoram Singer
Abstract: We describe a new algorithmic framework for learning multiclass categorization problems. In this framework a multiclass predictor is composed of a pair of embeddings that map both instances and labels into a common space. In this space each instance is assigned the label it is nearest to. We outline and analyze an algorithm, termed Bunching, for learning the pair of embeddings from labeled data. A key construction in the analysis of the algorithm is the notion of probabilistic output codes, a generalization of error correcting output codes (ECOC). Furthermore, the method of multiclass categorization using ECOC is shown to be an instance of Bunching. We demonstrate the advantage of Bunching over ECOC by comparing their performance on numerous categorization problems.
3 0.79811692 63 nips-2002-Critical Lines in Symmetry of Mixture Models and its Application to Component Splitting
Author: Kenji Fukumizu, Shotaro Akaho, Shun-ichi Amari
Abstract: We show the existence of critical points as lines for the likelihood function of mixture-type models. They are given by embedding of a critical point for models with less components. A sufficient condition that the critical line gives local maxima or saddle points is also derived. Based on this fact, a component-split method is proposed for a mixture of Gaussian components, and its effectiveness is verified through experiments. 1
4 0.78081793 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
Author: Yves Grandvalet, Stéphane Canu
Abstract: This paper introduces an algorithm for the automatic relevance determination of input variables in kernelized Support Vector Machines. Relevance is measured by scale factors defining the input space metric, and feature selection is performed by assigning zero weights to irrelevant variables. The metric is automatically tuned by the minimization of the standard SVM empirical risk, where scale factors are added to the usual set of parameters defining the classifier. Feature selection is achieved by constraints encouraging the sparsity of scale factors. The resulting algorithm compares favorably to state-of-the-art feature selection procedures and demonstrates its effectiveness on a demanding facial expression recognition problem.
5 0.69825453 10 nips-2002-A Model for Learning Variance Components of Natural Images
Author: Yan Karklin, Michael S. Lewicki
Abstract: We present a hierarchical Bayesian model for learning efficient codes of higher-order structure in natural images. The model, a non-linear generalization of independent component analysis, replaces the standard assumption of independence for the joint distribution of coefficients with a distribution that is adapted to the variance structure of the coefficients of an efficient image basis. This offers a novel description of higherorder image structure and provides a way to learn coarse-coded, sparsedistributed representations of abstract image properties such as object location, scale, and texture.
6 0.68884701 2 nips-2002-A Bilinear Model for Sparse Coding
7 0.68797195 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
8 0.68745661 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
9 0.68743098 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
10 0.68608779 53 nips-2002-Clustering with the Fisher Score
11 0.68599045 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
12 0.68508083 193 nips-2002-Temporal Coherence, Natural Image Sequences, and the Visual Cortex
13 0.68443197 141 nips-2002-Maximally Informative Dimensions: Analyzing Neural Responses to Natural Signals
14 0.68372238 46 nips-2002-Boosting Density Estimation
15 0.68208158 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
16 0.68195593 65 nips-2002-Derivative Observations in Gaussian Process Models of Dynamic Systems
17 0.68165761 11 nips-2002-A Model for Real-Time Computation in Generic Neural Microcircuits
18 0.68144858 135 nips-2002-Learning with Multiple Labels
19 0.68107891 110 nips-2002-Incremental Gaussian Processes
20 0.68059987 124 nips-2002-Learning Graphical Models with Mercer Kernels