nips nips2002 nips2002-69 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yasemin Altun, Thomas Hofmann, Mark Johnson
Abstract: This paper investigates a boosting approach to discriminative learning of label sequences based on a sequence rank loss function. The proposed method combines many of the advantages of boosting schemes with the efficiency of dynamic programming methods and is attractive both, conceptually and computationally. In addition, we also discuss alternative approaches based on the Hamming loss for label sequences. The sequence boosting algorithm offers an interesting alternative to methods based on HMMs and the more recently proposed Conditional Random Fields. Applications areas for the presented technique range from natural language processing and information extraction to computational biology. We include experiments on named entity recognition and part-of-speech tagging which demonstrate the validity and competitiveness of our approach. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract This paper investigates a boosting approach to discriminative learning of label sequences based on a sequence rank loss function. [sent-4, score-1.108]
2 The proposed method combines many of the advantages of boosting schemes with the efficiency of dynamic programming methods and is attractive both, conceptually and computationally. [sent-5, score-0.456]
3 In addition, we also discuss alternative approaches based on the Hamming loss for label sequences. [sent-6, score-0.442]
4 The sequence boosting algorithm offers an interesting alternative to methods based on HMMs and the more recently proposed Conditional Random Fields. [sent-7, score-0.403]
5 Applications areas for the presented technique range from natural language processing and information extraction to computational biology. [sent-8, score-0.106]
6 We include experiments on named entity recognition and part-of-speech tagging which demonstrate the validity and competitiveness of our approach. [sent-9, score-0.368]
7 1 Introduction The problem of annotating or segmenting observation sequences arises in many applications across a variety of scientific disciplines, most prominently in natural language processing, speech recognition, and computational biology. [sent-10, score-0.342]
8 Up to now, the predominant formalism for modeling and predicting label sequences has been based on Hidden Markov Models (HMMs) and variations thereof. [sent-12, score-0.423]
9 First, generative probabilistic models are typically trained using maximum likelihood estimation (MLE) for a joint sampling model of observation and label sequences. [sent-14, score-0.286]
10 Secondly, efficient inference and learning in this setting often requires to make questionable conditional independence assumptions. [sent-16, score-0.137]
11 More precisely, in the case of HMMs, it is assumed that the Markov blanket of the hidden label variable at time step t consists of the previous and next labels as well as the t-th observation. [sent-17, score-0.235]
12 This implies that all dependencies on past and future observations are mediated through neighboring labels. [sent-18, score-0.1]
13 In this paper, we investigate the use of discriminative learning methods for learning label sequences. [sent-19, score-0.31]
14 This line of research continues previous approaches for learning conditional models , namely Conditional Random Fields (CRFs) [6], and discriminative re-ranking [1, 2] . [sent-20, score-0.174]
15 CRFs have two main advantages compared to HMMs: They are trained discriminatively by maximizing a conditional (or pseudo-) likelihood criterion and they are more flexible in modeling additional dependencies such as direct dependencies of the t-th label on past or future observations. [sent-21, score-0.438]
16 First of all, and this is the main emphasis of this paper, an exponential loss function such as the one used in boosting algorithms [9,4] may be preferable to the logarithmic loss function used in CRFs. [sent-23, score-0.876]
17 In particular we will present a boosting algorithm that has the additional advantage of performing implicit feature selection, typically resulting in very sparse models. [sent-24, score-0.371]
18 This is important for model regularization as well as for reasons of efficiency in high dimensional feature spaces. [sent-25, score-0.094]
19 Secondly, we will also discuss the use of loss functions that explicitly minimize the zer%ne loss on labels, i. [sent-26, score-0.414]
20 the Hamming loss, as an alternative to loss functions based on ranking or predicting entire label sequences. [sent-28, score-0.52]
21 2 Additive Models and Exponential Families Formally, learning label sequences is a generalization of the standard supervised classification problem. [sent-29, score-0.474]
22 a mapping from observation sequences X = (X1,X2, . [sent-32, score-0.239]
23 The availability of a training set of labeled sequences X == {(Xi, yi) : i = 1, . [sent-45, score-0.231]
24 k(X , Y ; t) (1) k Here fk denotes a (discrete) feature in the language of maximum entropy modeling, or a weak learner in the language of boosting. [sent-51, score-0.392]
25 In the context of label se, quences fk will typically be either of the form f~1)(Xt+s,Yt) (with S E {-l , O l}) or f~2) (Yt-1, Yt). [sent-52, score-0.344]
26 The first type of features will model dependencies between the observation sequence X and the t-th label in the sequence, while the second type will model inter-label dependencies between neighboring label variables. [sent-53, score-0.851]
27 For ease of presentation, we will assume that all features are binary, i. [sent-54, score-0.09]
28 A typical way of defining a set of weak learners is as follows: fk(1) ( Xt+s , Yt ) J(Yt, y(k))Xdxt+s) (2) (3) J(Yt ,y(k))J(Yt-1 ,y(k)) . [sent-57, score-0.134]
29 fk(2) ( Yt-1, Yt ) where J denotes the Kronecker-J and Xk is a binary feature function that extracts a feature from an observation pattern; y(k) and y(k) refer to the label values for which the weak learner becomes "active". [sent-58, score-0.505]
30 There is a natural way to associate a conditional probability distribution over label sequences Y with an additive model Fo by defining an exponential family for every fixed observation sequence X == Po(YIX) exp~:(~; Y)], Zo(X) (4) == Lexp[Fo(X,Y)]. [sent-59, score-0.844]
31 y This distribution is in exponential normal form and the parameters B are also called natural or canonical parameters. [sent-60, score-0.097]
32 By performing the sum over the sequence index t, we can see that the corresponding sufficient statistics are given by Sk(X, Y) == 2: t h(X, Y; t). [sent-61, score-0.161]
33 These sufficient statistics simply count the number of times the feature fk has been "active" along the labeled sequence (X, Y). [sent-62, score-0.326]
34 The gradient can be readily computed as (6) where expectations are taken w. [sent-69, score-0.14]
35 The stationary equations then simply state that uniformly averaged over the training data, the observed sufficient statistics should match their conditional expectations. [sent-73, score-0.172]
36 Computationally, the evaluation of S(Xi, yi) is straightforward counting, while summing over all sequences Y to compute E [S(X, Y)IX = Xi] can be performed using dynamic programming, since the dependency structure between labels is a simple chain. [sent-74, score-0.232]
37 4 Ranking Loss Functions for Label Sequences As an alternative to logarithmic loss functions, we propose to minimize an upper bound on the ranking loss [9] adapted to label sequences. [sent-75, score-0.875]
38 The ranking loss of a discriminant function Fo w. [sent-76, score-0.319]
39 a set of training sequences is defined as 1{rnk(B;X) = L L i == 8(Fo(Xi,Y) _FO(Xi,yi)), 8(x) {~ ~~~:r~~e (7) Y;iY; which is simply the sum of the number of label sequences that are ranked higher than or equal to the true label sequence over all training sequences. [sent-79, score-0.934]
40 It is straightforward to see (based on a term by term comparison) that an upper bound on the rank loss is given by the following exponential loss function 1{exp(B; X) == L L i exp [FO(Xi, Y) - FO(Xi, yi)] = Y#Y' L i [Po 0 (~iIXi) -1]. [sent-80, score-0.717]
41 (8) Interestingly this simply leads to a loss function that uses the inverse conditional probability of the true label sequence, if we define this probability via the exponential form in Eq. [sent-81, score-0.687]
42 Notice that compared to [1], we include all sequences and not just the top N list generated by some external mechanism. [sent-83, score-0.188]
43 As we will show shortly, an explicit summation is possible because of the availability of dynamic programming formulation to compute sums over all sequences efficiently. [sent-84, score-0.379]
44 In order to derive gradient equations for the exponential loss we can simply make use of the elementary facts \1 eP(()) 1 \1 eP(()) \le(-logP(()))=- P(()) , and\le p (())=- P(())2 \le( -logP(())) P(()) (9) Then it is easy to see that (10) The only difference between Eq. [sent-85, score-0.403]
45 (10) is the non-uniform weighting of different sequences by their inverse probability, hence putting more emphasis on training label sequences that receive a small overall (conditional) probability. [sent-87, score-0.611]
46 5 Boosting Algorithm for Label Sequences As an alternative to a simple gradient method, we now turn to the derivation of a boosting algorithm, following the boosting formulation presented in [9]. [sent-88, score-0.774]
47 Let us introduce a relative weight (or distribution) D(i , Y) for each label sequence Y w. [sent-89, score-0.323]
48 L i Ly D(i , Y) = 1, exp [Fe (Xi, Y) - Fe (Xi, yi)] Lj, LY, #Yj exp [Fe(Xj , Y') - Fe (Xj, y j)]' D(i, Y) . [sent-94, score-0.216]
49 (12) shows how we can split D(i, Y) into a relative weight for each training instance, given by D(i) , and a relative weight of each sequence, given by the re-normalized conditional probability Pe(YIX i ). [sent-98, score-0.099]
50 We define a boosting algorithm which in each round aims at minimizing the partition function or weight normalization constant Zk w. [sent-100, score-0.432]
51 a weak learner fk and a corresponding optimal parameter increment L,()k Zk(L,()k) == "D(i)" ~ • = P~~IXli) . [sent-103, score-0.216]
52 ) exp [L,()k(Sk(X i , Y)-Sk(Xi, yi))](13) e Y¡X¡ ~ . [sent-104, score-0.108]
53 1- Y#Y' ~ ( ~ D(i)P(bIXi; k)) exp [bL,()k], where Pe(bIXi; k) = LY EY (b; X i) Pe(YIX i )/( l Y 1- y i 1\ (Sk(Xi,Y) - Sk(Xi,yi)) = b}. [sent-105, score-0.108]
54 tractable if the number of features is small, with accumulators [6] for every feature seems (14) - Pe(yi IXi)) and Y(b; Xi) == {Y : This minimization problem is only since a dynamic programming run to be required in order to compute the probabilities Po(bIXi; k), i. [sent-106, score-0.249]
55 the probability for the k-th feature to be active exactly b times, conditioned on the observation sequence Xi. [sent-108, score-0.195]
56 In cases, where this is intractable (and we assume this will be the case in most applications), one can instead minimize an upper bound on every Zk' The general idea is to exploit the convexity of the exponential function and to bound (15) which is valid for every x E [xmin; xmax]. [sent-109, score-0.247]
57 m tk tk + (1- rik)eMkU,&aX;), where tk (17) i rik == " ~ (18) 7fi(Y) u'[kax - Uik(:) uI? [sent-119, score-0.424]
58 ax _ u mm tk y:;tyi tk By taking the second derivative w. [sent-120, score-0.314]
59 If one is willing to accept a looser bound, one can instead work with the interval [uk'in; uk'ax] which is the union of the intervals [u'[kin; u'[kax] for every training sequence i and obtain the upper bound Zk(L. [sent-126, score-0.186]
60 The final boosting procedure picks at every round the feature for which the upper bound on Zk is minimal and then performs an update of Bk +- Bk + L. [sent-130, score-0.502]
61 k has been selected, since the upper bound approximation may underestimate the optimal step sizes. [sent-134, score-0.098]
62 It is important to see that the quantities involved (rik and rk, respectively) are simple expectations of sufficient statistics that can be computed for all features simultaneously with a single dynamic programming run per sequence. [sent-135, score-0.307]
63 6 Hamming Loss for Label Sequences In many applications one is primarily interested in the label-by-labelloss or Hamming loss [9]. [sent-136, score-0.207]
64 Here we investigate how to train models by minimizing an upper bound on the Hamming loss. [sent-137, score-0.133]
65 The following logarithmic loss aims at maximizing the log-probability for each individual label and is given by F1og(B;X) == - LL)og P o(y1I Xi ) = - LLlog L PO(YIX i ). [sent-138, score-0.492]
66 (22) v:Yt = Y; Again, focusing on gradient descent methods, the gradient is given by As can be seen, the expected sufficient statistics are now compared not to their empirical values, but to their expected values, conditioned on a given label value (and not the entire sequence Vi). [sent-139, score-0.632]
67 In order to evaluate these expectations, one can perform dynamic programming using the algorithm described in [5], which has (independently of our work) focused on the use of Hamming loss functions in the context of CRFs. [sent-140, score-0.31]
68 Y; Similar to the log-loss case, one can define an exponential loss function that corresponds to a margin-like quantity at every single label. [sent-142, score-0.353]
69 We propose minimizing the following loss function ~~~ L . [sent-143, score-0.242]
70 t 2, exp [F'(X;, Y) -log Y'~": exp [Fo(X" V')] ]<24) LR ( iIXi'B) - l l:vexp [FO(Xi,y)] = l:v ' t exp [FO(Xi, Y)] Yt=y i . [sent-144, score-0.324]
71 t 2, 0 Yt (25) , As a motivation, we point out that for the case of sequences of length 1, this will reduce to the standard multi-class exponential loss. [sent-145, score-0.285]
72 Effectively in this model, the prediction of a label Yt will mimic the probabilistic marginalization, i. [sent-146, score-0.235]
73 = argmax y FO(Xi, Y; t), FO(Xi, Y; t) = log l:v:Yt=Y exp [FO(Xi, Y)]. [sent-148, score-0.108]
74 y; Similar to the log-loss case, the gradient is given by _ " E [S(X , Y)IX = Xi ,Yt = yn ~ E [S(Xi, Y)IX = Xi] (26) it' Po(y:IX') Again, we see the same differences between the log-loss and the exponential loss, but this time for individual labels. [sent-149, score-0.196]
75 Labels for which the marginal probability Po (yf IXi) is small are accentuated in the exponential loss. [sent-150, score-0.097]
76 We have not been able to derive a boosting formulation for this loss function, mainly because it cannot be written as a sum of exponential terms. [sent-152, score-0.664]
77 We have thus resorted to conjugate gradient descent methods for minimizing Fexp in our experiments. [sent-153, score-0.26]
78 Each word is tagged with the type of the name as well as its position in the name phrase (i. [sent-156, score-0.168]
79 We used simple binary features to ask questions about the word being tagged, as well as the previous tag (i. [sent-161, score-0.264]
80 We also used features to ask detailed questions (i. [sent-165, score-0.126]
81 We ran experiments comparing the different loss functions optimized with the conjugate gradient method and the boosting algorithm. [sent-172, score-0.709]
82 We designed three sets of features: HMM features (=31), 31 and detailed features of the current word (= 32), and 32 and detailed features of the neighboring words (=33). [sent-173, score-0.391]
83 The results summarized in Table 1 demonstrate the competitiveness of the Feature Objective proposed loss functions with respect to log exp boost Set 1{log. [sent-174, score-0.406]
84 Consequently, the boosting algorithm needs Table 1: Test error of the Spanish corto include almost all weak learners in pus for named entity recognition. [sent-192, score-0.625]
85 When there are more detailed features , the boosting algorithm is competitive with the conjugate gradient method, but has the advantage of generating sparser models. [sent-194, score-0.592]
86 The conjugate gradient method uses all of the available features, whereas boosting uses only about 10% of the features. [sent-195, score-0.502]
87 2 Part of Speech Tagging We used the Penn TreeBank corpus for t he part-of-speech tagging experiments. [sent-197, score-0.127]
88 The features were similar to the feature sets Sl and S2 described above in the context of NER. [sent-198, score-0.146]
89 It can be seen that the test errors obtained by different loss functions lie within a relatively small range. [sent-200, score-0.207]
90 3 Feature Set Sl S2 Objective log exp boost 1{ :F 1{ :F 4. [sent-203, score-0.144]
91 09 - Table 2: Test error of the Penn TreeBank corpus for POS General Comments Even with t he tighter bound in the boosting formulation , the same features are selected many times, because of the conservative estimate of the step size for parameter updates. [sent-213, score-0.542]
92 We expect to speed up the convergence of the boosting algorithm by using a more sophisticated line search mechanism to compute the optimal step length, a conjecture that will be addressed in future work. [sent-214, score-0.315]
93 Although we did not use real-valued features in our experiments, we observed that including real-valued features in a conjugate gradient formulation is a challenge, whereas it is very natural to have such features in a boosting algorithm. [sent-215, score-0.817]
94 We noticed in our experiments that defining a distribution over the training instances using the inverse conditional probability creates problems in the boosting formulation for data sets that are highly unbalanced in terms of the length of the training sequences. [sent-216, score-0.509]
95 The conjugate gradient optimization, on the other hand, did not appear to suffer from this problem. [sent-218, score-0.187]
96 8 Conclusion and Future Work This paper makes two contributions to the problem of learning label sequences. [sent-219, score-0.235]
97 First, we have presented an efficient algorithm for discriminative learning of label sequences that combines boosting with dynamic programming. [sent-220, score-0.895]
98 Our future work will investigate the performance (in both accuracy and computational expenses) of the different loss functions in different conditions (e. [sent-223, score-0.207]
99 Ranking algorithms for named- entity extraction: Boosting and the voted perceptron. [sent-234, score-0.143]
100 Conditional random fields: Probabilistic models for segmenting and labeling sequence data. [sent-260, score-0.131]
wordName wordTfidf (topN-words)
[('fo', 0.328), ('boosting', 0.315), ('label', 0.235), ('loss', 0.207), ('yt', 0.2), ('uik', 0.191), ('sequences', 0.188), ('yix', 0.164), ('po', 0.163), ('pe', 0.145), ('entity', 0.143), ('zk', 0.13), ('tk', 0.114), ('xi', 0.109), ('crfs', 0.109), ('fk', 0.109), ('exp', 0.108), ('gradient', 0.099), ('conditional', 0.099), ('exponential', 0.097), ('hamming', 0.096), ('ixi', 0.095), ('features', 0.09), ('sequence', 0.088), ('conjugate', 0.088), ('tagging', 0.087), ('yi', 0.086), ('fe', 0.084), ('named', 0.083), ('bixi', 0.082), ('kax', 0.082), ('rik', 0.082), ('tyi', 0.082), ('hmms', 0.08), ('sk', 0.08), ('ranking', 0.078), ('discriminative', 0.075), ('word', 0.073), ('sufficient', 0.073), ('ui', 0.066), ('tag', 0.065), ('fields', 0.064), ('ix', 0.062), ('rk', 0.061), ('language', 0.06), ('programming', 0.059), ('learner', 0.058), ('ly', 0.057), ('tagged', 0.057), ('sl', 0.057), ('feature', 0.056), ('competitiveness', 0.055), ('pos', 0.055), ('yiixi', 0.055), ('hmm', 0.052), ('dependencies', 0.052), ('bound', 0.052), ('classification', 0.051), ('observation', 0.051), ('logarithmic', 0.05), ('defining', 0.05), ('weak', 0.049), ('define', 0.049), ('ax', 0.048), ('neighboring', 0.048), ('ld', 0.048), ('penn', 0.048), ('spanish', 0.048), ('treebank', 0.048), ('el', 0.047), ('upper', 0.046), ('extraction', 0.046), ('formulation', 0.045), ('dynamic', 0.044), ('benefits', 0.043), ('kin', 0.043), ('availability', 0.043), ('mle', 0.043), ('segmenting', 0.043), ('expectations', 0.041), ('bk', 0.04), ('corpus', 0.04), ('le', 0.038), ('ner', 0.038), ('phrase', 0.038), ('linguistics', 0.038), ('efficiency', 0.038), ('mm', 0.038), ('descent', 0.038), ('secondly', 0.038), ('efficient', 0.038), ('additive', 0.036), ('boost', 0.036), ('lj', 0.036), ('questions', 0.036), ('minimizing', 0.035), ('learners', 0.035), ('discriminant', 0.034), ('round', 0.033), ('names', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 69 nips-2002-Discriminative Learning for Label Sequences via Boosting
Author: Yasemin Altun, Thomas Hofmann, Mark Johnson
Abstract: This paper investigates a boosting approach to discriminative learning of label sequences based on a sequence rank loss function. The proposed method combines many of the advantages of boosting schemes with the efficiency of dynamic programming methods and is attractive both, conceptually and computationally. In addition, we also discuss alternative approaches based on the Hamming loss for label sequences. The sequence boosting algorithm offers an interesting alternative to methods based on HMMs and the more recently proposed Conditional Random Fields. Applications areas for the presented technique range from natural language processing and information extraction to computational biology. We include experiments on named entity recognition and part-of-speech tagging which demonstrate the validity and competitiveness of our approach. 1
2 0.30670398 46 nips-2002-Boosting Density Estimation
Author: Saharon Rosset, Eran Segal
Abstract: Several authors have suggested viewing boosting as a gradient descent search for a good fit in function space. We apply gradient-based boosting methodology to the unsupervised learning problem of density estimation. We show convergence properties of the algorithm and prove that a strength of weak learnability property applies to this problem as well. We illustrate the potential of this approach through experiments with boosting Bayesian networks to learn density models.
3 0.22451919 181 nips-2002-Self Supervised Boosting
Author: Max Welling, Richard S. Zemel, Geoffrey E. Hinton
Abstract: Boosting algorithms and successful applications thereof abound for classification and regression learning problems, but not for unsupervised learning. We propose a sequential approach to adding features to a random field model by training them to improve classification performance between the data and an equal-sized sample of “negative examples” generated from the model’s current estimate of the data density. Training in each boosting round proceeds in three stages: first we sample negative examples from the model’s current Boltzmann distribution. Next, a feature is trained to improve classification performance between data and negative examples. Finally, a coefficient is learned which determines the importance of this feature relative to ones already in the pool. Negative examples only need to be generated once to learn each new feature. The validity of the approach is demonstrated on binary digits and continuous synthetic data.
4 0.18724039 135 nips-2002-Learning with Multiple Labels
Author: Rong Jin, Zoubin Ghahramani
Abstract: In this paper, we study a special kind of learning problem in which each training instance is given a set of (or distribution over) candidate class labels and only one of the candidate labels is the correct one. Such a problem can occur, e.g., in an information retrieval setting where a set of words is associated with an image, or if classes labels are organized hierarchically. We propose a novel discriminative approach for handling the ambiguity of class labels in the training examples. The experiments with the proposed approach over five different UCI datasets show that our approach is able to find the correct label among the set of candidate labels and actually achieve performance close to the case when each training instance is given a single correct label. In contrast, naIve methods degrade rapidly as more ambiguity is introduced into the labels. 1
5 0.15753573 120 nips-2002-Kernel Design Using Boosting
Author: Koby Crammer, Joseph Keshet, Yoram Singer
Abstract: The focus of the paper is the problem of learning kernel operators from empirical data. We cast the kernel design problem as the construction of an accurate kernel from simple (and less accurate) base kernels. We use the boosting paradigm to perform the kernel construction process. To do so, we modify the booster so as to accommodate kernel operators. We also devise an efficient weak-learner for simple kernels that is based on generalized eigen vector decomposition. We demonstrate the effectiveness of our approach on synthetic data and on the USPS dataset. On the USPS dataset, the performance of the Perceptron algorithm with learned kernels is systematically better than a fixed RBF kernel. 1 Introduction and problem Setting The last decade brought voluminous amount of work on the design, analysis and experimentation of kernel machines. Algorithm based on kernels can be used for various machine learning tasks such as classification, regression, ranking, and principle component analysis. The most prominent learning algorithm that employs kernels is the Support Vector Machines (SVM) [1, 2] designed for classification and regression. A key component in a kernel machine is a kernel operator which computes for any pair of instances their inner-product in some abstract vector space. Intuitively and informally, a kernel operator is a means for measuring similarity between instances. Almost all of the work that employed kernel operators concentrated on various machine learning problems that involved a predefined kernel. A typical approach when using kernels is to choose a kernel before learning starts. Examples to popular predefined kernels are the Radial Basis Functions and the polynomial kernels (see for instance [1]). Despite the simplicity required in modifying a learning algorithm to a “kernelized” version, the success of such algorithms is not well understood yet. More recently, special efforts have been devoted to crafting kernels for specific tasks such as text categorization [3] and protein classification problems [4]. Our work attempts to give a computational alternative to predefined kernels by learning kernel operators from data. We start with a few definitions. Let X be an instance space. A kernel is an inner-product operator K : X × X → . An explicit way to describe K is via a mapping φ : X → H from X to an inner-products space H such that K(x, x ) = φ(x)·φ(x ). Given a kernel operator and a finite set of instances S = {xi , yi }m , the kernel i=1 matrix (a.k.a the Gram matrix) is the matrix of all possible inner-products of pairs from S, Ki,j = K(xi , xj ). We therefore refer to the general form of K as the kernel operator and to the application of the kernel operator to a set of pairs of instances as the kernel matrix. The specific setting of kernel design we consider assumes that we have access to a base kernel learner and we are given a target kernel K manifested as a kernel matrix on a set of examples. Upon calling the base kernel learner it returns a kernel operator denote Kj . The goal thereafter is to find a weighted combination of kernels ˆ K(x, x ) = j αj Kj (x, x ) that is similar, in a sense that will be defined shortly, to ˆ the target kernel, K ∼ K . Cristianini et al. [5] in their pioneering work on kernel target alignment employed as the notion of similarity the inner-product between the kernel matrices < K, K >F = m K(xi , xj )K (xi , xj ). Given this definition, they defined the i,j=1 kernel-similarity, or alignment, to be the above inner-product normalized by the norm of ˆ ˆ ˆ ˆ ˆ each kernel, A(S, K, K ) = < K, K >F / < K, K >F < K , K >F , where S is, as above, a finite sample of m instances. Put another way, the kernel alignment Cristianini et al. employed is the cosine of the angle between the kernel matrices where each matrix is “flattened” into a vector of dimension m2 . Therefore, this definition implies that the alignment is bounded above by 1 and can attain this value iff the two kernel matrices are identical. Given a (column) vector of m labels y where yi ∈ {−1, +1} is the label of the instance xi , Cristianini et al. used the outer-product of y as the the target kernel, ˆ K = yy T . Therefore, an optimal alignment is achieved if K(xi , xj ) = yi yj . Clearly, if such a kernel is used for classifying instances from X , then the kernel itself suffices to construct an excellent classifier f : X → {−1, +1} by setting, f (x) = sign(y i K(xi , x)) where (xi , yi ) is any instance-label pair. Cristianini et al. then devised a procedure that works with both labelled and unlabelled examples to find a Gram matrix which attains a good alignment with K on the labelled part of the matrix. While this approach can clearly construct powerful kernels, a few problems arise from the notion of kernel alignment they employed. For instance, a kernel operator such that the sign(K(x i , xj )) is equal to yi yj but its magnitude, |K(xi , xj )|, is not necessarily 1, might achieve a poor alignment score while it can constitute a classifier whose empirical loss is zero. Furthermore, the task of finding a good kernel when it is not always possible to find a kernel whose sign on each pair of instances is equal to the products of the labels (termed the soft-margin case in [5, 6]) becomes rather tricky. We thus propose a different approach which attempts to overcome some of the difficulties above. Like Cristianini et al. we assume that we are given a set of labelled instances S = {(xi , yi ) | xi ∈ X , yi ∈ {−1, +1}, i = 1, . . . , m} . We are also given a set of unlabelled m ˜ ˜ examples S = {˜i }i=1 . If such a set is not provided we can simply use the labelled inx ˜ ˜ stances (without the labels themselves) as the set S. The set S is used for constructing the ˆ primitive kernels that are combined to constitute the learned kernel K. The labelled set is used to form the target kernel matrix and its instances are used for evaluating the learned ˆ kernel K. This approach, known as transductive learning, was suggested in [5, 6] for kernel alignment tasks when the distribution of the instances in the test data is different from that of the training data. This setting becomes in particular handy in datasets where the test data was collected in a different scheme than the training data. We next discuss the notion of kernel goodness employed in this paper. This notion builds on the objective function that several variants of boosting algorithms maintain [7, 8]. We therefore first discuss in brief the form of boosting algorithms for kernels. 2 Using Boosting to Combine Kernels Numerous interpretations of AdaBoost and its variants cast the boosting process as a procedure that attempts to minimize, or make small, a continuous bound on the classification error (see for instance [9, 7] and the references therein). A recent work by Collins et al. [8] unifies the boosting process for two popular loss functions, the exponential-loss (denoted henceforth as ExpLoss) and logarithmic-loss (denoted as LogLoss) that bound the empir- ˜ ˜ Input: Labelled and unlabelled sets of examples: S = {(xi , yi )}m ; S = {˜i }m x i=1 i=1 Initialize: K ← 0 (all zeros matrix) For t = 1, 2, . . . , T : • Calculate distribution over pairs 1 ≤ i, j ≤ m: Dt (i, j) = exp(−yi yj K(xi , xj )) 1/(1 + exp(−yi yj K(xi , xj ))) ExpLoss LogLoss ˜ • Call base-kernel-learner with (Dt , S, S) and receive Kt • Calculate: + − St = {(i, j) | yi yj Kt (xi , xj ) > 0} ; St = {(i, j) | yi yj Kt (xi , xj ) < 0} + Wt = (i,j)∈S + Dt (i, j)|Kt (xi , xj )| ; Wt− = (i,j)∈S − Dt (i, j)|Kt (xi , xj )| t t 1 2 + Wt − Wt • Set: αt = ln ; K ← K + α t Kt . Return: kernel operator K : X × X → Figure 1: The skeleton of the boosting algorithm for kernels. ical classification error. Given the prediction of a classifier f on an instance x and a label y ∈ {−1, +1} the ExpLoss and the LogLoss are defined as, ExpLoss(f (x), y) = exp(−yf (x)) LogLoss(f (x), y) = log(1 + exp(−yf (x))) . Collins et al. described a single algorithm for the two losses above that can be used within the boosting framework to construct a strong-hypothesis which is a classifier f (x). This classifier is a weighted combination of (possibly very simple) base classifiers. (In the boosting framework, the base classifiers are referred to as weak-hypotheses.) The strongT hypothesis is of the form f (x) = t=1 αt ht (x). Collins et al. discussed a few ways to select the weak-hypotheses ht and to find a good of weights αt . Our starting point in this paper is the first sequential algorithm from [8] that enables the construction or creation of weak-hypotheses on-the-fly. We would like to note however that it is possible to use other variants of boosting to design kernels. In order to use boosting to design kernels we extend the algorithm to operate over pairs of instances. Building on the notion of alignment from [5, 6], we say that the inner-product of x1 and x2 is aligned with the labels y1 and y2 if sign(K(x1 , x2 )) = y1 y2 . Furthermore, we would like to make the magnitude of K(x, x ) to be as large as possible. We therefore use one of the following two alignment losses for a pair of examples (x 1 , y1 ) and (x2 , y2 ), ExpLoss(K(x1 , x2 ), y1 y2 ) = exp(−y1 y2 K(x1 , x2 )) LogLoss(K(x1 , x2 ), y1 y2 ) = log(1 + exp(−y1 y2 K(x1 , x2 ))) . Put another way, we view a pair of instances as a single example and cast the pairs of instances that attain the same label as positively labelled examples while pairs of opposite labels are cast as negatively labelled examples. Clearly, this approach can be applied to both losses. In the boosting process we therefore maintain a distribution over pairs of instances. The weight of each pair reflects how difficult it is to predict whether the labels of the two instances are the same or different. The core boosting algorithm follows similar lines to boosting algorithms for classification algorithm. The pseudo code of the booster is given in Fig. 1. The pseudo-code is an adaptation the to problem of kernel design of the sequentialupdate algorithm from [8]. As with other boosting algorithm, the base-learner, which in our case is charge of returning a good kernel with respect to the current distribution, is left unspecified. We therefore turn our attention to the algorithmic implementation of the base-learning algorithm for kernels. 3 Learning Base Kernels The base kernel learner is provided with a training set S and a distribution D t over a pairs ˜ of instances from the training set. It is also provided with a set of unlabelled examples S. Without any knowledge of the topology of the space of instances a learning algorithm is likely to fail. Therefore, we assume the existence of an initial inner-product over the input space. We assume for now that this initial inner-product is the standard scalar products over vectors in n . We later discuss a way to relax the assumption on the form of the inner-product. Equipped with an inner-product, we define the family of base kernels to be the possible outer-products Kw = wwT between a vector w ∈ n and itself. Using this definition we get, Kw (xi , xj ) = (xi ·w)(xj ·w) . Input: A distribution Dt . Labelled and unlabelled sets: ˜ ˜ Therefore, the similarity beS = {(xi , yi )}m ; S = {˜i }m . x i=1 i=1 tween two instances xi and Compute : xj is high iff both xi and xj • Calculate: ˜ are similar (w.r.t the standard A ∈ m×m , Ai,r = xi · xr ˜ inner-product) to a third vecm×m B∈ , Bi,j = Dt (i, j)yi yj tor w. Analogously, if both ˜ ˜ K ∈ m×m , Kr,s = xr · xs ˜ ˜ xi and xj seem to be dissim• Find the generalized eigenvector v ∈ m for ilar to the vector w then they the problem AT BAv = λKv which attains are similar to each other. Dethe largest eigenvalue λ spite the restrictive form of • Set: w = ( r vr xr )/ ˜ ˜ r vr xr . the inner-products, this famt ily is still too rich for our setReturn: Kernel operator Kw = ww . ting and we further impose two restrictions on the inner Figure 2: The base kernel learning algorithm. products. First, we assume ˜ that w is restricted to a linear combination of vectors from S. Second, since scaling of the base kernels is performed by the boosted, we constrain the norm of w to be 1. The m ˜ resulting class of kernels is therefore, C = {Kw = wwT | w = r=1 βr xr , w = 1} . ˜ In the boosting process we need to choose a specific base-kernel K w from C. We therefore need to devise a notion of how good a candidate for base kernel is given a labelled set S and a distribution function Dt . In this work we use the simplest version suggested by Collins et al. This version can been viewed as a linear approximation on the loss function. We define the score of a kernel Kw w.r.t to the current distribution Dt to be, Score(Kw ) = Dt (i, j)yi yj Kw (xi , xj ) . (1) i,j The higher the value of the score is, the better Kw fits the training data. Note that if Dt (i, j) = 1/m2 (as is D0 ) then Score(Kw ) is proportional to the alignment since w = 1. Under mild assumptions the score can also provide a lower bound of the loss function. To see that let c be the derivative of the loss function at margin zero, c = Loss (0) . If all the √ training examples xi ∈ S lies in a ball of radius c, we get that Loss(Kw (xi , xj ), yi yj ) ≥ 1 − cKw (xi , xj )yi yj ≥ 0, and therefore, i,j Dt (i, j)Loss(Kw (xi , xj ), yi yj ) ≥ 1 − c Dt (i, j)Kw (xi , xj )yi yj . i,j Using the explicit form of Kw in the Score function (Eq. (1)) we get, Score(Kw ) = i,j D(i, j)yi yj (w·xi )(w·xj ) . Further developing the above equation using the constraint that w = m ˜ r=1 βr xr we get, ˜ Score(Kw ) = βs βr r,s i,j D(i, j)yi yj (xi · xr ) (xj · xs ) . ˜ ˜ To compute efficiently the base kernel score without an explicit enumeration we exploit the fact that if the initial distribution D0 is symmetric (D0 (i, j) = D0 (j, i)) then all the distributions generated along the run of the boosting process, D t , are also symmetric. We ˜ now define a matrix A ∈ m×m where Ai,r = xi · xr and a symmetric matrix B ∈ m×m ˜ with Bi,j = Dt (i, j)yi yj . Simple algebraic manipulations yield that the score function can be written as the following quadratic form, Score(β) = β T (AT BA)β , where β is m dimensional column vector. Note that since B is symmetric so is A T BA. Finding a ˜ good base kernel is equivalent to finding a vector β which maximizes this quadratic form 2 m ˜ under the norm equality constraint w = ˜ 2 = β T Kβ = 1 where Kr,s = r=1 βr xr xr · xs . Finding the maximum of Score(β) subject to the norm constraint is a well known ˜ ˜ maximization problem known as the generalized eigen vector problem (cf. [10]). Applying simple algebraic manipulations it is easy to show that the matrix AT BA is positive semidefinite. Assuming that the matrix K is invertible, the the vector β which maximizes the quadratic form is proportional the eigenvector of K −1 AT BA which is associated with the m ˜ generalized largest eigenvalue. Denoting this vector by v we get that w ∝ ˜ r=1 vr xr . m ˜ m ˜ Adding the norm constraint we get that w = ( r=1 vr xr )/ ˜ vr xr . The skeleton ˜ r=1 of the algorithm for finding a base kernels is given in Fig. 3. To conclude the description of the kernel learning algorithm we describe how to the extend the algorithm to be employed with general kernel functions. Kernelizing the Kernel: As described above, we assumed that the standard scalarproduct constitutes the template for the class of base-kernels C. However, since the proce˜ dure for choosing a base kernel depends on S and S only through the inner-products matrix A, we can replace the scalar-product itself with a general kernel operator κ : X × X → , where κ(xi , xj ) = φ(xi ) · φ(xj ). Using a general kernel function κ we can not compute however the vector w explicitly. We therefore need to show that the norm of w, and evaluation Kw on any two examples can still be performed efficiently. First note that given the vector v we can compute the norm of w as follows, T w 2 = vr xr ˜ vs xr ˜ r s = vr vs κ(˜r , xs ) . x ˜ r,s Next, given two vectors xi and xj the value of their inner-product is, Kw (xi , xj ) = vr vs κ(xi , xr )κ(xj , xs ) . ˜ ˜ r,s Therefore, although we cannot compute the vector w explicitly we can still compute its norm and evaluate any of the kernels from the class C. 4 Experiments Synthetic data: We generated binary-labelled data using as input space the vectors in 100 . The labels, in {−1, +1}, were picked uniformly at random. Let y designate the label of a particular example. Then, the first two components of each instance were drawn from a two-dimensional normal distribution, N (µ, ∆ ∆−1 ) with the following parameters, µ=y 0.03 0.03 1 ∆= √ 2 1 −1 1 1 = 0.1 0 0 0.01 . That is, the label of each examples determined the mean of the distribution from which the first two components were generated. The rest of the components in the vector (98 8 0.2 6 50 50 100 100 150 150 200 200 4 2 0 0 −2 −4 −6 250 250 −0.2 −8 −0.2 0 0.2 −8 −6 −4 −2 0 2 4 6 8 300 20 40 60 80 100 120 140 160 180 200 300 20 40 60 80 100 120 140 160 180 Figure 3: Results on a toy data set prior to learning a kernel (first and third from left) and after learning (second and fourth). For each of the two settings we show the first two components of the training data (left) and the matrix of inner products between the train and the test data (right). altogether) were generated independently using the normal distribution with a zero mean and a standard deviation of 0.05. We generated 100 training and test sets of size 300 and 200 respectively. We used the standard dot-product as the initial kernel operator. On each experiment we first learned a linear classier that separates the classes using the Perceptron [11] algorithm. We ran the algorithm for 10 epochs on the training set. After each epoch we evaluated the performance of the current classifier on the test set. We then used the boosting algorithm for kernels with the LogLoss for 30 rounds to build a kernel for each random training set. After learning the kernel we re-trained a classifier with the Perceptron algorithm and recorded the results. A summary of the online performance is given in Fig. 4. The plot on the left-hand-side of the figure shows the instantaneous error (achieved during the run of the algorithm). Clearly, the Perceptron algorithm with the learned kernel converges much faster than the original kernel. The middle plot shows the test error after each epoch. The plot on the right shows the test error on a noisy test set in which we added a Gaussian noise of zero mean and a standard deviation of 0.03 to the first two features. In all plots, each bar indicates a 95% confidence level. It is clear from the figure that the original kernel is much slower to converge than the learned kernel. Furthermore, though the kernel learning algorithm was not expoed to the test set noise, the learned kernel reflects better the structure of the feature space which makes the learned kernel more robust to noise. Fig. 3 further illustrates the benefits of using a boutique kernel. The first and third plots from the left correspond to results obtained using the original kernel and the second and fourth plots show results using the learned kernel. The left plots show the empirical distribution of the two informative components on the test data. For the learned kernel we took each input vector and projected it onto the two eigenvectors of the learned kernel operator matrix that correspond to the two largest eigenvalues. Note that the distribution after the projection is bimodal and well separated along the first eigen direction (x-axis) and shows rather little deviation along the second eigen direction (y-axis). This indicates that the kernel learning algorithm indeed found the most informative projection for separating the labelled data with large margin. It is worth noting that, in this particular setting, any algorithm which chooses a single feature at a time is prone to failure since both the first and second features are mandatory for correctly classifying the data. The two plots on the right hand side of Fig. 3 use a gray level color-map to designate the value of the inner-product between each pairs instances, one from training set (y-axis) and the other from the test set. The examples were ordered such that the first group consists of the positively labelled instances while the second group consists of the negatively labelled instances. Since most of the features are non-relevant the original inner-products are noisy and do not exhibit any structure. In contrast, the inner-products using the learned kernel yields in a 2 × 2 block matrix indicating that the inner-products between instances sharing the same label obtain large positive values. Similarly, for instances of opposite 200 1 12 Regular Kernel Learned Kernel 0.8 17 0.7 16 0.5 0.4 0.3 Test Error % 8 0.6 Regular Kernel Learned Kernel 18 10 Test Error % Averaged Cumulative Error % 19 Regular Kernel Learned Kernel 0.9 6 4 15 14 13 12 0.2 11 2 0.1 10 0 0 10 1 10 2 10 Round 3 10 4 10 0 2 4 6 Epochs 8 10 9 2 4 6 Epochs 8 10 Figure 4: The online training error (left), test error (middle) on clean synthetic data using a standard kernel and a learned kernel. Right: the online test error for the two kernels on a noisy test set. labels the inner products are large and negative. The form of the inner-products matrix of the learned kernel indicates that the learning problem itself becomes much easier. Indeed, the Perceptron algorithm with the standard kernel required around 94 training examples on the average before converging to a hyperplane which perfectly separates the training data while using the Perceptron algorithm with learned kernel required a single example to reach a perfect separation on all 100 random training sets. USPS dataset: The USPS (US Postal Service) dataset is known as a challenging classification problem in which the training set and the test set were collected in a different manner. The USPS contains 7, 291 training examples and 2, 007 test examples. Each example is represented as a 16 × 16 matrix where each entry in the matrix is a pixel that can take values in {0, . . . , 255}. Each example is associated with a label in {0, . . . , 9} which is the digit content of the image. Since the kernel learning algorithm is designed for binary problems, we broke the 10-class problem into 45 binary problems by comparing all pairs of classes. The interesting question of how to learn kernels for multiclass problems is beyond the scopre of this short paper. We thus constraint on the binary error results for the 45 binary problem described above. For the original kernel we chose a RBF kernel with σ = 1 which is the value employed in the experiments reported in [12]. We used the kernelized version of the kernel design algorithm to learn a different kernel operator for each of the binary problems. We then used a variant of the Perceptron [11] and with the original RBF kernel and with the learned kernels. One of the motivations for using the Perceptron is its simplicity which can underscore differences in the kernels. We ran the kernel learning al˜ gorithm with LogLoss and ExpLoss, using bith the training set and the test test as S. Thus, we obtained four different sets of kernels where each set consists of 45 kernels. By examining the training loss, we set the number of rounds of boosting to be 30 for the LogLoss and 50 for the ExpLoss, when using the trainin set. When using the test set, the number of rounds of boosting was set to 100 for both losses. Since the algorithm exhibits slower rate of convergence with the test data, we choose a a higher value without attempting to optimize the actual value. The left plot of Fig. 5 is a scatter plot comparing the test error of each of the binary classifiers when trained with the original RBF a kernel versus the performance achieved on the same binary problem with a learned kernel. The kernels were built ˜ using boosting with the LogLoss and S was the training data. In almost all of the 45 binary classification problems, the learned kernels yielded lower error rates when combined with the Perceptron algorithm. The right plot of Fig. 5 compares two learned kernels: the first ˜ was build using the training instances as the templates constituing S while the second used the test instances. Although the differenece between the two versions is not as significant as the difference on the left plot, we still achieve an overall improvement in about 25% of the binary problems by using the test instances. 6 4.5 4 5 Learned Kernel (Test) Learned Kernel (Train) 3.5 4 3 2 3 2.5 2 1.5 1 1 0.5 0 0 1 2 3 Base Kernel 4 5 6 0 0 1 2 3 Learned Kernel (Train) 4 5 Figure 5: Left: a scatter plot comparing the error rate of 45 binary classifiers trained using an RBF kernel (x-axis) and a learned kernel with training instances. Right: a similar scatter plot for a learned kernel only constructed from training instances (x-axis) and test instances. 5 Discussion In this paper we showed how to use the boosting framework to design kernels. Our approach is especially appealing in transductive learning tasks where the test data distribution is different than the the distribution of the training data. For example, in speech recognition tasks the training data is often clean and well recorded while the test data often passes through a noisy channel that distorts the signal. An interesting and challanging question that stem from this research is how to extend the framework to accommodate more complex decision tasks such as multiclass and regression problems. Finally, we would like to note alternative approaches to the kernel design problem has been devised in parallel and independently. See [13, 14] for further details. Acknowledgements: Special thanks to Cyril Goutte and to John Show-Taylor for pointing the connection to the generalized eigen vector problem. Thanks also to the anonymous reviewers for constructive comments. References [1] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998. [2] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. [3] Huma Lodhi, John Shawe-Taylor, Nello Cristianini, and Christopher J. C. H. Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2:419–444, 2002. [4] C. Leslie, E. Eskin, and W. Stafford Noble. The spectrum kernel: A string kernel for svm protein classification. In Proceedings of the Pacific Symposium on Biocomputing, 2002. [5] Nello Cristianini, Andre Elisseeff, John Shawe-Taylor, and Jaz Kandla. On kernel target alignment. In Advances in Neural Information Processing Systems 14, 2001. [6] G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix with semi-definite programming. In Proc. of the 19th Intl. Conf. on Machine Learning, 2002. [7] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2):337–374, April 2000. [8] Michael Collins, Robert E. Schapire, and Yoram Singer. Logistic regression, adaboost and bregman distances. Machine Learning, 47(2/3):253–285, 2002. [9] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classifiers. MIT Press, 1999. [10] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1985. [11] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. [12] B. Sch¨ lkopf, S. Mika, C.J.C. Burges, P. Knirsch, K. M¨ ller, G. R¨ tsch, and A.J. Smola. Input o u a space vs. feature space in kernel-based methods. IEEE Trans. on NN, 10(5):1000–1017, 1999. [13] O. Bosquet and D.J.L. Herrmann. On the complexity of learning the kernel matrix. NIPS, 2002. [14] C.S. Ong, A.J. Smola, and R.C. Williamson. Superkenels. NIPS, 2002.
6 0.12579711 58 nips-2002-Conditional Models on the Ranking Poset
7 0.12528697 149 nips-2002-Multiclass Learning by Probabilistic Embeddings
8 0.11284489 140 nips-2002-Margin Analysis of the LVQ Algorithm
9 0.10285936 96 nips-2002-Generalized² Linear² Models
10 0.099851221 192 nips-2002-Support Vector Machines for Multiple-Instance Learning
11 0.095783032 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition
12 0.090930529 25 nips-2002-An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition
13 0.089847088 92 nips-2002-FloatBoost Learning for Classification
14 0.088938788 98 nips-2002-Going Metric: Denoising Pairwise Data
15 0.085191056 73 nips-2002-Dynamic Bayesian Networks with Deterministic Latent Tables
16 0.082951725 145 nips-2002-Mismatch String Kernels for SVM Protein Classification
17 0.08216957 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
18 0.080310903 119 nips-2002-Kernel Dependency Estimation
19 0.076262362 90 nips-2002-Feature Selection in Mixture-Based Clustering
20 0.074725166 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
topicId topicWeight
[(0, -0.241), (1, -0.129), (2, 0.03), (3, -0.015), (4, 0.13), (5, 0.046), (6, -0.04), (7, 0.059), (8, -0.002), (9, -0.216), (10, -0.041), (11, -0.079), (12, -0.124), (13, 0.143), (14, -0.369), (15, 0.262), (16, -0.041), (17, -0.03), (18, 0.086), (19, 0.084), (20, 0.065), (21, 0.048), (22, -0.054), (23, -0.013), (24, 0.073), (25, -0.039), (26, -0.016), (27, -0.07), (28, -0.008), (29, -0.025), (30, -0.001), (31, 0.012), (32, 0.027), (33, -0.03), (34, -0.06), (35, -0.021), (36, 0.003), (37, -0.104), (38, -0.086), (39, 0.073), (40, 0.044), (41, -0.052), (42, 0.067), (43, 0.038), (44, 0.055), (45, -0.049), (46, 0.044), (47, 0.077), (48, -0.004), (49, 0.001)]
simIndex simValue paperId paperTitle
same-paper 1 0.96494794 69 nips-2002-Discriminative Learning for Label Sequences via Boosting
Author: Yasemin Altun, Thomas Hofmann, Mark Johnson
Abstract: This paper investigates a boosting approach to discriminative learning of label sequences based on a sequence rank loss function. The proposed method combines many of the advantages of boosting schemes with the efficiency of dynamic programming methods and is attractive both, conceptually and computationally. In addition, we also discuss alternative approaches based on the Hamming loss for label sequences. The sequence boosting algorithm offers an interesting alternative to methods based on HMMs and the more recently proposed Conditional Random Fields. Applications areas for the presented technique range from natural language processing and information extraction to computational biology. We include experiments on named entity recognition and part-of-speech tagging which demonstrate the validity and competitiveness of our approach. 1
2 0.75603694 46 nips-2002-Boosting Density Estimation
Author: Saharon Rosset, Eran Segal
Abstract: Several authors have suggested viewing boosting as a gradient descent search for a good fit in function space. We apply gradient-based boosting methodology to the unsupervised learning problem of density estimation. We show convergence properties of the algorithm and prove that a strength of weak learnability property applies to this problem as well. We illustrate the potential of this approach through experiments with boosting Bayesian networks to learn density models.
3 0.63718146 181 nips-2002-Self Supervised Boosting
Author: Max Welling, Richard S. Zemel, Geoffrey E. Hinton
Abstract: Boosting algorithms and successful applications thereof abound for classification and regression learning problems, but not for unsupervised learning. We propose a sequential approach to adding features to a random field model by training them to improve classification performance between the data and an equal-sized sample of “negative examples” generated from the model’s current estimate of the data density. Training in each boosting round proceeds in three stages: first we sample negative examples from the model’s current Boltzmann distribution. Next, a feature is trained to improve classification performance between data and negative examples. Finally, a coefficient is learned which determines the importance of this feature relative to ones already in the pool. Negative examples only need to be generated once to learn each new feature. The validity of the approach is demonstrated on binary digits and continuous synthetic data.
4 0.55183613 135 nips-2002-Learning with Multiple Labels
Author: Rong Jin, Zoubin Ghahramani
Abstract: In this paper, we study a special kind of learning problem in which each training instance is given a set of (or distribution over) candidate class labels and only one of the candidate labels is the correct one. Such a problem can occur, e.g., in an information retrieval setting where a set of words is associated with an image, or if classes labels are organized hierarchically. We propose a novel discriminative approach for handling the ambiguity of class labels in the training examples. The experiments with the proposed approach over five different UCI datasets show that our approach is able to find the correct label among the set of candidate labels and actually achieve performance close to the case when each training instance is given a single correct label. In contrast, naIve methods degrade rapidly as more ambiguity is introduced into the labels. 1
5 0.44397232 192 nips-2002-Support Vector Machines for Multiple-Instance Learning
Author: Stuart Andrews, Ioannis Tsochantaridis, Thomas Hofmann
Abstract: This paper presents two new formulations of multiple-instance learning as a maximum margin problem. The proposed extensions of the Support Vector Machine (SVM) learning approach lead to mixed integer quadratic programs that can be solved heuristically. Our generalization of SVMs makes a state-of-the-art classification technique, including non-linear classification via kernels, available to an area that up to now has been largely dominated by special purpose methods. We present experimental results on a pharmaceutical data set and on applications in automated image indexing and document categorization. 1
6 0.41238788 58 nips-2002-Conditional Models on the Ranking Poset
7 0.40242791 96 nips-2002-Generalized² Linear² Models
8 0.40047851 149 nips-2002-Multiclass Learning by Probabilistic Embeddings
9 0.3977716 7 nips-2002-A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences
10 0.39749664 120 nips-2002-Kernel Design Using Boosting
11 0.3930082 92 nips-2002-FloatBoost Learning for Classification
12 0.36904588 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition
13 0.36778474 25 nips-2002-An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition
14 0.35182491 140 nips-2002-Margin Analysis of the LVQ Algorithm
15 0.34107316 194 nips-2002-The Decision List Machine
16 0.3115924 119 nips-2002-Kernel Dependency Estimation
17 0.29775986 90 nips-2002-Feature Selection in Mixture-Based Clustering
18 0.28796202 6 nips-2002-A Formulation for Minimax Probability Machine Regression
19 0.28538853 22 nips-2002-Adaptive Nonlinear System Identification with Echo State Networks
20 0.28332987 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization
topicId topicWeight
[(3, 0.019), (23, 0.012), (42, 0.109), (54, 0.112), (55, 0.042), (57, 0.015), (67, 0.018), (68, 0.033), (74, 0.106), (79, 0.289), (92, 0.038), (98, 0.108)]
simIndex simValue paperId paperTitle
1 0.94581318 146 nips-2002-Modeling Midazolam's Effect on the Hippocampus and Recognition Memory
Author: Kenneth J. Malmberg, René Zeelenberg, Richard M. Shiffrin
Abstract: The benz.odiaze:pine '~1idazolam causes dense,but temporary ~ anterograde amnesia, similar to that produced by- hippocampal damage~Does the action of M'idazola:m on the hippocanlpus cause less storage, or less accurate storage, .of information in episodic. long-term menlory?- \rVe used a sinlple variant of theREJv1. JD.odel [18] to fit data collected. by IIirsbnla.n~Fisher, .IIenthorn,Arndt} and Passa.nnante [9] on the effects of Midazola.m, study time~ and normative \vQrd.. frequenc:y on both yes-no and remember-k.novv recognition m.emory. That a: simple strength. 'model fit well \\tas cont.rary to the expectations of 'flirshman et aLMore important,within the Bayesian based R.EM modeling frame\vork, the data were consistentw'ith the view that Midazolam causes less accurate storage~ rather than less storage, of infornlation in episodic mcm.ory..
2 0.86887753 164 nips-2002-Prediction of Protein Topologies Using Generalized IOHMMs and RNNs
Author: Gianluca Pollastri, Pierre Baldi, Alessandro Vullo, Paolo Frasconi
Abstract: We develop and test new machine learning methods for the prediction of topological representations of protein structures in the form of coarse- or fine-grained contact or distance maps that are translation and rotation invariant. The methods are based on generalized input-output hidden Markov models (GIOHMMs) and generalized recursive neural networks (GRNNs). The methods are used to predict topology directly in the fine-grained case and, in the coarsegrained case, indirectly by first learning how to score candidate graphs and then using the scoring function to search the space of possible configurations. Computer simulations show that the predictors achieve state-of-the-art performance. 1 Introduction: Protein Topology Prediction Predicting the 3D structure of protein chains from the linear sequence of amino acids is a fundamental open problem in computational molecular biology [1]. Any approach to the problem must deal with the basic fact that protein structures are translation and rotation invariant. To address this invariance, we have proposed a machine learning approach to protein structure prediction [4] based on the prediction of topological representations of proteins, in the form of contact or distance maps. The contact or distance map is a 2D representation of neighborhood relationships consisting of an adjacency matrix at some distance cutoff (typically in the range of 6 to 12 ˚), or a matrix of pairwise Euclidean distances. Fine-grained maps A are derived at the amino acid or even atomic level. Coarse maps are obtained by looking at secondary structure elements, such as helices, and the distance between their centers of gravity or, as in the simulations below, the minimal distances between their Cα atoms. Reasonable methods for reconstructing 3D coordinates from contact/distance maps have been developed in the NMR literature and elsewhere Oi B Hi F Hi Ii Figure 1: Bayesian network for bidirectional IOHMMs consisting of input units, output units, and both forward and backward Markov chains of hidden states. [14] using distance geometry and stochastic optimization techniques. Thus the main focus here is on the more difficult task of contact map prediction. Various algorithms for the prediction of contact maps have been developed, in particular using feedforward neural networks [6]. The best contact map predictor in the literature and at the last CASP prediction experiment reports an average precision [True Positives/(True Positives + False Positives)] of 21% for distant contacts, i.e. with a linear distance of 8 amino acid or more [6] for fine-grained amino acid maps. While this result is encouraging and well above chance level by a factor greater than 6, it is still far from providing sufficient accuracy for reliable 3D structure prediction. A key issue in this area is the amount of noise that can be tolerated in a contact map prediction without compromising the 3D-reconstruction step. While systematic tests in this area have not yet been published, preliminary results appear to indicate that recovery of as little as half of the distant contacts may suffice for proper reconstruction, at least for proteins up to 150 amino acid long (Rita Casadio and Piero Fariselli, private communication and oral presentation during CASP4 [10]). It is important to realize that the input to a fine-grained contact map predictor need not be confined to the sequence of amino acids only, but may also include evolutionary information in the form of profiles derived by multiple alignment of homologue proteins, or structural feature information, such as secondary structure (alpha helices, beta strands, and coils), or solvent accessibility (surface/buried), derived by specialized predictors [12, 13]. In our approach, we use different GIOHMM and GRNN strategies to predict both structural features and contact maps. 2 GIOHMM Architectures Loosely speaking, GIOHMMs are Bayesian networks with input, hidden, and output units that can be used to process complex data structures such as sequences, images, trees, chemical compounds and so forth, built on work in, for instance, [5, 3, 7, 2, 11]. In general, the connectivity of the graphs associated with the hidden units matches the structure of the data being processed. Often multiple copies of the same hidden graph, but with different edge orientations, are used in the hidden layers to allow direct propagation of information in all relevant directions. Output Plane NE NW 4 Hidden Planes SW SE Input Plane Figure 2: 2D GIOHMM Bayesian network for processing two-dimensional objects such as contact maps, with nodes regularly arranged in one input plane, one output plane, and four hidden planes. In each hidden plane, nodes are arranged on a square lattice, and all edges are oriented towards the corresponding cardinal corner. Additional directed edges run vertically in column from the input plane to each hidden plane, and from each hidden plane to the output plane. To illustrate the general idea, a first example of GIOHMM is provided by the bidirectional IOHMMs (Figure 1) introduced in [2] to process sequences and predict protein structural features, such as secondary structure. Unlike standard HMMs or IOHMMS used, for instance in speech recognition, this architecture is based on two hidden markov chains running in opposite directions to leverage the fact that biological sequences are spatial objects rather than temporal sequences. Bidirectional IOHMMs have been used to derive a suite of structural feature predictors [12, 13, 4] available through http://promoter.ics.uci.edu/BRNN-PRED/. These predictors have accuracy rates in the 75-80% range on a per amino acid basis. 2.1 Direct Prediction of Topology To predict contact maps, we use a 2D generalization of the previous 1D Bayesian network. The basic version of this architecture (Figures 2) contains 6 layers of units: input, output, and four hidden layers, one for each cardinal corner. Within each column indexed by i and j, connections run from the input to the four hidden units, and from the four hidden units to the output unit. In addition, the hidden units in each hidden layer are arranged on a square or triangular lattice, with all the edges oriented towards the corresponding cardinal corner. Thus the parameters of this two-dimensional GIOHMMs, in the square lattice case, are the conditional probability distributions: NE NW SW SE P (Oi |Ii,j , Hi,j , Hi,j , Hi,j , Hi,j, ) NE NE NE P (Hi,j |Ii,j , Hi−1,j , Hi,j−1 ) N NW NW P (Hi,jW |Ii,j , Hi+1,j , Hi,j−1 ) SW SW SW P (Hi,j |Ii,j , Hi+1,j , Hi,j+1 ) SE SE SE P (Hi,j |Ii,j , Hi−1,j , Hi,j+1 ) (1) In a contact map prediction at the amino acid level, for instance, the (i, j) output represents the probability of whether amino acids i and j are in contact or not. This prediction depends directly on the (i, j) input and the four-hidden units in the same column, associated with omni-directional contextual propagation in the hidden planes. In the simulations reported below, we use a more elaborated input consisting of a 20 × 20 probability matrix over amino acid pairs derived from a multiple alignment of the given protein sequence and its homologues, as well as the structural features of the corresponding amino acids, including their secondary structure classification and their relative exposure to the solvent, derived from our corresponding predictors. It should be clear how GIOHMM ideas can be generalized to other data structures and problems in many ways. In the case of 3D data, for instance, a standard GIOHMM would have an input cube, an output cube, and up to 8 cubes of hidden units, one for each corner with connections inside each hidden cube oriented towards the corresponding corner. In the case of data with an underlying tree structure, the hidden layers would correspond to copies of the same tree with different orientations and so forth. Thus a fundamental advantage of GIOHMMs is that they can process a wide range of data structures of variable sizes and dimensions. 2.2 Indirect Prediction of Topology Although GIOHMMs allow flexible integration of contextual information over ranges that often exceed what can be achieved, for instance, with fixed-input neural networks, the models described above still suffer from the fact that the connections remain local and therefore long-ranged propagation of information during learning remains difficult. Introduction of large numbers of long-ranged connections is computationally intractable but in principle not necessary since the number of contacts in proteins is known to grow linearly with the length of the protein, and hence connectivity is inherently sparse. The difficulty of course is that the location of the long-ranged contacts is not known. To address this problem, we have developed also a complementary GIOHMM approach described in Figure 3 where a candidate graph structure is proposed in the hidden layers of the GIOHMM, with the two different orientations naturally associated with a protein sequence. Thus the hidden graphs change with each protein. In principle the output ought to be a single unit (Figure 3b) which directly computes a global score for the candidate structure presented in the hidden layer. In order to cope with long-ranged dependencies, however, it is preferable to compute a set of local scores (Figure 3c), one for each vertex, and combine the local scores into a global score by averaging. More specifically, consider a true topology represented by the undirected contact graph G∗ = (V, E ∗ ), and a candidate undirected prediction graph G = (V, E). A global measure of how well E approximates E ∗ is provided by the informationretrieval F1 score defined by the normalized edge-overlap F1 = 2|E ∩ E ∗ |/(|E| + |E ∗ |) = 2P R/(P + R), where P = |E ∩ E ∗ |/|E| is the precision (or specificity) and R = |E ∩ E ∗ |/|E ∗ | is the recall (or sensitivity) measure. Obviously, 0 ≤ F1 ≤ 1 and F1 = 1 if and only if E = E ∗ . The scoring function F1 has the property of being monotone in the sense that if |E| = |E | then F1 (E) < F1 (E ) if and only if |E ∩ E ∗ | < |E ∩ E ∗ |. Furthermore, if E = E ∪ {e} where e is an edge in E ∗ but not in E, then F1 (E ) > F1 (E). Monotonicity is important to guide the search in the space of possible topologies. It is easy to check that a simple search algorithm based on F1 takes on the order of O(|V |3 ) steps to find E ∗ , basically by trying all possible edges one after the other. The problem then is to learn F1 , or rather a good approximation to F1 . To approximate F1 , we first consider a similar local measure Fv by considering the O I(v) I(v) F B H (v) H (v) (a) I(v) F B H (v) H (v) (b) O(v) (c) Figure 3: Indirect prediction of contact maps. (a) target contact graph to be predicted. (b) GIOHMM with two hidden layers: the two hidden layers correspond to two copies of the same candidate graph oriented in opposite directions from one end of the protein to the other end. The single output O is the global score of how well the candidate graph approximates the true contact map. (c) Similar to (b) but with a local score O(v) at each vertex. The local scores can be averaged to produce a global score. In (b) and (c) I(v) represents the input for vertex v, and H F (v) and H B (v) are the corresponding hidden variables. ∗ ∗ set Ev of edges adjacent to vertex v and Fv = 2|Ev ∩ Ev |/(|Ev | + |Ev |) with the ¯ global average F = v Fv /|V |. If n and n∗ are the average degrees of G and G∗ , it can be shown that: F1 = 1 |V | v 2|Ev ∩ E ∗ | n + n∗ and 1 ¯ F = |V | v 2|Ev ∩ E ∗ | n + v + n∗ + ∗ v (2) where n + v (resp. n∗ + ∗ ) is the degree of v in G (resp. in G∗ ). In particular, if G v ¯ ¯ and G∗ are regular graphs, then F1 (E) = F (E) so that F is a good approximation to F1 . In the contact map regime where the number of contacts grows linearly with the length of the sequence, we should have in general |E| ≈ |E ∗ | ≈ (1 + α)|V | so that each node on average has n = n∗ = 2(1 + α) edges. The value of α depends of course on the neighborhood cutoff. As in reinforcement learning, to learn the scoring function one is faced with the problem of generating good training sets in a high dimensional space, where the states are the topologies (graphs), and the policies are algorithms for adding a single edge to a given graph. In the simulations we adopt several different strategies including static and dynamic generation. Within dynamic generation we use three exploration strategies: random exploration (successor graph chosen at random), pure exploitation (successor graph maximizes the current scoring function), and semi-uniform exploitation to find a balance between exploration and exploitation [with probability (resp. 1 − ) we choose random exploration (resp. pure exploitation)]. 3 GRNN Architectures Inference and learning in the protein GIOHMMs we have described is computationally intensive due to the large number of undirected loops they contain. This problem can be addressed using a neural network reparameterization assuming that: (a) all the nodes in the graphs are associated with a deterministic vector (note that in the case of the output nodes this vector can represent a probability distribution so that the overall model remains probabilistic); (b) each vector is a deterministic function of its parents; (c) each function is parameterized using a neural network (or some other class of approximators); and (d) weight-sharing or stationarity is used between similar neural networks in the model. For example, in the 2D GIOHMM contact map predictor, we can use a total of 5 neural networks to recursively compute the four hidden states and the output in each column in the form: NW NE SW SE Oij = NO (Iij , Hi,j , Hi,j , Hi,j , Hi,j ) NE NE NE Hi,j = NN E (Ii,j , Hi−1,j , Hi,j−1 ) N NW NW Hi,jW = NN W (Ii,j , Hi+1,j , Hi,j−1 ) SW SW SW Hi,j = NSW (Ii,j , Hi+1,j , Hi,j+1 ) SE SE SE Hi,j = NSE (Ii,j , Hi−1,j , Hi,j+1 ) (3) N In the NE plane, for instance, the boundary conditions are set to Hij E = 0 for i = 0 N or j = 0. The activity vector associated with the hidden unit Hij E depends on the NE NE local input Iij , and the activity vectors of the units Hi−1,j and Hi,j−1 . Activity in NE plane can be propagated row by row, West to East, and from the first row to the last (from South to North), or column by column South to North, and from the first column to the last. These GRNN architectures can be trained by gradient descent by unfolding the structures in space, leveraging the acyclic nature of the underlying GIOHMMs. 4 Data Many data sets are available or can be constructed for training and testing purposes, as described in the references. The data sets used in the present simulations are extracted from the publicly available Protein Data Bank (PDB) and then redundancy reduced, or from the non-homologous subset of PDB Select (ftp://ftp.emblheidelberg.de/pub/databases/). In addition, we typically exclude structures with poor resolution (less than 2.5-3 ˚), sequences containing less than 30 amino acids, A and structures containing multiple sequences or sequences with chain breaks. For coarse contact maps, we use the DSSP program [9] (CMBI version) to assign secondary structures and we remove also sequences for which DSSP crashes. The results we report for fine-grained contact maps are derived using 424 proteins with lengths in the 30-200 range for training and an additional non-homologous set of 48 proteins in the same length range for testing. For the coarse contact map, we use a set of 587 proteins of length less than 300. Because the average length of a secondary structure element is slightly above 7, the size of a coarse map is roughly 2% the size of the corresponding amino acid map. 5 Simulation Results and Conclusions We have trained several 2D GIOHMM/GRNN models on the direct prediction of fine-grained contact maps. Training of a single model typically takes on the order of a week on a fast workstation. A sample of validation results is reported in Table 1 for four different distance cutoffs. Overall percentages of correctly predicted contacts Table 1: Direct prediction of amino acid contact maps. Column 1: four distance cutoffs. Column 2, 3, and 4: overall percentages of amino acids correctly classified as contacts, non-contacts, and in total. Column 5: Precision percentage for distant contacts (|i − j| ≥ 8) with a threshold of 0.5. Single model results except for last line corresponding to an ensemble of 5 models. Cutoff 6˚ A 8˚ A 10 ˚ A 12 ˚ A 12 ˚ A Contact .714 .638 .512 .433 .445 Non-Contact .998 .998 .993 .987 .990 Total .985 .970 .931 .878 .883 Precision (P) .594 .670 .557 .549 .717 and non-contacts at all linear distances, as well as precision results for distant contacts (|i − j| ≥ 8) are reported for a single GIOHMM/GRNN model. The model has k = 14 hidden units in the hidden and output layers of the four hidden networks, as well as in the hidden layer of the output network. In the last row, we also report as an example the results obtained at 12˚ by an ensemble of 5 networks A with k = 11, 12, 13, 14 and 15. Note that precision for distant contacts exceeds all previously reported results and is well above 50%. For the prediction of coarse-grained contact maps, we use the indirect GIOHMM/GRNN strategy and compare different exploration/exploitation strategies: random exploration, pure exploitation, and their convex combination (semiuniform exploitation). In the semi-uniform case we set the probability of random uniform exploration to = 0.4. In addition, we also try a fourth hybrid strategy in which the search proceeds greedily (i.e. the best successor is chosen at each step, as in pure exploitation), but the network is trained by randomly sub-sampling the successors of the current state. Eight numerical features encode the input label of each node: one-hot encoding of secondary structure classes; normalized linear distances from the N to C terminus; average, maximum and minimum hydrophobic character of the segment (based on the Kyte-Doolittle scale with a moving window of length 7). A sample of results obtained with 5-fold cross-validation is shown in Table 2. Hidden state vectors have dimension k = 5 with no hidden layers. For each strategy we measure performances by means of several indices: micro and macroaveraged precision (mP , M P ), recall (mR, M R) and F1 measure (mF1 , M F1 ). Micro-averages are derived based on each pair of secondary structure elements in each protein, whereas macro-averages are obtained on a per-protein basis, by first computing precision and recall for each protein, and then averaging over the set of all proteins. In addition, we also measure the micro and macro averages for specificity in the sense of percentage of correct prediction for non-contacts (mP (nc), M P (nc)). Note the tradeoffs between precision and recall across the training methods, the hybrid method achieving the best F 1 results. Table 2: Indirect prediction of coarse contact maps with dynamic sampling. Strategy Random exploration Semi-uniform Pure exploitation Hybrid mP .715 .454 .431 .417 mP (nc) .769 .787 .806 .834 mR .418 .631 .726 .790 mF1 .518 .526 .539 .546 MP .767 .507 .481 .474 M P (nc) .709 .767 .793 .821 MR .469 .702 .787 .843 M F1 .574 .588 .596 .607 We have presented two approaches, based on a very general IOHMM/RNN framework, that achieve state-of-the-art performance in the prediction of proteins contact maps at fine and coarse-grained levels of resolution. In principle both methods can be applied to both resolution levels, although the indirect prediction is computationally too demanding for fine-grained prediction of large proteins. Several extensions are currently under development, including the integration of these methods into complete 3D structure predictors. While these systems require long training periods, once trained they can rapidly sift through large proteomic data sets. Acknowledgments The work of PB and GP is supported by a Laurel Wilkening Faculty Innovation award and awards from NIH, BREP, Sun Microsystems, and the California Institute for Telecommunications and Information Technology. The work of PF and AV is partially supported by a MURST grant. References [1] D. Baker and A. Sali. Protein structure prediction and structural genomics. Science, 294:93–96, 2001. [2] P. Baldi and S. Brunak and P. Frasconi and G. Soda and G. Pollastri. Exploiting the past and the future in protein secondary structure prediction. Bioinformatics, 15(11):937–946, 1999. [3] P. Baldi and Y. Chauvin. Hybrid modeling, HMM/NN architectures, and protein applications. Neural Computation, 8(7):1541–1565, 1996. [4] P. Baldi and G. Pollastri. Machine learning structural and functional proteomics. IEEE Intelligent Systems. Special Issue on Intelligent Systems in Biology, 17(2), 2002. [5] Y. Bengio and P. Frasconi. Input-output HMM’s for sequence processing. IEEE Trans. on Neural Networks, 7:1231–1249, 1996. [6] P. Fariselli, O. Olmea, A. Valencia, and R. Casadio. Prediction of contact maps with neural networks and correlated mutations. Protein Engineering, 14:835–843, 2001. [7] P. Frasconi, M. Gori, and A. Sperduti. A general framework for adaptive processing of data structures. IEEE Trans. on Neural Networks, 9:768–786, 1998. [8] Z. Ghahramani and M. I. Jordan. Factorial hidden Markov models Machine Learning, 29:245–273, 1997. [9] W. Kabsch and C. Sander. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22:2577–2637, 1983. [10] A. M. Lesk, L. Lo Conte, and T. J. P. Hubbard. Assessment of novel fold targets in CASP4: predictions of three-dimensional structures, secondary structures, and interresidue contacts. Proteins, 45, S5:98–118, 2001. [11] G. Pollastri and P. Baldi. Predition of contact maps by GIOHMMs and recurrent neural networks using lateral propagation from all four cardinal corners. Proceedings of 2002 ISMB (Intelligent Systems for Molecular Biology) Conference. Bioinformatics, 18, S1:62–70, 2002. [12] G. Pollastri, D. Przybylski, B. Rost, and P. Baldi. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins, 47:228–235, 2002. [13] G. Pollastri, P. Baldi, P. Fariselli, and R. Casadio. Prediction of coordination number and relative solvent accessibility in proteins. Proteins, 47:142–153, 2002. [14] M. Vendruscolo, E. Kussell, and E. Domany. Recovery of protein structure from contact maps. Folding and Design, 2:295–306, 1997.
same-paper 3 0.81059229 69 nips-2002-Discriminative Learning for Label Sequences via Boosting
Author: Yasemin Altun, Thomas Hofmann, Mark Johnson
Abstract: This paper investigates a boosting approach to discriminative learning of label sequences based on a sequence rank loss function. The proposed method combines many of the advantages of boosting schemes with the efficiency of dynamic programming methods and is attractive both, conceptually and computationally. In addition, we also discuss alternative approaches based on the Hamming loss for label sequences. The sequence boosting algorithm offers an interesting alternative to methods based on HMMs and the more recently proposed Conditional Random Fields. Applications areas for the presented technique range from natural language processing and information extraction to computational biology. We include experiments on named entity recognition and part-of-speech tagging which demonstrate the validity and competitiveness of our approach. 1
4 0.69539446 45 nips-2002-Boosted Dyadic Kernel Discriminants
Author: Baback Moghaddam, Gregory Shakhnarovich
Abstract: We introduce a novel learning algorithm for binary classification with hyperplane discriminants based on pairs of training points from opposite classes (dyadic hypercuts). This algorithm is further extended to nonlinear discriminants using kernel functions satisfying Mercer’s conditions. An ensemble of simple dyadic hypercuts is learned incrementally by means of a confidence-rated version of AdaBoost, which provides a sound strategy for searching through the finite set of hypercut hypotheses. In experiments with real-world datasets from the UCI repository, the generalization performance of the hypercut classifiers was found to be comparable to that of SVMs and k-NN classifiers. Furthermore, the computational cost of classification (at run time) was found to be similar to, or better than, that of SVM. Similarly to SVMs, boosted dyadic kernel discriminants tend to maximize the margin (via AdaBoost). In contrast to SVMs, however, we offer an on-line and incremental learning machine for building kernel discriminants whose complexity (number of kernel evaluations) can be directly controlled (traded off for accuracy). 1
5 0.608298 53 nips-2002-Clustering with the Fisher Score
Author: Koji Tsuda, Motoaki Kawanabe, Klaus-Robert Müller
Abstract: Recently the Fisher score (or the Fisher kernel) is increasingly used as a feature extractor for classification problems. The Fisher score is a vector of parameter derivatives of loglikelihood of a probabilistic model. This paper gives a theoretical analysis about how class information is preserved in the space of the Fisher score, which turns out that the Fisher score consists of a few important dimensions with class information and many nuisance dimensions. When we perform clustering with the Fisher score, K-Means type methods are obviously inappropriate because they make use of all dimensions. So we will develop a novel but simple clustering algorithm specialized for the Fisher score, which can exploit important dimensions. This algorithm is successfully tested in experiments with artificial data and real data (amino acid sequences).
6 0.59793431 32 nips-2002-Approximate Inference and Protein-Folding
7 0.59710628 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
8 0.5925228 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
9 0.58965808 3 nips-2002-A Convergent Form of Approximate Policy Iteration
10 0.58701599 46 nips-2002-Boosting Density Estimation
11 0.58557093 124 nips-2002-Learning Graphical Models with Mercer Kernels
12 0.58543128 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
13 0.58315867 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
14 0.5830012 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
15 0.58249122 169 nips-2002-Real-Time Particle Filters
16 0.58231437 74 nips-2002-Dynamic Structure Super-Resolution
17 0.58143097 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction
18 0.58098918 10 nips-2002-A Model for Learning Variance Components of Natural Images
19 0.58094859 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture
20 0.57983619 2 nips-2002-A Bilinear Model for Sparse Coding