nips nips2002 nips2002-145 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Eleazar Eskin, Jason Weston, William S. Noble, Christina S. Leslie
Abstract: We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the protein classification problem. These kernels measure sequence similarity based on shared occurrences of -length subsequences, counted with up to mismatches, and do not rely on any generative model for the positive training sequences. We compute the kernels efficiently using a mismatch tree data structure and report experiments on a benchmark SCOP dataset, where we show that the mismatch kernel used with an SVM classifier performs as well as the Fisher kernel, the most successful method for remote homology detection, while achieving considerable computational savings. ¡ ¢
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the protein classification problem. [sent-9, score-0.852]
2 These kernels measure sequence similarity based on shared occurrences of -length subsequences, counted with up to mismatches, and do not rely on any generative model for the positive training sequences. [sent-10, score-0.296]
3 ¡ ¢ 1 Introduction A fundamental problem in computational biology is the classification of proteins into functional and structural classes based on homology (evolutionary similarity) of protein sequence data. [sent-12, score-0.752]
4 Known methods for protein classification and homology detection include pairwise sequence alignment [1, 2, 3], profiles for protein families [4], consensus patterns using motifs [5, 6] and profile hidden Markov models [7, 8, 9]. [sent-13, score-1.317]
5 We are most interested in discriminative methods, where protein sequences are seen as a set of labeled examples — positive if they are in the protein family or superfamily and negative otherwise — and we train a classifier to distinguish between the two classes. [sent-14, score-0.986]
6 We focus on the more difficult problem of remote homology detection, where we want our classifier to detect (as positives) test sequences that are only remotely related to the positive training sequences. [sent-15, score-0.647]
7 One of the most successful discriminative techniques for protein classification – and the best performing method for remote homology detection – is the Fisher-SVM [10, 11] approach of Jaakkola et al. [sent-16, score-0.961]
8 In this method, one first builds a profile hidden Markov model £ Formerly William Noble Grundy: see http://www. [sent-17, score-0.043]
9 html (HMM) for the positive training sequences, defining a log likelihood function for any protein sequence . [sent-21, score-0.46]
10 If is the maximum likelihood estimate for the model parameters, then the gradient vector assigns to each (positive or negative) training sequence an explicit vector of features called Fisher scores. [sent-22, score-0.103]
11 This feature mapping defines a kernel function, called the Fisher kernel, that can then be used to train a support vector machine (SVM) [12, 13] classifier. [sent-23, score-0.398]
12 One of the strengths of the Fisher-SVM approach is that it combines the rich biological information encoded in a hidden Markov model with the discriminative power of the SVM algorithm. [sent-24, score-0.146]
13 ¥ £ ¡ © © In this paper, we present a new string kernel, called the mismatch kernel, for use with an SVM for remote homology detection. [sent-27, score-0.916]
14 The -mismatch kernel is based on a feature map to a vector space indexed by all possible subsequences of amino acids of a fixed length ; each instance of a fixed -length subsequence in an input sequence contributes to all feature coordinates differing from it by at most mismatches. [sent-28, score-0.842]
15 Thus, the mismatch kernel adds the biologically important idea of mismatching to the computationally simpler spectrum kernel presented in [14]. [sent-29, score-1.029]
16 In the current work, we also describe how to compute useful the new kernel efficiently using a mismatch tree data structure; for values of in this application, the kernel is fast enough to use on real datasets and is considerably less expensive than the Fisher kernel. [sent-30, score-1.046]
17 We report results from a benchmark dataset on the SCOP database [15] assembled by Jaakkola et al. [sent-31, score-0.085]
18 [10] and show that the mismatch kernel used with an SVM classifier achieves performance equal to the Fisher-SVM method while outperforming all other methods tested. [sent-32, score-0.617]
19 Finally, we note that the mismatch kernel does not depend on any generative model and could potentially be used in other sequence-based classification problems. [sent-33, score-0.648]
20 3 4 ¡ ¢ § ¡ ¡ ¢ 3 4 ¡ ¢ § 2 Spectrum and Mismatch String Kernels The basis for our approach to protein classification is to represent protein sequences as vectors in a high-dimensional feature space via a string-based feature map. [sent-34, score-0.975]
21 We then train a support vector machine (SVM), a large-margin linear classifier, on the feature vectors representing our training sequences. [sent-35, score-0.193]
22 Since SVMs are a kernel-based learning algorithm, we do not calculate the feature vectors explicitly but instead compute their pairwise inner products using a mismatch string kernel, which we define in this section. [sent-36, score-0.636]
23 1 Feature Maps for Strings 3 4 ¡ The -mismatch kernel is based on a feature map from the space of all finite sequences from an alphabet of size to the -dimensional vector space indexed by the set of -length subsequences (“ -mers”) from . [sent-38, score-0.627]
24 (For protein sequences, is the alphabet of amino acids, . [sent-39, score-0.383]
25 ) For a fixed -mer , with each a character in , the -neighborhood generated by is the set of all -length sequences from that differ from by at most mismatches. [sent-40, score-0.13]
26 Thus, a -mer contributes weight to all the coordinates in its mismatch neighborhood. [sent-43, score-0.407]
27 2 Fisher Scores and the Spectrum Kernel While we define the spectrum and mismatch feature maps without any reference to a generative model for the positive class of sequences, there is some similarity between the -spectrum feature map and the Fisher scores associated to an order Markov chain model. [sent-47, score-0.919]
28 "8 ©U 3 § ¤ u¨¥ 8 for a string © § ¨¥ 3 3 for characters in alphabet . [sent-49, score-0.17]
29 Denote by the maximum likelihood estimate for on the positive training set. [sent-50, score-0.075]
30 Then the Fisher scores are given by I I , E 8 ¡ where is the number of instances of the -mer in , and is the number of instances of the -mer . [sent-53, score-0.324]
31 Thus the Fisher score captures the degree to which the -mer is over- or under-represented relative to the positive model. [sent-54, score-0.08]
32 For the -spectrum kernel, the corresponding feature coordinate looks similar but simply uses the unweighted count: 1 0 443Q0 5 222 1 F ! [sent-55, score-0.092]
33 TI & 1 3 0 4430 5 222 1 SA § ¡ # S S S 3UV%I # ¡ t 3 Efficient Computation of the Mismatch Kernel Unlike the Fisher vectors used in [10], our feature vectors are sparse vectors in a very high dimensional feature space. [sent-57, score-0.301]
34 Thus, instead of calculating and storing the feature vectors, we directly and efficiently compute the kernel matrix for use with an SVM classifier. [sent-58, score-0.423]
35 ¡ ¡ -mismatch tree is a rooted tree of depth where each internal node has A branches and each branch is labeled with a symbol from . [sent-62, score-0.296]
36 A leaf node represents a fixed -mer in our feature space – obtained by concatenating the branch symbols along the path from root to leaf – and an internal node represents the prefix for those -mer features which are its descendants in the tree. [sent-63, score-0.496]
37 We use a depth-first search of this tree to store, at each node that we visit, a set of pointers to all instances of the current prefix pattern that occur with mismatches in the sample data. [sent-64, score-0.446]
38 Thus at each node of depth , we maintain pointers to all substrings from the sample data set whose -length prefixes are within mismatches from the -length prefix represented by the path down from the root. [sent-65, score-0.368]
39 Note that the set of valid substrings at a node is a subset of the set of valid substrings of its parent. [sent-66, score-0.261]
40 When we encounter a node with an empty list of pointers (no valid occurrences of the current prefix), we do not need to search below it in the tree. [sent-67, score-0.189]
41 When we reach a leaf node, we sum the contributions of all instances occurring in each source sequence to obtain feature values corresponding to the current -mer, and we update the kernel matrix entry for each pair of source sequences and having non-zero feature values. [sent-68, score-0.849]
42 The number of mismatches for each instance is also indicated. [sent-70, score-0.135]
43 u9 3 3 4 ¡ ¡ § x$ ¢ ¡ mismatches of any given fixed -mer is 8 ¢ The number of -mers within ¡ ¡ ¡ . [sent-75, score-0.135]
44 Thus the effective number of -mer instances that we f 9 f X 0 18 § e 9 i§ ( ) ¡ ' % &( f X D need to traverse grows as , where is the total length of the sample data. [sent-76, score-0.124]
45 At a leaf node, if exactly input sequences contain valid instances of the current -mer, one performs updates to the kernel matrix. [sent-77, score-0.601]
46 For sequences each of length (total length ), the worst case for the kernel computation occurs when the feature vectors are all equal and have the maximal number of non-zero entries, giving worst case overall running time . [sent-78, score-0.628]
47 For the application we discuss here, small values of are most useful, and the kernel calculations are quite inexpensive. [sent-79, score-0.273]
48 In practice, one usually wants to use a normalized feature map, so one would also need to compute the norm of the vector , with complexity for a sequence of length . [sent-82, score-0.244]
49 Simple normalization schemes, like dividing by sequence length, can also be used. [sent-83, score-0.074]
50 In these experiments, remote homology is simulated by holding out all members of a target SCOP family from a given superfamily. [sent-87, score-0.509]
51 Positive training examples are chosen from the remaining families in the same superfamily, and negative test and training examples are chosen from disjoint sets of folds outside the target family’s fold. [sent-88, score-0.151]
52 The held-out family members serve as positive test examples. [sent-89, score-0.113]
53 used the SAM-T98 algorithm to pull in domain homologs from the non-redundant protein database and added these sequences as positive examples in the experiments. [sent-91, score-0.624]
54 Because the test sets are designed for remote homology detection, we use small values of . [sent-96, score-0.442]
55 These include the original experimental results from Jaakkola et al. [sent-104, score-0.032]
56 We also test PSI-BLAST [3], an alignment-based method widely used in the biological community, on the same data using the methodology described in [14]. [sent-106, score-0.036]
57 3 © 3 4 ¡ § 8 ¢ § Figure 2 illustrates the mismatch-SVM method’s performance relative to three existing homology detection methods as measured by ROC scores. [sent-107, score-0.378]
58 The figure includes results for SCOP families, and each series corresponds to one homology detection method. [sent-108, score-0.378]
59 § $ E wE S 3 § Figure 3 shows a family-by-family comparison of performance of the -mismatchSVM and Fisher-SVM using ROC scores in plot (A) and ROC-50 scores in plot (B). [sent-113, score-0.492]
60 Figure 4 shows the improvement provided by including mismatches in the SVM kernel. [sent-115, score-0.135]
61 The figures plot ROC scores (plot e § 1 The ROC-50 score is the area under the graph of the number of true positives as a function of false positives, up to the first 50 false positives, scaled so that both axes range from 0 to 1. [sent-116, score-0.399]
62 This score is sometimes preferred in the computational biology community, motivated by the idea that a biologist might be willing to sift through about 50 false positives. [sent-117, score-0.165]
63 35 30 Number of families 25 20 15 10 (5,1)-Mismatch-SVM ROC Fisher-SVM ROC SAM-T98 PSI-BLAST 5 0 0. [sent-118, score-0.093]
64 95 1 ROC Figure 2: Comparison of four homology detection methods. [sent-128, score-0.378]
65 The graph plots the total number of families for which a given method exceeds an ROC score threshold. [sent-129, score-0.127]
66 § ¡ , (A)) and ROC-50 scores (plot (B)) for two string kernel SVM methods: using mismatch kernel, and using (no mismatch) spectrum kernel, the best-performing choice with . [sent-130, score-0.993]
67 Almost all of the families perform better with mismatching than without, showing that mismatching gives significantly better generalization performance. [sent-131, score-0.219]
68 8 ¢ 8 ¡ 8 E 8 ¢ 5 Discussion We have presented a class of string kernels that measure sequence similarity without requiring alignment or depending upon a generative model, and we have given an efficient method for computing these kernels. [sent-132, score-0.416]
69 For the remote homology detection problem, our discriminative approach — combining support vector machines with the mismatch kernel — performs as well in the SCOP experiments as the most successful known method. [sent-133, score-1.235]
70 A practical protein classification system would involve fast multi-class prediction – potentially involving thousands of binary classifiers – on massive test sets. [sent-134, score-0.311]
71 In such applications, computational efficiency of the kernel function becomes an important issue. [sent-135, score-0.273]
72 Chris Watkins [20] and David Haussler [21] have recently defined a set of kernel functions over strings, and one of these string kernels has been implemented for a text classification problem [22]. [sent-136, score-0.485]
73 However, the cost of computing each kernel entry is in the length of the input sequences. [sent-137, score-0.32]
74 The -mismatch kernel is relatively inexpensive to compute for values of that are practical in applications, allows computation of multiple kernel values in one pass, and significantly improves performance over the previously presented (mismatch-free) spectrum kernel. [sent-140, score-0.653]
75 ¡ 5 ¢ § ¢ Many family-based remote homogy detection algorithms incorporate a method for selecting probable domain homologs from unannotated protein sequence databases for additional training data. [sent-142, score-0.822]
76 In these experiments, we used the domain homologs that were identified by SAM-T98 (an iterative HMM-based algorithm) as part of the Fisher-SVM method and included in the datasets; these homologs may be more useful to the Fisher kernel than to the mismatch kernel. [sent-143, score-0.785]
77 We plan to extend our method by investigating semi-supervised techniques for selecting unannotated sequences for use with the mismatch-SVM. [sent-144, score-0.172]
78 The coordinates of each point in the plot are the ROC scores (plot (A)) or ROC-50 scores (plot (B)) for one SCOP family, obtained using the mismatch-SVM with , (x-axis) and Fisher-SVM . [sent-174, score-0.445]
79 The coFigure 4: Family-by-family comparison of ordinates of each point in the plot are the ROC scores (plot (A)) or ROC-50 scores (plot (B)) for one SCOP family, obtained using the mismatch-SVM with , (x-axis) and spectrum-SVM with (y-axis). [sent-205, score-0.416]
80 ¢¤ ©¡ ¢ ¨ ¢ ¦ ¢ £¡ Many interesting variations on the mismatch kernel can be explored using the framework presented here. [sent-207, score-0.617]
81 For example, explicit -mer feature selection can be implemented during calculation of the kernel matrix, based on a criterion enforced at each leaf or internal node. [sent-208, score-0.442]
82 Potentially, a good feature selection criterion could improve performance in certain applications while decreasing kernel computation time. [sent-209, score-0.365]
83 In biological applications, it is also natural to consider weighting each -mer instance contribution to a feature coordinate by evolutionary substitution probabilities. [sent-210, score-0.155]
84 Finally, one could use linear combinations of kernels to capture similarity of different length -mers. [sent-211, score-0.163]
85 We believe that further experimentation with mismatch string kernels could be fruitful for remote protein homology detection and other biological sequence classification problems. [sent-212, score-1.528]
86 We thank Nir Friedman for pointing out the connection with Fisher scores for Markov chain models. [sent-215, score-0.17]
87 Computer alignment of sequences, chapter Phylogenetic Analysis of DNA Sequences. [sent-221, score-0.065]
88 Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. [sent-246, score-0.364]
89 The PRINTS protein fingerprint database in its fifth year. [sent-265, score-0.364]
90 Hidden markov models in computational biology: Applications to protein modeling. [sent-273, score-0.358]
91 Using the fisher kernel method to detect remote protein homologies. [sent-299, score-0.757]
92 The spectrum kernel: A string kernel for SVM protein classification. [sent-317, score-0.79]
93 SCOP: A structural classification of proteins database for the investigation of sequences and structures. [sent-326, score-0.183]
94 Spelling approximate or repeated motifs using a suffix tree. [sent-330, score-0.042]
95 An algorithm for finding signals of unknown length in DNA sequences. [sent-336, score-0.047]
96 Embedding strategies for effective use of information from multiple sequence alignments. [sent-343, score-0.074]
wordName wordTfidf (topN-words)
[('mismatch', 0.344), ('protein', 0.311), ('kernel', 0.273), ('homology', 0.269), ('ea', 0.209), ('roc', 0.207), ('scop', 0.19), ('remote', 0.173), ('scores', 0.17), ('fisher', 0.163), ('hf', 0.148), ('mismatches', 0.135), ('sequences', 0.13), ('string', 0.13), ('classi', 0.11), ('detection', 0.109), ('svm', 0.103), ('biology', 0.098), ('families', 0.093), ('feature', 0.092), ('tree', 0.089), ('jaakkola', 0.089), ('node', 0.087), ('bf', 0.085), ('homologs', 0.084), ('molecular', 0.083), ('kernels', 0.082), ('leaf', 0.077), ('acids', 0.077), ('instances', 0.077), ('plot', 0.076), ('spectrum', 0.076), ('sequence', 0.074), ('rp', 0.073), ('traversal', 0.073), ('uvs', 0.073), ('pre', 0.069), ('discriminative', 0.067), ('alignment', 0.065), ('mismatching', 0.063), ('pro', 0.06), ('pointers', 0.058), ('noble', 0.058), ('subsequences', 0.058), ('positives', 0.053), ('database', 0.053), ('superfamily', 0.049), ('vus', 0.049), ('length', 0.047), ('markov', 0.047), ('nucleic', 0.046), ('positive', 0.046), ('le', 0.045), ('path', 0.045), ('valid', 0.044), ('er', 0.043), ('hidden', 0.043), ('substrings', 0.043), ('motifs', 0.042), ('unannotated', 0.042), ('diekhans', 0.042), ('alphabet', 0.04), ('family', 0.039), ('cation', 0.039), ('vectors', 0.039), ('altschul', 0.039), ('pnas', 0.039), ('eskin', 0.039), ('intelligent', 0.037), ('biological', 0.036), ('datasets', 0.036), ('score', 0.034), ('similarity', 0.034), ('occurring', 0.034), ('columbia', 0.034), ('contributes', 0.034), ('leslie', 0.034), ('map', 0.034), ('train', 0.033), ('false', 0.033), ('amino', 0.032), ('et', 0.032), ('ef', 0.032), ('generative', 0.031), ('compute', 0.031), ('chris', 0.031), ('branch', 0.031), ('expanding', 0.031), ('weston', 0.03), ('coordinates', 0.029), ('training', 0.029), ('dna', 0.029), ('hmms', 0.029), ('strings', 0.029), ('zhang', 0.028), ('miller', 0.028), ('members', 0.028), ('william', 0.028), ('evolutionary', 0.027), ('storing', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999923 145 nips-2002-Mismatch String Kernels for SVM Protein Classification
Author: Eleazar Eskin, Jason Weston, William S. Noble, Christina S. Leslie
Abstract: We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the protein classification problem. These kernels measure sequence similarity based on shared occurrences of -length subsequences, counted with up to mismatches, and do not rely on any generative model for the positive training sequences. We compute the kernels efficiently using a mismatch tree data structure and report experiments on a benchmark SCOP dataset, where we show that the mismatch kernel used with an SVM classifier performs as well as the Fisher kernel, the most successful method for remote homology detection, while achieving considerable computational savings. ¡ ¢
2 0.24088037 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata
Author: Craig Saunders, Alexei Vinokourov, John S. Shawe-taylor
Abstract: In this paper we show how the generation of documents can be thought of as a k-stage Markov process, which leads to a Fisher kernel from which the n-gram and string kernels can be re-constructed. The Fisher kernel view gives a more flexible insight into the string kernel and suggests how it can be parametrised in a way that reflects the statistics of the training corpus. Furthermore, the probabilistic modelling approach suggests extending the Markov process to consider sub-sequences of varying length, rather than the standard fixed-length approach used in the string kernel. We give a procedure for determining which sub-sequences are informative features and hence generate a Finite State Machine model, which can again be used to obtain a Fisher kernel. By adjusting the parametrisation we can also influence the weighting received by the features . In this way we are able to obtain a logarithmic weighting in a Fisher kernel. Finally, experiments are reported comparing the different kernels using the standard Bag of Words kernel as a baseline. 1
3 0.22111402 120 nips-2002-Kernel Design Using Boosting
Author: Koby Crammer, Joseph Keshet, Yoram Singer
Abstract: The focus of the paper is the problem of learning kernel operators from empirical data. We cast the kernel design problem as the construction of an accurate kernel from simple (and less accurate) base kernels. We use the boosting paradigm to perform the kernel construction process. To do so, we modify the booster so as to accommodate kernel operators. We also devise an efficient weak-learner for simple kernels that is based on generalized eigen vector decomposition. We demonstrate the effectiveness of our approach on synthetic data and on the USPS dataset. On the USPS dataset, the performance of the Perceptron algorithm with learned kernels is systematically better than a fixed RBF kernel. 1 Introduction and problem Setting The last decade brought voluminous amount of work on the design, analysis and experimentation of kernel machines. Algorithm based on kernels can be used for various machine learning tasks such as classification, regression, ranking, and principle component analysis. The most prominent learning algorithm that employs kernels is the Support Vector Machines (SVM) [1, 2] designed for classification and regression. A key component in a kernel machine is a kernel operator which computes for any pair of instances their inner-product in some abstract vector space. Intuitively and informally, a kernel operator is a means for measuring similarity between instances. Almost all of the work that employed kernel operators concentrated on various machine learning problems that involved a predefined kernel. A typical approach when using kernels is to choose a kernel before learning starts. Examples to popular predefined kernels are the Radial Basis Functions and the polynomial kernels (see for instance [1]). Despite the simplicity required in modifying a learning algorithm to a “kernelized” version, the success of such algorithms is not well understood yet. More recently, special efforts have been devoted to crafting kernels for specific tasks such as text categorization [3] and protein classification problems [4]. Our work attempts to give a computational alternative to predefined kernels by learning kernel operators from data. We start with a few definitions. Let X be an instance space. A kernel is an inner-product operator K : X × X → . An explicit way to describe K is via a mapping φ : X → H from X to an inner-products space H such that K(x, x ) = φ(x)·φ(x ). Given a kernel operator and a finite set of instances S = {xi , yi }m , the kernel i=1 matrix (a.k.a the Gram matrix) is the matrix of all possible inner-products of pairs from S, Ki,j = K(xi , xj ). We therefore refer to the general form of K as the kernel operator and to the application of the kernel operator to a set of pairs of instances as the kernel matrix. The specific setting of kernel design we consider assumes that we have access to a base kernel learner and we are given a target kernel K manifested as a kernel matrix on a set of examples. Upon calling the base kernel learner it returns a kernel operator denote Kj . The goal thereafter is to find a weighted combination of kernels ˆ K(x, x ) = j αj Kj (x, x ) that is similar, in a sense that will be defined shortly, to ˆ the target kernel, K ∼ K . Cristianini et al. [5] in their pioneering work on kernel target alignment employed as the notion of similarity the inner-product between the kernel matrices < K, K >F = m K(xi , xj )K (xi , xj ). Given this definition, they defined the i,j=1 kernel-similarity, or alignment, to be the above inner-product normalized by the norm of ˆ ˆ ˆ ˆ ˆ each kernel, A(S, K, K ) = < K, K >F / < K, K >F < K , K >F , where S is, as above, a finite sample of m instances. Put another way, the kernel alignment Cristianini et al. employed is the cosine of the angle between the kernel matrices where each matrix is “flattened” into a vector of dimension m2 . Therefore, this definition implies that the alignment is bounded above by 1 and can attain this value iff the two kernel matrices are identical. Given a (column) vector of m labels y where yi ∈ {−1, +1} is the label of the instance xi , Cristianini et al. used the outer-product of y as the the target kernel, ˆ K = yy T . Therefore, an optimal alignment is achieved if K(xi , xj ) = yi yj . Clearly, if such a kernel is used for classifying instances from X , then the kernel itself suffices to construct an excellent classifier f : X → {−1, +1} by setting, f (x) = sign(y i K(xi , x)) where (xi , yi ) is any instance-label pair. Cristianini et al. then devised a procedure that works with both labelled and unlabelled examples to find a Gram matrix which attains a good alignment with K on the labelled part of the matrix. While this approach can clearly construct powerful kernels, a few problems arise from the notion of kernel alignment they employed. For instance, a kernel operator such that the sign(K(x i , xj )) is equal to yi yj but its magnitude, |K(xi , xj )|, is not necessarily 1, might achieve a poor alignment score while it can constitute a classifier whose empirical loss is zero. Furthermore, the task of finding a good kernel when it is not always possible to find a kernel whose sign on each pair of instances is equal to the products of the labels (termed the soft-margin case in [5, 6]) becomes rather tricky. We thus propose a different approach which attempts to overcome some of the difficulties above. Like Cristianini et al. we assume that we are given a set of labelled instances S = {(xi , yi ) | xi ∈ X , yi ∈ {−1, +1}, i = 1, . . . , m} . We are also given a set of unlabelled m ˜ ˜ examples S = {˜i }i=1 . If such a set is not provided we can simply use the labelled inx ˜ ˜ stances (without the labels themselves) as the set S. The set S is used for constructing the ˆ primitive kernels that are combined to constitute the learned kernel K. The labelled set is used to form the target kernel matrix and its instances are used for evaluating the learned ˆ kernel K. This approach, known as transductive learning, was suggested in [5, 6] for kernel alignment tasks when the distribution of the instances in the test data is different from that of the training data. This setting becomes in particular handy in datasets where the test data was collected in a different scheme than the training data. We next discuss the notion of kernel goodness employed in this paper. This notion builds on the objective function that several variants of boosting algorithms maintain [7, 8]. We therefore first discuss in brief the form of boosting algorithms for kernels. 2 Using Boosting to Combine Kernels Numerous interpretations of AdaBoost and its variants cast the boosting process as a procedure that attempts to minimize, or make small, a continuous bound on the classification error (see for instance [9, 7] and the references therein). A recent work by Collins et al. [8] unifies the boosting process for two popular loss functions, the exponential-loss (denoted henceforth as ExpLoss) and logarithmic-loss (denoted as LogLoss) that bound the empir- ˜ ˜ Input: Labelled and unlabelled sets of examples: S = {(xi , yi )}m ; S = {˜i }m x i=1 i=1 Initialize: K ← 0 (all zeros matrix) For t = 1, 2, . . . , T : • Calculate distribution over pairs 1 ≤ i, j ≤ m: Dt (i, j) = exp(−yi yj K(xi , xj )) 1/(1 + exp(−yi yj K(xi , xj ))) ExpLoss LogLoss ˜ • Call base-kernel-learner with (Dt , S, S) and receive Kt • Calculate: + − St = {(i, j) | yi yj Kt (xi , xj ) > 0} ; St = {(i, j) | yi yj Kt (xi , xj ) < 0} + Wt = (i,j)∈S + Dt (i, j)|Kt (xi , xj )| ; Wt− = (i,j)∈S − Dt (i, j)|Kt (xi , xj )| t t 1 2 + Wt − Wt • Set: αt = ln ; K ← K + α t Kt . Return: kernel operator K : X × X → Figure 1: The skeleton of the boosting algorithm for kernels. ical classification error. Given the prediction of a classifier f on an instance x and a label y ∈ {−1, +1} the ExpLoss and the LogLoss are defined as, ExpLoss(f (x), y) = exp(−yf (x)) LogLoss(f (x), y) = log(1 + exp(−yf (x))) . Collins et al. described a single algorithm for the two losses above that can be used within the boosting framework to construct a strong-hypothesis which is a classifier f (x). This classifier is a weighted combination of (possibly very simple) base classifiers. (In the boosting framework, the base classifiers are referred to as weak-hypotheses.) The strongT hypothesis is of the form f (x) = t=1 αt ht (x). Collins et al. discussed a few ways to select the weak-hypotheses ht and to find a good of weights αt . Our starting point in this paper is the first sequential algorithm from [8] that enables the construction or creation of weak-hypotheses on-the-fly. We would like to note however that it is possible to use other variants of boosting to design kernels. In order to use boosting to design kernels we extend the algorithm to operate over pairs of instances. Building on the notion of alignment from [5, 6], we say that the inner-product of x1 and x2 is aligned with the labels y1 and y2 if sign(K(x1 , x2 )) = y1 y2 . Furthermore, we would like to make the magnitude of K(x, x ) to be as large as possible. We therefore use one of the following two alignment losses for a pair of examples (x 1 , y1 ) and (x2 , y2 ), ExpLoss(K(x1 , x2 ), y1 y2 ) = exp(−y1 y2 K(x1 , x2 )) LogLoss(K(x1 , x2 ), y1 y2 ) = log(1 + exp(−y1 y2 K(x1 , x2 ))) . Put another way, we view a pair of instances as a single example and cast the pairs of instances that attain the same label as positively labelled examples while pairs of opposite labels are cast as negatively labelled examples. Clearly, this approach can be applied to both losses. In the boosting process we therefore maintain a distribution over pairs of instances. The weight of each pair reflects how difficult it is to predict whether the labels of the two instances are the same or different. The core boosting algorithm follows similar lines to boosting algorithms for classification algorithm. The pseudo code of the booster is given in Fig. 1. The pseudo-code is an adaptation the to problem of kernel design of the sequentialupdate algorithm from [8]. As with other boosting algorithm, the base-learner, which in our case is charge of returning a good kernel with respect to the current distribution, is left unspecified. We therefore turn our attention to the algorithmic implementation of the base-learning algorithm for kernels. 3 Learning Base Kernels The base kernel learner is provided with a training set S and a distribution D t over a pairs ˜ of instances from the training set. It is also provided with a set of unlabelled examples S. Without any knowledge of the topology of the space of instances a learning algorithm is likely to fail. Therefore, we assume the existence of an initial inner-product over the input space. We assume for now that this initial inner-product is the standard scalar products over vectors in n . We later discuss a way to relax the assumption on the form of the inner-product. Equipped with an inner-product, we define the family of base kernels to be the possible outer-products Kw = wwT between a vector w ∈ n and itself. Using this definition we get, Kw (xi , xj ) = (xi ·w)(xj ·w) . Input: A distribution Dt . Labelled and unlabelled sets: ˜ ˜ Therefore, the similarity beS = {(xi , yi )}m ; S = {˜i }m . x i=1 i=1 tween two instances xi and Compute : xj is high iff both xi and xj • Calculate: ˜ are similar (w.r.t the standard A ∈ m×m , Ai,r = xi · xr ˜ inner-product) to a third vecm×m B∈ , Bi,j = Dt (i, j)yi yj tor w. Analogously, if both ˜ ˜ K ∈ m×m , Kr,s = xr · xs ˜ ˜ xi and xj seem to be dissim• Find the generalized eigenvector v ∈ m for ilar to the vector w then they the problem AT BAv = λKv which attains are similar to each other. Dethe largest eigenvalue λ spite the restrictive form of • Set: w = ( r vr xr )/ ˜ ˜ r vr xr . the inner-products, this famt ily is still too rich for our setReturn: Kernel operator Kw = ww . ting and we further impose two restrictions on the inner Figure 2: The base kernel learning algorithm. products. First, we assume ˜ that w is restricted to a linear combination of vectors from S. Second, since scaling of the base kernels is performed by the boosted, we constrain the norm of w to be 1. The m ˜ resulting class of kernels is therefore, C = {Kw = wwT | w = r=1 βr xr , w = 1} . ˜ In the boosting process we need to choose a specific base-kernel K w from C. We therefore need to devise a notion of how good a candidate for base kernel is given a labelled set S and a distribution function Dt . In this work we use the simplest version suggested by Collins et al. This version can been viewed as a linear approximation on the loss function. We define the score of a kernel Kw w.r.t to the current distribution Dt to be, Score(Kw ) = Dt (i, j)yi yj Kw (xi , xj ) . (1) i,j The higher the value of the score is, the better Kw fits the training data. Note that if Dt (i, j) = 1/m2 (as is D0 ) then Score(Kw ) is proportional to the alignment since w = 1. Under mild assumptions the score can also provide a lower bound of the loss function. To see that let c be the derivative of the loss function at margin zero, c = Loss (0) . If all the √ training examples xi ∈ S lies in a ball of radius c, we get that Loss(Kw (xi , xj ), yi yj ) ≥ 1 − cKw (xi , xj )yi yj ≥ 0, and therefore, i,j Dt (i, j)Loss(Kw (xi , xj ), yi yj ) ≥ 1 − c Dt (i, j)Kw (xi , xj )yi yj . i,j Using the explicit form of Kw in the Score function (Eq. (1)) we get, Score(Kw ) = i,j D(i, j)yi yj (w·xi )(w·xj ) . Further developing the above equation using the constraint that w = m ˜ r=1 βr xr we get, ˜ Score(Kw ) = βs βr r,s i,j D(i, j)yi yj (xi · xr ) (xj · xs ) . ˜ ˜ To compute efficiently the base kernel score without an explicit enumeration we exploit the fact that if the initial distribution D0 is symmetric (D0 (i, j) = D0 (j, i)) then all the distributions generated along the run of the boosting process, D t , are also symmetric. We ˜ now define a matrix A ∈ m×m where Ai,r = xi · xr and a symmetric matrix B ∈ m×m ˜ with Bi,j = Dt (i, j)yi yj . Simple algebraic manipulations yield that the score function can be written as the following quadratic form, Score(β) = β T (AT BA)β , where β is m dimensional column vector. Note that since B is symmetric so is A T BA. Finding a ˜ good base kernel is equivalent to finding a vector β which maximizes this quadratic form 2 m ˜ under the norm equality constraint w = ˜ 2 = β T Kβ = 1 where Kr,s = r=1 βr xr xr · xs . Finding the maximum of Score(β) subject to the norm constraint is a well known ˜ ˜ maximization problem known as the generalized eigen vector problem (cf. [10]). Applying simple algebraic manipulations it is easy to show that the matrix AT BA is positive semidefinite. Assuming that the matrix K is invertible, the the vector β which maximizes the quadratic form is proportional the eigenvector of K −1 AT BA which is associated with the m ˜ generalized largest eigenvalue. Denoting this vector by v we get that w ∝ ˜ r=1 vr xr . m ˜ m ˜ Adding the norm constraint we get that w = ( r=1 vr xr )/ ˜ vr xr . The skeleton ˜ r=1 of the algorithm for finding a base kernels is given in Fig. 3. To conclude the description of the kernel learning algorithm we describe how to the extend the algorithm to be employed with general kernel functions. Kernelizing the Kernel: As described above, we assumed that the standard scalarproduct constitutes the template for the class of base-kernels C. However, since the proce˜ dure for choosing a base kernel depends on S and S only through the inner-products matrix A, we can replace the scalar-product itself with a general kernel operator κ : X × X → , where κ(xi , xj ) = φ(xi ) · φ(xj ). Using a general kernel function κ we can not compute however the vector w explicitly. We therefore need to show that the norm of w, and evaluation Kw on any two examples can still be performed efficiently. First note that given the vector v we can compute the norm of w as follows, T w 2 = vr xr ˜ vs xr ˜ r s = vr vs κ(˜r , xs ) . x ˜ r,s Next, given two vectors xi and xj the value of their inner-product is, Kw (xi , xj ) = vr vs κ(xi , xr )κ(xj , xs ) . ˜ ˜ r,s Therefore, although we cannot compute the vector w explicitly we can still compute its norm and evaluate any of the kernels from the class C. 4 Experiments Synthetic data: We generated binary-labelled data using as input space the vectors in 100 . The labels, in {−1, +1}, were picked uniformly at random. Let y designate the label of a particular example. Then, the first two components of each instance were drawn from a two-dimensional normal distribution, N (µ, ∆ ∆−1 ) with the following parameters, µ=y 0.03 0.03 1 ∆= √ 2 1 −1 1 1 = 0.1 0 0 0.01 . That is, the label of each examples determined the mean of the distribution from which the first two components were generated. The rest of the components in the vector (98 8 0.2 6 50 50 100 100 150 150 200 200 4 2 0 0 −2 −4 −6 250 250 −0.2 −8 −0.2 0 0.2 −8 −6 −4 −2 0 2 4 6 8 300 20 40 60 80 100 120 140 160 180 200 300 20 40 60 80 100 120 140 160 180 Figure 3: Results on a toy data set prior to learning a kernel (first and third from left) and after learning (second and fourth). For each of the two settings we show the first two components of the training data (left) and the matrix of inner products between the train and the test data (right). altogether) were generated independently using the normal distribution with a zero mean and a standard deviation of 0.05. We generated 100 training and test sets of size 300 and 200 respectively. We used the standard dot-product as the initial kernel operator. On each experiment we first learned a linear classier that separates the classes using the Perceptron [11] algorithm. We ran the algorithm for 10 epochs on the training set. After each epoch we evaluated the performance of the current classifier on the test set. We then used the boosting algorithm for kernels with the LogLoss for 30 rounds to build a kernel for each random training set. After learning the kernel we re-trained a classifier with the Perceptron algorithm and recorded the results. A summary of the online performance is given in Fig. 4. The plot on the left-hand-side of the figure shows the instantaneous error (achieved during the run of the algorithm). Clearly, the Perceptron algorithm with the learned kernel converges much faster than the original kernel. The middle plot shows the test error after each epoch. The plot on the right shows the test error on a noisy test set in which we added a Gaussian noise of zero mean and a standard deviation of 0.03 to the first two features. In all plots, each bar indicates a 95% confidence level. It is clear from the figure that the original kernel is much slower to converge than the learned kernel. Furthermore, though the kernel learning algorithm was not expoed to the test set noise, the learned kernel reflects better the structure of the feature space which makes the learned kernel more robust to noise. Fig. 3 further illustrates the benefits of using a boutique kernel. The first and third plots from the left correspond to results obtained using the original kernel and the second and fourth plots show results using the learned kernel. The left plots show the empirical distribution of the two informative components on the test data. For the learned kernel we took each input vector and projected it onto the two eigenvectors of the learned kernel operator matrix that correspond to the two largest eigenvalues. Note that the distribution after the projection is bimodal and well separated along the first eigen direction (x-axis) and shows rather little deviation along the second eigen direction (y-axis). This indicates that the kernel learning algorithm indeed found the most informative projection for separating the labelled data with large margin. It is worth noting that, in this particular setting, any algorithm which chooses a single feature at a time is prone to failure since both the first and second features are mandatory for correctly classifying the data. The two plots on the right hand side of Fig. 3 use a gray level color-map to designate the value of the inner-product between each pairs instances, one from training set (y-axis) and the other from the test set. The examples were ordered such that the first group consists of the positively labelled instances while the second group consists of the negatively labelled instances. Since most of the features are non-relevant the original inner-products are noisy and do not exhibit any structure. In contrast, the inner-products using the learned kernel yields in a 2 × 2 block matrix indicating that the inner-products between instances sharing the same label obtain large positive values. Similarly, for instances of opposite 200 1 12 Regular Kernel Learned Kernel 0.8 17 0.7 16 0.5 0.4 0.3 Test Error % 8 0.6 Regular Kernel Learned Kernel 18 10 Test Error % Averaged Cumulative Error % 19 Regular Kernel Learned Kernel 0.9 6 4 15 14 13 12 0.2 11 2 0.1 10 0 0 10 1 10 2 10 Round 3 10 4 10 0 2 4 6 Epochs 8 10 9 2 4 6 Epochs 8 10 Figure 4: The online training error (left), test error (middle) on clean synthetic data using a standard kernel and a learned kernel. Right: the online test error for the two kernels on a noisy test set. labels the inner products are large and negative. The form of the inner-products matrix of the learned kernel indicates that the learning problem itself becomes much easier. Indeed, the Perceptron algorithm with the standard kernel required around 94 training examples on the average before converging to a hyperplane which perfectly separates the training data while using the Perceptron algorithm with learned kernel required a single example to reach a perfect separation on all 100 random training sets. USPS dataset: The USPS (US Postal Service) dataset is known as a challenging classification problem in which the training set and the test set were collected in a different manner. The USPS contains 7, 291 training examples and 2, 007 test examples. Each example is represented as a 16 × 16 matrix where each entry in the matrix is a pixel that can take values in {0, . . . , 255}. Each example is associated with a label in {0, . . . , 9} which is the digit content of the image. Since the kernel learning algorithm is designed for binary problems, we broke the 10-class problem into 45 binary problems by comparing all pairs of classes. The interesting question of how to learn kernels for multiclass problems is beyond the scopre of this short paper. We thus constraint on the binary error results for the 45 binary problem described above. For the original kernel we chose a RBF kernel with σ = 1 which is the value employed in the experiments reported in [12]. We used the kernelized version of the kernel design algorithm to learn a different kernel operator for each of the binary problems. We then used a variant of the Perceptron [11] and with the original RBF kernel and with the learned kernels. One of the motivations for using the Perceptron is its simplicity which can underscore differences in the kernels. We ran the kernel learning al˜ gorithm with LogLoss and ExpLoss, using bith the training set and the test test as S. Thus, we obtained four different sets of kernels where each set consists of 45 kernels. By examining the training loss, we set the number of rounds of boosting to be 30 for the LogLoss and 50 for the ExpLoss, when using the trainin set. When using the test set, the number of rounds of boosting was set to 100 for both losses. Since the algorithm exhibits slower rate of convergence with the test data, we choose a a higher value without attempting to optimize the actual value. The left plot of Fig. 5 is a scatter plot comparing the test error of each of the binary classifiers when trained with the original RBF a kernel versus the performance achieved on the same binary problem with a learned kernel. The kernels were built ˜ using boosting with the LogLoss and S was the training data. In almost all of the 45 binary classification problems, the learned kernels yielded lower error rates when combined with the Perceptron algorithm. The right plot of Fig. 5 compares two learned kernels: the first ˜ was build using the training instances as the templates constituing S while the second used the test instances. Although the differenece between the two versions is not as significant as the difference on the left plot, we still achieve an overall improvement in about 25% of the binary problems by using the test instances. 6 4.5 4 5 Learned Kernel (Test) Learned Kernel (Train) 3.5 4 3 2 3 2.5 2 1.5 1 1 0.5 0 0 1 2 3 Base Kernel 4 5 6 0 0 1 2 3 Learned Kernel (Train) 4 5 Figure 5: Left: a scatter plot comparing the error rate of 45 binary classifiers trained using an RBF kernel (x-axis) and a learned kernel with training instances. Right: a similar scatter plot for a learned kernel only constructed from training instances (x-axis) and test instances. 5 Discussion In this paper we showed how to use the boosting framework to design kernels. Our approach is especially appealing in transductive learning tasks where the test data distribution is different than the the distribution of the training data. For example, in speech recognition tasks the training data is often clean and well recorded while the test data often passes through a noisy channel that distorts the signal. An interesting and challanging question that stem from this research is how to extend the framework to accommodate more complex decision tasks such as multiclass and regression problems. Finally, we would like to note alternative approaches to the kernel design problem has been devised in parallel and independently. See [13, 14] for further details. Acknowledgements: Special thanks to Cyril Goutte and to John Show-Taylor for pointing the connection to the generalized eigen vector problem. Thanks also to the anonymous reviewers for constructive comments. References [1] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998. [2] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. [3] Huma Lodhi, John Shawe-Taylor, Nello Cristianini, and Christopher J. C. H. Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2:419–444, 2002. [4] C. Leslie, E. Eskin, and W. Stafford Noble. The spectrum kernel: A string kernel for svm protein classification. In Proceedings of the Pacific Symposium on Biocomputing, 2002. [5] Nello Cristianini, Andre Elisseeff, John Shawe-Taylor, and Jaz Kandla. On kernel target alignment. In Advances in Neural Information Processing Systems 14, 2001. [6] G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix with semi-definite programming. In Proc. of the 19th Intl. Conf. on Machine Learning, 2002. [7] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2):337–374, April 2000. [8] Michael Collins, Robert E. Schapire, and Yoram Singer. Logistic regression, adaboost and bregman distances. Machine Learning, 47(2/3):253–285, 2002. [9] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classifiers. MIT Press, 1999. [10] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1985. [11] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. [12] B. Sch¨ lkopf, S. Mika, C.J.C. Burges, P. Knirsch, K. M¨ ller, G. R¨ tsch, and A.J. Smola. Input o u a space vs. feature space in kernel-based methods. IEEE Trans. on NN, 10(5):1000–1017, 1999. [13] O. Bosquet and D.J.L. Herrmann. On the complexity of learning the kernel matrix. NIPS, 2002. [14] C.S. Ong, A.J. Smola, and R.C. Williamson. Superkenels. NIPS, 2002.
4 0.20763263 98 nips-2002-Going Metric: Denoising Pairwise Data
Author: Volker Roth, Julian Laub, Klaus-Robert Müller, Joachim M. Buhmann
Abstract: Pairwise data in empirical sciences typically violate metricity, either due to noise or due to fallible estimates, and therefore are hard to analyze by conventional machine learning technology. In this paper we therefore study ways to work around this problem. First, we present an alternative embedding to multi-dimensional scaling (MDS) that allows us to apply a variety of classical machine learning and signal processing algorithms. The class of pairwise grouping algorithms which share the shift-invariance property is statistically invariant under this embedding procedure, leading to identical assignments of objects to clusters. Based on this new vectorial representation, denoising methods are applied in a second step. Both steps provide a theoretically well controlled setup to translate from pairwise data to the respective denoised metric representation. We demonstrate the practical usefulness of our theoretical reasoning by discovering structure in protein sequence data bases, visibly improving performance upon existing automatic methods. 1
5 0.19336699 156 nips-2002-On the Complexity of Learning the Kernel Matrix
Author: Olivier Bousquet, Daniel Herrmann
Abstract: We investigate data based procedures for selecting the kernel when learning with Support Vector Machines. We provide generalization error bounds by estimating the Rademacher complexities of the corresponding function classes. In particular we obtain a complexity bound for function classes induced by kernels with given eigenvectors, i.e., we allow to vary the spectrum and keep the eigenvectors fix. This bound is only a logarithmic factor bigger than the complexity of the function class induced by a single kernel. However, optimizing the margin over such classes leads to overfitting. We thus propose a suitable way of constraining the class. We use an efficient algorithm to solve the resulting optimization problem, present preliminary experimental results, and compare them to an alignment-based approach.
6 0.18117872 99 nips-2002-Graph-Driven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA
7 0.16402854 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
8 0.16193798 85 nips-2002-Fast Kernels for String and Tree Matching
9 0.16160123 119 nips-2002-Kernel Dependency Estimation
10 0.15237203 164 nips-2002-Prediction of Protein Topologies Using Generalized IOHMMs and RNNs
11 0.14815173 53 nips-2002-Clustering with the Fisher Score
12 0.13530904 187 nips-2002-Spikernels: Embedding Spiking Neurons in Inner-Product Spaces
13 0.13355307 113 nips-2002-Information Diffusion Kernels
14 0.13240947 106 nips-2002-Hyperkernels
15 0.131008 125 nips-2002-Learning Semantic Similarity
16 0.12941515 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
17 0.11923984 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
18 0.11610119 196 nips-2002-The RA Scanner: Prediction of Rheumatoid Joint Inflammation Based on Laser Imaging
19 0.11469825 32 nips-2002-Approximate Inference and Protein-Folding
20 0.10704565 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
topicId topicWeight
[(0, -0.275), (1, -0.186), (2, 0.187), (3, -0.15), (4, -0.124), (5, -0.02), (6, 0.169), (7, 0.088), (8, 0.035), (9, 0.004), (10, 0.003), (11, 0.221), (12, -0.069), (13, 0.189), (14, -0.065), (15, -0.075), (16, 0.049), (17, 0.192), (18, -0.032), (19, -0.025), (20, -0.002), (21, -0.108), (22, -0.142), (23, 0.025), (24, -0.096), (25, 0.024), (26, 0.069), (27, 0.06), (28, 0.074), (29, -0.091), (30, 0.115), (31, 0.057), (32, -0.109), (33, -0.069), (34, -0.016), (35, 0.022), (36, -0.011), (37, 0.041), (38, 0.017), (39, 0.037), (40, 0.01), (41, 0.014), (42, -0.051), (43, 0.014), (44, 0.031), (45, 0.004), (46, 0.093), (47, -0.01), (48, 0.076), (49, 0.005)]
simIndex simValue paperId paperTitle
same-paper 1 0.94963294 145 nips-2002-Mismatch String Kernels for SVM Protein Classification
Author: Eleazar Eskin, Jason Weston, William S. Noble, Christina S. Leslie
Abstract: We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the protein classification problem. These kernels measure sequence similarity based on shared occurrences of -length subsequences, counted with up to mismatches, and do not rely on any generative model for the positive training sequences. We compute the kernels efficiently using a mismatch tree data structure and report experiments on a benchmark SCOP dataset, where we show that the mismatch kernel used with an SVM classifier performs as well as the Fisher kernel, the most successful method for remote homology detection, while achieving considerable computational savings. ¡ ¢
2 0.71256357 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata
Author: Craig Saunders, Alexei Vinokourov, John S. Shawe-taylor
Abstract: In this paper we show how the generation of documents can be thought of as a k-stage Markov process, which leads to a Fisher kernel from which the n-gram and string kernels can be re-constructed. The Fisher kernel view gives a more flexible insight into the string kernel and suggests how it can be parametrised in a way that reflects the statistics of the training corpus. Furthermore, the probabilistic modelling approach suggests extending the Markov process to consider sub-sequences of varying length, rather than the standard fixed-length approach used in the string kernel. We give a procedure for determining which sub-sequences are informative features and hence generate a Finite State Machine model, which can again be used to obtain a Fisher kernel. By adjusting the parametrisation we can also influence the weighting received by the features . In this way we are able to obtain a logarithmic weighting in a Fisher kernel. Finally, experiments are reported comparing the different kernels using the standard Bag of Words kernel as a baseline. 1
3 0.61832601 85 nips-2002-Fast Kernels for String and Tree Matching
Author: Alex J. Smola, S.v.n. Vishwanathan
Abstract: In this paper we present a new algorithm suitable for matching discrete objects such as strings and trees in linear time, thus obviating dynarrtic programming with quadratic time complexity. Furthermore, prediction cost in many cases can be reduced to linear cost in the length of the sequence to be classified, regardless of the number of support vectors. This improvement on the currently available algorithms makes string kernels a viable alternative for the practitioner.
4 0.59217966 164 nips-2002-Prediction of Protein Topologies Using Generalized IOHMMs and RNNs
Author: Gianluca Pollastri, Pierre Baldi, Alessandro Vullo, Paolo Frasconi
Abstract: We develop and test new machine learning methods for the prediction of topological representations of protein structures in the form of coarse- or fine-grained contact or distance maps that are translation and rotation invariant. The methods are based on generalized input-output hidden Markov models (GIOHMMs) and generalized recursive neural networks (GRNNs). The methods are used to predict topology directly in the fine-grained case and, in the coarsegrained case, indirectly by first learning how to score candidate graphs and then using the scoring function to search the space of possible configurations. Computer simulations show that the predictors achieve state-of-the-art performance. 1 Introduction: Protein Topology Prediction Predicting the 3D structure of protein chains from the linear sequence of amino acids is a fundamental open problem in computational molecular biology [1]. Any approach to the problem must deal with the basic fact that protein structures are translation and rotation invariant. To address this invariance, we have proposed a machine learning approach to protein structure prediction [4] based on the prediction of topological representations of proteins, in the form of contact or distance maps. The contact or distance map is a 2D representation of neighborhood relationships consisting of an adjacency matrix at some distance cutoff (typically in the range of 6 to 12 ˚), or a matrix of pairwise Euclidean distances. Fine-grained maps A are derived at the amino acid or even atomic level. Coarse maps are obtained by looking at secondary structure elements, such as helices, and the distance between their centers of gravity or, as in the simulations below, the minimal distances between their Cα atoms. Reasonable methods for reconstructing 3D coordinates from contact/distance maps have been developed in the NMR literature and elsewhere Oi B Hi F Hi Ii Figure 1: Bayesian network for bidirectional IOHMMs consisting of input units, output units, and both forward and backward Markov chains of hidden states. [14] using distance geometry and stochastic optimization techniques. Thus the main focus here is on the more difficult task of contact map prediction. Various algorithms for the prediction of contact maps have been developed, in particular using feedforward neural networks [6]. The best contact map predictor in the literature and at the last CASP prediction experiment reports an average precision [True Positives/(True Positives + False Positives)] of 21% for distant contacts, i.e. with a linear distance of 8 amino acid or more [6] for fine-grained amino acid maps. While this result is encouraging and well above chance level by a factor greater than 6, it is still far from providing sufficient accuracy for reliable 3D structure prediction. A key issue in this area is the amount of noise that can be tolerated in a contact map prediction without compromising the 3D-reconstruction step. While systematic tests in this area have not yet been published, preliminary results appear to indicate that recovery of as little as half of the distant contacts may suffice for proper reconstruction, at least for proteins up to 150 amino acid long (Rita Casadio and Piero Fariselli, private communication and oral presentation during CASP4 [10]). It is important to realize that the input to a fine-grained contact map predictor need not be confined to the sequence of amino acids only, but may also include evolutionary information in the form of profiles derived by multiple alignment of homologue proteins, or structural feature information, such as secondary structure (alpha helices, beta strands, and coils), or solvent accessibility (surface/buried), derived by specialized predictors [12, 13]. In our approach, we use different GIOHMM and GRNN strategies to predict both structural features and contact maps. 2 GIOHMM Architectures Loosely speaking, GIOHMMs are Bayesian networks with input, hidden, and output units that can be used to process complex data structures such as sequences, images, trees, chemical compounds and so forth, built on work in, for instance, [5, 3, 7, 2, 11]. In general, the connectivity of the graphs associated with the hidden units matches the structure of the data being processed. Often multiple copies of the same hidden graph, but with different edge orientations, are used in the hidden layers to allow direct propagation of information in all relevant directions. Output Plane NE NW 4 Hidden Planes SW SE Input Plane Figure 2: 2D GIOHMM Bayesian network for processing two-dimensional objects such as contact maps, with nodes regularly arranged in one input plane, one output plane, and four hidden planes. In each hidden plane, nodes are arranged on a square lattice, and all edges are oriented towards the corresponding cardinal corner. Additional directed edges run vertically in column from the input plane to each hidden plane, and from each hidden plane to the output plane. To illustrate the general idea, a first example of GIOHMM is provided by the bidirectional IOHMMs (Figure 1) introduced in [2] to process sequences and predict protein structural features, such as secondary structure. Unlike standard HMMs or IOHMMS used, for instance in speech recognition, this architecture is based on two hidden markov chains running in opposite directions to leverage the fact that biological sequences are spatial objects rather than temporal sequences. Bidirectional IOHMMs have been used to derive a suite of structural feature predictors [12, 13, 4] available through http://promoter.ics.uci.edu/BRNN-PRED/. These predictors have accuracy rates in the 75-80% range on a per amino acid basis. 2.1 Direct Prediction of Topology To predict contact maps, we use a 2D generalization of the previous 1D Bayesian network. The basic version of this architecture (Figures 2) contains 6 layers of units: input, output, and four hidden layers, one for each cardinal corner. Within each column indexed by i and j, connections run from the input to the four hidden units, and from the four hidden units to the output unit. In addition, the hidden units in each hidden layer are arranged on a square or triangular lattice, with all the edges oriented towards the corresponding cardinal corner. Thus the parameters of this two-dimensional GIOHMMs, in the square lattice case, are the conditional probability distributions: NE NW SW SE P (Oi |Ii,j , Hi,j , Hi,j , Hi,j , Hi,j, ) NE NE NE P (Hi,j |Ii,j , Hi−1,j , Hi,j−1 ) N NW NW P (Hi,jW |Ii,j , Hi+1,j , Hi,j−1 ) SW SW SW P (Hi,j |Ii,j , Hi+1,j , Hi,j+1 ) SE SE SE P (Hi,j |Ii,j , Hi−1,j , Hi,j+1 ) (1) In a contact map prediction at the amino acid level, for instance, the (i, j) output represents the probability of whether amino acids i and j are in contact or not. This prediction depends directly on the (i, j) input and the four-hidden units in the same column, associated with omni-directional contextual propagation in the hidden planes. In the simulations reported below, we use a more elaborated input consisting of a 20 × 20 probability matrix over amino acid pairs derived from a multiple alignment of the given protein sequence and its homologues, as well as the structural features of the corresponding amino acids, including their secondary structure classification and their relative exposure to the solvent, derived from our corresponding predictors. It should be clear how GIOHMM ideas can be generalized to other data structures and problems in many ways. In the case of 3D data, for instance, a standard GIOHMM would have an input cube, an output cube, and up to 8 cubes of hidden units, one for each corner with connections inside each hidden cube oriented towards the corresponding corner. In the case of data with an underlying tree structure, the hidden layers would correspond to copies of the same tree with different orientations and so forth. Thus a fundamental advantage of GIOHMMs is that they can process a wide range of data structures of variable sizes and dimensions. 2.2 Indirect Prediction of Topology Although GIOHMMs allow flexible integration of contextual information over ranges that often exceed what can be achieved, for instance, with fixed-input neural networks, the models described above still suffer from the fact that the connections remain local and therefore long-ranged propagation of information during learning remains difficult. Introduction of large numbers of long-ranged connections is computationally intractable but in principle not necessary since the number of contacts in proteins is known to grow linearly with the length of the protein, and hence connectivity is inherently sparse. The difficulty of course is that the location of the long-ranged contacts is not known. To address this problem, we have developed also a complementary GIOHMM approach described in Figure 3 where a candidate graph structure is proposed in the hidden layers of the GIOHMM, with the two different orientations naturally associated with a protein sequence. Thus the hidden graphs change with each protein. In principle the output ought to be a single unit (Figure 3b) which directly computes a global score for the candidate structure presented in the hidden layer. In order to cope with long-ranged dependencies, however, it is preferable to compute a set of local scores (Figure 3c), one for each vertex, and combine the local scores into a global score by averaging. More specifically, consider a true topology represented by the undirected contact graph G∗ = (V, E ∗ ), and a candidate undirected prediction graph G = (V, E). A global measure of how well E approximates E ∗ is provided by the informationretrieval F1 score defined by the normalized edge-overlap F1 = 2|E ∩ E ∗ |/(|E| + |E ∗ |) = 2P R/(P + R), where P = |E ∩ E ∗ |/|E| is the precision (or specificity) and R = |E ∩ E ∗ |/|E ∗ | is the recall (or sensitivity) measure. Obviously, 0 ≤ F1 ≤ 1 and F1 = 1 if and only if E = E ∗ . The scoring function F1 has the property of being monotone in the sense that if |E| = |E | then F1 (E) < F1 (E ) if and only if |E ∩ E ∗ | < |E ∩ E ∗ |. Furthermore, if E = E ∪ {e} where e is an edge in E ∗ but not in E, then F1 (E ) > F1 (E). Monotonicity is important to guide the search in the space of possible topologies. It is easy to check that a simple search algorithm based on F1 takes on the order of O(|V |3 ) steps to find E ∗ , basically by trying all possible edges one after the other. The problem then is to learn F1 , or rather a good approximation to F1 . To approximate F1 , we first consider a similar local measure Fv by considering the O I(v) I(v) F B H (v) H (v) (a) I(v) F B H (v) H (v) (b) O(v) (c) Figure 3: Indirect prediction of contact maps. (a) target contact graph to be predicted. (b) GIOHMM with two hidden layers: the two hidden layers correspond to two copies of the same candidate graph oriented in opposite directions from one end of the protein to the other end. The single output O is the global score of how well the candidate graph approximates the true contact map. (c) Similar to (b) but with a local score O(v) at each vertex. The local scores can be averaged to produce a global score. In (b) and (c) I(v) represents the input for vertex v, and H F (v) and H B (v) are the corresponding hidden variables. ∗ ∗ set Ev of edges adjacent to vertex v and Fv = 2|Ev ∩ Ev |/(|Ev | + |Ev |) with the ¯ global average F = v Fv /|V |. If n and n∗ are the average degrees of G and G∗ , it can be shown that: F1 = 1 |V | v 2|Ev ∩ E ∗ | n + n∗ and 1 ¯ F = |V | v 2|Ev ∩ E ∗ | n + v + n∗ + ∗ v (2) where n + v (resp. n∗ + ∗ ) is the degree of v in G (resp. in G∗ ). In particular, if G v ¯ ¯ and G∗ are regular graphs, then F1 (E) = F (E) so that F is a good approximation to F1 . In the contact map regime where the number of contacts grows linearly with the length of the sequence, we should have in general |E| ≈ |E ∗ | ≈ (1 + α)|V | so that each node on average has n = n∗ = 2(1 + α) edges. The value of α depends of course on the neighborhood cutoff. As in reinforcement learning, to learn the scoring function one is faced with the problem of generating good training sets in a high dimensional space, where the states are the topologies (graphs), and the policies are algorithms for adding a single edge to a given graph. In the simulations we adopt several different strategies including static and dynamic generation. Within dynamic generation we use three exploration strategies: random exploration (successor graph chosen at random), pure exploitation (successor graph maximizes the current scoring function), and semi-uniform exploitation to find a balance between exploration and exploitation [with probability (resp. 1 − ) we choose random exploration (resp. pure exploitation)]. 3 GRNN Architectures Inference and learning in the protein GIOHMMs we have described is computationally intensive due to the large number of undirected loops they contain. This problem can be addressed using a neural network reparameterization assuming that: (a) all the nodes in the graphs are associated with a deterministic vector (note that in the case of the output nodes this vector can represent a probability distribution so that the overall model remains probabilistic); (b) each vector is a deterministic function of its parents; (c) each function is parameterized using a neural network (or some other class of approximators); and (d) weight-sharing or stationarity is used between similar neural networks in the model. For example, in the 2D GIOHMM contact map predictor, we can use a total of 5 neural networks to recursively compute the four hidden states and the output in each column in the form: NW NE SW SE Oij = NO (Iij , Hi,j , Hi,j , Hi,j , Hi,j ) NE NE NE Hi,j = NN E (Ii,j , Hi−1,j , Hi,j−1 ) N NW NW Hi,jW = NN W (Ii,j , Hi+1,j , Hi,j−1 ) SW SW SW Hi,j = NSW (Ii,j , Hi+1,j , Hi,j+1 ) SE SE SE Hi,j = NSE (Ii,j , Hi−1,j , Hi,j+1 ) (3) N In the NE plane, for instance, the boundary conditions are set to Hij E = 0 for i = 0 N or j = 0. The activity vector associated with the hidden unit Hij E depends on the NE NE local input Iij , and the activity vectors of the units Hi−1,j and Hi,j−1 . Activity in NE plane can be propagated row by row, West to East, and from the first row to the last (from South to North), or column by column South to North, and from the first column to the last. These GRNN architectures can be trained by gradient descent by unfolding the structures in space, leveraging the acyclic nature of the underlying GIOHMMs. 4 Data Many data sets are available or can be constructed for training and testing purposes, as described in the references. The data sets used in the present simulations are extracted from the publicly available Protein Data Bank (PDB) and then redundancy reduced, or from the non-homologous subset of PDB Select (ftp://ftp.emblheidelberg.de/pub/databases/). In addition, we typically exclude structures with poor resolution (less than 2.5-3 ˚), sequences containing less than 30 amino acids, A and structures containing multiple sequences or sequences with chain breaks. For coarse contact maps, we use the DSSP program [9] (CMBI version) to assign secondary structures and we remove also sequences for which DSSP crashes. The results we report for fine-grained contact maps are derived using 424 proteins with lengths in the 30-200 range for training and an additional non-homologous set of 48 proteins in the same length range for testing. For the coarse contact map, we use a set of 587 proteins of length less than 300. Because the average length of a secondary structure element is slightly above 7, the size of a coarse map is roughly 2% the size of the corresponding amino acid map. 5 Simulation Results and Conclusions We have trained several 2D GIOHMM/GRNN models on the direct prediction of fine-grained contact maps. Training of a single model typically takes on the order of a week on a fast workstation. A sample of validation results is reported in Table 1 for four different distance cutoffs. Overall percentages of correctly predicted contacts Table 1: Direct prediction of amino acid contact maps. Column 1: four distance cutoffs. Column 2, 3, and 4: overall percentages of amino acids correctly classified as contacts, non-contacts, and in total. Column 5: Precision percentage for distant contacts (|i − j| ≥ 8) with a threshold of 0.5. Single model results except for last line corresponding to an ensemble of 5 models. Cutoff 6˚ A 8˚ A 10 ˚ A 12 ˚ A 12 ˚ A Contact .714 .638 .512 .433 .445 Non-Contact .998 .998 .993 .987 .990 Total .985 .970 .931 .878 .883 Precision (P) .594 .670 .557 .549 .717 and non-contacts at all linear distances, as well as precision results for distant contacts (|i − j| ≥ 8) are reported for a single GIOHMM/GRNN model. The model has k = 14 hidden units in the hidden and output layers of the four hidden networks, as well as in the hidden layer of the output network. In the last row, we also report as an example the results obtained at 12˚ by an ensemble of 5 networks A with k = 11, 12, 13, 14 and 15. Note that precision for distant contacts exceeds all previously reported results and is well above 50%. For the prediction of coarse-grained contact maps, we use the indirect GIOHMM/GRNN strategy and compare different exploration/exploitation strategies: random exploration, pure exploitation, and their convex combination (semiuniform exploitation). In the semi-uniform case we set the probability of random uniform exploration to = 0.4. In addition, we also try a fourth hybrid strategy in which the search proceeds greedily (i.e. the best successor is chosen at each step, as in pure exploitation), but the network is trained by randomly sub-sampling the successors of the current state. Eight numerical features encode the input label of each node: one-hot encoding of secondary structure classes; normalized linear distances from the N to C terminus; average, maximum and minimum hydrophobic character of the segment (based on the Kyte-Doolittle scale with a moving window of length 7). A sample of results obtained with 5-fold cross-validation is shown in Table 2. Hidden state vectors have dimension k = 5 with no hidden layers. For each strategy we measure performances by means of several indices: micro and macroaveraged precision (mP , M P ), recall (mR, M R) and F1 measure (mF1 , M F1 ). Micro-averages are derived based on each pair of secondary structure elements in each protein, whereas macro-averages are obtained on a per-protein basis, by first computing precision and recall for each protein, and then averaging over the set of all proteins. In addition, we also measure the micro and macro averages for specificity in the sense of percentage of correct prediction for non-contacts (mP (nc), M P (nc)). Note the tradeoffs between precision and recall across the training methods, the hybrid method achieving the best F 1 results. Table 2: Indirect prediction of coarse contact maps with dynamic sampling. Strategy Random exploration Semi-uniform Pure exploitation Hybrid mP .715 .454 .431 .417 mP (nc) .769 .787 .806 .834 mR .418 .631 .726 .790 mF1 .518 .526 .539 .546 MP .767 .507 .481 .474 M P (nc) .709 .767 .793 .821 MR .469 .702 .787 .843 M F1 .574 .588 .596 .607 We have presented two approaches, based on a very general IOHMM/RNN framework, that achieve state-of-the-art performance in the prediction of proteins contact maps at fine and coarse-grained levels of resolution. In principle both methods can be applied to both resolution levels, although the indirect prediction is computationally too demanding for fine-grained prediction of large proteins. Several extensions are currently under development, including the integration of these methods into complete 3D structure predictors. While these systems require long training periods, once trained they can rapidly sift through large proteomic data sets. Acknowledgments The work of PB and GP is supported by a Laurel Wilkening Faculty Innovation award and awards from NIH, BREP, Sun Microsystems, and the California Institute for Telecommunications and Information Technology. The work of PF and AV is partially supported by a MURST grant. References [1] D. Baker and A. Sali. Protein structure prediction and structural genomics. Science, 294:93–96, 2001. [2] P. Baldi and S. Brunak and P. Frasconi and G. Soda and G. Pollastri. Exploiting the past and the future in protein secondary structure prediction. Bioinformatics, 15(11):937–946, 1999. [3] P. Baldi and Y. Chauvin. Hybrid modeling, HMM/NN architectures, and protein applications. Neural Computation, 8(7):1541–1565, 1996. [4] P. Baldi and G. Pollastri. Machine learning structural and functional proteomics. IEEE Intelligent Systems. Special Issue on Intelligent Systems in Biology, 17(2), 2002. [5] Y. Bengio and P. Frasconi. Input-output HMM’s for sequence processing. IEEE Trans. on Neural Networks, 7:1231–1249, 1996. [6] P. Fariselli, O. Olmea, A. Valencia, and R. Casadio. Prediction of contact maps with neural networks and correlated mutations. Protein Engineering, 14:835–843, 2001. [7] P. Frasconi, M. Gori, and A. Sperduti. A general framework for adaptive processing of data structures. IEEE Trans. on Neural Networks, 9:768–786, 1998. [8] Z. Ghahramani and M. I. Jordan. Factorial hidden Markov models Machine Learning, 29:245–273, 1997. [9] W. Kabsch and C. Sander. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22:2577–2637, 1983. [10] A. M. Lesk, L. Lo Conte, and T. J. P. Hubbard. Assessment of novel fold targets in CASP4: predictions of three-dimensional structures, secondary structures, and interresidue contacts. Proteins, 45, S5:98–118, 2001. [11] G. Pollastri and P. Baldi. Predition of contact maps by GIOHMMs and recurrent neural networks using lateral propagation from all four cardinal corners. Proceedings of 2002 ISMB (Intelligent Systems for Molecular Biology) Conference. Bioinformatics, 18, S1:62–70, 2002. [12] G. Pollastri, D. Przybylski, B. Rost, and P. Baldi. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins, 47:228–235, 2002. [13] G. Pollastri, P. Baldi, P. Fariselli, and R. Casadio. Prediction of coordination number and relative solvent accessibility in proteins. Proteins, 47:142–153, 2002. [14] M. Vendruscolo, E. Kussell, and E. Domany. Recovery of protein structure from contact maps. Folding and Design, 2:295–306, 1997.
5 0.58360386 99 nips-2002-Graph-Driven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA
Author: Jean-philippe Vert, Minoru Kanehisa
Abstract: We present an algorithm to extract features from high-dimensional gene expression profiles, based on the knowledge of a graph which links together genes known to participate to successive reactions in metabolic pathways. Motivated by the intuition that biologically relevant features are likely to exhibit smoothness with respect to the graph topology, the algorithm involves encoding the graph and the set of expression profiles into kernel functions, and performing a generalized form of canonical correlation analysis in the corresponding reproducible kernel Hilbert spaces. Function prediction experiments for the genes of the yeast S. Cerevisiae validate this approach by showing a consistent increase in performance when a state-of-the-art classifier uses the vector of features instead of the original expression profile to predict the functional class of a gene.
6 0.54341465 167 nips-2002-Rational Kernels
7 0.54194039 119 nips-2002-Kernel Dependency Estimation
8 0.54117787 98 nips-2002-Going Metric: Denoising Pairwise Data
9 0.53904641 120 nips-2002-Kernel Design Using Boosting
10 0.52117461 113 nips-2002-Information Diffusion Kernels
11 0.50798506 106 nips-2002-Hyperkernels
12 0.50059634 53 nips-2002-Clustering with the Fisher Score
13 0.49438384 156 nips-2002-On the Complexity of Learning the Kernel Matrix
14 0.47363445 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
15 0.45002905 187 nips-2002-Spikernels: Embedding Spiking Neurons in Inner-Product Spaces
16 0.44295385 32 nips-2002-Approximate Inference and Protein-Folding
17 0.44217464 125 nips-2002-Learning Semantic Similarity
18 0.43038645 67 nips-2002-Discriminative Binaural Sound Localization
19 0.42930657 196 nips-2002-The RA Scanner: Prediction of Rheumatoid Joint Inflammation Based on Laser Imaging
20 0.42768705 62 nips-2002-Coulomb Classifiers: Generalizing Support Vector Machines via an Analogy to Electrostatic Systems
topicId topicWeight
[(3, 0.011), (11, 0.032), (23, 0.014), (42, 0.058), (51, 0.022), (54, 0.168), (55, 0.018), (57, 0.013), (67, 0.013), (68, 0.034), (72, 0.273), (74, 0.096), (79, 0.018), (92, 0.038), (98, 0.115)]
simIndex simValue paperId paperTitle
1 0.85088289 178 nips-2002-Robust Novelty Detection with Single-Class MPM
Author: Laurent E. Ghaoui, Michael I. Jordan, Gert R. Lanckriet
Abstract: In this paper we consider the problem of novelty detection, presenting an algorithm that aims to find a minimal region in input space containing a fraction 0: of the probability mass underlying a data set. This algorithm- the
same-paper 2 0.84167159 145 nips-2002-Mismatch String Kernels for SVM Protein Classification
Author: Eleazar Eskin, Jason Weston, William S. Noble, Christina S. Leslie
Abstract: We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the protein classification problem. These kernels measure sequence similarity based on shared occurrences of -length subsequences, counted with up to mismatches, and do not rely on any generative model for the positive training sequences. We compute the kernels efficiently using a mismatch tree data structure and report experiments on a benchmark SCOP dataset, where we show that the mismatch kernel used with an SVM classifier performs as well as the Fisher kernel, the most successful method for remote homology detection, while achieving considerable computational savings. ¡ ¢
3 0.68522769 2 nips-2002-A Bilinear Model for Sparse Coding
Author: David B. Grimes, Rajesh P. Rao
Abstract: Recent algorithms for sparse coding and independent component analysis (ICA) have demonstrated how localized features can be learned from natural images. However, these approaches do not take image transformations into account. As a result, they produce image codes that are redundant because the same feature is learned at multiple locations. We describe an algorithm for sparse coding based on a bilinear generative model of images. By explicitly modeling the interaction between image features and their transformations, the bilinear approach helps reduce redundancy in the image code and provides a basis for transformationinvariant vision. We present results demonstrating bilinear sparse coding of natural images. We also explore an extension of the model that can capture spatial relationships between the independent features of an object, thereby providing a new framework for parts-based object recognition.
4 0.64741343 124 nips-2002-Learning Graphical Models with Mercer Kernels
Author: Francis R. Bach, Michael I. Jordan
Abstract: We present a class of algorithms for learning the structure of graphical models from data. The algorithms are based on a measure known as the kernel generalized variance (KGV), which essentially allows us to treat all variables on an equal footing as Gaussians in a feature space obtained from Mercer kernels. Thus we are able to learn hybrid graphs involving discrete and continuous variables of arbitrary type. We explore the computational properties of our approach, showing how to use the kernel trick to compute the relevant statistics in linear time. We illustrate our framework with experiments involving discrete and continuous data.
5 0.6466648 53 nips-2002-Clustering with the Fisher Score
Author: Koji Tsuda, Motoaki Kawanabe, Klaus-Robert Müller
Abstract: Recently the Fisher score (or the Fisher kernel) is increasingly used as a feature extractor for classification problems. The Fisher score is a vector of parameter derivatives of loglikelihood of a probabilistic model. This paper gives a theoretical analysis about how class information is preserved in the space of the Fisher score, which turns out that the Fisher score consists of a few important dimensions with class information and many nuisance dimensions. When we perform clustering with the Fisher score, K-Means type methods are obviously inappropriate because they make use of all dimensions. So we will develop a novel but simple clustering algorithm specialized for the Fisher score, which can exploit important dimensions. This algorithm is successfully tested in experiments with artificial data and real data (amino acid sequences).
6 0.64600158 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
7 0.64479548 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
8 0.64469272 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
9 0.64190125 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
10 0.64167553 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
11 0.63911223 27 nips-2002-An Impossibility Theorem for Clustering
12 0.63662589 45 nips-2002-Boosted Dyadic Kernel Discriminants
13 0.63598013 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
14 0.63563395 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
15 0.63488197 3 nips-2002-A Convergent Form of Approximate Policy Iteration
16 0.63426954 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
17 0.63282758 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition
18 0.63274509 113 nips-2002-Information Diffusion Kernels
19 0.63269031 63 nips-2002-Critical Lines in Symmetry of Mixture Models and its Application to Component Splitting
20 0.63246274 119 nips-2002-Kernel Dependency Estimation