nips nips2002 nips2002-145 nips2002-145-reference knowledge-graph by maker-knowledge-mining

145 nips-2002-Mismatch String Kernels for SVM Protein Classification

Source: pdf

Author: Eleazar Eskin, Jason Weston, William S. Noble, Christina S. Leslie

Abstract: We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the protein classiﬁcation problem. These kernels measure sequence similarity based on shared occurrences of -length subsequences, counted with up to mismatches, and do not rely on any generative model for the positive training sequences. We compute the kernels efﬁciently using a mismatch tree data structure and report experiments on a benchmark SCOP dataset, where we show that the mismatch kernel used with an SVM classiﬁer performs as well as the Fisher kernel, the most successful method for remote homology detection, while achieving considerable computational savings. ¡ ¢

reference text

[1] M. S. Waterman, J. Joyce, and M. Eggert. Computer alignment of sequences, chapter Phylogenetic Analysis of DNA Sequences. Oxford, 1991.

[2] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. A basic local alignment search tool. Journal of Molecular Biology, 215:403–410, 1990.

[3] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25:3389–3402, 1997.

[4] Michael Gribskov, Andrew D. McLachlan, and David Eisenberg. Proﬁle analysis: Detection of distantly related proteins. PNAS, pages 4355–4358, 1987.

[5] A. Bairoch. The PROSITE database, its status in 1995. Nucleic Acids Research, 24:189–196, 1995.

[6] T. K. Attwood, M. E. Beck, D. R. Flower, P. Scordis, and J. N Selley. The PRINTS protein ﬁngerprint database in its ﬁfth year. Nucleic Acids Research, 26(1):304–308, 1998.

[7] A. Krogh, M. Brown, I. Mian, K. Sjolander, and D. Haussler. Hidden markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235:1501– 1531, 1994.

[8] S. R. Eddy. Multiple alignment using hidden markov models. In Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, pages 114–120. AAAI Press, 1995.

[9] P. Baldi, Y. Chauvin, T. Hunkapiller, and M. A. McClure. Hidden markov models of biological primary sequence information. PNAS, 91(3):1059–1063, 1994.

[10] T. Jaakkola, M. Diekhans, and D. Haussler. A discriminative framework for detecting remote protein homologies. Journal of Computational Biology, 2000.

[11] T. Jaakkola, M. Diekhans, and D. Haussler. Using the ﬁsher kernel method to detect remote protein homologies. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pages 149–158. AAAI Press, 1999.

[12] V. N. Vapnik. Statistical Learning Theory. Springer, 1998.

[13] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge, 2000.

[14] C. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein classiﬁcation. Proceedings of the Paciﬁc Biocomputing Symposium, 2002.

[15] A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. SCOP: A structural classiﬁcation of proteins database for the investigation of sequences and structures. Journal of Molecular Biology, 247:536–540, 1995.

[16] M. Sagot. Spelling approximate or repeated motifs using a sufﬁx tree. Lecture Notes in Computer Science, 1380:111–127, 1998.

[17] G. Pavesi, G. Mauri, and G. Pesole. An algorithm for ﬁnding signals of unknown length in DNA sequences. Bioinformatics, 17:S207–S214, July 2001. Proceedings of the Ninth International Conference on Intelligent Systems for Molecular Biology.

[18] S. Henikoff and J. G. Henikoff. Embedding strategies for effective use of information from multiple sequence alignments. Protein Science, 6(3):698–705, 1997.

[19] S. L. Salzberg. On comparing classiﬁers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery, 1:371–328, 1997.

[20] C. Watkins. Dynamic alignment kernels. Technical report, UL Royal Holloway, 1999.

[21] D. Haussler. Convolution kernels on discrete structure. Technical report, UC Santa Cruz, 1999.

[22] Huma Lodhi, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. Text classiﬁcation using string kernels. Preprint.