nips nips2003 nips2003-173 nips2003-173-reference knowledge-graph by maker-knowledge-mining

173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels

Source: pdf

Author: Jason Weston, Dengyong Zhou, André Elisseeff, William S. Noble, Christina S. Leslie

Abstract: A key issue in supervised protein classiﬁcation is the representation of input sequences of amino acids. Recent work using string kernels for protein data has achieved state-of-the-art classiﬁcation performance. However, such representations are based only on labeled data — examples with known 3D structures, organized into structural classes — while in practice, unlabeled data is far more plentiful. In this work, we develop simple and scalable cluster kernel techniques for incorporating unlabeled data into the representation of protein sequences. We show that our methods greatly improve the classiﬁcation performance of string kernels and outperform standard approaches for using unlabeled data, such as adding close homologs of the positive examples to the training data. We achieve equal or superior performance to previously presented cluster kernel methods while achieving far greater computational efﬁciency. 1

reference text

[1] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. A basic local alignment search tool. Journal of Molecular Biology, 215:403–410, 1990.

[2] T. Smith and M. Waterman. Identiﬁcation of common molecular subsequences. Journal of Molecular Biology, 147:195–197, 1981.

[3] A. Krogh, M. Brown, I. Mian, K. Sjolander, and D. Haussler. Hidden markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235:1501– 1531, 1994.

[4] J. Park, K. Karplus, C. Barrett, R. Hughey, D. Haussler, T. Hubbard, and C. Chothia. Sequence comparisons using multiple sequences detect twice as many remote homologues as pairwise methods. Journal of Molecular Biology, 284(4):1201–1210, 1998.

[5] T. Jaakkola, M. Diekhans, and D. Haussler. A discriminative framework for detecting remote protein homologies. Journal of Computational Biology, 2000.

[6] C. Leslie, E. Eskin, J. Weston, and W. S. Noble. Mismatch string kernels for SVM protein classiﬁcation. Neural Information Processing Systems 15, 2002.

[7] C. Liao and W. S. Noble. Combining pairwise sequence similarity and support vector machines for remote protein homology detection. Proceedings of RECOMB, 2002.

[8] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25:3389–3402, 1997.

[9] X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, CMU, 2002.

[10] O. Chapelle, J. Weston, and B. Schoelkopf. Cluster kernels for semi-supervised learning. Neural Information Processing Systems 15, 2002.

[11] M. Szummer and T. Jaakkola. Partially labeled classiﬁcation with Markov random walks. Neural Information Processing Systems 14, 2001.

[12] M. Seeger. Learning with labeled and unlabeled data. Technical report, University of Edinburgh, 2001.

[13] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: analysis and an algorithm. Neural Processing Information Systems 14, 2001.

[14] T. Joachims. Transductive inference for text classiﬁcation using support vector machines. Proceedings of ICML, 1999.

[15] A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. SCOP: A structural classiﬁcation of proteins database for the investigation of sequences and structures. Journal of Molecular Biology, 247:536–540, 1995.