nips nips2003 nips2003-173 nips2003-173-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jason Weston, Dengyong Zhou, André Elisseeff, William S. Noble, Christina S. Leslie
Abstract: A key issue in supervised protein classification is the representation of input sequences of amino acids. Recent work using string kernels for protein data has achieved state-of-the-art classification performance. However, such representations are based only on labeled data — examples with known 3D structures, organized into structural classes — while in practice, unlabeled data is far more plentiful. In this work, we develop simple and scalable cluster kernel techniques for incorporating unlabeled data into the representation of protein sequences. We show that our methods greatly improve the classification performance of string kernels and outperform standard approaches for using unlabeled data, such as adding close homologs of the positive examples to the training data. We achieve equal or superior performance to previously presented cluster kernel methods while achieving far greater computational efficiency. 1
[1] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. A basic local alignment search tool. Journal of Molecular Biology, 215:403–410, 1990.
[2] T. Smith and M. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195–197, 1981.
[3] A. Krogh, M. Brown, I. Mian, K. Sjolander, and D. Haussler. Hidden markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235:1501– 1531, 1994.
[4] J. Park, K. Karplus, C. Barrett, R. Hughey, D. Haussler, T. Hubbard, and C. Chothia. Sequence comparisons using multiple sequences detect twice as many remote homologues as pairwise methods. Journal of Molecular Biology, 284(4):1201–1210, 1998.
[5] T. Jaakkola, M. Diekhans, and D. Haussler. A discriminative framework for detecting remote protein homologies. Journal of Computational Biology, 2000.
[6] C. Leslie, E. Eskin, J. Weston, and W. S. Noble. Mismatch string kernels for SVM protein classification. Neural Information Processing Systems 15, 2002.
[7] C. Liao and W. S. Noble. Combining pairwise sequence similarity and support vector machines for remote protein homology detection. Proceedings of RECOMB, 2002.
[8] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25:3389–3402, 1997.
[9] X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, CMU, 2002.
[10] O. Chapelle, J. Weston, and B. Schoelkopf. Cluster kernels for semi-supervised learning. Neural Information Processing Systems 15, 2002.
[11] M. Szummer and T. Jaakkola. Partially labeled classification with Markov random walks. Neural Information Processing Systems 14, 2001.
[12] M. Seeger. Learning with labeled and unlabeled data. Technical report, University of Edinburgh, 2001.
[13] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: analysis and an algorithm. Neural Processing Information Systems 14, 2001.
[14] T. Joachims. Transductive inference for text classification using support vector machines. Proceedings of ICML, 1999.
[15] A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. SCOP: A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology, 247:536–540, 1995.