jmlr jmlr2007 jmlr2007-57 jmlr2007-57-reference knowledge-graph by maker-knowledge-mining

57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes

Source: pdf

Author: Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble, Christina Leslie

Abstract: Predicting a protein’s structural class from its amino acid sequence is a fundamental problem in computational biology. Recent machine learning work in this domain has focused on developing new input space representations for protein sequences, that is, string kernels, some of which give state-of-the-art performance for the binary prediction task of discriminating between one class and all the others. However, the underlying protein classiﬁcation problem is in fact a huge multiclass problem, with over 1000 protein folds and even more structural subcategories organized into a hierarchy. To handle this challenging many-class problem while taking advantage of progress on the binary problem, we introduce an adaptive code approach in the output space of one-vsthe-rest prediction scores. Speciﬁcally, we use a ranking perceptron algorithm to learn a weighting of binary classiﬁers that improves multi-class prediction with respect to a ﬁxed set of output codes. We use a cross-validation set-up to generate output vectors for training, and we deﬁne codes that capture information about the protein structural hierarchy. Our code weighting approach signiﬁcantly improves on the standard one-vs-all method for two difﬁcult multi-class protein classiﬁcation problems: remote homology detection and fold recognition. Our algorithm also outperforms a previous code learning approach due to Crammer and Singer, trained here using a perceptron, when the dimension of the code vectors is high and the number of classes is large. Finally, we compare against PSI-BLAST, one of the most widely used methods in protein sequence analysis, and ﬁnd that our method strongly outperforms it on every structure clas∗. The ﬁrst two authors contributed equally to this work. c 2007 Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble and Christina Leslie. M ELVIN , I E , W ESTON , N OBLE AND L ESLIE siﬁcation problem that we consider. Supplementary data and source code are available at http: //www.cs

reference text

Erin L. Allwein, Robert E. Schapire, and Yoram Singer. Reducing multiclass to binary: A unifying approach for margin classiﬁers. In Proceedings of the 17th International Conference on Machine Learning, pages 9–16. Morgan Kaufmann, San Francisco, CA, 2000. Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. A basic local alignment search tool. Journal of Molecular Biology, 215(3):403–410, 1990. Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25:3389–3402, 1997. Zafer Barutcuoglu, Robert E. Schapire, and Olga G. Troyanskaya. Hierarchical multi-label prediction of gene function. Bioinformatics, 22(7):830–836, 2006. Asa Ben-Hur and Douglas Brutlag. Remote homology detection: a motif based approach. Proceedings of the Eleventh International Conference on Intelligent Systems for Molecular Biology, 19 suppl 1:i26–i33, 2003. L´ on Bottou, Yann LeCun, and Yoshua Bengio. Global training of document processing systems e using graph transformer networks. In Proc. of Computer Vision and Pattern Recognition, pages 490–494, Puerto-Rico, 1997. IEEE. Steven E. Brenner, Patrice Koehl, and Michael Levitt. The ASTRAL compendium for sequence and structure analysis. Nucleic Acids Research, 28:254–256, 2000. Nicol` Cesa-Bianchi, Claudio Gentile, and Luca Zaniboni. Incremental algorithms for hierarchical o classiﬁcation. Journal of Machine Learning Research, 7:31–54, 2006. Michael Collins. Discriminative reranking for natural language parsing. In Proceedings of the 17th International Conference on Machine Learning, pages 175 – 182. Morgan Kaufmann, San Francisco, CA, 2000. Michael Collins and Nigel Duffy. New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 263–270, 2002. 1579 M ELVIN , I E , W ESTON , N OBLE AND L ESLIE Koby Crammer and Yoram Singer. On the learnability and design of output codes for multiclass problems. In Computational Learning Theory, pages 35–46, 2000. Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal Machine Learning Research, 2:265–292, 2002. ISSN 1533-7928. Ofer Dekel, Joseph Keshet, and Yoram Singer. Large margin hierarchical classiﬁcation. In Proceedings of the 21st International Conference on Machine Learning, 2004. Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via error-correcting output codes. Journal of Artiﬁcial Intelligence Research, 2:263–286, 1995. Yoav Freund and Robert E. Schapire. Large margin classiﬁcation using the perceptron algorithm. Machine Learning, 37(3):277 – 296, 1999. Mark Girolami and Simon Rogers. Variational Bayesian multinomial probit regression with Gaussian process priors. Neural Computation, 18(8):1790–1817, 2006. Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multiclass support vector machines. Neural Networks, IEEE Transactions on, 13(2):415–425, 2002. Eugene Ie, Jason Weston, William Stafford Noble, and Christina Leslie. Multi-class protein fold recognition using adaptive codes. Proceedings of the 22nd International Conference on Machine Learning, 2005. Tommi Jaakkola, Mark Diekhans, and David Haussler. A discriminative framework for detecting remote protein homologies. Journal of Computational Biology, 7(1–2):95–114, 2000. ¨ Anders Krogh, Michael Brown, I. Saira Mian, Kimmen Sjolander, and David Haussler. Hidden Markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235:1501–1531, 1994. Rui Kuang, Eugene Ie, Ke Wang, Kai Wang, Mahira Siddiqi, Yoav Freund, and Christina Leslie. Proﬁle kernels for detecting remote protein homologs and discriminative motifs. Journal of Bioinformatics and Computational Biology, 2005. To appear. Yann LeCun and Fu Jie Huang. Loss functions for discriminative training of energy-based models. In Proceedings of the 10th International Workshop on Artiﬁcial Intelligence and Statistics, 2005. Christina Leslie, Eleazar Eskin, and William S. Noble. The spectrum kernel: A string kernel for SVM protein classiﬁcation. Proceedings of the Paciﬁc Biocomputing Symposium, pages 564–575, 2002a. Christina Leslie, Eleazar Eskin, Jason Weston, and William S. Noble. Mismatch string kernels for SVM protein classiﬁcation. Advances in Neural Information Processing Systems 15, pages 1441–1448, 2002b. Christina Leslie, Eleazar Eskin, Adiel Cohen, Jason Weston, and William S. Noble. Mismatch string kernels for discriminative protein classiﬁcation. Bioinformatics, 20(4):467–476, 2004. 1580 M ULTI - CLASS P ROTEIN C LASSIFICATION U SING A DAPTIVE C ODES Li Liao and William S. Noble. Combining pairwise sequence similarity and support vector machines for remote protein homology detection. Proceedings of the 6th Annual International Conference on Research in Computational Molecular Biology, pages 225–232, 2002. Hsuan-Tien Lin, Chih-Jen Lin, and Ruby C. Weng. A note on Platt’s probabilistic outputs for support vector machines. Technical report, Department of Computer Science and Information Engineering, National Taiwan University, 2003. Alexey G. Murzin, Steven E. Brenner, Tim Hubbard, and Cyrus Chothia. SCOP: A structural classiﬁcation of proteins database for the investigation of sequences and structures. Journal of Molecular Biology, 247(4):536–540, 1995. Jong Park, Kevin Karplus, Christian Barrett, Richard Hughey, David Haussler, Tim Hubbard, and Cyrus Chothia. Sequence comparisons using multiple sequences detect twice as many remote homologues as pairwise methods. Journal of Molecular Biology, 284(4):1201–1210, 1998. John Platt. Probabilities for support vector machines. Advances in Large Margin Classiﬁers, pages 61–74, 1999. Huzefa Rangwala and George Karypis. Proﬁle-based direct kernels for remote homology detection and fold recognition. Bioinformatics, 21(23):4239–4247, 2005. Gunnar R¨ tsch, Alexander J. Smola, and Sebastian Mika. Adapting codes and embeddings for a polychotomies. Advances in Neural Information Processing Systems, 15:513–520, 2002. Ryan Rifkin and Aldebaro Klautau. In defense of one-vs-all classiﬁcation. Journal Machine Learning Research, 5:101–141, 2004. ISSN 1533-7928. Frank Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. Hiroto Saigo, Jean-Philippe Vert, Nobuhisa Ueda, and Tatsuya Akutsu. Protein homology detection using string alignment kernels. Bioinformatics, 20(11):1682–1689, 2004. Temple Smith and Michael Waterman. Identiﬁcation of common molecular subsequences. Journal of Molecular Biology, 147(1):195–197, 1981. Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. Support vector learning for interdependent and structured output spaces. Proceedings of the 21st International Conference on Machine Learning, pages 823–830, 2004. Vladimir N. Vapnik. Statistical Learning Theory. John Wiley and Sons, New York, 1998. Jason Weston and Chris Watkins. Support vector machines for multiclass pattern recognition. In Proceedings of the 7th European Symposium On Artiﬁcial Neural Networks, 1999. Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff, and William S. Noble. Semi-supervised protein classiﬁcation using cluster kernels. Bioinformatics, 21(15):3241–3247, 2005. 1581