jmlr jmlr2007 jmlr2007-57 jmlr2007-57-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble, Christina Leslie
Abstract: Predicting a protein’s structural class from its amino acid sequence is a fundamental problem in computational biology. Recent machine learning work in this domain has focused on developing new input space representations for protein sequences, that is, string kernels, some of which give state-of-the-art performance for the binary prediction task of discriminating between one class and all the others. However, the underlying protein classification problem is in fact a huge multiclass problem, with over 1000 protein folds and even more structural subcategories organized into a hierarchy. To handle this challenging many-class problem while taking advantage of progress on the binary problem, we introduce an adaptive code approach in the output space of one-vsthe-rest prediction scores. Specifically, we use a ranking perceptron algorithm to learn a weighting of binary classifiers that improves multi-class prediction with respect to a fixed set of output codes. We use a cross-validation set-up to generate output vectors for training, and we define codes that capture information about the protein structural hierarchy. Our code weighting approach significantly improves on the standard one-vs-all method for two difficult multi-class protein classification problems: remote homology detection and fold recognition. Our algorithm also outperforms a previous code learning approach due to Crammer and Singer, trained here using a perceptron, when the dimension of the code vectors is high and the number of classes is large. Finally, we compare against PSI-BLAST, one of the most widely used methods in protein sequence analysis, and find that our method strongly outperforms it on every structure clas∗. The first two authors contributed equally to this work. c 2007 Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble and Christina Leslie. M ELVIN , I E , W ESTON , N OBLE AND L ESLIE sification problem that we consider. Supplementary data and source code are available at http: //www.cs
Erin L. Allwein, Robert E. Schapire, and Yoram Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. In Proceedings of the 17th International Conference on Machine Learning, pages 9–16. Morgan Kaufmann, San Francisco, CA, 2000. Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. A basic local alignment search tool. Journal of Molecular Biology, 215(3):403–410, 1990. Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25:3389–3402, 1997. Zafer Barutcuoglu, Robert E. Schapire, and Olga G. Troyanskaya. Hierarchical multi-label prediction of gene function. Bioinformatics, 22(7):830–836, 2006. Asa Ben-Hur and Douglas Brutlag. Remote homology detection: a motif based approach. Proceedings of the Eleventh International Conference on Intelligent Systems for Molecular Biology, 19 suppl 1:i26–i33, 2003. L´ on Bottou, Yann LeCun, and Yoshua Bengio. Global training of document processing systems e using graph transformer networks. In Proc. of Computer Vision and Pattern Recognition, pages 490–494, Puerto-Rico, 1997. IEEE. Steven E. Brenner, Patrice Koehl, and Michael Levitt. The ASTRAL compendium for sequence and structure analysis. Nucleic Acids Research, 28:254–256, 2000. Nicol` Cesa-Bianchi, Claudio Gentile, and Luca Zaniboni. Incremental algorithms for hierarchical o classification. Journal of Machine Learning Research, 7:31–54, 2006. Michael Collins. Discriminative reranking for natural language parsing. In Proceedings of the 17th International Conference on Machine Learning, pages 175 – 182. Morgan Kaufmann, San Francisco, CA, 2000. Michael Collins and Nigel Duffy. New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 263–270, 2002. 1579 M ELVIN , I E , W ESTON , N OBLE AND L ESLIE Koby Crammer and Yoram Singer. On the learnability and design of output codes for multiclass problems. In Computational Learning Theory, pages 35–46, 2000. Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal Machine Learning Research, 2:265–292, 2002. ISSN 1533-7928. Ofer Dekel, Joseph Keshet, and Yoram Singer. Large margin hierarchical classification. In Proceedings of the 21st International Conference on Machine Learning, 2004. Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2:263–286, 1995. Yoav Freund and Robert E. Schapire. Large margin classification using the perceptron algorithm. Machine Learning, 37(3):277 – 296, 1999. Mark Girolami and Simon Rogers. Variational Bayesian multinomial probit regression with Gaussian process priors. Neural Computation, 18(8):1790–1817, 2006. Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multiclass support vector machines. Neural Networks, IEEE Transactions on, 13(2):415–425, 2002. Eugene Ie, Jason Weston, William Stafford Noble, and Christina Leslie. Multi-class protein fold recognition using adaptive codes. Proceedings of the 22nd International Conference on Machine Learning, 2005. Tommi Jaakkola, Mark Diekhans, and David Haussler. A discriminative framework for detecting remote protein homologies. Journal of Computational Biology, 7(1–2):95–114, 2000. ¨ Anders Krogh, Michael Brown, I. Saira Mian, Kimmen Sjolander, and David Haussler. Hidden Markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235:1501–1531, 1994. Rui Kuang, Eugene Ie, Ke Wang, Kai Wang, Mahira Siddiqi, Yoav Freund, and Christina Leslie. Profile kernels for detecting remote protein homologs and discriminative motifs. Journal of Bioinformatics and Computational Biology, 2005. To appear. Yann LeCun and Fu Jie Huang. Loss functions for discriminative training of energy-based models. In Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, 2005. Christina Leslie, Eleazar Eskin, and William S. Noble. The spectrum kernel: A string kernel for SVM protein classification. Proceedings of the Pacific Biocomputing Symposium, pages 564–575, 2002a. Christina Leslie, Eleazar Eskin, Jason Weston, and William S. Noble. Mismatch string kernels for SVM protein classification. Advances in Neural Information Processing Systems 15, pages 1441–1448, 2002b. Christina Leslie, Eleazar Eskin, Adiel Cohen, Jason Weston, and William S. Noble. Mismatch string kernels for discriminative protein classification. Bioinformatics, 20(4):467–476, 2004. 1580 M ULTI - CLASS P ROTEIN C LASSIFICATION U SING A DAPTIVE C ODES Li Liao and William S. Noble. Combining pairwise sequence similarity and support vector machines for remote protein homology detection. Proceedings of the 6th Annual International Conference on Research in Computational Molecular Biology, pages 225–232, 2002. Hsuan-Tien Lin, Chih-Jen Lin, and Ruby C. Weng. A note on Platt’s probabilistic outputs for support vector machines. Technical report, Department of Computer Science and Information Engineering, National Taiwan University, 2003. Alexey G. Murzin, Steven E. Brenner, Tim Hubbard, and Cyrus Chothia. SCOP: A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology, 247(4):536–540, 1995. Jong Park, Kevin Karplus, Christian Barrett, Richard Hughey, David Haussler, Tim Hubbard, and Cyrus Chothia. Sequence comparisons using multiple sequences detect twice as many remote homologues as pairwise methods. Journal of Molecular Biology, 284(4):1201–1210, 1998. John Platt. Probabilities for support vector machines. Advances in Large Margin Classifiers, pages 61–74, 1999. Huzefa Rangwala and George Karypis. Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics, 21(23):4239–4247, 2005. Gunnar R¨ tsch, Alexander J. Smola, and Sebastian Mika. Adapting codes and embeddings for a polychotomies. Advances in Neural Information Processing Systems, 15:513–520, 2002. Ryan Rifkin and Aldebaro Klautau. In defense of one-vs-all classification. Journal Machine Learning Research, 5:101–141, 2004. ISSN 1533-7928. Frank Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. Hiroto Saigo, Jean-Philippe Vert, Nobuhisa Ueda, and Tatsuya Akutsu. Protein homology detection using string alignment kernels. Bioinformatics, 20(11):1682–1689, 2004. Temple Smith and Michael Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195–197, 1981. Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. Support vector learning for interdependent and structured output spaces. Proceedings of the 21st International Conference on Machine Learning, pages 823–830, 2004. Vladimir N. Vapnik. Statistical Learning Theory. John Wiley and Sons, New York, 1998. Jason Weston and Chris Watkins. Support vector machines for multiclass pattern recognition. In Proceedings of the 7th European Symposium On Artificial Neural Networks, 1999. Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff, and William S. Noble. Semi-supervised protein classification using cluster kernels. Bioinformatics, 21(15):3241–3247, 2005. 1581