nips nips2006 nips2006-54 nips2006-54-reference knowledge-graph by maker-knowledge-mining

54 nips-2006-Comparative Gene Prediction using Conditional Random Fields

Source: pdf

Author: Jade P. Vinson, David Decaprio, Matthew D. Pearson, Stacey Luoma, James E. Galagan

Abstract: Computational gene prediction using generative models has reached a plateau, with several groups converging to a generalized hidden Markov model (GHMM) incorporating phylogenetic models of nucleotide sequence evolution. Further improvements in gene calling accuracy are likely to come through new methods that incorporate additional data, both comparative and species speciﬁc. Conditional Random Fields (CRFs), which directly model the conditional probability P (y|x) of a vector of hidden states conditioned on a set of observations, provide a uniﬁed framework for combining probabilistic and non-probabilistic information and have been shown to outperform HMMs on sequence labeling tasks in natural language processing. We describe the use of CRFs for comparative gene prediction. We implement a model that encapsulates both a phylogenetic-GHMM (our baseline comparative model) and additional non-probabilistic features. We tested our model on the genome sequence of the fungal human pathogen Cryptococcus neoformans. Our baseline comparative model displays accuracy comparable to the the best available gene prediction tool for this organism. Moreover, we show that discriminative training and the incorporation of non-probabilistic evidence signiﬁcantly improve performance. Our software implementation, Conrad, is freely available with an open source license at http://www.broad.mit.edu/annotation/conrad/. 1

reference text

[1] Adam Siepel and David Haussler. Combining phylogenetic and hidden Markov models in biosequence analysis. J Comput Biol, 11(2-3):413–428, 2004.

[2] Jon D McAuliffe, Lior Pachter, and Michael I Jordan. Multiple-sequence functional annotation and the generalized hidden Markov phylogeny. Bioinformatics, 20(12):1850–1860, Aug 2004.

[3] Jakob Skou Pedersen and Jotun Hein. Gene ﬁnding with a hidden Markov model of genome structure and evolution. Bioinformatics, 19(2):219–227, Jan 2003.

[4] Randall H Brown, Samuel S Gross, and Michael R Brent. Begin at the beginning: predicting genes with 5’ UTRs. Genome Res, 15(5):742–747, May 2005.

[5] G. D. Stormo and D. Haussler. Optimally parsing a sequence into different classes based on multiple types of information. In Proc. of Second Int. Conf. on Intelligent Systems for Molecular Biology, pages 369–375, Menlo Park, CA, 1994. AAAI/MIT Press.

[6] Kevin L Howe, Tom Chothia, and Richard Durbin. GAZE: a generic framework for the integration of gene-prediction data by dynamic programming. Genome Res, 12(9):1418–1427, Sep 2002.

[7] Kevin L. Howe. Gene prediction using a conﬁgurable system for the integration of data by dynamic programming. PhD thesis, University of Cambridge, 2003.

[8] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random ﬁelds: Probabilistic models for segmenting and labeling sequence data. In Proc. 18th International Conf. on Machine Learning, pages 282–289. Morgan Kaufmann, San Francisco, CA, 2001.

[9] Aaron E Tenney, Randall H Brown, Charles Vaske, Jennifer K Lodge, Tamara L Doering, and Michael R Brent. Gene prediction and veriﬁcation in a compact genome with numerous small introns. Genome Res, 14(11):2330–2335, Nov 2004.

[10] I Korf, P Flicek, D Duan, and M R Brent. Integrating genomic homology into gene structure prediction. Bioinformatics, 17 Suppl 1:140–148, 2001.

[11] S. Sarawagi and W. Cohen. Semimarkov conditional random ﬁelds for information extraction. Proceedings of ICML, 2004.

[12] Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ci You Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientiﬁc Computing, 16(6):1190–1208, 1995.

[13] Lawrence Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. In Alex Waibel and Kai-Fu Lee, editors, Readings in speech recognition, pages 267–296. Morgan Kaufmann, San Mateo, 1990.

[14] Charles Sutton and Andrew McCallum. An introduction to conditional random ﬁelds for relational learning. In Lise Getoor and Ben Taskar, editors, Statistical Relational Learning. To appear.

[15] Hanna Wallach. Efﬁcient training of conditional random ﬁelds. Master’s thesis, University of Edinburgh, 2002.

[16] F. Sha and F. Pereira. Shallow parsing with conditional random ﬁelds. Technical Report CIS TR MS-CIS-02-35, University of Pennsylvania, 2003.

[17] Adam Siepel and David Haussler. Computational identiﬁcation of evolutionarily conserved exons. In Proceedings of the 8th Annual International Conference, RECOMB 2004. ACM, 2004.

[18] Brendan J Loftus and Eula Fung et. al. The genome of the basidiomycetous yeast and human pathogen Cryptococcus neoformans. Science, 307(5713):1321–1324, Feb 2005.

[19] Yasemin Altun, Mark Johnson, and Thomas Hofmann. Investigating Loss Functions and Optimization Methods for Discriminative Learning of Label Sequences. Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing.

[20] Aron Culotta, David Kulp, and Andrew McCallum. Gene prediction with conditional random ﬁelds. Technical Report UM-CS-2005-028, University of Massachusetts, Amherst, April 2005.

[21] R F Yeh, L P Lim, and C B Burge. Computational inference of homologous gene structures in the human genome. Genome Res, 11(5):803–816, May 2001.

[22] Brona Brejova, Daniel G Brown, Ming Li, and Tomas Vinar. ExonHunter: a comprehensive approach to gene ﬁnding. Bioinformatics, 21 Suppl 1:i57–i65, Jun 2005.

[23] Jonathan E Allen and Steven L Salzberg. JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics, 21(18):3596–3603, Sep 2005.

[24] Chuong B. Do, Samuel S. Gross, and Seraﬁm Batzoglou. Contralign: Discriminative training for protein sequence alignment. In Alberto Apostolico, Concettina Guerra, Sorin Istrail, Pavel A. Pevzner, and Michael S. Waterman, editors, RECOMB, volume 3909 of Lecture Notes in Computer Science, pages 160–174. Springer, 2006.