nips nips2004 nips2004-52 nips2004-52-reference knowledge-graph by maker-knowledge-mining

52 nips-2004-Discrete profile alignment via constrained information bottleneck

Source: pdf

Author: Sean O'rourke, Gal Chechik, Robin Friedman, Eleazar Eskin

Abstract: Amino acid proﬁles, which capture position-speciﬁc mutation probabilities, are a richer encoding of biological sequences than the individual sequences themselves. However, proﬁle comparisons are much more computationally expensive than discrete symbol comparisons, making proﬁles impractical for many large datasets. Furthermore, because they are such a rich representation, proﬁles can be diﬃcult to visualize. To overcome these problems, we propose a discretization for proﬁles using an expanded alphabet representing not just individual amino acids, but common proﬁles. By using an extension of information bottleneck (IB) incorporating constraints and priors on the class distributions, we ﬁnd an informationally optimal alphabet. This discretization yields a concise, informative textual representation for proﬁle sequences. Also alignments between these sequences, while nearly as accurate as the full proﬁleproﬁle alignments, can be computed almost as quickly as those between individual or consensus sequences. A full pairwise alignment of SwissProt would take years using proﬁles, but less than 3 days using a discrete IB encoding, illustrating how discrete encoding can expand the range of sequence problems to which proﬁle information can be applied. 1

reference text

[1] Nir Friedman, Ori Mosenzon, Noam Slonim, and Naftali Tishby. Multivariate information bottleneck. In Uncertainty in Artiﬁcial Intelligence: Proceedings of the Seventeenth Conference (UAI-2001), pages 152–161, San Francisco, CA, 2001. Morgan Kaufmann Publishers.

[2] Crooks GE, Hon G, Chandonia JM, and Brenner SE. WebLogo: a sequence logo generator. Genome Research, in press, 2004.

[3] A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. SCOP: a structural classiﬁcation of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247:536–40, 1995.

[4] N.D. Rawlings, D.P. Tolle, and A.J. Barrett. MEROPS: the peptidase database. Nucleic Acids Res, 32 Database issue:D160–4, 2004.

[5] B. Rost and C. Sander. Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Bio., 232:584–99, 1993.

[6] Altschul SF, Gish W, Miller W, Myers EW, and Lipman DJ. Basic local alignment search tool. J Mol Biol, 215(3):403–10, October 1990.

[7] Noam Slonim. The Information Bottleneck: Theory and Applications. PhD thesis, Hebrew University, Jerusalem, Israel, 2002.

[8] T. F. Smith and M. S. Waterman. Identiﬁcation of common molecular subsequences. Journal of Molecular Biology, 147:195–197, 1981.

[9] Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method. In Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing, pages 368–77, 1999.

[10] Golan Yona and Michael Levitt. Within the twilight zone: A sensitive proﬁleproﬁle comparison tool based on information theory. Journal of Molecular Biology, 315:1257–75, 2002.