nips nips2004 nips2004-52 nips2004-52-reference knowledge-graph by maker-knowledge-mining

52 nips-2004-Discrete profile alignment via constrained information bottleneck


Source: pdf

Author: Sean O'rourke, Gal Chechik, Robin Friedman, Eleazar Eskin

Abstract: Amino acid profiles, which capture position-specific mutation probabilities, are a richer encoding of biological sequences than the individual sequences themselves. However, profile comparisons are much more computationally expensive than discrete symbol comparisons, making profiles impractical for many large datasets. Furthermore, because they are such a rich representation, profiles can be difficult to visualize. To overcome these problems, we propose a discretization for profiles using an expanded alphabet representing not just individual amino acids, but common profiles. By using an extension of information bottleneck (IB) incorporating constraints and priors on the class distributions, we find an informationally optimal alphabet. This discretization yields a concise, informative textual representation for profile sequences. Also alignments between these sequences, while nearly as accurate as the full profileprofile alignments, can be computed almost as quickly as those between individual or consensus sequences. A full pairwise alignment of SwissProt would take years using profiles, but less than 3 days using a discrete IB encoding, illustrating how discrete encoding can expand the range of sequence problems to which profile information can be applied. 1


reference text

[1] Nir Friedman, Ori Mosenzon, Noam Slonim, and Naftali Tishby. Multivariate information bottleneck. In Uncertainty in Artificial Intelligence: Proceedings of the Seventeenth Conference (UAI-2001), pages 152–161, San Francisco, CA, 2001. Morgan Kaufmann Publishers.

[2] Crooks GE, Hon G, Chandonia JM, and Brenner SE. WebLogo: a sequence logo generator. Genome Research, in press, 2004.

[3] A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247:536–40, 1995.

[4] N.D. Rawlings, D.P. Tolle, and A.J. Barrett. MEROPS: the peptidase database. Nucleic Acids Res, 32 Database issue:D160–4, 2004.

[5] B. Rost and C. Sander. Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Bio., 232:584–99, 1993.

[6] Altschul SF, Gish W, Miller W, Myers EW, and Lipman DJ. Basic local alignment search tool. J Mol Biol, 215(3):403–10, October 1990.

[7] Noam Slonim. The Information Bottleneck: Theory and Applications. PhD thesis, Hebrew University, Jerusalem, Israel, 2002.

[8] T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195–197, 1981.

[9] Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method. In Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing, pages 368–77, 1999.

[10] Golan Yona and Michael Levitt. Within the twilight zone: A sensitive profileprofile comparison tool based on information theory. Journal of Molecular Biology, 315:1257–75, 2002.