nips nips2009 nips2009-51 nips2009-51-reference knowledge-graph by maker-knowledge-mining

51 nips-2009-Clustering sequence sets for motif discovery

Source: pdf

Author: Jong K. Kim, Seungjin Choi

Abstract: Most of existing methods for DNA motif discovery consider only a single set of sequences to ﬁnd an over-represented motif. In contrast, we consider multiple sets of sequences where we group sets associated with the same motif into a cluster, assuming that each set involves a single motif. Clustering sets of sequences yields clusters of coherent motifs, improving signal-to-noise ratio or enabling us to identify multiple motifs. We present a probabilistic model for DNA motif discovery where we identify multiple motifs through searching for patterns which are shared across multiple sets of sequences. Our model infers cluster-indicating latent variables and learns motifs simultaneously, where these two tasks interact with each other. We show that our model can handle various motif discovery problems, depending on how to construct multiple sets of sequences. Experiments on three different problems for discovering DNA motifs emphasize the useful behavior and conﬁrm the substantial gains over existing methods where only a single set of sequences is considered.

reference text

[1] G. D. Stormo. DNA binding sites: representation and discovery. Bioinformatics, 16:16–23, 2000.

[2] W. W. Wasserman and A. Sandelin. Applied bioinformatics for the identiﬁcation of regulatory elements. Nature Review Genetics, 5:276–287, 2004.

[3] E. Segal, Y. Barash, I. Simon, N. Friedman, and D. Koller. From promoter sequence to expression: a probabilistic framework. In Proceedings of the International Conference on Research in Computational Molecular Biology, pages 263–272, 2002.

[4] G. Badis, M. F. Berger, A. A. Philippakis, S. Talukder, A. R. Gehrke, S. A. Jaeger, E. T. Chan, G. Metzler, A. Vedenko, X. Chen, H. Kuznetsov, C. F. Wang, D. Coburn, D. E. Newburger, Q. Morris, T. R. Hughes, and M. L. Bulyk. Diversity and complexity in DNA recognition by transcription factors. Science, 324:1720–1723, 2009.

[5] T. L. Bailey and C. Elkan. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In Proceedings of the International Conference Intelligent Systems for Molecular Biology, 1994.

[6] T. L. Bailey and C. Elkan. The value of prior knowledge in discovering motifs with MEME. In Proceedings of the International Conference Intelligent Systems for Molecular Biology, 1995.

[7] C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, and J. C. Wootton. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262:208–214, 1993.

[8] J. S. Liu, A. F. Neuwald, and C. E. Lawrence. Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. Journal of the American Statistical Association, 90:1156–1170, 1995.

[9] S. T. Jensen and J. S. Liu. Bayesian clustering of transcription factor binding motifs. Journal of the American Statistical Association, 103:188–200, 2008.

[10] C. T. Harbison, D. B. Gordon, T. I. Lee, N. J. Rinaldi, K. D. Macisaac, T. W. Danford, N. M. Hannett, J. B. Tagne, D. B. Reynolds, J. Yoo, E. G. Jennings, J. Zeitlinger, D. K. Pokholok, M. Kellis, P. A. Rolfe, K. T. Takusagawa, E. S. Lander, D. K. Gifford, E. Fraenkel, and R. A. Young. Transcriptional regulatory code of a eukaryotic genome. Nature, 431:99–104, 2004.

[11] R. Gordan, L. Narlikar, and A. J. Hartemink. A fast, alignment-free, conservation-based method for transcription factor binding site discovery. In Proceedings of the International Conference on Research in Computational Molecular Biology, pages 98–111, 2008.

[12] A. Siepel, G. Bejerano, J. S. Pedersen, A. S. Hinrichs, M. Hou, K. Rosenbloom, H. Clawson, J. Spieth, L. W. Hillier, S. Richards, G. M. Weinstock, R. K. Wilson, R. A. Gibbs, W. J. Kent, W. Miller, and D. Haussler. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Research, 15:1034–1050, 2005.

[13] D. S. Johnson, A. Mortazavi, R. M. Myers, and B. Wold. Genome-wide mapping of in vivo protein-DNA interactions. Science, 316:1497–1502, 2007.

[14] L. Narlikar, R. Gordan, and A. J. Hartemink. Nucleosome occupancy information improves de novo motif discovery. In Proceedings of the International Conference on Research in Computational Molecular Biology, pages 107–121, 2007.

[15] S. Kim, Z. Wang, and M. Dalkilic. igibbs: improving gibbs motif sampler for proteins by sequence clustering and iterative pattern sampling. Proteins, 66:671–681, 2007.

[16] F. P. Roth, J. D. Hughes, P. W. Estep, and G. M. Church. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nature Biotechnology, 16:939– 945, 1998.

[17] X. S. Liu, D. L. Brutlag, and J. S. Liu. An algorithm for ﬁnding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nature Biotechnology, 20:835–839, 2002.

[18] E. Eden, D. Lipson, S. Yogev, and Z. Yakhini. Discovering motifs in ranked lists of DNA sequences. PLoS Computational Biology, 3:e39, 2007.

[19] T. Wang and G. D. Stormo. Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics, 19:2369–2380, 2003.

[20] S. Sinha, M. Blanchette, and M. Tompa. PhyME: a probabilistic algorithm for ﬁnding motifs in sets of orthologous sequences. BMC Bioinformatics, 5:170, 2004.

[21] R. Siddharthan, E. D. Siggia, and E. van Nimwegen. PhyloGibbs: a gibbs sampling motif ﬁnder that incorporates phylogeny. PLoS Computational Biology, 1:e67, 2005. 9