jmlr jmlr2009 jmlr2009-45 jmlr2009-45-reference knowledge-graph by maker-knowledge-mining

45 jmlr-2009-Learning Approximate Sequential Patterns for Classification

Source: pdf

Author: Zeeshan Syed, Piotr Indyk, John Guttag

Abstract: In this paper, we present an automated approach to discover patterns that can distinguish between sequences belonging to different labeled groups. Our method searches for approximately conserved motifs that occur with varying statistical properties in positive and negative training examples. We propose a two-step process to discover such patterns. Using locality sensitive hashing (LSH), we ﬁrst estimate the frequency of all subsequences and their approximate matches within a given Hamming radius in labeled examples. The discriminative ability of each pattern is then assessed from the estimated frequencies by concordance and rank sum testing. The use of LSH to identify approximate matches for each candidate pattern helps reduce the runtime of our method. Space requirements are reduced by decomposing the search problem into an iterative method that uses a single LSH table in memory. We propose two further optimizations to the search for discriminative patterns. Clustering with redundancy based on a 2-approximate solution of the k-center problem decreases the number of overlapping approximate groups while providing exhaustive coverage of the search space. Sequential statistical methods allow the search process to use data from only as many training examples as are needed to assess signiﬁcance. We evaluated our algorithm on data sets from different applications to discover sequential patterns for classiﬁcation. On nucleotide sequences from the Drosophila genome compared with random background sequences, our method was able to discover approximate binding sites that were preserved upstream of genes. We observed a similar result in experiments on ChIP-on-chip data. For cardiovascular data from patients admitted with acute coronary syndromes, our pattern discovery approach identiﬁed approximately conserved sequences of morphology variations that were predictive of future death in a test population. Our data showed that the use of LSH, clustering, and sequential statistic

reference text

Timothy L. Bailey and Charles Elkan. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In International Conference on Intelligent Systems in Molecular Biology, pages 28–36, 1994. Yosef Barash, Gill Bejerano, and Nir Friedman. A simple hyper-geometric approach for discovering putative transcription factor binding sites. Algorithms in Bioinformatics, pages 278–293, 2001. Seraﬁm Batzoglou, Lior Pachter, Jill P. Mesirov, Bonnie Berger, and Eric S. Lander. Human and mouse gene structure: comparative analysis and its application to exon prediction. Genome Research, pages 950–958, 2000. Martin Bland and Douglas G. Altman. Multiple signiﬁcance tests: the bonferroni method. British Medical Journal, page 170, 1995. Alvis Brazma, Inge Jonassen, Ingvar Eidhammer, and David Gilber. Approaches to the automatic discovery of patterns in biosequences. Journal of Computational Biology, pages 279–305, 1998. Jeremy Buhler. Efﬁcient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics, pages 419–428, 2001. Jeremy Buhler and Martin Tompa. Finding motifs using random projections. Journal of Computational Biology, pages 225–242, 2002. Moises Burset and Roderic Guigo. Evaluation of gene structure prediction programs. Genomics, pages 353–367, 1996. Stuart Daw, Charles E. Finney, and Eugene R. Tracy. A review of symbolic analysis of experimental data. Review of Scientiﬁc Instruments, pages 915–930, 2003. 1933 S YED , I NDYK AND G UTTAG Arthur L. Delcher, Simon Kasif, Robert D. Fleischmann, Jeremy Peterson, Owen White, and Steven L. Salzberg. Alignment of whole genomes. Nucleic Acids Research, pages 2369–2376, 1999. Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classiﬁcation. Wiley-Interscience, 2000. Eleazar Eskin and Pavel Pevzner. Finding composite regulatory patterns in dna sequences. Bioinformatics, pages 354–363, 2002. Jean D. Gibbons. Nonparametric Statistical Inference. Marcel Dekker, 1985. William N. Grundy, Timothy L. Bailey, Charles P. Elkan, and Michael E. Baker. Meta-meme: motifbased hidden markov models of protein families. Computer Applications in the Biosciences, pages 397–406, 1997. Jaiwei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kauffman, 2005. James A. Hanley and Barbara J. McNeil. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, pages 29–36, 1982. Christopher T. Harbison, Benjamin Gordon, Tong I. Lee, Nicola J. Rinaldi, Kenzie D. Macisaac, Timothy W. Danford, Nancy M. Hannett, Jean-Bosco Tange, David B. Reynoalds, Jane Yoo, Ezra G. Jennings, Julia Zeitlingen, Dmitry K. Pokholok, Manolis Kellis, Alex Rolfe, Ken T. Takusagawa, Eric S. Lander, David K. Gifford, Ernest Fraenkel, and Richard A. Young. Transcriptional regulatory code of a eukaryotic genome. Nature, pages 99–104, 2004. Taher H. Haveliwala, Aristides Gionis, and Piotr Indyk. Scalable techniques for clustering the web. In WebDB Workshop, 2000. Dorit S. Hochbaum and David B. Shmoys. A best possible heuristic for the k-center problem. Mathematics of Operations Research, pages 180–184, 1985. Myles Hollander and Douglas A. Wolfe. Nonparametric Statistical Methods. John Wiley and Sons, 1999. Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Symposium on Theory of Computing, pages 604–613, 1998. Tommi Jaakkola, Mark Diekhans, and David Haussler. Using the ﬁsher kernel method to detect remote protein homologies. In Seventh International Conference on Intelligent Systems for Molecular Biology, pages 149–158, 1999. Manolis Kellis, Nick Patterson, Bruce Birren, Bonnie Berger, and Eric S. Lander. Methods in comparative genomics: genome correspondence, gene identiﬁcation and regulatory motif discovery. Journal of Computational Biology, pages 319–355, 2004. Anders Krogh. Hidden markov models for labeled sequences. In IEEE International Conference on Pattern Recognition, pages 140–144, 1994. 1934 L EARNING A PPROXIMATE S EQUENTIAL PATTERNS FOR C LASSIFICATION Charles E. Lawrence, Stephen F. Altschul, Mark S. Boguski, Jun S. Liu, Andrew F. Neuwald, and John C. Wootton. Detecting subtle sequence signals: a gibbs sampling strategy for multiple alignment. Science, pages 208–214, 1993. Erich L. Lehmann. Nonparametrics: Statistical Methods Based on Ranks. McGraw-Hill, 1975. Christina Leslie, Eleazar Eskin, Jason Weston, and William S. Noble. Mismatch string kernels for svm protein classiﬁcation. Advances in Neural Information Processing Systems, 2003. Henry Leung and Francis Chin. Finding motifs from all sequences with and without binding sites. Bioinformatics, pages 2217–2223, 2006. Jiuyong Li, Ada W. Fu, Hongxing He, Jie Chen, Huidong Jin, Damien McAullay, Graham Williams, Ross Sparks, and Chris Kelman. Mining risk patterns in medical data. In ACM Conference on Knowledge Discovery in Data, pages 770–775, 2005. Nan Li and Martin Tompa. Analysis of computational approaches for motif discovery. Algorithms for Molecular Biology, page 8, 2006. Jessica Lin, Eamonn Keogh, Stefano Lonardi, and Bill Chiu. A symbolic representation of time series, with implications for streaming algorithms. In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2003. Shirley Liu, Douglas L Brutlag, and Jun S. Liu. Bioprospector: discovering conserved dna motifs in upstream regulatory regions of co-expressed genes. In Paciﬁc Symposium on Biocomputing, pages 127–138, 2001. Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. Multi-probe lsh: efﬁcient indexing for high-dimensional similarity search. In Very Large Data Bases, pages 950–961, 2007. Bamshad Mobasher, Namit Jain, Eui-Hong Han, and Jaideep Srivasta. Web mining: pattern discovery from world wide web transactions. Department of Computer Science, University of Minnesota Technical Report TR-96-050, 1996. Giulio Pavesi, Giancarlo Mauri, and Graziano Pesole. An algorithm for ﬁnding signals of unknown length in dna sequences. Bioinformatics, pages 207–214, 2001. Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000. Pavel A. Pevzner and Sing-Hoi Sze. Combinatorial approaches to ﬁnding subtle signal in dna sequences. In International Conference on Intelligent Systems for Molecular Biology, pages 269–278, 2000. Ravi M. Phatarfod and Aidan Sudbury. A simple sequential wilcoxon test. Australian Journal of Statistics, pages 93–106, 1988. Lawrence R. Rabiner, Aaron E. Rosenberg, and Stephen E. Levinson. Considerations in dynamic time warping algorithms for discrete word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, pages 575–582, 1978. 1935 S YED , I NDYK AND G UTTAG Emma Redhead and Timothy L. Bailey. Discriminative motif discovery in dna and protein sequences using the deme algorithm. BMC Bioinformatics, page 385, 2007. Marion R. Reynolds. A sequential signed-rank test for symmetry. The Annals of Statistics, pages 382–400, 1975. Michael J. Shaw, Chandrasekar Subramaniam, Gek W. Tan, and Michael E. Welge. Knowledge management and data mining for marketing. Decision Support Systems, pages 127–137, 2001. Rahul Siddharthan, Eric D. Siggia, and Erik van Nimwegen. Phylogibbs: a gibbs sampling motif ﬁnder that incorporates phylogeny. PLoS Computational Biology, pages 534–556, 2005. Saurabh Sinha and Martin Tompa. Ymf: a program for discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Research, pages 3586–3588, 2003. Zeeshan Syed, John V. Guttag, and Collin M. Stultz. Clustering and symbolic analysis in cardiovascular signals: discovery and visualization of medically relevant patterns in long-term data using limited prior knowledge. EURASIP Journal on Advances in Signal Processing, 2007. Zeeshan Syed, Benjamin M. Scirica, Collin M. Stultz, and John V. Guttag. Risk-stratiﬁcation following acute coronary syndromes using a novel electrocardiographic technique to measure variability in morphology. In Computers in Cardiology, 2008. Saeed Tavazoie, Jason D. Hughes, Michael J. Campbell, Raymond J. Cho, and George M. Church. Systematic determination of genetic network architecture. Nature Genetics, pages 281–285, 1999. Martin Tompa, Nan Li, Timothy L. Bailey, George M. Church, Bart D. Moor, Eleazer Eskin, Alexander V. Favorov, Martin C. Frith, Yutao Fu, James Kent, Vsevolod J. Makeev, Andrei A. Mironov, William S. Noble, Giulio Pavesi, Graziano Pesole, Mireille Regnier, Nicolas Simonis, Saurabh Sinha, Gert Thijs, Jacques V. Helden, Mathias Vandenbogaert, Zhiping Weng, Christopher Workman, Chun Ye, and Zhou Zhu. Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology, pages 137–144, 2005. Jason T.L. Wang, Bruce A. Shapiro, and Dennis E. Shasha. Pattern Discovery in Biomolecular Data: Tools, Techniques, and Applications. Oxford University Press, 1999. Frank Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, pages 80–83, 1945. Edgar Wingender, Peter Dietze, Holger Karas, and Rainer Knuppel. Transfac: a database on transcription factors and their dna binding sites. Nucleic Acids Research, pages 238–241, 1996. 1936