jmlr jmlr2008 jmlr2008-55 jmlr2008-55-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Konrad Rieck, Pavel Laskov
Abstract: Efficient and expressive comparison of sequences is an essential procedure for learning with sequential data. In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and non-metric similarity functions. The basis for comparison is embedding of sequences using a formal language, such as a set of natural words, k-grams or all contiguous subsequences. As realizations of the framework we provide linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures. Experiments on data sets from bioinformatics, text processing and computer security illustrate the efficiency of the proposed algorithms—enabling peak performances of up to 106 pairwise comparisons per second. The utility of distances and non-metric similarity measures for sequences as alternatives to string kernels is demonstrated in applications of text categorization, network intrusion detection and transcription site recognition in DNA. Keywords: string kernels, string distances, learning with sequential data
M. Abouelhoda, S. Kurtz, and E. Ohlebusch. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2(1):53–86, 2004. 43 R IECK AND L ASKOV M. Anderberg. Cluster Analysis for Applications. Academic Press, Inc., New York, NY, USA, 1973. B. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In D. Haussler, editor, Proceedings of COLT, pages 144–152. ACM Press, 1992. W. B. Cavnar and J. M. Trenkle. N-gram-based text categorization. In Proceedings of SDAIR, pages 161–175, Las Vegas, NV, USA., Apr. 1994. O. Chapelle, P. Haffner, and V. Vapnik. SVMs for histogram-based image classification. IEEE Transaction on Neural Networks, 10(5):1055–1064, 1999. V. Cherkassky, S. Xuhui, F. Mulier, and V. Vapnik. Model complexity control for regression using vc generalization bounds. IEEE Transactions on Neural Networks, 10(5):1075–1089, 1999. M. Collins and N. Duffy. Convolution kernel for natural language. In Advances in Neural Information Proccessing Systems, pages 625–632, 2002. M. Damashek. Gauging similarity with n-grams: Language-independent categorization of text. Science, 267(5199):843–848, 1995. H. Drucker, D. Wu, and V. Vapnik. Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10(5):1048–1054, 1999. R. Duda, P.E.Hart, and D.G.Stork. Pattern Classification. John Wiley & Sons, second edition, 2001. E. Fredkin. Trie memory. Communications of ACM, 3(9):490–499, 1960. T. G¨ rtner, J. Lloyd, and P. Flach. Kernels and distances for structured data. Machine Learning, 57 a (3):205–232, 2004. T. Graepel, R. Herbrich, P. Bollmann-Sdorra, and K. Obermayer. Classification on pairwise proximity data. In Advances in Neural Information Processing Systems, volume 11, 1999. D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997. B. Haasdonk. Feature space interpretation of svms with indefinite kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(4):482–492, 2005. R. W. Hamming. Error-detecting and error-correcting codes. Bell System Technical Journal, 29(2): 147–160, 1950. S. Harmeling, A. Ziehe, M. Kawanabe, and K.-R. M¨ ller. Kernel-based nonlinear blind source u separation. Neural Computation, 15:1089–1124, 2003. S. Harmeling, G. Dornhege, D. Tax, F. C. Meinecke, and K.-R. M¨ ller. From outliers to prototypes: u ordering data. Neurocomputing, 69(13–15):1608–1618, 2006. J. Hopcroft and J. Motwani, R. Ullmann. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, 2 edition, 2001. T. Jaakkola, M. Diekhans, and D. Haussler. A discriminative framework for detecting remote protein homologies. Journal of Computational Biology, 7:95–114, 2000. 44 S IMILARITY M EASURES FOR S EQUENTIAL DATA D. Jacobs, D. Weinshall, and Y. Gdalyahu. Classification with nonmetric distances: Image retrieval and class representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22 (6):583–600, 2000. T. Joachims. Learning to Classify Text using Support Vector Machines. Kluwer, 2002. T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of ECML, pages 137 – 142. Springer, 1998. T. Kasai, H. Ariumar, and A. Setsuo. Efficient substring traversal with suffix arrays. Technical report, 185, Department of Informatics, Kyushu University, 2001a. T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Combinatorial Pattern Matching (CPM), 12th Annual Symposium, pages 181–192, 2001b. D. Knuth. The Art of Computer Programming, volume 3. Addison-Wesley, 1973. J. Laub and K.-R. M¨ ller. Feature discovery in non-metric pairwise data. Journal of Machine u Learning, 5(Jul):801–818, July 2004. J. Laub, V. Roth, J. Buhmann, and K.-R. M¨ ller. On the information and representation of nonu euclidean pairwise data. Pattern Recognition, 39(10):1815–1826, Oct. 2006. C. Leslie and R. Kuang. Fast string kernels using inexact matching for protein sequences. Journal of Machine Learning Research, 5:1435–1455, 2004. C. Leslie, E. Eskin, and W. Noble. The spectrum kernel: A string kernel for SVM protein classification. In Proc. Pacific Symp. Biocomputing, pages 564–575, 2002. C. Leslie, E. Eskin, A. Cohen, J. Weston, and W. Noble. Mismatch string kernel for discriminative protein classification. Bioinformatics, 1(1):1–10, 2003. V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Doklady Akademii Nauk SSSR, 163(4):845–848, 1966. D. Lewis. Reuters-21578 text categorization test collection. AT&T; Labs Research, 1997. L. Liao and W. Noble. Combining pairwise sequence similiarity and support vector machines for detecting remote protein evolutionary and structural relationships. Journal of Computational Biology, 10:857–868, 2003. R. Lippmann, J. Haines, D. Fried, J. Korba, and K. Das. The 1999 DARPA off-line intrusion detection evaluation. Computer Networks, 34(4):579–595, 2000. H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2:419–444, 2002. U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935–948, 1993. 45 R IECK AND L ASKOV M. Maniscalco and S. Puglisi. An efficient, versatile approach to suffix sorting. Journal of Experimental Algorithmics, 12, Article No. 1.2, 2007. G. Manzini and P. Ferragina. Engineering a lightweight suffix array construction algorithm. Algorithmica, 40:33–50, 2004. W. Masek and M. Patterson. A faster algorithm for computing string edit distance. Journal of Computer and System sciences, 20(1):18–31, 1980. E. M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23 (2):262–272, 1976. J. McHugh. Testing intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln Laboratory. ACM Transactions on Information Systems Security, 3(4):262–294, 2000. V. Metsis, G. Androutsopoulos, and G. Paliouras. Spam filtering with naive bayes - which naive bayes? In Proc. of the 3rd Conference on Email and Anti-Spam (CEAS), 2006. K.-R. M¨ ller, S. Mika, G. R¨ tsch, K. Tsuda, and B. Sch¨ lkopf. An introduction to kernel-based u a o learning algorithms. IEEE Neural Networks, 12(2):181–201, May 2001. S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarties in the amino acid sequence of two proteins. Journal of Molecular Biology, 48:443–453, 1970. F. Odone, A. Barla, and A. Verri. Building kernels from binary strings for image matching. IEEE Transactions on Image Processing, 14(2):169–180, 2005. C. O’Donovan, M. Martin, A. Gattiker, E. Gasteiger, A. Bairoch, and R. Apweiler. High-quality protein knowledge resource: SWISS-PROT and TrEMBL. Briefings in Bioinformatics, 3(3): 275–284, 2002. C. Ong, X. Mary, S. Canu, and A. Smola. Learning with non-positive kernels. In R. Greiner and D. Schuurmans, editors, Proceedings of ICML, pages 639–646. ACM Press, 2004. G. R¨ tsch and S. Sonnenburg. Accurate splice site prediction for caenorhabditis elegans. In Kernel a Methods in Computational Biology, pages 277–298. MIT Press, 2004. G. R¨ tsch, S. Sonnenburg, J. Srinivasan, H. Witte, R. Sommer, K.-R. M¨ ller, and B. Sch¨ lkopf. Ima u o proving the c. elegans genome annotation using machine learning. PLoS Computational Biology, 3:e20, 2007. K. Rieck and P. Laskov. Language models for detection of unknown attacks in network traffic. Journal in Computer Virology, 2(4):243–256, 2007. K. Rieck, P. Laskov, and K.-R. M¨ ller. Efficient algorithms for similarity measures over sequential u data: A look beyond kernels. In Pattern Recognition, Proc. of 28th DAGM Symposium, LNCS, pages 374–383, Sept. 2006. 46 S IMILARITY M EASURES FOR S EQUENTIAL DATA K. Rieck, P. Laskov, and S. Sonnenburg. Computation of similarity measures for sequential data using generalized suffix trees. In Advances in Neural Information Processing Systems 19, pages 1177–1184, Cambridge, MA, 2007. MIT Press. V. Roth, J. Laub, M. Kawanabe, and J. Buhmann. Optimal cluster preserving embedding of nonmetric proximity data. IEEE Trans. PAMI, 25:1540–1551, Dec. 2003. J. Rousu and J. Shawe-Taylor. Efficient computation of gapped substring kernels for large alphabets. Journal of Machine Leaning Research, 6:1323–1344, 2005. G. Salton. Mathematics and information retrieval. Journal of Documentation, 35(1):1–29, 1979. G. Salton, A. Wong, and C. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975. D. Sankoff and J. Kruskal. Time wraps, String edits, and Macromulecules: The Theory and Practice of Sequence Comparison. Addision-Wesley Publishing Co., 1983. B. Sch¨ lkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. o B. Sch¨ lkopf, P. Simard, A. Smola, and V. Vapnik. Prior knowledge in support vector kernels. In o Advances in Neural Information Processing Systems, volume 10, pages 640–646, Cambridge, MA, 1998a. MIT Press. B. Sch¨ lkopf, A. Smola, and K.-R. M¨ ller. Nonlinear component analysis as a kernel eigenvalue o u problem. Neural Computation, 10:1299–1319, 1998b. J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge University Press, 2004. T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195–197, 1981. R. Sokal and P. Sneath. Principles of Numerical Taxonomy. W.H. Freeman and Company, San Francisco, CA, USA, 1963. S. Sonnenburg, A. Zien, and G. R¨ tsch. ARTS: Accurate Recognition of Transcription Starts in a Human. Bioinformatics, 22(14):e472–e480, 2006. S. Sonnenburg, G. R¨ tsch, and K. Rieck. Large scale learning with string kernels. In Large Scale a Kernel Machines. MIT Press, 2007. C. Y. Suen. N-gram statistics for natural language understanding and text processing. IEEE Trans. Pattern Analysis and Machine Intelligence, 1(2):164–172, Apr. 1979. M. Swain and D. Ballard. Color indexing. International Journal of Computer Vision, 7(1), 1991. C. Teo and S. Vishwanathan. Fast and space efficient string kernels using suffix arrays. In Proceedings of ICML, pages 939–936. ACM Press, 2006. K. Tsuda, M. Kawanabe, G. R¨ tsch, S. Sonnenburg, and K. M¨ ller. A new discriminative kernel a u from probabilistic models. Neural Computation, 14:2397–2414, 2002. 47 R IECK AND L ASKOV E. Ukkonen. Online construction of suffix trees. Algorithmica, 14(3):249–260, 1995. V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York, 1995. J.-P. Vert, H. Saigo, and T. Akutsu. Local alignment kernels for biological sequences. In Kernel methods in Computational Biology, pages 131–154. MIT Press, 2004. S. Vishwanathan and A. Smola. Fast kernels for string and tree matching. In K. Tsuda, B. Sch¨ lkopf, o and J. Vert, editors, Kernels and Bioinformatics, pages 113–130. MIT Press, 2004. U. von Luxburg and O. Bousquet. Distance-based classification with Lipschitz functions. Journal of Machine Learning Research, 5:669–695, 2004. C. Watkins. Dynamic alignment kernels. In A. Smola, P. Bartlett, B. Sch¨ lkopf, and D. Schuurmans, o editors, Advances in Large Margin Classifiers, pages 39–50, Cambridge, MA, 2000. MIT Press. A. Webb. Statistical Pattern Recognition. John Wiley and Sons Ltd., 2002. P. Weiner. Linear pattern matching algorithms. In Proc. 14th Annual Symposium on Switching and Automata Theory, pages 1–11, 1973. A. Zien, G. R¨ tsch, S. Mika, B. Sch¨ lkopf, T. Lengauer, and K.-R. M¨ ller. Engineering Support a o u Vector Machine Kernels That Recognize Translation Initiation Sites. BioInformatics, 16(9):799– 807, Sept. 2000. 48