jmlr jmlr2008 jmlr2008-55 jmlr2008-55-reference knowledge-graph by maker-knowledge-mining

55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data

Source: pdf

Author: Konrad Rieck, Pavel Laskov

Abstract: Efﬁcient and expressive comparison of sequences is an essential procedure for learning with sequential data. In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and non-metric similarity functions. The basis for comparison is embedding of sequences using a formal language, such as a set of natural words, k-grams or all contiguous subsequences. As realizations of the framework we provide linear-time algorithms of different complexity and capabilities using sorted arrays, tries and sufﬁx trees as underlying data structures. Experiments on data sets from bioinformatics, text processing and computer security illustrate the efﬁciency of the proposed algorithms—enabling peak performances of up to 106 pairwise comparisons per second. The utility of distances and non-metric similarity measures for sequences as alternatives to string kernels is demonstrated in applications of text categorization, network intrusion detection and transcription site recognition in DNA. Keywords: string kernels, string distances, learning with sequential data

reference text

M. Abouelhoda, S. Kurtz, and E. Ohlebusch. Replacing sufﬁx trees with enhanced sufﬁx arrays. Journal of Discrete Algorithms, 2(1):53–86, 2004. 43 R IECK AND L ASKOV M. Anderberg. Cluster Analysis for Applications. Academic Press, Inc., New York, NY, USA, 1973. B. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classiﬁers. In D. Haussler, editor, Proceedings of COLT, pages 144–152. ACM Press, 1992. W. B. Cavnar and J. M. Trenkle. N-gram-based text categorization. In Proceedings of SDAIR, pages 161–175, Las Vegas, NV, USA., Apr. 1994. O. Chapelle, P. Haffner, and V. Vapnik. SVMs for histogram-based image classiﬁcation. IEEE Transaction on Neural Networks, 10(5):1055–1064, 1999. V. Cherkassky, S. Xuhui, F. Mulier, and V. Vapnik. Model complexity control for regression using vc generalization bounds. IEEE Transactions on Neural Networks, 10(5):1075–1089, 1999. M. Collins and N. Duffy. Convolution kernel for natural language. In Advances in Neural Information Proccessing Systems, pages 625–632, 2002. M. Damashek. Gauging similarity with n-grams: Language-independent categorization of text. Science, 267(5199):843–848, 1995. H. Drucker, D. Wu, and V. Vapnik. Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10(5):1048–1054, 1999. R. Duda, P.E.Hart, and D.G.Stork. Pattern Classiﬁcation. John Wiley & Sons, second edition, 2001. E. Fredkin. Trie memory. Communications of ACM, 3(9):490–499, 1960. T. G¨ rtner, J. Lloyd, and P. Flach. Kernels and distances for structured data. Machine Learning, 57 a (3):205–232, 2004. T. Graepel, R. Herbrich, P. Bollmann-Sdorra, and K. Obermayer. Classiﬁcation on pairwise proximity data. In Advances in Neural Information Processing Systems, volume 11, 1999. D. Gusﬁeld. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997. B. Haasdonk. Feature space interpretation of svms with indeﬁnite kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(4):482–492, 2005. R. W. Hamming. Error-detecting and error-correcting codes. Bell System Technical Journal, 29(2): 147–160, 1950. S. Harmeling, A. Ziehe, M. Kawanabe, and K.-R. M¨ ller. Kernel-based nonlinear blind source u separation. Neural Computation, 15:1089–1124, 2003. S. Harmeling, G. Dornhege, D. Tax, F. C. Meinecke, and K.-R. M¨ ller. From outliers to prototypes: u ordering data. Neurocomputing, 69(13–15):1608–1618, 2006. J. Hopcroft and J. Motwani, R. Ullmann. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, 2 edition, 2001. T. Jaakkola, M. Diekhans, and D. Haussler. A discriminative framework for detecting remote protein homologies. Journal of Computational Biology, 7:95–114, 2000. 44 S IMILARITY M EASURES FOR S EQUENTIAL DATA D. Jacobs, D. Weinshall, and Y. Gdalyahu. Classiﬁcation with nonmetric distances: Image retrieval and class representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22 (6):583–600, 2000. T. Joachims. Learning to Classify Text using Support Vector Machines. Kluwer, 2002. T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of ECML, pages 137 – 142. Springer, 1998. T. Kasai, H. Ariumar, and A. Setsuo. Efﬁcient substring traversal with sufﬁx arrays. Technical report, 185, Department of Informatics, Kyushu University, 2001a. T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park. Linear-time longest-common-preﬁx computation in sufﬁx arrays and its applications. In Combinatorial Pattern Matching (CPM), 12th Annual Symposium, pages 181–192, 2001b. D. Knuth. The Art of Computer Programming, volume 3. Addison-Wesley, 1973. J. Laub and K.-R. M¨ ller. Feature discovery in non-metric pairwise data. Journal of Machine u Learning, 5(Jul):801–818, July 2004. J. Laub, V. Roth, J. Buhmann, and K.-R. M¨ ller. On the information and representation of nonu euclidean pairwise data. Pattern Recognition, 39(10):1815–1826, Oct. 2006. C. Leslie and R. Kuang. Fast string kernels using inexact matching for protein sequences. Journal of Machine Learning Research, 5:1435–1455, 2004. C. Leslie, E. Eskin, and W. Noble. The spectrum kernel: A string kernel for SVM protein classiﬁcation. In Proc. Paciﬁc Symp. Biocomputing, pages 564–575, 2002. C. Leslie, E. Eskin, A. Cohen, J. Weston, and W. Noble. Mismatch string kernel for discriminative protein classiﬁcation. Bioinformatics, 1(1):1–10, 2003. V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Doklady Akademii Nauk SSSR, 163(4):845–848, 1966. D. Lewis. Reuters-21578 text categorization test collection. AT&T; Labs Research, 1997. L. Liao and W. Noble. Combining pairwise sequence similiarity and support vector machines for detecting remote protein evolutionary and structural relationships. Journal of Computational Biology, 10:857–868, 2003. R. Lippmann, J. Haines, D. Fried, J. Korba, and K. Das. The 1999 DARPA off-line intrusion detection evaluation. Computer Networks, 34(4):579–595, 2000. H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classiﬁcation using string kernels. Journal of Machine Learning Research, 2:419–444, 2002. U. Manber and G. Myers. Sufﬁx arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935–948, 1993. 45 R IECK AND L ASKOV M. Maniscalco and S. Puglisi. An efﬁcient, versatile approach to sufﬁx sorting. Journal of Experimental Algorithmics, 12, Article No. 1.2, 2007. G. Manzini and P. Ferragina. Engineering a lightweight sufﬁx array construction algorithm. Algorithmica, 40:33–50, 2004. W. Masek and M. Patterson. A faster algorithm for computing string edit distance. Journal of Computer and System sciences, 20(1):18–31, 1980. E. M. McCreight. A space-economical sufﬁx tree construction algorithm. Journal of the ACM, 23 (2):262–272, 1976. J. McHugh. Testing intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln Laboratory. ACM Transactions on Information Systems Security, 3(4):262–294, 2000. V. Metsis, G. Androutsopoulos, and G. Paliouras. Spam ﬁltering with naive bayes - which naive bayes? In Proc. of the 3rd Conference on Email and Anti-Spam (CEAS), 2006. K.-R. M¨ ller, S. Mika, G. R¨ tsch, K. Tsuda, and B. Sch¨ lkopf. An introduction to kernel-based u a o learning algorithms. IEEE Neural Networks, 12(2):181–201, May 2001. S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarties in the amino acid sequence of two proteins. Journal of Molecular Biology, 48:443–453, 1970. F. Odone, A. Barla, and A. Verri. Building kernels from binary strings for image matching. IEEE Transactions on Image Processing, 14(2):169–180, 2005. C. O’Donovan, M. Martin, A. Gattiker, E. Gasteiger, A. Bairoch, and R. Apweiler. High-quality protein knowledge resource: SWISS-PROT and TrEMBL. Brieﬁngs in Bioinformatics, 3(3): 275–284, 2002. C. Ong, X. Mary, S. Canu, and A. Smola. Learning with non-positive kernels. In R. Greiner and D. Schuurmans, editors, Proceedings of ICML, pages 639–646. ACM Press, 2004. G. R¨ tsch and S. Sonnenburg. Accurate splice site prediction for caenorhabditis elegans. In Kernel a Methods in Computational Biology, pages 277–298. MIT Press, 2004. G. R¨ tsch, S. Sonnenburg, J. Srinivasan, H. Witte, R. Sommer, K.-R. M¨ ller, and B. Sch¨ lkopf. Ima u o proving the c. elegans genome annotation using machine learning. PLoS Computational Biology, 3:e20, 2007. K. Rieck and P. Laskov. Language models for detection of unknown attacks in network trafﬁc. Journal in Computer Virology, 2(4):243–256, 2007. K. Rieck, P. Laskov, and K.-R. M¨ ller. Efﬁcient algorithms for similarity measures over sequential u data: A look beyond kernels. In Pattern Recognition, Proc. of 28th DAGM Symposium, LNCS, pages 374–383, Sept. 2006. 46 S IMILARITY M EASURES FOR S EQUENTIAL DATA K. Rieck, P. Laskov, and S. Sonnenburg. Computation of similarity measures for sequential data using generalized sufﬁx trees. In Advances in Neural Information Processing Systems 19, pages 1177–1184, Cambridge, MA, 2007. MIT Press. V. Roth, J. Laub, M. Kawanabe, and J. Buhmann. Optimal cluster preserving embedding of nonmetric proximity data. IEEE Trans. PAMI, 25:1540–1551, Dec. 2003. J. Rousu and J. Shawe-Taylor. Efﬁcient computation of gapped substring kernels for large alphabets. Journal of Machine Leaning Research, 6:1323–1344, 2005. G. Salton. Mathematics and information retrieval. Journal of Documentation, 35(1):1–29, 1979. G. Salton, A. Wong, and C. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975. D. Sankoff and J. Kruskal. Time wraps, String edits, and Macromulecules: The Theory and Practice of Sequence Comparison. Addision-Wesley Publishing Co., 1983. B. Sch¨ lkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. o B. Sch¨ lkopf, P. Simard, A. Smola, and V. Vapnik. Prior knowledge in support vector kernels. In o Advances in Neural Information Processing Systems, volume 10, pages 640–646, Cambridge, MA, 1998a. MIT Press. B. Sch¨ lkopf, A. Smola, and K.-R. M¨ ller. Nonlinear component analysis as a kernel eigenvalue o u problem. Neural Computation, 10:1299–1319, 1998b. J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge University Press, 2004. T. F. Smith and M. S. Waterman. Identiﬁcation of common molecular subsequences. Journal of Molecular Biology, 147:195–197, 1981. R. Sokal and P. Sneath. Principles of Numerical Taxonomy. W.H. Freeman and Company, San Francisco, CA, USA, 1963. S. Sonnenburg, A. Zien, and G. R¨ tsch. ARTS: Accurate Recognition of Transcription Starts in a Human. Bioinformatics, 22(14):e472–e480, 2006. S. Sonnenburg, G. R¨ tsch, and K. Rieck. Large scale learning with string kernels. In Large Scale a Kernel Machines. MIT Press, 2007. C. Y. Suen. N-gram statistics for natural language understanding and text processing. IEEE Trans. Pattern Analysis and Machine Intelligence, 1(2):164–172, Apr. 1979. M. Swain and D. Ballard. Color indexing. International Journal of Computer Vision, 7(1), 1991. C. Teo and S. Vishwanathan. Fast and space efﬁcient string kernels using sufﬁx arrays. In Proceedings of ICML, pages 939–936. ACM Press, 2006. K. Tsuda, M. Kawanabe, G. R¨ tsch, S. Sonnenburg, and K. M¨ ller. A new discriminative kernel a u from probabilistic models. Neural Computation, 14:2397–2414, 2002. 47 R IECK AND L ASKOV E. Ukkonen. Online construction of sufﬁx trees. Algorithmica, 14(3):249–260, 1995. V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York, 1995. J.-P. Vert, H. Saigo, and T. Akutsu. Local alignment kernels for biological sequences. In Kernel methods in Computational Biology, pages 131–154. MIT Press, 2004. S. Vishwanathan and A. Smola. Fast kernels for string and tree matching. In K. Tsuda, B. Sch¨ lkopf, o and J. Vert, editors, Kernels and Bioinformatics, pages 113–130. MIT Press, 2004. U. von Luxburg and O. Bousquet. Distance-based classiﬁcation with Lipschitz functions. Journal of Machine Learning Research, 5:669–695, 2004. C. Watkins. Dynamic alignment kernels. In A. Smola, P. Bartlett, B. Sch¨ lkopf, and D. Schuurmans, o editors, Advances in Large Margin Classiﬁers, pages 39–50, Cambridge, MA, 2000. MIT Press. A. Webb. Statistical Pattern Recognition. John Wiley and Sons Ltd., 2002. P. Weiner. Linear pattern matching algorithms. In Proc. 14th Annual Symposium on Switching and Automata Theory, pages 1–11, 1973. A. Zien, G. R¨ tsch, S. Mika, B. Sch¨ lkopf, T. Lengauer, and K.-R. M¨ ller. Engineering Support a o u Vector Machine Kernels That Recognize Translation Initiation Sites. BioInformatics, 16(9):799– 807, Sept. 2000. 48