jmlr jmlr2005 jmlr2005-28 jmlr2005-28-reference knowledge-graph by maker-knowledge-mining

28 jmlr-2005-Efficient Computation of Gapped Substring Kernels on Large Alphabets


Source: pdf

Author: Juho Rousu, John Shawe-Taylor

Abstract: We present a sparse dynamic programming algorithm that, given two strings s and t, a gap penalty λ, and an integer p, computes the value of the gap-weighted length-p subsequences kernel. The algorithm works in time O(p|M| log |t|), where M = {(i, j)|si = t j } is the set of matches of characters in the two sequences. The algorithm is easily adapted to handle bounded length subsequences and different gap-penalty schemes, including penalizing by the total length of gaps and the number of gaps as well as incorporating character-specific match/gap penalties. The new algorithm is empirically evaluated against a full dynamic programming approach and a trie-based algorithm both on synthetic and newswire article data. Based on the experiments, the full dynamic programming approach is the fastest on short strings, and on long strings if the alphabet is small. On large alphabets, the new sparse dynamic programming algorithm is the most efficient. On medium-sized alphabets the trie-based approach is best if the maximum number of allowed gaps is strongly restricted. Keywords: kernel methods, string kernels, text categorization, sparse dynamic programming


reference text