jmlr jmlr2009 jmlr2009-45 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Zeeshan Syed, Piotr Indyk, John Guttag
Abstract: In this paper, we present an automated approach to discover patterns that can distinguish between sequences belonging to different labeled groups. Our method searches for approximately conserved motifs that occur with varying statistical properties in positive and negative training examples. We propose a two-step process to discover such patterns. Using locality sensitive hashing (LSH), we first estimate the frequency of all subsequences and their approximate matches within a given Hamming radius in labeled examples. The discriminative ability of each pattern is then assessed from the estimated frequencies by concordance and rank sum testing. The use of LSH to identify approximate matches for each candidate pattern helps reduce the runtime of our method. Space requirements are reduced by decomposing the search problem into an iterative method that uses a single LSH table in memory. We propose two further optimizations to the search for discriminative patterns. Clustering with redundancy based on a 2-approximate solution of the k-center problem decreases the number of overlapping approximate groups while providing exhaustive coverage of the search space. Sequential statistical methods allow the search process to use data from only as many training examples as are needed to assess significance. We evaluated our algorithm on data sets from different applications to discover sequential patterns for classification. On nucleotide sequences from the Drosophila genome compared with random background sequences, our method was able to discover approximate binding sites that were preserved upstream of genes. We observed a similar result in experiments on ChIP-on-chip data. For cardiovascular data from patients admitted with acute coronary syndromes, our pattern discovery approach identified approximately conserved sequences of morphology variations that were predictive of future death in a test population. Our data showed that the use of LSH, clustering, and sequential statistic
Reference: text
sentIndex sentText sentNum sentScore
1 Using locality sensitive hashing (LSH), we first estimate the frequency of all subsequences and their approximate matches within a given Hamming radius in labeled examples. [sent-10, score-0.573]
2 The discriminative ability of each pattern is then assessed from the estimated frequencies by concordance and rank sum testing. [sent-11, score-0.332]
3 We evaluated our algorithm on data sets from different applications to discover sequential patterns for classification. [sent-17, score-0.398]
4 On nucleotide sequences from the Drosophila genome compared with random background sequences, our method was able to discover approximate binding sites that were preserved upstream of genes. [sent-18, score-0.551]
5 For cardiovascular data from patients admitted with acute coronary syndromes, our pattern discovery approach identified approximately conserved sequences of morphology variations that were predictive of future death in a test population. [sent-20, score-0.661]
6 Keywords: pattern discovery, motif discovery, locality sensitive hashing, classification c 2009 Zeeshan Syed, Piotr Indyk and John Guttag. [sent-23, score-0.49]
7 More recently, there has been an increased interest in applying techniques for discovering patterns to sequences corresponding to genomic data (Wang et al. [sent-34, score-0.341]
8 This process means that discriminative patterns in negatively labeled sequences are not explored for classification. [sent-53, score-0.372]
9 , 2000) enumerate all exact patterns across both positive and negative examples to identify sequences that can discriminate between these two cases, but become computationally intractable when allowing subsequences to have an approximate form. [sent-56, score-0.689]
10 We describe a locality sensitive hashing (LSH) based algorithm to efficiently estimate the frequencies of all approximately conserved subsequences with a certain Hamming radius in both positive and negative examples. [sent-57, score-0.58]
11 In this way, our method unifies the broad areas of existing work in sequential pattern detection for classification by proposing a way to discover patterns that are both approximate and derived using the additional information available in negative instances. [sent-59, score-0.553]
12 We also explore the idea of using clustering as part of pattern discovery to reduce approximate subsequences with significantly overlapping Hamming radii to a small number. [sent-65, score-0.688]
13 In addition to LSH and clustering, we also draw upon sequential statistical methods to make the search for interesting patterns more efficient. [sent-71, score-0.374]
14 The process of identifying patterns with discriminative ability makes use of concordance and rank sum testing. [sent-72, score-0.387]
15 In many cases, the goodness of approximate patterns can be assessed without using data from all training sequences. [sent-73, score-0.361]
16 The runtime and space requirements of the pattern discovery process can be reduced by using sequential statistical methods that allow the search process for patterns to terminate after using data from only as many training examples as are needed to assess significance. [sent-75, score-0.676]
17 We address a similar goal to earlier work on using hypergeometric significance testing to discover patterns that are enriched in a positive set relative to a negative set (Barash et al. [sent-76, score-0.328]
18 Our algorithm to find approximate discriminative patterns is also related to previous work on the use of profile hidden Markov models (Krogh, 1994; Jaakkola et al. [sent-82, score-0.34]
19 This work focuses on efficiently calculating a kernel based on the mismatch tree data structure (Eskin and Pevzner, 2002), which quantifies how different two sequences are based on the approximate occurrence of the fixed L-length subsequences within them. [sent-89, score-0.438]
20 In contrast to the mismatch kernel, which focuses on quantifying the difference between two sequences and does not report the frequency of individual approximate 1915 S YED , I NDYK AND G UTTAG Figure 1: Overview of the pattern discovery process. [sent-92, score-0.39]
21 subsequences in the data, our algorithm focuses on identifying the specific approximate patterns with discriminative value. [sent-93, score-0.636]
22 This approach can be integrated more easily with the use of sequential statistics, that is, since the frequencies of each approximate pattern are retained during analysis, this information can be used to determine if patterns are good or bad discriminators without analyzing all the available data. [sent-94, score-0.508]
23 We evaluated our algorithm on data sets from different applications to discover sequential patterns for classification. [sent-95, score-0.398]
24 On nucleotide sequences from the Drosophila genome, our method was able to discover binding sites for genes that are preserved across the genome and do not occur in random background sequences. [sent-96, score-0.512]
25 On symbolized electrocardiographic time-series from patients with acute coronary syndromes, our pattern discovery approach identified approximately conserved sequences of morphology variations that were predictive of future death in a test population. [sent-97, score-0.627]
26 Section 3 describes a locality sensitive hashing scheme to find approximate matches to all subsequences in the data set. [sent-100, score-0.543]
27 Section 4 proposes the use of clustering to reduce the number of approximate patterns analyzed during pattern discovery. [sent-101, score-0.474]
28 Overview The process of discovering discriminative patterns of a specified length L from positive and negative sequences is carried out in two stages: frequency estimation and pattern ranking. [sent-107, score-0.576]
29 To allow for approximate patterns, unique subsequences are then matched to all other subsequences at a Hamming distance of at most d from Wi . [sent-121, score-0.631]
30 In Section 3, we describe an LSH-based solution that allows for efficient discovery of the subsequences DWi matching Wi . [sent-123, score-0.435]
31 We also present a clustering approach in Section 4 to reduce overlapping approximate patterns for which frequencies are estimated to a smaller number with less redundancy for subsequent analysis. [sent-124, score-0.492]
32 2 Pattern Ranking The goal of the search process is to identify approximately matching subsequences that can discriminate between positive and negative training examples. [sent-126, score-0.416]
33 However, since the indices used by the locality sensitive hash function LSH(S) may not span the entire subsequences Sx and Sy , an exact match in Equation 1 may be obtained if Sx and Sy match approximately. [sent-158, score-0.504]
34 Practically, LSH is implemented by creating a hash table using the LSH(S) values for all subsequences as the keys. [sent-159, score-0.39]
35 Following this, the bucket to which the query is mapped is searched for all original subsequences with a Hamming distance of at most d. [sent-162, score-0.334]
36 To find the nearest neighbors for all M subsequences in the data set, each member of the set can be passed 1918 L EARNING A PPROXIMATE S EQUENTIAL PATTERNS FOR C LASSIFICATION through the entire LSH data structure comprising T hash tables for matches. [sent-178, score-0.39]
37 The subsequences DWi matching Wi are found as: DWi = T [ {W j |LSHt (W j ) = LSHt (Wi )}. [sent-192, score-0.324]
38 Clustering Subsequences The runtime of the pattern discovery process as described so far is dominated by the approximate matching of all subsequences. [sent-194, score-0.339]
39 This process is associated with considerable redundancy, as matches are sought individually for subsequences that are similar to each other. [sent-196, score-0.346]
40 The overlap between approximate patterns increases the computational needs of the pattern discovery process and also makes it more challenging to interpret the results as good patterns may appear many times in the output. [sent-197, score-0.743]
41 The focus of this clustering is to group together the original subsequences falling into the same hash bucket during the first LSH iteration. [sent-200, score-0.513]
42 This reduces the runtime of the search by reducing the number of times subsequences have to be passed through the LSH tables to find true and false positives. [sent-203, score-0.435]
43 It also reduces the memory requirements of the search by reducing the number of subsequences for which we need to maintain state about approximate matches. [sent-204, score-0.387]
44 This approach can be considered as being identical to choosing a set of subsequences during the first LSH iteration, and finding their approximate matches by multiple LSH iterations. [sent-213, score-0.385]
45 i=1 The LSH iterations that follow find approximate matches to the subsequences in Φ. [sent-220, score-0.385]
46 It is important to note that while clustering reduces a large number of overlapping approximate patterns to a much smaller group, the clusters formed during this process may still overlap. [sent-221, score-0.445]
47 Pattern Ranking Given the frequencies g+ and g− of an approximate pattern corresponding to all subsequences i i within a Hamming distance d of the subsequence Wi , a score can be assigned to the pattern by using concordance statistics and rank sum testing. [sent-227, score-0.725]
48 In our work, we reduce the original approximate patterns into a small group with some overlap to span the search space. [sent-230, score-0.334]
49 2 Rank Sum Testing An alternate approach to assess the goodness of patterns is to make use of rank sum testing (Wilcoxon, 1945; Lehmann, 1975). [sent-240, score-0.322]
50 For example, it may be possible to recognize patterns with high discriminative ability without the need to analyze all positive and negative examples. [sent-254, score-0.341]
51 This helps identify candidate approximate patterns that occur in the data set and collectively span the search space. [sent-256, score-0.334]
52 Traditional formulations of sequential significance testing remove good patterns from analysis when the test statistic is first found to lie above or below a given threshold. [sent-284, score-0.358]
53 Since there may be a large number of such patterns with low discriminative value, we make use of a modified approach to reject poor hypotheses while retaining patterns that may have value in classification. [sent-286, score-0.544]
54 Given the typical goal of pattern discovery to return the best patterns found during the search process, this technique naturally addresses the problem statement. [sent-288, score-0.513]
55 Given the test statistics in Equations 5 and 9, we remove all patterns that have a test statistic: Un < λUmax where Umax is the maximum test statistic for any pattern and λ is a user-specific fraction (e. [sent-289, score-0.386]
56 If we declare a pattern to be significant for some probability of the null hypothesis less than θ, then the overall false positive rate for the experiment assuming independence of patterns is given by: FP = 1 − (1 − θ)M . [sent-297, score-0.383]
57 The focus of this project was to compare 13 motif discovery tools on nucleotide sequences from the human, mouse, fly and yeast genomes to see if they could successfully identify known regulatory elements such as binding sites for genes. [sent-309, score-0.858]
58 Our method predicted binding sites in the original data corresponding to all subsequences in the group DWi with maximum discriminative value. [sent-319, score-0.556]
59 The pattern discovery process attempted to find approximate subsequences of lengths 8, 12 and 16. [sent-321, score-0.553]
60 We compared our method to the 13 motif discovery tools evaluated on the same data set using the nPC and nCC summary statistics (Li and Tompa, 2006; Tompa et al. [sent-327, score-0.411]
61 In addition to comparing our method with results from the 13 motif discovery algorithms evaluated in the Assessment of Computational Motif Discovery Tools project, we also explored the runtime and accuracy of two variations of our algorithm. [sent-333, score-0.502]
62 The first variation we examined did not make use of the clustering process described in Section 4 to reduce overlapping approximate patterns to a smaller group. [sent-335, score-0.417]
63 2 Regulatory Motifs in ChIP-on-chip Yeast Data We evaluated the ability of our method to discover yeast transcription factor binding motifs in ChIPon-chip data (Harbison et al. [sent-342, score-0.379]
64 The “gold standard” motifs for this data set were generated by applying a suite of six motif-finding algorithms to intergenic regions from Saccharomyces cerevisiae and clustering the results to arrive at a consensus motif for each transcription factor. [sent-344, score-0.556]
65 When no motif was found computationally for the intergenic regions, a literature-based consensus motif was used. [sent-345, score-0.576]
66 (2004) failed to find a significant motif and the reported motif had to be based on a consensus obtained from the literature. [sent-347, score-0.576]
67 This approach was analogous to earlier work on ChIP-on-chip data to evaluate motif discovery methods (Siddharthan et al. [sent-350, score-0.38]
68 This approach was also chosen to be consistent with earlier studies to evaluate motif discovery tools (Redhead and Bailey, 2007). [sent-354, score-0.38]
69 We applied our pattern discovery algorithm to discover specific sequences of beat-to-beat changes that had predictive value. [sent-359, score-0.334]
70 , 15 positive examples and 750 negative examples), we used our pattern discovery algorithm to learn sequences of morphology changes in the ECG signal that were predictive of death. [sent-370, score-0.437]
71 1 Regulatory Motifs in Drosophila Genome Table 1 presents the results of the 13 different motif discovery algorithms evaluated by the Assessment of Computational Motif Discovery Tools project. [sent-385, score-0.411]
72 014 Table 1: Performance of 13 different motif discovery methods on the Drosophila genome (nPC=performance coefficient, nCC=correlation coefficient) in the Assessment of Computational Motif Discovery Tools project (Tompa et al. [sent-409, score-0.456]
73 1927 S YED , I NDYK AND G UTTAG Our LSHCS algorithm compared favorably with the other motif discovery algorithms evaluated on the same data by experts. [sent-429, score-0.411]
74 The L = 16, d = 4 case also had a higher correlation coefficient than the motif discovery algorithms previously reported, although the performance coefficient for this choice of parameters was exceeded by SeSiMCMC. [sent-433, score-0.38]
75 Since these results hold on a specific genome, we caution against interpreting the results as a more general comparison of LSHCS with nucleotide motif discovery methods. [sent-435, score-0.498]
76 In particular, we observe that the nucleotide motif discovery methods in Table 1 were tested blindly on a single pre-determined choice of parameters. [sent-436, score-0.498]
77 For large choices of the maximum allowed Hamming radius, the neighborhood for each pattern is more extensive and storing this information for all overlapping patterns imposes significant space requirements. [sent-445, score-0.4]
78 In contrast to LSHCS, the NoClust variation explores all overlapping patterns while ExhSearch uses an exhaustive search to find nearest neighbors (i. [sent-448, score-0.345]
79 The running time increased significantly both as all overlapping approximate patterns were studied, and when an exhaustive search was used to replace LSH. [sent-461, score-0.384]
80 2 Regulatory Motifs in ChIP-on-chip Yeast Data The results of applying the LSHCS algorithm on the ChIP-on-chip data set for the 21 transcription factors for which the six motif-finding algorithms failed to find a significant motif are shown in Table 7. [sent-464, score-0.345]
81 70 Table 8: Statistical significance of approximate patterns found on a training set of 765 postNSTEACS patients (15 deaths over a 90 day follow-up period) when evaluated on a test population of 250 patients (10 deaths). [sent-601, score-0.432]
82 In 6 of the 21 experiments conducted, the discriminative motif with the lowest p-value found by the LSHCS algorithm corresponded to the consensus motif in the literature, but was not found by any of the six motif-finding algorithms evaluated earlier. [sent-602, score-0.665]
83 LSHCS, as well as the other six motiffinding algorithms, failed to find the discriminative motif in the remaining 15 cases although motifs are described in the literature. [sent-603, score-0.415]
84 3 Predictive Morphology Variations in Electrocardiogram Signals Our pattern discovery method returned 2 approximate patterns that were assessed to have discriminative value in the training set (i. [sent-605, score-0.599]
85 The results of testing both patterns on previously unseen data from 250 patients (with 10 deaths over a 90 day follow-up period) are shown in Table 8. [sent-614, score-0.316]
86 Both patterns found by our approximate pattern discovery algorithm showed statistical significance in predicting death according to the Cstatistic and rank sum criteria. [sent-615, score-0.577]
87 In the absence of sequential statistics, NoSeqStats had to retain ranking information for all patterns till the training data set was completely analyzed. [sent-621, score-0.322]
88 Summary and Discussion This paper presents an approach to efficiently learn patterns in labeled sequences that can be used for classification. [sent-624, score-0.314]
89 We define patterns as groups of approximately matching subsequences, that is, all variations of a subsequence within a given Hamming radius. [sent-625, score-0.358]
90 Our method represents a fully automated approach for unsupervised pattern discovery, where the goodness of patterns is assessed by measuring differences in their frequency of occurrence in positive and negative training examples. [sent-626, score-0.499]
91 First, we include patterns from both positive and negative training sets in our search for discriminative activity. [sent-629, score-0.393]
92 We employ an iterative LSH method that is able to efficiently find groups of matching subsequences for subsequent analysis as candidate patterns. [sent-642, score-0.324]
93 Third, we explore the idea of clustering together subsequences, so that the number of candidate patterns can be reduced. [sent-643, score-0.328]
94 This abstraction is intended to decrease the redundancy associated with evaluating a large number of approximate patterns with significant overlap. [sent-644, score-0.317]
95 On nucleotide sequences from the Drosophila genome, our method was able to discover approximate binding sites that were preserved upstream of genes. [sent-649, score-0.475]
96 Approaches to the automatic discovery of patterns in biosequences. [sent-674, score-0.354]
97 Methods in comparative genomics: genome correspondence, gene identification and regulatory motif discovery. [sent-761, score-0.4]
98 Discriminative motif discovery in dna and protein sequences using the deme algorithm. [sent-828, score-0.451]
99 Ymf: a program for discovery of novel transcription factor binding sites by statistical overrepresentation. [sent-845, score-0.389]
100 Clustering and symbolic analysis in cardiovascular signals: discovery and visualization of medically relevant patterns in long-term data using limited prior knowledge. [sent-850, score-0.388]
wordName wordTfidf (topN-words)
[('lsh', 0.524), ('subsequences', 0.296), ('motif', 0.269), ('patterns', 0.243), ('lshcs', 0.235), ('hamming', 0.152), ('npc', 0.118), ('nucleotide', 0.118), ('discovery', 0.111), ('drosophila', 0.108), ('exhsearch', 0.108), ('morphology', 0.108), ('ndyk', 0.108), ('uttag', 0.108), ('yed', 0.108), ('binding', 0.107), ('pattern', 0.107), ('ncc', 0.1), ('equential', 0.099), ('noclust', 0.099), ('sites', 0.095), ('hash', 0.094), ('tompa', 0.09), ('motifs', 0.088), ('clustering', 0.085), ('pproximate', 0.084), ('sx', 0.081), ('sequential', 0.079), ('transcription', 0.076), ('genome', 0.076), ('locality', 0.072), ('sequences', 0.071), ('dwi', 0.063), ('sy', 0.062), ('discriminative', 0.058), ('regulatory', 0.055), ('runtime', 0.054), ('ecg', 0.054), ('syed', 0.054), ('lassification', 0.052), ('search', 0.052), ('matches', 0.05), ('subsequence', 0.05), ('overlapping', 0.05), ('conserved', 0.046), ('patients', 0.046), ('discover', 0.045), ('bailey', 0.045), ('concordance', 0.045), ('noseqstats', 0.045), ('syndromes', 0.045), ('hashing', 0.044), ('sensitive', 0.042), ('assessed', 0.041), ('rank', 0.041), ('negative', 0.04), ('frequencies', 0.04), ('approximate', 0.039), ('bucket', 0.038), ('coronary', 0.038), ('consensus', 0.038), ('goodness', 0.038), ('variations', 0.037), ('death', 0.036), ('kellis', 0.036), ('nfn', 0.036), ('nfp', 0.036), ('zeeshan', 0.036), ('statistic', 0.036), ('indyk', 0.035), ('redundancy', 0.035), ('signals', 0.035), ('cardiovascular', 0.034), ('false', 0.033), ('mismatch', 0.032), ('yeast', 0.032), ('evaluated', 0.031), ('piotr', 0.031), ('hi', 0.03), ('terminate', 0.03), ('biology', 0.03), ('frequency', 0.03), ('timothy', 0.029), ('nt', 0.029), ('wi', 0.029), ('matching', 0.028), ('clusters', 0.028), ('eskin', 0.028), ('discovering', 0.027), ('admitted', 0.027), ('buhler', 0.027), ('deaths', 0.027), ('electrocardiographic', 0.027), ('guttag', 0.027), ('harbison', 0.027), ('lsht', 0.027), ('manolis', 0.027), ('pavesi', 0.027), ('pevzner', 0.027), ('phatarfod', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999994 45 jmlr-2009-Learning Approximate Sequential Patterns for Classification
Author: Zeeshan Syed, Piotr Indyk, John Guttag
Abstract: In this paper, we present an automated approach to discover patterns that can distinguish between sequences belonging to different labeled groups. Our method searches for approximately conserved motifs that occur with varying statistical properties in positive and negative training examples. We propose a two-step process to discover such patterns. Using locality sensitive hashing (LSH), we first estimate the frequency of all subsequences and their approximate matches within a given Hamming radius in labeled examples. The discriminative ability of each pattern is then assessed from the estimated frequencies by concordance and rank sum testing. The use of LSH to identify approximate matches for each candidate pattern helps reduce the runtime of our method. Space requirements are reduced by decomposing the search problem into an iterative method that uses a single LSH table in memory. We propose two further optimizations to the search for discriminative patterns. Clustering with redundancy based on a 2-approximate solution of the k-center problem decreases the number of overlapping approximate groups while providing exhaustive coverage of the search space. Sequential statistical methods allow the search process to use data from only as many training examples as are needed to assess significance. We evaluated our algorithm on data sets from different applications to discover sequential patterns for classification. On nucleotide sequences from the Drosophila genome compared with random background sequences, our method was able to discover approximate binding sites that were preserved upstream of genes. We observed a similar result in experiments on ChIP-on-chip data. For cardiovascular data from patients admitted with acute coronary syndromes, our pattern discovery approach identified approximately conserved sequences of morphology variations that were predictive of future death in a test population. Our data showed that the use of LSH, clustering, and sequential statistic
2 0.091332383 38 jmlr-2009-Hash Kernels for Structured Data
Author: Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, S.V.N. Vishwanathan
Abstract: We propose hashing to facilitate efficient kernels. This generalizes previous work using sampling and we show a principled way to compute the kernel matrix for data streams and sparse feature spaces. Moreover, we give deviation bounds from the exact kernel matrix. This has applications to estimation on strings and graphs. Keywords: hashing, stream, string kernel, graphlet kernel, multiclass classification
3 0.062822402 92 jmlr-2009-Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining
Author: Petra Kralj Novak, Nada Lavrač, Geoffrey I. Webb
Abstract: This paper gives a survey of contrast set mining (CSM), emerging pattern mining (EPM), and subgroup discovery (SD) in a unifying framework named supervised descriptive rule discovery. While all these research areas aim at discovering patterns in the form of rules induced from labeled data, they use different terminology and task definitions, claim to have different goals, claim to use different rule learning heuristics, and use different means for selecting subsets of induced patterns. This paper contributes a novel understanding of these subareas of data mining by presenting a unified terminology, by explaining the apparent differences between the learning tasks as variants of a unique supervised descriptive rule discovery task and by exploring the apparent differences between the approaches. It also shows that various rule learning heuristics used in CSM, EPM and SD algorithms all aim at optimizing a trade off between rule coverage and precision. The commonalities (and differences) between the approaches are showcased on a selection of best known variants of CSM, EPM and SD algorithms. The paper also provides a critical survey of existing supervised descriptive rule discovery visualization methods. Keywords: descriptive rules, rule learning, contrast set mining, emerging patterns, subgroup discovery
4 0.04815736 90 jmlr-2009-Structure Spaces
Author: Brijnesh J. Jain, Klaus Obermayer
Abstract: Finite structures such as point patterns, strings, trees, and graphs occur as ”natural” representations of structured data in different application areas of machine learning. We develop the theory of structure spaces and derive geometrical and analytical concepts such as the angle between structures and the derivative of functions on structures. In particular, we show that the gradient of a differentiable structural function is a well-defined structure pointing in the direction of steepest ascent. Exploiting the properties of structure spaces, it will turn out that a number of problems in structural pattern recognition such as central clustering or learning in structured output spaces can be formulated as optimization problems with cost functions that are locally Lipschitz. Hence, methods from nonsmooth analysis are applicable to optimize those cost functions. Keywords: graphs, graph matching, learning in structured domains, nonsmooth optimization
Author: Stijn Goedertier, David Martens, Jan Vanthienen, Bart Baesens
Abstract: Process discovery is the automated construction of structured process models from information system event logs. Such event logs often contain positive examples only. Without negative examples, it is a challenge to strike the right balance between recall and specificity, and to deal with problems such as expressiveness, noise, incomplete event logs, or the inclusion of prior knowledge. In this paper, we present a configurable technique that deals with these challenges by representing process discovery as a multi-relational classification problem on event logs supplemented with Artificially Generated Negative Events (AGNEs). This problem formulation allows using learning algorithms and evaluation techniques that are well-know in the machine learning community. Moreover, it allows users to have a declarative control over the inductive bias and language bias. Keywords: graph pattern discovery, inductive logic programming, Petri net, process discovery, positive data only
6 0.040240329 43 jmlr-2009-Java-ML: A Machine Learning Library (Machine Learning Open Source Software Paper)
7 0.038222998 34 jmlr-2009-Fast ApproximatekNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection
8 0.03324296 23 jmlr-2009-Discriminative Learning Under Covariate Shift
9 0.033116113 49 jmlr-2009-Learning Permutations with Exponential Weights
10 0.032419361 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory
11 0.031091781 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
12 0.030231699 59 jmlr-2009-Nearest Neighbor Clustering: A Baseline Method for Consistent Clustering with Arbitrary Objective Functions
13 0.029704096 48 jmlr-2009-Learning Nondeterministic Classifiers
14 0.028998192 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications
15 0.028995324 35 jmlr-2009-Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination (Special Topic on Model Selection)
16 0.027989618 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures
17 0.02632831 50 jmlr-2009-Learning When Concepts Abound
18 0.025162755 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training
19 0.024792001 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
20 0.024390191 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning
topicId topicWeight
[(0, 0.131), (1, -0.065), (2, 0.092), (3, -0.013), (4, 0.026), (5, -0.102), (6, 0.034), (7, 0.046), (8, -0.048), (9, 0.066), (10, -0.005), (11, -0.125), (12, 0.047), (13, 0.072), (14, -0.046), (15, 0.12), (16, 0.084), (17, -0.057), (18, -0.089), (19, -0.093), (20, -0.305), (21, -0.038), (22, 0.06), (23, -0.068), (24, -0.077), (25, 0.164), (26, -0.013), (27, -0.129), (28, 0.109), (29, 0.229), (30, -0.072), (31, 0.038), (32, 0.101), (33, 0.058), (34, -0.026), (35, 0.274), (36, 0.252), (37, 0.193), (38, 0.047), (39, -0.047), (40, -0.13), (41, 0.19), (42, 0.032), (43, 0.064), (44, -0.067), (45, 0.176), (46, -0.19), (47, -0.141), (48, 0.193), (49, -0.045)]
simIndex simValue paperId paperTitle
same-paper 1 0.96306014 45 jmlr-2009-Learning Approximate Sequential Patterns for Classification
Author: Zeeshan Syed, Piotr Indyk, John Guttag
Abstract: In this paper, we present an automated approach to discover patterns that can distinguish between sequences belonging to different labeled groups. Our method searches for approximately conserved motifs that occur with varying statistical properties in positive and negative training examples. We propose a two-step process to discover such patterns. Using locality sensitive hashing (LSH), we first estimate the frequency of all subsequences and their approximate matches within a given Hamming radius in labeled examples. The discriminative ability of each pattern is then assessed from the estimated frequencies by concordance and rank sum testing. The use of LSH to identify approximate matches for each candidate pattern helps reduce the runtime of our method. Space requirements are reduced by decomposing the search problem into an iterative method that uses a single LSH table in memory. We propose two further optimizations to the search for discriminative patterns. Clustering with redundancy based on a 2-approximate solution of the k-center problem decreases the number of overlapping approximate groups while providing exhaustive coverage of the search space. Sequential statistical methods allow the search process to use data from only as many training examples as are needed to assess significance. We evaluated our algorithm on data sets from different applications to discover sequential patterns for classification. On nucleotide sequences from the Drosophila genome compared with random background sequences, our method was able to discover approximate binding sites that were preserved upstream of genes. We observed a similar result in experiments on ChIP-on-chip data. For cardiovascular data from patients admitted with acute coronary syndromes, our pattern discovery approach identified approximately conserved sequences of morphology variations that were predictive of future death in a test population. Our data showed that the use of LSH, clustering, and sequential statistic
2 0.41946393 38 jmlr-2009-Hash Kernels for Structured Data
Author: Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, S.V.N. Vishwanathan
Abstract: We propose hashing to facilitate efficient kernels. This generalizes previous work using sampling and we show a principled way to compute the kernel matrix for data streams and sparse feature spaces. Moreover, we give deviation bounds from the exact kernel matrix. This has applications to estimation on strings and graphs. Keywords: hashing, stream, string kernel, graphlet kernel, multiclass classification
Author: Petra Kralj Novak, Nada Lavrač, Geoffrey I. Webb
Abstract: This paper gives a survey of contrast set mining (CSM), emerging pattern mining (EPM), and subgroup discovery (SD) in a unifying framework named supervised descriptive rule discovery. While all these research areas aim at discovering patterns in the form of rules induced from labeled data, they use different terminology and task definitions, claim to have different goals, claim to use different rule learning heuristics, and use different means for selecting subsets of induced patterns. This paper contributes a novel understanding of these subareas of data mining by presenting a unified terminology, by explaining the apparent differences between the learning tasks as variants of a unique supervised descriptive rule discovery task and by exploring the apparent differences between the approaches. It also shows that various rule learning heuristics used in CSM, EPM and SD algorithms all aim at optimizing a trade off between rule coverage and precision. The commonalities (and differences) between the approaches are showcased on a selection of best known variants of CSM, EPM and SD algorithms. The paper also provides a critical survey of existing supervised descriptive rule discovery visualization methods. Keywords: descriptive rules, rule learning, contrast set mining, emerging patterns, subgroup discovery
Author: Stijn Goedertier, David Martens, Jan Vanthienen, Bart Baesens
Abstract: Process discovery is the automated construction of structured process models from information system event logs. Such event logs often contain positive examples only. Without negative examples, it is a challenge to strike the right balance between recall and specificity, and to deal with problems such as expressiveness, noise, incomplete event logs, or the inclusion of prior knowledge. In this paper, we present a configurable technique that deals with these challenges by representing process discovery as a multi-relational classification problem on event logs supplemented with Artificially Generated Negative Events (AGNEs). This problem formulation allows using learning algorithms and evaluation techniques that are well-know in the machine learning community. Moreover, it allows users to have a declarative control over the inductive bias and language bias. Keywords: graph pattern discovery, inductive logic programming, Petri net, process discovery, positive data only
5 0.17525059 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
Author: Yihua Chen, Eric K. Garcia, Maya R. Gupta, Ali Rahimi, Luca Cazzanti
Abstract: This paper reviews and extends the field of similarity-based classification, presenting new analyses, algorithms, data sets, and a comprehensive set of experimental results for a rich collection of classification problems. Specifically, the generalizability of using similarities as features is analyzed, design goals and methods for weighting nearest-neighbors for similarity-based learning are proposed, and different methods for consistently converting similarities into kernels are compared. Experiments on eight real data sets compare eight approaches and their variants to similarity-based learning. Keywords: similarity, dissimilarity, similarity-based learning, indefinite kernels
6 0.15521333 43 jmlr-2009-Java-ML: A Machine Learning Library (Machine Learning Open Source Software Paper)
8 0.1519976 48 jmlr-2009-Learning Nondeterministic Classifiers
9 0.15014762 90 jmlr-2009-Structure Spaces
10 0.1459872 88 jmlr-2009-Stable and Efficient Gaussian Process Calculations
11 0.14065038 59 jmlr-2009-Nearest Neighbor Clustering: A Baseline Method for Consistent Clustering with Arbitrary Objective Functions
12 0.13842742 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions
13 0.13297367 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization
14 0.12543134 23 jmlr-2009-Discriminative Learning Under Covariate Shift
15 0.12486327 34 jmlr-2009-Fast ApproximatekNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection
16 0.12241106 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability
17 0.11656613 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory
18 0.11505965 72 jmlr-2009-Polynomial-Delay Enumeration of Monotonic Graph Classes
19 0.11368539 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers
20 0.10577141 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks
topicId topicWeight
[(8, 0.019), (9, 0.013), (11, 0.023), (19, 0.012), (26, 0.028), (38, 0.032), (47, 0.02), (52, 0.046), (55, 0.033), (58, 0.036), (66, 0.083), (68, 0.018), (89, 0.451), (90, 0.075), (96, 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.70996779 45 jmlr-2009-Learning Approximate Sequential Patterns for Classification
Author: Zeeshan Syed, Piotr Indyk, John Guttag
Abstract: In this paper, we present an automated approach to discover patterns that can distinguish between sequences belonging to different labeled groups. Our method searches for approximately conserved motifs that occur with varying statistical properties in positive and negative training examples. We propose a two-step process to discover such patterns. Using locality sensitive hashing (LSH), we first estimate the frequency of all subsequences and their approximate matches within a given Hamming radius in labeled examples. The discriminative ability of each pattern is then assessed from the estimated frequencies by concordance and rank sum testing. The use of LSH to identify approximate matches for each candidate pattern helps reduce the runtime of our method. Space requirements are reduced by decomposing the search problem into an iterative method that uses a single LSH table in memory. We propose two further optimizations to the search for discriminative patterns. Clustering with redundancy based on a 2-approximate solution of the k-center problem decreases the number of overlapping approximate groups while providing exhaustive coverage of the search space. Sequential statistical methods allow the search process to use data from only as many training examples as are needed to assess significance. We evaluated our algorithm on data sets from different applications to discover sequential patterns for classification. On nucleotide sequences from the Drosophila genome compared with random background sequences, our method was able to discover approximate binding sites that were preserved upstream of genes. We observed a similar result in experiments on ChIP-on-chip data. For cardiovascular data from patients admitted with acute coronary syndromes, our pattern discovery approach identified approximately conserved sequences of morphology variations that were predictive of future death in a test population. Our data showed that the use of LSH, clustering, and sequential statistic
2 0.28030172 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions
Author: Lisa Hellerstein, Bernard Rosell, Eric Bach, Soumya Ray, David Page
Abstract: A Boolean function f is correlation immune if each input variable is independent of the output, under the uniform distribution on inputs. For example, the parity function is correlation immune. We consider the problem of identifying relevant variables of a correlation immune function, in the presence of irrelevant variables. We address this problem in two different contexts. First, we analyze Skewing, a heuristic method that was developed to improve the ability of greedy decision tree algorithms to identify relevant variables of correlation immune Boolean functions, given examples drawn from the uniform distribution (Page and Ray, 2003). We present theoretical results revealing both the capabilities and limitations of skewing. Second, we explore the problem of identifying relevant variables in the Product Distribution Choice (PDC) learning model, a model in which the learner can choose product distributions and obtain examples from them. We prove a lemma establishing a property of Boolean functions that may be of independent interest. Using this lemma, we give two new algorithms for finding relevant variables of correlation immune functions in the PDC model. Keywords: correlation immune functions, skewing, relevant variables, Boolean functions, product distributions c 2009 Lisa Hellerstein, Bernard Rosell, Eric Bach, Soumya Ray and David Page. H ELLERSTEIN , ROSELL , BACH , R AY AND PAGE
3 0.28025624 38 jmlr-2009-Hash Kernels for Structured Data
Author: Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, S.V.N. Vishwanathan
Abstract: We propose hashing to facilitate efficient kernels. This generalizes previous work using sampling and we show a principled way to compute the kernel matrix for data streams and sparse feature spaces. Moreover, we give deviation bounds from the exact kernel matrix. This has applications to estimation on strings and graphs. Keywords: hashing, stream, string kernel, graphlet kernel, multiclass classification
Author: Halbert White, Karim Chalak
Abstract: Judea Pearl’s Causal Model is a rich framework that provides deep insight into the nature of causal relations. As yet, however, the Pearl Causal Model (PCM) has had a lesser impact on economics or econometrics than on other disciplines. This may be due in part to the fact that the PCM is not as well suited to analyzing structures that exhibit features of central interest to economists and econometricians: optimization, equilibrium, and learning. We offer the settable systems framework as an extension of the PCM that permits causal discourse in systems embodying optimization, equilibrium, and learning. Because these are common features of physical, natural, or social systems, our framework may prove generally useful for machine learning. Important features distinguishing the settable system framework from the PCM are its countable dimensionality and the use of partitioning and partition-specific response functions to accommodate the behavior of optimizing and interacting agents and to eliminate the requirement of a unique fixed point for the system. Refinements of the PCM include the settable systems treatment of attributes, the causal role of exogenous variables, and the dual role of variables as causes and responses. A series of closely related machine learning examples and examples from game theory and machine learning with feedback demonstrates some limitations of the PCM and motivates the distinguishing features of settable systems. Keywords: equations causal models, game theory, machine learning, recursive estimation, simultaneous
5 0.27676097 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
Author: Cynthia Rudin, Robert E. Schapire
Abstract: We study boosting algorithms for learning to rank. We give a general margin-based bound for ranking based on covering numbers for the hypothesis space. Our bound suggests that algorithms that maximize the ranking margin will generalize well. We then describe a new algorithm, smooth margin ranking, that precisely converges to a maximum ranking-margin solution. The algorithm is a modification of RankBoost, analogous to “approximate coordinate ascent boosting.” Finally, we prove that AdaBoost and RankBoost are equally good for the problems of bipartite ranking and classification in terms of their asymptotic behavior on the training set. Under natural conditions, AdaBoost achieves an area under the ROC curve that is equally as good as RankBoost’s; furthermore, RankBoost, when given a specific intercept, achieves a misclassification error that is as good as AdaBoost’s. This may help to explain the empirical observations made by Cortes and Mohri, and Caruana and Niculescu-Mizil, about the excellent performance of AdaBoost as a bipartite ranking algorithm, as measured by the area under the ROC curve. Keywords: ranking, RankBoost, generalization bounds, AdaBoost, area under the ROC curve
6 0.27607813 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
7 0.27565813 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models
8 0.27538386 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
9 0.27453256 70 jmlr-2009-Particle Swarm Model Selection (Special Topic on Model Selection)
11 0.27268648 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
12 0.27220652 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
13 0.27137652 18 jmlr-2009-Consistency and Localizability
14 0.27030531 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning
15 0.26912344 48 jmlr-2009-Learning Nondeterministic Classifiers
16 0.26834494 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers
17 0.26804769 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications
18 0.26739767 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures
19 0.26729032 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
20 0.26721343 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models