jmlr jmlr2009 jmlr2009-38 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, S.V.N. Vishwanathan
Abstract: We propose hashing to facilitate efficient kernels. This generalizes previous work using sampling and we show a principled way to compute the kernel matrix for data streams and sparse feature spaces. Moreover, we give deviation bounds from the exact kernel matrix. This has applications to estimation on strings and graphs. Keywords: hashing, stream, string kernel, graphlet kernel, multiclass classification
Reference: text
sentIndex sentText sentNum sentScore
1 AU Department of Statistics Purdue University, IN, USA Editor: Soeren Sonnenburg, Vojtech Franc, Elad Yom-Tov and Michele Sebag Abstract We propose hashing to facilitate efficient kernels. [sent-17, score-0.255]
2 This generalizes previous work using sampling and we show a principled way to compute the kernel matrix for data streams and sparse feature spaces. [sent-18, score-0.309]
3 Keywords: hashing, stream, string kernel, graphlet kernel, multiclass classification 1. [sent-21, score-0.174]
4 Introduction In recent years, a number of methods have been proposed to deal with the fact that kernel methods have slow runtime performance if the number of kernel functions used in the expansion is large. [sent-22, score-0.305]
5 Our work is inspired by the count-min sketch of Cormode and Muthukrishnan (2004), which uses hash functions as a computationally efficient means of randomization. [sent-47, score-0.697]
6 As an application, we demonstrate computational benefits over suffix array string kernels in the case of document analysis and we discuss a kernel between graphs which only becomes computationally feasible by means of compressed representation. [sent-49, score-0.351]
7 4 Outline We begin with a description of previous work in Section 2 and propose the hash kernels in Section 3 which is suitable for data with simple structure such as strings. [sent-51, score-0.728]
8 And we propose a graphlet kernel which generalizes hash kernels to data with general structure— graphs and discuss a MCMC sampler in Section 5. [sent-53, score-1.006]
9 Previous Work and Applications Recently much attention has been paid to efficient algorithms with randomization or hashing in the machine learning community. [sent-56, score-0.283]
10 For matrices of size m × m one obtains bounds of type O(R2 ε−2 log m) by combining Hoeffding’s theorem with a union bound argument over the O(m2 ) different elements of the kernel matrix. [sent-69, score-0.167]
11 (2002) use a setting identical to (1) as the basis for comparisons between strings and graphs by defining a random walk as the feature extractor. [sent-72, score-0.224]
12 • The empirical kernel map of Sch¨ lkopf (1997) uses C = X and employs some kernel function o κ to define φc (x) = κ(c, x). [sent-77, score-0.28]
13 For instance, for the pair-HMM kernel most strings c are unlikely ancestors of x and x′ , hence P(x|c) and P(x′ |c) will be negligible for most c. [sent-87, score-0.214]
14 Achlioptas (2003) proposes a random projection approach that uses symmetric random variables to project the original feature onto a lower dimension feature space. [sent-94, score-0.208]
15 , n} be a hash function and assume that there exists a distribution over h such that they are pairwise independent. [sent-108, score-0.66]
16 Assume that we draw d hash functions hi at random and let S ∈ Rn×d be a sketch matrix. [sent-109, score-0.774]
17 The idea is that by storing counts of s according to several hash functions we can reduce the probability of collision with another particularly large symbol. [sent-113, score-0.845]
18 Even worse, for the hashing to work, one needs to take the minimum over a set of inner product candidates. [sent-117, score-0.255]
19 5 Random Feature Mixing Ganchev and Dredze (2008) provide empirical evidence that using hashing can eliminate alphabet storage and reduce the number of parameters without severely impacting model performance. [sent-119, score-0.282]
20 (2007) released the Vowpal Wabbit fast online learning software which uses a hash representation similar to the one discussed here. [sent-121, score-0.685]
21 (2009) propose a hash kernel to deal with the issue of computational efficiency by a very simple algorithm: high-dimensional vectors are compressed by adding up all coordinates which have the same hash value—one only needs to perform as many calculations as there are nonzero terms in the vector. [sent-124, score-1.504]
22 The hash kernel can jointly hash both label and features, thus the memory footprint is essentially independent of the number of classes used. [sent-125, score-1.489]
23 , n} be a hash function drawn from a distribution of pairwise independent hash functions. [sent-132, score-1.32]
24 In this case we define the hash kernel as follows: k(x, x′ ) = φ(x), φ(x′ ) with φ j (x) = ∑ φi (x) i∈I;h(i)= j (3) We are accumulating all coordinates i of φ(x) for which h(i) generates the same value j into coordinate φ j (x). [sent-134, score-0.8]
25 Our claim is that hashing preserves information as well as randomized projections with significantly less computation. [sent-135, score-0.299]
26 Before providing an analysis let us discuss two key applications: efficient hashing of kernels on strings and cases where the number of classes is very high, such as categorization in an ontology. [sent-136, score-0.426]
27 3 Multiclass Classification can sometimes lead to a very high dimensional feature vector even if the underlying feature map x → φ(x) may be acceptable. [sent-155, score-0.17]
28 Analysis We show that the penalty we incur from using hashing to compress the number of coordinates only grows logarithmically with the number of objects and with the number of classes. [sent-169, score-0.255]
29 While we are 2620 H ASH K ERNELS FOR S TRUCTURED DATA unable to obtain the excellent O(ε−1 ) rates offered by the count-min sketch, our approach retains the inner product property thus making hashing accessible to linear estimation. [sent-170, score-0.255]
30 h h Whenever needed we will write φ (x) and k (x, x′ ) to make the dependence on the hash function h explicit. [sent-173, score-0.66]
31 Consequently k(x, x′ ) is a biased estimator of the kernel matrix, with the bias decreasing inversely proportional to the number of hash bins. [sent-176, score-0.8]
32 As can be seen, the variance decreases O(n−1 ) in the size of the values of the hash function. [sent-180, score-0.66]
33 2 Information Loss One of the key fears of using hashing in machine learning is that hash collisions harm performance. [sent-183, score-0.941]
34 For example, the well-known birthday paradox shows that if the hash function maps into a space of 1 size n then with O(n 2 ) features a collision is likely. [sent-184, score-0.859]
35 This can be implemented with a hash function by (for example) appending the string i ∈ {1, . [sent-189, score-0.699]
36 , c} to feature f and then computing the hash of f ◦ i for the i-th duplicate. [sent-192, score-0.745]
37 The essential observation is that the information in a feature is only lost if all duplicates of the feature collide with other features. [sent-193, score-0.274]
38 Proof The proof is essentially a counting argument with consideration of the fact that we are dealing with a hash function rather than a random variable. [sent-202, score-0.66]
39 Note that we do not care about a collision of two duplicates of the same feature, because the feature value is preserved. [sent-208, score-0.297]
40 The probability that all duplicates 1 ≤ i ≤ c collide with another feature is bounded by (lc/n)c + 1 − (1 − c/n)c . [sent-209, score-0.189]
41 Taking a union bound over all l features, we get that the probability any feature has all duplicates collide is bounded by l[1 − (1 − c/n)c + 2622 H ASH K ERNELS FOR S TRUCTURED DATA (lc/n)c ]. [sent-220, score-0.189]
42 Theorem 3 Assume that the probability of deviation between the hash kernel and its expected value is bounded by an exponential inequality via h h P k (x, x′ ) − Eh k (x, x′ ) > ε ≤ c exp(−c′ ε2 n) for some constants c, c′ depending on the size of the hash and the kernel used. [sent-223, score-1.6]
43 In this case the error ε arising from ensuring the above inequality, with probability at least 1 − δ, for m observations and M classes (for a joint feature map φ(x, y), is bounded by ε≤ (2 log(m + 1) + 2 log(M + 1) − log δ + log c − 2 log 2)/nc′ . [sent-224, score-0.166]
44 4 Generalization Bound The hash kernel approximates the original kernel by big storage and computation saving. [sent-229, score-0.967]
45 An interesting question is whether the generalization bound on the hash kernel will be much worse than the bound obtained on the original kernel. [sent-230, score-0.8]
46 Theorem 4 (Generalization Bound) For binary class SVM, let K = φ(x), φ(x′ ) , K = φ(x), φ(x′ ) be the hash kernel matrix and original kernel matrix. [sent-231, score-0.94]
47 The approximation quality depends on the both the feature and the hash collision. [sent-250, score-0.745]
48 From the definition of hash kernel (see (3)), the feature entries with the same hash value will add up and map to a new feature entry indexed by the hash value. [sent-251, score-2.29]
49 The higher collision of the hash has, the more entries will be added up. [sent-252, score-0.816]
50 In this case, the hash kernel gives a good approximation of the original kernel, so b is reasonably small. [sent-255, score-0.8]
51 Moreover, Theorem 4 shows us the generalization bounds on the hash kernel and the original kernel only differ by O((mγ)−1 ). [sent-262, score-0.94]
52 Let’s take a closer look at the hash kernel binary SVM, which can be formalized as min w m λ||w||2 + ∑ max{0, 1 − yi w, φ(xi , yi ) }. [sent-272, score-0.8]
53 This shows that hashing kernel SVM has nearly the same generalization bound as the original SVM in theory. [sent-278, score-0.395]
54 In the following we show that sampling and hashing can be used to make the analysis of larger subgraphs tractable in practice. [sent-284, score-0.385]
55 Before discussing hashing we briefly discuss averages of dependent random variables: Definition 6 (Bernoulli Mixing) Denote by Z a stochastic process indexed by t ∈ Z with probability measure P and let Σn be the σ-algebra on Zt with t ∈ Z\1, . [sent-314, score-0.255]
56 This is where a hash map on data streams comes in handy. [sent-352, score-0.688]
57 Finally, we combine the convergence bounds from Theorem 8 with the guarantees available for hash kernels to obtain the approximate graph kernel. [sent-354, score-0.756]
58 Note that the two randomizations have very different purposes: the sampling over graphlets is done as a way to approximate the extraction of features whereas the compression via hashing is carried out to ensure that the representation is computationally efficient. [sent-355, score-0.411]
59 Experiments To test the efficacy of our approach we applied hashing to the following problems: first we used it for classification on the Reuters RCV1 data set as it has a relatively large feature dimensionality. [sent-357, score-0.34]
60 The third experiment—Biochemistry and Bioinformatics Graph Classification uses our hashing scheme, which makes comparing all possible subgraph pairs tractable, to compare graphs (Vishwanathan et al. [sent-359, score-0.29]
61 Apart from the efficacy of hashing operation itself, the gain of speed is also due to a multi-core implementation—hash kernel uses 4-cores to access the disc for online hash feature generation. [sent-384, score-1.165]
62 We apply our hash kernels and random projection (Achlioptas, 2003) to the SGD linear SVM. [sent-389, score-0.728]
63 Instead we compare hash kernel with existing graph kernels: random walk kernel, shortest path kernel and graphlet kernel (see Borgwardt et al. [sent-392, score-1.267]
64 However, it suffices to compute TF and IDF approximately as follows: using hash features, we no longer require building the bag of words. [sent-403, score-0.66]
65 Every word produces a hash key which is the dimension index of the word. [sent-404, score-0.698]
66 The frequency is recorded in the dimension index of its hash key. [sent-405, score-0.698]
67 In hash kernel there is no preprocess step, so there is no original/new feature files. [sent-438, score-0.885]
68 Features for hash kernel are built up online via accessing the string on disc. [sent-439, score-0.864]
69 Note that the TrainTest for random projection time increases as the new feature dimension increases, whereas for hash kernel the TrainTest is almost independent of the feature dimensionality. [sent-441, score-1.008]
70 096 Table 4: Influence of new dimension on Reuters (RCV1) on collision rates (reported for both training and test set combined) and error rates. [sent-454, score-0.194]
71 We compare the hash kernel with Leon Bottou’s Stochastic Gradient Descent SVM2 (BSGD), Vowpal Wabbit (Langford et al. [sent-460, score-0.8]
72 83 Table 5: Misclassification and memory footprint of hashing and baseline methods on DMOZ. [sent-496, score-0.284]
73 Table 2 shows that the test errors for hash kernel, BSGD and VW are competitive. [sent-526, score-0.66]
74 In table 3 we compare hash kernel to RP with different feature dimensions. [sent-527, score-0.885]
75 However, the error of hash kernel is always much smaller (by about 10%) than RP given the same new dimension. [sent-529, score-0.8]
76 For example, when the feature dimension is 210 , the compressed new feature file size is already 5. [sent-532, score-0.252]
77 Hash kernel is much more efficient than RP in terms of speed, since to compute a hash feature one requires only O(dnz ) hashing operations, where dnz is the number of non-zero entries. [sent-534, score-1.174]
78 To compute a RP feature one requires O(dn) operations, where d is the original feature dimension and and n is the new feature dimension. [sent-535, score-0.293]
79 For example, hash kernel (including Pre and TrainTest) with 210 feature size is over 100 times faster than RP. [sent-539, score-0.885]
80 As can be seen in Table 4, when the feature dimension decreases, the collision and the error rate increase. [sent-541, score-0.279]
81 Note that the TrainTest time for random projections increases as the new feature dimension increases, whereas for hash kernel the TrainTest is almost independent of the feature dimensionality. [sent-583, score-1.052]
82 However, by hashing data and labels jointly we are able to obtain an efficient joint representation which makes the implementation computationally possible. [sent-592, score-0.255]
83 As can be seen in Table 5 joint hashing of features and labels is very attractive in items of memory usage and in many cases is necessary to make large multiclass categorization computationally feasible at all (naive online SVM ran out of memory). [sent-593, score-0.384]
84 In particular, hashing features only produces worse results than joint hashing of labels and features. [sent-594, score-0.553]
85 This is likely due to the increased collision rate: we need to use a smaller feature dimension to store the class dependent weight vectors explicitly. [sent-595, score-0.303]
86 Next we compare hash kernel with K Nearest Neighbor (KNN) and Kmeans. [sent-600, score-0.8]
87 The accuracy plot in Figure 2 shows that in both Dmoz L2 and L3, KNN and Kmeans with various sample sizes get test accuracies of 30% to 20% off than the upper bound accuracy achieved by hash kernel. [sent-608, score-0.66]
88 We also compare hash kernel with RP with various feature dimensionality on Dmoz. [sent-611, score-0.885]
89 It uses the same 4-cores implementation as hash kernel does. [sent-613, score-0.8]
90 It can be seen that hash kernel is not only much faster but also has much smaller error than RP given the same feature dimension. [sent-615, score-0.885]
91 Note that both hash kernel and RP reduce the error as they increase the feature dimension. [sent-616, score-0.885]
92 However, RP can’t achieve competitive error to what hash kernel has in Table 5, simply because with large feature dimension RP is too slow—the estimated run time for RP with dimension 219 on dmoz L3 is 2000 days. [sent-617, score-1.086]
93 Figure 3 shows that hash kernel does very well on the first one hundred topics. [sent-620, score-0.8]
94 RW: random walk kernel, SP: shortest path kernel, GKS = graphlet kernel sampling 8497 graphlets, GK: graphlet kernel enumerating all graphlets exhaustively, HK: hash kernel, HKF: hash kernel with feature selection. [sent-665, score-2.2]
95 918 Table 9: Non feature selection vs feature selection for hash kernel. [sent-679, score-0.83]
96 We compare the proposed hash kernel (with/without feature selection) with random walk kernel, shortest path kernel and graphlet kernel on the benchmark data sets. [sent-682, score-1.324]
97 From Table 8 we can see that the hash Kernel even without feature selection still significantly outperforms the other three kernels in terms of classification accuracy over all three benchmark data sets. [sent-683, score-0.813]
98 The dimensionality of the canonical isomorph representation is quite high and many features are extremely sparse, a feature selection step was taken that removed features suspected as noninformative. [sent-684, score-0.205]
99 As can be seen in Table 9 feature selection on hash kernel can furthermore improve the test accuracy and area under ROC. [sent-687, score-0.885]
100 Discussion In this paper we showed that hashing is a computationally attractive technique which allows one to approximate kernels for very high dimensional settings efficiently by means of a sparse projection into a lower dimensional space. [sent-689, score-0.323]
wordName wordTfidf (topN-words)
[('hash', 0.66), ('hashing', 0.255), ('rp', 0.234), ('collision', 0.156), ('kernel', 0.14), ('ash', 0.125), ('dmoz', 0.125), ('etterson', 0.125), ('ishwanathan', 0.125), ('ror', 0.125), ('traintest', 0.125), ('mola', 0.107), ('angford', 0.107), ('tructured', 0.107), ('bsgd', 0.103), ('graphlet', 0.103), ('knn', 0.089), ('ernels', 0.088), ('feature', 0.085), ('si', 0.079), ('vw', 0.077), ('hi', 0.077), ('hk', 0.077), ('subgraphs', 0.074), ('strings', 0.074), ('shi', 0.069), ('hlf', 0.068), ('kernels', 0.068), ('kmeans', 0.061), ('rahimi', 0.061), ('vishwanathan', 0.059), ('achlioptas', 0.057), ('cormode', 0.057), ('graphlets', 0.057), ('sampling', 0.056), ('duplicates', 0.056), ('recht', 0.052), ('collide', 0.048), ('kontorovich', 0.048), ('mem', 0.048), ('vowpal', 0.048), ('wabbit', 0.048), ('articles', 0.046), ('dd', 0.046), ('muthukrishnan', 0.046), ('nauty', 0.046), ('ptc', 0.046), ('compressed', 0.044), ('projections', 0.044), ('pre', 0.044), ('features', 0.043), ('string', 0.039), ('reuters', 0.039), ('dimension', 0.038), ('sketch', 0.037), ('tr', 0.036), ('langford', 0.036), ('graphs', 0.035), ('mcmc', 0.034), ('debnath', 0.034), ('dimensionalities', 0.034), ('dnz', 0.034), ('isomorph', 0.034), ('mutag', 0.034), ('petterson', 0.034), ('pmc', 0.034), ('qinfeng', 0.034), ('topic', 0.034), ('multiclass', 0.032), ('preprocessing', 0.031), ('na', 0.03), ('borgwardt', 0.03), ('walk', 0.03), ('isomorphism', 0.03), ('categorization', 0.029), ('counts', 0.029), ('dror', 0.029), ('footprint', 0.029), ('idf', 0.029), ('graph', 0.028), ('streams', 0.028), ('randomization', 0.028), ('duplicate', 0.028), ('dim', 0.027), ('log', 0.027), ('storage', 0.027), ('stream', 0.026), ('df', 0.026), ('topics', 0.026), ('collisions', 0.026), ('shortest', 0.026), ('online', 0.025), ('document', 0.025), ('expansion', 0.025), ('store', 0.024), ('australia', 0.024), ('misclassi', 0.024), ('tf', 0.024), ('gideon', 0.024), ('sgd', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999997 38 jmlr-2009-Hash Kernels for Structured Data
Author: Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, S.V.N. Vishwanathan
Abstract: We propose hashing to facilitate efficient kernels. This generalizes previous work using sampling and we show a principled way to compute the kernel matrix for data streams and sparse feature spaces. Moreover, we give deviation bounds from the exact kernel matrix. This has applications to estimation on strings and graphs. Keywords: hashing, stream, string kernel, graphlet kernel, multiclass classification
2 0.13239813 34 jmlr-2009-Fast ApproximatekNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection
Author: Jie Chen, Haw-ren Fang, Yousef Saad
Abstract: Nearest neighbor graphs are widely used in data mining and machine learning. A brute-force method to compute the exact kNN graph takes Θ(dn2 ) time for n data points in the d dimensional Euclidean space. We propose two divide and conquer methods for computing an approximate kNN graph in Θ(dnt ) time for high dimensional data (large d). The exponent t ∈ (1, 2) is an increasing function of an internal parameter α which governs the size of the common region in the divide step. Experiments show that a high quality graph can usually be obtained with small overlaps, that is, for small values of t. A few of the practical details of the algorithms are as follows. First, the divide step uses an inexpensive Lanczos procedure to perform recursive spectral bisection. After each conquer step, an additional refinement step is performed to improve the accuracy of the graph. Finally, a hash table is used to avoid repeating distance calculations during the divide and conquer process. The combination of these techniques is shown to yield quite effective algorithms for building kNN graphs. Keywords: nearest neighbors graph, high dimensional data, divide and conquer, Lanczos algorithm, spectral method
3 0.094309926 78 jmlr-2009-Refinement of Reproducing Kernels
Author: Yuesheng Xu, Haizhang Zhang
Abstract: We continue our recent study on constructing a refinement kernel for a given kernel so that the reproducing kernel Hilbert space associated with the refinement kernel contains that with the original kernel as a subspace. To motivate this study, we first develop a refinement kernel method for learning, which gives an efficient algorithm for updating a learning predictor. Several characterizations of refinement kernels are then presented. It is shown that a nontrivial refinement kernel for a given kernel always exists if the input space has an infinite cardinal number. Refinement kernels for translation invariant kernels and Hilbert-Schmidt kernels are investigated. Various concrete examples are provided. Keywords: reproducing kernels, reproducing kernel Hilbert spaces, learning with kernels, refinement kernels, translation invariant kernels, Hilbert-Schmidt kernels
4 0.091332383 45 jmlr-2009-Learning Approximate Sequential Patterns for Classification
Author: Zeeshan Syed, Piotr Indyk, John Guttag
Abstract: In this paper, we present an automated approach to discover patterns that can distinguish between sequences belonging to different labeled groups. Our method searches for approximately conserved motifs that occur with varying statistical properties in positive and negative training examples. We propose a two-step process to discover such patterns. Using locality sensitive hashing (LSH), we first estimate the frequency of all subsequences and their approximate matches within a given Hamming radius in labeled examples. The discriminative ability of each pattern is then assessed from the estimated frequencies by concordance and rank sum testing. The use of LSH to identify approximate matches for each candidate pattern helps reduce the runtime of our method. Space requirements are reduced by decomposing the search problem into an iterative method that uses a single LSH table in memory. We propose two further optimizations to the search for discriminative patterns. Clustering with redundancy based on a 2-approximate solution of the k-center problem decreases the number of overlapping approximate groups while providing exhaustive coverage of the search space. Sequential statistical methods allow the search process to use data from only as many training examples as are needed to assess significance. We evaluated our algorithm on data sets from different applications to discover sequential patterns for classification. On nucleotide sequences from the Drosophila genome compared with random background sequences, our method was able to discover approximate binding sites that were preserved upstream of genes. We observed a similar result in experiments on ChIP-on-chip data. For cardiovascular data from patients admitted with acute coronary syndromes, our pattern discovery approach identified approximately conserved sequences of morphology variations that were predictive of future death in a test population. Our data showed that the use of LSH, clustering, and sequential statistic
Author: Leonid (Aryeh) Kontorovich, Boaz Nadler
Abstract: We propose a novel framework for supervised learning of discrete concepts. Since the 1970’s, the standard computational primitive has been to find the most consistent hypothesis in a given complexity class. In contrast, in this paper we propose a new basic operation: for each pair of input instances, count how many concepts of bounded complexity contain both of them. Our approach maps instances to a Hilbert space, whose metric is induced by a universal kernel coinciding with our computational primitive, and identifies concepts with half-spaces. We prove that all concepts are linearly separable under this mapping. Hence, given a labeled sample and an oracle for evaluating the universal kernel, we can efficiently compute a linear classifier (via SVM, for example) and use margin bounds to control its generalization error. Even though exact evaluation of the universal kernel may be infeasible, in various natural situations it is efficiently approximable. Though our approach is general, our main application is to regular languages. Our approach presents a substantial departure from current learning paradigms and in particular yields a novel method for learning this fundamental concept class. Unlike existing techniques, we make no structural assumptions on the corresponding unknown automata, the string distribution or the completeness of the training set. Instead, given a labeled sample our algorithm outputs a classifier with guaranteed distribution-free generalization bounds; to our knowledge, the proposed framework is the only one capable of achieving the latter. Along the way, we touch upon several fundamental questions in complexity, automata, and machine learning. Keywords: grammar induction, regular language, finite state automaton, maximum margin hyperplane, kernel approximation
6 0.074663304 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
7 0.061814684 29 jmlr-2009-Estimating Labels from Label Proportions
8 0.061098706 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
9 0.060609855 23 jmlr-2009-Discriminative Learning Under Covariate Shift
10 0.056112591 50 jmlr-2009-Learning When Concepts Abound
11 0.05382913 72 jmlr-2009-Polynomial-Delay Enumeration of Monotonic Graph Classes
12 0.050488599 61 jmlr-2009-Nonextensive Information Theoretic Kernels on Measures
13 0.044867903 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization
14 0.041875739 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
15 0.037214711 26 jmlr-2009-Dlib-ml: A Machine Learning Toolkit (Machine Learning Open Source Software Paper)
16 0.035877861 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
17 0.035782579 90 jmlr-2009-Structure Spaces
18 0.035545442 68 jmlr-2009-Online Learning with Samples Drawn from Non-identical Distributions
19 0.035342157 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
20 0.034464356 43 jmlr-2009-Java-ML: A Machine Learning Library (Machine Learning Open Source Software Paper)
topicId topicWeight
[(0, 0.205), (1, -0.029), (2, 0.131), (3, -0.077), (4, -0.05), (5, -0.093), (6, 0.093), (7, -0.074), (8, -0.161), (9, 0.105), (10, -0.144), (11, -0.127), (12, 0.073), (13, -0.1), (14, -0.096), (15, -0.003), (16, 0.091), (17, -0.189), (18, -0.061), (19, 0.054), (20, -0.241), (21, -0.189), (22, 0.188), (23, -0.186), (24, -0.001), (25, 0.062), (26, 0.166), (27, -0.011), (28, 0.077), (29, 0.098), (30, 0.033), (31, 0.07), (32, 0.236), (33, 0.055), (34, -0.079), (35, 0.061), (36, 0.027), (37, 0.057), (38, -0.009), (39, 0.014), (40, 0.081), (41, 0.067), (42, -0.113), (43, 0.022), (44, 0.006), (45, -0.027), (46, 0.044), (47, 0.024), (48, 0.168), (49, 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.92958164 38 jmlr-2009-Hash Kernels for Structured Data
Author: Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, S.V.N. Vishwanathan
Abstract: We propose hashing to facilitate efficient kernels. This generalizes previous work using sampling and we show a principled way to compute the kernel matrix for data streams and sparse feature spaces. Moreover, we give deviation bounds from the exact kernel matrix. This has applications to estimation on strings and graphs. Keywords: hashing, stream, string kernel, graphlet kernel, multiclass classification
2 0.5039677 45 jmlr-2009-Learning Approximate Sequential Patterns for Classification
Author: Zeeshan Syed, Piotr Indyk, John Guttag
Abstract: In this paper, we present an automated approach to discover patterns that can distinguish between sequences belonging to different labeled groups. Our method searches for approximately conserved motifs that occur with varying statistical properties in positive and negative training examples. We propose a two-step process to discover such patterns. Using locality sensitive hashing (LSH), we first estimate the frequency of all subsequences and their approximate matches within a given Hamming radius in labeled examples. The discriminative ability of each pattern is then assessed from the estimated frequencies by concordance and rank sum testing. The use of LSH to identify approximate matches for each candidate pattern helps reduce the runtime of our method. Space requirements are reduced by decomposing the search problem into an iterative method that uses a single LSH table in memory. We propose two further optimizations to the search for discriminative patterns. Clustering with redundancy based on a 2-approximate solution of the k-center problem decreases the number of overlapping approximate groups while providing exhaustive coverage of the search space. Sequential statistical methods allow the search process to use data from only as many training examples as are needed to assess significance. We evaluated our algorithm on data sets from different applications to discover sequential patterns for classification. On nucleotide sequences from the Drosophila genome compared with random background sequences, our method was able to discover approximate binding sites that were preserved upstream of genes. We observed a similar result in experiments on ChIP-on-chip data. For cardiovascular data from patients admitted with acute coronary syndromes, our pattern discovery approach identified approximately conserved sequences of morphology variations that were predictive of future death in a test population. Our data showed that the use of LSH, clustering, and sequential statistic
3 0.4079468 78 jmlr-2009-Refinement of Reproducing Kernels
Author: Yuesheng Xu, Haizhang Zhang
Abstract: We continue our recent study on constructing a refinement kernel for a given kernel so that the reproducing kernel Hilbert space associated with the refinement kernel contains that with the original kernel as a subspace. To motivate this study, we first develop a refinement kernel method for learning, which gives an efficient algorithm for updating a learning predictor. Several characterizations of refinement kernels are then presented. It is shown that a nontrivial refinement kernel for a given kernel always exists if the input space has an infinite cardinal number. Refinement kernels for translation invariant kernels and Hilbert-Schmidt kernels are investigated. Various concrete examples are provided. Keywords: reproducing kernels, reproducing kernel Hilbert spaces, learning with kernels, refinement kernels, translation invariant kernels, Hilbert-Schmidt kernels
4 0.40410927 34 jmlr-2009-Fast ApproximatekNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection
Author: Jie Chen, Haw-ren Fang, Yousef Saad
Abstract: Nearest neighbor graphs are widely used in data mining and machine learning. A brute-force method to compute the exact kNN graph takes Θ(dn2 ) time for n data points in the d dimensional Euclidean space. We propose two divide and conquer methods for computing an approximate kNN graph in Θ(dnt ) time for high dimensional data (large d). The exponent t ∈ (1, 2) is an increasing function of an internal parameter α which governs the size of the common region in the divide step. Experiments show that a high quality graph can usually be obtained with small overlaps, that is, for small values of t. A few of the practical details of the algorithms are as follows. First, the divide step uses an inexpensive Lanczos procedure to perform recursive spectral bisection. After each conquer step, an additional refinement step is performed to improve the accuracy of the graph. Finally, a hash table is used to avoid repeating distance calculations during the divide and conquer process. The combination of these techniques is shown to yield quite effective algorithms for building kNN graphs. Keywords: nearest neighbors graph, high dimensional data, divide and conquer, Lanczos algorithm, spectral method
Author: Leonid (Aryeh) Kontorovich, Boaz Nadler
Abstract: We propose a novel framework for supervised learning of discrete concepts. Since the 1970’s, the standard computational primitive has been to find the most consistent hypothesis in a given complexity class. In contrast, in this paper we propose a new basic operation: for each pair of input instances, count how many concepts of bounded complexity contain both of them. Our approach maps instances to a Hilbert space, whose metric is induced by a universal kernel coinciding with our computational primitive, and identifies concepts with half-spaces. We prove that all concepts are linearly separable under this mapping. Hence, given a labeled sample and an oracle for evaluating the universal kernel, we can efficiently compute a linear classifier (via SVM, for example) and use margin bounds to control its generalization error. Even though exact evaluation of the universal kernel may be infeasible, in various natural situations it is efficiently approximable. Though our approach is general, our main application is to regular languages. Our approach presents a substantial departure from current learning paradigms and in particular yields a novel method for learning this fundamental concept class. Unlike existing techniques, we make no structural assumptions on the corresponding unknown automata, the string distribution or the completeness of the training set. Instead, given a labeled sample our algorithm outputs a classifier with guaranteed distribution-free generalization bounds; to our knowledge, the proposed framework is the only one capable of achieving the latter. Along the way, we touch upon several fundamental questions in complexity, automata, and machine learning. Keywords: grammar induction, regular language, finite state automaton, maximum margin hyperplane, kernel approximation
6 0.35699713 61 jmlr-2009-Nonextensive Information Theoretic Kernels on Measures
7 0.33171484 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
8 0.29219589 50 jmlr-2009-Learning When Concepts Abound
9 0.2535463 29 jmlr-2009-Estimating Labels from Label Proportions
10 0.24768133 72 jmlr-2009-Polynomial-Delay Enumeration of Monotonic Graph Classes
11 0.24229059 23 jmlr-2009-Discriminative Learning Under Covariate Shift
12 0.22469956 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
13 0.2167131 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
14 0.20349205 26 jmlr-2009-Dlib-ml: A Machine Learning Toolkit (Machine Learning Open Source Software Paper)
15 0.19562894 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
16 0.18258189 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems
17 0.17852473 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization
19 0.17310444 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning
20 0.16986714 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization
topicId topicWeight
[(8, 0.014), (9, 0.358), (11, 0.022), (19, 0.035), (21, 0.01), (26, 0.015), (38, 0.067), (47, 0.012), (52, 0.089), (55, 0.045), (58, 0.031), (66, 0.109), (68, 0.032), (90, 0.048), (96, 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.73022616 38 jmlr-2009-Hash Kernels for Structured Data
Author: Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, S.V.N. Vishwanathan
Abstract: We propose hashing to facilitate efficient kernels. This generalizes previous work using sampling and we show a principled way to compute the kernel matrix for data streams and sparse feature spaces. Moreover, we give deviation bounds from the exact kernel matrix. This has applications to estimation on strings and graphs. Keywords: hashing, stream, string kernel, graphlet kernel, multiclass classification
2 0.41582412 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
Author: John Langford, Lihong Li, Tong Zhang
Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. This method has several essential properties: 1. The degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. 2. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. 3. The approach works well empirically. We apply the approach to several data sets and find for data sets with large numbers of features, substantial sparsity is discoverable. Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso
3 0.40690818 29 jmlr-2009-Estimating Labels from Label Proportions
Author: Novi Quadrianto, Alex J. Smola, Tibério S. Caetano, Quoc V. Le
Abstract: Consider the following problem: given sets of unlabeled observations, each set with known label proportions, predict the labels of another set of observations, possibly with known label proportions. This problem occurs in areas like e-commerce, politics, spam filtering and improper content detection. We present consistent estimators which can reconstruct the correct labels with high probability in a uniform convergence sense. Experiments show that our method works well in practice. Keywords: unsupervised learning, Gaussian processes, classification and prediction, probabilistic models, missing variables
4 0.40568396 48 jmlr-2009-Learning Nondeterministic Classifiers
Author: Juan José del Coz, Jorge Díez, Antonio Bahamonde
Abstract: Nondeterministic classifiers are defined as those allowed to predict more than one class for some entries from an input space. Given that the true class should be included in predictions and the number of classes predicted should be as small as possible, these kind of classifiers can be considered as Information Retrieval (IR) procedures. In this paper, we propose a family of IR loss functions to measure the performance of nondeterministic learners. After discussing such measures, we derive an algorithm for learning optimal nondeterministic hypotheses. Given an entry from the input space, the algorithm requires the posterior probabilities to compute the subset of classes with the lowest expected loss. From a general point of view, nondeterministic classifiers provide an improvement in the proportion of predictions that include the true class compared to their deterministic counterparts; the price to be paid for this increase is usually a tiny proportion of predictions with more than one class. The paper includes an extensive experimental study using three deterministic learners to estimate posterior probabilities: a multiclass Support Vector Machine (SVM), a Logistic Regression, and a Na¨ve Bayes. The data sets considered comprise both UCI ı multi-class learning tasks and microarray expressions of different kinds of cancer. We successfully compare nondeterministic classifiers with other alternative approaches. Additionally, we shall see how the quality of posterior probabilities (measured by the Brier score) determines the goodness of nondeterministic predictions. Keywords: nondeterministic, multiclassification, reject option, multi-label classification, posterior probabilities
5 0.40300992 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
Author: Huan Xu, Constantine Caramanis, Shie Mannor
Abstract: We consider regularized support vector machines (SVMs) and show that they are precisely equivalent to a new robust optimization formulation. We show that this equivalence of robust optimization and regularization has implications for both algorithms, and analysis. In terms of algorithms, the equivalence suggests more general SVM-like algorithms for classification that explicitly build in protection to noise, and at the same time control overfitting. On the analysis front, the equivalence of robustness and regularization provides a robust optimization interpretation for the success of regularized SVMs. We use this new robustness interpretation of SVMs to give a new proof of consistency of (kernelized) SVMs, thus establishing robustness as the reason regularized SVMs generalize well. Keywords: robustness, regularization, generalization, kernel, support vector machine
6 0.39992148 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
7 0.3980777 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
8 0.39582559 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures
9 0.39544204 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
10 0.39255956 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
11 0.39224887 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models
13 0.38633007 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning
14 0.38471404 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM
15 0.38372755 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications
16 0.38274792 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
17 0.38271552 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions
18 0.38096401 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
19 0.38052481 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification
20 0.38046837 70 jmlr-2009-Particle Swarm Model Selection (Special Topic on Model Selection)