jmlr jmlr2010 jmlr2010-15 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Konrad Rieck, Tammo Krueger, Ulf Brefeld, Klaus-Robert Müller
Abstract: Convolution kernels for trees provide simple means for learning with tree-structured data. The computation time of tree kernels is quadratic in the size of the trees, since all pairs of nodes need to be compared. Thus, large parse trees, obtained from HTML documents or structured network data, render convolution kernels inapplicable. In this article, we propose an effective approximation technique for parse tree kernels. The approximate tree kernels (ATKs) limit kernel computation to a sparse subset of relevant subtrees and discard redundant structures, such that training and testing of kernel-based learning methods are significantly accelerated. We devise linear programming approaches for identifying such subsets for supervised and unsupervised learning tasks, respectively. Empirically, the approximate tree kernels attain run-time improvements up to three orders of magnitude while preserving the predictive accuracy of regular tree kernels. For unsupervised tasks, the approximate tree kernels even lead to more accurate predictions by identifying relevant dimensions in feature space. Keywords: tree kernels, approximation, kernel methods, convolution kernels
Reference: text
sentIndex sentText sentNum sentScore
1 The computation time of tree kernels is quadratic in the size of the trees, since all pairs of nodes need to be compared. [sent-11, score-0.734]
2 Thus, large parse trees, obtained from HTML documents or structured network data, render convolution kernels inapplicable. [sent-12, score-0.727]
3 The approximate tree kernels (ATKs) limit kernel computation to a sparse subset of relevant subtrees and discard redundant structures, such that training and testing of kernel-based learning methods are significantly accelerated. [sent-14, score-1.14]
4 Empirically, the approximate tree kernels attain run-time improvements up to three orders of magnitude while preserving the predictive accuracy of regular tree kernels. [sent-16, score-1.124]
5 For unsupervised tasks, the approximate tree kernels even lead to more accurate predictions by identifying relevant dimensions in feature space. [sent-17, score-0.746]
6 Keywords: tree kernels, approximation, kernel methods, convolution kernels 1. [sent-18, score-0.93]
7 A prominent example for such a convolution is the parse tree kernel proposed by Collins and Duffy (2002) which determines the similarity of trees by counting shared subtrees. [sent-32, score-1.286]
8 The computation of parse tree kernels, however, is inherently quadratic in the number of tree nodes, as it builds on dynamic programming to compute the contribution of shared subtrees. [sent-33, score-1.127]
9 Figure 1(a) shows an exemplary parse tree of natural language text. [sent-35, score-0.737]
10 For example, the computation of a parse tree kernel for two HTML documents comprising 10,000 nodes each, requires about 1 gigabyte of memory and takes over 100 seconds on a recent computer system. [sent-37, score-1.049]
11 Given that kernel computations are performed millions of times in large-scale learning, it is evident that regular tree kernels are an inappropriate choice in many learning tasks. [sent-38, score-0.87]
12 php PARAM PARAM our products (a) Parse tree for a sentence KEY VAL KEY VAL search= NP SVM lang= english (b) Parse tree for a HTTP request Figure 1: Parse trees for natural language text and the HTTP network protocol. [sent-41, score-0.956]
13 The limitation of regular tree kernels becomes apparent when considering learning tasks based on formal grammars, such as web spam detection. [sent-42, score-1.013]
14 Hence, a promising approach for web spam detection is the analysis of structure in web pages using parse trees of HTML documents. [sent-48, score-1.177]
15 To alleviate the limitations of regular tree kernels, we propose approximate tree kernels (ATKs) which approximate the kernel computation and thereby allow for efficient learning with arbitrary sized trees in supervised and in unsupervised settings. [sent-65, score-1.675]
16 The efficacy gained by approximate tree kernels rests on a two-stage process: A sparse set of relevant subtrees rooted at appropriate grammar symbols is determined from a small sample of trees, prior to subsequent training and testing processes. [sent-66, score-1.268]
17 Experiments conducted on question classification, web spam detection and network intrusion detection demonstrate the expressiveness and efficiency of our novel approximate kernels. [sent-70, score-1.156]
18 Throughout all our experiments, approximate tree kernels are significantly faster than regular convolution kernels. [sent-71, score-0.889]
19 Furthermore, the approximate tree kernels not only consistently yield the same predictive performance as regular tree kernels, but even outperform their exact counterparts in some tasks by identifying informative dimensions in feature space. [sent-73, score-1.158]
20 We introduce regular parse tree kernels in Section 2 and present our main contribution, the approximate tree kernels, in Section 3. [sent-75, score-1.439]
21 A tree X is called a parse tree of G if X is derived by assembling productions p ∈ P such that every node x ∈ X is labeled with a symbol ℓ(x) ∈ S. [sent-80, score-1.151]
22 Given two parse trees X and Z, their parse tree kernel is given by k(X, Z) = ∑ ∑ c(x, z), (1) x∈X z∈Z where the counting function c recursively determines the number of shared subtrees rooted in the tree nodes x and z. [sent-88, score-2.224]
23 Several extensions and refinements of the parse tree kernel have been proposed in the literature to increase its expressiveness for specific learning tasks. [sent-93, score-0.928]
24 Moreover, Moschitti and Zanzotto (2007) extend the parse tree kernel to operate on pairs of trees for deriving relations between sentences in tasks such as textual entailment recognition. [sent-97, score-1.097]
25 Selecting discriminative subtrees for tree kernels has been first studied by Suzuki et al. [sent-99, score-0.885]
26 A feature selection procedure based on statistical tests is embedded in the dynamic programming, such that relevant substructures are identified during computation of the parse tree kernel (see also Suzuki and Isozaki, 2005). [sent-101, score-0.973]
27 While this procedure 558 A PPROXIMATE T REE K ERNELS significantly improves the expressiveness of corresponding tree kernels, the entanglement of feature selection and dynamic programming unfortunately prevents any improvement of run-time and memory requirements over regular tree kernels. [sent-102, score-1.07]
28 First steps toward run-time improvement of tree kernels have been devised by Moschitti (2006a), who introduces an alternative algorithm limiting computation to node pairs with matching grammar symbols. [sent-103, score-0.826]
29 Approximate Tree Kernels The previous section argues that the computation of regular tree kernels using dynamic programming is infeasible for large tree structures. [sent-108, score-1.104]
30 Our approximation of tree kernels is based on the observation that trees often contain redundant parts that are not only irrelevant for the learning task but also slow-down the kernel computation unnecessarily. [sent-110, score-1.034]
31 The amount of generic subtrees contained in a single parse tree is exponential in the number of nodes and thus intractable for large tree structures. [sent-115, score-1.349]
32 ˜ i=1 For the task at hand, the selection function ω needs to be adapted to the tree data before the resulting approximate tree kernel can be applied together with learning algorithms. [sent-121, score-1.04]
33 Note that the exact parse tree kernel in Equation (1) is obtained as a special case of Equation (3) if ω(s) = 1 for all symbols ˆ s ∈ S. [sent-122, score-1.074]
34 559 ¨ R IECK , K RUEGER , B REFELD AND M ULLER Proposition 2 The approximate tree kernel in Definition 1 is a kernel function. [sent-124, score-0.818]
35 Proposition 2 allows to view approximate tree kernels as feature selection techniques. [sent-128, score-0.746]
36 Additionally, we note that the approximate tree kernel realizes a run-time speed-up by a factor of qω , which depends on the number of selected symbols in ω and the amount of subtrees rooted at these symbols. [sent-131, score-1.168]
37 If the selection function ω discards redundant and irrelevant subtrees from the kernel computation the approximate kernel can not only be computed faster but also preserves the discriminative expressiveness of the regular tree kernel. [sent-139, score-1.295]
38 (7) i, j=1 i= j ˆ Note that we exclude diagonal elements, that is, kω (Xi , Xi ), from Equation (7), as the large selfsimilarity induced by the parse tree kernel impacts numerical stability on large tree structures (see Collins and Duffy, 2002). [sent-172, score-1.225]
39 Optimizing Equation (7) directly will inevitably reproduce the regular convolution kernel as all subtrees contribute positively to the kernel computation. [sent-173, score-0.794]
40 n2 i,∑ j=1 (9) Using the average comparison frequency f , we bound the expected run-time of the approximate kernel by ρ ∈ (0, 1], the ratio of expected node comparisons with respect to the exact parse tree kernel. [sent-190, score-1.063]
41 ˜ The optimal selection function ω∗ gives rise to a tree kernel that approximates the regular kernel as close as possible, while on average considering a fraction of ρ node pairs for computing the similarities. [sent-194, score-0.943]
42 Depending on the learning task at hand, these constraints are exchangeable, such that approximate tree kernels in supervised settings may also be restricted to the ratio ρ of expected node comparisons and the unsupervised formulation can be alternatively conditioned on a fixed number of symbols N. [sent-198, score-1.038]
43 4 Implementation A standard technique for computing tree kernels is dynamic programming, where a table of |X| × |Z| elements is used to store the counts of subtrees during recursive evaluations. [sent-215, score-0.874]
44 The computation of the tree kernel is then carried out by looping over the node pairs and counting the number of shared subtrees rooted at each pair. [sent-239, score-1.037]
45 While the standard implementation of the parse tree kernel (e. [sent-243, score-0.864]
46 , 2004), the efficiency of our approximate tree kernels is rooted in decoupling the selection of symbols from later application of the learned kernel function. [sent-261, score-1.186]
47 This procedure ensures that the approximate tree kernels are first determined over equally weighted subtrees, hence allowing for an unbiased optimization in the selection phase. [sent-275, score-0.746]
48 Experiments on Artificial Data Before studying the expressiveness and performance of approximate tree kernels in real-word applications, we aim at gaining insights into the approximation process. [sent-279, score-0.761]
49 The parse tree kernel (PTK) leads to a perfect discrimination between the two classes, yet the approximate tree kernel (ATK) performs equally well, irrespectively of the number of selected symbols. [sent-308, score-1.522]
50 While for the exact tree kernel the optimal λ is 10−2 , the approximate kernel yields best results with λ = 10−3 , thus putting emphasis on shallow subtrees and the discriminative production rules rooted at C and D. [sent-311, score-1.272]
51 Although the differences are marginal, comparing the spectra allows for viewing the approximate kernel as a denoised variant of the regular parse tree kernel. [sent-322, score-1.011]
52 4] −− − − − − − − −→ A B | A | B (*4) B A | A | B (*5) Second, we sample the parse trees such that training, validation, and testing sets contain 99% positive and 1% negative instances each, thus matching the anomaly detection scenario. [sent-333, score-0.793]
53 o Figure 5(a) shows the detection performance for the parse tree kernel and the approximate tree kernel for varying values of ρ. [sent-336, score-1.69]
54 The parse tree kernel reaches an AUC value of 57%. [sent-337, score-0.864]
55 Moreover, for the approximate kernel shallow subtrees are sufficient for detection of anomalies which is indicated by an optimal λ = 10−3 , whereas for the exact kernel subtrees of all depths need to be considered due to an optimal λ = 1. [sent-340, score-1.167]
56 The high detection performances can be explained by considering a kernel PCA of the two tree kernels in Figure 5(b). [sent-341, score-1.0]
57 The redundant production rules introduce irrelevant and noisy dimensions into the feature space induced by the parse tree kernel. [sent-342, score-0.752]
58 In all experiments we employ the exact parse tree kernel and state-of-the-art implementations as baseline methods. [sent-361, score-0.898]
59 Each question is transformed to a respective parse tree using the MEI Parser1 (Charniak, 1999). [sent-366, score-0.712]
60 From the top 20 sites of both classes we sample 5,000 parse trees covering 974 web spam documents and 4,026 normal HTML pages. [sent-376, score-0.878]
61 For question classification, the largest tree comprises 113 nodes, while several parse trees in the web spam and intrusion detection data consist of more than 5,000 nodes. [sent-405, score-1.692]
62 5 and conduct the following experimental procedure: parse trees are randomly drawn from each data set and split into training, validation and test partitions consisting of 1,000 trees each. [sent-407, score-0.717]
63 1 Results for Question Classification We first study the expressiveness of the approximate tree kernel and the exact parse tree kernel for the question classification task. [sent-414, score-1.628]
64 We thus vary the number of selected symbols in Optimization Problem 1 and report on the achieved classification performance for the approximate tree kernel for varying N and the exact tree kernel in Figure 7(a). [sent-415, score-1.417]
65 However, the curve saturates to the performance of the regular parse tree kernel for selecting 7 and more symbols. [sent-417, score-0.93]
66 03 0 0 1 10 2 3 4 10 10 Tree size (nodes) 1 10 10 (c) Intrusion detection (HTTP) 2 3 10 10 Tree size (nodes) 4 10 (d) Intrusion detection (FTP) Figure 6: Tree sizes for question classification, web spam detection and intrusion detection. [sent-452, score-1.207]
67 Figure 7(b) shows the eigenspectra of the parse tree kernel and its approximate variant with the above 7 selected symbols. [sent-460, score-1.001]
68 Approximate tree kernels identify a simplified representation that proves robust against label noise and leads to the same classification rate compared to regular parse tree kernels. [sent-471, score-1.38]
69 2 Results for Web Spam Detection We now study approximate tree kernels for web spam detection. [sent-473, score-1.028]
70 Unfortunately, training SVMs using the exact parse tree kernel proves intractable for many large trees in the data due to their excessive memory requirements. [sent-474, score-1.167]
71 The approximation is consistently on par with the regular parse tree kernel for four and more selected labels, as the differences in this interval are not significant. [sent-478, score-0.958]
72 As a consequence, the optimal λ = 10−1 for the approximate kernel effectively captures discriminative features reflecting web spam templates rooted at the HTML and BODY tag. [sent-483, score-0.745]
73 As for the question classification task, the exact and approximate tree kernels share the same expressiveness. [sent-485, score-0.767]
74 9 1 2 3 4 5 6 7 8 Selected symbols (N) 9 −1 10 10 0 50 100 Kernel PCA components 150 (b) Kernel PCA plot for N = 2 (a) Classification performance Figure 9: Classification performance and kernel PCA plot for the web spam detection task. [sent-492, score-0.937]
75 Moreover, the selection of symbols for web spam detection proves robust against label noise. [sent-495, score-0.774]
76 3 Results for Intrusion Detection In this section, we study the expressiveness of approximate kernels for unsupervised intrusion detection. [sent-499, score-0.701]
77 The resulting approximate tree kernels are then employed together with a one-class SVM for the detection of attacks in HTTP and FTP parse trees. [sent-501, score-1.251]
78 We again exclude trees comprising more than 1,500 nodes due to prohibitive memory requirements for the exact tree kernel. [sent-503, score-0.79]
79 Clearly, the approximate tree kernel performs identically to its exact counterpart if the ratio of node comparisons ρ equals 100%. [sent-505, score-0.748]
80 comparisons is restricted to only a fraction, the approximate kernel significantly outperforms the exact parse tree kernel and leads to a superior detection rate. [sent-530, score-1.388]
81 The optimal depth parameter λ for the approximate kernel is 10−2 for HTTP and 10−2 for FTP, while the exact tree kernel requires λ = 10−1 in the optimal setting. [sent-533, score-0.852]
82 That is, the approximation is not only more accurate than the exact parse tree kernel but also concisely represented. [sent-543, score-0.898]
83 In this section we compare the runtime and memory requirements of the approximate tree kernel with state-of-the-art implementations of the exact tree kernels. [sent-546, score-1.133]
84 Selection stage on 250 parse trees Question classification Web spam detection Intrusion detection (HTTP) Intrusion detection (FTP) 17s 144s 43s 31s ±0 ±28 ±7 ±2 Table 1: Selection stage prior to training and testing phase. [sent-550, score-1.381]
85 4 lists the training and testing times using the approximate tree kernel (ATK) and a fast implementation for the exact tree kernel (PTK2) devised by Moschitti (2006a). [sent-554, score-1.213]
86 4 is limited because parse trees containing more than 1,500 nodes have been excluded from the experiment and the true performance gain induced by approximate tree kernels cannot be estimated. [sent-563, score-1.299]
87 As baselines, we include a standard implementation of the parse tree kernel (PTK1) detailed by Shawe-Taylor and Cristianini (2004) and the improved variant (PTK2) proposed by Moschitti (2006a). [sent-567, score-0.864]
88 the learning tasks of web spam and intrusion detection (HTTP), where results for question classification and FTP are analogous. [sent-579, score-0.815]
89 By contrast, the approximate tree kernel computes similarities between trees up to three orders of magnitude faster and yields a worst-case computation time of less than 40 ms for the web spam detection task and less than 20 ms for the intrusion detection task. [sent-586, score-1.866]
90 The worst-case analysis shows that the exact tree kernel scales quadratically in the number of nodes whereas the approximate tree kernel is computed in sub-quadratic time in the size of the trees. [sent-587, score-1.299]
91 Figure 14 reports on average and worst-case memory requirements for the web spam detection and intrusion detection task. [sent-589, score-1.083]
92 In all figures, the curves of the approximate kernel are significantly below the variants of the parse tree kernel. [sent-592, score-0.945]
93 The allocated memory for the regular tree kernel exceeds 1 gigabytes in both learning tasks, which is clearly prohibitive for a single kernel computation. [sent-593, score-0.871]
94 For the worst-case estimation, the memory consumption of the exact kernel scales quadratically in the number of tree nodes while the approximate tree kernel scales sub-quadratically due to the sparse selection of symbols. [sent-595, score-1.416]
95 Conclusions Learning with large trees render regular parse tree kernels inapplicable due to quadratic run-time and memory requirements. [sent-597, score-1.266]
96 In the selection stage, the computation of tree kernels is narrowed to a sparse subset of subtrees rooted in appropriate grammar symbols. [sent-600, score-1.06]
97 We evaluate the approximate trees kernels with SVMs as underlying learning algorithms for question classification, web spam detection and intrusion detection. [sent-604, score-1.352]
98 In all experiments, the approximate tree kernels not only replicate the predictive performances of exact kernels but also provide concise representations by operating on only 2–10% of the available grammar symbols. [sent-605, score-1.079]
99 Here, a kernel PCA shows that approximate tree kernels effectively identify relevant dimensions in feature space and discard redundant and noisy subspaces from the kernel computation. [sent-610, score-1.102]
100 Moreover, the devised approximate tree kernels build on the concept of convolution over local kernel functions. [sent-615, score-1.011]
wordName wordTfidf (topN-words)
[('tree', 0.361), ('parse', 0.315), ('kernels', 0.255), ('intrusion', 0.252), ('subtrees', 0.226), ('atk', 0.212), ('trees', 0.201), ('spam', 0.197), ('detection', 0.196), ('kernel', 0.188), ('symbols', 0.176), ('ftp', 0.138), ('web', 0.134), ('convolution', 0.126), ('ieck', 0.12), ('moschitti', 0.12), ('refeld', 0.12), ('rueger', 0.12), ('html', 0.118), ('ptk', 0.102), ('uller', 0.102), ('grammar', 0.093), ('nodes', 0.086), ('rieck', 0.083), ('approximate', 0.081), ('pproximate', 0.078), ('rooted', 0.076), ('ernels', 0.073), ('pca', 0.069), ('memory', 0.068), ('counting', 0.066), ('regular', 0.066), ('traf', 0.065), ('expressiveness', 0.064), ('ree', 0.06), ('protocol', 0.059), ('node', 0.059), ('nx', 0.057), ('anomaly', 0.055), ('symbol', 0.055), ('unsupervised', 0.049), ('selection', 0.049), ('duffy', 0.047), ('id', 0.047), ('production', 0.047), ('nz', 0.044), ('security', 0.044), ('discriminative', 0.043), ('attacks', 0.043), ('requirements', 0.04), ('stage', 0.04), ('laskov', 0.039), ('eskin', 0.037), ('kruegel', 0.037), ('proccessing', 0.037), ('question', 0.036), ('exact', 0.034), ('cristianini', 0.033), ('ws', 0.033), ('language', 0.033), ('supervised', 0.032), ('dynamic', 0.032), ('pairs', 0.032), ('fraunhofer', 0.032), ('realizes', 0.032), ('documents', 0.031), ('collins', 0.031), ('suzuki', 0.031), ('roc', 0.03), ('ms', 0.03), ('shared', 0.029), ('redundant', 0.029), ('programming', 0.029), ('selected', 0.028), ('substructures', 0.028), ('castillo', 0.028), ('eigenspectra', 0.028), ('exemplary', 0.028), ('krueger', 0.028), ('ount', 0.028), ('paxson', 0.028), ('rfc', 0.028), ('ssel', 0.028), ('syntactical', 0.028), ('tammo', 0.028), ('shallow', 0.028), ('kb', 0.026), ('templates', 0.026), ('subtree', 0.026), ('matching', 0.026), ('auc', 0.026), ('comparisons', 0.025), ('tags', 0.025), ('hash', 0.024), ('header', 0.024), ('classi', 0.023), ('brefeld', 0.023), ('plot', 0.023), ('label', 0.022), ('svm', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 15 jmlr-2010-Approximate Tree Kernels
Author: Konrad Rieck, Tammo Krueger, Ulf Brefeld, Klaus-Robert Müller
Abstract: Convolution kernels for trees provide simple means for learning with tree-structured data. The computation time of tree kernels is quadratic in the size of the trees, since all pairs of nodes need to be compared. Thus, large parse trees, obtained from HTML documents or structured network data, render convolution kernels inapplicable. In this article, we propose an effective approximation technique for parse tree kernels. The approximate tree kernels (ATKs) limit kernel computation to a sparse subset of relevant subtrees and discard redundant structures, such that training and testing of kernel-based learning methods are significantly accelerated. We devise linear programming approaches for identifying such subsets for supervised and unsupervised learning tasks, respectively. Empirically, the approximate tree kernels attain run-time improvements up to three orders of magnitude while preserving the predictive accuracy of regular tree kernels. For unsupervised tasks, the approximate tree kernels even lead to more accurate predictions by identifying relevant dimensions in feature space. Keywords: tree kernels, approximation, kernel methods, convolution kernels
2 0.16100487 53 jmlr-2010-Inducing Tree-Substitution Grammars
Author: Trevor Cohn, Phil Blunsom, Sharon Goldwater
Abstract: Inducing a grammar from text has proven to be a notoriously challenging learning task despite decades of research. The primary reason for its difficulty is that in order to induce plausible grammars, the underlying model must be capable of representing the intricacies of language while also ensuring that it can be readily learned from data. The majority of existing work on grammar induction has favoured model simplicity (and thus learnability) over representational capacity by using context free grammars and first order dependency grammars, which are not sufficiently expressive to model many common linguistic constructions. We propose a novel compromise by inferring a probabilistic tree substitution grammar, a formalism which allows for arbitrarily large tree fragments and thereby better represent complex linguistic structures. To limit the model’s complexity we employ a Bayesian non-parametric prior which biases the model towards a sparse grammar with shallow productions. We demonstrate the model’s efficacy on supervised phrase-structure parsing, where we induce a latent segmentation of the training treebank, and on unsupervised dependency grammar induction. In both cases the model uncovers interesting latent linguistic structures while producing competitive results. Keywords: grammar induction, tree substitution grammar, Bayesian non-parametrics, Pitman-Yor process, Chinese restaurant process
3 0.16057082 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
Author: Kamaledin Ghiasi-Shirazi, Reza Safabakhsh, Mostafa Shamsi
Abstract: Appropriate selection of the kernel function, which implicitly defines the feature space of an algorithm, has a crucial role in the success of kernel methods. In this paper, we consider the problem of optimizing a kernel function over the class of translation invariant kernels for the task of binary classification. The learning capacity of this class is invariant with respect to rotation and scaling of the features and it encompasses the set of radial kernels. We show that how translation invariant kernel functions can be embedded in a nested set of sub-classes and consider the kernel learning problem over one of these sub-classes. This allows the choice of an appropriate sub-class based on the problem at hand. We use the criterion proposed by Lanckriet et al. (2004) to obtain a functional formulation for the problem. It will be proven that the optimal kernel is a finite mixture of cosine functions. The kernel learning problem is then formulated as a semi-infinite programming (SIP) problem which is solved by a sequence of quadratically constrained quadratic programming (QCQP) sub-problems. Using the fact that the cosine kernel is of rank two, we propose a formulation of a QCQP sub-problem which does not require the kernel matrices to be loaded into memory, making the method applicable to large-scale problems. We also address the issue of including other classes of kernels, such as individual kernels and isotropic Gaussian kernels, in the learning process. Another interesting feature of the proposed method is that the optimal classifier has an expansion in terms of the number of cosine kernels, instead of support vectors, leading to a remarkable speedup at run-time. As a by-product, we also generalize the kernel trick to complex-valued kernel functions. Our experiments on artificial and real-world benchmark data sets, including the USPS and the MNIST digit recognition data sets, show the usefulness of the proposed method. Keywords: kernel learning, translation invariant k
4 0.14217377 44 jmlr-2010-Graph Kernels
Author: S.V.N. Vishwanathan, Nicol N. Schraudolph, Risi Kondor, Karsten M. Borgwardt
Abstract: We present a unified framework to study graph kernels, special cases of which include the random a walk (G¨ rtner et al., 2003; Borgwardt et al., 2005) and marginalized (Kashima et al., 2003, 2004; Mah´ et al., 2004) graph kernels. Through reduction to a Sylvester equation we improve the time e complexity of kernel computation between unlabeled graphs with n vertices from O(n6 ) to O(n3 ). We find a spectral decomposition approach even more efficient when computing entire kernel matrices. For labeled graphs we develop conjugate gradient and fixed-point methods that take O(dn3 ) time per iteration, where d is the size of the label set. By extending the necessary linear algebra to Reproducing Kernel Hilbert Spaces (RKHS) we obtain the same result for d-dimensional edge kernels, and O(n4 ) in the infinite-dimensional case; on sparse graphs these algorithms only take O(n2 ) time per iteration in all cases. Experiments on graphs from bioinformatics and other application domains show that these techniques can speed up computation of the kernel by an order of magnitude or more. We also show that certain rational kernels (Cortes et al., 2002, 2003, 2004) when specialized to graphs reduce to our random walk graph kernel. Finally, we relate our framework to R-convolution kernels (Haussler, 1999) and provide a kernel that is close to the optimal assignment o kernel of Fr¨ hlich et al. (2006) yet provably positive semi-definite. Keywords: linear algebra in RKHS, Sylvester equations, spectral decomposition, bioinformatics, rational kernels, transducers, semirings, random walks
5 0.12592031 7 jmlr-2010-A Streaming Parallel Decision Tree Algorithm
Author: Yael Ben-Haim, Elad Tom-Tov
Abstract: We propose a new algorithm for building decision tree classifiers. The algorithm is executed in a distributed environment and is especially designed for classifying large data sets and streaming data. It is empirically shown to be as accurate as a standard decision tree classifier, while being scalable for processing of streaming data on multiple processors. These findings are supported by a rigorous analysis of the algorithm’s accuracy. The essence of the algorithm is to quickly construct histograms at the processors, which compress the data to a fixed amount of memory. A master processor uses this information to find near-optimal split points to terminal tree nodes. Our analysis shows that guarantees on the local accuracy of split points imply guarantees on the overall tree accuracy. Keywords: decision tree classifiers, distributed computing, streaming data, scalability
6 0.099723503 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures
7 0.092645153 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
8 0.092620395 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars
9 0.073719546 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning
10 0.07182309 40 jmlr-2010-Fast and Scalable Local Kernel Machines
11 0.068602674 11 jmlr-2010-An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data
12 0.066273026 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation
13 0.064598463 69 jmlr-2010-Lp-Nested Symmetric Distributions
14 0.0633756 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
15 0.06250304 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models
16 0.060809158 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
17 0.058494598 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers
18 0.056670092 90 jmlr-2010-Permutation Tests for Studying Classifier Performance
19 0.053648368 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
20 0.051425599 110 jmlr-2010-The SHOGUN Machine Learning Toolbox
topicId topicWeight
[(0, -0.229), (1, 0.044), (2, -0.216), (3, 0.212), (4, 0.064), (5, -0.08), (6, -0.304), (7, -0.148), (8, -0.091), (9, -0.04), (10, 0.057), (11, -0.029), (12, 0.012), (13, -0.043), (14, -0.131), (15, 0.075), (16, 0.109), (17, -0.014), (18, -0.032), (19, -0.024), (20, -0.063), (21, -0.113), (22, -0.043), (23, -0.095), (24, 0.091), (25, 0.09), (26, 0.049), (27, 0.008), (28, -0.13), (29, -0.038), (30, -0.053), (31, -0.115), (32, -0.043), (33, 0.066), (34, -0.089), (35, 0.038), (36, -0.011), (37, -0.06), (38, -0.126), (39, -0.008), (40, -0.037), (41, -0.109), (42, 0.035), (43, 0.031), (44, -0.142), (45, -0.038), (46, -0.049), (47, -0.034), (48, -0.03), (49, -0.066)]
simIndex simValue paperId paperTitle
same-paper 1 0.97360528 15 jmlr-2010-Approximate Tree Kernels
Author: Konrad Rieck, Tammo Krueger, Ulf Brefeld, Klaus-Robert Müller
Abstract: Convolution kernels for trees provide simple means for learning with tree-structured data. The computation time of tree kernels is quadratic in the size of the trees, since all pairs of nodes need to be compared. Thus, large parse trees, obtained from HTML documents or structured network data, render convolution kernels inapplicable. In this article, we propose an effective approximation technique for parse tree kernels. The approximate tree kernels (ATKs) limit kernel computation to a sparse subset of relevant subtrees and discard redundant structures, such that training and testing of kernel-based learning methods are significantly accelerated. We devise linear programming approaches for identifying such subsets for supervised and unsupervised learning tasks, respectively. Empirically, the approximate tree kernels attain run-time improvements up to three orders of magnitude while preserving the predictive accuracy of regular tree kernels. For unsupervised tasks, the approximate tree kernels even lead to more accurate predictions by identifying relevant dimensions in feature space. Keywords: tree kernels, approximation, kernel methods, convolution kernels
2 0.61480957 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
Author: Kamaledin Ghiasi-Shirazi, Reza Safabakhsh, Mostafa Shamsi
Abstract: Appropriate selection of the kernel function, which implicitly defines the feature space of an algorithm, has a crucial role in the success of kernel methods. In this paper, we consider the problem of optimizing a kernel function over the class of translation invariant kernels for the task of binary classification. The learning capacity of this class is invariant with respect to rotation and scaling of the features and it encompasses the set of radial kernels. We show that how translation invariant kernel functions can be embedded in a nested set of sub-classes and consider the kernel learning problem over one of these sub-classes. This allows the choice of an appropriate sub-class based on the problem at hand. We use the criterion proposed by Lanckriet et al. (2004) to obtain a functional formulation for the problem. It will be proven that the optimal kernel is a finite mixture of cosine functions. The kernel learning problem is then formulated as a semi-infinite programming (SIP) problem which is solved by a sequence of quadratically constrained quadratic programming (QCQP) sub-problems. Using the fact that the cosine kernel is of rank two, we propose a formulation of a QCQP sub-problem which does not require the kernel matrices to be loaded into memory, making the method applicable to large-scale problems. We also address the issue of including other classes of kernels, such as individual kernels and isotropic Gaussian kernels, in the learning process. Another interesting feature of the proposed method is that the optimal classifier has an expansion in terms of the number of cosine kernels, instead of support vectors, leading to a remarkable speedup at run-time. As a by-product, we also generalize the kernel trick to complex-valued kernel functions. Our experiments on artificial and real-world benchmark data sets, including the USPS and the MNIST digit recognition data sets, show the usefulness of the proposed method. Keywords: kernel learning, translation invariant k
3 0.59040636 7 jmlr-2010-A Streaming Parallel Decision Tree Algorithm
Author: Yael Ben-Haim, Elad Tom-Tov
Abstract: We propose a new algorithm for building decision tree classifiers. The algorithm is executed in a distributed environment and is especially designed for classifying large data sets and streaming data. It is empirically shown to be as accurate as a standard decision tree classifier, while being scalable for processing of streaming data on multiple processors. These findings are supported by a rigorous analysis of the algorithm’s accuracy. The essence of the algorithm is to quickly construct histograms at the processors, which compress the data to a fixed amount of memory. A master processor uses this information to find near-optimal split points to terminal tree nodes. Our analysis shows that guarantees on the local accuracy of split points imply guarantees on the overall tree accuracy. Keywords: decision tree classifiers, distributed computing, streaming data, scalability
4 0.55137569 53 jmlr-2010-Inducing Tree-Substitution Grammars
Author: Trevor Cohn, Phil Blunsom, Sharon Goldwater
Abstract: Inducing a grammar from text has proven to be a notoriously challenging learning task despite decades of research. The primary reason for its difficulty is that in order to induce plausible grammars, the underlying model must be capable of representing the intricacies of language while also ensuring that it can be readily learned from data. The majority of existing work on grammar induction has favoured model simplicity (and thus learnability) over representational capacity by using context free grammars and first order dependency grammars, which are not sufficiently expressive to model many common linguistic constructions. We propose a novel compromise by inferring a probabilistic tree substitution grammar, a formalism which allows for arbitrarily large tree fragments and thereby better represent complex linguistic structures. To limit the model’s complexity we employ a Bayesian non-parametric prior which biases the model towards a sparse grammar with shallow productions. We demonstrate the model’s efficacy on supervised phrase-structure parsing, where we induce a latent segmentation of the training treebank, and on unsupervised dependency grammar induction. In both cases the model uncovers interesting latent linguistic structures while producing competitive results. Keywords: grammar induction, tree substitution grammar, Bayesian non-parametrics, Pitman-Yor process, Chinese restaurant process
5 0.50105119 44 jmlr-2010-Graph Kernels
Author: S.V.N. Vishwanathan, Nicol N. Schraudolph, Risi Kondor, Karsten M. Borgwardt
Abstract: We present a unified framework to study graph kernels, special cases of which include the random a walk (G¨ rtner et al., 2003; Borgwardt et al., 2005) and marginalized (Kashima et al., 2003, 2004; Mah´ et al., 2004) graph kernels. Through reduction to a Sylvester equation we improve the time e complexity of kernel computation between unlabeled graphs with n vertices from O(n6 ) to O(n3 ). We find a spectral decomposition approach even more efficient when computing entire kernel matrices. For labeled graphs we develop conjugate gradient and fixed-point methods that take O(dn3 ) time per iteration, where d is the size of the label set. By extending the necessary linear algebra to Reproducing Kernel Hilbert Spaces (RKHS) we obtain the same result for d-dimensional edge kernels, and O(n4 ) in the infinite-dimensional case; on sparse graphs these algorithms only take O(n2 ) time per iteration in all cases. Experiments on graphs from bioinformatics and other application domains show that these techniques can speed up computation of the kernel by an order of magnitude or more. We also show that certain rational kernels (Cortes et al., 2002, 2003, 2004) when specialized to graphs reduce to our random walk graph kernel. Finally, we relate our framework to R-convolution kernels (Haussler, 1999) and provide a kernel that is close to the optimal assignment o kernel of Fr¨ hlich et al. (2006) yet provably positive semi-definite. Keywords: linear algebra in RKHS, Sylvester equations, spectral decomposition, bioinformatics, rational kernels, transducers, semirings, random walks
6 0.44983247 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures
7 0.35255045 110 jmlr-2010-The SHOGUN Machine Learning Toolbox
8 0.32849741 69 jmlr-2010-Lp-Nested Symmetric Distributions
9 0.32692465 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
10 0.31725699 11 jmlr-2010-An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data
11 0.31639719 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars
12 0.28592256 40 jmlr-2010-Fast and Scalable Local Kernel Machines
13 0.28484106 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning
14 0.27655432 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation
15 0.27631631 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
16 0.27060986 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers
17 0.25698179 115 jmlr-2010-Using Contextual Representations to Efficiently Learn Context-Free Languages
18 0.24900983 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems
19 0.24681228 63 jmlr-2010-Learning Instance-Specific Predictive Models
20 0.21251778 50 jmlr-2010-Image Denoising with Kernels Based on Natural Image Relations
topicId topicWeight
[(3, 0.01), (4, 0.053), (8, 0.015), (21, 0.027), (22, 0.019), (32, 0.04), (33, 0.014), (36, 0.029), (37, 0.066), (58, 0.337), (71, 0.015), (75, 0.16), (81, 0.021), (85, 0.057), (91, 0.015), (96, 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.75769144 15 jmlr-2010-Approximate Tree Kernels
Author: Konrad Rieck, Tammo Krueger, Ulf Brefeld, Klaus-Robert Müller
Abstract: Convolution kernels for trees provide simple means for learning with tree-structured data. The computation time of tree kernels is quadratic in the size of the trees, since all pairs of nodes need to be compared. Thus, large parse trees, obtained from HTML documents or structured network data, render convolution kernels inapplicable. In this article, we propose an effective approximation technique for parse tree kernels. The approximate tree kernels (ATKs) limit kernel computation to a sparse subset of relevant subtrees and discard redundant structures, such that training and testing of kernel-based learning methods are significantly accelerated. We devise linear programming approaches for identifying such subsets for supervised and unsupervised learning tasks, respectively. Empirically, the approximate tree kernels attain run-time improvements up to three orders of magnitude while preserving the predictive accuracy of regular tree kernels. For unsupervised tasks, the approximate tree kernels even lead to more accurate predictions by identifying relevant dimensions in feature space. Keywords: tree kernels, approximation, kernel methods, convolution kernels
2 0.72609192 58 jmlr-2010-Kronecker Graphs: An Approach to Modeling Networks
Author: Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, Zoubin Ghahramani
Abstract: How can we generate realistic networks? In addition, how can we do so with a mathematically tractable model that allows for rigorous analysis of network properties? Real networks exhibit a long list of surprising properties: Heavy tails for the in- and out-degree distribution, heavy tails for the eigenvalues and eigenvectors, small diameters, and densification and shrinking diameters over time. Current network models and generators either fail to match several of the above properties, are complicated to analyze mathematically, or both. Here we propose a generative model for networks that is both mathematically tractable and can generate networks that have all the above mentioned structural properties. Our main idea here is to use a non-standard matrix operation, the Kronecker product, to generate graphs which we refer to as “Kronecker graphs”. First, we show that Kronecker graphs naturally obey common network properties. In fact, we rigorously prove that they do so. We also provide empirical evidence showing that Kronecker graphs can effectively model the structure of real networks. We then present K RON F IT, a fast and scalable algorithm for fitting the Kronecker graph generation model to large real networks. A naive approach to fitting would take super-exponential time. In contrast, K RON F IT takes linear time, by exploiting the structure of Kronecker matrix multiplication and by using statistical simulation techniques. Experiments on a wide range of large real and synthetic networks show that K RON F IT finds accurate parameters that very well mimic the properties of target networks. In fact, using just c 2010 Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos and Zoubin Ghahramani. L ESKOVEC , C HAKRABARTI , K LEINBERG , FALOUTSOS AND G HAHRAMANI four parameters we can accurately model several aspects of global network structure. Once fitted, the model parameters can be used to gain insights about the network structure, and the resulting synt
3 0.48923585 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
Author: Ming Yuan
Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity
4 0.48289695 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence
Author: Qiang Wu, Justin Guinney, Mauro Maggioni, Sayan Mukherjee
Abstract: The problems of dimension reduction and inference of statistical dependence are addressed by the modeling framework of learning gradients. The models we propose hold for Euclidean spaces as well as the manifold setting. The central quantity in this approach is an estimate of the gradient of the regression or classification function. Two quadratic forms are constructed from gradient estimates: the gradient outer product and gradient based diffusion maps. The first quantity can be used for supervised dimension reduction on manifolds as well as inference of a graphical model encoding dependencies that are predictive of a response variable. The second quantity can be used for nonlinear projections that incorporate both the geometric structure of the manifold as well as variation of the response variable on the manifold. We relate the gradient outer product to standard statistical quantities such as covariances and provide a simple and precise comparison of a variety of supervised dimensionality reduction methods. We provide rates of convergence for both inference of informative directions as well as inference of a graphical model of variable dependencies. Keywords: gradient estimates, manifold learning, graphical models, inverse regression, dimension reduction, gradient diffusion maps
5 0.48253003 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
Author: Shiliang Sun, John Shawe-Taylor
Abstract: In this paper, we propose a general framework for sparse semi-supervised learning, which concerns using a small portion of unlabeled data and a few labeled data to represent target functions and thus has the merit of accelerating function evaluations when predicting the output of a new example. This framework makes use of Fenchel-Legendre conjugates to rewrite a convex insensitive loss involving a regularization with unlabeled data, and is applicable to a family of semi-supervised learning methods such as multi-view co-regularized least squares and single-view Laplacian support vector machines (SVMs). As an instantiation of this framework, we propose sparse multi-view SVMs which use a squared ε-insensitive loss. The resultant optimization is an inf-sup problem and the optimal solutions have arguably saddle-point properties. We present a globally optimal iterative algorithm to optimize the problem. We give the margin bound on the generalization error of the sparse multi-view SVMs, and derive the empirical Rademacher complexity for the induced function class. Experiments on artificial and real-world data show their effectiveness. We further give a sequential training approach to show their possibility and potential for uses in large-scale problems and provide encouraging experimental results indicating the efficacy of the margin bound and empirical Rademacher complexity on characterizing the roles of unlabeled data for semi-supervised learning. Keywords: semi-supervised learning, Fenchel-Legendre conjugate, representer theorem, multiview regularization, support vector machine, statistical learning theory
6 0.48238945 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
7 0.47583762 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
8 0.47542396 69 jmlr-2010-Lp-Nested Symmetric Distributions
9 0.47489354 63 jmlr-2010-Learning Instance-Specific Predictive Models
10 0.47281313 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
11 0.47278103 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
12 0.4717913 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
13 0.47013167 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
14 0.47007936 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values
15 0.47002533 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis
16 0.46937108 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
17 0.46859699 66 jmlr-2010-Linear Algorithms for Online Multitask Classification
18 0.46800807 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
19 0.4666231 109 jmlr-2010-Stochastic Composite Likelihood
20 0.4653877 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning