jmlr jmlr2007 jmlr2007-43 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Niels Landwehr, Kristian Kersting, Luc De Raedt
Abstract: A novel relational learning approach that tightly integrates the na¨ve Bayes learning scheme with ı the inductive logic programming rule-learner FOIL is presented. In contrast to previous combinations that have employed na¨ve Bayes only for post-processing the rule sets, the presented approach ı employs the na¨ve Bayes criterion to guide its search directly. The proposed technique is impleı mented in the N FOIL and T FOIL systems, which employ standard na¨ve Bayes and tree augmented ı na¨ve Bayes models respectively. We show that these integrated approaches to probabilistic model ı and rule learning outperform post-processing approaches. They also yield significantly more accurate models than simple rule learning and are competitive with more sophisticated ILP systems. Keywords: rule learning, na¨ve Bayes, statistical relational learning, inductive logic programming ı
Reference: text
sentIndex sentText sentNum sentScore
1 Keywords: rule learning, na¨ve Bayes, statistical relational learning, inductive logic programming ı 1. [sent-13, score-0.249]
2 Introduction The study of learning schemes that lie at the intersection of probabilistic and logic or relational learning has received a lot of attention recently (De Raedt and Kersting, 2003; Getoor and Taskar, 2007). [sent-14, score-0.193]
3 , 2001) or Stochastic Logic Programs (Muggleton, 1996)), we start from an inductive logic programming system and extend it with a probabilistic model. [sent-16, score-0.219]
4 More specifically, we start with the simplest approaches from both domains, the inductive logic programming system FOIL (Quinlan, 1990) and na¨ve Bayes, and ı integrate them in the N FOIL system. [sent-17, score-0.186]
5 In relational learning or inductive logic programming, one typically induces a set of rules (socalled clauses). [sent-24, score-0.237]
6 A straightforward but powerful idea to integrate these two approaches is to interpret the clauses or rules as propositional “features” over which a joint probability distribution can be defined. [sent-27, score-0.255]
7 First, the features or clauses are generated (for example using an existing inductive logic programming system such as ILP-R (Pompe and Kononenko, 1995)) and then the probability estimates for the na¨ve Bayes are determined. [sent-33, score-0.405]
8 This actually corresponds to a ı static propositionalization approach, where the propositionalized problem is learned using na¨ve ı Bayes. [sent-34, score-0.246]
9 The advantage of such a dynamic propositionalization is that the criterion acı cording to which the features are generated is that of na¨ve Bayes. [sent-36, score-0.234]
10 Our intention in this paper is ı to investigate how dynamic propositionalization compares to static propositionalization approaches that use na¨ve Bayes only to post-process the rule set. [sent-37, score-0.449]
11 More precisely, we will investigate: ı (Q1) Is there a gain in predictive accuracy of a dynamic propositionalization approach over its ILP baseline? [sent-38, score-0.236]
12 (Q2) If so, is the gain of dynamic propositionalization over its baseline larger than the gain of static propositionalization approaches? [sent-39, score-0.449]
13 ı (Q4) Is a dynamic propositionalization based on a simple rule learner competitive with advanced ILP approaches? [sent-41, score-0.22]
14 (Q5) Relational na¨ve Bayes methods such as 1BC2 (Flach and Lachiche, 2004) essentially employ ı all clauses within a given language bias as features in a probabilistic model and thus perform static propositionalization. [sent-42, score-0.305]
15 Does dynamic propositionalization employ fewer features and perform better than these approaches? [sent-43, score-0.234]
16 This contrasts with static propositionalization approaches, in which the ı criteria employed for feature generation and classification are different. [sent-48, score-0.229]
17 The search heuristic is based on class conditional likelihood and clauses are combined with na¨ve Bayes. [sent-50, score-0.249]
18 In the experimental evaluation, the implementation of Structural Logistic Regression only searches for clauses with up to two literals, where the first literal is fixed in advance. [sent-56, score-0.205]
19 Furthermore, no experimental comparison to static propositionalization approaches is provided. [sent-57, score-0.229]
20 Section 2 introduces some basic concepts from inductive logic programming and the simple first-order rule learning algorithm FOIL. [sent-65, score-0.186]
21 In Section 6 we experimentally evaluate the proposed methods on benchmark data sets from four different domains and specifically investigate how it compares to static propositionalization approaches. [sent-69, score-0.229]
22 FOIL: First Order Inductive Learning The problem that we tackle in this paper is a probabilistic formulation of the traditional inductive logic programming (ILP) problem. [sent-72, score-0.219]
23 Facts are definite clauses with an empty body, for example, atom(189, d189 1, c, 22, −0. [sent-92, score-0.205]
24 A hypothesis is said to entail an example given the background knowledge if the example is a logical consequence of the definite clauses in the hypothesis and background knowledge. [sent-96, score-0.364]
25 H⊂L Traditional approaches to inductive logic programming (Muggleton and De Raedt, 1994) tackle a concept-learning problem, in which there are typically two classes, and the goal is to find a complete and consistent concept-description. [sent-105, score-0.186]
26 This setting is incorporated in many well-known inductive logic programming systems such as FOIL (Quinlan, 1990), GOLEM (Muggleton and Feng, 1990), PROGOL (Muggleton, 1995) and TILDE (Blockeel and De Raedt, 1997). [sent-110, score-0.186]
27 It repeatedly searches for clauses that score well with respect to the data set and the current hypothesis and adds them to the hypothesis. [sent-114, score-0.302]
28 A clause c1 θ-subsumes a clause c2 if and only if there is a substitution θ such that c1 θ ⊆ c2 . [sent-119, score-0.268]
29 The most general clause is p(X1 , · · · , Xn ) ← where p/n is the predicate being learned and the Xi are different variables. [sent-130, score-0.172]
30 This is typically realized by either adding a new literal l to the clause yielding h ← b 1 , · · · , bn , l or by applying a substitution θ yielding hθ ← b1 θ, · · · , bn θ. [sent-132, score-0.164]
31 The original FOIL algorithm uses information gain based on the number of positive and negative tuples bound to clauses as the scoring function (Quinlan, 1990), and a stopping criterion based on the minimum description length principle (Rissanen, 1978). [sent-136, score-0.257]
32 M FOIL, a variant of FOIL that was particularly designed to cope with noise in the training data, uses the m-estimate for scoring clauses and a significance-based stopping criterion (Lavraˇ and Dˇ eroski, 1994). [sent-137, score-0.257]
33 This ı will be realized by modifying the covers and score functions in the inductive logic programming 485 L ANDWEHR , K ERSTING AND D E R AEDT setting. [sent-140, score-0.291]
34 All other elements, such as the examples, background theory and language of clauses L will—in principle—be untouched. [sent-141, score-0.235]
35 However, the set of clauses defining a hypothesis is augmented with a set of parameters that quantify the probabilistic model. [sent-142, score-0.313]
36 For easy of exposition, we will first discuss this for the case that the clauses are combined using na¨ve Bayes, that is, the N FOIL system. [sent-143, score-0.205]
37 The key idea for realizing this is that we interpret the clauses in H together with the example e as queries or features. [sent-152, score-0.205]
38 More formally, let H contain a set of clauses defining the predicate p. [sent-153, score-0.226]
39 Then for each clause c of the form p(X1 , · · · , Xn ) ← b1 , · · · , bn we view the query qc =← b1 , · · · , bn as a boolean feature or attribute. [sent-154, score-0.215]
40 We now define P(e | H, B) = Pλ (pθ|q1 θ, · · · , qk θ) P (q1 θ, · · · , qk θ|pθ) · Pλ (pθ) = λ Pλ (q1 θ, · · · , qk θ) 486 I NTEGRATING NA¨VE BAYES AND FOIL I where HC = {q1 , . [sent-160, score-0.261]
41 e∈E To evaluate a set of clauses HC in the generic FOIL algorithm, the scoring function has to be changed accordingly, to score(E, HC , B) = P(E | H, B) where HC has been augmented with an optimal probabilistic model Hλ . [sent-202, score-0.31]
42 , 2004) or an ILP technique such as bottom clauses to generate candidate rules for selection (Davis et al. [sent-213, score-0.228]
43 More formally, the model space under consideration is H = {(HC , Hλ )|HC ⊆ L , Hλ ∈ MC } where MC is the space of all na¨ve Bayes models over the clauses in HC . [sent-231, score-0.205]
44 The optimization problem ı is to find H ∗ = arg max P(E | H, B) H = arg max P(E | (HC , Hλ ), B), HC ,Hλ a hypothesis that jointly optimizes the clause set (structure) and probabilistic model (parameters such as the Pλ (qi |p) appearing in Example 3). [sent-232, score-0.225]
45 1 Adapting FOIL Like FOIL, N FOIL searches a set of clauses greedily, and a single clause in a general-to-specific manner using a refinement operator. [sent-241, score-0.339]
46 Because the final model in FOIL is the disjunction of the learned clauses (where every clause covers a certain subset of examples), it holds that 1. [sent-243, score-0.394]
47 (Non-recursive) clauses already learned do not need to be considered when scoring additional clauses: score(E, H ∪ {c }, B) = score(E, {c }, B). [sent-246, score-0.249]
48 An additional clause c has to be scored together with the current model: score(E, HC ∪ {c }, B) = P(E | H , B) where H has clauses HC ∪ {c } and optimal parameters Hλ . [sent-252, score-0.339]
49 This is replaced in N FOIL by stopping if the change in score when adding a clause falls below a certain threshold. [sent-255, score-0.226]
50 2 Parameter Estimation: An Approximation To evaluate a set of clauses HC = {q1 , . [sent-260, score-0.205]
51 , qk }: ∗ Hλ = arg max P(E | H, B) Hλ = arg max ∏ Pλ (pθ|q1 θ, · · · , qk θ) λ θ∈E ∏ j Pλ (q j θ | pθ) · Pλ (pθ) . [sent-266, score-0.202]
52 , qk θ) = ∏ Pλ (pθ|q1 θ, · · · , qk θ) · Pλ (q1 θ, . [sent-274, score-0.174]
53 However, when evaluating a set of clauses HC = {q1 , . [sent-354, score-0.205]
54 , qk } in T FOIL, for every clause qi ∈ HC we have to decide on a q j = q pa(i) ∈ HC for which a dependency q pa(i) → qi is added. [sent-357, score-0.295]
55 Rather than re-learning the TAN graph from scratch every time a new candidate clause q i is scored, the subgraph over the existing hypothesis H = {q1 , . [sent-360, score-0.164]
56 , qi−1 } is kept fixed, and all existing clauses q j ∈ H are considered as possible parents for the clause qi . [sent-363, score-0.376]
57 For every candidate clause c , the best parent pa(c ) is identified and the c with highest score is added to H with the dependency pa(c ) → c . [sent-367, score-0.201]
58 The computational complexity of scoring the (k + 1)th clause qk+1 in T FOIL is O(C · k · n), as all k existing clauses are considered as possible parents. [sent-369, score-0.366]
59 The problem is to classify compounds as mutagenic or not given their chemical structure described in terms of atoms, bonds, atom charge, and information about atom and bond types that have been generated by the molecular modeling package QUANTA. [sent-381, score-0.335]
60 Additionally, we provided a relation linked(A1 , A2 , E, BT ) in the background knowledge that represents that there is a bond of type BT from atom A1 to atom A2 and A2 is of element E. [sent-412, score-0.272]
61 During the search for a clause, the algorithm also keeps a set C ∗ of the k best (partial) clauses found so far. [sent-425, score-0.224]
62 The search for additional clauses is stopped if the change in score between two successive iterations is less than 0. [sent-427, score-0.291]
63 A hypothesis is limited to contain at most 25 clauses and a clause to contain at most 10 literals. [sent-429, score-0.369]
64 1 It is based on the concept of bottom clauses, which are maximally specific clauses covering a certain example. [sent-439, score-0.205]
65 The theory is then built from clauses that contain a subset of the literals found in a bottom clause. [sent-440, score-0.231]
66 The maximum number of literals in a clause was set to 10 instead of the default 4. [sent-443, score-0.175]
67 We have therefore investigated a static propositionalization approach: frequent clauses were extracted from the relational data sets and then used as features in the propositional MACCENT system. [sent-451, score-0.538]
68 , 1998), as WARMR patterns have shown to be effective propositionalization techniques on similar benchmarks in inductive logic programming (Srinivasan et al. [sent-453, score-0.362]
69 Comparing the results for N FOIL/T FOIL and M FOIL, the experiments clearly show a gain for dynamic propositionalization over the baseline ILP algorithm, giving an affirmative answer to Question Q1. [sent-551, score-0.22]
70 This shows that relatively simple dynamic propositionalization approaches are competitive with more advanced ILP systems, giving an affirmative answer to Question Q4. [sent-560, score-0.22]
71 In clause voting, the threshold that is varied is the number of clauses that have to cover an example before it is classified as positive (the default threshold being one). [sent-629, score-0.354]
72 Note that in some cases dynamic propositionalization achieves higher AUC scores than ILP systems even though it achieves lower accuracy (e. [sent-631, score-0.236]
73 At the level of induced clauses, it indicates that clauses induced by the ILP systems are specific in the sense that positive examples are typically covered by only one clause—if the decision threshold in clause voting is set to more than one, the true positive rate drops rapidly. [sent-638, score-0.339]
74 In contrast, ranking performance for the dynamic propositionalization systems and 1BC2 is more stable, meaning that these systems also offer good classification performance under varying misclassification costs (or, equivalently, class skew). [sent-639, score-0.22]
75 This indicates that dynamic propositionalization approaches can make use of more diverse rule sets, which are helpful in ranking examples by providing additional information but would produce too many false-positive classifications if interpreted as a disjunctive hypothesis. [sent-640, score-0.242]
76 To investigate Question Q2, that is, to compare dynamic and static propositionalization approaches, two additional experiments were performed: 1. [sent-760, score-0.273]
77 Learning a na¨ve Bayes or tree augmented na¨ve Bayes model over a set of clauses using a ı ı two-step approach: First, a set of rules is learned using M FOIL or A LEPH, and afterwards a (tree augmented) na¨ve Bayes model is built using these rules. [sent-761, score-0.317]
78 Question Q2 is whether these rules are less useful when combined with (tree augmented) na¨ve Bayes than the rules constructed by a dynamic propositionalization approach ı (N FOIL/T FOIL). [sent-763, score-0.266]
79 However, AUC scores of the static propositionalization approaches are on average much lower than those of dynamic propositionalization approaches. [sent-817, score-0.449]
80 Thus, we can answer Q2 as follows: A dynamic propositionalization approach that selects rules based on the criterion of the probabilistic model performs better than static propositionalization approaches that post-process a rule set using a probabilistic learner. [sent-824, score-0.538]
81 This is because dynamic propositionalization can make use of different/additional rules that are not considered by traditional ILP systems as they would produce too many false-positive classifications. [sent-825, score-0.243]
82 It remains to answer Question Q5, that is, to compare dynamic propositionalization approaches to relational na¨ve Bayes methods such as 1BC2 in terms of accuracy and the number of features they ı employ. [sent-826, score-0.313]
83 In the experiments, 1BC2 uses an order of magnitude more clauses than N FOIL. [sent-833, score-0.205]
84 More precisely, 1BC2 uses more than 400 (in some cases even more than 1000) clauses whereas N FOIL is limited to using 25 clauses. [sent-834, score-0.205]
85 We furthermore investigated whether the proposed dynamic propositionalization approach sometimes constructs too many clauses and overfits the training data, as the stopping criterion is based on the training set score. [sent-836, score-0.45]
86 4 Table 6: Cross-validated predicative accuracy results and average number of clauses in the final model for N FOIL and N FOIL/pruning. [sent-892, score-0.221]
87 The procedure cross-validate-accuracy(H,E,B) cross-validates the na¨ve Bayes model on the training data for a fixed set H of clauses and returns an ı accuracy estimate. [sent-898, score-0.221]
88 The algorithm greedily drops clauses from H as long as this does not decrease the cross-validated accuracy estimate. [sent-899, score-0.221]
89 The dynamic propositionalization approaches N FOIL and T FOIL yield more accurate models than simple ILP rule learning and static propositionalization approaches, and also compare favorably to the first order na¨ve Bayes system 1BC2 and one of the most advanced ILP systems, ı namely Aleph. [sent-910, score-0.449]
90 Related Work The approaches that combine statistical learning with inductive logic programming techniques for addressing classification can be divided into three categories. [sent-912, score-0.186]
91 A first class of techniques are static propositionalization approaches. [sent-913, score-0.229]
92 However, an initial step beyond static propositionalization has been taken in the work by Pompe and Kononenko (1997), where rules generated by an ILP system are post-processed by splitting and merging clauses in order to find a rule set that satisfies the na¨ve ı Bayes assumption. [sent-921, score-0.457]
93 , 2003; Dehaspe, 1997) indeed tightly integrates the inductive logic programming step with the statistical learning step in a dynamic propositionalization approach. [sent-930, score-0.423]
94 SAYU uses a “wrapper” approach where (partial) clauses generated by the refinement search of an ILP system are proposed as features to a (tree augmented) na¨ve Bayes, ı and incorporated if they improve performance. [sent-937, score-0.238]
95 The probabilistic model is trained to maximize the likelihood on the training data, while clause selection is based on the area under the precision-recall curve of the model on a separate tuning set. [sent-940, score-0.192]
96 Conclusions We have introduced the N FOIL and T FOIL systems, two dynamic propositionalization approaches that combine the simplest techniques from ILP and probabilistic learning. [sent-949, score-0.253]
97 In an experimental study on several benchmark data sets, the proposed approaches were compared to the ILP systems M FOIL and A LEPH, static propositionalization approaches and the first-order na¨ve Bayes system 1BC2. [sent-950, score-0.229]
98 ı Experimental results show that dynamic propositionalization is superior to static propositionalization and simple rule learning. [sent-951, score-0.449]
99 Moreover, our experiments indicate that the superior performance of dynamic propositionalization approaches is due to the fact that they can make use of more diverse rule sets than ILP systems. [sent-953, score-0.22]
100 Further exploring dynamic propositionalization as a way of combining (possibly more powerful) statistical learners with ILP search techniques is an interesting direction for future work. [sent-955, score-0.239]
wordName wordTfidf (topN-words)
[('foil', 0.765), ('na', 0.219), ('clauses', 0.205), ('alzheimer', 0.192), ('propositionalization', 0.176), ('leph', 0.14), ('ilp', 0.137), ('clause', 0.134), ('mutagenesis', 0.127), ('hc', 0.126), ('logic', 0.097), ('ve', 0.095), ('atom', 0.088), ('qk', 0.087), ('dsstox', 0.083), ('bayes', 0.073), ('aedt', 0.068), ('andwehr', 0.068), ('ersting', 0.068), ('ntegrating', 0.068), ('score', 0.067), ('bond', 0.066), ('relational', 0.063), ('tan', 0.061), ('inductive', 0.054), ('static', 0.053), ('mutagenic', 0.053), ('luc', 0.047), ('davis', 0.047), ('acetyl', 0.047), ('amine', 0.047), ('pompe', 0.047), ('toxic', 0.047), ('augmented', 0.045), ('il', 0.045), ('dynamic', 0.044), ('roc', 0.043), ('diterpene', 0.042), ('logical', 0.039), ('covers', 0.038), ('qi', 0.037), ('eroski', 0.036), ('sayu', 0.036), ('lachiche', 0.035), ('programming', 0.035), ('kersting', 0.035), ('flach', 0.034), ('probabilistic', 0.033), ('raedt', 0.032), ('lavra', 0.031), ('kononenko', 0.031), ('muggleton', 0.031), ('qc', 0.031), ('hypothesis', 0.03), ('background', 0.03), ('tree', 0.027), ('propositional', 0.027), ('scoring', 0.027), ('landwehr', 0.026), ('literals', 0.026), ('kristian', 0.026), ('tf', 0.025), ('stopping', 0.025), ('likelihood', 0.025), ('rankings', 0.024), ('srinivasan', 0.024), ('compounds', 0.024), ('folds', 0.024), ('rules', 0.023), ('beam', 0.022), ('disjunctive', 0.022), ('nb', 0.021), ('predicate', 0.021), ('query', 0.02), ('dehaspe', 0.02), ('search', 0.019), ('goldszmidt', 0.018), ('lz', 0.018), ('tightly', 0.017), ('getoor', 0.017), ('learned', 0.017), ('accuracy', 0.016), ('popescul', 0.016), ('freiburg', 0.016), ('chemical', 0.016), ('auc', 0.016), ('ashvin', 0.016), ('grossman', 0.016), ('maccent', 0.016), ('mutagenicity', 0.016), ('niels', 0.016), ('rmatively', 0.016), ('saso', 0.016), ('default', 0.015), ('bn', 0.015), ('pa', 0.015), ('nement', 0.015), ('features', 0.014), ('arg', 0.014), ('memory', 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999928 43 jmlr-2007-Integrating Naïve Bayes and FOIL
Author: Niels Landwehr, Kristian Kersting, Luc De Raedt
Abstract: A novel relational learning approach that tightly integrates the na¨ve Bayes learning scheme with ı the inductive logic programming rule-learner FOIL is presented. In contrast to previous combinations that have employed na¨ve Bayes only for post-processing the rule sets, the presented approach ı employs the na¨ve Bayes criterion to guide its search directly. The proposed technique is impleı mented in the N FOIL and T FOIL systems, which employ standard na¨ve Bayes and tree augmented ı na¨ve Bayes models respectively. We show that these integrated approaches to probabilistic model ı and rule learning outperform post-processing approaches. They also yield significantly more accurate models than simple rule learning and are competitive with more sophisticated ILP systems. Keywords: rule learning, na¨ve Bayes, statistical relational learning, inductive logic programming ı
2 0.16446698 47 jmlr-2007-Learning Horn Expressions with LOGAN-H
Author: Marta Arias, Roni Khardon, Jérôme Maloberti
Abstract: The paper introduces L OG A N -H —a system for learning first-order function-free Horn expressions from interpretations. The system is based on an algorithm that learns by asking questions and that was proved correct in previous work. The current paper shows how the algorithm can be implemented in a practical system, and introduces a new algorithm based on it that avoids interaction and learns from examples only. The L OG A N -H system implements these algorithms and adds several facilities and optimizations that allow efficient applications in a wide range of problems. As one of the important ingredients, the system includes several fast procedures for solving the subsumption problem, an NP-complete problem that needs to be solved many times during the learning process. We describe qualitative and quantitative experiments in several domains. The experiments demonstrate that the system can deal with varied problems, large amounts of data, and that it achieves good classification accuracy. Keywords: inductive logic programming, subsumption, bottom-up learning, learning with queries
3 0.047905304 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study
Author: Sofus A. Macskassy, Foster Provost
Abstract: paper1 This is about classifying entities that are interlinked with entities for which the class is known. After surveying prior work, we present NetKit, a modular toolkit for classification in networked data, and a case-study of its application to networked data used in prior machine learning research. NetKit is based on a node-centric framework in which classifiers comprise a local classifier, a relational classifier, and a collective inference procedure. Various existing node-centric relational learning algorithms can be instantiated with appropriate choices for these components, and new combinations of components realize new algorithms. The case study focuses on univariate network classification, for which the only information used is the structure of class linkage in the network (i.e., only links and some class labels). To our knowledge, no work previously has evaluated systematically the power of class-linkage alone for classification in machine learning benchmark data sets. The results demonstrate that very simple network-classification models perform quite well—well enough that they should be used regularly as baseline classifiers for studies of learning with networked data. The simplest method (which performs remarkably well) highlights the close correspondence between several existing methods introduced for different purposes—that is, Gaussian-field classifiers, Hopfield networks, and relational-neighbor classifiers. The case study also shows that there are two sets of techniques that are preferable in different situations, namely when few versus many labels are known initially. We also demonstrate that link selection plays an important role similar to traditional feature selection. Keywords: relational learning, network learning, collective inference, collective classification, networked data, probabilistic relational models, network analysis, network data
4 0.044493161 72 jmlr-2007-Relational Dependency Networks
Author: Jennifer Neville, David Jensen
Abstract: Recent work on graphical models for relational data has demonstrated significant improvements in classification and inference when models represent the dependencies among instances. Despite its use in conventional statistical models, the assumption of instance independence is contradicted by most relational data sets. For example, in citation data there are dependencies among the topics of a paper’s references, and in genomic data there are dependencies among the functions of interacting proteins. In this paper, we present relational dependency networks (RDNs), graphical models that are capable of expressing and reasoning with such dependencies in a relational setting. We discuss RDNs in the context of relational Bayes networks and relational Markov networks and outline the relative strengths of RDNs—namely, the ability to represent cyclic dependencies, simple methods for parameter estimation, and efficient structure learning techniques. The strengths of RDNs are due to the use of pseudolikelihood learning techniques, which estimate an efficient approximation of the full joint distribution. We present learned RDNs for a number of real-world data sets and evaluate the models in a prediction context, showing that RDNs identify and exploit cyclic relational dependencies to achieve significant performance gains over conventional conditional models. In addition, we use synthetic data to explore model performance under various relational data characteristics, showing that RDN learning and inference techniques are accurate over a wide range of conditions. Keywords: relational learning, probabilistic relational models, knowledge discovery, graphical models, dependency networks, pseudolikelihood estimation
5 0.043047115 90 jmlr-2007-Value Regularization and Fenchel Duality
Author: Ryan M. Rifkin, Ross A. Lippert
Abstract: Regularization is an approach to function learning that balances fit and smoothness. In practice, we search for a function f with a finite representation f = ∑i ci φi (·). In most treatments, the ci are the primary objects of study. We consider value regularization, constructing optimization problems in which the predicted values at the training points are the primary variables, and therefore the central objects of study. Although this is a simple change, it has profound consequences. From convex conjugacy and the theory of Fenchel duality, we derive separate optimality conditions for the regularization and loss portions of the learning problem; this technique yields clean and short derivations of standard algorithms. This framework is ideally suited to studying many other phenomena at the intersection of learning theory and optimization. We obtain a value-based variant of the representer theorem, which underscores the transductive nature of regularization in reproducing kernel Hilbert spaces. We unify and extend previous results on learning kernel functions, with very simple proofs. We analyze the use of unregularized bias terms in optimization problems, and low-rank approximations to kernel matrices, obtaining new results in these areas. In summary, the combination of value regularization and Fenchel duality are valuable tools for studying the optimization problems in machine learning. Keywords: kernel machines, duality, optimization, convex analysis, kernel learning
6 0.031730138 1 jmlr-2007-"Ideal Parent" Structure Learning for Continuous Variable Bayesian Networks
7 0.031013843 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
8 0.029977854 70 jmlr-2007-Ranking the Best Instances
9 0.027502988 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections
10 0.026386166 22 jmlr-2007-Compression-Based Averaging of Selective Naive Bayes Classifiers (Special Topic on Model Selection)
11 0.025081214 48 jmlr-2007-Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners
12 0.020860069 29 jmlr-2007-Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts
13 0.02046296 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification
14 0.0204542 11 jmlr-2007-Anytime Learning of Decision Trees
15 0.01871529 27 jmlr-2007-Distances between Data Sets Based on Summary Statistics
16 0.018543595 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data
17 0.01780856 79 jmlr-2007-Structure and Majority Classes in Decision Tree Learning
18 0.017790683 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes
19 0.016615385 54 jmlr-2007-Measuring Differentiability: Unmasking Pseudonymous Authors
20 0.016426511 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation
topicId topicWeight
[(0, 0.109), (1, 0.108), (2, -0.024), (3, 0.044), (4, 0.026), (5, 0.126), (6, 0.012), (7, -0.05), (8, 0.013), (9, -0.249), (10, 0.092), (11, 0.213), (12, 0.113), (13, 0.056), (14, -0.022), (15, 0.297), (16, -0.082), (17, -0.398), (18, 0.234), (19, 0.071), (20, 0.067), (21, 0.099), (22, -0.11), (23, 0.002), (24, 0.052), (25, 0.027), (26, -0.001), (27, -0.069), (28, 0.002), (29, 0.045), (30, -0.074), (31, -0.064), (32, 0.038), (33, -0.041), (34, -0.038), (35, 0.017), (36, 0.045), (37, 0.168), (38, -0.183), (39, 0.066), (40, 0.091), (41, -0.026), (42, -0.011), (43, -0.079), (44, -0.056), (45, 0.018), (46, -0.013), (47, -0.002), (48, 0.03), (49, 0.004)]
simIndex simValue paperId paperTitle
same-paper 1 0.97143573 43 jmlr-2007-Integrating Naïve Bayes and FOIL
Author: Niels Landwehr, Kristian Kersting, Luc De Raedt
Abstract: A novel relational learning approach that tightly integrates the na¨ve Bayes learning scheme with ı the inductive logic programming rule-learner FOIL is presented. In contrast to previous combinations that have employed na¨ve Bayes only for post-processing the rule sets, the presented approach ı employs the na¨ve Bayes criterion to guide its search directly. The proposed technique is impleı mented in the N FOIL and T FOIL systems, which employ standard na¨ve Bayes and tree augmented ı na¨ve Bayes models respectively. We show that these integrated approaches to probabilistic model ı and rule learning outperform post-processing approaches. They also yield significantly more accurate models than simple rule learning and are competitive with more sophisticated ILP systems. Keywords: rule learning, na¨ve Bayes, statistical relational learning, inductive logic programming ı
2 0.80770862 47 jmlr-2007-Learning Horn Expressions with LOGAN-H
Author: Marta Arias, Roni Khardon, Jérôme Maloberti
Abstract: The paper introduces L OG A N -H —a system for learning first-order function-free Horn expressions from interpretations. The system is based on an algorithm that learns by asking questions and that was proved correct in previous work. The current paper shows how the algorithm can be implemented in a practical system, and introduces a new algorithm based on it that avoids interaction and learns from examples only. The L OG A N -H system implements these algorithms and adds several facilities and optimizations that allow efficient applications in a wide range of problems. As one of the important ingredients, the system includes several fast procedures for solving the subsumption problem, an NP-complete problem that needs to be solved many times during the learning process. We describe qualitative and quantitative experiments in several domains. The experiments demonstrate that the system can deal with varied problems, large amounts of data, and that it achieves good classification accuracy. Keywords: inductive logic programming, subsumption, bottom-up learning, learning with queries
3 0.14946163 72 jmlr-2007-Relational Dependency Networks
Author: Jennifer Neville, David Jensen
Abstract: Recent work on graphical models for relational data has demonstrated significant improvements in classification and inference when models represent the dependencies among instances. Despite its use in conventional statistical models, the assumption of instance independence is contradicted by most relational data sets. For example, in citation data there are dependencies among the topics of a paper’s references, and in genomic data there are dependencies among the functions of interacting proteins. In this paper, we present relational dependency networks (RDNs), graphical models that are capable of expressing and reasoning with such dependencies in a relational setting. We discuss RDNs in the context of relational Bayes networks and relational Markov networks and outline the relative strengths of RDNs—namely, the ability to represent cyclic dependencies, simple methods for parameter estimation, and efficient structure learning techniques. The strengths of RDNs are due to the use of pseudolikelihood learning techniques, which estimate an efficient approximation of the full joint distribution. We present learned RDNs for a number of real-world data sets and evaluate the models in a prediction context, showing that RDNs identify and exploit cyclic relational dependencies to achieve significant performance gains over conventional conditional models. In addition, we use synthetic data to explore model performance under various relational data characteristics, showing that RDN learning and inference techniques are accurate over a wide range of conditions. Keywords: relational learning, probabilistic relational models, knowledge discovery, graphical models, dependency networks, pseudolikelihood estimation
4 0.14440937 90 jmlr-2007-Value Regularization and Fenchel Duality
Author: Ryan M. Rifkin, Ross A. Lippert
Abstract: Regularization is an approach to function learning that balances fit and smoothness. In practice, we search for a function f with a finite representation f = ∑i ci φi (·). In most treatments, the ci are the primary objects of study. We consider value regularization, constructing optimization problems in which the predicted values at the training points are the primary variables, and therefore the central objects of study. Although this is a simple change, it has profound consequences. From convex conjugacy and the theory of Fenchel duality, we derive separate optimality conditions for the regularization and loss portions of the learning problem; this technique yields clean and short derivations of standard algorithms. This framework is ideally suited to studying many other phenomena at the intersection of learning theory and optimization. We obtain a value-based variant of the representer theorem, which underscores the transductive nature of regularization in reproducing kernel Hilbert spaces. We unify and extend previous results on learning kernel functions, with very simple proofs. We analyze the use of unregularized bias terms in optimization problems, and low-rank approximations to kernel matrices, obtaining new results in these areas. In summary, the combination of value regularization and Fenchel duality are valuable tools for studying the optimization problems in machine learning. Keywords: kernel machines, duality, optimization, convex analysis, kernel learning
5 0.13904026 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
Author: Roland Nilsson, José M. Peña, Johan Björkegren, Jesper Tegnér
Abstract: We analyze two different feature selection problems: finding a minimal feature set optimal for classification (MINIMAL - OPTIMAL) vs. finding all features relevant to the target variable (ALL RELEVANT ). The latter problem is motivated by recent applications within bioinformatics, particularly gene expression analysis. For both problems, we identify classes of data distributions for which there exist consistent, polynomial-time algorithms. We also prove that ALL - RELEVANT is much harder than MINIMAL - OPTIMAL and propose two consistent, polynomial-time algorithms. We argue that the distribution classes considered are reasonable in many practical cases, so that our results simplify feature selection in a wide range of machine learning tasks. Keywords: learning theory, relevance, classification, Markov blanket, bioinformatics
6 0.13032526 1 jmlr-2007-"Ideal Parent" Structure Learning for Continuous Variable Bayesian Networks
7 0.12962219 70 jmlr-2007-Ranking the Best Instances
8 0.12367729 22 jmlr-2007-Compression-Based Averaging of Selective Naive Bayes Classifiers (Special Topic on Model Selection)
9 0.11406903 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study
10 0.1086935 11 jmlr-2007-Anytime Learning of Decision Trees
11 0.092201814 29 jmlr-2007-Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts
12 0.089571439 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification
13 0.089085706 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections
14 0.082802139 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features
15 0.082630143 48 jmlr-2007-Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners
16 0.078882568 54 jmlr-2007-Measuring Differentiability: Unmasking Pseudonymous Authors
17 0.078120582 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions
18 0.077396199 79 jmlr-2007-Structure and Majority Classes in Decision Tree Learning
19 0.072979324 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data
20 0.068448208 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data
topicId topicWeight
[(4, 0.011), (8, 0.01), (10, 0.021), (12, 0.017), (15, 0.016), (18, 0.019), (28, 0.079), (40, 0.035), (45, 0.031), (48, 0.021), (60, 0.507), (80, 0.021), (85, 0.031), (98, 0.07)]
simIndex simValue paperId paperTitle
Author: Nikolas List, Hans Ulrich Simon
Abstract: We present a general decomposition algorithm that is uniformly applicable to every (suitably normalized) instance of Convex Quadratic Optimization and efficiently approaches an optimal solution. The number of iterations required to be within ε of optimality grows linearly with 1/ε and quadratically with the number m of variables. The working set selection can be performed in polynomial time. If we restrict our considerations to instances of Convex Quadratic Optimization with at most k0 equality constraints for some fixed constant k0 plus some so-called box-constraints (conditions that hold for most variants of SVM-optimization), the working set is found in linear time. Our analysis builds on a generalization of the concept of rate certifying pairs that was introduced by Hush and Scovel. In order to extend their results to arbitrary instances of Convex Quadratic Optimization, we introduce the general notion of a rate certifying q-set. We improve on the results by Hush and Scovel (2003) in several ways. First our result holds for Convex Quadratic Optimization whereas the results by Hush and Scovel are specialized to SVM-optimization. Second, we achieve a higher rate of convergence even for the special case of SVM-optimization (despite the generality of our approach). Third, our analysis is technically simpler. We prove furthermore that the strategy for working set selection which is based on rate certifying sets coincides with a strategy which is based on a so-called “sparse witness of sub-optimality”. Viewed from this perspective, our main result improves on convergence results by List and Simon (2004) and Simon (2004) by providing convergence rates (and by holding under more general conditions). Keywords: convex quadratic optimization, decomposition algorithms, support vector machines
2 0.88177747 9 jmlr-2007-AdaBoost is Consistent
Author: Peter L. Bartlett, Mikhail Traskin
Abstract: The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n1−ε iterations—for sample size n and ε ∈ (0, 1)—the sequence of risks of the classifiers it produces approaches the Bayes risk. Keywords: boosting, adaboost, consistency
same-paper 3 0.87876719 43 jmlr-2007-Integrating Naïve Bayes and FOIL
Author: Niels Landwehr, Kristian Kersting, Luc De Raedt
Abstract: A novel relational learning approach that tightly integrates the na¨ve Bayes learning scheme with ı the inductive logic programming rule-learner FOIL is presented. In contrast to previous combinations that have employed na¨ve Bayes only for post-processing the rule sets, the presented approach ı employs the na¨ve Bayes criterion to guide its search directly. The proposed technique is impleı mented in the N FOIL and T FOIL systems, which employ standard na¨ve Bayes and tree augmented ı na¨ve Bayes models respectively. We show that these integrated approaches to probabilistic model ı and rule learning outperform post-processing approaches. They also yield significantly more accurate models than simple rule learning and are competitive with more sophisticated ILP systems. Keywords: rule learning, na¨ve Bayes, statistical relational learning, inductive logic programming ı
4 0.86835432 73 jmlr-2007-Revised Loss Bounds for the Set Covering Machine and Sample-Compression Loss Bounds for Imbalanced Data
Author: Zakria Hussain, François Laviolette, Mario Marchand, John Shawe-Taylor, Spencer Charles Brubaker, Matthew D. Mullin
Abstract: Marchand and Shawe-Taylor (2002) have proposed a loss bound for the set covering machine that has the property to depend on the observed fraction of positive examples and on what the classifier achieves on the positive training examples. We show that this loss bound is incorrect. We then propose a loss bound, valid for any sample-compression learning algorithm (including the set covering machine), that depends on the observed fraction of positive examples and on what the classifier achieves on them. We also compare numerically the loss bound proposed in this paper with the incorrect bound, the original SCM bound and a recently proposed loss bound of Marchand and Sokolova (2005) (which does not depend on the observed fraction of positive examples) and show that the latter loss bounds can be substantially larger than the new bound in the presence of imbalanced misclassifications. Keywords: set covering machines, sample-compression, loss bounds
5 0.56177008 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
Author: Roland Nilsson, José M. Peña, Johan Björkegren, Jesper Tegnér
Abstract: We analyze two different feature selection problems: finding a minimal feature set optimal for classification (MINIMAL - OPTIMAL) vs. finding all features relevant to the target variable (ALL RELEVANT ). The latter problem is motivated by recent applications within bioinformatics, particularly gene expression analysis. For both problems, we identify classes of data distributions for which there exist consistent, polynomial-time algorithms. We also prove that ALL - RELEVANT is much harder than MINIMAL - OPTIMAL and propose two consistent, polynomial-time algorithms. We argue that the distribution classes considered are reasonable in many practical cases, so that our results simplify feature selection in a wide range of machine learning tasks. Keywords: learning theory, relevance, classification, Markov blanket, bioinformatics
6 0.55939412 71 jmlr-2007-Refinable Kernels
8 0.52782393 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
9 0.52697808 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds
10 0.52127409 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis
12 0.50672269 14 jmlr-2007-Behavioral Shaping for Geometric Concepts
13 0.4430204 44 jmlr-2007-Large Margin Semi-supervised Learning
14 0.43788317 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections
15 0.43482739 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss
16 0.43139365 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
17 0.42232496 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression
18 0.42179441 45 jmlr-2007-Learnability of Gaussians with Flexible Variances
19 0.42000586 65 jmlr-2007-PAC-Bayes Risk Bounds for Stochastic Averages and Majority Votes of Sample-Compressed Classifiers
20 0.41840827 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption