jmlr jmlr2007 jmlr2007-43 knowledge-graph by maker-knowledge-mining

43 jmlr-2007-Integrating Naïve Bayes and FOIL

Source: pdf

Author: Niels Landwehr, Kristian Kersting, Luc De Raedt

Abstract: A novel relational learning approach that tightly integrates the na¨ve Bayes learning scheme with ı the inductive logic programming rule-learner FOIL is presented. In contrast to previous combinations that have employed na¨ve Bayes only for post-processing the rule sets, the presented approach ı employs the na¨ve Bayes criterion to guide its search directly. The proposed technique is impleı mented in the N FOIL and T FOIL systems, which employ standard na¨ve Bayes and tree augmented ı na¨ve Bayes models respectively. We show that these integrated approaches to probabilistic model ı and rule learning outperform post-processing approaches. They also yield signiﬁcantly more accurate models than simple rule learning and are competitive with more sophisticated ILP systems. Keywords: rule learning, na¨ve Bayes, statistical relational learning, inductive logic programming ı

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Keywords: rule learning, na¨ve Bayes, statistical relational learning, inductive logic programming ı 1. [sent-13, score-0.249]

2 Introduction The study of learning schemes that lie at the intersection of probabilistic and logic or relational learning has received a lot of attention recently (De Raedt and Kersting, 2003; Getoor and Taskar, 2007). [sent-14, score-0.193]

3 , 2001) or Stochastic Logic Programs (Muggleton, 1996)), we start from an inductive logic programming system and extend it with a probabilistic model. [sent-16, score-0.219]

4 More speciﬁcally, we start with the simplest approaches from both domains, the inductive logic programming system FOIL (Quinlan, 1990) and na¨ve Bayes, and ı integrate them in the N FOIL system. [sent-17, score-0.186]

5 In relational learning or inductive logic programming, one typically induces a set of rules (socalled clauses). [sent-24, score-0.237]

6 A straightforward but powerful idea to integrate these two approaches is to interpret the clauses or rules as propositional “features” over which a joint probability distribution can be deﬁned. [sent-27, score-0.255]

7 First, the features or clauses are generated (for example using an existing inductive logic programming system such as ILP-R (Pompe and Kononenko, 1995)) and then the probability estimates for the na¨ve Bayes are determined. [sent-33, score-0.405]

8 This actually corresponds to a ı static propositionalization approach, where the propositionalized problem is learned using na¨ve ı Bayes. [sent-34, score-0.246]

9 The advantage of such a dynamic propositionalization is that the criterion acı cording to which the features are generated is that of na¨ve Bayes. [sent-36, score-0.234]

10 Our intention in this paper is ı to investigate how dynamic propositionalization compares to static propositionalization approaches that use na¨ve Bayes only to post-process the rule set. [sent-37, score-0.449]

11 More precisely, we will investigate: ı (Q1) Is there a gain in predictive accuracy of a dynamic propositionalization approach over its ILP baseline? [sent-38, score-0.236]

12 (Q2) If so, is the gain of dynamic propositionalization over its baseline larger than the gain of static propositionalization approaches? [sent-39, score-0.449]

13 ı (Q4) Is a dynamic propositionalization based on a simple rule learner competitive with advanced ILP approaches? [sent-41, score-0.22]

14 (Q5) Relational na¨ve Bayes methods such as 1BC2 (Flach and Lachiche, 2004) essentially employ ı all clauses within a given language bias as features in a probabilistic model and thus perform static propositionalization. [sent-42, score-0.305]

15 Does dynamic propositionalization employ fewer features and perform better than these approaches? [sent-43, score-0.234]

16 This contrasts with static propositionalization approaches, in which the ı criteria employed for feature generation and classiﬁcation are different. [sent-48, score-0.229]

17 The search heuristic is based on class conditional likelihood and clauses are combined with na¨ve Bayes. [sent-50, score-0.249]

18 In the experimental evaluation, the implementation of Structural Logistic Regression only searches for clauses with up to two literals, where the ﬁrst literal is ﬁxed in advance. [sent-56, score-0.205]

19 Furthermore, no experimental comparison to static propositionalization approaches is provided. [sent-57, score-0.229]

20 Section 2 introduces some basic concepts from inductive logic programming and the simple ﬁrst-order rule learning algorithm FOIL. [sent-65, score-0.186]

21 In Section 6 we experimentally evaluate the proposed methods on benchmark data sets from four different domains and speciﬁcally investigate how it compares to static propositionalization approaches. [sent-69, score-0.229]

22 FOIL: First Order Inductive Learning The problem that we tackle in this paper is a probabilistic formulation of the traditional inductive logic programming (ILP) problem. [sent-72, score-0.219]

23 Facts are deﬁnite clauses with an empty body, for example, atom(189, d189 1, c, 22, −0. [sent-92, score-0.205]

24 A hypothesis is said to entail an example given the background knowledge if the example is a logical consequence of the deﬁnite clauses in the hypothesis and background knowledge. [sent-96, score-0.364]

25 H⊂L Traditional approaches to inductive logic programming (Muggleton and De Raedt, 1994) tackle a concept-learning problem, in which there are typically two classes, and the goal is to ﬁnd a complete and consistent concept-description. [sent-105, score-0.186]

26 This setting is incorporated in many well-known inductive logic programming systems such as FOIL (Quinlan, 1990), GOLEM (Muggleton and Feng, 1990), PROGOL (Muggleton, 1995) and TILDE (Blockeel and De Raedt, 1997). [sent-110, score-0.186]

27 It repeatedly searches for clauses that score well with respect to the data set and the current hypothesis and adds them to the hypothesis. [sent-114, score-0.302]

28 A clause c1 θ-subsumes a clause c2 if and only if there is a substitution θ such that c1 θ ⊆ c2 . [sent-119, score-0.268]

29 The most general clause is p(X1 , · · · , Xn ) ← where p/n is the predicate being learned and the Xi are different variables. [sent-130, score-0.172]

30 This is typically realized by either adding a new literal l to the clause yielding h ← b 1 , · · · , bn , l or by applying a substitution θ yielding hθ ← b1 θ, · · · , bn θ. [sent-132, score-0.164]

31 The original FOIL algorithm uses information gain based on the number of positive and negative tuples bound to clauses as the scoring function (Quinlan, 1990), and a stopping criterion based on the minimum description length principle (Rissanen, 1978). [sent-136, score-0.257]

32 M FOIL, a variant of FOIL that was particularly designed to cope with noise in the training data, uses the m-estimate for scoring clauses and a signiﬁcance-based stopping criterion (Lavraˇ and Dˇ eroski, 1994). [sent-137, score-0.257]

33 This ı will be realized by modifying the covers and score functions in the inductive logic programming 485 L ANDWEHR , K ERSTING AND D E R AEDT setting. [sent-140, score-0.291]

34 All other elements, such as the examples, background theory and language of clauses L will—in principle—be untouched. [sent-141, score-0.235]

35 However, the set of clauses deﬁning a hypothesis is augmented with a set of parameters that quantify the probabilistic model. [sent-142, score-0.313]

36 For easy of exposition, we will ﬁrst discuss this for the case that the clauses are combined using na¨ve Bayes, that is, the N FOIL system. [sent-143, score-0.205]

37 The key idea for realizing this is that we interpret the clauses in H together with the example e as queries or features. [sent-152, score-0.205]

38 More formally, let H contain a set of clauses deﬁning the predicate p. [sent-153, score-0.226]

39 Then for each clause c of the form p(X1 , · · · , Xn ) ← b1 , · · · , bn we view the query qc =← b1 , · · · , bn as a boolean feature or attribute. [sent-154, score-0.215]

40 We now deﬁne P(e | H, B) = Pλ (pθ|q1 θ, · · · , qk θ) P (q1 θ, · · · , qk θ|pθ) · Pλ (pθ) = λ Pλ (q1 θ, · · · , qk θ) 486 I NTEGRATING NA¨VE BAYES AND FOIL I where HC = {q1 , . [sent-160, score-0.261]

41 e∈E To evaluate a set of clauses HC in the generic FOIL algorithm, the scoring function has to be changed accordingly, to score(E, HC , B) = P(E | H, B) where HC has been augmented with an optimal probabilistic model Hλ . [sent-202, score-0.31]

42 , 2004) or an ILP technique such as bottom clauses to generate candidate rules for selection (Davis et al. [sent-213, score-0.228]

43 More formally, the model space under consideration is H = {(HC , Hλ )|HC ⊆ L , Hλ ∈ MC } where MC is the space of all na¨ve Bayes models over the clauses in HC . [sent-231, score-0.205]

44 The optimization problem ı is to ﬁnd H ∗ = arg max P(E | H, B) H = arg max P(E | (HC , Hλ ), B), HC ,Hλ a hypothesis that jointly optimizes the clause set (structure) and probabilistic model (parameters such as the Pλ (qi |p) appearing in Example 3). [sent-232, score-0.225]

45 1 Adapting FOIL Like FOIL, N FOIL searches a set of clauses greedily, and a single clause in a general-to-speciﬁc manner using a reﬁnement operator. [sent-241, score-0.339]

46 Because the ﬁnal model in FOIL is the disjunction of the learned clauses (where every clause covers a certain subset of examples), it holds that 1. [sent-243, score-0.394]

47 (Non-recursive) clauses already learned do not need to be considered when scoring additional clauses: score(E, H ∪ {c }, B) = score(E, {c }, B). [sent-246, score-0.249]

48 An additional clause c has to be scored together with the current model: score(E, HC ∪ {c }, B) = P(E | H , B) where H has clauses HC ∪ {c } and optimal parameters Hλ . [sent-252, score-0.339]

49 This is replaced in N FOIL by stopping if the change in score when adding a clause falls below a certain threshold. [sent-255, score-0.226]

50 2 Parameter Estimation: An Approximation To evaluate a set of clauses HC = {q1 , . [sent-260, score-0.205]

51 , qk }: ∗ Hλ = arg max P(E | H, B) Hλ = arg max ∏ Pλ (pθ|q1 θ, · · · , qk θ) λ θ∈E ∏ j Pλ (q j θ | pθ) · Pλ (pθ) . [sent-266, score-0.202]

52 , qk θ) = ∏ Pλ (pθ|q1 θ, · · · , qk θ) · Pλ (q1 θ, . [sent-274, score-0.174]

53 However, when evaluating a set of clauses HC = {q1 , . [sent-354, score-0.205]

54 , qk } in T FOIL, for every clause qi ∈ HC we have to decide on a q j = q pa(i) ∈ HC for which a dependency q pa(i) → qi is added. [sent-357, score-0.295]

55 Rather than re-learning the TAN graph from scratch every time a new candidate clause q i is scored, the subgraph over the existing hypothesis H = {q1 , . [sent-360, score-0.164]

56 , qi−1 } is kept ﬁxed, and all existing clauses q j ∈ H are considered as possible parents for the clause qi . [sent-363, score-0.376]

57 For every candidate clause c , the best parent pa(c ) is identiﬁed and the c with highest score is added to H with the dependency pa(c ) → c . [sent-367, score-0.201]

58 The computational complexity of scoring the (k + 1)th clause qk+1 in T FOIL is O(C · k · n), as all k existing clauses are considered as possible parents. [sent-369, score-0.366]

59 The problem is to classify compounds as mutagenic or not given their chemical structure described in terms of atoms, bonds, atom charge, and information about atom and bond types that have been generated by the molecular modeling package QUANTA. [sent-381, score-0.335]

60 Additionally, we provided a relation linked(A1 , A2 , E, BT ) in the background knowledge that represents that there is a bond of type BT from atom A1 to atom A2 and A2 is of element E. [sent-412, score-0.272]

61 During the search for a clause, the algorithm also keeps a set C ∗ of the k best (partial) clauses found so far. [sent-425, score-0.224]

62 The search for additional clauses is stopped if the change in score between two successive iterations is less than 0. [sent-427, score-0.291]

63 A hypothesis is limited to contain at most 25 clauses and a clause to contain at most 10 literals. [sent-429, score-0.369]

64 1 It is based on the concept of bottom clauses, which are maximally speciﬁc clauses covering a certain example. [sent-439, score-0.205]

65 The theory is then built from clauses that contain a subset of the literals found in a bottom clause. [sent-440, score-0.231]

66 The maximum number of literals in a clause was set to 10 instead of the default 4. [sent-443, score-0.175]

67 We have therefore investigated a static propositionalization approach: frequent clauses were extracted from the relational data sets and then used as features in the propositional MACCENT system. [sent-451, score-0.538]

68 , 1998), as WARMR patterns have shown to be effective propositionalization techniques on similar benchmarks in inductive logic programming (Srinivasan et al. [sent-453, score-0.362]

69 Comparing the results for N FOIL/T FOIL and M FOIL, the experiments clearly show a gain for dynamic propositionalization over the baseline ILP algorithm, giving an afﬁrmative answer to Question Q1. [sent-551, score-0.22]

70 This shows that relatively simple dynamic propositionalization approaches are competitive with more advanced ILP systems, giving an afﬁrmative answer to Question Q4. [sent-560, score-0.22]

71 In clause voting, the threshold that is varied is the number of clauses that have to cover an example before it is classiﬁed as positive (the default threshold being one). [sent-629, score-0.354]

72 Note that in some cases dynamic propositionalization achieves higher AUC scores than ILP systems even though it achieves lower accuracy (e. [sent-631, score-0.236]

73 At the level of induced clauses, it indicates that clauses induced by the ILP systems are speciﬁc in the sense that positive examples are typically covered by only one clause—if the decision threshold in clause voting is set to more than one, the true positive rate drops rapidly. [sent-638, score-0.339]

74 In contrast, ranking performance for the dynamic propositionalization systems and 1BC2 is more stable, meaning that these systems also offer good classiﬁcation performance under varying misclassiﬁcation costs (or, equivalently, class skew). [sent-639, score-0.22]

75 This indicates that dynamic propositionalization approaches can make use of more diverse rule sets, which are helpful in ranking examples by providing additional information but would produce too many false-positive classiﬁcations if interpreted as a disjunctive hypothesis. [sent-640, score-0.242]

76 To investigate Question Q2, that is, to compare dynamic and static propositionalization approaches, two additional experiments were performed: 1. [sent-760, score-0.273]

77 Learning a na¨ve Bayes or tree augmented na¨ve Bayes model over a set of clauses using a ı ı two-step approach: First, a set of rules is learned using M FOIL or A LEPH, and afterwards a (tree augmented) na¨ve Bayes model is built using these rules. [sent-761, score-0.317]

78 Question Q2 is whether these rules are less useful when combined with (tree augmented) na¨ve Bayes than the rules constructed by a dynamic propositionalization approach ı (N FOIL/T FOIL). [sent-763, score-0.266]

79 However, AUC scores of the static propositionalization approaches are on average much lower than those of dynamic propositionalization approaches. [sent-817, score-0.449]

80 Thus, we can answer Q2 as follows: A dynamic propositionalization approach that selects rules based on the criterion of the probabilistic model performs better than static propositionalization approaches that post-process a rule set using a probabilistic learner. [sent-824, score-0.538]

81 This is because dynamic propositionalization can make use of different/additional rules that are not considered by traditional ILP systems as they would produce too many false-positive classiﬁcations. [sent-825, score-0.243]

82 It remains to answer Question Q5, that is, to compare dynamic propositionalization approaches to relational na¨ve Bayes methods such as 1BC2 in terms of accuracy and the number of features they ı employ. [sent-826, score-0.313]

83 In the experiments, 1BC2 uses an order of magnitude more clauses than N FOIL. [sent-833, score-0.205]

84 More precisely, 1BC2 uses more than 400 (in some cases even more than 1000) clauses whereas N FOIL is limited to using 25 clauses. [sent-834, score-0.205]

85 We furthermore investigated whether the proposed dynamic propositionalization approach sometimes constructs too many clauses and overﬁts the training data, as the stopping criterion is based on the training set score. [sent-836, score-0.45]

86 4 Table 6: Cross-validated predicative accuracy results and average number of clauses in the ﬁnal model for N FOIL and N FOIL/pruning. [sent-892, score-0.221]

87 The procedure cross-validate-accuracy(H,E,B) cross-validates the na¨ve Bayes model on the training data for a ﬁxed set H of clauses and returns an ı accuracy estimate. [sent-898, score-0.221]

88 The algorithm greedily drops clauses from H as long as this does not decrease the cross-validated accuracy estimate. [sent-899, score-0.221]

89 The dynamic propositionalization approaches N FOIL and T FOIL yield more accurate models than simple ILP rule learning and static propositionalization approaches, and also compare favorably to the ﬁrst order na¨ve Bayes system 1BC2 and one of the most advanced ILP systems, ı namely Aleph. [sent-910, score-0.449]

90 Related Work The approaches that combine statistical learning with inductive logic programming techniques for addressing classiﬁcation can be divided into three categories. [sent-912, score-0.186]

91 A ﬁrst class of techniques are static propositionalization approaches. [sent-913, score-0.229]

92 However, an initial step beyond static propositionalization has been taken in the work by Pompe and Kononenko (1997), where rules generated by an ILP system are post-processed by splitting and merging clauses in order to ﬁnd a rule set that satisﬁes the na¨ve ı Bayes assumption. [sent-921, score-0.457]

93 , 2003; Dehaspe, 1997) indeed tightly integrates the inductive logic programming step with the statistical learning step in a dynamic propositionalization approach. [sent-930, score-0.423]

94 SAYU uses a “wrapper” approach where (partial) clauses generated by the reﬁnement search of an ILP system are proposed as features to a (tree augmented) na¨ve Bayes, ı and incorporated if they improve performance. [sent-937, score-0.238]

95 The probabilistic model is trained to maximize the likelihood on the training data, while clause selection is based on the area under the precision-recall curve of the model on a separate tuning set. [sent-940, score-0.192]

96 Conclusions We have introduced the N FOIL and T FOIL systems, two dynamic propositionalization approaches that combine the simplest techniques from ILP and probabilistic learning. [sent-949, score-0.253]

97 In an experimental study on several benchmark data sets, the proposed approaches were compared to the ILP systems M FOIL and A LEPH, static propositionalization approaches and the ﬁrst-order na¨ve Bayes system 1BC2. [sent-950, score-0.229]

98 ı Experimental results show that dynamic propositionalization is superior to static propositionalization and simple rule learning. [sent-951, score-0.449]

99 Moreover, our experiments indicate that the superior performance of dynamic propositionalization approaches is due to the fact that they can make use of more diverse rule sets than ILP systems. [sent-953, score-0.22]

100 Further exploring dynamic propositionalization as a way of combining (possibly more powerful) statistical learners with ILP search techniques is an interesting direction for future work. [sent-955, score-0.239]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('foil', 0.765), ('na', 0.219), ('clauses', 0.205), ('alzheimer', 0.192), ('propositionalization', 0.176), ('leph', 0.14), ('ilp', 0.137), ('clause', 0.134), ('mutagenesis', 0.127), ('hc', 0.126), ('logic', 0.097), ('ve', 0.095), ('atom', 0.088), ('qk', 0.087), ('dsstox', 0.083), ('bayes', 0.073), ('aedt', 0.068), ('andwehr', 0.068), ('ersting', 0.068), ('ntegrating', 0.068), ('score', 0.067), ('bond', 0.066), ('relational', 0.063), ('tan', 0.061), ('inductive', 0.054), ('static', 0.053), ('mutagenic', 0.053), ('luc', 0.047), ('davis', 0.047), ('acetyl', 0.047), ('amine', 0.047), ('pompe', 0.047), ('toxic', 0.047), ('augmented', 0.045), ('il', 0.045), ('dynamic', 0.044), ('roc', 0.043), ('diterpene', 0.042), ('logical', 0.039), ('covers', 0.038), ('qi', 0.037), ('eroski', 0.036), ('sayu', 0.036), ('lachiche', 0.035), ('programming', 0.035), ('kersting', 0.035), ('flach', 0.034), ('probabilistic', 0.033), ('raedt', 0.032), ('lavra', 0.031), ('kononenko', 0.031), ('muggleton', 0.031), ('qc', 0.031), ('hypothesis', 0.03), ('background', 0.03), ('tree', 0.027), ('propositional', 0.027), ('scoring', 0.027), ('landwehr', 0.026), ('literals', 0.026), ('kristian', 0.026), ('tf', 0.025), ('stopping', 0.025), ('likelihood', 0.025), ('rankings', 0.024), ('srinivasan', 0.024), ('compounds', 0.024), ('folds', 0.024), ('rules', 0.023), ('beam', 0.022), ('disjunctive', 0.022), ('nb', 0.021), ('predicate', 0.021), ('query', 0.02), ('dehaspe', 0.02), ('search', 0.019), ('goldszmidt', 0.018), ('lz', 0.018), ('tightly', 0.017), ('getoor', 0.017), ('learned', 0.017), ('accuracy', 0.016), ('popescul', 0.016), ('freiburg', 0.016), ('chemical', 0.016), ('auc', 0.016), ('ashvin', 0.016), ('grossman', 0.016), ('maccent', 0.016), ('mutagenicity', 0.016), ('niels', 0.016), ('rmatively', 0.016), ('saso', 0.016), ('default', 0.015), ('bn', 0.015), ('pa', 0.015), ('nement', 0.015), ('features', 0.014), ('arg', 0.014), ('memory', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999928 43 jmlr-2007-Integrating Naïve Bayes and FOIL

Author: Niels Landwehr, Kristian Kersting, Luc De Raedt

2 0.16446698 47 jmlr-2007-Learning Horn Expressions with LOGAN-H

Author: Marta Arias, Roni Khardon, Jérôme Maloberti

Abstract: The paper introduces L OG A N -H —a system for learning ﬁrst-order function-free Horn expressions from interpretations. The system is based on an algorithm that learns by asking questions and that was proved correct in previous work. The current paper shows how the algorithm can be implemented in a practical system, and introduces a new algorithm based on it that avoids interaction and learns from examples only. The L OG A N -H system implements these algorithms and adds several facilities and optimizations that allow efﬁcient applications in a wide range of problems. As one of the important ingredients, the system includes several fast procedures for solving the subsumption problem, an NP-complete problem that needs to be solved many times during the learning process. We describe qualitative and quantitative experiments in several domains. The experiments demonstrate that the system can deal with varied problems, large amounts of data, and that it achieves good classiﬁcation accuracy. Keywords: inductive logic programming, subsumption, bottom-up learning, learning with queries

3 0.047905304 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study

Author: Sofus A. Macskassy, Foster Provost

Abstract: paper1 This is about classifying entities that are interlinked with entities for which the class is known. After surveying prior work, we present NetKit, a modular toolkit for classiﬁcation in networked data, and a case-study of its application to networked data used in prior machine learning research. NetKit is based on a node-centric framework in which classiﬁers comprise a local classiﬁer, a relational classiﬁer, and a collective inference procedure. Various existing node-centric relational learning algorithms can be instantiated with appropriate choices for these components, and new combinations of components realize new algorithms. The case study focuses on univariate network classiﬁcation, for which the only information used is the structure of class linkage in the network (i.e., only links and some class labels). To our knowledge, no work previously has evaluated systematically the power of class-linkage alone for classiﬁcation in machine learning benchmark data sets. The results demonstrate that very simple network-classiﬁcation models perform quite well—well enough that they should be used regularly as baseline classiﬁers for studies of learning with networked data. The simplest method (which performs remarkably well) highlights the close correspondence between several existing methods introduced for different purposes—that is, Gaussian-ﬁeld classiﬁers, Hopﬁeld networks, and relational-neighbor classiﬁers. The case study also shows that there are two sets of techniques that are preferable in different situations, namely when few versus many labels are known initially. We also demonstrate that link selection plays an important role similar to traditional feature selection. Keywords: relational learning, network learning, collective inference, collective classiﬁcation, networked data, probabilistic relational models, network analysis, network data

4 0.044493161 72 jmlr-2007-Relational Dependency Networks

Author: Jennifer Neville, David Jensen

Abstract: Recent work on graphical models for relational data has demonstrated signiﬁcant improvements in classiﬁcation and inference when models represent the dependencies among instances. Despite its use in conventional statistical models, the assumption of instance independence is contradicted by most relational data sets. For example, in citation data there are dependencies among the topics of a paper’s references, and in genomic data there are dependencies among the functions of interacting proteins. In this paper, we present relational dependency networks (RDNs), graphical models that are capable of expressing and reasoning with such dependencies in a relational setting. We discuss RDNs in the context of relational Bayes networks and relational Markov networks and outline the relative strengths of RDNs—namely, the ability to represent cyclic dependencies, simple methods for parameter estimation, and efﬁcient structure learning techniques. The strengths of RDNs are due to the use of pseudolikelihood learning techniques, which estimate an efﬁcient approximation of the full joint distribution. We present learned RDNs for a number of real-world data sets and evaluate the models in a prediction context, showing that RDNs identify and exploit cyclic relational dependencies to achieve signiﬁcant performance gains over conventional conditional models. In addition, we use synthetic data to explore model performance under various relational data characteristics, showing that RDN learning and inference techniques are accurate over a wide range of conditions. Keywords: relational learning, probabilistic relational models, knowledge discovery, graphical models, dependency networks, pseudolikelihood estimation

5 0.043047115 90 jmlr-2007-Value Regularization and Fenchel Duality

Author: Ryan M. Rifkin, Ross A. Lippert

Abstract: Regularization is an approach to function learning that balances ﬁt and smoothness. In practice, we search for a function f with a ﬁnite representation f = ∑i ci φi (·). In most treatments, the ci are the primary objects of study. We consider value regularization, constructing optimization problems in which the predicted values at the training points are the primary variables, and therefore the central objects of study. Although this is a simple change, it has profound consequences. From convex conjugacy and the theory of Fenchel duality, we derive separate optimality conditions for the regularization and loss portions of the learning problem; this technique yields clean and short derivations of standard algorithms. This framework is ideally suited to studying many other phenomena at the intersection of learning theory and optimization. We obtain a value-based variant of the representer theorem, which underscores the transductive nature of regularization in reproducing kernel Hilbert spaces. We unify and extend previous results on learning kernel functions, with very simple proofs. We analyze the use of unregularized bias terms in optimization problems, and low-rank approximations to kernel matrices, obtaining new results in these areas. In summary, the combination of value regularization and Fenchel duality are valuable tools for studying the optimization problems in machine learning. Keywords: kernel machines, duality, optimization, convex analysis, kernel learning

6 0.031730138 1 jmlr-2007-"Ideal Parent" Structure Learning for Continuous Variable Bayesian Networks

7 0.031013843 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

8 0.029977854 70 jmlr-2007-Ranking the Best Instances

9 0.027502988 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections

10 0.026386166 22 jmlr-2007-Compression-Based Averaging of Selective Naive Bayes Classifiers (Special Topic on Model Selection)

11 0.025081214 48 jmlr-2007-Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners

12 0.020860069 29 jmlr-2007-Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts

13 0.02046296 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

14 0.0204542 11 jmlr-2007-Anytime Learning of Decision Trees

15 0.01871529 27 jmlr-2007-Distances between Data Sets Based on Summary Statistics

16 0.018543595 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data

17 0.01780856 79 jmlr-2007-Structure and Majority Classes in Decision Tree Learning

18 0.017790683 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes

19 0.016615385 54 jmlr-2007-Measuring Differentiability: Unmasking Pseudonymous Authors

20 0.016426511 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.109), (1, 0.108), (2, -0.024), (3, 0.044), (4, 0.026), (5, 0.126), (6, 0.012), (7, -0.05), (8, 0.013), (9, -0.249), (10, 0.092), (11, 0.213), (12, 0.113), (13, 0.056), (14, -0.022), (15, 0.297), (16, -0.082), (17, -0.398), (18, 0.234), (19, 0.071), (20, 0.067), (21, 0.099), (22, -0.11), (23, 0.002), (24, 0.052), (25, 0.027), (26, -0.001), (27, -0.069), (28, 0.002), (29, 0.045), (30, -0.074), (31, -0.064), (32, 0.038), (33, -0.041), (34, -0.038), (35, 0.017), (36, 0.045), (37, 0.168), (38, -0.183), (39, 0.066), (40, 0.091), (41, -0.026), (42, -0.011), (43, -0.079), (44, -0.056), (45, 0.018), (46, -0.013), (47, -0.002), (48, 0.03), (49, 0.004)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97143573 43 jmlr-2007-Integrating Naïve Bayes and FOIL

Author: Niels Landwehr, Kristian Kersting, Luc De Raedt

2 0.80770862 47 jmlr-2007-Learning Horn Expressions with LOGAN-H

Author: Marta Arias, Roni Khardon, Jérôme Maloberti

3 0.14946163 72 jmlr-2007-Relational Dependency Networks

Author: Jennifer Neville, David Jensen

4 0.14440937 90 jmlr-2007-Value Regularization and Fenchel Duality

Author: Ryan M. Rifkin, Ross A. Lippert

5 0.13904026 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

Author: Roland Nilsson, José M. Peña, Johan Björkegren, Jesper Tegnér

Abstract: We analyze two different feature selection problems: ﬁnding a minimal feature set optimal for classiﬁcation (MINIMAL - OPTIMAL) vs. ﬁnding all features relevant to the target variable (ALL RELEVANT ). The latter problem is motivated by recent applications within bioinformatics, particularly gene expression analysis. For both problems, we identify classes of data distributions for which there exist consistent, polynomial-time algorithms. We also prove that ALL - RELEVANT is much harder than MINIMAL - OPTIMAL and propose two consistent, polynomial-time algorithms. We argue that the distribution classes considered are reasonable in many practical cases, so that our results simplify feature selection in a wide range of machine learning tasks. Keywords: learning theory, relevance, classiﬁcation, Markov blanket, bioinformatics

6 0.13032526 1 jmlr-2007-"Ideal Parent" Structure Learning for Continuous Variable Bayesian Networks

7 0.12962219 70 jmlr-2007-Ranking the Best Instances

8 0.12367729 22 jmlr-2007-Compression-Based Averaging of Selective Naive Bayes Classifiers (Special Topic on Model Selection)

9 0.11406903 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study

10 0.1086935 11 jmlr-2007-Anytime Learning of Decision Trees

11 0.092201814 29 jmlr-2007-Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts

12 0.089571439 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

13 0.089085706 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections

14 0.082802139 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features

15 0.082630143 48 jmlr-2007-Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners

16 0.078882568 54 jmlr-2007-Measuring Differentiability: Unmasking Pseudonymous Authors

17 0.078120582 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions

18 0.077396199 79 jmlr-2007-Structure and Majority Classes in Decision Tree Learning

19 0.072979324 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data

20 0.068448208 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.011), (8, 0.01), (10, 0.021), (12, 0.017), (15, 0.016), (18, 0.019), (28, 0.079), (40, 0.035), (45, 0.031), (48, 0.021), (60, 0.507), (80, 0.021), (85, 0.031), (98, 0.07)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93117875 35 jmlr-2007-General Polynomial Time Decomposition Algorithms (Special Topic on the Conference on Learning Theory 2005)

Author: Nikolas List, Hans Ulrich Simon

Abstract: We present a general decomposition algorithm that is uniformly applicable to every (suitably normalized) instance of Convex Quadratic Optimization and efﬁciently approaches an optimal solution. The number of iterations required to be within ε of optimality grows linearly with 1/ε and quadratically with the number m of variables. The working set selection can be performed in polynomial time. If we restrict our considerations to instances of Convex Quadratic Optimization with at most k0 equality constraints for some ﬁxed constant k0 plus some so-called box-constraints (conditions that hold for most variants of SVM-optimization), the working set is found in linear time. Our analysis builds on a generalization of the concept of rate certifying pairs that was introduced by Hush and Scovel. In order to extend their results to arbitrary instances of Convex Quadratic Optimization, we introduce the general notion of a rate certifying q-set. We improve on the results by Hush and Scovel (2003) in several ways. First our result holds for Convex Quadratic Optimization whereas the results by Hush and Scovel are specialized to SVM-optimization. Second, we achieve a higher rate of convergence even for the special case of SVM-optimization (despite the generality of our approach). Third, our analysis is technically simpler. We prove furthermore that the strategy for working set selection which is based on rate certifying sets coincides with a strategy which is based on a so-called “sparse witness of sub-optimality”. Viewed from this perspective, our main result improves on convergence results by List and Simon (2004) and Simon (2004) by providing convergence rates (and by holding under more general conditions). Keywords: convex quadratic optimization, decomposition algorithms, support vector machines

2 0.88177747 9 jmlr-2007-AdaBoost is Consistent

Author: Peter L. Bartlett, Mikhail Traskin

Abstract: The risk, or probability of error, of the classiﬁer produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n1−ε iterations—for sample size n and ε ∈ (0, 1)—the sequence of risks of the classiﬁers it produces approaches the Bayes risk. Keywords: boosting, adaboost, consistency

same-paper 3 0.87876719 43 jmlr-2007-Integrating Naïve Bayes and FOIL

Author: Niels Landwehr, Kristian Kersting, Luc De Raedt

4 0.86835432 73 jmlr-2007-Revised Loss Bounds for the Set Covering Machine and Sample-Compression Loss Bounds for Imbalanced Data

Author: Zakria Hussain, François Laviolette, Mario Marchand, John Shawe-Taylor, Spencer Charles Brubaker, Matthew D. Mullin

Abstract: Marchand and Shawe-Taylor (2002) have proposed a loss bound for the set covering machine that has the property to depend on the observed fraction of positive examples and on what the classiﬁer achieves on the positive training examples. We show that this loss bound is incorrect. We then propose a loss bound, valid for any sample-compression learning algorithm (including the set covering machine), that depends on the observed fraction of positive examples and on what the classiﬁer achieves on them. We also compare numerically the loss bound proposed in this paper with the incorrect bound, the original SCM bound and a recently proposed loss bound of Marchand and Sokolova (2005) (which does not depend on the observed fraction of positive examples) and show that the latter loss bounds can be substantially larger than the new bound in the presence of imbalanced misclassiﬁcations. Keywords: set covering machines, sample-compression, loss bounds

5 0.56177008 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

Author: Roland Nilsson, José M. Peña, Johan Björkegren, Jesper Tegnér

6 0.55939412 71 jmlr-2007-Refinable Kernels

7 0.54056233 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)

8 0.52782393 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)

9 0.52697808 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds

10 0.52127409 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis

11 0.52045524 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

12 0.50672269 14 jmlr-2007-Behavioral Shaping for Geometric Concepts

13 0.4430204 44 jmlr-2007-Large Margin Semi-supervised Learning

14 0.43788317 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections

15 0.43482739 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

16 0.43139365 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

17 0.42232496 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression

18 0.42179441 45 jmlr-2007-Learnability of Gaussians with Flexible Variances

19 0.42000586 65 jmlr-2007-PAC-Bayes Risk Bounds for Stochastic Averages and Majority Votes of Sample-Compressed Classifiers

20 0.41840827 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption