jmlr jmlr2010 jmlr2010-63 knowledge-graph by maker-knowledge-mining

63 jmlr-2010-Learning Instance-Specific Predictive Models

Source: pdf

Author: Shyam Visweswaran, Gregory F. Cooper

Abstract: This paper introduces a Bayesian algorithm for constructing predictive models from data that are optimized to predict a target variable well for a particular instance. This algorithm learns Markov blanket models, carries out Bayesian model averaging over a set of models to predict a target variable of the instance at hand, and employs an instance-speciﬁc heuristic to locate a set of suitable models to average over. We call this method the instance-speciﬁc Markov blanket (ISMB) algorithm. The ISMB algorithm was evaluated on 21 UCI data sets using ﬁve different performance measures and its performance was compared to that of several commonly used predictive algorithms, including nave Bayes, C4.5 decision tree, logistic regression, neural networks, k-Nearest Neighbor, Lazy Bayesian Rules, and AdaBoost. Over all the data sets, the ISMB algorithm performed better on average on all performance measures against all the comparison algorithms. Keywords: instance-speciﬁc, Bayesian network, Markov blanket, Bayesian model averaging

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU Department of Biomedical Informatics University of Pittsburgh Pittsburgh, PA 15260, USA Editor: Max Chickering Abstract This paper introduces a Bayesian algorithm for constructing predictive models from data that are optimized to predict a target variable well for a particular instance. [sent-4, score-0.208]

2 This algorithm learns Markov blanket models, carries out Bayesian model averaging over a set of models to predict a target variable of the instance at hand, and employs an instance-speciﬁc heuristic to locate a set of suitable models to average over. [sent-5, score-0.569]

3 Introduction Prediction is a central problem in machine learning that involves inducing a model from a set of training instances that is then applied to future instances to predict a target variable of interest. [sent-11, score-0.323]

4 A third approach, applicable to model averaging where a set of models is collectively used for prediction, is to identify a set of models that are most relevant to prediction for the instance at hand. [sent-26, score-0.317]

5 A coherent approach to dealing with the uncertainty in model selection is Bayesian model averaging (BMA) (Hoeting et al. [sent-38, score-0.233]

6 A pragmatic approach is to average over a few good models, termed selective Bayesian model averaging, which serves to approximate the prediction obtained from averaging over all models. [sent-42, score-0.251]

7 Speciﬁcally, the instance-speciﬁc method uses the features of the current instance to inform the BN learning algorithm to selectively average over models that differ considerably in their predictions for the target variable of the instance at hand. [sent-46, score-0.248]

8 In the bottom panel, there is an extra arc from instance to model, because the structure and parameters of the model are inﬂuenced by the features of the instance at hand. [sent-58, score-0.246]

9 A test instance t is one in which the unknown value of the target variable Z t is to be predicted from the known values of the predictors Xt and the known values of < Xi , Z i > of a set of training instances. [sent-84, score-0.241]

10 The Bayesian model combination technique is called model averaging where the combined prediction is the weighted average of the individual predictions of the models with the model posterior probabilities comprising the weights. [sent-103, score-0.347]

11 Expression 4 for the instance-speciﬁc model, however, selects the model that will have the greatest utility for the speciﬁc instance Xt = xt . [sent-120, score-0.238]

12 For predicting Z t given instance Xt = xt , application of the model selected using Expression 1 can never have an expected utility greater than the application of the model selected using Expression 4. [sent-121, score-0.261]

13 Such averaging is primarily useful when no single model in the model space under consideration has a dominant posterior probability. [sent-129, score-0.234]

14 However, since the number of models in practically useful model spaces is enormous, exact BMA, where the averaging is done over the entire model space, is usually not feasible. [sent-130, score-0.247]

15 Instance-speciﬁc methods, on the other hand, learn an explicit model or models from the training instances that are then applied to the test instance. [sent-142, score-0.204]

16 The similarity measure evaluates the similarity between the test instance and the training instances and selects the appropriate training instances and their relative weights in response to the test instance (Zhang et al. [sent-148, score-0.36]

17 To predict the target variable in the test instance, the values of the target variable in the selected training instances are combined in some simple fashion such as majority vote, simple numerical average or ﬁtted with a polynomial. [sent-151, score-0.353]

18 When a test instance is encountered, the training instance that is most similar to the test instance is located and its target value is returned as the prediction (Cover and Hart, 1967). [sent-153, score-0.345]

19 For a test instance, this method selects the k most similar training instances and either averages or takes a majority vote of their target values. [sent-155, score-0.253]

20 A further extension is locally weighted regression that selects instances similar to the test instance, weights them according to their similarity, and performs regression to predict the target (Atkeson et al. [sent-158, score-0.217]

21 The structure of the local naive Bayes classiﬁer consists of 3339 V ISWESWARAN AND C OOPER the target variable as the parent of all other variables that do not appear in the antecedent, and the parameters of the classiﬁer are estimated from those training instances that satisfy the antecedent. [sent-177, score-0.256]

22 LBR is an example of an instance-speciﬁc method that uses feature information available in the test instance to direct the search for a suitable model in the model space. [sent-181, score-0.234]

23 The minimal Markov blanket of a node Xi , which is sometimes called its Markov boundary, consists of the parents, children, and children’s parents of Xi . [sent-244, score-0.269]

24 In this paper, we refer to the minimal Markov blanket as the Markov blanket (MB). [sent-245, score-0.296]

25 The Markov blanket of the node X6 (shown stippled) comprises the set of parents, children and spouses of the node and is indicated by the shaded nodes. [sent-253, score-0.31]

26 In particular, when interest centers on the distribution of a speciﬁc target 3342 L EARNING I NSTANCE -S PECIFIC P REDICTIVE M ODELS node, as is the case in classiﬁcation, the structure and parameters of only the MB of the target node need be learned. [sent-257, score-0.253]

27 This algorithm has two stages: a growing phase that adds potential predictor variables to the MB and a shrinking phase that removes the false positives that were added in the ﬁrst phase. [sent-270, score-0.347]

28 (2006) later developed the Min-Max Markov Blanket algorithm (MMMB) that ﬁrst identiﬁes the direct parents and children of the target and then parents of the children using conditional independence tests. [sent-272, score-0.272]

29 The Instance-Speciﬁc Markov Blanket (ISMB) Algorithm The goal of the instance-speciﬁc Markov blanket (ISMB) algorithm is to predict well a discrete target variable of interest. [sent-280, score-0.278]

30 Such Bayes optimal predictions involve averaging over all models in the model space which is usually computationally intractable. [sent-282, score-0.2]

31 One approach, termed selective model averag3343 V ISWESWARAN AND C OOPER ing, has been to approximate the Bayes optimal prediction by averaging over a subset of the possible models and has been shown to improve predictive performance (Hoeting et al. [sent-283, score-0.329]

32 The ISMB algorithm performs selective model averaging and uses a novel heuristic search method to select the models over which averaging is done. [sent-286, score-0.436]

33 The ISMB algorithm modiﬁes the typical BN structure learning algorithm to learn only MBs of the target node of interest, by using a set of operators that generate only the MB structures of the target variable. [sent-296, score-0.305]

34 2 Instance-Speciﬁc Bayesian Model Averaging The objective of the ISMB algorithm is to derive the posterior distribution P(Z t |, xt , D) for the target variable Z t in the instance at hand, given the values of the other variables Xt = xt and the training data D. [sent-302, score-0.442]

35 The ideal computation of the posterior distribution P(Z t |, xt , D) by BMA is as follows: P(Z t |xt , D) = ∑ P(Zt |xt , G, D)P(G|D), (5) G∈M where the sum is taken over all MB structures G in the model space M. [sent-303, score-0.242]

36 Next, the parameterized MB is used to compute the distribution over the target variable Z t of the instance at hand given the values xt of the remaining variables in the MB by applying standard BN inference (Neapolitan, 2003). [sent-314, score-0.273]

37 Also, as previously described, Ni jk is the number of instances in the data where node i has value k and the parents of i have the state denoted by j, and Ni j = ∑k Ni jk . [sent-330, score-0.287]

38 Hence, complete model averaging given by Equation 5 is approximated with selective model averaging, and heuristic search (described in the next section) is used to sample the model space. [sent-339, score-0.377]

40 (13) ′ G∈R The ISMB algorithm performs selective model averaging and seeks to locate a good set of models over which averaging is carried out. [sent-342, score-0.378]

41 The ﬁrst phase (phase 1) ignores the evidence xt from the instance at hand, while searching for MB structures that best ﬁt the training data. [sent-345, score-0.397]

42 The second phase (phase 2) continues to add to the set of MB structures obtained from phase 1, but now searches for MB structures that have the greatest impact on the prediction of Z t for the instance at hand. [sent-346, score-0.485]

43 At each iteration of the search, successor models are generated from the current best model; the best of the successor models is added to R only if this model is better than current best model; and the remaining successor models are discarded. [sent-349, score-0.329]

44 Since, no backtracking is performed, phase 1 search terminates in a local maximum. [sent-350, score-0.21]

45 At each iteration of the search, successor models are generated from the current best model and added to Q; after an iteration the best model from Q is added to R even if this model is not better than the current best model in R. [sent-355, score-0.282]

46 With respect to a MB, the nodes can be categorized into ﬁve groups: (1) the target node, (2) parent nodes of the target, (3) child nodes of the target, (4) spousal nodes, which are parents of the children, and (5) other nodes, which are not part of the current MB. [sent-366, score-0.28]

47 In phase 1 the candidate MB structures are scored with the Bayesian score (phase 1 score) shown in Equation 11. [sent-372, score-0.389]

48 Deletion of arc Z → X5 leads to removal of the arc X4 → X5 since X5 is no longer a part of the Markov blanket of Z. [sent-378, score-0.272]

49 The purpose of this phase is to identify a set of MB structures that are highly probable, given data D. [sent-382, score-0.204]

50 Phase 2 differs from the phase 1 in two aspects: it uses best-ﬁrst search and it employs a different scoring function for evaluating candidate MB structures. [sent-386, score-0.21]

51 3348 L EARNING I NSTANCE -S PECIFIC P REDICTIVE M ODELS At the beginning of the phase 2, R contains MB structures that were generated in phase 1. [sent-387, score-0.356]

52 Successors to the MB structures in R are generated, scored with the phase 2 score (described in detail below) and added to the priority queue Q. [sent-388, score-0.468]

53 At each iteration of the search, the highest scoring MB structure in Q is removed from Q and added to R; all operations leading to legal MB structures are applied to it; the successor structures are scored with the phase 2 score; and the scored structures are added to Q. [sent-389, score-0.467]

54 Phase 2 search terminates when no MB structure in Q has a score higher than some small value ε or when a period of time t has elapsed, where ε and t are user speciﬁed parameters. [sent-390, score-0.243]

55 In phase 2, the model score is computed as follows. [sent-391, score-0.349]

56 Each successor MB structure G∗ to be added to Q is scored based on how much it changes the current estimate of P(Z t |xt , D); this is obtained by model averaging over the MB structures in R. [sent-392, score-0.336]

57 Thus, the phase 2 score for a candidate MB structure G∗ is given by: f (R, G∗ ) = KL(p||q) ≡ ∑ p(x) log x p(x) , q(x) where p(x) = P(G|D) ∑ P(Zt |xt , G, D) ∑G ∈R P(G′ |D) ′ G∈R and ∑ q(x) = P(Z t |xt , G, D) G∈R∪G∗ P(G|D) . [sent-396, score-0.337]

58 Phase 1 uses greedy hill-climbing search while phase 2 uses best-ﬁrst search. [sent-401, score-0.21]

59 For d iterations of the search, the number of MB structures generated and scored with the phase 1 score is O(bd). [sent-407, score-0.389]

60 Note that both phases of the search require successor MB structures to be scored with the phase 1 score. [sent-408, score-0.375]

61 Since the phase 1 score decomposes over the MB nodes, to compute it for a newly generated MB structure only those MB nodes whose parent nodes have changed need be evaluated. [sent-409, score-0.442]

62 Computing the phase 1 score for a MB node entails estimating the parameters for that node and calculating the marginal likelihood from those parameters. [sent-411, score-0.426]

63 The phase 2 score computes the effect of a candidate MB structure on the model averaged estimate of the distribution of the target variable. [sent-413, score-0.462]

64 This requires doing inference for the target node in a MB that contains all measured variables which takes O(n) since at most n nodes inﬂuence the target distribution and hence at most n sets of parameters need be retrieved. [sent-414, score-0.256]

65 Computing both phase 1 and phase 2 scores for a MB structure therefore takes O(mn) time. [sent-415, score-0.339]

66 2 S PACE C OMPLEXITY The ISMB algorithm searches in the space of MB structures using greedy hill-climbing search for phase 1 and best-ﬁrst search with a priority queue of capacity w for phase 2. [sent-421, score-0.551]

67 The ISMB algorithm performs selective model averaging to estimate the distribution of the target variable of the instance at hand as described in Section 5. [sent-481, score-0.382]

68 It chooses the MB structure that has the highest posterior probability from those found by the ISMB algorithm in the two-phase search, and uses that single model to estimate the distribution of the target variable of the instance at hand. [sent-483, score-0.266]

69 Comparing the ISMB algorithm to the ISMB-MS algorithm measures the effect of approximating selective model averaging by using model selection. [sent-484, score-0.295]

70 In phase 2, the NISMB algorithm accumulates the same number of MB models as the ISMB algorithm except that the models are identiﬁed on the basis of the non-instance-speciﬁc phase 1 score. [sent-491, score-0.384]

71 The training set simulates a low occurrence of A = T (only ﬁve out of 69 instances have A = T), and the test set consists of three instances of A = T which are not present in the training set. [sent-499, score-0.203]

72 The training set contains a total of 69 instances and the test set a total of three instances as shown; the test instances are not present in the training set. [sent-501, score-0.294]

73 The following algorithms were used in the experiments: (1) a complete model averaged version of the ISMB algorithm where model averaging is carried out over all 3567 possible MB structures, (2) the ISMB algorithm, (3) the ISMB-MS algorithm, and (4) the NISMB algorithm. [sent-503, score-0.207]

74 • Phase 2: The model score for phase 2 is computed using Equation 14 that is based on KLdivergence. [sent-505, score-0.349]

75 Phase 2 search terminates when no MB structure in Q has a phase 2 score higher than ε = 0. [sent-507, score-0.395]

76 • The predicted distribution for the target variable Z of the test instance is computed using Equation 13; for each MB structure the parameters are estimated using Equation 6. [sent-510, score-0.223]

77 Though both methods average over the same number of models, the ISMB algorithm uses the instance-speciﬁc phase 2 score to choose phase 2 models while the ISMB algorithm uses the non-instance-speciﬁc phase 1 score to choose both phase 1 and phase 2 models. [sent-517, score-1.1]

78 The phase 2 models chosen by the ISMB algorithm are potentially different for each test instance in contrast to the NISMB algorithm which selects the same models irrespective of the test instance. [sent-518, score-0.369]

79 These results, while limited in scope, provide support that the instance-speciﬁc search for models may be able to choose models that better approximate the distribution of the target variable of the instance at hand. [sent-519, score-0.295]

80 A second curve plots the model score as the logarithmic posterior probability of the model given the data; this score measures the relative contribution of the model to the ﬁnal estimate of P(Z t = T|xt , D). [sent-540, score-0.525]

81 In the ﬁrst two test instances the ﬁnal estimates of P(Z t = T|xt , D) obtained from the instance-speciﬁc and non-instance-speciﬁc model averaging respectively are very close; both the ISMB and the NISMB algorithms predicted the value of Z correctly as T. [sent-544, score-0.251]

82 The performance on all the evaluation measures peaked at values of 800 or 1600 and beyond 1600 no further improvement was 3358 L EARNING I NSTANCE -S PECIFIC P REDICTIVE M ODELS Data Set # models # models # models phase 1 phase 2 phases 1 and 2 australian 28. [sent-608, score-0.497]

83 Both algorithms select the same models in phase 1 but potentially different models in phase 2. [sent-673, score-0.384]

84 The essence of the instance-speciﬁc method lies in the model score used in phase 2 of the search. [sent-925, score-0.349]

85 This score is sensitive to both the posterior probability of the model and the predicted distribution for the outcome variable of the instance at hand. [sent-926, score-0.303]

86 Typically, methods that evaluate models with a score employ a score that is sensitive only to the ﬁt of the model to the training data and not to the prediction of the outcome variable. [sent-927, score-0.439]

87 BMA had better performance than Bayesian model selection, and within model averaging, instancespeciﬁc BMA had better performance than non-instance-speciﬁc BMA though the improvement is not as large as that of model averaging over model selection. [sent-1153, score-0.301]

88 The improved performance by ISMB may arise from not only the model averaging but also from the variable selection that is performed implicitly by the Markov blanket models. [sent-1154, score-0.362]

89 As one example, in a domain where complete BMA is tractable and model averaging is carried out over all models in the model space, a search heuristic that selects a subset of models such as the one used by the instance-speciﬁc method is superﬂuous. [sent-1160, score-0.369]

90 Thus, the ISMB algorithm is useful for selective model averaging where it identiﬁes a potentially relevant set of models that is predictive of the instance at hand. [sent-1162, score-0.354]

91 Improvements in the phase 1 search may make the phase 2 search relatively less contributory to the overall performance. [sent-1388, score-0.42]

92 To explore this issue a number of search strategies that augment local greedy search that have been successfully applied to learning BN structures can be tried, such as best-ﬁrst search (Neapolitan, 2003), simulated annealing (Heckerman et al. [sent-1391, score-0.226]

93 Investigating the use of such alternative search methods in phase 1 is an interesting open problem. [sent-1398, score-0.21]

94 However, the variance of selective BMA over models that are chosen randomly is likely to be much larger than the variance of selective BMA over models chosen by the instancespeciﬁc method which is constrained to prefer models that are good ﬁt to the training data. [sent-1625, score-0.276]

95 The computation of the phase 2 score (see Equation 14) requires a dissimilarity metric to compare the predictive distributions of the target variable in candidate MB structures. [sent-1628, score-0.446]

96 Hiton: A novel markov blanket algorithm for optimal variable selection. [sent-2064, score-0.227]

97 Local causal and markov blanket induction for causal discovery and feature selection for classiﬁcation part i: Algorithms and empirical evaluation. [sent-2073, score-0.225]

98 Local causal and markov blanket induction for causal discovery and feature selection for classiﬁcation part ii: Analysis and extensions. [sent-2082, score-0.225]

99 Bayesian model averaging of bayesian network classiﬁers over multiple node-orders: application to sparse datasets. [sent-2235, score-0.232]

100 Evaluation of the performance of the markov blanket bayesian classiﬁer algorithm. [sent-2259, score-0.271]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ismb', 0.589), ('mb', 0.428), ('nismb', 0.209), ('bma', 0.181), ('phase', 0.152), ('score', 0.15), ('blanket', 0.148), ('lbr', 0.143), ('isweswaran', 0.125), ('nstance', 0.125), ('ooper', 0.125), ('xt', 0.116), ('averaging', 0.113), ('pecific', 0.096), ('redictive', 0.096), ('bestmodel', 0.09), ('bn', 0.088), ('target', 0.078), ('bayesian', 0.072), ('odels', 0.068), ('selective', 0.065), ('node', 0.062), ('arc', 0.062), ('nb', 0.061), ('instances', 0.06), ('parents', 0.059), ('cal', 0.059), ('search', 0.058), ('visweswaran', 0.056), ('successor', 0.054), ('jk', 0.053), ('knn', 0.053), ('structures', 0.052), ('markov', 0.051), ('instance', 0.051), ('cleveland', 0.049), ('ismbms', 0.049), ('postoperative', 0.049), ('lymphography', 0.048), ('model', 0.047), ('queue', 0.044), ('lazy', 0.043), ('earning', 0.043), ('predictor', 0.043), ('auc', 0.042), ('lazydt', 0.042), ('corral', 0.042), ('crx', 0.042), ('uci', 0.041), ('ab', 0.04), ('models', 0.04), ('lr', 0.04), ('cooper', 0.039), ('predictive', 0.038), ('children', 0.038), ('nodes', 0.038), ('tsamardinos', 0.037), ('successors', 0.036), ('mbs', 0.036), ('nn', 0.036), ('structure', 0.035), ('bestscore', 0.035), ('nave', 0.035), ('priority', 0.035), ('hepatitis', 0.035), ('zoo', 0.035), ('scored', 0.035), ('pima', 0.035), ('sonar', 0.035), ('logarithmic', 0.034), ('vote', 0.034), ('heart', 0.033), ('vehicle', 0.032), ('test', 0.031), ('wine', 0.03), ('parent', 0.029), ('diabetes', 0.029), ('aliferis', 0.029), ('iris', 0.029), ('misclassi', 0.029), ('variable', 0.028), ('antecedent', 0.028), ('bse', 0.028), ('bns', 0.028), ('predictors', 0.027), ('posterior', 0.027), ('dt', 0.027), ('neapolitan', 0.027), ('prediction', 0.026), ('glass', 0.026), ('australian', 0.026), ('training', 0.026), ('arcs', 0.026), ('ni', 0.026), ('selection', 0.026), ('selects', 0.024), ('predict', 0.024), ('phases', 0.024), ('hoeting', 0.024), ('measures', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000008 63 jmlr-2010-Learning Instance-Specific Predictive Models

Author: Shyam Visweswaran, Gregory F. Cooper

2 0.13723412 67 jmlr-2010-Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation

Author: Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, Xenofon D. Koutsoukos

Abstract: We present an algorithmic framework for learning local causal structure around target variables of interest in the form of direct causes/effects and Markov blankets applicable to very large data sets with relatively small samples. The selected feature sets can be used for causal discovery and classiÄ?Ĺš cation. The framework (Generalized Local Learning, or GLL) can be instantiated in numerous ways, giving rise to both existing state-of-the-art as well as novel algorithms. The resulting algorithms are sound under well-deÄ?Ĺš ned sufÄ?Ĺš cient conditions. In a Ä?Ĺš rst set of experiments we evaluate several algorithms derived from this framework in terms of predictivity and feature set parsimony and compare to other local causal discovery methods and to state-of-the-art non-causal feature selection methods using real data. A second set of experimental evaluations compares the algorithms in terms of ability to induce local causal neighborhoods using simulated and resimulated data and examines the relation of predictivity with causal induction performance. Our experiments demonstrate, consistently with causal feature selection theory, that local causal feature selection methods (under broad assumptions encompassing appropriate family of distribuc 2010 Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani and Xenofon D. Koutsoukos. A LIFERIS , S TATNIKOV, T SAMARDINOS , M ANI AND KOUTSOUKOS tions, types of classiÄ?Ĺš ers, and loss functions) exhibit strong feature set parsimony, high predictivity and local causal interpretability. Although non-causal feature selection methods are often used in practice to shed light on causal relationships, we Ä?Ĺš nd that they cannot be interpreted causally even when they achieve excellent predictivity. Therefore we conclude that only local causal techniques should be used when insight into causal structure is sought. In a companion paper we examine in depth the behavior of GLL algorithms, provide extensions, and show

3 0.12325779 68 jmlr-2010-Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part II: Analysis and Extensions

Author: Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, Xenofon D. Koutsoukos

Abstract: In part I of this work we introduced and evaluated the Generalized Local Learning (GLL) framework for producing local causal and Markov blanket induction algorithms. In the present second part we analyze the behavior of GLL algorithms and provide extensions to the core methods. SpeciÄ?Ĺš cally, we investigate the empirical convergence of GLL to the true local neighborhood as a function of sample size. Moreover, we study how predictivity improves with increasing sample size. Then we investigate how sensitive are the algorithms to multiple statistical testing, especially in the presence of many irrelevant features. Next we discuss the role of the algorithm parameters and also show that Markov blanket and causal graph concepts can be used to understand deviations from optimality of state-of-the-art non-causal algorithms. The present paper also introduces the following extensions to the core GLL framework: parallel and distributed versions of GLL algorithms, versions with false discovery rate control, strategies for constructing novel heuristics for speciÄ?Ĺš c domains, and divide-and-conquer local-to-global learning (LGL) strategies. We test the generality of the LGL approach by deriving a novel LGL-based algorithm that compares favorably c 2010 Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani and Xenofon D. Koutsoukos. A LIFERIS , S TATNIKOV, T SAMARDINOS , M ANI AND KOUTSOUKOS to the state-of-the-art global learning algorithms. In addition, we investigate the use of non-causal feature selection methods to facilitate global learning. Open problems and future research paths related to local and local-to-global causal learning are discussed. Keywords: local causal discovery, Markov blanket induction, feature selection, classiÄ?Ĺš cation, causal structure learning, learning of Bayesian networks

4 0.076801687 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers

Author: Franz Pernkopf, Jeff A. Bilmes

Abstract: We introduce a simple order-based greedy heuristic for learning discriminative structure within generative Bayesian network classiﬁers. We propose two methods for establishing an order of N features. They are based on the conditional mutual information and classiﬁcation rate (i.e., risk), respectively. Given an ordering, we can ﬁnd a discriminative structure with O N k+1 score evaluations (where constant k is the tree-width of the sub-graph over the attributes). We present results on 25 data sets from the UCI repository, for phonetic classiﬁcation using the TIMIT database, for a visual surface inspection task, and for two handwritten digit recognition tasks. We provide classiﬁcation performance for both discriminative and generative parameter learning on both discriminatively and generatively structured networks. The discriminative structure found by our new procedures signiﬁcantly outperforms generatively produced structures, and achieves a classiﬁcation accuracy on par with the best discriminative (greedy) Bayesian network learning approach, but does so with a factor of ∼10-40 speedup. We also show that the advantages of generative discriminatively structured Bayesian network classiﬁers still hold in the case of missing features, a case where generative classiﬁers have an advantage over discriminative classiﬁers. Keywords: Bayesian networks, classiﬁcation, discriminative learning, structure learning, graphical model, missing feature

5 0.063196093 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

Author: Isabelle Guyon, Amir Saffari, Gideon Dror, Gavin Cawley

Abstract: The principle of parsimony also known as “Ockham’s razor” has inspired many theories of model selection. Yet such theories, all making arguments in favor of parsimony, are based on very different premises and have developed distinct methodologies to derive algorithms. We have organized challenges and edited a special issue of JMLR and several conference proceedings around the theme of model selection. In this editorial, we revisit the problem of avoiding overﬁtting in light of the latest results. We note the remarkable convergence of theories as different as Bayesian theory, Minimum Description Length, bias/variance tradeoff, Structural Risk Minimization, and regularization, in some approaches. We also present new and interesting examples of the complementarity of theories leading to hybrid algorithms, neither frequentist, nor Bayesian, or perhaps both frequentist and Bayesian! Keywords: model selection, ensemble methods, multilevel inference, multilevel optimization, performance prediction, bias-variance tradeoff, Bayesian priors, structural risk minimization, guaranteed risk minimization, over-ﬁtting, regularization, minimum description length

6 0.061234072 7 jmlr-2010-A Streaming Parallel Decision Tree Algorithm

7 0.055957209 88 jmlr-2010-Optimal Search on Clustered Structural Constraint for Learning Bayesian Network Structure

8 0.054009311 56 jmlr-2010-Introduction to Causal Inference

9 0.053581756 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks

10 0.050561387 90 jmlr-2010-Permutation Tests for Studying Classifier Performance

11 0.045367859 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes

12 0.04392533 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory

13 0.042131051 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

14 0.040839747 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks

15 0.039628148 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks

16 0.039168611 85 jmlr-2010-On the Foundations of Noise-free Selective Classification

17 0.038449705 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression

18 0.037256163 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning

19 0.03699404 40 jmlr-2010-Fast and Scalable Local Kernel Machines

20 0.036956005 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.189), (1, 0.211), (2, 0.038), (3, -0.008), (4, 0.029), (5, 0.024), (6, -0.011), (7, 0.037), (8, -0.028), (9, 0.056), (10, 0.054), (11, 0.109), (12, 0.002), (13, -0.013), (14, -0.032), (15, -0.043), (16, 0.036), (17, -0.009), (18, -0.038), (19, 0.001), (20, 0.025), (21, 0.05), (22, 0.06), (23, -0.016), (24, -0.065), (25, -0.032), (26, 0.157), (27, -0.353), (28, -0.084), (29, -0.126), (30, 0.054), (31, -0.029), (32, 0.004), (33, 0.03), (34, -0.153), (35, -0.046), (36, -0.133), (37, -0.007), (38, -0.028), (39, 0.043), (40, 0.148), (41, -0.082), (42, -0.014), (43, -0.04), (44, 0.01), (45, 0.036), (46, -0.052), (47, 0.051), (48, 0.008), (49, -0.046)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89954209 63 jmlr-2010-Learning Instance-Specific Predictive Models

Author: Shyam Visweswaran, Gregory F. Cooper

2 0.55889302 68 jmlr-2010-Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part II: Analysis and Extensions

Author: Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, Xenofon D. Koutsoukos

3 0.50464374 67 jmlr-2010-Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation

Author: Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, Xenofon D. Koutsoukos

4 0.47815606 88 jmlr-2010-Optimal Search on Clustered Structural Constraint for Learning Bayesian Network Structure

Author: Kaname Kojima, Eric Perrier, Seiya Imoto, Satoru Miyano

Abstract: We study the problem of learning an optimal Bayesian network in a constrained search space; skeletons are compelled to be subgraphs of a given undirected graph called the super-structure. The previously derived constrained optimal search (COS) remains limited even for sparse superstructures. To extend its feasibility, we propose to divide the super-structure into several clusters and perform an optimal search on each of them. Further, to ensure acyclicity, we introduce the concept of ancestral constraints (ACs) and derive an optimal algorithm satisfying a given set of ACs. Finally, we theoretically derive the necessary and sufﬁcient sets of ACs to be considered for ﬁnding an optimal constrained graph. Empirical evaluations demonstrate that our algorithm can learn optimal Bayesian networks for some graphs containing several hundreds of vertices, and even for super-structures having a high average degree (up to four), which is a drastic improvement in feasibility over the previous optimal algorithm. Learnt networks are shown to largely outperform state-of-the-art heuristic algorithms both in terms of score and structural hamming distance. Keywords: Bayesian networks, structure learning, constrained optimal search

5 0.45877004 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers

Author: Franz Pernkopf, Jeff A. Bilmes

6 0.42134452 7 jmlr-2010-A Streaming Parallel Decision Tree Algorithm

7 0.34140682 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks

8 0.30923888 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

9 0.29429919 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines

10 0.26421496 24 jmlr-2010-Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials

11 0.24103811 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning

12 0.22411571 15 jmlr-2010-Approximate Tree Kernels

13 0.22173604 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes

14 0.21960938 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning

15 0.2190672 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory

16 0.21611445 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation

17 0.2112197 77 jmlr-2010-Model-based Boosting 2.0

18 0.21083358 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

19 0.21042186 53 jmlr-2010-Inducing Tree-Substitution Grammars

20 0.20957626 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.01), (3, 0.011), (4, 0.022), (8, 0.044), (12, 0.012), (15, 0.025), (18, 0.01), (21, 0.013), (24, 0.011), (32, 0.052), (33, 0.035), (36, 0.067), (37, 0.046), (38, 0.012), (44, 0.274), (75, 0.161), (81, 0.015), (85, 0.075), (96, 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73286968 63 jmlr-2010-Learning Instance-Specific Predictive Models

Author: Shyam Visweswaran, Gregory F. Cooper

2 0.57513028 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

Author: Yevgeny Seldin, Naftali Tishby

Abstract: We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering.1 We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for ﬁnding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved stateof-the-art performance in the MovieLens collaborative ﬁltering task. Our co-clustering model can also be seen as matrix tri-factorization and the results provide generalization bounds, regularization terms, and new algorithms for this form of matrix factorization. The analysis of co-clustering is extended to tree-shaped graphical models, which can be used to analyze high dimensional tensors. According to the bounds, the generalization abilities of treeshaped graphical models depend on a trade-off between their empirical data ﬁt and the mutual information that is propagated up the tree levels. We also formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. The analysis of co-clustering easily

3 0.57152021 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

Author: Ming Yuan

Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity

4 0.57126725 67 jmlr-2010-Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation

Author: Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, Xenofon D. Koutsoukos

5 0.56994623 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

Author: Pannagadatta K. Shivaswamy, Tony Jebara

Abstract: Leading classiﬁcation methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identiﬁes its sensitivity to afﬁne transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classiﬁcation function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classiﬁcation function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efﬁciency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classiﬁcation and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages. Keywords: support vector machines, kernel methods, large margin, Rademacher complexity

6 0.56841642 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers

7 0.56830114 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing

8 0.56636 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

9 0.56631792 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

10 0.56554317 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

11 0.56384337 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking

12 0.56236541 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks

13 0.56019753 56 jmlr-2010-Introduction to Causal Inference

14 0.55865055 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

15 0.55835181 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

16 0.55827892 109 jmlr-2010-Stochastic Composite Likelihood

17 0.55779386 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

18 0.5574742 69 jmlr-2010-Lp-Nested Symmetric Distributions

19 0.55582744 66 jmlr-2010-Linear Algorithms for Online Multitask Classification

20 0.55559134 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory