jmlr jmlr2010 jmlr2010-23 knowledge-graph by maker-knowledge-mining

23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors

Source: pdf

Author: Chunping Wang, Xuejun Liao, Lawrence Carin, David B. Dunson

Abstract: A non-parametric hierarchical Bayesian framework is developed for designing a classiﬁer, based on a mixture of simple (linear) classiﬁers. Each simple classiﬁer is termed a local “expert”, and the number of experts and their construction are manifested via a Dirichlet process formulation. The simple form of the “experts” allows analytical handling of incomplete data. The model is extended to allow simultaneous design of classiﬁers on multiple data sets, termed multi-task learning, with this also performed non-parametrically via the Dirichlet process. Fast inference is performed using variational Bayesian (VB) analysis, and example results are presented for several data sets. We also perform inference via Gibbs sampling, to which we compare the VB results. Keywords: classiﬁcation, incomplete data, expert, Dirichlet process, variational Bayesian, multitask learning

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Unlike in semi-supervised learning (Ando and Zhang, 2005) where missing labels (responses) must be addressed, features (inputs) are partially missing in the aforementioned incomplete-data problems. [sent-21, score-0.527]

2 Since most data analysis procedures (for example, regression and classiﬁcation) are designed for complete data, and cannot be directly applied to incomplete data, the appropriate handling of missing data is challenging. [sent-22, score-0.335]

3 More importantly, even if the incomplete-data problem is eliminated by ignoring data with missing features in the training phase, it is still inevitable in the test stage since test data cannot be ignored simply because a portion of features are missing. [sent-29, score-0.36]

4 For single imputation, the main concern is that the uncertainty of the missing features is ignored by imputing ﬁxed values. [sent-30, score-0.284]

5 The work of Rubin (1976) developed a theoretical framework for incomplete-data problems, where widely-cited terminology for missing patterns was ﬁrst deﬁned. [sent-31, score-0.243]

6 It was proven that ignoring the missing mechanism is appropriate (Rubin, 1976) under the missing at random (MAR) assumption, meaning that the missing mechanism is conditionally independent of the missing features given the observed data. [sent-32, score-1.013]

7 In this paper, we address missing features under the MAR assumption. [sent-35, score-0.284]

8 Previous work in this setting may be placed into two groups, depending on whether the missing data are handled before algorithm learning or within the algorithm. [sent-36, score-0.243]

9 For the former, an extra step is required to estimate p(xm |xo ), conditional distributions of missing values given observed ones, with this step distinct from the main inference algorithm. [sent-37, score-0.243]

10 With a mild Gaussian mixture model (GMM) assumption for the joint distribution of observed and missing data, Williams et al. [sent-42, score-0.28]

11 (2007) managed to analytically integrate out missing values over p(xm |xo ) and performed essentially inﬁnite imputations. [sent-43, score-0.243]

12 The other class of methods explicitly addresses missing values during the model-learning procedure. [sent-49, score-0.243]

13 , 2008) show that this procedure is comparable to several single-imputation methods when values are missing at random. [sent-53, score-0.243]

14 , 2008) handles the missing features inside the procedure of learning a support vector machine (SVM), without constraining the distribution of missing features to any speciﬁc class. [sent-55, score-0.568]

15 The main concern is that this method can only handle missing features in the training data; however, in many applications one cannot control whether missing values occur in the training or test data. [sent-56, score-0.597]

16 A widely employed approach for handling missing values within the algorithm involves maximum likelihood (ML) estimation via expectation maximization (EM) (Dempster et al. [sent-57, score-0.273]

17 , mixture component indicators), the missing features are also integrated out in the E-step so that the likelihood is maximized with respect to model parameters in the M-step. [sent-61, score-0.321]

18 The main difﬁculty is that the integral in the E-step is analytically tractable only when an assumption is made on the distribution of the missing features. [sent-62, score-0.243]

19 (2007) the authors proposed a quadratically gated mixture of experts (QGME) where the GMM is used to form the gating network, statistically partitioning the feature space into quadratic subregions. [sent-67, score-0.263]

20 In this manner model selection is avoided and the uncertainty on the number of experts is captured in the posterior density function. [sent-77, score-0.191]

21 In Shahbaba and Neal (2009) another form of inﬁnite mixtures of experts was proposed, where experts are speciﬁed by a multinomial logit (MNL) model (also called softmax) and the gating network is Gaussian mixture model with independent covariates. [sent-89, score-0.341]

22 The independence assumption of covariates in Shahbaba and Neal (2009) leads to efﬁcient computation but is not appealing for handling missing features. [sent-98, score-0.301]

23 With Gaussian process experts (Meeds and Osindero, 2006), the inference for missing features is not analytical for fast inference algorithms such as variational Bayesian (Beal, 2003) and EM, and the computation could be prohibitive for large data sets. [sent-99, score-0.461]

24 The iQGME seeks a balance between the ease of inference, computational burden and the ability of handling missing features. [sent-100, score-0.273]

25 The problem of missing data in classiﬁer design is addressed by extending QGME (Liao et al. [sent-114, score-0.243]

26 , 2007) to a fully Bayesian setting, with the number of local experts inferred automatically via a DP prior. [sent-115, score-0.185]

27 Inﬁnite Quadratically Gated Mixture of Experts In this section, we ﬁrst provide a brief review of the quadratically gated mixture of experts (QGME) (Liao et al. [sent-126, score-0.21]

28 Latent variables ti are introduced as “soft labels” associated with yi , as in probit models (Albert and Chib, 1993), where yi = 1 if ti > 0 and yi = −1 if ti ≤ 0. [sent-131, score-0.195]

29 The ﬁnite quadratically gated mixture of experts (QGME) (Liao et al. [sent-132, score-0.21]

30 The (P + 1) × K matrix W has columns wh , where each wh are the weights on a local linear classiﬁer, and the xb are feature vectors with an intercept, that is, xb = [xT , 1]T . [sent-134, score-0.395]

31 Since learning the correct model requires model selection, and moreover in many applications there may exist no such ﬁxed “correct” model, in the work reported here we infer the full posterior for a QGME model with the number of experts data-driven. [sent-147, score-0.191]

32 Incomplete Data Problem In the above discussion it was assumed that all components of the feature vectors were available (no missing data). [sent-193, score-0.268]

33 Each xi has its own observed set oi and missing set mi , which may be different for each i. [sent-196, score-0.243]

34 For an arbitrary missing pattern, R could be deﬁned as a missing data indicator matrix, that is, 1, xip observed, Rip = 0, xip missing. [sent-198, score-0.542]

35 We use ξ to denote parameters characterizing the distribution of R, which is usually called the missing mechanism. [sent-199, score-0.243]

36 In the classiﬁcation context, the joint distribution of class labels, observed features and the missingness R may be given by integrating out the missing features xm , p(y, xo , R|θ, ξ) = p(y, x|θ)p(R|x, ξ)dxm . [sent-200, score-0.465]

37 If the missing mechanism is conditionally independent of missing values xm given the observed data, that is, p(R|x, ξ) = p(R|xo , ξ), the missing data are deﬁned to be missing at random (MAR) (Rubin, 1976). [sent-202, score-0.972]

39 As long as the prior p(θ, ξ) = p(θ)p(ξ) (factorizable), the posterior p(θ, ξ|y, xo , R) ∝ p(y, xo , R|θ, ξ)p(θ, ξ) = p(R|xo , ξ)p(ξ)p(y, xo |θ)p(θ) is also factorizable. [sent-205, score-0.386]

40 As an important special case of MAR, missing completely at random (MCAR) occurs if we can further assume that p(R|x, ξ) = p(R|ξ), which means the distribution of missingness is independent of observed values xo as well. [sent-207, score-0.383]

41 When the missing mechanism depends on missing values xm , the data are termed to be missing not at random (MNAR). [sent-208, score-0.729]

42 As a result, the data are typically assumed to be either MCAR or MAR in the literature, unless signiﬁcant correlations between the missing values and the distribution of the missingness are suspected. [sent-215, score-0.281]

43 Naturally, the missing features could be regarded as hidden variables to be inferred and the graphical representation of the iQGME with incomplete data remains the same as in Figure 1, except that the node presenting features are partially observed now. [sent-219, score-0.434]

44 As elaborated below, the important but mild assumption that the features are distributed as a GMM enables us to analytically infer the variational distributions associated with the missing values in a procedure of variational Bayesian inference. [sent-220, score-0.362]

45 , 2007), estimating the distribution of the missing values ﬁrst and learning the classiﬁer at a second step gives the ﬂexibility of selecting the classiﬁer for the second step. [sent-222, score-0.243]

46 However, (12) suggests that the classiﬁer and the data distribution are coupled, provided that partial data are missing and thus have to be integrated out. [sent-223, score-0.243]

47 Therefore, a joint estimation of missing features and classiﬁers (searching in the space of (θ1 , θ2 )) is more desirable than a twostep process (searching in the space of θ1 for the distribution of the data, and then in the space of θ2 for the classiﬁer). [sent-224, score-0.284]

48 Extension to Multi-Task Learning Assume we have J data sets, with the jth represented as D j = {(x ji , y ji ) : i = 1, . [sent-226, score-0.43]

49 With the task-dependent iQGME deﬁned in (8), we consider all J tasks jointly: (x ji ,t ji ) ∼ NP (x ji |µ ji , Λ−1 )N (t ji |wT xb , 1), ji ji ji iid (µ ji , Λ ji , w ji ) ∼ G j , G j ∼ DP (αG0 ), G0 ∼ DP (βH). [sent-247, score-2.491]

50 In this form of borrowing information, experts with associated means and precision matrices are shared across tasks as distinct atoms. [sent-248, score-0.192]

51 We explicitly write the stick-breaking representations for G j and G0 , with z ji and c jh introduced as the indicators for each data point and each distinct atom of G j , respectively. [sent-250, score-0.253]

52 h However, for the incomplete-data situation, the integral over the missing values is tractable only when the two terms in the integral are both normal. [sent-253, score-0.243]

53 For complete data the integral of missing features is absent, so we take advantage of the full variational posteriors for prediction. [sent-255, score-0.364]

54 With incomplete data, since the missing pattern is unique for each data point, the time and memory complexity increase with number of data points, that is, O(nNP3 ) and O(nNP2 ), respectively. [sent-260, score-0.305]

55 The mixture of Gaussian process experts (Meeds and Osindero, 2006) requires O(NP3 + n3 /N) computations for each MCMC iteration if the N experts equally divide the data, and the memory complexity is O(NP2 + n2 /N). [sent-261, score-0.313]

56 Although the proposed model requires more computations for each MCMC iteration than the latter one, we are able to handle missing values naturally, and much more efﬁciently compared to the former one. [sent-265, score-0.243]

57 5 4 (b) Figure 4: Synthetic three-Gaussian single-task data: (a) prior and posterior beliefs on the number of dominant components; (b) prior and posterior beliefs on α. [sent-349, score-0.191]

58 As an example, we plot the broad common prior imposed for local experts in Figure 5(c) and the peaked variational posteriors for three dominant experts in Figure 5(d). [sent-359, score-0.414]

59 5 w 2 w 1 (c) 1 (d) Figure 5: Synthetic three-Gauss single-task data: (a) prediction in feature space using the full posteriors; (b) prediction in feature space using the posterior means; (c) a common broad prior on local experts; (d) variational posteriors on local experts. [sent-397, score-0.21]

60 Since those SVM and RVM algorithms are not directly applicable to problems with missing features, we use two methods to impute the missing values before the implementation. [sent-413, score-0.486]

61 • Classiﬁers handling missing values: the ﬁnite QGME inferred by expectation-maximization (EM) (Liao et al. [sent-415, score-0.32]

62 In order to simulate the missing at random setting, we randomly remove a fraction of feature values according to a uniform distribution, and assume the rest are observed. [sent-421, score-0.326]

63 Any instance with all feature values missing is deleted. [sent-422, score-0.268]

64 Note that the random pattern of missing features and the random partition of training and test subsets are independent of each other. [sent-424, score-0.319]

65 Given a portion of missing values, each curve is a function of the fraction of data used in training. [sent-430, score-0.333]

66 We observe that with different fractions of missing values and training samples, the values for K which achieve the best performance may be different; as K goes to a large number (e. [sent-498, score-0.278]

67 Although our main purpose is classiﬁcation, one may also be interested in how well the algorithm can estimate the missing values while pursuing the main purpose. [sent-549, score-0.243]

68 In Figure 10, we show the ratio of correctly estimated missing values for the Ionosphere data set with 25% feature values missing, where two criteria are considered: true values are one standard deviation (red circles) or two standard deviations (blue squares) away from the posterior means. [sent-550, score-0.321]

69 This ﬁgure suggests that the algorithm estimates most of the missing values in a reasonable range away from the true val3291 WANG , L IAO , C ARIN AND D UNSON Ionosphere: 25% Missing, 10% Training Ionosphere: 25% Missing, 90% Training Ionosphere: 25% Missing, 50% Training 1 0. [sent-551, score-0.243]

70 Here we take the Ionosphere data with 25% features missing as an example to compare these two inference techniques, as shown in Figure 11. [sent-600, score-0.284]

71 (a) Prior and inferred posterior on the number of clusters for one trial given 10% samples for training, the number of clusters for the case when (b) 25%, and (c) 50% of features are missing. [sent-636, score-0.251]

72 8 1 Fraction of Data Used in Training Figure 10: Ratio of missing values whose true values are one standard deviation (red circles) or two standard deviations (blue squares) away from the posterior means for the Ionosphere data set with 25% feature values missing. [sent-653, score-0.321]

73 8 1 Fraction of Data Used in Training (c) Figure 11: Comparison between VB and MCMC inferred iQGME on the Ionosphere data with 25% features missing in terms of (a) performance, (b) time consumed for each iteration, (c) number of iterations. [sent-679, score-0.331]

74 This is a real sensing problem for which missing data occurs naturally. [sent-685, score-0.243]

75 Figure 12 shows the missing patterns for this data set. [sent-692, score-0.243]

76 A high-dimensional data set with natural missing values is considered in this subsection. [sent-702, score-0.243]

77 8 1 Fraction of Data Used in Training (a) (b) Figure 13: Mean performance over 100 random training/test partitions for each training fraction on the unexploded ordnance data set, in terms of (a) area under the ROC curve, and (b) classiﬁcation accuracy. [sent-725, score-0.195]

78 for which missing data are a natural consequence of the sensing modality. [sent-726, score-0.243]

79 The missing pattern of feature values is shown in Figure 14(a), where black indicates observed (this missingness is a natural consequence of the sensing device). [sent-730, score-0.306]

80 From Figure 14(b), our method provides improvement by handling missing values analytically in the procedure of model inference and performing a dimensionality reduction jointly with local classiﬁers learning. [sent-738, score-0.273]

81 (2007), we also ﬁnd two big clusters correspond to two different vegetation conditions of the landmine ﬁelds (task 1-10 and task 11-19). [sent-821, score-0.21]

82 As in the experiments above on benchmark data sets, we perform ten independent random trials for each setting of missing fraction and training size. [sent-828, score-0.336]

83 To the best of our knowledge, there exists no previous work in the literature on multi-task learning with missing data. [sent-829, score-0.243]

84 , 2007) and the Structure-MTL (Ando and Zhang, 2005) with missing values imputed as baseline algorithms. [sent-831, score-0.243]

85 These observations underscore the advantage of handling missing data in a principled manner and at the same time learning multiple tasks simultaneously. [sent-839, score-0.327]

86 , 25%) of the feature values are missing and training data are rich (e. [sent-843, score-0.303]

87 As the fraction of missing values becomes larger, tasks appear more different from each other in terms of the usage of the higher-level items. [sent-846, score-0.355]

88 Considering that the missing pattern for each task is unique, it is probable that tasks look quite different from each other after a large fraction of feature values are missing. [sent-847, score-0.423]

89 48 0 50 100 150 200 250 Number of Training Samples from Each Task 300 (c) Figure 18: Average AUC over 19 tasks of landmine detection for the cases when (a) 25%, (b) 50%, and (c) 75% of the features are missing. [sent-879, score-0.207]

90 From Figure 21, the iQGME-MTL performs signiﬁcantly better than the baselines on this data set for all the missing fractions and training fractions under consideration. [sent-897, score-0.278]

91 7 500 100 Number of training samples from each task 200 300 400 500 Number of training samples from each task (c) (d) Figure 21: Average AUC over eight tasks of handwriting letters classiﬁcation for the cases when (a) none, (b) 25%, (c) 50%, and (d) 75% of the features are missing. [sent-937, score-0.251]

92 The Dirichlet process is employed to allow the model to infer automatically the proper number of experts and their characteristics; in fact, since a Bayesian formulation is employed, a full posterior distribution is manifested on the properties of the local experts, including their number. [sent-942, score-0.227]

93 Secondly, the classiﬁer is endowed with the ability to naturally address missing data, without the need for an imputation step. [sent-943, score-0.273]

94 Finally, the whole framework has been placed within the context of a multi-task learning, allowing one to jointly infer classiﬁers for multiple data sets with missing data. [sent-944, score-0.243]

95 q(ti |µti ) µti = N ∑ ρih m |o ˆ ˆ wh T xb i,h where xb i,h = [xoi ; mh i i ; 1]. [sent-956, score-0.311]

96 q(wh |µw , Σw ), wh = µw , h h h Σw h T wh wh = Σw + µw (µw )T . [sent-969, score-0.339]

97 h h h −1 n = ˆT ˆ ˆ ∑ ρih (Ωi,h + xb i,h xb i,h ) + diag( λ ) , i=1 µw = Σw h h n ˆ ∑ ρih xb i,h ti + diag( λ )φ . [sent-970, score-0.281]

98 q(t ji |µtji ) S ˆ ∑ Eσ jis ws T xb ji where Eσ jis = µtji = s=1 m |o ji m 2. [sent-982, score-0.814]

99 q(x ji ji |m ji ji m |o ji , Σ ji ji N ∑ ρ jih σ jhs , o m |o ˆ xb ji = [x jiji ; m ji ji ji ; 1]. [sent-983, score-2.479]

100 ˜ s s=1 o x ji = ˆ x jiji m |o m ji ji ji ˆ , Ω ji = 0 0 m ji |o ji 0 Σ ji , x ji xT = x ji xT + Ω ji . [sent-985, score-2.407]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('iqgme', 0.385), ('uncond', 0.308), ('qgme', 0.28), ('mtl', 0.273), ('cond', 0.263), ('missing', 0.243), ('ji', 0.215), ('rvm', 0.209), ('vb', 0.152), ('experts', 0.138), ('ionosphere', 0.128), ('lr', 0.123), ('wh', 0.113), ('arin', 0.112), ('landmine', 0.112), ('ncomplete', 0.112), ('unson', 0.112), ('xo', 0.102), ('stl', 0.098), ('ulti', 0.096), ('iao', 0.096), ('xue', 0.092), ('rbf', 0.091), ('liao', 0.084), ('dp', 0.083), ('bh', 0.081), ('dirichlet', 0.075), ('gmm', 0.072), ('xb', 0.072), ('ti', 0.065), ('wdbc', 0.065), ('incomplete', 0.062), ('svm', 0.058), ('fraction', 0.058), ('ih', 0.057), ('meeds', 0.056), ('clusters', 0.055), ('tasks', 0.054), ('roc', 0.054), ('mh', 0.054), ('lassification', 0.054), ('posterior', 0.053), ('auc', 0.052), ('shahbaba', 0.049), ('np', 0.047), ('inferred', 0.047), ('task', 0.043), ('em', 0.043), ('scarce', 0.042), ('jiji', 0.042), ('mcmc', 0.042), ('ws', 0.041), ('posteriors', 0.041), ('features', 0.041), ('williams', 0.041), ('mar', 0.04), ('osindero', 0.04), ('wang', 0.039), ('variational', 0.039), ('missingness', 0.038), ('jh', 0.038), ('mixture', 0.037), ('manifested', 0.036), ('training', 0.035), ('dxm', 0.035), ('gated', 0.035), ('landmines', 0.035), ('ordnance', 0.035), ('sepsis', 0.035), ('unexploded', 0.035), ('index', 0.033), ('area', 0.032), ('carin', 0.032), ('curve', 0.032), ('dominant', 0.031), ('zi', 0.031), ('handling', 0.03), ('clutter', 0.03), ('vh', 0.03), ('imputation', 0.03), ('duke', 0.03), ('xh', 0.03), ('gating', 0.028), ('imputations', 0.028), ('jacobs', 0.028), ('jis', 0.028), ('xip', 0.028), ('xoi', 0.028), ('classi', 0.028), ('covariates', 0.028), ('rubin', 0.028), ('nh', 0.028), ('integration', 0.027), ('prior', 0.027), ('responses', 0.026), ('handwritten', 0.026), ('feature', 0.025), ('wb', 0.025), ('ando', 0.025), ('ferguson', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors

Author: Chunping Wang, Xuejun Liao, Lawrence Carin, David B. Dunson

2 0.16756524 11 jmlr-2010-An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data

Author: Yufeng Ding, Jeffrey S. Simonoff

Abstract: There are many different methods used by classiﬁcation tree algorithms when missing data occur in the predictors, but few studies have been done comparing their appropriateness and performance. This paper provides both analytic and Monte Carlo evidence regarding the effectiveness of six popular missing data methods for classiﬁcation trees applied to binary response data. We show that in the context of classiﬁcation trees, the relationship between the missingness and the dependent variable, as well as the existence or non-existence of missing values in the testing data, are the most helpful criteria to distinguish different missing data methods. In particular, separate class is clearly the best method to use when the testing set has missing values and the missingness is related to the response variable. A real data set related to modeling bankruptcy of a ﬁrm is then analyzed. The paper concludes with discussion of adaptation of these results to logistic regression, and other potential generalizations. Keywords: classiﬁcation tree, missing data, separate class, RPART, C4.5, CART 1. Classiﬁcation Trees and the Problem of Missing Data Classiﬁcation trees are a supervised learning method appropriate for data where the response variable is categorical. The simple methodology behind classiﬁcation trees is to recursively split data based upon the predictors that best distinguish the response variable classes. There are, of course, many subtleties, such as the choice of criterion function used to pick the best split variable, stopping rules, pruning rules, and so on. In this study, we mostly rely on the built-in features of the tree algorithms C 4.5 and RPART to implement tree methods. Details about classiﬁcation trees can be found in various references, for example, Breiman, Friedman, Olshen, and Stone (1998) and Quinlan (1993). Classiﬁcation trees are computationally efﬁcient, can handle mixed variables (continuous and discrete) easily and the rules generated by them are relatively easy to interpret and understand. Classiﬁcation trees are highly ﬂexible, and naturally uncover interaction effects among the independent variables. Classiﬁcation trees are also popular because they can easily be incorporated into learning ensembles or larger learning systems as base learners. c 2010 Yufeng Ding and Jeffrey S. Simonoff. D ING AND S IMONOFF Like most statistics or machine learning methods, “base form” classiﬁcation trees are designed assuming that data are complete. That is, all of the values in the data matrix, with the rows being the observations (instances) and the columns being the variables (attributes), are observed. However, missing data (meaning that some of the values in the data matrix are not observed) is a very common problem, and for this reason classiﬁcation trees have to, and do, have ways of dealing with missing data in the predictors. (In supervised learning, an observation with missing response value has no information about the underlying relationship, and must be omitted. There is, however, research in the ﬁeld of semi-supervised learning methods that tries to handle the situation where the response value is missing, for example, Wang and Shen 2007.) Although there are many different ways of dealing with missing data in classiﬁcation trees, there are relatively few studies in the literature about the appropriateness and performance of these missing data methods. Moreover, most of these studies limited their coverage to the simplest missing data scenario, namely, missing completely at random (MCAR), while our study shows that the missing data generating process is one of the two crucial criteria in determining the best missing data method. The other crucial criterion is whether or not the testing set is complete. The following two subsections describe in more detail these two criteria. 1.1 Different Types of Missing Data Generating Process Data originate according to the data generating process (DGP) under which the data matrix is “generated” according to the probabilistic relationships between the variables. We can think of the missingness itself as a random variable, realized as the matrix of the missingness indicator Im . Im is generated according to the missingness generating process (MGP), which governs the relationship between Im and the variables in the data matrix. Im has the same dimension as the original data matrix, with each entry equal to 0 if the corresponding original data value is observed and 1 if the corresponding original data value is not observed (missing). Note that an Im value not only can be related to its corresponding original data value, but can also be related to other variables of the same observation. Depending on the relationship between Im and the original data, Rubin (1976) and Little and Rubin (2002) categorize the missingness into three different types. If Im is dependent upon the missing values (the unobserved original data values), then the missingness pattern is called “not missing at random” (NMAR). Otherwise, the missingness pattern is called “missing at random” (MAR). As a special case of MAR, when the missingness is also not dependent on the observed values (that is, is independent of all data values), the missingness pattern is called “missing completely at random” (MCAR). The deﬁnition of MCAR is rather restrictive, which makes MCAR unlikely in reality. For example, in the bankruptcy data discussed later in the paper, there is evidence that after the Enron scandal in 2001, when both government and the public became more wary about ﬁnancial reporting misconduct, missingness of values in ﬁnancial statement data was related to the well-being of the company, and thus other values in the data. This makes intuitive sense because when scrutinized, a company is more likely to have trouble reporting their ﬁnancial data if there were problems. Thus, focusing on the MCAR case is a major limitation that will be avoided in this paper. In fact, this paper shows that the categorization of MCAR, MAR and NMAR itself is not appropriate for the missing data problem in classiﬁcation trees, as well as in another supervised learning context (at least with respect to prediction), although it has been shown to be helpful with likelihood-based or Bayesian analysis. 132 A N I NVESTIGATION OF M ISSING DATA M ETHODS FOR C LASSIFICATION T REES 1 2 3 4 5 6 7 8 Missingness is related to Missing Observed Response values Predictors Variable No No No No Yes No Yes No No Yes Yes No No No Yes No Yes Yes Yes No Yes Yes Yes Yes LR MCAR MAR NMAR NMAR MAR MAR NMAR NMAR Three-Letter −−− −X− M−− M X− −−Y −X Y M−Y MXY Table 1: Eight missingness patterns investigated in this study and their correspondence to the categorization MCAR, MAR and NMAR deﬁned by Rubin (1976) and Little and Rubin (2002) (the LR column). The column Three-Letter shows the notation that is used in this paper. In this paper, we investigate eight different missingness patterns, depending on the relationship between the missingness and three types of variables, the observed predictors, the unobserved predictors (the missing values) and the response variable. The relationship is conditional upon other factors, for example, missingness is not dependent upon the missing values means that the missingness is conditionally independent of the missing values given the observed predictors and/or the response variable. Table 1 shows their correspondence with the MCAR/MAR/NMAR categorization as well as the three-letter notation we use in this paper. The three letters indicate if the missingness is conditionally dependent on the missing values (M), on other predictors (X) and on the response variable (Y), respectively. As will be shown, the dependence of the missingness on the response variable (the letter Y) is the one that affects the choice of best missingness data method. Later in the paper, some derived notations are also used. For example, ∗X∗ means the union of −X−, −XY, MX− and MXY, that is, the missingness is dependent upon the observed predictors, and it may or may not be related to the missing values and/or the response variable. 1.2 Scenarios Where the Testing Data May or May Not Be Complete There are essentially two stages of applying classiﬁcation trees, the training phase where the historical data (training set) are used to construct the tree, and the testing phase where the tree is put into use and applied to testing data. Similar to most other studies, this study deals with the scenario where missing data occur in the training set, but the testing set may or may not have missing values. One basic assumption is, of course, that the DGP (as well as MGP if the testing set also contains missing values) is the same for both the training set and the testing set. While it would probably typically be the case that the testing data would also have missing values (generated by the same process that generated them in the training set), it should be noted that in certain circumstances a testing set without missing values could be expected. For example, consider a problem involving prediction of bankruptcy from various ﬁnancial ratios. If the training set comes from a publicly available database, there could be missing values corresponding to information that was not supplied by various companies. If the goal is to use these publicly available data to try 133 D ING AND S IMONOFF to predict bankruptcy from ratios from one’s own company, it would be expected that all of the necessary information for prediction would be available, and thus the test set would be complete. This study shows that when the missingness is dependent upon the response variable and the test set has missing values, separate class is the best missing data method to use. In other situations, the choice is not as clear, but some insights on effective choices are provided. The rest of paper provides detailed theoretical and empirical analysis and is organized as follows. Section 2 gives a brief introduction to the previous research on this topic. This is followed by discussion of the design of this study and ﬁndings in Section 3. The generality of the results are then tested on real data sets in Section 4. A brief extension of the results to logistic regression is presented in Section 5. We conclude with discussion of these results and future work in Section 6. 2. Previous Research There have been several studies of missing data and classiﬁcation trees in the literature. Liu, White, Thompson, and Bramer (1997) gave a general description of the problem, but did not discuss solutions. Saar-Tsechansky and Provost (2007) discussed various missing data methods in classiﬁcation trees and proposed a cost-sensitive approach to the missing data problem for the scenario when missing data occur only at the testing phase, which is different from the problem studied here (where missing values occur in the training phase). Kim and Yates (2003) conducted a simulation study of seven popular missing value methods but did not ﬁnd any dominant method. Feelders (1999) compared the performance of surrogate split and imputation and found the imputation methods to work better. (These methods, and the methods described below, are described more fully in the next section.) Batista and Monard (2003) compared four different missing data methods, and found that 10 nearest neighbor imputation outperformed other methods in most cases. In the context of cost sensitive classiﬁcation trees, Zhang, Qin, Ling, and Sheng (2005) studied four different missing data methods based on their performances on ﬁve data sets with artiﬁcially generated random missing values. They concluded that the internal node method (the decision rules for the observations with the next split variable missing will be made at the (internal) node) is better than the other three methods examined. Fujikawa and Ho (2002) compared several imputation methods based on preliminary clustering algorithms to probabilistic split on simulations based on several real data sets and found comparable performance. A weakness of all of the above studies is that they focused only on the restrictive MCAR situation. Other studies examined both MAR and NMAR missingness. Kalousis and Hilario (2000) used simulations from real data sets to examine the properties of seven algorithms: two rule inducers, a nearest neighbor method, two decision tree inducers, a naive Bayes inducer, and linear discriminant analysis. They found that the naive Bayes method was by far most resilient to missing data, in the sense that its properties changed the least when the missing rate was increased (note that this resilience is related to, but not the same as, its overall predictive performance). They also found that the deleterious effects of missing data are more serious if a given amount of missing values are spread over several variables, rather than concentrated in a few. Twala (2009) used computer simulations based on real data sets to compare the properties of different missing value methods, including using complete cases, single imputation of missing values, likelihood-based multiple imputation (where missing values are imputed several times, and the results of ﬁtting trees to the different generated data sets are combined), probabilistic split, and surrogate split. He studied MAR, MCAR, and NMAR missingness generating processes, although 134 A N I NVESTIGATION OF M ISSING DATA M ETHODS FOR C LASSIFICATION T REES dependence of missingness on the response variable was not examined. Multiple imputation was found to be most effective, with probabilistic split also performing reasonably well, although little difference was found between methods when the proportion of missing values was low. As would be expected, MCAR missingness caused the least problems for methods, while NMAR missingness caused the most, and as was also found by Kalousis and Hilario (2000), missingness spread over several predictors is more serious than if it is concentrated in only one. Twala, Jones, and Hand (2008) proposed a method closely related to creating a separate class for missing values, and found that its performance was competitive with that of likelihood-based multiple imputation. The study described in the next section extends these previous studies in several ways. First, theoretical analyses are provided for simple situations that help explain observed empirical performance. We then extend these analyses to more complex situations and data sets (including large ones) using Monte Carlo simulations based on generated and real data sets. The importance of whether missing is dependent on the response variable, which has been ignored in previous studies on classiﬁcation trees yet turns out to be of crucial importance, is a fundamental aspect of these results. The generality of the conclusions is ﬁnally tested using real data sets and application to logistic regression. 3. The Effectiveness of Missing Data Methods The recursive nature of classiﬁcation trees makes them almost impossible to analyze analytically in the general case beyond 2×2 tables (where there is only one binary predictor and a binary response variable). On the other hand, trees built on 2×2 tables, which can be thought of as “stumps” with a binary split, can be considered as degenerate classiﬁcation trees, with a classiﬁcation tree being built (recursively) as a hierarchy of these degenerate trees. Therefore, analyzing 2×2 tables can result in important insights for more general cases. We then build on the 2×2 analyses using Monte Carlo simulation, where factors that might have impact on performance are incrementally added, in order to see the effect of each factor. The factors include variation in both the data generating process (DGP) and the missing data generating process (MGP), the number and type of predictors in the data, the number of predictors that contain missing values, and the number of observations with missing data. This study examines six different missing data methods: probabilistic split, complete case method, grand mode/mean imputation, separate class, surrogate split, and complete variable method. Probabilistic split is the default method of C 4.5 (Quinlan, 1993). In the training phase, observations with values observed on the split variable are split ﬁrst. The ones with missing values are then put into each of the child nodes with a weight given as the proportion of non-missing instances in the child. In the testing phase, an observation with a missing value on a split variable will be associated with all of the children using probabilities, which are the weights recorded in the training phase. The complete case method deletes all observations that contain missing values in any of the predictors in the training phase. If the testing set also contains missing values, the complete case method is not applicable and thus some other method has to be used. In the simulations, we use C 4.5 to realize the complete case method. In the training phase, we manually delete all of the observations with missing values and then run C 4.5 on the pre-processed remaining complete data. In the testing phase, the default missing data method, probabilistic split, is used. Grand mode imputation imputes the missing value with the grand mode of that variable if it is categorical. Grand mean is used if the variable is continuous. The separate class method treats the missing values as a new class 135 D ING AND S IMONOFF (category) of the predictor. This is trivial to apply when the original variable is categorical, where we can create a new category called “missing”. To apply the separate class method to a numerical variable, we give all of the missing values a single extremely large value that is obviously outside of the original data range. This creates the needed separation between the nonmissing values and the missing values, implying that any split that involves the variable with missing values will put all of the missing observations into the same branch of the tree. Surrogate split is the default method of CART (realized using RPART in this study; Breiman et al. 1998 and Therneau and Atkinson 1997). It ﬁnds and uses a surrogate variable (or several surrogates in order) within a node if the variable for the next split contains missing values. In the testing phase, if a split variable contains missing values, the surrogate variables in the training phase are used instead. The complete variable method simply deletes all variables that contain missing values. Before we start presenting results, we deﬁne a performance measure that is appropriate for measuring the impact of missing data. Accuracy, calculated as the percentage of correctly classiﬁed observations, is often used to measure the performance of classiﬁcation trees. Since it can be affected by both the data structure (some data are intrinsically easier to classify than others) and by the missing data, this is not necessarily a good summary of the impact of missing data. In this study, we deﬁne a measure called relative accuracy (RelAcc), calculated as RelAcc = Accuracy with missing data . Accuracy with original full data This can be thought of as a standardized accuracy, as RelAcc measures the accuracy achievable with missing values relative to that achievable with the original full data. 3.1 Analytical Results In the following consistency theorems, the data are assumed to reﬂect the DGP exactly, and therefore the training set and the testing set are exactly the same. Several of the theorems are for 2×2 tables, and in those cases stopping and pruning rules are not relevant, since the only question is whether or not the one possible split is made. The proofs are thus dependent on the underlying parameters of the DGP and MGP, rather than on data randomly generated from them. It is important to recognize that these results are only designed to be illustrative of the results found in the much more realistic simulation analyses to follow. Proofs of all of the results are given in the appendix. Before presenting the theorems, we deﬁne some terms to avoid possible confusion. First, a partition of the data refers to the grouping of the observations deﬁned by the classiﬁcation tree’s splitting rules. Note that it is possible for two different trees on the same data set to deﬁne the same partition. For example, suppose that there are only two binary explanatory variables, X1 and X2 , and one tree splits on X1 then X2 while another tree splits on X2 then X1 . In this case, these two trees have different structures, but they can lead to the same partition of the data. Secondly, the set of rules deﬁned by a classiﬁcation tree consists of the rules deﬁned by the tree leaves on each of the groups (the partition) of the data. 3.1.1 W HEN THE T EST S ET IS F ULLY O BSERVED W ITH N O M ISSING VALUES We start with Theorems 1 to 3 that apply to the complete case method. Theorems 4 and 5 apply to probabilistic split and mode imputation, respectively. Proofs of the theorems can be found in the appendix. 136 A N I NVESTIGATION OF M ISSING DATA M ETHODS FOR C LASSIFICATION T REES Theorem 1 Complete Case Method: If the MGP is conditionally independent of Y given X, then the tree built on the data containing missing values using the complete case method gives the same set of rules as the tree built on the original full data set. Theorem 2 Complete Case Method: If the partition of the data deﬁned by the tree built on the incomplete data is not changed from the one deﬁned by the tree built on the original full data, the loss in accuracy when the testing set is complete is bounded above by PM , where PM is the missing rate, deﬁned as the percentage of observations that contain missing values. Theorem 3 Complete Case Method: If the partition of the data deﬁned by the tree built on the incomplete data is not changed from the one deﬁned by the tree built on the original full data, the relative accuracy when the testing set is complete is bounded below by RelAccmin = 1 − PM , 1 + PM where PM is the missing rate. Notice that the tree structure itself could change as long as it gives the same ﬁnal partition of the data. There are similar results in regression analyses as in Theorem 1. In regression analyses, when the missingness is independent of the response variable, by using only the complete observations, the parameter estimators are all unbiased (Allison, 2001). This implies that in theory, when the missingness is independent of the response variable, using complete cases only is not a bad approach on average. However, in practice, as will be seen later, deleting observations with missing values can cause severe loss in information, and thus has generally poor performance. Theorem 4 Probabilistic Split: In a 2×2 data table, if the MGP is independent of either Y or X, given the other variable, then the following results hold for probabilistic split. 1. If X is not informative in terms of classiﬁcation, that is, the majority classes of Y for different X values are the same, then probabilistic split will give the same rule as the one that would be obtained from the original full data; 2. If probabilistic split shows that X is informative in terms of classiﬁcation, that is, the majority classes of Y for different X values are different, then it ﬁnds the same rule as the one that would be obtained from the original full data; 3. The absolute accuracy when the testing set is complete is bounded below by 0.5. Since the original full data accuracy is at most 1, the relative accuracy is also bounded below by 0.5. Theorem 5 Mode Imputation: If the MGP is independent of Y , given X, then the same results hold for mode imputation as for probabilistic split under the conditions of Theorem 4. Theorems 1, 2 and 3 (for the complete case method) are true for general data sets. Theorems 4 and 5 are for 2×2 tables only but they imply that probabilistic split and mode imputation have advantages over the complete case method, which can have very poor performance (as will be shown in Figure 1). 137 D ING AND S IMONOFF Moreover, with 2×2 tables, the complete variable method will always have a higher than 0.5 accuracy since by ignoring the only predictor, we will always classify all of the data to the overall majority class and achieve at least 0.5 accuracy, and thus at least 0.5 relative accuracy. Together with Theorems 4 and 5, as well as the evidence to be shown in Figure 1, this is an indication that classiﬁcation trees tend not to be hurt much by missing values, since trees built on 2 × 2 tables can be considered as degenerate classiﬁcation trees and more complex trees are composites of these degenerate trees. The performance of a classiﬁcation tree is the average (weighted by the number of observations at each leaf) over the degenerate trees at the leaf level, and, as will be seen later in the simulations, can often be quite good. Surrogate split is not applicable to 2×2 tables because there are no other predictors. For 2×2 table problems with a complete testing set, separate class is essentially the same as the complete case method, because as long as the data are split according to the predictor (and it is very likely that this will be so), the separate class method builds separate rules for the observations with missing values; when the testing set is complete, the rules that are used in the testing phase are exactly the ones built on the complete observations. When there is more than one predictor, however, the creation of the “separate class” will save the observations with missing values from being deleted and affect the tree building process. It will very likely lead to a change in the tree structure. This, as will be seen, tends to have a favorable impact on the performance accuracy. Figure 1 illustrates the lower bound calculated in Theorem 3. The illustration is achieved by Monte Carlo simulation of 2×2 tables. A 2×2 table with missing values has only eight cells, that is, eight different value combinations of the binary variables X, Y and M, where M is the missingness indicator such that M = 0 if X is observed and M = 1 if X is missing. There is one constraint, that the sum of the eight cell probabilities must equal one. Therefore, this table is determined by seven parameters. In the simulation, for each 2 × 2 table, the following seven parameters (probabilities) are randomly and independently generated from a uniform distribution between (0, 1): (1)P(X = 1), (2)P(Y = 1|X = 0), (3)P(Y = 1|X = 1), (4)P(M = 1|X = 0,Y = 0), (5)P(M = 1|X = 0,Y = 1), (6)P(M = 1|X = 1,Y = 0) and (7)P(M = 1|X = 1,Y = 1). Here we assume the data tables reﬂect the true underlying DGP and MGP without random variation, and thus the expected performance of the classiﬁcation trees can be derived using the parameters. In this simulation, sets of the seven parameters are generated (but no data sets are generated using these parameters) repeatedly, and the relative accuracy of each missing data method on each parameter set is determined. One million sets of parameters are generated for each missingness pattern. In Figure 1, the plot on the left is a scatter plot of relative accuracy versus missing rate for each Monte Carlo replication for the complete case method when the MGP depends on the response variable. The lower bound is clearly shown. We can see that when the missing rate is high, the lower bound can reduce to almost zero (implying that not only relative accuracy, but accuracy itself, can approach zero). This perhaps somewhat counterintuitive result can occur in the following way. Imagine the extreme case where almost all cases are positive and (virtually) all of the positive cases have missing predictor value at the training phase; in this situation the resultant rule will be to classify everything as negative. When this rule is applied to a complete testing set with almost all positive cases, the accuracy will be almost zero. The graph on the right is the quantile version of the scatter plot on the left. The lines shown in the quantile plot are the theoretical lower bound, the 10th, 20th, 30th, 40th and 50th percentile lines from the lowest to the highest. Higher percentile lines are the same as the 50th percentile (median) line, which is already the horizontal line at RelAcc = 1. The percentile lines are constructed by connecting the corresponding percentiles in a moving window 138 A N I NVESTIGATION OF M ISSING DATA M ETHODS FOR C LASSIFICATION T REES Figure 1: Scatter plot and the corresponding quantile plot of the complete testing set RelAcc vs. missing rate of the complete case method when the MGP is dependent on the response variable. Recall that “∗ ∗ Y” means the MGP is conditionally dependent on the response variable but no restriction on the relationship between the MGP and other variables, missing or observed, is assumed. Each point in the scatter plot represents the result on one of the simulated data tables. of data from the left to the right. Due to space limitations, we do not show quantile plots of other missing data methods and/or under different scenarios, but in all of the other plots, the quantile lines are all higher (that is, the quantile plot in Figure 1 shows the worst case scenario). The plots show that the missing data problem, when the missing rate is not too high, may not be as serious as we might have thought. For example, when 40% of the observations contain missing data, 80% of the time the expected relative accuracy is higher than 90%, and 90% of the time the expected relative accuracy is higher than 80%. 3.1.2 W HEN THE T EST S ET H AS M ISSING VALUES Theorem 6 Separate Class: In 2×2 data tables, if missing values occur in both the training set and the testing set, then the separate class method achieves the best possible performance. In the Monte Carlo simulation of the 2 × 2 tables, the head-to-head comparison between the separate class method and other missing data methods conﬁrmed the uniform dominance of the separate class when the test set also contains missing values, regardless whether the MGP is dependent on the response variable or not. However, as shown in Figure 2, when the MGP is independent of the response variable, separate class never performances better than the performance on the original full data, indicated by relative accuracies less than one. This means that separate class is not gaining from the missingness. On the other hand, when the MGP is dependent on the response variable, a fairly large percentage of the time the relative accuracy of the separate class method is larger than one (the quantiles shown are from the 10th to the 90th percentile with increment 10 percent). This means that trees based on the separate class method can improve on predictive performance compared to the situation where there are no missing data. Our simulations show that other methods can also gain from the missingness when the MGP is dependent on the response variable, but not as frequently as the separate class method and the gains are in general not as large. We follow up on this behavior in more detail in the next section, but the simple explanation is that since missingness depends on the response variable, the tree algorithm can use the presence of missing data in an observation to improve prediction of the response for that observation. Duda, Hart, and Stork (2001) and Hand (1997) brieﬂy mentioned this possibility in the classiﬁcation context, but did not give any 139 D ING AND S IMONOFF Figure 2: Scatter plot of the separate class method with incomplete testing set. Each point in the scatter plot represents the result on one of the simulated data tables. supporting evidence. Theorem 6 makes a fairly strong statement in the simple situation, and it will be seen to be strongly indicative of the results in more general cases. 3.2 Monte Carlo Simulations of General Data Sets In this section extensions of the simulations in the last section are summarized. 3.2.1 A N OVERVIEW OF THE S IMULATION The following simulations are carried out. 1. 2×2 tables, missing values occur in the only predictor. 2. Up to seven binary predictors, missing values occur in only one predictor. 3. Eight binary predictors, missing values occur in two of them. 4. Twelve binary predictors, missing values occur in six of them. 5. Eight continuous predictors, missing values occur in two of them. 6. Twelve continuous predictors, missing values occur in six of them. Two different scenarios of each of the last four simulations listed above were performed. In the ﬁrst scenario, the six complete predictors are all independent of the missing ones, while in the second scenario three of the six complete predictors are related to the missing ones. Therefore, ten simulations were done in total. In each of the simulations, 5000 sets of DGPs are simulated in order to cover a wide range of different-structured data sets so that a generalizable inference from the simulation is possible. For 140 A N I NVESTIGATION OF M ISSING DATA M ETHODS FOR C LASSIFICATION T REES Density 0.6 0.7 0.8 0.9 0.0 1.0 2.0 3.0 Out−of−sample accuracy 0.0 1.0 2.0 3.0 Density In−sample accuracy 1.0 0.5 0.6 0.7 0.8 0.9 1.0 4 0 Density 8 Out−of−sample AUC 0 1 2 3 4 Out−of−sample accuracy In−sample AUC Density In−sample accuracy 0.5 0.6 0.7 0.8 0.9 1.0 0.5 In−sample AUC 0.6 0.7 0.8 0.9 Out−of−sample AUC Figure 3: A summary of the tree performance on the simulated original full data. each DGP, eight different MGPs are simulated to cover different types of missingness patterns. For each data set, the variables are generated sequentially in the order of the predictors, the response and the missingness. The probabilities associated with the binary response variable and the binary missingness variable are generated using conditional logit functions. The predictors may or may not be correlated with each other. Details about the simulations implementation can be found in Ding and Simonoff (2008). For each set of DGP/MGP, several different sample sizes are simulated to see any possible learning curve effect, since it was shown by Perlich, Provost, and Simonoff (2003) that sample size is an important factor in the effectiveness of classiﬁcation trees. Figure 3 shows the distribution of the tree performance on the simulated original full data, as measured by accuracy and area under the ROC curve (AUC). As we can see, there is broad coverage of the entire range of strength of the underlying relationship. Also, as expected, the out-of-sample performance (on the test set) is generally worse than the in-sample performance (on the training set). When the in-sample AUC is close to 0.5, a tree is likely to not split and as a result, any missing data method will not actually be applied, resulting in equivalent performance over all of them. To make the comparisons more meaningful, we exclude the cases where the in-sample AUC is below 0.7. Lower thresholds for exclusion (0.55 and 0.6) yield very similar results. Of the six missing data methods covered by this study, ﬁve of them, namely, complete case method, probabilistic split, separate class, imputation and complete variable method, are realized using C 4.5. These methods are always comparable. However, surrogate split is carried out using RPART , which makes it less comparable to the other methods because of differences between RPART and C 4.5 other than the missing data methods. To remedy this problem, we tuned the RPART parameters (primarily the parameter “cp”) so that it gives balanced results compared to C 4.5 when applied to the original full data (i.e., each has a similar probability of outperforming the other), and special attention is given when comparing RPART with other methods. The out-of-sample performances of each pair of missing data methods were compared based on both t-tests and nonparametric tests; each difference discussed in the following sections was strongly statistically signiﬁcant. 141 D ING AND S IMONOFF 100 P M D S T C D M C T S 500 2000 0 20 40 60 80 P P C D T S M − − Y Winning pct of each method 0 20 40 60 80 Winning pct of each method − − − 10000 P P P D C T S M 100 D C T M S 500 D M S T C 2000 M D S T C 500 2000 0 20 40 60 80 D C M T S P D C T S M P Winning pct of each method 0 20 40 60 80 − X Y P 100 10000 P P P D C T D 100 D M C S T C T M S M S 500 2000 D C T M S M D S T C 500 2000 0 20 40 60 80 P P Winning pct of each method 0 20 40 60 80 M − Y P D C T S M 10000 P P P D C T D 100 D C T M S M S 500 C S T M 2000 P P P D C T S M D C M T S 500 M D S T C 2000 10000 0 20 40 60 80 M X Y Winning pct of each method 0 20 40 60 80 10000 Sample size M X − Winning pct of each method Sample size 100 10000 Sample size M − − Winning pct of each method Sample size 100 10000 Sample size − X − Winning pct of each method Sample size P P P D C T 100 Sample size D C T S M M S 500 D C S M T 2000 10000 Sample size Figure 4: A summary of the order of six missing data methods when tested on a new complete testing set. The Y axis is the percentage of times each method is the best (including being tied with other methods; therefore the percentages do not sum up to one). 3.2.2 T HE T WO FACTORS THAT D ETERMINE DATA M ETHODS THE P ERFORMANCE OF D IFFERENT M ISSING The simulations make clear that the dependence relationship between the missingness and the response variable is the most informative factor in differentiating different missing data methods, and thus is most helpful in determining the appropriateness of the methods. This can be clearly seen in Figures 4 and 5 (these ﬁgures refer to the case with twelve continuous predictors, six of which are subject to missing values, but results for other situations were broadly similar). The left column in 142 A N I NVESTIGATION OF M ISSING DATA M ETHODS FOR C LASSIFICATION T REES 100 P M D T S C D T M C S 500 2000 0 20 40 60 80 P C P D T S M − − Y Winning pct of each method 0 20 40 60 80 Winning pct of each method − − − 10000 S S P D T C M S C P D T M 100 500 P M D T C 2000 P D T C M S 500 M D T S C 2000 0 20 40 60 80 P Winning pct of each method 0 20 40 60 80 − X Y P D C T S M 100 10000 S S S P D T C M C D P T M 100 500 P M D T C 2000 500 P M D S T C 2000 0 20 40 60 80 P D T C M S Winning pct of each method 0 20 40 60 80 M − Y P D C T S M 10000 S S S C P D T M 100 P D T C M 500 P T D M C 2000 P D T M C S D P C T S M 500 P M S D T C 2000 10000 0 20 40 60 80 M X Y Winning pct of each method 0 20 40 60 80 10000 Sample size M X − Winning pct of each method Sample size 100 10000 Sample size M − − Winning pct of each method Sample size 100 10000 Sample size − X − Winning pct of each method Sample size S S S C P D T M 100 Sample size P D T C M 500 P T D M C 2000 10000 Sample size Figure 5: A summary of the order of six missing data methods when tested on a new incomplete testing set. The Y axis is the percentage of times each method is the best (including being tied with other methods). the pictures shows the results when the missingness is independent of the response variable and the right column shows the results when the missingness is dependent on the response variable. We can see that there are clear differences between the two columns, but within each column there is essentially no difference. This also says the categorization of MCAR/MAR/NMAR (which is based upon the dependence relationship between the missingness and missing values, and does not distinguish the dependence of the missingness on other Xs and on Y ) is not helpful in this context. 143 D ING AND S IMONOFF Figure 6: Plot of the case-wise missing rate MR2 versus the value-wise missing rate MR1 in the simulations using the 36 real data sets. Comparison of the right columns of Figures 4 and 5 shows that whether or not there are missing values in the testing set is the second important criterion in differentiating between the methods. The separate class method is strongly dominant when the testing set contains missing values and the missingness is related to the response variable. The reason for this is that when missing data exist in both the training phase and the testing phase, they become part of the data and the MGP becomes an essential part of the DGP. This, of course, requires the assumption that the MGP (as well as the DGP) is the same in both the training phase and the testing phase. Under this scenario, if the missingness is related to the response variable, then there is information about the response variable in the missingness, which should be helpful when making predictions. Separate class, by taking the missingness directly as an “observed” variable, uses the information in the missingness about the response variable most effectively and thus is the best method to use. As a matter of fact, as can be seen in the bottom rows of Figures 7 and 8 (which give average relative accuracies separated by missing rate), the average relative accuracy of separate class under this situation is larger than one, indicating, on average, a better performance than with the original full data. On the other hand, when the missing data only occur in the training phase and the testing set does not have missing values, or when the missingness is not related to and carries no information about the response variable, the existence of missing values is a nuisance. Its only effect is to obscure the underlying DGP and thus would most likely reduce a tree’s performance. In this case, simulations show probabilistic split to be the dominantly best method. However, we don’t see this dominance later in results based on real data sets. More discussion of this point will follow in Section 4. 3.2.3 M ISSING R ATE E FFECT There are two ways of deﬁning the missing rate: the percentage of predictor values that are missing from the data set (the value-wise missing rate, termed here MR1 ), and the percentage of observations that contain missing values (the case-wise missing rate, termed here MR2 ). If there is only one predictor, as is the case with 2×2 tables, then the two deﬁnitions are the same. We have seen earlier in the theoretical analyses that the missing rate has a clear impact on the performance of the missing data methods. In the simulations, there is also evidence of a relationship between relative performance and missing rate, whichever deﬁnition is used to deﬁne the missing rate. 144 A N I NVESTIGATION OF M ISSING DATA M ETHODS FOR C LASSIFICATION T REES 100 500 2000 10000 100 500 2000 80 100 P T D M C 60 40 20 80 60 40 P D C T M 0 P D M C T Winning Pct of each method P D C T M C P D T M S S S P T C M D P T D C M 0 D C P T M S S S 20 60 40 20 S Winning pct of each method 80 S S 0 Winning pct of each method Winning pct / MGP: MXY / MR1>0.35 100 Winning Pct / MGP: MXY / 0.2 <0.3 100 Winning Pct / MGP: MXY / MR1<0.15 10000 100 500 T D P C M 2000 10000 Mean RelAcc / MGP: MXY / MR1<0.15 Mean RelAcc / MGP: MXY / 0.2

3 0.10364495 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

Author: Alexander Ilin, Tapani Raiko

Abstract: Principal component analysis (PCA) is a classical data analysis technique that Ä?Ĺš nds linear transformations of data that retain the maximal amount of variance. We study a case where some of the data values are missing, and show that this problem has many features which are usually associated with nonlinear models, such as overÄ?Ĺš tting and bad locally optimal solutions. A probabilistic formulation of PCA provides a good foundation for handling missing values, and we provide formulas for doing that. In case of high dimensional and very sparse data, overÄ?Ĺš tting becomes a severe problem and traditional algorithms for PCA are very slow. We introduce a novel fast algorithm and extend it to variational Bayesian learning. Different versions of PCA are compared in artiÄ?Ĺš cial experiments, demonstrating the effects of regularization and modeling of posterior variance. The scalability of the proposed algorithm is demonstrated by applying it to the NetÄ?Ĺš‚ix problem. Keywords: principal component analysis, missing values, overÄ?Ĺš tting, regularization, variational Bayes

4 0.070103578 14 jmlr-2010-Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes

Author: Antti Honkela, Tapani Raiko, Mikael Kuusela, Matti Tornio, Juha Karhunen

Abstract: Variational Bayesian (VB) methods are typically only applied to models in the conjugate-exponential family using the variational Bayesian expectation maximisation (VB EM) algorithm or one of its variants. In this paper we present an efﬁcient algorithm for applying VB to more general models. The method is based on specifying the functional form of the approximation, such as multivariate Gaussian. The parameters of the approximation are optimised using a conjugate gradient algorithm that utilises the Riemannian geometry of the space of the approximations. This leads to a very efﬁcient algorithm for suitably structured approximations. It is shown empirically that the proposed method is comparable or superior in efﬁciency to the VB EM in a case where both are applicable. We also apply the algorithm to learning a nonlinear state-space model and a nonlinear factor analysis model for which the VB EM is not applicable. For these models, the proposed algorithm outperforms alternative gradient-based methods by a signiﬁcant margin. Keywords: variational inference, approximate Riemannian conjugate gradient, ﬁxed-form approximation, Gaussian approximation

5 0.063324913 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines

Author: Pedro A. Forero, Alfonso Cano, Georgios B. Giannakis

Abstract: This paper develops algorithms to train support vector machines when training data are distributed across different nodes, and their communication to a centralized processing unit is prohibited due to, for example, communication complexity, scalability, or privacy reasons. To accomplish this goal, the centralized linear SVM problem is cast as a set of decentralized convex optimization subproblems (one per node) with consensus constraints on the wanted classiﬁer parameters. Using the alternating direction method of multipliers, fully distributed training algorithms are obtained without exchanging training data among nodes. Different from existing incremental approaches, the overhead associated with inter-node communications is ﬁxed and solely dependent on the network topology rather than the size of the training sets available per node. Important generalizations to train nonlinear SVMs in a distributed fashion are also developed along with sequential variants capable of online processing. Simulated tests illustrate the performance of the novel algorithms.1 Keywords: support vector machine, distributed optimization, distributed data mining, distributed learning, sensor networks

6 0.058114819 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models

7 0.053774919 61 jmlr-2010-Learning From Crowds

8 0.048422825 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

9 0.047061592 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

10 0.045893718 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars

11 0.044226944 22 jmlr-2010-Classification Using Geometric Level Sets

12 0.044066731 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

13 0.042995628 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers

14 0.041617408 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks

15 0.041307967 40 jmlr-2010-Fast and Scalable Local Kernel Machines

16 0.041012179 90 jmlr-2010-Permutation Tests for Studying Classifier Performance

17 0.040399287 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation

18 0.038143869 53 jmlr-2010-Inducing Tree-Substitution Grammars

19 0.03782244 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

20 0.037174698 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.176), (1, 0.03), (2, -0.098), (3, 0.06), (4, 0.025), (5, 0.064), (6, 0.101), (7, 0.001), (8, -0.1), (9, -0.09), (10, -0.014), (11, 0.303), (12, -0.056), (13, 0.013), (14, -0.157), (15, 0.236), (16, 0.068), (17, -0.217), (18, 0.089), (19, 0.187), (20, -0.002), (21, -0.038), (22, 0.013), (23, 0.027), (24, -0.037), (25, 0.038), (26, -0.057), (27, -0.03), (28, 0.145), (29, -0.001), (30, 0.032), (31, 0.186), (32, -0.006), (33, -0.065), (34, -0.006), (35, -0.011), (36, 0.079), (37, -0.018), (38, 0.125), (39, -0.002), (40, -0.027), (41, -0.018), (42, -0.01), (43, -0.069), (44, 0.076), (45, -0.047), (46, -0.031), (47, -0.063), (48, 0.058), (49, 0.074)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93209147 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors

Author: Chunping Wang, Xuejun Liao, Lawrence Carin, David B. Dunson

2 0.77594471 11 jmlr-2010-An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data

Author: Yufeng Ding, Jeffrey S. Simonoff

3 0.68580973 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

Author: Alexander Ilin, Tapani Raiko

4 0.37017599 14 jmlr-2010-Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes

Author: Antti Honkela, Tapani Raiko, Mikael Kuusela, Matti Tornio, Juha Karhunen

5 0.30621764 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models

Author: Jörg Lücke, Julian Eggert

Abstract: We show how a preselection of hidden variables can be used to efﬁciently train generative models with binary hidden variables. The approach is based on Expectation Maximization (EM) and uses an efﬁciently computable approximation to the sufﬁcient statistics of a given model. The computational cost to compute the sufﬁcient statistics is strongly reduced by selecting, for each data point, the relevant hidden causes. The approximation is applicable to a wide range of generative models and provides an interpretation of the beneﬁts of preselection in terms of a variational EM approximation. To empirically show that the method maximizes the data likelihood, it is applied to different types of generative models including: a version of non-negative matrix factorization (NMF), a model for non-linear component extraction (MCA), and a linear generative model similar to sparse coding. The derived algorithms are applied to both artiﬁcial and realistic data, and are compared to other models in the literature. We ﬁnd that the training scheme can reduce computational costs by orders of magnitude and allows for a reliable extraction of hidden causes. Keywords: maximum likelihood, deterministic approximations, variational EM, generative models, component extraction, multiple-cause models

6 0.30604544 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines

7 0.29358068 61 jmlr-2010-Learning From Crowds

8 0.26853371 66 jmlr-2010-Linear Algorithms for Online Multitask Classification

9 0.26259637 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks

10 0.26108146 40 jmlr-2010-Fast and Scalable Local Kernel Machines

11 0.25681072 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers

12 0.24912798 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

13 0.22456104 37 jmlr-2010-Evolving Static Representations for Task Transfer

14 0.22375712 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis

15 0.21900795 22 jmlr-2010-Classification Using Geometric Level Sets

16 0.21519706 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars

17 0.21504274 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

18 0.21256796 90 jmlr-2010-Permutation Tests for Studying Classifier Performance

19 0.20535083 63 jmlr-2010-Learning Instance-Specific Predictive Models

20 0.20229749 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.021), (6, 0.013), (8, 0.019), (20, 0.4), (21, 0.014), (22, 0.014), (32, 0.064), (36, 0.038), (37, 0.045), (75, 0.144), (81, 0.021), (85, 0.081), (96, 0.014), (97, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.67225158 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors

Author: Chunping Wang, Xuejun Liao, Lawrence Carin, David B. Dunson

2 0.41480738 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

Author: Pannagadatta K. Shivaswamy, Tony Jebara

Abstract: Leading classiﬁcation methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identiﬁes its sensitivity to afﬁne transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classiﬁcation function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classiﬁcation function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efﬁciency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classiﬁcation and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages. Keywords: support vector machines, kernel methods, large margin, Rademacher complexity

3 0.41323158 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

Author: Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian

Abstract: Estimating the error rates of classiﬁers or regression models is a fundamental task in machine learning which has thus far been studied exclusively using supervised learning techniques. We propose a novel unsupervised framework for estimating these error rates using only unlabeled data and mild assumptions. We prove consistency results for the framework and demonstrate its practical applicability on both synthetic and real world data. Keywords: classiﬁcation and regression, maximum likelihood, latent variable models

4 0.41053259 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

Author: Alexander Ilin, Tapani Raiko

5 0.40877637 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

Author: Yevgeny Seldin, Naftali Tishby

Abstract: We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering.1 We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for ﬁnding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved stateof-the-art performance in the MovieLens collaborative ﬁltering task. Our co-clustering model can also be seen as matrix tri-factorization and the results provide generalization bounds, regularization terms, and new algorithms for this form of matrix factorization. The analysis of co-clustering is extended to tree-shaped graphical models, which can be used to analyze high dimensional tensors. According to the bounds, the generalization abilities of treeshaped graphical models depend on a trade-off between their empirical data ﬁt and the mutual information that is propagated up the tree levels. We also formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. The analysis of co-clustering easily

6 0.40839159 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking

7 0.40817949 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

8 0.40745232 63 jmlr-2010-Learning Instance-Specific Predictive Models

9 0.40693223 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory

10 0.40628648 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization

11 0.40626472 11 jmlr-2010-An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data

12 0.40608951 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

13 0.4037396 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

14 0.40323231 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

15 0.40239224 102 jmlr-2010-Semi-Supervised Novelty Detection

16 0.40223444 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing

17 0.40215236 107 jmlr-2010-Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion

18 0.40160412 109 jmlr-2010-Stochastic Composite Likelihood

19 0.4014667 22 jmlr-2010-Classification Using Geometric Level Sets

20 0.40074867 66 jmlr-2010-Linear Algorithms for Online Multitask Classification