jmlr jmlr2007 jmlr2007-39 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Maytal Saar-Tsechansky, Foster Provost
Abstract: Much work has studied the effect of different treatments of missing values on model induction, but little work has analyzed treatments for the common case of missing values at prediction time. This paper first compares several different methods—predictive value imputation, the distributionbased imputation used by C4.5, and using reduced models—for applying classification trees to instances with missing values (and also shows evidence that the results generalize to bagged trees and to logistic regression). The results show that for the two most popular treatments, each is preferable under different conditions. Strikingly the reduced-models approach, seldom mentioned or used, consistently outperforms the other two methods, sometimes by a large margin. The lack of attention to reduced modeling may be due in part to its (perceived) expense in terms of computation or storage. Therefore, we then introduce and evaluate alternative, hybrid approaches that allow users to balance between more accurate but computationally expensive reduced modeling and the other, less accurate but less computationally expensive treatments. The results show that the hybrid methods can scale gracefully to the amount of investment in computation/storage, and that they outperform imputation even for small investments. Keywords: missing data, classification, classification trees, decision trees, imputation
Reference: text
sentIndex sentText sentNum sentScore
1 EDU New York University 44West 4th Street New York, NY 10012, USA Editor: Rich Caruana Abstract Much work has studied the effect of different treatments of missing values on model induction, but little work has analyzed treatments for the common case of missing values at prediction time. [sent-5, score-0.911]
2 This paper first compares several different methods—predictive value imputation, the distributionbased imputation used by C4. [sent-6, score-0.841]
3 The results show that the hybrid methods can scale gracefully to the amount of investment in computation/storage, and that they outperform imputation even for small investments. [sent-12, score-0.863]
4 Keywords: missing data, classification, classification trees, decision trees, imputation 1. [sent-13, score-1.139]
5 The most common approaches for dealing with missing features involve imputation (Hastie et al. [sent-31, score-1.165]
6 The main idea of imputation is that if an important feature is missing for a particular instance, it can be estimated from the data that are present. [sent-33, score-1.162]
7 There are two main families of imputation approaches: (predictive) value imputation and distribution-based imputation. [sent-34, score-1.682]
8 Value imputation estimates a value to be used by the model in place of the missing feature. [sent-35, score-1.139]
9 Distribution-based imputation estimates the conditional distribution of the missing value, and predictions will be based on this estimated distribution. [sent-36, score-1.139]
10 Value imputation is more common in the statistics community; distribution-based imputation is the basis for the most popular treatment used by the (non-Bayesian) machine learning community, as exemplified by C4. [sent-37, score-1.728]
11 An alternative to imputation is to construct models that employ only those features that will be known for a particular test case—so imputation is not necessary. [sent-39, score-1.743]
12 The empirical evaluation clearly shows the inferiority of the two common imputation treatments, highlighting the underappreciated reduced-model method. [sent-47, score-0.841]
13 Neither of the two imputation techniques dominates cleanly, and each provides considerable advantage over the other for some domains. [sent-49, score-0.841]
14 The follow-up discussion examines the conditions under which the two imputation methods perform better or worse. [sent-50, score-0.841]
15 , 1997a); nonetheless, it is a general and commonly assumed scenario that should be understood before moving to other analyses, especially since most imputation methods rely on MCAR for their validity (Hastie et al. [sent-59, score-0.841]
16 As introduced above, imputation is a class of methods by which an estimation of the missing value or of its distribution is used to generate predictions from a given model. [sent-78, score-1.139]
17 Various imputation treatments for missing values in historical/training data are available that may also be deployed at prediction time. [sent-80, score-1.309]
18 However, some treatments such as multiple imputation (Rubin, 1987) are particularly suitable to induction. [sent-81, score-0.986]
19 In particular, multiple imputation (or repeated imputation) is a Monte Carlo approach that generates multiple simulated versions of a data set that each are analyzed and the results are combined to generate inference. [sent-82, score-0.841]
20 For this paper, we consider imputation techniques that can be applied to individual test cases during inference. [sent-83, score-0.854]
21 This treatment is fundamentally different from value imputation because it combines the classifications across the distribution of an attribute’s possible values, rather than merely making the classification based on its most likely value. [sent-105, score-0.887]
22 Unique-value imputation is preferable when the following two conditions hold: the fact that a value is missing depends on the value of the class variable, and this dependence is present both in the training and in the application/test data (Ding and Simonoff, 2006). [sent-110, score-1.139]
23 As we discuss in detail below, one could achieve a balance of storage and computation with a hybrid method, whereby reduced-feature models are stored for the most important patterns; lazy learning or imputation could be applied for less-important patterns. [sent-129, score-0.946]
24 If more than one feature is missing for the test case, the imputation of A is (recursively) a problem of prediction with missing values. [sent-132, score-1.498]
25 Short of abandoning straightforward imputation, one possibility is to take a reduced-model approach for imputation itself, which begs the question: why not simply use a direct reduced-model approach? [sent-133, score-0.841]
26 4 Another approach is to build one predictive imputation model for each attribute, using all the other features, and then use an alternative imputation method (such as mean or mode value imputation, or C4. [sent-134, score-1.742]
27 We are aware of neither theoretical nor empirical support for an advantage of predictive imputation over reduced modeling in terms of prediction accuracy. [sent-150, score-1.075]
28 For value imputation we estimate missing categorical features using a J48 tree, and continuous values using Weka’s linear regression. [sent-165, score-1.165]
29 As discussed above, for value imputation with multiple missing values we use mean/mode imputation for the additional missing values. [sent-166, score-2.278]
30 Table 2 shows the differences in the relative improvements obtained with each imputation treatment from those obtained with reduced modeling. [sent-181, score-0.972]
31 A large negative value indicates that an imputation treatment resulted in a larger drop in accuracy than that exhibited by reduced modeling. [sent-182, score-0.994]
32 The reduced-model approach results in better performance compared to distribution-based imputation in 13 out of 15 data sets, and is better than value imputation in 14 data sets (both significant with p < 0. [sent-184, score-1.682]
33 Not only does a reduced-feature model almost always result in statistically significantly moreaccurate predictions, the improvement over the imputation methods often was substantial. [sent-186, score-0.841]
34 For example, for the Downsize data set, prediction with reduced models results in less than 1% decrease in accuracy, while value imputation and distribution-based imputation exhibit drops of 10. [sent-187, score-1.802]
35 32%, respectively; the drop in accuracy resulting from imputation is more than 9 times that obtained with a reduced model. [sent-189, score-0.948]
36 5’s distribution-based imputation (DBI), each has a stark advantage over the other in some domains. [sent-203, score-0.841]
37 The different imputation treatments differ in how they take advantage of statistical dependencies between features. [sent-205, score-0.986]
38 It is easy to develop a notion of the exact type of statistical dependency under which predictive value imputation should work, and we can formalize this notion by defining feature imputability as the fundamental ability to estimate one feature using others. [sent-206, score-1.134]
39 Assume also, for the moment, that both the primary modeling and the imputation modeling have no intrinsic error—in the latter case, all existing feature 5. [sent-216, score-1.016]
40 49 Table 2: Differences in relative improvement (from Figure 1 between each imputation treatment and reduced-feature modeling. [sent-254, score-0.899]
41 Predictive value imputation simply fills in the correct value and has no effect whatsoever on the bias and variance of the model induction. [sent-256, score-0.841]
42 Now given a test case with A missing, predictive value imputation can use the (perfect) feature imputability directly: B can be used to infer A, and this enables the use of the learned model to predict perfectly. [sent-259, score-1.124]
43 Since there is no feature imputability, A cannot be inferred using B and the imputation model should predict the mode (A = 2). [sent-286, score-0.864]
44 6 The bars represent the differences in the entries in Table 2, between predictive value imputation and C4. [sent-307, score-0.901]
45 A bar above the horizontal line indicates that value imputation performed better; a bar below the line indicates that DBI performed better. [sent-309, score-0.841]
46 The relative performances follow the above argument closely, with value imputation generally preferable for high feature imputability, and C4. [sent-310, score-0.876]
47 Reduced modeling is a lower-dimensional learning problem than the (complete) modeling to which imputation methods are applied; it will tend to have lower variance and thereby may exhibit lower 6. [sent-315, score-0.993]
48 For categorical features we measured the classification accuracy of the imputation model; for numeric features we computed the correlation coefficient of the regression. [sent-317, score-0.927]
49 In contrast, imputation takes on quite an ambitious task. [sent-326, score-0.841]
50 From the same training data, it must build an accurate base classifier and build accurate imputation models for any possible missing values. [sent-327, score-1.161]
51 One can argue that imputation tries implicitly to approximate the full-joint distribution, similar to a graphical model such as a dependency network (Heckerman et al. [sent-328, score-0.841]
52 There are many opportunities for the introduction of error, and the errors will be compounded as imputation models are composed. [sent-330, score-0.863]
53 Reduced-feature modeling may have additional advantages over value imputation when the imputation is imperfect, as just discussed. [sent-335, score-1.758]
54 Reduced modeling is likely to be better than the imputation methods, because of its reduced variance as described above. [sent-339, score-0.99]
55 Then, if A is missing at prediction time, no imputation technique will help us do better than merely guessing that the example belongs to the most common class (as with DBI) or guessing that the missing value is the most common one (as in PVI). [sent-342, score-1.462]
56 Value imputation does very well only for the domains with the highest feature imputability (for the highest-imputability domains, the accuracies of imputation and reduced modeling are statistically indistinguishable). [sent-349, score-2.063]
57 Table 3 shows the differences in the relative improvements of each imputation treatment from those obtained with reduced models. [sent-360, score-0.972]
58 For bagged trees, reduced modeling is better than predictive imputation in 12 out of 15 data sets, and it performs better than distribution-based imputation in 14 out of 15 data sets (according to the sign test, these differences are significant at p < 0. [sent-361, score-1.931]
59 These results indicate that for bagging, a reduced model’s relative advantage with respect to predictive imputation is comparable to its relative advantage when a single model is used. [sent-367, score-0.998]
60 57 Table 3: Relative difference in prediction accuracy for bagged decision trees between imputation treatments and reduced-feature modeling. [sent-412, score-1.121]
61 ¡£ £¡§ ¥£¡ ©¨¦¤¢ -30 -35 -40 Figure 8: Relative differences in accuracies for a logistic regression model when predicitve value imputation and reduced modeling are employed, as compared to when all values are known. [sent-417, score-1.025]
62 5-style distribution-based imputation is not applicable for logistic regression, we compare predictive value imputation to the reduced model approach. [sent-419, score-1.828]
63 Figure 8 shows the difference in accuracy when predictive value imputation and reduced models are used. [sent-420, score-1.03]
64 Table 4 shows the differences in the relative improvements of the predictive imputation treatment from those obtained with reduced models. [sent-421, score-1.032]
65 For logistic regression, reduced modeling results in higher accuracy than predictive imputation in all 15 data sets (statistically significant with p 0. [sent-422, score-1.097]
66 59 Table 4: Relative difference in prediction accuracy for logistic regression between imputation and reduced-feature modeling. [sent-457, score-0.913]
67 Furthermore, predictive value imputation and distribution-based imputation each outperforms the other substantially on at least one data set—so one should not choose between them arbitrarily. [sent-467, score-1.742]
68 © ¦ ¨ §¥£ ¢ -4 Figure 9: Relative percentage-point differences in predictive accuracy obtained with distributionbased imputation and predictive value imputation treatments compared to that obtained with reduced-feature models. [sent-477, score-1.981]
69 48 Table 6: Relative percentage-point difference in prediction accuracy between imputation treatments and reduced-feature modeling. [sent-488, score-1.045]
70 Figure 10 shows the accuracies of reduced-feature modeling and predictive value imputation as the number of missing features increases, from 1 feature up to when only a single feature is left. [sent-491, score-1.369]
71 We see a typical pattern: the imputation methods have steeper decreases in accuracy as the number of missing values increases. [sent-495, score-1.173]
72 First, as we mentioned earlier when more than one value is missing, the imputation models themselves face a missing-at-prediction-time problem, which must be addressed by a different technique. [sent-498, score-0.863]
73 This is a fundamental limitation to predictive value imputation as it is used in practice. [sent-499, score-0.901]
74 Second, predictive value imputation might do worse than reduced modeling, if the inductive bias of the resultant imputation model is “worse” than that of the reduced model. [sent-501, score-1.888]
75 For example, perhaps our classification-tree modeling does a much better job with numeric variables than the linear regression we use for imputation of real-value features. [sent-502, score-0.917]
76 5 for both the base model and the imputation model), we see the same patterns of results as with the other data sets. [sent-506, score-0.863]
77 If no corresponding model has been stored, the hybrid would call on a fall-back technique: either incurring the expense of prediction-time reduced modeling, or invoking an imputation method (and possibly incurring reduced accuracy). [sent-516, score-1.009]
78 The horizontal, dashed line in Figure 11 shows the performance of pure predictive value imputation for the CalHouse data set. [sent-530, score-0.901]
79 Given enough space to store k models, the hybrid induces and stores reduced models for the top-k most likely missing-feature patterns, and uses distribution-based imputation for the rest. [sent-534, score-0.958]
80 When multiple values are missing, ReFE ensemble members rely on imputation for the additional missing values. [sent-561, score-1.155]
81 We see that ReFE consistently improves over both a single model with imputation (positive entries in the ReFE column) and over bagging with imputation. [sent-575, score-0.923]
82 The magnitudes of ReFE’s improvements vary widely, but on average they split the difference between bagging with imputation and the full-blown reduced modeling. [sent-578, score-0.996]
83 27 Table 7: Relative improvements in accuracy for bagging with imputation and ReFE, as compared to a single model with imputation. [sent-638, score-0.957]
84 Bold entries show the cases where ReFE improves both over using a single model with imputation and over bagging with imputation. [sent-639, score-0.923]
85 The advantage exhibited by ReFE over bagging with imputation is maintained. [sent-646, score-0.923]
86 ReFE results in higher accuracy than bagging with imputation for all 15 data sets (statistically significant at p 0. [sent-648, score-0.957]
87 Bold entries show the cases where ReFE improves both over using a single model with imputation and over bagging with imputation. [sent-706, score-0.923]
88 ¨ Number of missing features Credit Move Figure 14: Performance of treatments for missing values for large ensemble models as the number of missing values increases. [sent-712, score-1.103]
89 Related Work Although value imputation and distribution-based imputation are common in practical applications of classification models, there is surprisingly little theoretical or empirical work analyzing the strategies. [sent-726, score-1.682]
90 The paper discusses instances of the three strategies we explore here: value imputation (simple default-value imputation in their paper), distribution-based prediction, and a reduced-feature “classifier lattice” of models for all possible patterns of missing values. [sent-730, score-2.04]
91 The study explores two forms of imputation similar to those explored here7 and classification by simply using the first tree node for which the feature is missing (treating it as a leaf); the study does not consider reduced-feature models. [sent-742, score-1.191]
92 8 Our study revealed the opposite pattern—predictive value imputation often is superior to C4. [sent-746, score-0.841]
93 More importantly, however, we show that the dominance of one form of imputation versus another depends on the statistical dependencies (and lack thereof) between the features: value imputation is likely to outperform C4. [sent-748, score-1.682]
94 Predictive value imputation was implemented by imputing either the mode value or a prediction using a decision tree classifier. [sent-755, score-0.895]
95 They note that because of the large proportion of missing values, ID3 with various imputation techniques performed poorly. [sent-761, score-1.139]
96 Neither paper considers value imputation as an alternative, nor do they explore the domain characteristics that enable the different missing-value treatments to succeed or fail. [sent-778, score-0.986]
97 For example, our followup analysis shows that with high feature imputability, predictive value imputation can perform just as well as lazy (reduced-feature) modeling, but reduced modeling is considerably more robust to lower levels of imputability. [sent-779, score-1.107]
98 However, our arguments supporting the results are not limited a particular imputation model. [sent-818, score-0.841]
99 In the case of multiple missing values, we have not analyzed the degree to which imputation would improve if a reduced-modeling approach were taken for the imputation itself, rather than using simple value imputation. [sent-819, score-1.98]
100 If run-time computation simply is not an option, then storage could be allocated to the most advantageous reduced models, and an imputation technique used otherwise. [sent-847, score-0.941]
wordName wordTfidf (topN-words)
[('imputation', 0.841), ('missing', 0.298), ('imputability', 0.187), ('refe', 0.161), ('treatments', 0.145), ('dbi', 0.102), ('bagging', 0.082), ('modeling', 0.076), ('reduced', 0.073), ('aar', 0.068), ('andling', 0.068), ('sechansky', 0.068), ('pvi', 0.064), ('predictive', 0.06), ('missingness', 0.059), ('issing', 0.057), ('pplying', 0.057), ('rovost', 0.057), ('treatment', 0.046), ('induction', 0.044), ('bmg', 0.042), ('expedia', 0.042), ('predictiv', 0.042), ('odels', 0.041), ('bagged', 0.04), ('greiner', 0.04), ('calhouse', 0.038), ('lassification', 0.037), ('trees', 0.036), ('accuracy', 0.034), ('lazy', 0.034), ('downsize', 0.03), ('tree', 0.029), ('storage', 0.027), ('ensembles', 0.026), ('features', 0.026), ('attribute', 0.026), ('coding', 0.026), ('prediction', 0.025), ('contraceptive', 0.025), ('etoys', 0.025), ('priceline', 0.025), ('qvc', 0.025), ('attributes', 0.024), ('feature', 0.023), ('models', 0.022), ('percentage', 0.022), ('hybrid', 0.022), ('patterns', 0.022), ('car', 0.022), ('accuracies', 0.022), ('mcar', 0.021), ('pe', 0.02), ('credit', 0.019), ('rubin', 0.019), ('abalone', 0.019), ('od', 0.019), ('incomplete', 0.019), ('ding', 0.018), ('ov', 0.018), ('pendigits', 0.018), ('batista', 0.017), ('hospitalization', 0.017), ('simonoff', 0.017), ('ensemble', 0.016), ('utility', 0.016), ('instances', 0.016), ('schuurmans', 0.015), ('di', 0.015), ('iz', 0.015), ('bm', 0.014), ('storing', 0.014), ('mortgage', 0.014), ('ou', 0.014), ('move', 0.014), ('breast', 0.014), ('vc', 0.014), ('induce', 0.014), ('practitioners', 0.014), ('ns', 0.014), ('classi', 0.013), ('al', 0.013), ('test', 0.013), ('induced', 0.013), ('lo', 0.013), ('ys', 0.013), ('pricing', 0.013), ('ack', 0.013), ('buying', 0.013), ('maytal', 0.013), ('monard', 0.013), ('mputability', 0.013), ('protos', 0.013), ('tra', 0.013), ('friedman', 0.013), ('logistic', 0.013), ('quinlan', 0.012), ('cancer', 0.012), ('relative', 0.012), ('ce', 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 39 jmlr-2007-Handling Missing Values when Applying Classification Models
Author: Maytal Saar-Tsechansky, Foster Provost
Abstract: Much work has studied the effect of different treatments of missing values on model induction, but little work has analyzed treatments for the common case of missing values at prediction time. This paper first compares several different methods—predictive value imputation, the distributionbased imputation used by C4.5, and using reduced models—for applying classification trees to instances with missing values (and also shows evidence that the results generalize to bagged trees and to logistic regression). The results show that for the two most popular treatments, each is preferable under different conditions. Strikingly the reduced-models approach, seldom mentioned or used, consistently outperforms the other two methods, sometimes by a large margin. The lack of attention to reduced modeling may be due in part to its (perceived) expense in terms of computation or storage. Therefore, we then introduce and evaluate alternative, hybrid approaches that allow users to balance between more accurate but computationally expensive reduced modeling and the other, less accurate but less computationally expensive treatments. The results show that the hybrid methods can scale gracefully to the amount of investment in computation/storage, and that they outperform imputation even for small investments. Keywords: missing data, classification, classification trees, decision trees, imputation
2 0.043586999 11 jmlr-2007-Anytime Learning of Decision Trees
Author: Saher Esmeir, Shaul Markovitch
Abstract: The majority of existing algorithms for learning decision trees are greedy—a tree is induced topdown, making locally optimal decisions at each node. In most cases, however, the constructed tree is not globally optimal. Even the few non-greedy learners cannot learn good trees when the concept is difficult. Furthermore, they require a fixed amount of time and are not able to generate a better tree if additional time is available. We introduce a framework for anytime induction of decision trees that overcomes these problems by trading computation speed for better tree quality. Our proposed family of algorithms employs a novel strategy for evaluating candidate splits. A biased sampling of the space of consistent trees rooted at an attribute is used to estimate the size of the minimal tree under that attribute, and an attribute with the smallest expected tree is selected. We present two types of anytime induction algorithms: a contract algorithm that determines the sample size on the basis of a pre-given allocation of time, and an interruptible algorithm that starts with a greedy tree and continuously improves subtrees by additional sampling. Experimental results indicate that, for several hard concepts, our proposed approach exhibits good anytime behavior and yields significantly better decision trees when more time is available. Keywords: anytime algorithms, decision tree induction, lookahead, hard concepts, resource-bounded reasoning
3 0.03663433 59 jmlr-2007-Nonlinear Boosting Projections for Ensemble Construction
Author: Nicolás García-Pedrajas, César García-Osorio, Colin Fyfe
Abstract: In this paper we propose a novel approach for ensemble construction based on the use of nonlinear projections to achieve both accuracy and diversity of individual classifiers. The proposed approach combines the philosophy of boosting, putting more effort on difficult instances, with the basis of the random subspace method. Our main contribution is that instead of using a random subspace, we construct a projection taking into account the instances which have posed most difficulties to previous classifiers. In this way, consecutive nonlinear projections are created by a neural network trained using only incorrectly classified instances. The feature subspace induced by the hidden layer of this network is used as the input space to a new classifier. The method is compared with bagging and boosting techniques, showing an improved performance on a large set of 44 problems from the UCI Machine Learning Repository. An additional study showed that the proposed approach is less sensitive to noise in the data than boosting methods. Keywords: classifier ensembles, boosting, neural networks, nonlinear projections
4 0.033943918 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study
Author: Sofus A. Macskassy, Foster Provost
Abstract: paper1 This is about classifying entities that are interlinked with entities for which the class is known. After surveying prior work, we present NetKit, a modular toolkit for classification in networked data, and a case-study of its application to networked data used in prior machine learning research. NetKit is based on a node-centric framework in which classifiers comprise a local classifier, a relational classifier, and a collective inference procedure. Various existing node-centric relational learning algorithms can be instantiated with appropriate choices for these components, and new combinations of components realize new algorithms. The case study focuses on univariate network classification, for which the only information used is the structure of class linkage in the network (i.e., only links and some class labels). To our knowledge, no work previously has evaluated systematically the power of class-linkage alone for classification in machine learning benchmark data sets. The results demonstrate that very simple network-classification models perform quite well—well enough that they should be used regularly as baseline classifiers for studies of learning with networked data. The simplest method (which performs remarkably well) highlights the close correspondence between several existing methods introduced for different purposes—that is, Gaussian-field classifiers, Hopfield networks, and relational-neighbor classifiers. The case study also shows that there are two sets of techniques that are preferable in different situations, namely when few versus many labels are known initially. We also demonstrate that link selection plays an important role similar to traditional feature selection. Keywords: relational learning, network learning, collective inference, collective classification, networked data, probabilistic relational models, network analysis, network data
5 0.025374979 79 jmlr-2007-Structure and Majority Classes in Decision Tree Learning
Author: Ray J. Hickey
Abstract: To provide good classification accuracy on unseen examples, a decision tree, learned by an algorithm such as ID3, must have sufficient structure and also identify the correct majority class in each of its leaves. If there are inadequacies in respect of either of these, the tree will have a percentage classification rate below that of the maximum possible for the domain, namely (100 Bayes error rate). An error decomposition is introduced which enables the relative contributions of deficiencies in structure and in incorrect determination of majority class to be isolated and quantified. A sub-decomposition of majority class error permits separation of the sampling error at the leaves from the possible bias introduced by the attribute selection method of the induction algorithm. It is shown that sampling error can extend to 25% when there are more than two classes. Decompositions are obtained from experiments on several data sets. For ID3, the effect of selection bias is shown to vary from being statistically non-significant to being quite substantial, with the latter appearing to be associated with a simple underlying model. Keywords: decision tree learning, error decomposition, majority classes, sampling error, attribute selection bias 1
6 0.023268107 52 jmlr-2007-Margin Trees for High-dimensional Classification
7 0.022077322 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models
8 0.021370154 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
9 0.020341506 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition
10 0.020125501 29 jmlr-2007-Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts
11 0.019670557 1 jmlr-2007-"Ideal Parent" Structure Learning for Continuous Variable Bayesian Networks
12 0.019179342 48 jmlr-2007-Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners
13 0.018750448 72 jmlr-2007-Relational Dependency Networks
14 0.017419143 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization
15 0.016352478 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation
16 0.015322648 4 jmlr-2007-A New Probabilistic Approach in Rank Regression with Optimal Bayesian Partitioning (Special Topic on Model Selection)
17 0.014948433 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
18 0.012584996 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors
19 0.01249539 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data
20 0.012243835 42 jmlr-2007-Infinitely Imbalanced Logistic Regression
topicId topicWeight
[(0, 0.082), (1, 0.094), (2, -0.017), (3, 0.043), (4, 0.062), (5, 0.059), (6, 0.03), (7, 0.001), (8, -0.042), (9, 0.014), (10, 0.024), (11, -0.021), (12, 0.004), (13, -0.054), (14, -0.031), (15, -0.024), (16, 0.057), (17, 0.003), (18, 0.036), (19, -0.007), (20, 0.01), (21, -0.155), (22, 0.027), (23, 0.033), (24, -0.032), (25, 0.019), (26, -0.004), (27, -0.094), (28, -0.071), (29, 0.019), (30, 0.026), (31, 0.118), (32, 0.278), (33, 0.192), (34, 0.347), (35, -0.005), (36, 0.436), (37, -0.261), (38, 0.071), (39, 0.424), (40, 0.062), (41, -0.072), (42, 0.21), (43, 0.099), (44, -0.283), (45, -0.028), (46, 0.137), (47, 0.046), (48, 0.012), (49, -0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.96952069 39 jmlr-2007-Handling Missing Values when Applying Classification Models
Author: Maytal Saar-Tsechansky, Foster Provost
Abstract: Much work has studied the effect of different treatments of missing values on model induction, but little work has analyzed treatments for the common case of missing values at prediction time. This paper first compares several different methods—predictive value imputation, the distributionbased imputation used by C4.5, and using reduced models—for applying classification trees to instances with missing values (and also shows evidence that the results generalize to bagged trees and to logistic regression). The results show that for the two most popular treatments, each is preferable under different conditions. Strikingly the reduced-models approach, seldom mentioned or used, consistently outperforms the other two methods, sometimes by a large margin. The lack of attention to reduced modeling may be due in part to its (perceived) expense in terms of computation or storage. Therefore, we then introduce and evaluate alternative, hybrid approaches that allow users to balance between more accurate but computationally expensive reduced modeling and the other, less accurate but less computationally expensive treatments. The results show that the hybrid methods can scale gracefully to the amount of investment in computation/storage, and that they outperform imputation even for small investments. Keywords: missing data, classification, classification trees, decision trees, imputation
2 0.21333784 11 jmlr-2007-Anytime Learning of Decision Trees
Author: Saher Esmeir, Shaul Markovitch
Abstract: The majority of existing algorithms for learning decision trees are greedy—a tree is induced topdown, making locally optimal decisions at each node. In most cases, however, the constructed tree is not globally optimal. Even the few non-greedy learners cannot learn good trees when the concept is difficult. Furthermore, they require a fixed amount of time and are not able to generate a better tree if additional time is available. We introduce a framework for anytime induction of decision trees that overcomes these problems by trading computation speed for better tree quality. Our proposed family of algorithms employs a novel strategy for evaluating candidate splits. A biased sampling of the space of consistent trees rooted at an attribute is used to estimate the size of the minimal tree under that attribute, and an attribute with the smallest expected tree is selected. We present two types of anytime induction algorithms: a contract algorithm that determines the sample size on the basis of a pre-given allocation of time, and an interruptible algorithm that starts with a greedy tree and continuously improves subtrees by additional sampling. Experimental results indicate that, for several hard concepts, our proposed approach exhibits good anytime behavior and yields significantly better decision trees when more time is available. Keywords: anytime algorithms, decision tree induction, lookahead, hard concepts, resource-bounded reasoning
3 0.18880309 59 jmlr-2007-Nonlinear Boosting Projections for Ensemble Construction
Author: Nicolás García-Pedrajas, César García-Osorio, Colin Fyfe
Abstract: In this paper we propose a novel approach for ensemble construction based on the use of nonlinear projections to achieve both accuracy and diversity of individual classifiers. The proposed approach combines the philosophy of boosting, putting more effort on difficult instances, with the basis of the random subspace method. Our main contribution is that instead of using a random subspace, we construct a projection taking into account the instances which have posed most difficulties to previous classifiers. In this way, consecutive nonlinear projections are created by a neural network trained using only incorrectly classified instances. The feature subspace induced by the hidden layer of this network is used as the input space to a new classifier. The method is compared with bagging and boosting techniques, showing an improved performance on a large set of 44 problems from the UCI Machine Learning Repository. An additional study showed that the proposed approach is less sensitive to noise in the data than boosting methods. Keywords: classifier ensembles, boosting, neural networks, nonlinear projections
4 0.10270997 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study
Author: Sofus A. Macskassy, Foster Provost
Abstract: paper1 This is about classifying entities that are interlinked with entities for which the class is known. After surveying prior work, we present NetKit, a modular toolkit for classification in networked data, and a case-study of its application to networked data used in prior machine learning research. NetKit is based on a node-centric framework in which classifiers comprise a local classifier, a relational classifier, and a collective inference procedure. Various existing node-centric relational learning algorithms can be instantiated with appropriate choices for these components, and new combinations of components realize new algorithms. The case study focuses on univariate network classification, for which the only information used is the structure of class linkage in the network (i.e., only links and some class labels). To our knowledge, no work previously has evaluated systematically the power of class-linkage alone for classification in machine learning benchmark data sets. The results demonstrate that very simple network-classification models perform quite well—well enough that they should be used regularly as baseline classifiers for studies of learning with networked data. The simplest method (which performs remarkably well) highlights the close correspondence between several existing methods introduced for different purposes—that is, Gaussian-field classifiers, Hopfield networks, and relational-neighbor classifiers. The case study also shows that there are two sets of techniques that are preferable in different situations, namely when few versus many labels are known initially. We also demonstrate that link selection plays an important role similar to traditional feature selection. Keywords: relational learning, network learning, collective inference, collective classification, networked data, probabilistic relational models, network analysis, network data
5 0.10172616 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models
Author: Tapani Raiko, Harri Valpola, Markus Harva, Juha Karhunen
Abstract: We introduce standardised building blocks designed to be used with variational Bayesian learning. The blocks include Gaussian variables, summation, multiplication, nonlinearity, and delay. A large variety of latent variable models can be constructed from these blocks, including nonlinear and variance models, which are lacking from most existing variational systems. The introduced blocks are designed to fit together and to yield efficient update rules. Practical implementation of various models is easy thanks to an associated software package which derives the learning formulas automatically once a specific model structure has been fixed. Variational Bayesian learning provides a cost function which is used both for updating the variables of the model and for optimising the model structure. All the computations can be carried out locally, resulting in linear computational complexity. We present experimental results on several structures, including a new hierarchical nonlinear model for variances and means. The test results demonstrate the good performance and usefulness of the introduced method. Keywords: latent variable models, variational Bayesian learning, graphical models, building blocks, Bayesian modelling, local computation
6 0.1015745 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features
7 0.099511325 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition
9 0.082771994 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization
10 0.082482956 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling
11 0.07752046 22 jmlr-2007-Compression-Based Averaging of Selective Naive Bayes Classifiers (Special Topic on Model Selection)
12 0.075290747 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
13 0.074826673 29 jmlr-2007-Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts
14 0.073157758 1 jmlr-2007-"Ideal Parent" Structure Learning for Continuous Variable Bayesian Networks
15 0.072501078 30 jmlr-2007-Dynamics and Generalization Ability of LVQ Algorithms
16 0.071887836 52 jmlr-2007-Margin Trees for High-dimensional Classification
17 0.068027787 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm
18 0.066434719 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data
19 0.065909706 72 jmlr-2007-Relational Dependency Networks
20 0.062640369 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors
topicId topicWeight
[(8, 0.032), (10, 0.025), (12, 0.025), (15, 0.039), (18, 0.022), (22, 0.012), (28, 0.042), (40, 0.032), (43, 0.382), (45, 0.021), (48, 0.044), (60, 0.018), (63, 0.02), (80, 0.057), (85, 0.052), (98, 0.063)]
simIndex simValue paperId paperTitle
same-paper 1 0.68199825 39 jmlr-2007-Handling Missing Values when Applying Classification Models
Author: Maytal Saar-Tsechansky, Foster Provost
Abstract: Much work has studied the effect of different treatments of missing values on model induction, but little work has analyzed treatments for the common case of missing values at prediction time. This paper first compares several different methods—predictive value imputation, the distributionbased imputation used by C4.5, and using reduced models—for applying classification trees to instances with missing values (and also shows evidence that the results generalize to bagged trees and to logistic regression). The results show that for the two most popular treatments, each is preferable under different conditions. Strikingly the reduced-models approach, seldom mentioned or used, consistently outperforms the other two methods, sometimes by a large margin. The lack of attention to reduced modeling may be due in part to its (perceived) expense in terms of computation or storage. Therefore, we then introduce and evaluate alternative, hybrid approaches that allow users to balance between more accurate but computationally expensive reduced modeling and the other, less accurate but less computationally expensive treatments. The results show that the hybrid methods can scale gracefully to the amount of investment in computation/storage, and that they outperform imputation even for small investments. Keywords: missing data, classification, classification trees, decision trees, imputation
2 0.29416591 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study
Author: Sofus A. Macskassy, Foster Provost
Abstract: paper1 This is about classifying entities that are interlinked with entities for which the class is known. After surveying prior work, we present NetKit, a modular toolkit for classification in networked data, and a case-study of its application to networked data used in prior machine learning research. NetKit is based on a node-centric framework in which classifiers comprise a local classifier, a relational classifier, and a collective inference procedure. Various existing node-centric relational learning algorithms can be instantiated with appropriate choices for these components, and new combinations of components realize new algorithms. The case study focuses on univariate network classification, for which the only information used is the structure of class linkage in the network (i.e., only links and some class labels). To our knowledge, no work previously has evaluated systematically the power of class-linkage alone for classification in machine learning benchmark data sets. The results demonstrate that very simple network-classification models perform quite well—well enough that they should be used regularly as baseline classifiers for studies of learning with networked data. The simplest method (which performs remarkably well) highlights the close correspondence between several existing methods introduced for different purposes—that is, Gaussian-field classifiers, Hopfield networks, and relational-neighbor classifiers. The case study also shows that there are two sets of techniques that are preferable in different situations, namely when few versus many labels are known initially. We also demonstrate that link selection plays an important role similar to traditional feature selection. Keywords: relational learning, network learning, collective inference, collective classification, networked data, probabilistic relational models, network analysis, network data
3 0.28936136 11 jmlr-2007-Anytime Learning of Decision Trees
Author: Saher Esmeir, Shaul Markovitch
Abstract: The majority of existing algorithms for learning decision trees are greedy—a tree is induced topdown, making locally optimal decisions at each node. In most cases, however, the constructed tree is not globally optimal. Even the few non-greedy learners cannot learn good trees when the concept is difficult. Furthermore, they require a fixed amount of time and are not able to generate a better tree if additional time is available. We introduce a framework for anytime induction of decision trees that overcomes these problems by trading computation speed for better tree quality. Our proposed family of algorithms employs a novel strategy for evaluating candidate splits. A biased sampling of the space of consistent trees rooted at an attribute is used to estimate the size of the minimal tree under that attribute, and an attribute with the smallest expected tree is selected. We present two types of anytime induction algorithms: a contract algorithm that determines the sample size on the basis of a pre-given allocation of time, and an interruptible algorithm that starts with a greedy tree and continuously improves subtrees by additional sampling. Experimental results indicate that, for several hard concepts, our proposed approach exhibits good anytime behavior and yields significantly better decision trees when more time is available. Keywords: anytime algorithms, decision tree induction, lookahead, hard concepts, resource-bounded reasoning
4 0.28760731 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
Author: Wei Pan, Xiaotong Shen
Abstract: Variable selection in clustering analysis is both challenging and important. In the context of modelbased clustering analysis with a common diagonal covariance matrix, which is especially suitable for “high dimension, low sample size” settings, we propose a penalized likelihood approach with an L1 penalty function, automatically realizing variable selection via thresholding and delivering a sparse solution. We derive an EM algorithm to fit our proposed model, and propose a modified BIC as a model selection criterion to choose the number of components and the penalization parameter. A simulation study and an application to gene function prediction with gene expression profiles demonstrate the utility of our method. Keywords: BIC, EM, mixture model, penalized likelihood, soft-thresholding, shrinkage
5 0.28601313 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation
Author: Guy Lebanon, Yi Mao, Joshua Dillon
Abstract: The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words representation and its n-gram extensions. This representation uses local smoothing to embed documents as smooth curves in the multinomial simplex thereby preserving valuable sequential information. In contrast to bag of words or n-grams, the new representation is able to robustly capture medium and long range sequential trends in the document. We discuss the representation and its geometric properties and demonstrate its applicability for various text processing tasks. Keywords: text processing, local smoothing
6 0.28451392 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data
7 0.28406316 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification
8 0.27653682 72 jmlr-2007-Relational Dependency Networks
9 0.27293983 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition
10 0.27263218 59 jmlr-2007-Nonlinear Boosting Projections for Ensemble Construction
11 0.27147666 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
12 0.27114812 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models
13 0.26734519 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm
14 0.26719812 26 jmlr-2007-Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis
15 0.26509482 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
16 0.26474011 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data
17 0.26359531 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
18 0.26191127 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis
19 0.2618534 88 jmlr-2007-Unlabeled Compression Schemes for Maximum Classes