jmlr jmlr2007 jmlr2007-52 knowledge-graph by maker-knowledge-mining

52 jmlr-2007-Margin Trees for High-dimensional Classification


Source: pdf

Author: Robert Tibshirani, Trevor Hastie

Abstract: We propose a method for the classification of more than two classes, from high-dimensional features. Our approach is to build a binary decision tree in a top-down manner, using the optimal margin classifier at each split. We implement an exact greedy algorithm for this task, and compare its performance to less greedy procedures based on clustering of the matrix of pairwise margins. We compare the performance of the “margin tree” to the closely related “all-pairs” (one versus one) support vector machine, and nearest centroids on a number of cancer microarray data sets. We also develop a simple method for feature selection. We find that the margin tree has accuracy that is competitive with other methods and offers additional interpretability in its putative grouping of the classes. Keywords: maximum margin classifier, support vector machine, decision tree, CART

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Our approach is to build a binary decision tree in a top-down manner, using the optimal margin classifier at each split. [sent-6, score-0.707]

2 We find that the margin tree has accuracy that is competitive with other methods and offers additional interpretability in its putative grouping of the classes. [sent-10, score-0.752]

3 Keywords: maximum margin classifier, support vector machine, decision tree, CART 1. [sent-11, score-0.398]

4 When the number of classes K is greater than two, maximum margin classifiers do not generalize easily. [sent-14, score-0.469]

5 These latter proposals have a nice generalization of the maximum margin property of the two class support vector classifier. [sent-18, score-0.398]

6 In this paper we propose a tree-based maximum margin classifier. [sent-23, score-0.398]

7 We then focus just on classes 2 and 3, and their maximum margin classifier is shown in the bottom left. [sent-30, score-0.531]

8 We employ strategies like this for larger numbers of classes, producing a binary decision tree with a maximum margin classifier at each junction in the tree. [sent-32, score-0.766]

9 In Section 2 we give details of the margin tree classifier. [sent-33, score-0.707]

10 Section 3 shows the application of the margin tree to a number of cancer microarray data sets. [sent-34, score-0.929]

11 For construction of the tree, all of the classifiers in the margin tree use all of the features (genes). [sent-35, score-0.707]

12 Having settled on the partition, we use the maximum margin classifier between the two groups of classes, for future predictions. [sent-60, score-0.45]

13 Let M( j, k) be the maximum margin between classes j and k. [sent-61, score-0.469]

14 Also, let G 1 , G2 be groups of classes, and let M(G1 , G2 ) denote the maximum margin between the groups. [sent-62, score-0.45]

15 That is, M(G 1 , G2 ) is the maximum margin between two hyper-classes: all classes in G 1 and all classes in G2 . [sent-63, score-0.54]

16 (b) Single linkage: Find the partition P yielding the largest margin M 0 so that min M( j1 , j2 ) ≤ M0 for j1 , j2 ∈ Gk , k = 1, 2 and min M( j1 , j2 ) ≥ M for j1 ∈ G1 , j2 ∈ G2 . [sent-66, score-0.452]

17 (c) Complete linkage: Find the partition P yielding the largest margin M 0 so that max M( j1 , j2 ) ≤ M0 for j1 , j2 ∈ Gk , k = 1, 2 and max M( j1 , j2 ) ≥ M0 for j1 ∈ G1 , j2 ∈ G2 . [sent-67, score-0.452]

18 The greedy method finds the partition that maximizes the resulting margin over all possible partitions. [sent-68, score-0.572]

19 0 Margin tree −6 −4 −2 0 2 4 6 3 2 −6 3 8 x1 Figure 1: Simple illustration of a margin tree. [sent-77, score-0.707]

20 The largest margin is between class 1 and (2,3), with the optimal classifier shown on the top right. [sent-79, score-0.447]

21 These top-down splits are summarized in the margin tree in the bottom right. [sent-81, score-0.856]

22 Single linkage clustering successively merges groups based on the minimum distance between any pair of items in each of the group. [sent-83, score-0.559]

23 Complete linkage clustering does the same, but using the maximum distance. [sent-84, score-0.532]

24 Now having built a clustering tree bottom-up, we can interpret each split in the tree in a top-down manner, and that is how criteria (b) and (c) above were derived. [sent-85, score-0.81]

25 In particular it is easy to see that the single and complete linkage problems are solved by single and complete linkage agglomerative clustering, respectively, applied to the margin matrix M( j 1 , j2 ). [sent-86, score-1.449]

26 Note that we are applying single or complete linkage clustering to the classes of objects C j , while one usually applies clustering to individual objects. [sent-87, score-0.717]

27 Both the single linkage and complete linkage methods take into account both the between and within partition distances. [sent-90, score-1.044]

28 We will also see in the next section that the complete linkage method can be viewed as an approximation to the greedy search. [sent-91, score-0.684]

29 Figure 2 shows a toy example that illustrates the difference between the greedy and complete linkage algorithms. [sent-92, score-0.684]

30 The complete linkage algorithm in the bottom right panel instead groups 1,2 and 3 together and 4,5, and 6 together. [sent-96, score-0.743]

31 The complete linkage tree is more balanced and hence may be more useful biologically. [sent-97, score-0.872]

32 In the experiments in this paper we find that: • All three methods produce about the same test set accuracy, and about the same as the all-pairs maximum margin classifier. [sent-98, score-0.423]

33 • The complete linkage approach gives more balanced trees, that may be more interpretable that those from the other two methods; the single linkage and greedy methods tend to produce long stringy trees that usually split off one class at a time at each branch. [sent-99, score-1.298]

34 The complete linkage method is also considerably faster to compute than the greedy method. [sent-100, score-0.684]

35 Thus the complete linkage margin tree emerges as our method of choice. [sent-101, score-1.245]

36 It requires computation of K 2 support vector classifiers for each pair of classes for the complete linkage clustering and then for the final tree, one computation of a support vector classifier for each node in the tree (at most K and typically ≈ log2 (K) classifiers. [sent-102, score-1.051]

37 (1) That is, the margin between two groups of classes is less than or equal to the smallest margin between any pair of classes, one chosen from each group. [sent-105, score-0.869]

38 Rather than enumerate all possible partitions (and their associated maximum margin classifiers), we can speed up the computation by constructing the complete linkage clustering tree, and collapsing all nodes at height M. [sent-107, score-1.281]

39 We know that all classes in any collapsed node must be on the same side of the decision plane, since each class has margin 640 0. [sent-108, score-0.558]

40 8 Figure 2: A toy example illustrating the difference between the greedy and complete linkage algorithms. [sent-128, score-0.684]

41 The complete linkage algorithm (bottom right panel) instead groups 1,2 and 3 together, and 4,5, and 6 together. [sent-132, score-0.59]

42 For example, the margin between classes 1 and 2 is 0. [sent-133, score-0.444]

43 The height in each plot is the margin corresponding to each join. [sent-136, score-0.583]

44 Construct the complete linkage clustering tree based on the margin matrix M( j 1 , j2 ). [sent-140, score-1.299]

45 Starting with all classes at the top of the tree, find the partition of each individual class versus the rest, and also the partition that produces two classes in the complete linkage tree (that is, make a horizontal cut in the tree to produce two classes). [sent-142, score-1.534]

46 Let M 0 be the largest margin achieved amongst all of these competitors. [sent-143, score-0.426]

47 Cut the complete linkage tree at height M0 , and collapse all nodes at that height. [sent-145, score-1.11]

48 Consider all partitions of all classes that keep the collapsed nodes intact, and choose the one ˆ that gives maximal margin M. [sent-147, score-0.585]

49 We then apply this procedure in a top-down recursive manner, until the entire margin tree is grown. [sent-149, score-0.732]

50 This algorithm is exact in that it finds the best split at each node in the top-down tree building process. [sent-150, score-0.503]

51 This is because the best greedy split must be among the candidates considered in step 4, since as mentioned above, all classes in a collapsed node must be on the same side of the decision plane. [sent-151, score-0.419]

52 Note that if approximation (1) is an equality, then the complete linkage tree is itself the greedy ˆ margin classifier solution. [sent-153, score-1.391]

53 We cut the complete linkage tree to produce two nodes (5,4,6) and (1,2,3). [sent-156, score-0.932]

54 We compute the achieved margin for this split and also the margin for partitions (1) vs. [sent-157, score-0.914]

55 We find that the largest margin corresponds to (1) vs. [sent-160, score-0.399]

56 (3,4,5,6) , (3) vs (2,4,5,6) etc, as well as the complete linkage split (2,3) vs (4,5,6). [sent-164, score-0.626]

57 The largest margin is achieved by the latter, so me make that split and continue the process. [sent-165, score-0.514]

58 The length of each (non-terminal) arm corresponds to the margin that is achieved by the classifier at that split. [sent-171, score-0.4]

59 We note that the greedy and single linkage margin tree are “stringy”, with each partition separating off just one class in most cases. [sent-179, score-1.359]

60 The complete linkage tree is more balanced, producing some potentially useful subgroupings of the cancer classes. [sent-180, score-1.011]

61 In this example, full enumeration of the partitions at each node would have required computation of 16,382 two class maximum margin classifiers. [sent-181, score-0.505]

62 In general the cost savings can vary, depending on the height M 0 of the initial cut in the complete linkage tree. [sent-183, score-0.78]

63 The largest margin achieved is about 49,000, corresponding to the split between class CNS and Collerectal and so on. [sent-186, score-0.514]

64 This is larger than the margin between Leukemia and the rest at the top of the greedy tree. [sent-188, score-0.567]

65 This shows that the greediness of the exact algorithm can hurt its overall performance in finding large margin splits. [sent-189, score-0.4]

66 SVM(All pairs) MT(Greedy) MT(Single) MT(Complete) SVM(All pairs) 0 10 11 10 MT(Greedy) 10 0 7 2 MT(Single) 11 7 0 9 MT(Complete) 10 2 9 0 Table 1: Number of disagreements on the test set, for different margin tree-building methods. [sent-194, score-0.451]

67 Figure 5 shows the test errors at each node of the complete linkage tree, for the 14 tumor data set. [sent-198, score-0.724]

68 We see that for problems involving more than 4 or 5 classes, the one-versus-one support vector classifier and the margin tree methods sometimes offer an advantage over nearest centroids. [sent-208, score-0.766]

69 The margin tree methods are all very similar to each other and the one-versus-one support vector classifier. [sent-209, score-0.707]

70 645 T IBSHIRANI AND H ASTIE Name Brain Lymphoma Small round blue cell tumors Stanford 9 tumors 11 tumors 14 tumors # Classes 5 3 4 14 9 11 14 # Samples 22 62 63 261 60 174 198 # Features 5597 4026 2308 4718 5726 12,533 16063 Source Pomeroy et al. [sent-210, score-0.636]

71 Data set Brain Lymphoma SRBCT Stanford 9 tumors 11 tumors 14 tumors Nearest centroids 0. [sent-218, score-0.548]

72 SVM (OVO) is the support vector machine, using the one-versus-one approach; each pairwise classifier uses a large value for the cost parameter, to yield the maximal margin classifier; MT are the margin tree methods, with different tree-building strategies. [sent-289, score-1.08]

73 Feature Selection The classifiers at each junction of the margin tree each use all of the features (genes). [sent-291, score-0.741]

74 Then to form a reduced classifier, we simply set to zero the first nk coefficients at split k in the margin tree. [sent-296, score-0.496]

75 For example we might be able to use fewer genes near the top of the tree, where the margins between the classes is largest. [sent-300, score-0.431]

76 We compute reduced classifiers at each tree split, for a range of values of nk , and for each, the proportion of the full margin achieved by the classifier. [sent-302, score-0.81]

77 Then we use a common value α for the margin proportion throughout the tree. [sent-303, score-0.414]

78 The plot shows the test error as the margin proportion α is varied. [sent-307, score-0.439]

79 The average number of genes at each of 646 25 20 error 30 35 M ARGIN T REES 10 50 500 5000 mean number of genes Figure 6: 14 tumor data set: Test errors for reduced numbers of genes. [sent-308, score-0.579]

80 With no feature selection the margin tree (left panel) achieves 4/61 errors, the same as the one-versus one support vector machine. [sent-314, score-0.707]

81 The margin proportion is shown at the top of the plot. [sent-316, score-0.462]

82 The right panel shows the number of genes used as a function of the height of the split in the tree, for margin proportion 0. [sent-317, score-1.039]

83 For reasons of speed and interpretability, we do not recompute the maximum margin classifier in the subspace of the remaining features (we do however recompute the classifier cutpoint, equal to midpoint between the classes). [sent-321, score-0.452]

84 For the top and bottom splits in the tree of Figure 7, Figure 8 shows the margins achieved by the maximum margin classifier (black points) and the approximation (blue points) as the numbers of genes is reduced. [sent-323, score-1.268]

85 The approximation gives margins remarkably close to the optimal margin until the number of genes drop below 100. [sent-324, score-0.685]

86 2 50 200 1000 5000 mean number of genes 20000 40000 60000 Height in tree Figure 7: Results for 11 tumor data set. [sent-339, score-0.645]

87 The left panel shows the margin tree using complete linkage; the test errors from hard-thresholding are shown in the middle, with the margin proportion α indicated along the top of the plot; for the tree using α = 0. [sent-340, score-1.736]

88 6, the right panel shows the resulting number of genes at each split in the tree, as a function of the height of that split. [sent-341, score-0.625]

89 1 Results on Real Data Sets Table 4 shows the results of applying the margin tree classifier (complete linkage) with feature selection, on the data sets described earlier. [sent-343, score-0.707]

90 We see that (a) hard thresholding generally improves upon the error rate of the full margin tree; (b) margin trees outperform nearest shrunken centroids on the whole, but not in every case. [sent-348, score-1.074]

91 In some cases, the number of genes used has dropped substantially; to get smaller number of genes one could look more closely at the cross-validation curve, to check how quickly it was rising. [sent-349, score-0.472]

92 Top panel refers to the top split in the margin tree; bottom panel refers to the bottom split. [sent-359, score-0.815]

93 In particular, the best classifier with say 50 genes might have only a few genes in common with the best classifier with 100 genes. [sent-361, score-0.472]

94 649 T IBSHIRANI AND H ASTIE Brain Lymphoma SRBCT Stanford 9 tumors 11 tumors 14 tumors Nearest shrunken centroids CV errors Test errors # genes used 0. [sent-364, score-0.938]

95 1) Table 4: CV and test error rates for nearest shrunken centroids and margin trees with feature selection by simple hard thresholding. [sent-446, score-0.683]

96 The construction of margin tree could also be done using other kernels, using the support vector machine framework. [sent-452, score-0.707]

97 The greedy algorithm and linkage algorithms will work without change. [sent-453, score-0.599]

98 The nodes of the clustering tree will be impure, that is contain observations from more than one class. [sent-458, score-0.416]

99 However the maximum margin classifier used here seems more natural and the overall performance of the margin tree is better. [sent-475, score-1.105]

100 Diagnosis of multiple cancer types by shrunken centroids of gene expression. [sent-718, score-0.406]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('linkage', 0.453), ('margin', 0.373), ('tree', 0.334), ('genes', 0.236), ('height', 0.21), ('tumors', 0.15), ('greedy', 0.146), ('ibshirani', 0.141), ('cancer', 0.139), ('mt', 0.122), ('astie', 0.12), ('centroids', 0.098), ('rees', 0.094), ('panel', 0.091), ('shrunken', 0.09), ('cns', 0.088), ('collerectal', 0.088), ('ovary', 0.088), ('pancreas', 0.088), ('split', 0.088), ('splits', 0.087), ('argin', 0.086), ('er', 0.085), ('complete', 0.085), ('microarray', 0.083), ('gene', 0.079), ('margins', 0.076), ('tumor', 0.075), ('bladder', 0.071), ('leuk', 0.071), ('lung', 0.071), ('lymp', 0.071), ('lymphoma', 0.071), ('melanoma', 0.071), ('renal', 0.071), ('uterus', 0.071), ('vmeso', 0.071), ('classes', 0.071), ('bottom', 0.062), ('tibshirani', 0.061), ('stanford', 0.061), ('rfe', 0.06), ('ramaswamy', 0.06), ('collapsed', 0.06), ('nearest', 0.059), ('centroid', 0.057), ('clustering', 0.054), ('node', 0.054), ('pros', 0.054), ('partition', 0.053), ('disagreements', 0.053), ('classi', 0.053), ('partitions', 0.053), ('groups', 0.052), ('top', 0.048), ('interpretability', 0.045), ('thresholding', 0.043), ('breast', 0.041), ('proportion', 0.041), ('angelo', 0.04), ('trees', 0.038), ('blue', 0.036), ('cv', 0.036), ('alizadeh', 0.035), ('khan', 0.035), ('munagala', 0.035), ('ovo', 0.035), ('srbct', 0.035), ('statnikov', 0.035), ('staunton', 0.035), ('stringy', 0.035), ('succession', 0.035), ('nk', 0.035), ('junction', 0.034), ('errors', 0.032), ('cut', 0.032), ('hastie', 0.031), ('tamayo', 0.03), ('tenfold', 0.03), ('investigator', 0.03), ('institutes', 0.03), ('pomeroy', 0.03), ('vural', 0.03), ('ers', 0.029), ('loh', 0.029), ('nodes', 0.028), ('achieved', 0.027), ('cart', 0.027), ('brain', 0.027), ('recompute', 0.027), ('su', 0.027), ('recomputed', 0.027), ('exact', 0.027), ('splitting', 0.026), ('elimination', 0.026), ('trevor', 0.026), ('largest', 0.026), ('test', 0.025), ('maximum', 0.025), ('recursive', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999934 52 jmlr-2007-Margin Trees for High-dimensional Classification

Author: Robert Tibshirani, Trevor Hastie

Abstract: We propose a method for the classification of more than two classes, from high-dimensional features. Our approach is to build a binary decision tree in a top-down manner, using the optimal margin classifier at each split. We implement an exact greedy algorithm for this task, and compare its performance to less greedy procedures based on clustering of the matrix of pairwise margins. We compare the performance of the “margin tree” to the closely related “all-pairs” (one versus one) support vector machine, and nearest centroids on a number of cancer microarray data sets. We also develop a simple method for feature selection. We find that the margin tree has accuracy that is competitive with other methods and offers additional interpretability in its putative grouping of the classes. Keywords: maximum margin classifier, support vector machine, decision tree, CART

2 0.14632705 11 jmlr-2007-Anytime Learning of Decision Trees

Author: Saher Esmeir, Shaul Markovitch

Abstract: The majority of existing algorithms for learning decision trees are greedy—a tree is induced topdown, making locally optimal decisions at each node. In most cases, however, the constructed tree is not globally optimal. Even the few non-greedy learners cannot learn good trees when the concept is difficult. Furthermore, they require a fixed amount of time and are not able to generate a better tree if additional time is available. We introduce a framework for anytime induction of decision trees that overcomes these problems by trading computation speed for better tree quality. Our proposed family of algorithms employs a novel strategy for evaluating candidate splits. A biased sampling of the space of consistent trees rooted at an attribute is used to estimate the size of the minimal tree under that attribute, and an attribute with the smallest expected tree is selected. We present two types of anytime induction algorithms: a contract algorithm that determines the sample size on the basis of a pre-given allocation of time, and an interruptible algorithm that starts with a greedy tree and continuously improves subtrees by additional sampling. Experimental results indicate that, for several hard concepts, our proposed approach exhibits good anytime behavior and yields significantly better decision trees when more time is available. Keywords: anytime algorithms, decision tree induction, lookahead, hard concepts, resource-bounded reasoning

3 0.12118074 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

Author: Jia Li, Surajit Ray, Bruce G. Lindsay

Abstract: A new clustering approach based on mode identification is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efficiently solved by an EM-style algorithm, namely, the Modal EM (MEM). This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. Without model fitting, the mode-based clustering yields a density description for every cluster, a major advantage of mixture-model-based clustering. Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnosing clustering results is also investigated. Specifically, a pairwise separability measure for clusters is defined using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is created to enforce strong separation. Experiments on simulated and real data demonstrate that the mode-based clustering approach tends to combine the strengths of linkage and mixture-model-based clustering. In addition, the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difficulty for parametric mixture modeling. A C package on the new algorithms is developed for public access at http://www.stat.psu.edu/∼jiali/hmac. Keywords: modal clustering, mode-based clustering, mixture modeling, modal EM, ridgeline EM, nonparametric density

4 0.1190431 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

Author: Wei Pan, Xiaotong Shen

Abstract: Variable selection in clustering analysis is both challenging and important. In the context of modelbased clustering analysis with a common diagonal covariance matrix, which is especially suitable for “high dimension, low sample size” settings, we propose a penalized likelihood approach with an L1 penalty function, automatically realizing variable selection via thresholding and delivering a sparse solution. We derive an EM algorithm to fit our proposed model, and propose a modified BIC as a model selection criterion to choose the number of components and the penalization parameter. A simulation study and an application to gene function prediction with gene expression profiles demonstrate the utility of our method. Keywords: BIC, EM, mixture model, penalized likelihood, soft-thresholding, shrinkage

5 0.096212409 79 jmlr-2007-Structure and Majority Classes in Decision Tree Learning

Author: Ray J. Hickey

Abstract: To provide good classification accuracy on unseen examples, a decision tree, learned by an algorithm such as ID3, must have sufficient structure and also identify the correct majority class in each of its leaves. If there are inadequacies in respect of either of these, the tree will have a percentage classification rate below that of the maximum possible for the domain, namely (100 Bayes error rate). An error decomposition is introduced which enables the relative contributions of deficiencies in structure and in incorrect determination of majority class to be isolated and quantified. A sub-decomposition of majority class error permits separation of the sampling error at the leaves from the possible bias introduced by the attribute selection method of the induction algorithm. It is shown that sampling error can extend to 25% when there are more than two classes. Decompositions are obtained from experiments on several data sets. For ID3, the effect of selection bias is shown to vary from being statistically non-significant to being quite substantial, with the latter appearing to be associated with a simple underlying model. Keywords: decision tree learning, error decomposition, majority classes, sampling error, attribute selection bias 1

6 0.091651231 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers     (Special Topic on Model Selection)

7 0.071455494 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm

8 0.067738406 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

9 0.052895959 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

10 0.050299682 48 jmlr-2007-Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners

11 0.048490785 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation

12 0.044432424 1 jmlr-2007-"Ideal Parent" Structure Learning for Continuous Variable Bayesian Networks

13 0.04270748 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

14 0.04102546 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression

15 0.03815116 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes

16 0.037864912 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study

17 0.036957815 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

18 0.036432568 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

19 0.036158357 59 jmlr-2007-Nonlinear Boosting Projections for Ensemble Construction

20 0.030356176 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.206), (1, 0.209), (2, -0.014), (3, 0.166), (4, 0.274), (5, -0.075), (6, 0.053), (7, 0.167), (8, -0.104), (9, 0.261), (10, 0.107), (11, 0.054), (12, 0.172), (13, 0.071), (14, 0.107), (15, -0.143), (16, -0.021), (17, -0.08), (18, 0.027), (19, -0.193), (20, 0.02), (21, -0.033), (22, 0.057), (23, -0.151), (24, -0.026), (25, 0.007), (26, 0.071), (27, -0.102), (28, -0.044), (29, -0.052), (30, -0.041), (31, -0.044), (32, 0.004), (33, 0.016), (34, -0.021), (35, -0.015), (36, -0.084), (37, 0.004), (38, -0.011), (39, -0.025), (40, -0.001), (41, -0.06), (42, -0.055), (43, 0.009), (44, 0.0), (45, -0.024), (46, 0.074), (47, -0.023), (48, 0.043), (49, 0.048)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97474754 52 jmlr-2007-Margin Trees for High-dimensional Classification

Author: Robert Tibshirani, Trevor Hastie

Abstract: We propose a method for the classification of more than two classes, from high-dimensional features. Our approach is to build a binary decision tree in a top-down manner, using the optimal margin classifier at each split. We implement an exact greedy algorithm for this task, and compare its performance to less greedy procedures based on clustering of the matrix of pairwise margins. We compare the performance of the “margin tree” to the closely related “all-pairs” (one versus one) support vector machine, and nearest centroids on a number of cancer microarray data sets. We also develop a simple method for feature selection. We find that the margin tree has accuracy that is competitive with other methods and offers additional interpretability in its putative grouping of the classes. Keywords: maximum margin classifier, support vector machine, decision tree, CART

2 0.68749684 11 jmlr-2007-Anytime Learning of Decision Trees

Author: Saher Esmeir, Shaul Markovitch

Abstract: The majority of existing algorithms for learning decision trees are greedy—a tree is induced topdown, making locally optimal decisions at each node. In most cases, however, the constructed tree is not globally optimal. Even the few non-greedy learners cannot learn good trees when the concept is difficult. Furthermore, they require a fixed amount of time and are not able to generate a better tree if additional time is available. We introduce a framework for anytime induction of decision trees that overcomes these problems by trading computation speed for better tree quality. Our proposed family of algorithms employs a novel strategy for evaluating candidate splits. A biased sampling of the space of consistent trees rooted at an attribute is used to estimate the size of the minimal tree under that attribute, and an attribute with the smallest expected tree is selected. We present two types of anytime induction algorithms: a contract algorithm that determines the sample size on the basis of a pre-given allocation of time, and an interruptible algorithm that starts with a greedy tree and continuously improves subtrees by additional sampling. Experimental results indicate that, for several hard concepts, our proposed approach exhibits good anytime behavior and yields significantly better decision trees when more time is available. Keywords: anytime algorithms, decision tree induction, lookahead, hard concepts, resource-bounded reasoning

3 0.61675137 79 jmlr-2007-Structure and Majority Classes in Decision Tree Learning

Author: Ray J. Hickey

Abstract: To provide good classification accuracy on unseen examples, a decision tree, learned by an algorithm such as ID3, must have sufficient structure and also identify the correct majority class in each of its leaves. If there are inadequacies in respect of either of these, the tree will have a percentage classification rate below that of the maximum possible for the domain, namely (100 Bayes error rate). An error decomposition is introduced which enables the relative contributions of deficiencies in structure and in incorrect determination of majority class to be isolated and quantified. A sub-decomposition of majority class error permits separation of the sampling error at the leaves from the possible bias introduced by the attribute selection method of the induction algorithm. It is shown that sampling error can extend to 25% when there are more than two classes. Decompositions are obtained from experiments on several data sets. For ID3, the effect of selection bias is shown to vary from being statistically non-significant to being quite substantial, with the latter appearing to be associated with a simple underlying model. Keywords: decision tree learning, error decomposition, majority classes, sampling error, attribute selection bias 1

4 0.51436782 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

Author: Wei Pan, Xiaotong Shen

Abstract: Variable selection in clustering analysis is both challenging and important. In the context of modelbased clustering analysis with a common diagonal covariance matrix, which is especially suitable for “high dimension, low sample size” settings, we propose a penalized likelihood approach with an L1 penalty function, automatically realizing variable selection via thresholding and delivering a sparse solution. We derive an EM algorithm to fit our proposed model, and propose a modified BIC as a model selection criterion to choose the number of components and the penalization parameter. A simulation study and an application to gene function prediction with gene expression profiles demonstrate the utility of our method. Keywords: BIC, EM, mixture model, penalized likelihood, soft-thresholding, shrinkage

5 0.42346925 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers     (Special Topic on Model Selection)

Author: Yann Guermeur

Abstract: In the context of discriminant analysis, Vapnik’s statistical learning theory has mainly been developed in three directions: the computation of dichotomies with binary-valued functions, the computation of dichotomies with real-valued functions, and the computation of polytomies with functions taking their values in finite sets, typically the set of categories itself. The case of classes of vectorvalued functions used to compute polytomies has seldom been considered independently, which is unsatisfactory, for three main reasons. First, this case encompasses the other ones. Second, it cannot be treated appropriately through a na¨ve extension of the results devoted to the computation of ı dichotomies. Third, most of the classification problems met in practice involve multiple categories. In this paper, a VC theory of large margin multi-category classifiers is introduced. Central in this theory are generalized VC dimensions called the γ-Ψ-dimensions. First, a uniform convergence bound on the risk of the classifiers of interest is derived. The capacity measure involved in this bound is a covering number. This covering number can be upper bounded in terms of the γ-Ψdimensions thanks to generalizations of Sauer’s lemma, as is illustrated in the specific case of the scale-sensitive Natarajan dimension. A bound on this latter dimension is then computed for the class of functions on which multi-class SVMs are based. This makes it possible to apply the structural risk minimization inductive principle to those machines. Keywords: multi-class discriminant analysis, large margin classifiers, uniform strong laws of large numbers, generalized VC dimensions, multi-class SVMs, structural risk minimization inductive principle, model selection

6 0.41168967 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm

7 0.40050623 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

8 0.36907655 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

9 0.24849127 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation

10 0.23726058 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

11 0.23471414 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

12 0.23331307 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes

13 0.22959483 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression

14 0.19533984 48 jmlr-2007-Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners

15 0.18341888 59 jmlr-2007-Nonlinear Boosting Projections for Ensemble Construction

16 0.17212066 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

17 0.17077509 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study

18 0.16598807 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

19 0.14893618 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization

20 0.14810176 22 jmlr-2007-Compression-Based Averaging of Selective Naive Bayes Classifiers     (Special Topic on Model Selection)


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.015), (12, 0.013), (15, 0.555), (28, 0.057), (40, 0.043), (45, 0.013), (48, 0.028), (60, 0.013), (80, 0.029), (85, 0.03), (98, 0.065), (99, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93827826 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization

Author: Evgeniy Gabrilovich, Shaul Markovitch

Abstract: Most existing methods for text categorization employ induction algorithms that use the words appearing in the training documents as features. While they perform well in many categorization tasks, these methods are inherently limited when faced with more complicated tasks where external knowledge is essential. Recently, there have been efforts to augment these basic features with external knowledge, including semi-supervised learning and transfer learning. In this work, we present a new framework for automatic acquisition of world knowledge and methods for incorporating it into the text categorization process. Our approach enhances machine learning algorithms with features generated from domain-specific and common-sense knowledge. This knowledge is represented by ontologies that contain hundreds of thousands of concepts, further enriched through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts that augment the bag of words used in simple supervised learning. Feature generation is accomplished through contextual analysis of document text, thus implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses two significant problems in natural language processing—synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the training documents alone. We applied our methodology using the Open Directory Project, the largest existing Web directory built by over 70,000 human editors. Experimental results over a range of data sets confirm improved performance compared to the bag of words document representation. Keywords: feature generation, text classification, background knowledge

same-paper 2 0.88080478 52 jmlr-2007-Margin Trees for High-dimensional Classification

Author: Robert Tibshirani, Trevor Hastie

Abstract: We propose a method for the classification of more than two classes, from high-dimensional features. Our approach is to build a binary decision tree in a top-down manner, using the optimal margin classifier at each split. We implement an exact greedy algorithm for this task, and compare its performance to less greedy procedures based on clustering of the matrix of pairwise margins. We compare the performance of the “margin tree” to the closely related “all-pairs” (one versus one) support vector machine, and nearest centroids on a number of cancer microarray data sets. We also develop a simple method for feature selection. We find that the margin tree has accuracy that is competitive with other methods and offers additional interpretability in its putative grouping of the classes. Keywords: maximum margin classifier, support vector machine, decision tree, CART

3 0.37727407 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

Author: Guy Lebanon, Yi Mao, Joshua Dillon

Abstract: The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words representation and its n-gram extensions. This representation uses local smoothing to embed documents as smooth curves in the multinomial simplex thereby preserving valuable sequential information. In contrast to bag of words or n-grams, the new representation is able to robustly capture medium and long range sequential trends in the document. We discuss the representation and its geometric properties and demonstrate its applicability for various text processing tasks. Keywords: text processing, local smoothing

4 0.35133967 88 jmlr-2007-Unlabeled Compression Schemes for Maximum Classes

Author: Dima Kuzmin, Manfred K. Warmuth

Abstract: n Maximum concept classes of VC dimension d over n domain points have size ≤d , and this is an upper bound on the size of any concept class of VC dimension d over n points. We give a compression scheme for any maximum class that represents each concept by a subset of up to d unlabeled domain points and has the property that for any sample of a concept in the class, the representative of exactly one of the concepts consistent with the sample is a subset of the domain of the sample. This allows us to compress any sample of a concept in the class to a subset of up to d unlabeled sample points such that this subset represents a concept consistent with the entire original sample. Unlike the previously known compression scheme for maximum classes (Floyd and Warmuth, 1995) which compresses to labeled subsets of the sample of size equal d, our new scheme is tight in the sense that the number of possible unlabeled compression sets of size at most d equals the number of concepts in the class. Keywords: compression schemes, VC dimension, maximum classes, one-inclusion graph

5 0.34615403 39 jmlr-2007-Handling Missing Values when Applying Classification Models

Author: Maytal Saar-Tsechansky, Foster Provost

Abstract: Much work has studied the effect of different treatments of missing values on model induction, but little work has analyzed treatments for the common case of missing values at prediction time. This paper first compares several different methods—predictive value imputation, the distributionbased imputation used by C4.5, and using reduced models—for applying classification trees to instances with missing values (and also shows evidence that the results generalize to bagged trees and to logistic regression). The results show that for the two most popular treatments, each is preferable under different conditions. Strikingly the reduced-models approach, seldom mentioned or used, consistently outperforms the other two methods, sometimes by a large margin. The lack of attention to reduced modeling may be due in part to its (perceived) expense in terms of computation or storage. Therefore, we then introduce and evaluate alternative, hybrid approaches that allow users to balance between more accurate but computationally expensive reduced modeling and the other, less accurate but less computationally expensive treatments. The results show that the hybrid methods can scale gracefully to the amount of investment in computation/storage, and that they outperform imputation even for small investments. Keywords: missing data, classification, classification trees, decision trees, imputation

6 0.34034082 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

7 0.33928877 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

8 0.33216786 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

9 0.32761577 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features

10 0.32731187 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

11 0.31095713 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data

12 0.30538368 72 jmlr-2007-Relational Dependency Networks

13 0.30404821 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study

14 0.29063189 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm

15 0.28903013 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions

16 0.28510284 29 jmlr-2007-Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts

17 0.27735889 44 jmlr-2007-Large Margin Semi-supervised Learning

18 0.27368027 59 jmlr-2007-Nonlinear Boosting Projections for Ensemble Construction

19 0.272412 25 jmlr-2007-Covariate Shift Adaptation by Importance Weighted Cross Validation

20 0.26944259 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression