jmlr jmlr2007 jmlr2007-66 knowledge-graph by maker-knowledge-mining

66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection


Source: pdf

Author: Wei Pan, Xiaotong Shen

Abstract: Variable selection in clustering analysis is both challenging and important. In the context of modelbased clustering analysis with a common diagonal covariance matrix, which is especially suitable for “high dimension, low sample size” settings, we propose a penalized likelihood approach with an L1 penalty function, automatically realizing variable selection via thresholding and delivering a sparse solution. We derive an EM algorithm to fit our proposed model, and propose a modified BIC as a model selection criterion to choose the number of components and the penalization parameter. A simulation study and an application to gene function prediction with gene expression profiles demonstrate the utility of our method. Keywords: BIC, EM, mixture model, penalized likelihood, soft-thresholding, shrinkage

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU School of Statistics University of Minnesota Minneapolis, MN 55455, USA Editor: Bin Yu Abstract Variable selection in clustering analysis is both challenging and important. [sent-5, score-0.422]

2 A simulation study and an application to gene function prediction with gene expression profiles demonstrate the utility of our method. [sent-8, score-0.609]

3 Keywords: BIC, EM, mixture model, penalized likelihood, soft-thresholding, shrinkage 1. [sent-9, score-0.558]

4 In this context, some of the attributes x jp ’s of x j may not be relevant: use of such attributes only introduces noise, and may impede uncovering the clustering structure of interest. [sent-18, score-0.83]

5 Due to lack of statistical models in many existing clustering algorithms, it is difficult to implement principled variable selection, though some promising methods have been proposed, based largely on heuristics (Friedman and Meulman, 2004; Mangasarian and Wild, 2004). [sent-20, score-0.382]

6 In contrast, model-based clustering (McLachlan and Peel, 2002; Fraley and Raftery, 2002) assumes that data come from a finite mixture model with each component corresponding to a cluster; with such a statistical model, statistical inference, including variable selection, can be carried out. [sent-21, score-0.449]

7 With high-dimensional data, as an alternative to variable selection, one may apply dimension reduction techniques, such as principal component analysis, prior to clustering (Ghosh and Chinnaiyan, 2002; Liu et al. [sent-25, score-0.382]

8 In light of the success of penalized regression with variable selection (Tibshirani, 1996; Fan and Li, 2001), we conjecture that penalization may be also viable to variable selection in clustering, and hence we propose an approach through penalized model-based clustering. [sent-38, score-1.278]

9 Note that, although there is an extensive body of literature on penalized likelihood methods, most focus on classification and regression; in particular, to our knowledge, we are not aware of any existing works on penalized likelihood particularly designed for multivariate clustering. [sent-41, score-0.988]

10 Recent advances in high-throughput biotechnologies, such as microarrays, have generated a large amount of high-dimensional data, and have led to routine use of clustering analyses, for example, in gene function discovery (Eisen et al. [sent-42, score-0.654]

11 In this article, in addition to the promising application of cancer subtype discovery, we also apply model-based clustering to the task of gene function discovery (Li and Hong, 2001; Ghosh and Chinnaiyan, 2002). [sent-49, score-0.737]

12 It has become popular to cluster gene expression profiles to discover unknown gene functions. [sent-51, score-0.705]

13 In the remaining parts of this article, we first review briefly the standard model-based clustering, then we introduce a general framework for penalized model-based clustering. [sent-52, score-0.45]

14 We derive an EM algorithm to compute the maximum penalized likelihood estimates for the model; a modified BIC is used to determine the number of components and the value of the penalization parameter in penalized model-based clustering. [sent-54, score-1.058]

15 Methods We first give a brief review on model-based clustering with a finite Normal mixture model, then we introduce our penalized model-based clustering, including an EM algorithm and a modified BIC for model selection. [sent-58, score-0.853]

16 2 Penalized Model-based Clustering With the same motivation as in penalized regression, we propose a penalized model-based clustering approach. [sent-78, score-1.236]

17 (2003) proposed a penalized likelihood approach to dealing with the degeneracy: by penalizing small variance components σk , one circumvents the problem. [sent-84, score-0.494]

18 Specifically, we regularize log L(Θ) to yield a penalized log-likelihood: n K j=1 log LP (Θ) = k=1 ∑ log[ ∑ πk fk (x j ; θk )] − hλ (Θ), where hλ () is a penalty function with penalization parameter λ. [sent-86, score-0.762]

19 Correspondingly, the penalized log-likelihood for the complete data is K log Lc,P (Θ) = n ∑ ∑ zk j [log πk + log fk (x j ; θk )] − hλ (Θ). [sent-88, score-0.635]

20 k=1 j=1 We propose using such penalized model-based clustering as a general way to regularize parameter estimates. [sent-89, score-0.786]

21 There is a well known connection between penalized likelihood and Bayesian modeling (Hastie et al. [sent-93, score-0.494]

22 , 2001): it can be regarded that minus the penalty function −hλ (Θ) is proportional to the log density of the prior distribution for parameters Θ, and the penalized (log) likelihood is proportional to the (log) posterior density. [sent-94, score-0.58]

23 3 Penalizing Mean Parameters Now we propose a specific implementation of penalized model-based clustering to realize variable selection. [sent-98, score-0.874]

24 We assume throughout this article that, prior to clustering analysis, we have standardized data so that each attribute has 1148 P ENALIZED M ODEL -BASED C LUSTERING sample mean 0 and sample variance 1. [sent-103, score-0.409]

25 Next we derive an EM algorithm for the above penalized model-based clustering; in particular, it is confirmed that the L1 -penalty yields a thresholding rule with the desired sparsity property. [sent-110, score-0.483]

26 The derivation closely follows from that for standard model-based clustering (McLachlan and Peel, 2002) and the general methodology for penalized likelihood (Green, 1990). [sent-111, score-0.83]

27 j=1 The above iteration is repeated until convergence, resulting in the maximum penalized likeliˆ hood estimate (MPLE) Θ. [sent-130, score-0.45]

28 k=1 1150 P ENALIZED M ODEL -BASED C LUSTERING For penalized model-based clustering, in addition to K, we also have to choose an appropriate value of penalization parameter λ; a model selection criterion has to account for the adaptive choice of λ. [sent-145, score-0.65]

29 One difficulty in using the above BIC criterion is that it is not always clear what is d in a penalized model. [sent-146, score-0.45]

30 Results We first present results for simulated data, then consider clustering samples and clustering genes for two microarray data. [sent-153, score-0.959]

31 Table 1 summarizes the means and standard errors of BIC for the standard clustering using all 1000 attributes, and those of BIC and penalization parameter λ of the selected models in penalized clustering. [sent-168, score-0.933]

32 Twenty out of a hundred times, standard model-based clustering incorrectly selected K = 1, failing to discover the existence of the two clusters of interest. [sent-170, score-0.471]

33 Because standard clustering used all the attributes, noting that 850 was much larger than 150, as expected it might choose K = 1. [sent-172, score-0.336]

34 In contrast, with an appropriate variable selection, penalized clustering more frequently chose the model with 1151 PAN AND S HEN K 1 2 3 Standard BIC 92923 (2) 92834 (12) 96282 (13) Penalized BIC λ 88393 (0) 1. [sent-173, score-0.867]

35 12) Table 1: Mean BIC (with standard errors in parentheses) with various numbers (K) of clusters in the standard and penalized clustering for 100 simulated data sets. [sent-179, score-0.949]

36 6) Table 2: Frequencies of the selected numbers (K) of clusters in the standard and penalized clustering from 100 simulated data sets with P = 1000. [sent-193, score-0.982]

37 The corresponding means (with standard errors in parentheses) of BIC, λ, the number of the first 150 informative attributes excluded (#Zero1), and the number of the last 850 noise attributes excluded (#Zero0) are also included. [sent-194, score-0.626]

38 19) Table 3: Frequencies of the selected numbers (K) of clusters in the standard and penalized clustering from 100 simulated data sets with P = 10. [sent-203, score-0.982]

39 Importantly, the penalized approach can automatically select attributes: out of the total 850 noise attributes, on average, 833 attributes were correctly identified and not used in the final clustering; on the other hand, only one out of 150 informative attributes was not used. [sent-206, score-1.008]

40 Penalized clustering gave perfect assignments for K = 2: the 15 and 85 observations from two distributions/classes were correctly assigned to clusters 1 and 2 respectively. [sent-207, score-0.53]

41 Table 3 gives the results for standard and penalized clustering. [sent-220, score-0.45]

42 Again, in presence of the eight noise attributes, standard clustering always chose K = 1; in contrast, penalized clustering with automatic variable selection tended to chose K > 1, and the two informative attributes were most often retained. [sent-221, score-1.659]

43 However, penalized clustering did not work as well as in the previous set-up with P = 1000: it chose K = 3 most often while keeping most of the noise attributes; the reason could be that with fewer informative attributes, this was a more difficult problem than that of the previous section. [sent-222, score-0.933]

44 This highlights a unique point that in variable selection for clustering, unlike in regression, a correct model for the data based on a subset of attributes (e. [sent-228, score-0.355]

45 In summary, we concluded that subset selection was not suitable at all for variable selection in clustering, whereas penalized clustering was much more effective. [sent-233, score-1.004]

46 (1999) studied discovering two subtypes of human acute leukemia, acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL), using microarray gene expression data. [sent-236, score-0.561]

47 We applied model-based clustering to their data with 38 patients, among which 21 were ALL while the other 11 were AML patients; each patient was treated as an observation while the genes were treated as attributes. [sent-239, score-0.478]

48 The standard clustering chose K = 3 while the penalized one selected K = 7 (Table 5). [sent-244, score-0.854]

49 The penalized method performed better than the standard clustering: the former incorrectly assigned five while the latter misclassified seven AML samples into the clusters with the ALL samples as the majority. [sent-246, score-0.58]

50 3 Gene Function Discovery Using Gene Expression Profiles Because many genes still have unknown functions, a biologically important subject is to computationally predict gene functions using, for example, gene expression profiles (Brown et al. [sent-250, score-0.786]

51 Due to incomplete knowledge, it is equally important to discover new gene functions; there is functional heterogeneity within most of functional categories, and there are probably many more uncharacterized gene functions. [sent-253, score-0.532]

52 Clustering gene expression profiles has become a popular approach for both gene function prediction and discovery (Eisen et al. [sent-254, score-0.661]

53 Here, we considered gene function discovery using gene expression data for yeast S cerevisiae. [sent-260, score-0.661]

54 Specifically, we used a gene expression data set containing 300 microarray experiments with gene deletions and drug treatments (Hughes et al. [sent-261, score-0.693]

55 Variable selection was highly relevant here: first, as shown in simulation, incorrectly using noise attributes might degrade the performance of clustering, obscuring some interesting clustering structures; second, it was also biologically important to identify which microarray experiments (i. [sent-263, score-0.809]

56 , attributes) were informative in clustering, linking putative functions of gene clusters to biological perturbations underlying the microarray experiments. [sent-265, score-0.519]

57 One difficulty in evaluating the performance of a clustering algorithm for real data is how to choose an appropriate criterion. [sent-266, score-0.336]

58 Although our interest was in clustering for gene function discovery, for the purpose of evaluations, we treated the problem as supervised learning: each gene had its response variable as its known function; gene functions were downloaded from the MIPS database (Mewes et al. [sent-267, score-1.18]

59 For illustration, we only considered two gene functions, cytoplasm and mitochondrion, with 100 genes in each class as training data; we used other 406 and 212 genes in the two functional classes as test data. [sent-269, score-0.55]

60 We first used the training data without their class labels: we clustered the 200 gene expression profiles into, say K, clusters. [sent-270, score-0.343]

61 1 U SING THE O RIGINAL DATA Both standard clustering and penalized clustering selected K = 7 by BIC (Table 7), with their predictive performances for the test data given in Table 8. [sent-280, score-1.155]

62 For penalized clustering, some MPLEs of the cluster-specific mean parameters were exactly zero; their numbers ranged from 36 to 126 in the seven clusters. [sent-282, score-0.45]

63 This example showed that our penalized clustering performed as well as the standard clustering method for data with none or few non-informative attributes. [sent-284, score-1.122]

64 , 2003), random forests (RF) (Breiman, 2001), and support vector machines (SVM) (Vapnik, 1998), and thus treating the problem as supervised learning; the NSC was specifically developed for classification with gene expression data, while the RF and SVM are two state-of-the-art machine learning tools. [sent-286, score-0.372]

65 651 Table 8: Predictions of standard clustering and penalized clustering over a separate test data set for Hughes’ gene expression data. [sent-300, score-1.465]

66 798 Table 9: Predictions over a separate test data set based on the nearest shrunken centroid without shrinkage (called NC) and with shrinkage (NSC), random forests (RF) and support vector machine (SVM) for Hughes’ gene expression data. [sent-306, score-0.5]

67 Note that it is in general unfair to compare the predictive performance of a clustering method against that of a classification or supervised learning method; our purpose here was to use the modern classifiers as benchmarks. [sent-307, score-0.336]

68 For the NSC, with the selected ∆, only six attributes remained in the final model; however, using all the attributes gave a slightly higher accuracy (Table 9). [sent-309, score-0.506]

69 It was interesting to note that the NSC failed to perform better than either clustering with hard classification. [sent-310, score-0.336]

70 There was some similarity between these two: if we regarded each cluster as a single class, then clustering with hard classification worked in a similar manner as the NSC. [sent-311, score-0.432]

71 Nevertheless, a difference between the two was that, the NSC assumed only one cluster for each class, whereas clustering with hard classification allowed observations of the same class to go to different clusters. [sent-312, score-0.469]

72 2 U SING THE DATA WITH A DDED N OISE It seemed that there were none or few non-informative attributes in the gene expression data for gene function prediction. [sent-316, score-0.832]

73 To mimic other real applications, where a large number of microarray experiments were available, of which however only a fraction are informative, we added 700 noise attributes to the gene expression data. [sent-317, score-0.695]

74 By BIC, standard clustering selected K = 2 whereas penalized clustering chose K = 6. [sent-320, score-1.19]

75 With K = 2, standard clustering gave quite bad results for the test data with an accuracy only at about 50% (Table 11), while it performed much better with K = 6 (results not shown); in contrast, penalized clustering gave much higher accuracy rates. [sent-321, score-1.176]

76 Note that with many noise attributes, in agreement with that in the previous simulation study, standard clustering probably under-estimated the true number of the clusters of interest at K = 2. [sent-322, score-0.483]

77 639 Table 11: Predictions of standard clustering and penalized clustering over a separate test data set for Hughes’ gene expression data with added noise. [sent-341, score-1.465]

78 772 Table 12: Predictions for a separate test data set based on several classifiers for Hughes’ gene expression data with added noise. [sent-347, score-0.343]

79 We are not aware of any other penalized likelihood approaches to multivariate model-based clustering, which we have studied in this article. [sent-355, score-0.494]

80 First, clustering without variable selection may fail to uncover interesting structures underlying the data. [sent-359, score-0.468]

81 Second, best subset selection not only is computationally infeasible for clustering high-dimensional data, but also may fail in small problems. [sent-360, score-0.422]

82 (2006) extended the penalized likelihood approach discussed here to semi-supervised learning so that variable selection is accomplished simultaneously along with model fitting (i. [sent-372, score-0.626]

83 It is unclear how to realize automatic variable selection for other more general covariance matrices in penalized model-based clustering. [sent-384, score-0.671]

84 , 2003), and more work is needed to explore how to realize variable selection with such a choice in penalized model-based clustering. [sent-386, score-0.624]

85 , 2004), we propose counting only non-zero components of the maximum penalized likelihood estimate when calculating the effective number of parameters in BIC. [sent-391, score-0.494]

86 WP thanks Benhuai Xie, Guanghua Xiao and Peng Wei for assistance with the gene expression data. [sent-394, score-0.343]

87 Class discovery and classification of tumor samples using mixture modeling of gene expression data. [sent-400, score-0.495]

88 Knowledge-based analysis of microarray gene expression data using support vector machines. [sent-425, score-0.427]

89 Variable selection via nonconcave penalized likelihood and its Oracle properties. [sent-468, score-0.58]

90 Mixture modeling of gene expression data from microarray experiments. [sent-508, score-0.427]

91 Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. [sent-531, score-0.395]

92 On use of the EM for penalized likelihood estimation. [sent-536, score-0.494]

93 Variable selection in clustering via Dirichlet process mixture models. [sent-622, score-0.489]

94 Bayesian clustering with variable and transformation selection (with discussion). [sent-640, score-0.468]

95 A mixture model-based approach to the clustering of microarray expression data. [sent-655, score-0.564]

96 Semi-supervised learning via penalized mixture model with application to microarray sample classification. [sent-698, score-0.601]

97 Discussion of “Bayesian clustering with variable and transformation selection” by Liu et al. [sent-703, score-0.382]

98 Gene function prediction by a combined analysis of gene expression data and protein-protein interaction data. [sent-776, score-0.343]

99 Model-based clustering and data transformations for gene expression data. [sent-794, score-0.679]

100 Transitive functional annotation by shortest-path analysis of gene expression data. [sent-802, score-0.343]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('penalized', 0.45), ('bic', 0.353), ('clustering', 0.336), ('gene', 0.266), ('attributes', 0.223), ('nsc', 0.192), ('pan', 0.166), ('genes', 0.142), ('kp', 0.133), ('enalized', 0.124), ('penalization', 0.114), ('mclachlan', 0.105), ('hughes', 0.105), ('hen', 0.104), ('clusters', 0.102), ('cluster', 0.096), ('raftery', 0.093), ('fraley', 0.093), ('pred', 0.093), ('selection', 0.086), ('odel', 0.086), ('microarray', 0.084), ('fk', 0.081), ('lustering', 0.079), ('expression', 0.077), ('aml', 0.073), ('rf', 0.073), ('freq', 0.07), ('informative', 0.067), ('mixture', 0.067), ('simulated', 0.061), ('subtype', 0.055), ('yeung', 0.055), ('penalty', 0.055), ('discovery', 0.052), ('peel', 0.052), ('jp', 0.048), ('covariance', 0.047), ('realizing', 0.046), ('shrunken', 0.046), ('hoff', 0.046), ('attribute', 0.046), ('variable', 0.046), ('bioinformatics', 0.046), ('noise', 0.045), ('likelihood', 0.044), ('zk', 0.042), ('realize', 0.042), ('shrinkage', 0.041), ('chinnaiyan', 0.041), ('ciuperca', 0.041), ('differentially', 0.041), ('subtypes', 0.041), ('tadesse', 0.041), ('xiao', 0.041), ('em', 0.04), ('pro', 0.04), ('qp', 0.039), ('les', 0.038), ('shen', 0.038), ('observations', 0.037), ('golub', 0.035), ('biologically', 0.035), ('modelbased', 0.035), ('mple', 0.035), ('chose', 0.035), ('excluded', 0.034), ('selected', 0.033), ('genome', 0.033), ('thresholding', 0.033), ('tumor', 0.033), ('bayesian', 0.032), ('log', 0.031), ('lda', 0.031), ('acute', 0.031), ('eisen', 0.031), ('mle', 0.031), ('fan', 0.031), ('efron', 0.03), ('li', 0.03), ('diagonal', 0.03), ('tibshirani', 0.029), ('forests', 0.029), ('centroids', 0.029), ('zou', 0.029), ('cancer', 0.028), ('assigned', 0.028), ('frequencies', 0.028), ('acad', 0.027), ('alexandridis', 0.027), ('jasa', 0.027), ('jasra', 0.027), ('mewes', 0.027), ('mips', 0.027), ('mples', 0.027), ('sci', 0.027), ('stoughton', 0.027), ('wp', 0.027), ('article', 0.027), ('gave', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

Author: Wei Pan, Xiaotong Shen

Abstract: Variable selection in clustering analysis is both challenging and important. In the context of modelbased clustering analysis with a common diagonal covariance matrix, which is especially suitable for “high dimension, low sample size” settings, we propose a penalized likelihood approach with an L1 penalty function, automatically realizing variable selection via thresholding and delivering a sparse solution. We derive an EM algorithm to fit our proposed model, and propose a modified BIC as a model selection criterion to choose the number of components and the penalization parameter. A simulation study and an application to gene function prediction with gene expression profiles demonstrate the utility of our method. Keywords: BIC, EM, mixture model, penalized likelihood, soft-thresholding, shrinkage

2 0.20398107 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

Author: Jia Li, Surajit Ray, Bruce G. Lindsay

Abstract: A new clustering approach based on mode identification is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efficiently solved by an EM-style algorithm, namely, the Modal EM (MEM). This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. Without model fitting, the mode-based clustering yields a density description for every cluster, a major advantage of mixture-model-based clustering. Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnosing clustering results is also investigated. Specifically, a pairwise separability measure for clusters is defined using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is created to enforce strong separation. Experiments on simulated and real data demonstrate that the mode-based clustering approach tends to combine the strengths of linkage and mixture-model-based clustering. In addition, the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difficulty for parametric mixture modeling. A C package on the new algorithms is developed for public access at http://www.stat.psu.edu/∼jiali/hmac. Keywords: modal clustering, mode-based clustering, mixture modeling, modal EM, ridgeline EM, nonparametric density

3 0.1190431 52 jmlr-2007-Margin Trees for High-dimensional Classification

Author: Robert Tibshirani, Trevor Hastie

Abstract: We propose a method for the classification of more than two classes, from high-dimensional features. Our approach is to build a binary decision tree in a top-down manner, using the optimal margin classifier at each split. We implement an exact greedy algorithm for this task, and compare its performance to less greedy procedures based on clustering of the matrix of pairwise margins. We compare the performance of the “margin tree” to the closely related “all-pairs” (one versus one) support vector machine, and nearest centroids on a number of cancer microarray data sets. We also develop a simple method for feature selection. We find that the margin tree has accuracy that is competitive with other methods and offers additional interpretability in its putative grouping of the classes. Keywords: maximum margin classifier, support vector machine, decision tree, CART

4 0.11420219 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods

Author: Marc Teboulle

Abstract: Center-based partitioning clustering algorithms rely on minimizing an appropriately formulated objective function, and different formulations suggest different possible algorithms. In this paper, we start with the standard nonconvex and nonsmooth formulation of the partitioning clustering problem. We demonstrate that within this elementary formulation, convex analysis tools and optimization theory provide a unifying language and framework to design, analyze and extend hard and soft center-based clustering algorithms, through a generic algorithm which retains the computational simplicity of the popular k-means scheme. We show that several well known and more recent center-based clustering algorithms, which have been derived either heuristically, or/and have emerged from intuitive analogies in physics, statistical techniques and information theoretic perspectives can be recovered as special cases of the proposed analysis and we streamline their relationships. Keywords: clustering, k-means algorithm, convex analysis, support and asymptotic functions, distance-like functions, Bregman and Csiszar divergences, nonlinear means, nonsmooth optimization, smoothing algorithms, fixed point methods, deterministic annealing, expectation maximization, information theory and entropy methods

5 0.076994173 1 jmlr-2007-"Ideal Parent" Structure Learning for Continuous Variable Bayesian Networks

Author: Gal Elidan, Iftach Nachman, Nir Friedman

Abstract: Bayesian networks in general, and continuous variable networks in particular, have become increasingly popular in recent years, largely due to advances in methods that facilitate automatic learning from data. Yet, despite these advances, the key task of learning the structure of such models remains a computationally intensive procedure, which limits most applications to parameter learning. This problem is even more acute when learning networks in the presence of missing values or hidden variables, a scenario that is part of many real-life problems. In this work we present a general method for speeding structure search for continuous variable networks with common parametric distributions. We efficiently evaluate the approximate merit of candidate structure modifications and apply time consuming (exact) computations only to the most promising ones, thereby achieving significant improvement in the running time of the search algorithm. Our method also naturally and efficiently facilitates the addition of useful new hidden variables into the network structure, a task that is typically considered both conceptually difficult and computationally prohibitive. We demonstrate our method on synthetic and real-life data sets, both for learning structure on fully and partially observable data, and for introducing new hidden variables during structure search. Keywords: Bayesian networks, structure learning, continuous variables, hidden variables

6 0.064426176 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians

7 0.062732413 3 jmlr-2007-A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation

8 0.050655816 72 jmlr-2007-Relational Dependency Networks

9 0.048248105 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

10 0.047170181 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

11 0.045527663 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

12 0.043409385 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

13 0.043355145 79 jmlr-2007-Structure and Majority Classes in Decision Tree Learning

14 0.042457614 11 jmlr-2007-Anytime Learning of Decision Trees

15 0.04160466 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

16 0.035871625 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

17 0.035367437 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters     (Special Topic on Model Selection)

18 0.034752786 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors

19 0.034647718 42 jmlr-2007-Infinitely Imbalanced Logistic Regression

20 0.033349268 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.224), (1, 0.164), (2, -0.005), (3, 0.187), (4, 0.211), (5, -0.382), (6, 0.084), (7, 0.146), (8, 0.042), (9, 0.028), (10, 0.009), (11, -0.045), (12, 0.148), (13, 0.08), (14, -0.095), (15, -0.038), (16, 0.033), (17, -0.035), (18, -0.025), (19, -0.023), (20, -0.068), (21, -0.049), (22, 0.053), (23, -0.043), (24, 0.078), (25, -0.033), (26, -0.014), (27, 0.013), (28, 0.096), (29, 0.136), (30, -0.147), (31, 0.044), (32, 0.025), (33, -0.002), (34, -0.037), (35, -0.107), (36, -0.047), (37, 0.038), (38, 0.068), (39, 0.03), (40, 0.021), (41, 0.017), (42, 0.035), (43, -0.016), (44, 0.014), (45, -0.023), (46, -0.072), (47, -0.033), (48, 0.035), (49, -0.089)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97888523 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

Author: Wei Pan, Xiaotong Shen

Abstract: Variable selection in clustering analysis is both challenging and important. In the context of modelbased clustering analysis with a common diagonal covariance matrix, which is especially suitable for “high dimension, low sample size” settings, we propose a penalized likelihood approach with an L1 penalty function, automatically realizing variable selection via thresholding and delivering a sparse solution. We derive an EM algorithm to fit our proposed model, and propose a modified BIC as a model selection criterion to choose the number of components and the penalization parameter. A simulation study and an application to gene function prediction with gene expression profiles demonstrate the utility of our method. Keywords: BIC, EM, mixture model, penalized likelihood, soft-thresholding, shrinkage

2 0.8621667 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

Author: Jia Li, Surajit Ray, Bruce G. Lindsay

Abstract: A new clustering approach based on mode identification is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efficiently solved by an EM-style algorithm, namely, the Modal EM (MEM). This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. Without model fitting, the mode-based clustering yields a density description for every cluster, a major advantage of mixture-model-based clustering. Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnosing clustering results is also investigated. Specifically, a pairwise separability measure for clusters is defined using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is created to enforce strong separation. Experiments on simulated and real data demonstrate that the mode-based clustering approach tends to combine the strengths of linkage and mixture-model-based clustering. In addition, the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difficulty for parametric mixture modeling. A C package on the new algorithms is developed for public access at http://www.stat.psu.edu/∼jiali/hmac. Keywords: modal clustering, mode-based clustering, mixture modeling, modal EM, ridgeline EM, nonparametric density

3 0.5010137 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods

Author: Marc Teboulle

Abstract: Center-based partitioning clustering algorithms rely on minimizing an appropriately formulated objective function, and different formulations suggest different possible algorithms. In this paper, we start with the standard nonconvex and nonsmooth formulation of the partitioning clustering problem. We demonstrate that within this elementary formulation, convex analysis tools and optimization theory provide a unifying language and framework to design, analyze and extend hard and soft center-based clustering algorithms, through a generic algorithm which retains the computational simplicity of the popular k-means scheme. We show that several well known and more recent center-based clustering algorithms, which have been derived either heuristically, or/and have emerged from intuitive analogies in physics, statistical techniques and information theoretic perspectives can be recovered as special cases of the proposed analysis and we streamline their relationships. Keywords: clustering, k-means algorithm, convex analysis, support and asymptotic functions, distance-like functions, Bregman and Csiszar divergences, nonlinear means, nonsmooth optimization, smoothing algorithms, fixed point methods, deterministic annealing, expectation maximization, information theory and entropy methods

4 0.48287174 52 jmlr-2007-Margin Trees for High-dimensional Classification

Author: Robert Tibshirani, Trevor Hastie

Abstract: We propose a method for the classification of more than two classes, from high-dimensional features. Our approach is to build a binary decision tree in a top-down manner, using the optimal margin classifier at each split. We implement an exact greedy algorithm for this task, and compare its performance to less greedy procedures based on clustering of the matrix of pairwise margins. We compare the performance of the “margin tree” to the closely related “all-pairs” (one versus one) support vector machine, and nearest centroids on a number of cancer microarray data sets. We also develop a simple method for feature selection. We find that the margin tree has accuracy that is competitive with other methods and offers additional interpretability in its putative grouping of the classes. Keywords: maximum margin classifier, support vector machine, decision tree, CART

5 0.37633306 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

Author: Philippe Rigollet

Abstract: We consider semi-supervised classification when part of the available data is unlabeled. These unlabeled data can be useful for the classification problem when we make an assumption relating the behavior of the regression function to that of the marginal distribution. Seeger (2000) proposed the well-known cluster assumption as a reasonable one. We propose a mathematical formulation of this assumption and a method based on density level sets estimation that takes advantage of it to achieve fast rates of convergence both in the number of unlabeled examples and the number of labeled examples. Keywords: semi-supervised learning, statistical learning theory, classification, cluster assumption, generalization bounds

6 0.36122188 3 jmlr-2007-A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation

7 0.32615876 1 jmlr-2007-"Ideal Parent" Structure Learning for Continuous Variable Bayesian Networks

8 0.31106758 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians

9 0.26099187 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

10 0.23304828 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

11 0.22352766 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

12 0.22199593 72 jmlr-2007-Relational Dependency Networks

13 0.21379019 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters     (Special Topic on Model Selection)

14 0.19985583 77 jmlr-2007-Stagewise Lasso

15 0.19803286 11 jmlr-2007-Anytime Learning of Decision Trees

16 0.19631886 22 jmlr-2007-Compression-Based Averaging of Selective Naive Bayes Classifiers     (Special Topic on Model Selection)

17 0.19235016 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors

18 0.17874174 79 jmlr-2007-Structure and Majority Classes in Decision Tree Learning

19 0.17102931 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

20 0.16812858 15 jmlr-2007-Bilinear Discriminant Component Analysis


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.019), (8, 0.05), (10, 0.019), (12, 0.028), (15, 0.066), (22, 0.011), (24, 0.324), (28, 0.063), (40, 0.04), (45, 0.043), (48, 0.03), (60, 0.044), (80, 0.045), (85, 0.041), (98, 0.069)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.69775581 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

Author: Wei Pan, Xiaotong Shen

Abstract: Variable selection in clustering analysis is both challenging and important. In the context of modelbased clustering analysis with a common diagonal covariance matrix, which is especially suitable for “high dimension, low sample size” settings, we propose a penalized likelihood approach with an L1 penalty function, automatically realizing variable selection via thresholding and delivering a sparse solution. We derive an EM algorithm to fit our proposed model, and propose a modified BIC as a model selection criterion to choose the number of components and the penalization parameter. A simulation study and an application to gene function prediction with gene expression profiles demonstrate the utility of our method. Keywords: BIC, EM, mixture model, penalized likelihood, soft-thresholding, shrinkage

2 0.37054276 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

Author: Jia Li, Surajit Ray, Bruce G. Lindsay

Abstract: A new clustering approach based on mode identification is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efficiently solved by an EM-style algorithm, namely, the Modal EM (MEM). This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. Without model fitting, the mode-based clustering yields a density description for every cluster, a major advantage of mixture-model-based clustering. Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnosing clustering results is also investigated. Specifically, a pairwise separability measure for clusters is defined using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is created to enforce strong separation. Experiments on simulated and real data demonstrate that the mode-based clustering approach tends to combine the strengths of linkage and mixture-model-based clustering. In addition, the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difficulty for parametric mixture modeling. A C package on the new algorithms is developed for public access at http://www.stat.psu.edu/∼jiali/hmac. Keywords: modal clustering, mode-based clustering, mixture modeling, modal EM, ridgeline EM, nonparametric density

3 0.36781731 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby

Abstract: Embedding algorithms search for a low dimensional continuous representation of data, but most algorithms only handle objects of a single type for which pairwise distances are specified. This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space, based on their co-occurrence statistics. The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semidefinite matrices. The local structure of the embedding corresponds to the statistical correlations via random walks in the Euclidean space. We quantify the performance of our method on two text data sets, and show that it consistently and significantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling, IsoMap and correspondence analysis. Keywords: embedding algorithms, manifold learning, exponential families, multidimensional scaling, matrix factorization, semidefinite programming

4 0.36610013 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

Author: Roland Nilsson, José M. Peña, Johan Björkegren, Jesper Tegnér

Abstract: We analyze two different feature selection problems: finding a minimal feature set optimal for classification (MINIMAL - OPTIMAL) vs. finding all features relevant to the target variable (ALL RELEVANT ). The latter problem is motivated by recent applications within bioinformatics, particularly gene expression analysis. For both problems, we identify classes of data distributions for which there exist consistent, polynomial-time algorithms. We also prove that ALL - RELEVANT is much harder than MINIMAL - OPTIMAL and propose two consistent, polynomial-time algorithms. We argue that the distribution classes considered are reasonable in many practical cases, so that our results simplify feature selection in a wide range of machine learning tasks. Keywords: learning theory, relevance, classification, Markov blanket, bioinformatics

5 0.3641336 39 jmlr-2007-Handling Missing Values when Applying Classification Models

Author: Maytal Saar-Tsechansky, Foster Provost

Abstract: Much work has studied the effect of different treatments of missing values on model induction, but little work has analyzed treatments for the common case of missing values at prediction time. This paper first compares several different methods—predictive value imputation, the distributionbased imputation used by C4.5, and using reduced models—for applying classification trees to instances with missing values (and also shows evidence that the results generalize to bagged trees and to logistic regression). The results show that for the two most popular treatments, each is preferable under different conditions. Strikingly the reduced-models approach, seldom mentioned or used, consistently outperforms the other two methods, sometimes by a large margin. The lack of attention to reduced modeling may be due in part to its (perceived) expense in terms of computation or storage. Therefore, we then introduce and evaluate alternative, hybrid approaches that allow users to balance between more accurate but computationally expensive reduced modeling and the other, less accurate but less computationally expensive treatments. The results show that the hybrid methods can scale gracefully to the amount of investment in computation/storage, and that they outperform imputation even for small investments. Keywords: missing data, classification, classification trees, decision trees, imputation

6 0.36296487 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

7 0.35424846 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

8 0.3509475 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

9 0.34962147 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

10 0.34828952 88 jmlr-2007-Unlabeled Compression Schemes for Maximum Classes

11 0.34813955 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

12 0.34774223 52 jmlr-2007-Margin Trees for High-dimensional Classification

13 0.34705696 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

14 0.34526169 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

15 0.34441811 72 jmlr-2007-Relational Dependency Networks

16 0.34340894 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study

17 0.34212744 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm

18 0.34102798 47 jmlr-2007-Learning Horn Expressions with LOGAN-H

19 0.34035286 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

20 0.33832932 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters     (Special Topic on Model Selection)