nips nips2003 nips2003-79 knowledge-graph by maker-knowledge-mining

79 nips-2003-Gene Expression Clustering with Functional Mixture Models

Source: pdf

Author: Darya Chudova, Christopher Hart, Eric Mjolsness, Padhraic Smyth

Abstract: We propose a functional mixture model for simultaneous clustering and alignment of sets of curves measured on a discrete time grid. The model is speciﬁcally tailored to gene expression time course data. Each functional cluster center is a nonlinear combination of solutions of a simple linear differential equation that describes the change of individual mRNA levels when the synthesis and decay rates are constant. The mixture of continuous time parametric functional forms allows one to (a) account for the heterogeneity in the observed proﬁles, (b) align the proﬁles in time by estimating real-valued time shifts, (c) capture the synthesis and decay of mRNA in the course of an experiment, and (d) regularize noisy proﬁles by enforcing smoothness in the mean curves. We derive an EM algorithm for estimating the parameters of the model, and apply the proposed approach to the set of cycling genes in yeast. The experiments show consistent improvement in predictive power and within cluster variance compared to regular Gaussian mixtures. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We propose a functional mixture model for simultaneous clustering and alignment of sets of curves measured on a discrete time grid. [sent-7, score-1.088]

2 The model is speciﬁcally tailored to gene expression time course data. [sent-8, score-0.65]

3 Each functional cluster center is a nonlinear combination of solutions of a simple linear differential equation that describes the change of individual mRNA levels when the synthesis and decay rates are constant. [sent-9, score-0.811]

4 We derive an EM algorithm for estimating the parameters of the model, and apply the proposed approach to the set of cycling genes in yeast. [sent-11, score-0.163]

5 The experiments show consistent improvement in predictive power and within cluster variance compared to regular Gaussian mixtures. [sent-12, score-0.196]

6 Each curve typically consists of a sequence of measurements as a function of discrete time or some other independent variable. [sent-14, score-0.329]

7 Examples of such data include trajectory tracks of individuals or objects (Gaffney and Smyth, 2003) and biomedical measurements of response to drug therapies (James and Sugar, 2003). [sent-15, score-0.074]

8 In some cases, the curve data is measured on regular grids and the curves have the same lengths. [sent-16, score-0.404]

9 Curve data analysis is typically referred to as “functional data analysis” in the statistical literature (Ramsay and Silverman, 1997), where the observed measurements are treated as samples from an assumed underlying continuous-time process. [sent-19, score-0.074]

10 Clustering in this context can be performed using mixtures of continuous functions such as splines (James and Sugar, 2003) and polynomial regression models (DeSarbo and Cron, 1988; Gaffney and Smyth, 2003). [sent-20, score-0.261]

11 In this paper we focus on the speciﬁc problem of analyzing gene expression time course data and extend the functional mixture modelling approach to (a) cluster the data using plausible biological models for the expression dynamics, and (b) align the expression proﬁles along the time axis. [sent-21, score-1.84]

12 Large scale gene expression proﬁling measures the relative abundance of tens of thousands of mRNA molecules in the cell simultaneously. [sent-22, score-0.603]

13 The goal of clustering in this context is to discover groups of genes with similar dynamics and ﬁnd sets of genes that participate in the same regulatory mechanism. [sent-23, score-0.508]

14 For the most part, clustering approaches to gene expression data treat the observed curves as elements of the corresponding vector-space. [sent-24, score-0.837]

15 A variety of vector-based clustering algorithms have been successfully applied, ranging from hierarchical clustering (Eisen et al. [sent-25, score-0.291]

16 There has been some previous work on continuous time models in this context, e. [sent-29, score-0.191]

17 , 2002) were applied to clustering and alignment of the cell-cycle regulated genes in yeast and good interpolation properties were demonstrated. [sent-32, score-0.613]

18 However, such spline models are “black boxes” that can approximate virtually any temporal behavior — they do not take the speciﬁcs of gene regulation mechanisms into account. [sent-33, score-0.423]

19 In contrast, in this paper we propose speciﬁc functional forms that are targeted at short time courses, in which fairly simple reaction kinetics can describe the possible dynamics. [sent-34, score-0.477]

20 Alignment: Individual genes within clusters of co-regulated genes can exhibit variations in the time of the onset of their characteristic behaviors or in their initial concentrations. [sent-35, score-0.401]

21 Such differences can signiﬁcantly increase within-cluster variability and produce incorrect cluster assignments. [sent-36, score-0.153]

22 We address this problem by explicitly modelling the unknown real-valued time shifts between different genes. [sent-37, score-0.189]

23 The high noise levels of observed gene expression data imply the need for robust estimation of mean behavior. [sent-39, score-0.606]

24 Functional models (such as those that we propose here) naturally impose smoothness in the learned mean curves, providing implicit regularization for such data. [sent-40, score-0.1]

25 1 Model Description Generative Model We describe a generative model that allows one to simulate heterogeneous sets of curves from a mixture of functional curve models. [sent-44, score-0.831]

26 Each generated curve Yi is a series of obser- vations at a discrete set of values Xi of an independent variable. [sent-45, score-0.158]

27 In many applications, and for gene expression measurements in particular, the independent variable X is time. [sent-46, score-0.587]

28 We adopt the same general approach to functional curve clustering that is used in regression mixture models (DeSarbo and Cron, 1988), random effects regression mixtures (Gaffney and Smyth, 2003) and mixtures of spline models (James and Sugar, 2003). [sent-47, score-1.079]

29 The clusters are deﬁned by their mean curves parametrized by a set of parameters µk : fk (x) = f (x, µk ), and a noise model that describes the deviation from the mean functional form (described below in Section 2. [sent-49, score-0.751]

30 In contrast to standard Gaussian mixtures, the functional mixture is deﬁned in continuous time, allowing evaluation of the mean curves on a continuum of “off-grid” time points. [sent-51, score-0.871]

31 This allows us to extend the functional mixture models described above by incorporating real-valued alignment of observed curves along the time axis. [sent-52, score-0.969]

32 In particular, the precise time grid Xi of observation i is assumed unknown and is allowed to vary from curve to curve. [sent-53, score-0.307]

33 This is common in practice when the measurement process cannot be synchronized from curve to curve. [sent-54, score-0.168]

34 For simplicity we assume (unknown) linear shifts of the curves along the time axis. [sent-55, score-0.388]

35 We ﬁx the basic time grid X, but generate each curve on its own grid (X + φi ) with a curve-speciﬁc time offset φi . [sent-56, score-0.525]

36 We treat the offset corresponding to curve Yi as an additional real-valued latent variable in the model. [sent-57, score-0.155]

37 2 Functional form of the mean curves The generative model described above uses a generic functional form f (x, µ) for the mean curves. [sent-59, score-0.626]

38 In this section, we introduce a parametric representation of f (x, µ) that is speciﬁcally tailored to short gene expression time courses. [sent-60, score-0.667]

39 , vN } measured in gene expression experiments can be modeled via a system of differential equations with the following structure (see Gibson and Mjolsness , eq. [sent-64, score-0.513]

40 , vN ) dt (4) The ﬁrst term on the right hand side is responsible for the synthesis of vi with maximal rate σ, and the second term represents decay with maximal fractional rate . [sent-72, score-0.279]

41 Instead, we make a few simplifying assumptions about the equation and use it as a motivation for the parametric functional form that we propose below. [sent-74, score-0.402]

42 The nonlinear sigmoid transfer function Φ(x, δ, ψ) allows us to model −1 switching between the two regimes at x = ψ with slope δ: Φ(x, δ, ψ) = (1 + e−δ(x−ψ) ) The random effects on the time grid allow us to time-shift each curve individually by replacing x with (x + φi ) in Equation (6). [sent-80, score-0.56]

43 There are other biologically plausible transformation on the curves in a cluster that we do not pursue in this paper, such as allowing ψ to vary with each curve, or representing minor differences in the regulatory functions g1,i and g2,i which affect the timing of their transitions. [sent-81, score-0.397]

44 When learning these models from data, we restrict the class of functions in Equation (6) to those with non-negative initial conditions, synthesis and decay rates, as well as enforcing continuity of the exponents at the switching point: f a (ψ, µ1 ) = f a (ψ, µ2 ). [sent-82, score-0.51]

45 Finally, given that the log-normal noise model is well-suited to gene expression data (Yeung et al. [sent-83, score-0.554]

46 , 2001) we use the logarithm of the functional forms proposed in Equation (6) as a general class of functions that describe the mean behavior within the clusters. [sent-84, score-0.438]

47 distribution of cluster membership Z and time offsets φ for each observed curve. [sent-94, score-0.322]

48 Each cluster is characterized by the parameters of the mean curves, noise variance, cluster probability and time shift distribution: Θk = {µk , Ck , P (k), P (φ|k)}. [sent-95, score-0.537]

49 • In the E-step, we ﬁnd the posterior distribution of the cluster membership Zi and the time shift φi for each curve Yi , given current cluster parameters Θ; • In the M-step, we maximize the expected log-likelihood with respect to the posterior distribution of Z and φ by adjusting Θ. [sent-96, score-0.635]

50 Since the time shifts φ are real-valued, the E-step requires evaluation of the posterior distribution over a continuous domain of φ. [sent-97, score-0.241]

51 To speed up the EM algorithm, we initialize the coefﬁcients of the mean functional forms by approximating the mean vectors obtained using a standard vector-based Gaussian mixture model on the same data. [sent-106, score-0.65]

52 This typically produces a useful set of initial parameter values which are then optimized by running the full EM algorithm for a functional mixture model with alignment. [sent-107, score-0.465]

53 We use the EM algorithm in its maximum a posteriori (MAP) formulation, using a zeromean Gaussian prior distribution on the curve-speciﬁc time shifts. [sent-108, score-0.097]

54 Figure 1 shows examples of mean curves (Equation (6)), that were learned from actual gene expression data. [sent-111, score-0.77]

55 Each functional form has 7 free parameters: µ = {ν, σ1 , 1 , σ2 , 2 , δ, ψ}. [sent-112, score-0.311]

56 Note that, as with many time-course gene expression data sets, having so few points presents an obvious problem for parameter estimation directly from a single curve. [sent-113, score-0.513]

57 However, the curve-speciﬁc time shifts in effect provide a ﬁner sampling grid that helps to −0. [sent-114, score-0.277]

58 095 5 6 7 Number of components 8 9 Figure 2: Cross-validated conditional logP scores (left) and cross-validated interpolation mean-squared error (MSE) (right), as a function of the number of mixture components, for the ﬁrst cell cycle of the Cho et al. [sent-130, score-0.546]

59 recover the parameters from observed data, in addition to the “pooling” effect of learning common functional forms for groups of curves. [sent-132, score-0.448]

60 The right-hand side of Figure 1 shows a scatter plot of the switching parameters for 5 clusters estimated from 10 different crossvalidation runs. [sent-133, score-0.185]

61 The 5 clusters exhibit different dynamics (as indicated by the spread in parameter space) and the algorithm ﬁnds qualitatively similar parameter estimates for each cluster across the 10 different runs. [sent-134, score-0.245]

62 1 Experimental Evaluation Gene expression data We illustrate our approach using the immediate responses of yeast Saccharomyces cerevisiae when released from cell cycle arrest, using the raw data reported by Cho et al (1998). [sent-136, score-0.445]

63 Brieﬂy, the CDC28 TS mutants were released from the cell cycle arrest by temperature shift. [sent-137, score-0.253]

64 Cells were harvested and RNA was collected every 10 min for 170 min, spanning two cell cycles. [sent-138, score-0.09]

65 The RNA was than analyzed using Affymetrix gene chip arrays. [sent-139, score-0.347]

66 From these data we select only the 416 genes which are reported to be actively regulated throughout the cell cycle and are expressed for 30 continuous minutes above an Affymetrix absolute level of 100 (a total of 385 genes pass these criteria). [sent-140, score-0.625]

67 We normalize each gene expression vector by its median expression value throughout the time course to reduce the inﬂuence of probe-speciﬁc intensity biases. [sent-141, score-0.882]

68 2 Experimental results In order to study the immediate cellular response we analyze only the ﬁrst 8 time points of this data set. [sent-143, score-0.097]

69 We evaluate the cross-validated out-of-sample performance of the proposed functional mixture model. [sent-144, score-0.465]

70 A conventional Gaussian mixture model applied to observations on the discrete time grid is used for baseline comparison. [sent-145, score-0.375]

71 It is not at all clear a priori that the functional mixture models with highly constrained parametric set of mean curves should outperform Gaussian mixtures that impose no parametric assumptions and are free to approximate any discrete grid observation. [sent-146, score-1.108]

72 While one can expect that mixtures of splines (Bar-Joseph et al. [sent-147, score-0.208]

73 There are two main reasons to use the proposed restricted class of functional forms: (1) T=7 0. [sent-149, score-0.311]

74 06 5 6 7 8 9 Number of Components Figure 3: Cross-validated one-step-ahead prediction MSE (left) and cross-validated interpolation MSE (right) for the ﬁrst cell cycle of the Cho et al. [sent-204, score-0.314]

75 to be able to interpret the resulting mean curves in terms of the synthesis / decay rates at each of the regimes as well as the switching times; (2) to naturally incorporate alignment by real-values shifts along the time axis. [sent-206, score-1.106]

76 In Figures 2 and 3, we present 5-fold cross-validated out-of-sample scores, as a function of the number of clusters, for both the functional mixture model and the baseline Gaussian mixture model. [sent-207, score-0.619]

77 The conditional logP score (Figure 2, left panel) estimates the average probability assigned to a single measurement at time points 6, 7, 8 within an unseen curve, given the ﬁrst ﬁve measurements of the same curve. [sent-208, score-0.217]

78 The conditioning on the ﬁrst few time points allows us to demonstrate the power of models with random effects since estimation of alignment based on partial curves improves the probability of the remainder of the curve. [sent-210, score-0.541]

79 The interpolation error in Figure 2 (right panel) shows the accuracy of recovering missing measurements. [sent-211, score-0.112]

80 To evaluate the interpolation error, we trained the models on the full training curves, and then assumed a single measurement was missing from the test curve (at time point 2 through 7). [sent-213, score-0.419]

81 The model was then used to make a prediction at the time point of the missing measurement, and the interpolation error was averaged for all time points and test curves. [sent-214, score-0.306]

82 The right panel of Figure 3 contains a detailed view of these results: each subplot shows the mean error in recovering values at a particular time point. [sent-215, score-0.199]

83 While some time points are harder to approximate than the others (in particular, T = 2, 3), the functional mixture models provide better interpolation properties overall. [sent-216, score-0.716]

84 Difﬁculties in approximating at T = 2, 3 can be attributed to the large changes in the intensities at these time points, and possibly indicate the limitations of the functional forms chosen as candidate mean curves. [sent-217, score-0.535]

85 Again, we trained the models on the full curves, and then used the models to make prediction for test curves at time T given all measurements up to T − 1 (T = 6, 7, 8). [sent-219, score-0.454]

86 Figures 2 and 3 demonstrate a consistent improvement in the out-of-sample performance of the functional mixtures. [sent-220, score-0.311]

87 The improvements seen in these plots result from integrating alignment along the time axis into the clustering framework. [sent-221, score-0.388]

88 We found that the functional mixture model without alignment does not result in better out-of-sample performance than discrete-time Gaussian mixtures. [sent-222, score-0.631]

89 In the experiments presented in this paper we used a Gaussian prior distribution on the time- shift parameter to softly constrain the shifts to lie roughly within 1. [sent-224, score-0.134]

90 The discrete grid alignment approaches that we proposed earlier in Chudova et al (2003) can successfully align curves if one assumes offsets on the scale of multiple time grid points. [sent-226, score-0.854]

91 Also worth noting is the fact that continuous time mixtures can align curves sampled on non-uniform time grids (such non-uniform sampling in time is relatively common in gene expression time course data). [sent-228, score-1.439]

92 5 Conclusions We presented a probabilistic framework for joint clustering and alignment of gene expression time course data using continuous time cluster models. [sent-229, score-1.243]

93 These models allow (1) realvalued off-grid alignment of unequally spaced measurements, (2) off-grid interpolation, and (3) regularization by enforcing smoothness implied by the functional cluster forms. [sent-230, score-0.723]

94 We have demonstrated that a mixture of simple parametric functions with nonlinear transition between two exponential regimes can model a broad class of gene expression proﬁles in a single cell cycle of yeast. [sent-231, score-0.996]

95 Cross-validated performance scores show the advantages of continuous time models over standard Gaussian mixtures. [sent-232, score-0.233]

96 Possible extensions include adding additional curve-speciﬁc parameters, incorporating other alignment methods, and introducing periodic functional forms for multi-cycle data. [sent-233, score-0.546]

97 A new approach to analyzing gene expression time series data. [sent-240, score-0.61]

98 A genomewide transcriptional analysis of the mitotic cell cycle. [sent-263, score-0.09]

99 Mixture models for translationinvariant clustering of sets of multi-dimensional curves. [sent-270, score-0.167]

100 Model-based clustering and data transformations for gene expression data. [sent-336, score-0.638]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gene', 0.347), ('functional', 0.311), ('mse', 0.237), ('curves', 0.199), ('alignment', 0.166), ('expression', 0.166), ('mixture', 0.154), ('mrna', 0.154), ('cluster', 0.153), ('smyth', 0.132), ('synthesis', 0.13), ('genes', 0.129), ('gaffney', 0.128), ('clustering', 0.125), ('curve', 0.122), ('irvine', 0.122), ('mm', 0.12), ('decay', 0.115), ('interpolation', 0.112), ('regimes', 0.111), ('minutes', 0.109), ('mixtures', 0.106), ('switching', 0.105), ('mjolsness', 0.102), ('sugar', 0.102), ('align', 0.101), ('time', 0.097), ('shifts', 0.092), ('cell', 0.09), ('cho', 0.089), ('grid', 0.088), ('ck', 0.087), ('chudova', 0.077), ('cron', 0.077), ('desarbo', 0.077), ('measurements', 0.074), ('cycle', 0.071), ('forms', 0.069), ('logp', 0.067), ('exponents', 0.067), ('intensity', 0.066), ('james', 0.062), ('yeung', 0.061), ('splines', 0.061), ('pro', 0.06), ('vn', 0.06), ('les', 0.059), ('mean', 0.058), ('parametric', 0.057), ('continuous', 0.052), ('affymetrix', 0.051), ('arrest', 0.051), ('eisen', 0.051), ('gibson', 0.051), ('lemay', 0.051), ('mestl', 0.051), ('ramsay', 0.051), ('yi', 0.051), ('enforcing', 0.051), ('em', 0.048), ('measurement', 0.046), ('dynamics', 0.046), ('clusters', 0.046), ('fk', 0.045), ('regulatory', 0.045), ('regulated', 0.045), ('heterogeneous', 0.045), ('silverman', 0.045), ('panel', 0.044), ('regular', 0.043), ('models', 0.042), ('scores', 0.042), ('shift', 0.042), ('et', 0.041), ('regularize', 0.041), ('courses', 0.041), ('released', 0.041), ('rna', 0.041), ('course', 0.04), ('grids', 0.04), ('gaussian', 0.038), ('offsets', 0.038), ('effects', 0.037), ('components', 0.036), ('discrete', 0.036), ('yeast', 0.036), ('glass', 0.036), ('production', 0.036), ('ninth', 0.036), ('california', 0.035), ('levels', 0.035), ('parameters', 0.034), ('vi', 0.034), ('groups', 0.034), ('ner', 0.034), ('membership', 0.034), ('spline', 0.034), ('equation', 0.034), ('offset', 0.033), ('rates', 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 79 nips-2003-Gene Expression Clustering with Functional Mixture Models

Author: Darya Chudova, Christopher Hart, Eric Mjolsness, Padhraic Smyth

2 0.23980555 86 nips-2003-ICA-based Clustering of Genes from Microarray Expression Data

Author: Su-in Lee, Serafim Batzoglou

Abstract: We propose an unsupervised methodology using independent component analysis (ICA) to cluster genes from DNA microarray data. Based on an ICA mixture model of genomic expression patterns, linear and nonlinear ICA finds components that are specific to certain biological processes. Genes that exhibit significant up-regulation or down-regulation within each component are grouped into clusters. We test the statistical significance of enrichment of gene annotations within each cluster. ICA-based clustering outperformed other leading methods in constructing functionally coherent clusters on various datasets. This result supports our model of genomic expression data as composite effect of independent biological processes. Comparison of clustering performance among various ICA algorithms including a kernel-based nonlinear ICA algorithm shows that nonlinear ICA performed the best for small datasets and natural-gradient maximization-likelihood worked well for all the datasets. 1 In trod u ction Microarray technology has enabled genome-wide expression profiling, promising to provide insight into underlying biological mechanism involved in gene regulation. To aid such discoveries, mathematical tools that are versatile enough to capture the underlying biology and simple enough to be applied efficiently on large datasets are needed. Analysis tools based on novel data mining techniques have been proposed [1]-[6]. When applying mathematical models and tools to microarray analysis, clustering genes that have the similar biological properties is an important step for three reasons: reduction of data complexity, prediction of gene function, and evaluation of the analysis approach by measuring the statistical significance of biological coherence of gene clusters. Independent component analysis (ICA) linearly decomposes each of N vectors into M common component vectors (N≥M) so that each component is statistically as independent from the others as possible. One of the main applications of ICA is blind source separation (BSS) that aims to separate source signals from their mixtures. There have been a few attempts to apply ICA to the microarray expression data to extract meaningful signals each corresponding to independent biological process [5]-[6]. In this paper, we provide the first evidence that ICA is a superior mathematical model and clustering tool for microarray analysis, compared to the most widely used methods namely PCA and k-means clustering. We also introduce the application of nonlinear ICA to microarray analysis, and show that it outperforms linear ICA on some datasets. We apply ICA to microarray data to decompose the input data into statistically independent components. Then, genes are clustered in an unsupervised fashion into non-mutually exclusive clusters. Each independent component is assigned a putative biological meaning based on functional annotations of genes that are predominant within the component. We systematically evaluate the clustering performance of several ICA algorithms on four expression datasets and show that ICA-based clustering is superior to other leading methods that have been applied to analyze the same datasets. We also proposed a kernel based nonlinear ICA algorithm for dealing with more realistic mixture model. Among the different linear ICA algorithms including six linear and one nonlinear ICA algorithm, the natural-gradient maximum-likelihood estimation method (NMLE) [7]-[8] performs well in all the datasets. Kernel-based nonlinear ICA method worked better for three small datasets. 2 M a t h e m at i c a l m o de l o f g e n om e - wi d e e x p r e s s i on Several distinct biological processes take place simultaneously inside a cell; each biological process has its own expression program to up-regulate or down-regulate the level of expression of specific sets of genes. We model a genome-wide expression pattern in a given condition (measured by a microarray assay) as a mixture of signals generated by statistically independent biological processes with different activation levels. We design two kinds of models for genomic expression pattern: a linear and nonlinear mixture model. Suppose that a cell is governed by M independent biological processes S = (s1, …, sM)T, each of which is a vector of K gene expression levels, and that we measure the levels of expression of all genes in N conditions, resulting in a microarray expression matrix X = (x1,…,xN)T. The expression level at each different condition j can be expressed as linear combinations of the M biological processes: xj=aj1s1+…+ajMsM. We can express this idea concisely in matrix notation as follows. X = AS ,  x1   a11 L a1M   s1   M = M M  M        x N  a N 1 L a NM   s M       (1) More generally, we can express X = (x1,…,xN)T as a post-nonlinear mixture of the underlying independent processes as follows, where f(.) is a nonlinear mapping from N to N dimensional space. X = f ( AS ),   a11 L a1M   s1    x1    M  = f  M M  M         a  xN     N 1 L a NM   s M      (2) 3 I n d e p e n d e n t c o m po n e n t a n a l y s i s In the models described above, since we assume that the underlying biological processes are independent, we suggest that vectors S=(s1,…,sM) are statistically independent and so ICA can recover S from the observed microarray data X. For linear ICA, we apply natural-gradient maximum estimation (NMLE) method which was proposed in [7] and was made more efficient by using natural gradient method in [8]. We also apply nonlinear ICA using reproducible kernel Hilbert spaces (RKHS) based on [9], as follows: 1. We map the N dimensional input data xi to Ф(xi) in the feature space by using the kernel trick. The feature space is defined by the relationship Ф(xi)TФ(xj)=k(xi,, xj). That is, inner product of mapped data is determined to by a kernel function k(.,.) in the input space; we used a Gaussian radial basis function (RBF) kernel (k(x,y)=exp(-|x-y|2)) and a polynomial kernel of degree 2 (k(x,y)=(xTy+1)2). To perform mapping, we found orthonormal bases of the feature space by randomly sampling L input data v={v1,…,vL} 1000 times and choosing one set minimizing the condition number of Φv=(Φ(v1),…,Φ(vL)). Then, a set of orthonormal bases of the feature space is determined by the selected L images of input data in v as Ξ = Φv(Φv TΦv )-1/2. We map all input data x1,…,xK, each corresponding to a gene, to Ψ(x1),…,Ψ(xK) in the feature space with basis Ξ, as follows:  k (v1 , v1 ) K k (v1 , v L )   M M     k (v L , v1 ) L k (v L , v L )  Ψ(xi)=(ΦvTΦv )-1/2Φv TΦv (xi)  = −1 / 2  k (v1 , xi )   ∈ ℜ L (1≤ i≤K) (3)  M    k ( v L , xi )    2. We linearly decompose the mapped data Ψ=[Ψ(x1),.,Ψ(xK)] ∈RL×K into statistically independent components using NMLE. 4 Proposed approach The microarray dataset we are given is in matrix form where each element xij corresponds to the level of expression of the jth gene in the ith experimental condition. Missing values are imputed by KNNImpute [10], an algorithm based on k nearest neighbors that is widely used in microarray analysis. Given the expression matrix X of N experiment by K genes, we perform the following steps. 1. Apply ICA to decompose X into independent components y1, …,yM as in Equations (1) and (2). Prior to applying ICA, remove any rows that make the expression matrix X singular. After ICA, each component denoted by yi is a vector comprising K loads gene expression levels, i.e., yi = (yi1, ...,yiK). We chose to let the number of components M to be maximized, which is equal the number of microarray experiments N because the maximum for N in our datasets was 250, which is smaller than the number of biological processes we hypothesize to act within a cell. 2. For each component, cluster genes according to their relative loads yij/mean(yi). Based on our ICA model, each component is a putative genomic expression program of an independent biological process. Thus, our hypothesis is that genes showing relatively high or low expression level within the component are the most important for the process. We create two clusters for each component: one cluster containing genes with expression level higher than a threshold, and one cluster containing genes with expression level lower than a threshold. Cluster i,1 = {gene j | y ij > mean( y i ) + c × std( y i )} Cluster i,2 = {gene j | y ij < mean( y i ) – c × std( y i )} (4) Here, mean(yi) is the average, std(yi) is the standard deviation of yi; and c is an adjustable coefficient. The value of the coefficient c was varied from 1.0 to 2.0 and the result for c=1.25 was presented in this paper. The results for other values of c are similar, and are presented on the website www.stanford.edu/~silee/ICA/. 3. For each cluster, measure the enrichment of each cluster with genes of known functional annotations. Using the Gene Ontology (GO) [11] and KEGG [12] gene annotation databases, we calculate the p-value for each cluster with every gene annotation, which is the probability that the cluster contains the observed number of genes with the annotation by chance assuming the hypergeometric distribution (details in [4]). For each gene annotation, the minimum p-value that is smaller than 10-7 obtained from any cluster was collected. If no p-value smaller than 10-7 is found, we consider the gene annotation not to be detected by the approach. As a result, we can assign biological meaning to each cluster and the corresponding independent component and we can evaluate the clustering performance by comparing the collected minimum p-value for each gene annotation with that from other clustering approach. 5 P e r f o r m a n c e e v a l uat i o n We tested the ICA-based clustering to four expression datasets (D1—D4) described in Table 1. Table 1: The four datasets used in our analysis ARRAY TYPE DESCRIPTION # OF GENES (K) # OF EXPS (N) D1 Spotted 4579 22 D2 Oligonucl eotide Spotted Oligonucl eotide Budding yeast during cell cycle and CLB2/CLN3 overactive strain [13] Budding yeast during cell cycle [14] 6616 17 C. elegans in various conditions [3] Normal human tissue including 19 kinds of tissues [15] 17817 7070 553 59 D3 D4 For D1 and D4, we compared the biological coherence of ICA components with that of PCA applied in the same datasets in [1] and [2], respectively. For D2 and D3, we compared with k-means clustering and the topomap method, applied in the same datasets in [4] and [3], respectively. We applied nonlinear ICA to D1, D2 and D4. Dataset D3 is very large and makes the nonlinear algorithm unstable. D1 was preprocessed to contain log-ratios xij=log2(Rij/Gij) between red and green intensities. In [1], principal components, referred to as eigenarrays, were hypothesized to be genomic expression programs of distinct biological processes. We compared the biological coherence of independent components with that of principal components found by [1]. Comparison was done in two ways: (1) For each component, we grouped genes within top x% of significant up-regulation and down-regulation (as measured by the load of the gene in the component) into two clusters with x adjusted from 5% to 45%. For each value of x, statistical significance was measured for clusters from independent components and compared with that from principal components based on the minimum p-value for each gene annotation, as described in Section 4. We made a scatter plot to compare the negative log of the collected best p-values for each gene annotation when x is fixed to be 15%, shown in Figure 1 (a) (2) Same as before, except we did not fix the value of x; instead, we collected the minimum p-value from each method for each GO and KEGG gene annotation category and compared the collected p-values (Figure 1 (b)). For both cases, in the majority of the gene annotation categories ICA produced significantly lower p-values than PCA did, especially for gene annotation for which both ICA and PCA showed high significance. Figure 1. Comparison of linear ICA (NMLE) to PCA on dataset D1 (a) when x is fixed to be 15%; (b) when x is not fixed. (c) Three independent components of dataset D4. Each gene is mapped to a point based on the value assigned to the gene in three independent components, which are enriched with liver- (red), Muscle- (orange) and vulva-specific (green) genes, respectively. The expression levels of genes in D4 were normalized across the 59 experiments, and the logarithms of the resulting values were taken. Experiments 57, 58, and 59 were removed because they made the expression matrix nearly singular. In [2], a clustering approach based on PCA and subsequent visual inspection was applied to an earlier version of this dataset, containing 50 of the 59 samples. After we performed ICA, the most significant independent components were enriched for liver-specific, muscle-specific and vulva-specific genes with p-value of 10-133, 10-124 and 100-117, respectively. In the ICA liver cluster, 198 genes were liver specific (out of a total of 244), as compared with the 23 liver-specific genes identified in [2] using PCA. The ICA muscle cluster of 235 genes contains 199 muscle specific genes compared to 19 muscle-specific genes identified in [2]. We generated a 3-dimensional scatter plot of the load expression levels of all genes annotated in [15] on these significant ICA components in Figure 1 (c). We can see that the liver-specific, muscle-specific and vulva-specific genes are strongly biased to lie on the x-, y-, and z- axis, respectively. We applied nonlinear ICA on this dataset and the first four most significant clusters from nonlinear ICA with Gaussian RBF kernel were muscle-specific, liver-specific, vulva-specific and brain-specific with p-value of 10-158, 10-127, 10-112 and 10-70, respectively, showing considerable improvement over the linear ICA clusters. For D2, variance-normalization was applied to the 3000 most variant genes as in [4]. The 17th experiment, which made the expression matrix close to singular, was removed. We measured the statistical significance of clusters as described in Section 4 and compared the smallest p-value of each gene annotation from our approach to that from k-means clustering applied to the same dataset [4]. We made a scatter plot for comparing the negative log of the smallest p-value (y-axis) from ICA clusters with that from k-means clustering (x-axis). The coefficient c is varied from 1.0 to 2.0 and the superiority of ICA-based clustering to k-means clustering does not change. In many practical settings, estimation of the best c is not needed; we can adjust c to get a desired size of the cluster unless our focus is to blindly find the size of clusters. Figure 2 (a) (b) (c) shows for c=1.25 a comparison of the performance of linear ICA (NMLE), nonlinear ICA with Gaussian RBF kernel (NICA gauss), and k-means clustering (k-means). For D3, first we removed experiments that contained more than 7000 missing values, because ICA does not perform properly when the dataset contains many missing values. The 250 remaining experiments were used, containing expression levels for 17817 genes preprocessed to be log-ratios xij=log2(Rij/Gij) between red and green intensities. We compared the biological coherence of clusters by our approach with that of topomap-based approach applied to the same dataset in [3]. The result when c=1.25 is plotted in the Figure 2 (d). We observe that the two methods perform very similarly, with most categories having roughly the same p-value in ICA and in the topomap clusters. The topomap clustering approach performs slightly better in a larger fraction of the categories. Still, we consider this performance a confirmation that ICA is a widely applicable method that requires minimal training: in this case the missing values and high diversity of the data make clustering especially challenging, while the topomap approach was specifically designed and manually trained for this dataset as described in [3]. Finally, we compared different ICA algorithms in terms of clustering performance. We tested six linear ICA methods: Natural Gradient Maximum Likelihood Estimation (NMLE) [7][8], Joint Approximate Diagonalization of Eigenmatrices [16], Fast Fixed Point ICA with three different measures of non-Gaussianity [17], and Extended Information Maximization (Infomax) [18]. We also tested two kernels for nonlinear ICA: Gaussian RBF kernel, and polynomial kernel (NICA ploy). For each dataset, we compared the biological coherence of clusters generated by each method. Among the six linear ICA algorithms, NMLE was the best in all datasets. Among both linear and nonlinear methods, the Gaussian kernel nonlinear ICA method was the best in Datasets D1, D2 and D4, the polynomial kernel nonlinear ICA method was best in Dataset D4, and NMLE was best in the large datasets (D3 and D4). In Figure 3, we compare the NMLE method with three other ICA methods for the dataset D2. Overall, the NMLE algorithm consistently performed well in all datasets. The nonlinear ICA algorithms performed best in the small datasets, but were unstable in the two largest datasets. More comparison results are demonstrated in the website www.stanford.edu/~silee/ICA/. Figure 2: Comparison of (a) linear ICA (NMLE) with k-means clustering, (b) nonlinear ICA with Gaussian RBF kernel to linear ICA (NMLE), and (c) nonlinear ICA with Gaussian RBF kernel to k-means clustering on the dataset D2. (d) Comparison of linear ICA (NMLE) to topomap-based approach on the dataset D3. Figure 3: Comparison of linear ICA (NMLE) to (a) Extended Infomax ICA algorithm, (b) Fast ICA with symmetric orthogonalization and tanh nonlinearity and (c) Nonlinear ICA with polynomial kernel of degree 2 on the Dataset (B). 6 D i s c u s s i on ICA is a powerful statistical method for separating mixed independent signals. We proposed applying ICA to decompose microarray data into independent gene expression patterns of underlying biological processes, and to group genes into clusters that are mutually non-exclusive with statistically significant functional coherence. Our clustering method outperformed several leading methods on a variety of datasets, with the added advantage that it requires setting only one parameter, namely the fraction c of standard deviations beyond which a gene is considered to be associated with a component’s cluster. We observed that performance was not very sensitive to that parameter, suggesting that ICA is robust enough to be used for clustering with little human intervention. The empirical performance of ICA in our tests supports the hypothesis that statistical independence is a good criterion for separating mixed biological signals in microarray data. The Extended Infomax ICA algorithm proposed in [18] can automatically determine whether the distribution of each source signal is super-Gaussian or sub-Gaussian. Interestingly, the application of Extended Infomax ICA to all the expression datasets uncovered no source signal with sub-Gaussian distribution. A likely explanation is that global gene expression profiles are mixtures of super-Gaussian sources rather than of sub-Gaussian sources. This finding is consistent with the following intuition: underlying biological processes are super-Gaussian, because they affect sharply the relevant genes, typically a small fraction of all genes, and leave the majority of genes relatively unaffected. Acknowledgments We thank Te-Won Lee for helpful feedback. We thank Relly Brandman, Chuong Do, and Yueyi Liu for edits to the manuscript. References [1] Alter O, Brown PO, Botstein D. Proc. Natl. Acad. Sci. USA 97(18):10101-10106, 2000. [2] Misra J, Schmitt W, et al. Genome Research 12:1112-1120, 2002. [3] Kim SK, Lund J, et al. Science 293:2087-2092, 2001. [4] Tavazoie S, Hughes JD, et al. Nature Genetics 22(3):281-285, 1999. [5] Hori G, Inoue M, et al. Proc. 3rd Int. Workshop on Independent Component Analysis and Blind Signal Separation, Helsinki, Finland, pp. 151-155, 2000. [6] Liebermeister W. Bioinformatics 18(1):51-60, 2002. [7] Bell AJ. and Sejnowski TJ. Neural Computation, 7:1129-1159, 1995. [8] Amari S, Cichocki A, et al. In Advances in Neural Information Processing Systems 8, pp. 757-763. Cambridge, MA: MIT Press, 1996. [9] Harmeling S, Ziehe A, et al. In Advances in Neural Information Processing Systems 8, pp. 757-763. Cambridge, MA: MIT Press, . [10] Troyanskaya O., Cantor M, et al. Bioinformatics 17:520-525, 2001. [11] The Gene Ontology Consortium. Genome Research 11:1425-1433, 2001. [12] Kanehisa M., Goto S. In Current Topics in Computational Molecular Biology, pp. 301–315. MIT-Press, Cambridge, MA, 2002. [13] Spellman PT, Sherlock G, et al. Mol. Biol. Cell 9:3273-3297, 1998. [14] Cho RJ, Campell MJ, et al. Molecular Cell 2:65-73, 1998. [15] Hsiao L, Dangond F, et al. Physiol. Genomics 7:97-104, 2001. [16] Cardoso JF, Neural Computation 11(1):157-192, 1999. [17] Hyvarinen A. IEEE Transactions on Neural Network 10(3):626–634, 1999. [18] Lee TW, Girolami M, et al. Neural Computation 11:417–441, 1999.

3 0.15487573 73 nips-2003-Feature Selection in Clustering Problems

Author: Volker Roth, Tilman Lange

Abstract: A novel approach to combining clustering and feature selection is presented. It implements a wrapper strategy for feature selection, in the sense that the features are directly selected by optimizing the discriminative power of the used partitioning algorithm. On the technical side, we present an efﬁcient optimization algorithm with guaranteed local convergence property. The only free parameter of this method is selected by a resampling-based stability analysis. Experiments with real-world datasets demonstrate that our method is able to infer both meaningful partitions and meaningful subsets of features. 1

4 0.10165331 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method

Author: Susanne Still, William Bialek, Léon Bottou

Abstract: We argue that K–means and deterministic annealing algorithms for geometric clustering can be derived from the more general Information Bottleneck approach. If we cluster the identities of data points to preserve information about their location, the set of optimal solutions is massively degenerate. But if we treat the equations that deﬁne the optimal solution as an iterative algorithm, then a set of “smooth” initial conditions selects solutions with the desired geometrical properties. In addition to conceptual uniﬁcation, we argue that this approach can be more efﬁcient and robust than classic algorithms. 1

5 0.10084067 87 nips-2003-Identifying Structure across Pre-partitioned Data

Author: Zvika Marx, Ido Dagan, Eli Shamir

Abstract: We propose an information-theoretic clustering approach that incorporates a pre-known partition of the data, aiming to identify common clusters that cut across the given partition. In the standard clustering setting the formation of clusters is guided by a single source of feature information. The newly utilized pre-partition factor introduces an additional bias that counterbalances the impact of the features whenever they become correlated with this known partition. The resulting algorithmic framework was applied successfully to synthetic data, as well as to identifying text-based cross-religion correspondences. 1 In t ro d u c t i o n The standard task of feature-based data clustering deals with a single set of elements that are characterized by a unified set of features. The goal of the clustering task is to identify implicit constructs, or themes, within the clustered set, grouping together elements that are characterized similarly by the features. In recent years there has been growing interest in more complex clustering settings, in which additional information is incorporated [1], [2]. Several such extensions ([3]-[5]) are based on the information bottleneck (IB) framework [6], which facilitates coherent information-theoretic representation of different information types. In a recent line of research we have investigated the cross-dataset clustering task [7], [8]. In this setting, some inherent a-priori partition of the clustered data to distinct subsets is given. The clustering goal it to identify corresponding (analogous) structures that cut across the different subsets, while ignoring internal structures that characterize individual subsets. To accomplish this task, those features that commonly characterize elements across the different subsets guide the clustering process, while within-subset regularities are neutralized. In [7], we presented a distance-based hard clustering algorithm for the coupledclustering problem, in which the clustered data is pre-partitioned to two subsets. In [8], our setting, generalized to pre-partitions of any number of subsets, was addressed by a heuristic extension of the probabilistic IB algorithm, yielding improved empirical results. Specifically, the algorithm in [8] was based on a modification of the IB stable-point equation, which amplified the impact of features characterizing a formed cluster across all, or most, subsets. This paper describes an information-theoretic framework that motivates and extends the algorithm proposed in [8]. The given pre-partitioning is represented via a probability distribution variable, which may represent “soft” pre-partitioning of the data, versus the strictly disjoint subsets assumed in the earlier cross-dataset framework. Further, we present a new functional that captures the cross-partition motivation. From the new functional, we derive a stable-point equation underlying our algorithmic framework in conjunction with the corresponding IB equation. Our algorithm was tested empirically on synthetic data and on a real-world textbased task that aimed to identify corresponding themes across distinct religions. We have cross-clustered five sets of keywords that were extracted from topical corpora of texts about Buddhism, Christianity, Hinduism, Islam and Judaism. In distinction from standard clustering results, our algorithm reveals themes that are common to all religions, such as sacred writings, festivals, narratives and myths and theological principles, and avoids topical clusters that correspond to individual religions (for example, ‘Christmas’ and ‘Easter’ are clustered together with ‘Ramadan’ rather than with ‘Church’). Finally, we have paid specific attention to the framework of clustering with side information [4]. While this approach was presented for a somewhat different mindset, it might be used directly to address clustering across pre-partitioned data. We compare the technical details of the two approaches and demonstrate empirically that clustering with side information does not seem appropriate for the kind of cross-partition tasks that we explored. 2 Th e In fo rmat i o n B ot t len eck M et h od Probabilistic (“soft”) data clustering outputs, for each element x of the set being clustered and each cluster c, an assignment probability p(c|x). The IB method [6] interprets probabilistic clustering as lossy data compression. The given data is represented by a random variable X ranging over the clustered elements. X is compressed through another random variable C, ranging over the clusters. Every element x is characterized by conditional probability distribution p(Y|x), where Y is a third random variable taking the members y of a given set of features as values. The IB method formalizes the clustering task as minimizing the IB functional: L(IB) = I(C; X) − β I(C; Y) . (1) As known from information theory (Ch. 13 of [9]), minimizing the mutual information I(C; X) optimizes distorted compression rate. A complementary bias to maximize I(C; Y) is interpreted in [6] as articulating the level of relevance of Y to the obtained clustering, inferred from the level by which C can predict Y. β is a free parameter counterbalancing the two biases. It is shown in [6] that p(c|x) values that minimize L(IB) satisfy the following equation: p(c|x) = 1 p (c )e −β DKL [ p ( Y |x )|| p (Y |c ) ] , z( β , x) (2) where DKL stands for the Kullback-Leibler (KL) divergence, or relative entropy, between two distributions and z(β ,x) is a normalization function over C. Eq. (2) implies that, optimally, x is assigned to c in proportion to their KL distance in a feature distribution space, where the distribution p(Y|c) takes the role of a Start at time t = 0 and iterate the following update-steps, till convergence: IB1: initialize p t (c|x) randomly or arbitrarily −β DKL [ p (Y | x )|| pt −1 (Y |c ) ] pt (c|x) ∝ IB2: pt (c) = IB3: pt (y|c) = pt −1 (c ) e ∑ x (t = 0) (t > 0) p t (c | x ) p ( x ) 1 ∑ pt ( c | x) p ( y | x ) p ( x) p t (c ) x Figure 1: The Information Bottleneck iterative algorithm (with fixed β and |C|). representative, or centroid, of c. The feature variable Y is hence utilized as the (exclusive) means to guide clustering, beyond the random nature of compression. Figure 1 presents the IB iterative algorithm for a fixed value of β . The IB1 update step follows Eq. (2). The other two steps, which are derived from the IB functional as well, estimate the p(c) and p(y|c) values required for the next iteration. The algorithm converges to a local minimum of the IB functional. The IB setting, particularly the derivation of steps IB1 and IB3 of the algorithm, assumes that Y and C are independent given X, that is: I(C; Y|X) = ∑x p(x) I(C|x; Y|x) = 0. The balancing parameter β affects the number of distinct clusters being formed in a manner that resembles (inverse) temperature in physical systems. The higher β is (i.e., the stronger the bias to construct C that predicts Y well), more distinct clusters are required for encoding the data. For each |C| = 2, 3, …, there is a minimal β value, enabling the formation of |C| distinct clusters. Setting β to be smaller than this critical value corresponding to the current |C| would result in two or more clusters that are identical to one another. Based on this, the iterative algorithm is applied repeatedly within a gradual cooling-like (deterministic annealing) scheme: starting with random initialization of the p0 (c|x)'s, generate two clusters with the critical β value, found empirically, for |C| = 2. Then, use a perturbation on the obtained two-cluster configuration to initialize the p0(c|x)'s for a larger set of clusters and execute additional runs of the algorithm to identify the critical β value for the larger |C|. And so on: each output configuration is used as a basis for a more granular one. The final outcome is a “soft hierarchy” of probabilistic clusters. 3 Cro ss- p a rt i t i o n Clu st eri n g Cross-partition (CP) clustering introduces a factor – a pre-given partition of the clustered data – additional to what considered in a standard clustering setting. For representing this factor we introduce the pre-partitioning variable W, ranging over all parts w of the pre-given partition. Every data element x is associated with W through a given probability distribution p(W|x). Our goal is to cluster the data, so that the clusters C would not be correlated with W. We notice that Y, which is intended to direct the formation of clusters, might be a-priori correlated with W, so the formed clusters might end up being correlated with W as well. Our method aims at eliminating this aspect of Y. 3.1 I n f or ma t i o n D e f oc us i n g As noted, some of the information conveyed by Y characterizes structures correlated with W, while the other part of the information characterizes the target cross-W structures. We are interested in detecting the latter while filtering out the former. However, there is no direct a-priori separation between the two parts of the Ymediated information. Our strategy in tackling this difficulty is: we follow in general Y's directions, as the IB method does, while avoiding Y's impact whenever it entails undesired inter-dependencies of C and W. Our strategy implies conflicting biases with regard to the mutual information I(C,Y): it should be maximized in order to form meaningful clusters, but be minimized as well in the specific context where Y entails C–W dependencies. Accordingly, we propose a computational procedure directed by two distinct cost-terms in tandem. The first one is the IB functional (Eq. 1), introducing the bias to maximize I(C,Y). With this bias alone, Y might dictate (or “explain”, in retrospect) substantial C–W dependencies, implying a low I(C;W|Y) value. 1 Hence, the guideline of preventing Y from accounting for C–W dependencies is realized through an opposing bias of maximizing I(C;W|Y) = ∑y p(y) I(C|y; W|y). The second cost term – the Information Defocusing (ID) functional – consequently counterbalances minimization of I(C,Y) against the new bias: L(ID) = I(C; Y) − η I(C;W|Y) , (3) where η is a free parameter articulating the tradeoff between the biases. The ID functional captures our goal of reducing the impact of Y selectively: “defocusing” a specific aspect of the information Y conveys: the information correlated with W. In a like manner to the stable-point equation of the IB functional (Eq. 2), we derive the following stable-point equation for the ID functional: η p ( w) 1 p ( c )∏ w p ( y | c, w) η +1 , p(c|y) = z (η , y ) (4) where z(η,y) is a normalization function over C. The derivation relies on an additional assumption, I(C;W) = 0, imposing the intended independence between C and W (the detailed derivation will be described elsewhere). The intuitive interpretation of Eq. (4) is as follows: a feature y is to be associated with a cluster c in proportion to a weighted, though flattened, geometric mean of the “W-projected centroids” p(y|c,w), priored by p(c). 2 This scheme overweighs y's that contribute to c evenly across W. Thus, clusters satisfying Eq. (4) are situated around centroids biased towards evenly contributing features. The higher η is, heavier emphasis is put on suppressing disagreements between the w's. For η → ∞ a plain weighted geometric-mean scheme is obtained. The inclusion of a step derived from Eq. (4) in our algorithm (see below) facilitates convergence on a configuration with centroids dominated by features that are evenly distributed across W. 3.2 T h e Cr os s - p a r t i t i on C l us t e r i n g A l g or i t h m Our proposed cross partition (CP) clustering algorithm (Fig. 2) seeks a clustering configuration that optimizes simultaneously both the IB and ID functionals, 1 Notice that “Z explaining well the dependencies between A and B” is equivalent with “A and B sharing little information in common given Z”, i.e. low I(A;B|Z) . Complete conditional independence is exemplified in the IB framework, assuming I(C;Y|X) = 0. 2 Eq. (4) resembles our suggestion in [8] to compute a geometric average over the subsets; in the current paper this scheme is analytically derived from the ID functional. Start at time t = 0 and iterate the following update-steps, till convergence: CP1: Initialize p t (c|x) randomly or arbitrarily −β DKL [ p (Y | x )|| pt −1 (Y |c ) ] pt (c|x) ∝ CP2: pt (c) = CP3: p*t (y|c,w) = CP4: (t = 0) p t −1 (c ) e ∑ x (t > 0) p t (c | x ) p ( x ) 1 ∑ pt ( c | x ) p ( y | x ) p ( w | x ) p ( x ) p t ( c ) p ( w) x Initialize p*t (c) randomly or arbitrarily (t = 0) p*t (c) (t > 0) = ∑ y p *t −1 (c | y ) p ( y ) η CP5: p*t (c|y) ∝ p *t (c)∏w p *t ( y | c, w) η +1 CP6: pt (y|c) = p ( w) p *t (c | y ) p ( y ) p *t (c ) Figure 2: The cross-partition clustering iterative algorithm (with fixed β, η, and |C|). thus obtaining clusters that cut across the pre-given partition W. To this end, the algorithm interleaves an iterative computation of the stable-point equations, and the additional estimated parameters, for both functionals. Steps CP1, CP2 and CP6 correspond to the computations related to the IB functional, while steps CP3, CP4 and CP5, which compute a separate set of parameters (denoted by an asterisk), correspond to the ID functional. Figure 3 summarizes the roles of the two functionals in the dynamics of the CP algorithm. The two components of the iterative cycle are tied together in steps CP3 and CP6, in which parameters from one set are used as input to compute a parameter of other set. The derivation of step CP3 relies on an additional assumption, namely that C, Y and W are jointly independent given X. This assumption, which extends to W the underlying assumption of the IB setting that C and Y are independent given X, still entails the IB stable point equation. At convergence, the stable point equations for both the IB and ID functionals are satisfied, each by its own set of parameters (in steps CP1 and CP5). The deterministic annealing scheme, which gradually increases β over repeated runs (see Sec. 2), is applied for the CP algorithm as well with η held fixed. For a given target number of clusters |C|, the algorithm empirically converges with a wide range of η values 3. I(C;X) ↓ IB β↑ I(C;Y) ↓ ID η↑ I(C; W|Y) I(C; Y; W|X) = 0 ← assumptions → I(C;W) = 0 Figure 3: The interplay of the IB and the ID functionals in the CP algorithm. High η values tend to dictate centroids with features that are unevenly distributed across W, resulting in shrinkage of some of the clusters. Further analysis will be provided in future work. 3 4 Exp e ri men t a l Resu lt s Our synthetic setting consisted of 75 virtual elements, evenly pre-partitioned into three 25-element parts denoted X 1 , X2 and X3 (in our formalism, for each clustered element x, p(w|x) = 1 holds for either w = 1, 2, or 3). On top of this pre-partition, we partitioned the data twice, getting two (exhaustive) clustering configurations: 1. Target cross-W clustering: five clusters, each with representatives from all X w's; 2. Masking within-w clustering: six clusters, each consisting of roughly half the elements of either X 1, X 2 or X3 with no representatives from the other X w's. Each cluster, of both configurations, was characterized by a designated subset of features. Masking clusters were designed to be more salient than target clusters: they had more designated features (60 vs. 48 per cluster, i.e., 360 vs. 240 in total) and their elements shared higher feature-element (virtual) co-occurrence counts with those designated features (900 vs. 450 per element-feature pair). Noise (random positive integer < 200) was added to all counts associating elements with their designated features (for both within-w and cross-W clusters), as well as to roughly quarter of the zero counts associating elements with the rest of the features. The plain IB method consistently produced configurations strongly correlated with the masking clustering, while the CP algorithm revealed the target configuration. We got (see Table 1A) almost perfect results in configurations of nearly equal-sized cross-W clusters, and somewhat less perfect reconstruction in configurations of diverging sizes (6, 9, 15, 21 and 24). Performance level was measured relatively to optimal target-output cluster match by the proportion of elements correctly assigned, where assignment of an element x follows its highest p(c|x). The results indicated were averaged over 200 runs. They were obtained for the optimal η, which was found to be higher in the diverging-sizes task. In the text-based task, the clustered elements – keywords – were automatically extracted from five distinct corpora addressing five religions: introductory web pages, online magazines, encyclopedic entries etc., all downloaded from the Internet. The clustered keyword set X was consequently pre-partitioned to disjoint subsets {X w} w∈W, one for each religion4 (|X w| ≈ 200 for each w). We conducted experiments simultaneously involving religion pairs as well as all five religions. We took the features Y to be a set of words that commonly occur within all five corpora (|Y| ≈ 7000). x–y co-occurrences were recorded within ±5-word sliding window truncated by sentence boundaries. η was fixed to a value (1.0) enabling the formation of 20 clusters in all settings. The obtained clusters revealed interesting cross religion themes (see Sec. 1). For instance, the cluster (one of nine) capturing the theme of sacred festivals: the three highest p(c/x) members within each religion were Full-moon, Ceremony, Celebration (Buddhism); Easter, Sunday, Christmas Table 1: Average correct assignment proportion scores for the synthetic task (A) and Jaccard-coefficient scores for the religion keyword classification task (B). A. Synthetic Data IB CP B. Religion Data IB Coupled Clustering [7] CP (cross-expert agreement on religion pairs .462±.232) equal-size clusters .305 .985 non-equal clusters .292 .827 4 religion pairs all five (one case) .200±.100 .220±.138 .407±.144 .104 ––––––– .167 A keyword x that appeared in the corpora of different religions was considered as a distinct element for each religion, so the Xw were kept disjointed. (Chrsitianity); Puja, Ceremony, Festival (Hinduism); Id-al-Fitr, Friday, Ramadan, (Islam); and Sukkoth, Shavuot, Rosh-Hodesh (Judaism). The closest cluster produced by the plain IB method was poorer by far, including Islamic Ramadan, and Id and Jewish Passover, Rosh-Hashanah and Sabbath (which our method ranked high too), but no single related term from the other religions. Our external evaluation standards were cross-religion keyword classes constructed manually by experts of comparative religion studies. One such expert classification involved all five religions, and eight classifications addressed religions in pairs. Each of the eight religion-pair classifications was contributed by two independent experts using the same keywords, so we could also assess the agreement between experts. As an overlap measure we employed the Jaccard coefficient: the number of element pairs co-assigned together by both one of the evaluated clusters and one of the expert classes, divided by the number of pairs co-assigned by either our clusters or the expert (or both). We did not assume the number of expert classes is known in advance (as done in the synthetic experiments), so the results were averaged over all configurations of 2–16 cluster hierarchy, for each experiment. The results shown in Table 1B – clear improvement relatively to plain IB and the distance-based coupled clustering [7] – are, however, persistent when the number of clusters is taken to be equal to the number of classes, or if only the best score in hierarchy is considered. The level of cross-expert agreement indicates that our results are reasonably close to the scores expected in such subjective task. 5 C o mp a ri so n t o R e la t ed W o r k The information bottleneck framework served as the basis for several approaches that represent additional information in their clustering setting. The multivariate information bottleneck (MIB) adapts the IB framework for networks of multiple variables [3]. However, all variables in such networks are either compressed (like X), or predicted (like Y). The incorporation of an empirical variable to be masked or defocused in the sense of our W is not possible. Including such variables in the MIB framework might be explored in future work. Particularly relevant to our work is the IB-based method for extracting relevant constructs with side information [4]. This approach addresses settings in which two different types of features are distinguished explicitly: relevant versus irrelevant ones, denoted by Y+ and Y−. Both types of features are incorporated within a single functional to be minimized: L(IB-side-info) = I(C; X) − β ( I(C; Y +) − γ I(C; Y−) ), which directly drives clustering to de-correlate C and Y−. Formally, our setting can be mapped to the side information setting by regarding the pre-partition W simply as the additional set of irrelevant features, giving symmetric (and opposite) roles to W and Y. However, it seems that this view does not address properly the desired cross-partition setting. In our setting, it is assumed that clustering should be guided in general by Y, while W should only neutralize particular information within Y that would otherwise yield the undesired correlation between C and W (as described in Section 3.1). For that reason, the defocusing functional tie the three variables together by conditioning the de-correlation of C and W on Y, while its underlying assumption ensures the global de-correlation. Indeed, our method was found empirically superior on the cross-dataset task. The side-information IB method (the iterative algorithm with best scoring γ) achieves correct assignment proportion of 0.52 in both synthetic tasks, where our method scored 0.99 and 0.83 (see Table 1A) and, in the religion-pair keyword classification task, Jaccard coefficient improved by 20% relatively to plain IB (compared to our 100% improvement, see Table 1B). 6 C o n c lu si o n s This paper addressed the problem of clustering a pre-partitioned dataset, aiming to detect new internal structures that are not correlated with the pre-given partition but rather cut across its components. The proposed framework extends the cross-dataset clustering algorithm [8], providing better formal grounding and representing any pre-given (soft) partition of the dataset. Supported by empirical evidence, we suggest that our framework is better suited for the cross-partition task than applying the side-information framework [4], which was originally developed to address a somewhat different setting. We also demonstrate substantial empirical advantage over the distance-based coupled-clustering algorithm [7]. As an applied real-world goal, the algorithm successfully detects cross-religion commonalities. This goal exemplifies the more general notion of detecting analogies across different systems, which is a somewhat vague and non-consensual task and therefore especially challenging for a computational framework. Our approach can be viewed as an initial step towards principled identification of “hidden” commonalities between substantially different real world systems, while suppressing the vast majority of attributes that are irrelevant for the analogy. Further research may study the role of defocusing in supervised learning, where some pre-given partitions might mask the role of underlying discriminative features. Additionally, it would be interesting to explore relationships to other disciplines, e.g., network information theory ([9], Ch. 14) which provided motivation for the side-information approach. Finally, both frameworks (ours and side-information) suggest the importance of dealing wisely with information that should not dictate the clustering output directly. A c k n ow l e d g me n t s We thank Yuval Krymolowski for helpful discussions and Tiina Mahlamäki, Eitan Reich and William Shepard, for contributing the religion keyword classifications. References [1] Hofmann, T. (2001) Unsupervised learning by probabilistic latent semantic analysis. Journal of Machine Learning Research, 41(1):177-196. [2] Wagstaff K., Cardie C., Rogers S. and Schroedl S., 2001. Constrained K-Means clustering with background knowledge. The 18th International Conference on Machine Learning (ICML-2001), pp 577-584. [3] Friedman N., Mosenzon O., Slonim N. & Tishby N. (2002) Multivariate information bottleneck. The 17th conference on Uncertainty in Artificial Intelligence (UAI-17), pp. 152161. [4] Chechik G. & Tishby N. (2002) Extracting relevant structures with side information. Advances in Neural Processing Information Systems 15 (NIPS'02). [5] Globerson, A., Chechik G. & Tishby N. (2003) Sufficient dimensionality reduction. Journal of Machine Learning Research, 3:1307-1331. [6] Tishby, N., Pereira, F. C. & Bialek, W. (1999) The information bottleneck method. The 37th Annual Allerton Conference on Communication, Control, and Computing, pp. 368-379. [7] Marx, Z., Dagan, I., Buhmann, J. M. & Shamir E. (2002) Coupled clustering: A method for detecting structural correspondence. Journal of Machine Learning Research, 3:747-780. [8] Dagan, I., Marx, Z. & Shamir E (2002) Cross-dataset clustering: Revealing corresponding themes across multiple corpora. Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002), pp. 15-21. [9] Cover T. M. & Thomas J. A. (1991) Elements of Information Theory. Sons, Inc., New York, New York. John Wiley &

6 0.09913335 46 nips-2003-Clustering with the Connectivity Kernel

7 0.098366171 1 nips-2003-1-norm Support Vector Machines

8 0.096542202 111 nips-2003-Learning the k in k-means

9 0.0899956 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data

10 0.089785717 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms

11 0.084379703 85 nips-2003-Human and Ideal Observers for Detecting Image Curves

12 0.083801933 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models

13 0.083712578 102 nips-2003-Large Scale Online Learning

14 0.082850926 177 nips-2003-Simplicial Mixtures of Markov Chains: Distributed Modelling of Dynamic User Profiles

15 0.081954561 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

16 0.080633096 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering

17 0.08027304 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks

18 0.079828456 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms

19 0.079466678 141 nips-2003-Nonstationary Covariance Functions for Gaussian Process Regression

20 0.07721635 81 nips-2003-Geometric Analysis of Constrained Curves

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.249), (1, -0.066), (2, 0.032), (3, 0.064), (4, -0.118), (5, 0.126), (6, -0.022), (7, 0.098), (8, -0.029), (9, 0.097), (10, -0.113), (11, 0.078), (12, -0.075), (13, -0.006), (14, -0.047), (15, 0.18), (16, -0.09), (17, -0.025), (18, 0.118), (19, -0.053), (20, -0.194), (21, 0.048), (22, 0.048), (23, 0.181), (24, -0.024), (25, 0.119), (26, -0.022), (27, -0.072), (28, -0.039), (29, 0.178), (30, -0.2), (31, 0.002), (32, -0.15), (33, 0.003), (34, 0.045), (35, -0.002), (36, -0.078), (37, -0.003), (38, 0.044), (39, -0.091), (40, 0.055), (41, -0.128), (42, -0.028), (43, -0.016), (44, -0.006), (45, 0.06), (46, 0.043), (47, 0.096), (48, 0.077), (49, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96684122 79 nips-2003-Gene Expression Clustering with Functional Mixture Models

Author: Darya Chudova, Christopher Hart, Eric Mjolsness, Padhraic Smyth

2 0.81821716 86 nips-2003-ICA-based Clustering of Genes from Microarray Expression Data

Author: Su-in Lee, Serafim Batzoglou

3 0.5685339 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms

Author: Claudio Gentile

Abstract: New feature selection algorithms for linear threshold functions are described which combine backward elimination with an adaptive regularization method. This makes them particularly suitable to the classiﬁcation of microarray expression data, where the goal is to obtain accurate rules depending on few genes only. Our algorithms are fast and easy to implement, since they center on an incremental (large margin) algorithm which allows us to avoid linear, quadratic or higher-order programming methods. We report on preliminary experiments with ﬁve known DNA microarray datasets. These experiments suggest that multiplicative large margin algorithms tend to outperform additive algorithms (such as SVM) on feature selection tasks. 1

4 0.55443543 73 nips-2003-Feature Selection in Clustering Problems

Author: Volker Roth, Tilman Lange

5 0.45839214 111 nips-2003-Learning the k in k-means

Author: Greg Hamerly, Charles Elkan

Abstract: When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the standard statistical signiﬁcance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does not penalize strongly enough the model’s complexity. 1 Introduction and related work Clustering algorithms are useful tools for data mining, compression, probability density estimation, and many other important tasks. However, most clustering algorithms require the user to specify the number of clusters (called k), and it is not always clear what is the best value for k. Figure 1 shows examples where k has been improperly chosen. Choosing k is often an ad hoc decision based on prior knowledge, assumptions, and practical experience. Choosing k is made more difﬁcult when the data has many dimensions, even when clusters are well-separated. Center-based clustering algorithms (in particular k-means and Gaussian expectationmaximization) usually assume that each cluster adheres to a unimodal distribution, such as Gaussian. With these methods, only one center should be used to model each subset of data that follows a unimodal distribution. If multiple centers are used to describe data drawn from one mode, the centers are a needlessly complex description of the data, and in fact the multiple centers capture the truth about the subset less well than one center. In this paper we present a simple algorithm called G-means that discovers an appropriate k using a statistical test for deciding whether to split a k-means center into two centers. We describe examples and present experimental results that show that the new algorithm 0.9 4 0.8 3 0.7 2 0.6 1 0.5 0 0.4 −1 0.3 −2 −3 0.2 0.1 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 −4 −3 −2 −1 0 1 2 3 Figure 1: Two clusterings where k was improperly chosen. Dark crosses are k-means centers. On the left, there are too few centers; ﬁve should be used. On the right, too many centers are used; one center is sufﬁcient for representing the data. In general, one center should be used to represent one Gaussian cluster. is successful. This technique is useful and applicable for many clustering algorithms other than k-means, but here we consider only the k-means algorithm for simplicity. Several algorithms have been proposed previously to determine k automatically. Like our method, most previous methods are wrappers around k-means or some other clustering algorithm for ﬁxed k. Wrapper methods use splitting and/or merging rules for centers to increase or decrease k as the algorithm proceeds. Pelleg and Moore [14] proposed a regularization framework for learning k, which they call X-means. The algorithm searches over many values of k and scores each clustering model using the so-called Bayesian Information Criterion [10]: BIC(C|X) = L(X|C) − p log n 2 where L(X|C) is the log-likelihood of the dataset X according to model C, p = k(d + 1) is the number of parameters in the model C with dimensionality d and k cluster centers, and n is the number of points in the dataset. X-means chooses the model with the best BIC score on the data. Aside from the BIC, other scoring functions are also available. Bischof et al. [1] use a minimum description length (MDL) framework, where the description length is a measure of how well the data are ﬁt by the model. Their algorithm starts with a large value for k and removes centers (reduces k) whenever that choice reduces the description length. Between steps of reducing k, they use the k-means algorithm to optimize the model ﬁt to the data. With hierarchical clustering algorithms, other methods may be employed to determine the best number of clusters. One is to build a merging tree (“dendrogram”) of the data based on a cluster distance metric, and search for areas of the tree that are stable with respect to inter- and intra-cluster distances [9, Section 5.1]. This method of estimating k is best applied with domain-speciﬁc knowledge and human intuition. 2 The Gaussian-means (G-means) algorithm The G-means algorithm starts with a small number of k-means centers, and grows the number of centers. Each iteration of the algorithm splits into two those centers whose data appear not to come from a Gaussian distribution. Between each round of splitting, we run k-means on the entire dataset and all the centers to reﬁne the current solution. We can initialize with just k = 1, or we can choose some larger value of k if we have some prior knowledge about the range of k. G-means repeatedly makes decisions based on a statistical test for the data assigned to each center. If the data currently assigned to a k-means center appear to be Gaussian, then we want to represent that data with only one center. However, if the same data do not appear Algorithm 1 G-means(X, α) 1: Let C be the initial set of centers (usually C ← {¯}). x 2: C ← kmeans(C, X). 3: Let {xi |class(xi ) = j} be the set of datapoints assigned to center cj . 4: Use a statistical test to detect if each {xi |class(xi ) = j} follow a Gaussian distribution (at conﬁdence level α). 5: If the data look Gaussian, keep cj . Otherwise replace cj with two centers. 6: Repeat from step 2 until no more centers are added. to be Gaussian, then we want to use multiple centers to model the data properly. The algorithm will run k-means multiple times (up to k times when ﬁnding k centers), so the time complexity is at most O(k) times that of k-means. The k-means algorithm implicitly assumes that the datapoints in each cluster are spherically distributed around the center. Less restrictively, the Gaussian expectation-maximization algorithm assumes that the datapoints in each cluster have a multidimensional Gaussian distribution with a covariance matrix that may or may not be ﬁxed, or shared. The Gaussian distribution test that we present below are valid for either covariance matrix assumption. The test also accounts for the number of datapoints n tested by incorporating n in the calculation of the critical value of the test (see Equation 2). This prevents the G-means algorithm from making bad decisions about clusters with few datapoints. 2.1 Testing clusters for Gaussian ﬁt To specify the G-means algorithm fully we need a test to detect whether the data assigned to a center are sampled from a Gaussian. The alternative hypotheses are • H0 : The data around the center are sampled from a Gaussian. • H1 : The data around the center are not sampled from a Gaussian. If we accept the null hypothesis H0 , then we believe that the one center is sufﬁcient to model its data, and we should not split the cluster into two sub-clusters. If we reject H0 and accept H1 , then we want to split the cluster. The test we use is based on the Anderson-Darling statistic. This one-dimensional test has been shown empirically to be the most powerful normality test that is based on the empirical cumulative distribution function (ECDF). Given a list of values xi that have been converted to mean 0 and variance 1, let x(i) be the ith ordered value. Let zi = F (x(i) ), where F is the N (0, 1) cumulative distribution function. Then the statistic is A2 (Z) = − 1 n n (2i − 1) [log(zi ) + log(1 − zn+1−i )] − n (1) i=1 Stephens [17] showed that for the case where µ and σ are estimated from the data (as in clustering), we must correct the statistic according to A2 (Z) ∗ = A2 (Z)(1 + 4/n − 25/(n2 )) (2) Given a subset of data X in d dimensions that belongs to center c, the hypothesis test proceeds as follows: 1. Choose a signiﬁcance level α for the test. 2. Initialize two centers, called “children” of c. See the text for good ways to do this. 3. Run k-means on these two centers in X. This can be run to completion, or to some early stopping point if desired. Let c1 , c2 be the child centers chosen by k-means. 4. Let v = c1 − c2 be a d-dimensional vector that connects the two centers. This is the direction that k-means believes to be important for clustering. Then project X onto v: xi = xi , v /||v||2 . X is a 1-dimensional representation of the data projected onto v. Transform X so that it has mean 0 and variance 1. 5. Let zi = F (x(i) ). If A2 (Z) is in the range of non-critical values at conﬁdence ∗ level α, then accept H0 , keep the original center, and discard {c1 , c2 }. Otherwise, reject H0 and keep {c1 , c2 } in place of the original center. A primary contribution of this work is simplifying the test for Gaussian ﬁt by projecting the data to one dimension where the test is simple to apply. The authors of [5] also use this approach for online dimensionality reduction during clustering. The one-dimensional representation of the data allows us to consider only the data along the direction that kmeans has found to be important for separating the data. This is related to the problem of projection pursuit [7], where here k-means searches for a direction in which the data appears non-Gaussian. We must choose the signiﬁcance level of the test, α, which is the desired probability of making a Type I error (i.e. incorrectly rejecting H0 ). It is appropriate to use a Bonferroni adjustment to reduce the chance of making Type I errors over multiple tests. For example, if we want a 0.01 chance of making a Type I error in 100 tests, we should apply a Bonferroni adjustment to make each test use α = 0.01/100 = 0.0001. To ﬁnd k ﬁnal centers the G-means algorithm makes k statistical tests, so the Bonferroni correction does not need to be extreme. In our tests, we always use α = 0.0001. We consider two ways to initialize the two child centers. Both approaches initialize with c ± m, where c is a center and m is chosen. The ﬁrst method chooses m as a random d-dimensional vector such that ||m|| is small compared to the distortion of the data. A second method ﬁnds the main principal component s of the data (having eigenvalue λ), and chooses m = s 2λ/π. This deterministic method places the two centers in their expected locations under H0 . The principal component calculations require O(nd2 + d3 ) time and O(d2 ) space, but since we only want the main principal component, we can use fast methods like the power method, which takes time that is at most linear in the ratio of the two largest eigenvalues [4]. In this paper we use principal-component-based splitting. 2.2 An example Figure 2 shows a run of the G-means algorithm on a synthetic dataset with two true clusters and 1000 points, using α = 0.0001. The critical value for the Anderson-Darling test is 1.8692 for this conﬁdence level. Starting with one center, after one iteration of G-means, we have 2 centers and the A2 statistic is 38.103. This is much larger than the critical value, ∗ so we reject H0 and accept this split. On the next iteration, we split each new center and repeat the statistical test. The A2 values for the two splits are 0.386 and 0.496, both of ∗ which are well below the critical value. Therefore we accept H0 for both tests, and discard these splits. Thus G-means gives a ﬁnal answer of k = 2. 2.3 Statistical power Figure 3 shows the power of the Anderson-Darling test, as compared to the BIC. Lower is better for both plots. We run 1000 tests for each data point plotted for both plots. In the left 14 14 14 13 13 13 12 12 12 11 11 11 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 4 4 0 2 4 6 8 10 12 5 4 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Figure 2: An example of running G-means for three iterations on a 2-dimensional dataset with two true clusters and 1000 points. Starting with one center (left plot), G-means splits into two centers (middle). The test for normality is signiﬁcant, so G-means rejects H0 and keeps the split. After splitting each center again (right), the test values are not signiﬁcant, so G-means accepts H0 for both tests and does not accept these splits. The middle plot is the G-means answer. See the text for further details. 1 1 G-means X-means 0.8 P(Type II error) P(Type I error) 0.8 G-means X-means 0.6 0.4 0.2 0.6 0.4 0.2 0 0 0 30 60 90 120 150 number of datapoints 180 210 0 30 60 90 120 150 number of datapoints 180 210 Figure 3: A comparison of the power of the Anderson-Darling test versus the BIC. For the AD test we ﬁx the signiﬁcance level (α = 0.0001), while the BIC’s signiﬁcance level depends on n. The left plot shows the probability of incorrectly splitting (Type I error) one true 2-d cluster that is 5% elliptical. The right plot shows the probability of incorrectly not splitting two true clusters separated by 5σ (Type II error). Both plots are functions of n. Both plots show that the BIC overﬁts (splits clusters) when n is small. plot, for each test we generate n datapoints from a single true Gaussian distribution, and then plot the frequency with which BIC and G-means will choose k = 2 rather than k = 1 (i.e. commit a Type I error). BIC tends to overﬁt by choosing too many centers when the data is not strictly spherical, while G-means does not. This is consistent with the tests of real-world data in the next section. While G-means commits more Type II errors when n is small, this prevents it from overﬁtting the data. The BIC can be considered a likelihood ratio test, but with a signiﬁcance level that cannot be ﬁxed. The signiﬁcance level instead varies depending on n and ∆k (the change in the number of model parameters between two models). As n or ∆k decrease, the signiﬁcance level increases (the BIC becomes weaker as a statistical test) [10]. Figure 3 shows this effect for varying n. In [11] the authors show that penalty-based methods require problemspeciﬁc tuning and don’t generalize as well as other methods, such as cross validation. 3 Experiments Table 1 shows the results from running G-means and X-means on many large synthetic. On synthetic datasets with spherically distributed clusters, G-means and X-means do equally Table 1: Results for many synthetic datasets. We report distortion relative to the optimum distortion for the correct clustering (closer to one is better), and time is reported relative to k-means run with the correct k. For BIC, larger values are better, but it is clear that ﬁnding the correct clustering does not always coincide with ﬁnding a larger BIC. Items with a star are where X-means always chose the largest number of centers we allowed. dataset synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 d 2 k found 9.1± 9.9 18.1± 3.2 20.1± 0.6 70.5±11.6 80.0± 0.2 171.7±23.7 5.0± 0.0 *20.0± 0.0 20.0± 0.1 *80.0± 0.0 80.2± 0.5 229.2±36.8 5.0± 0.0 *20.0± 0.0 20.0± 0.0 *80.0± 0.0 80.0± 0.0 171.5±10.9 method G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means 2 2 8 8 8 32 32 32 BIC(×104 ) -0.19±2.70 0.70±0.93 0.21±0.18 14.83±3.50 1.84±0.12 40.16±6.59 -0.74±0.16 -2.28±0.20 -0.18±0.17 14.36±0.21 1.45±0.20 52.28±9.26 -3.36±0.21 -27.92±0.22 -2.73±0.22 -11.13±0.23 -1.10±0.16 11.78±2.74 distortion(× optimal) 0.89± 0.23 0.37± 0.12 0.99± 0.01 9.45±28.02 1.00± 0.01 48.49±70.04 1.00± 0.00 0.47± 0.03 0.99± 0.00 0.47± 0.01 0.99± 0.00 0.57± 0.06 1.00± 0.00 0.76± 0.00 1.00± 0.00 0.76± 0.01 1.00± 0.00 0.84± 0.01 7 7 6 6 5 5 4 4 3 3 2 2 1 time(× k-means) 13.2 2.8 2.1 1.2 2.2 1.8 4.6 11.0 2.6 4.0 2.9 6.5 4.4 29.9 2.3 21.2 2.8 53.3 1 0 0 2 4 6 8 10 12 0 0 2 4 6 8 10 12 Figure 4: 2-d synthetic dataset with 5 true clusters. On the left, G-means correctly chooses 5 centers and deals well with non-spherical data. On the right, the BIC causes X-means to overﬁt the data, choosing 20 unevenly distributed clusters. well at ﬁnding the correct k and maximizing the BIC statistic, so we don’t show these results here. Most real-world data is not spherical, however. The synthetic datasets used here each have 5000 datapoints in d = 2/8/32 dimensions. The true ks are 5, 20, and 80. For each synthetic dataset type, we generate 30 datasets with the true center means chosen uniformly randomly from the unit hypercube, and choosing σ so that no two clusters are closer than 3σ apart. Each cluster is also given a transformation to make it non-spherical, by multiplying the data by a randomly chosen scaling and rotation matrix. We run G-means starting with one center. We allow X-means to search between 2 and 4k centers (where here k is the true number of clusters). The G-means algorithm clearly does better at ﬁnding the correct k on non-spherical data. Its results are closer to the true distortions and the correct ks. The BIC statistic that X-means uses has been formulated to maximize the likelihood for spherically-distributed data. Thus it overestimates the number of true clusters in non-spherical data. This is especially evident when the number of points per cluster is small, as in datasets with 80 true clusters. 1 2 2 3 3 4 4 Digit 0 1 Digit 0 5 5 6 6 7 7 8 8 9 9 5 10 15 20 25 30 Cluster 10 20 30 40 50 60 Cluster Figure 5: NIST and Pendigits datasets: correspondence between each digit (row) and each cluster (column) found by G-means. G-means did not have the labels, yet it found meaningful clusters corresponding with the labels. Because of this overestimation, X-means often hits our limit of 4k centers. Figure 4 shows an example of overﬁtting on a dataset with 5 true clusters. X-means chooses k = 20 while G-means ﬁnds all 5 true cluster centers. Also of note is that X-means does not distribute centers evenly among clusters; some clusters receive one center, but others receive many. G-means runs faster than X-means for 8 and 32 dimensions, which we expect, since the kd-tree structures which make X-means fast in low dimensions take time exponential in d, making them slow for more than 8 to 12 dimensions. All our code is written in Matlab; X-means is written in C. 3.1 Discovering true clusters in labeled data We tested these algorithms on two real-world datasets for handwritten digit recognition: the NIST dataset [12] and the Pendigits dataset [2]. The goal is to cluster the data without knowledge of the labels and measure how well the clustering captures the true labels. Both datasets have 10 true classes (digits 0-9). NIST has 60000 training examples and 784 dimensions (28×28 pixels). We use 6000 randomly chosen examples and we reduce the dimension to 50 by random projection (following [3]). The Pendigits dataset has 7984 examples and 16 dimensions; we did not change the data in any way. We cluster each dataset with G-means and X-means, and measure performance by comparing the cluster labels Lc with the true labels Lt . We deﬁne the partition quality (PQ) as kt kc kt 2 2 pq = i=1 j=1 p(i, j) i=1 p(i) where kt is the true number of classes, and kc is the number of clusters found by the algorithm. This metric is maximized when Lc induces the same partition of the data as Lt ; in other words, when all points in each cluster have the same true label, and the estimated k is the true k. The p(i, j) term is the frequency-based probability that a datapoint will be labeled i by Lt and j by Lc . This quality is normalized by the sum of true probabilities, squared. This statistic is related to the Rand statistic for comparing partitions [8]. For the NIST dataset, G-means ﬁnds 31 clusters in 30 seconds with a PQ score of 0.177. X-means ﬁnds 715 clusters in 4149 seconds, and 369 of these clusters contain only one point, indicating an overestimation problem with the BIC. X-means receives a PQ score of 0.024. For the Pendigits dataset, G-means ﬁnds 69 clusters in 30 seconds, with a PQ score of 0.196; X-means ﬁnds 235 clusters in 287 seconds, with a PQ score of 0.057. Figure 5 shows Hinton diagrams of the G-means clusterings of both datasets, showing that G-means succeeds at identifying the true clusters concisely, without aid of the labels. The confusions between different digits in the NIST dataset (seen in the off-diagonal elements) are common for other researchers using more sophisticated techniques, see [3]. 4 Discussion and conclusions We have introduced the new G-means algorithm for learning k based on a statistical test for determining whether datapoints are a random sample from a Gaussian distribution with arbitrary dimension and covariance matrix. The splitting uses dimension reduction and a powerful test for Gaussian ﬁtness. G-means uses this statistical test as a wrapper around k-means to discover the number of clusters automatically. The only parameter supplied to the algorithm is the signiﬁcance level of the statistical test, which can easily be set in a standard way. The G-means algorithm takes linear time and space (plus the cost of the splitting heuristic and test) in the number of datapoints and dimension, since k-means is itself linear in time and space. Empirically, the G-means algorithm works well at ﬁnding the correct number of clusters and the locations of genuine cluster centers, and we have shown it works well in moderately high dimensions. Clustering in high dimensions has been an open problem for many years. Recent research has shown that it may be preferable to use dimensionality reduction techniques before clustering, and then use a low-dimensional clustering algorithm such as k-means, rather than clustering in the high dimension directly. In [3] the author shows that using a simple, inexpensive linear projection preserves many of the properties of data (such as cluster distances), while making it easier to ﬁnd the clusters. Thus there is a need for good-quality, fast clustering algorithms for low-dimensional data. Our work is a step in this direction. Additionally, recent image segmentation algorithms such as normalized cut [16, 13] are based on eigenvector computations on distance matrices. These “spectral” clustering algorithms still use k-means as a post-processing step to ﬁnd the actual segmentation and they require k to be speciﬁed. Thus we expect G-means will be useful in combination with spectral clustering. References [1] Horst Bischof, Aleˇ Leonardis, and Alexander Selb. MDL principle for robust vector quantisation. Pattern analysis and applications, 2:59–72, s 1999. [2] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/∼mlearn/MLRepository.html. [3] Sanjoy Dasgupta. Experiments with random projection. In Uncertainty in Artiﬁcial Intelligence: Proceedings of the Sixteenth Conference (UAI-2000), pages 143–151, San Francisco, CA, 2000. Morgan Kaufmann Publishers. [4] Gianna M. Del Corso. Estimating an eigenvector by the power method with a random start. SIAM Journal on Matrix Analysis and Applications, 18(4):913–937, 1997. [5] Chris Ding, Xiaofeng He, Hongyuan Zha, and Horst Simon. Adaptive dimension reduction for clustering high dimensional data. In Proceedings of the 2nd IEEE International Conference on Data Mining, 2002. [6] Fredrik Farnstrom, James Lewis, and Charles Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2(1):51–57, 2000. [7] Peter J. Huber. Projection pursuit. Annals of Statistics, 13(2):435–475, June 1985. [8] L. Hubert and P. Arabie. Comparing partitions. Journal of Classiﬁcation, 2:193–218, 1985. [9] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264–323, 1999. [10] Robert E. Kass and Larry Wasserman. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association, 90(431):928–934, 1995. [11] Michael J. Kearns, Yishay Mansour, Andrew Y. Ng, and Dana Ron. An experimental and theoretical comparison of model selection methods. In Computational Learing Theory (COLT), pages 21–30, 1995. [12] Yann LeCun, L´ on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the e IEEE, 86(11):2278–2324, 1998. [13] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Neural Information Processing Systems, 14, 2002. [14] Dan Pelleg and Andrew Moore. X-means: Extending K-means with efﬁcient estimation of the number of clusters. In Proceedings of the 17th International Conf. on Machine Learning, pages 727–734. Morgan Kaufmann, San Francisco, CA, 2000. [15] Peter Sand and Andrew Moore. Repairing faulty mixture models using density estimation. In Proceedings of the 18th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA, 2001. [16] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000. [17] M. A. Stephens. EDF statistics for goodness of ﬁt and some comparisons. American Statistical Association, 69(347):730–737, September 1974.

6 0.4387562 85 nips-2003-Human and Ideal Observers for Detecting Image Curves

7 0.41788286 83 nips-2003-Hierarchical Topic Models and the Nested Chinese Restaurant Process

8 0.3991119 1 nips-2003-1-norm Support Vector Machines

9 0.38031471 87 nips-2003-Identifying Structure across Pre-partitioned Data

10 0.37649214 131 nips-2003-Modeling User Rating Profiles For Collaborative Filtering

11 0.36147514 46 nips-2003-Clustering with the Connectivity Kernel

12 0.35122022 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models

13 0.34578419 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method

14 0.330863 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data

15 0.32381773 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering

16 0.31885582 177 nips-2003-Simplicial Mixtures of Markov Chains: Distributed Modelling of Dynamic User Profiles

17 0.31883398 102 nips-2003-Large Scale Online Learning

18 0.31395695 169 nips-2003-Sample Propagation

19 0.31389213 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model

20 0.3114008 58 nips-2003-Efficient Multiscale Sampling from Products of Gaussian Mixtures

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.039), (11, 0.027), (29, 0.023), (30, 0.018), (35, 0.082), (53, 0.121), (63, 0.015), (69, 0.014), (71, 0.066), (76, 0.067), (85, 0.064), (90, 0.248), (91, 0.13)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.84042978 79 nips-2003-Gene Expression Clustering with Functional Mixture Models

Author: Darya Chudova, Christopher Hart, Eric Mjolsness, Padhraic Smyth

2 0.78767568 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons

Author: Thomas Natschläger, Wolfgang Maass

Abstract: We employ an efﬁcient method using Bayesian and linear classiﬁers for analyzing the dynamics of information in high-dimensional states of generic cortical microcircuit models. It is shown that such recurrent circuits of spiking neurons have an inherent capability to carry out rapid computations on complex spike patterns, merging information contained in the order of spike arrival with previously acquired context information. 1

3 0.66165268 107 nips-2003-Learning Spectral Clustering

Author: Francis R. Bach, Michael I. Jordan

Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive a new cost function for spectral clustering based on a measure of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing this cost function with respect to the partition leads to a new spectral clustering algorithm. Minimizing with respect to the similarity matrix leads to an algorithm for learning the similarity matrix. We develop a tractable approximation of our cost function that is based on the power method of computing eigenvectors. 1

4 0.6588518 73 nips-2003-Feature Selection in Clustering Problems

Author: Volker Roth, Tilman Lange

5 0.65625489 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms

Author: Claudio Gentile

6 0.65343004 78 nips-2003-Gaussian Processes in Reinforcement Learning

7 0.65307373 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model

8 0.65136564 113 nips-2003-Learning with Local and Global Consistency

9 0.65124744 81 nips-2003-Geometric Analysis of Constrained Curves

10 0.65101385 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

11 0.65028864 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

12 0.64748996 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

13 0.64739454 126 nips-2003-Measure Based Regularization