nips nips2003 nips2003-86 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Su-in Lee, Serafim Batzoglou
Abstract: We propose an unsupervised methodology using independent component analysis (ICA) to cluster genes from DNA microarray data. Based on an ICA mixture model of genomic expression patterns, linear and nonlinear ICA finds components that are specific to certain biological processes. Genes that exhibit significant up-regulation or down-regulation within each component are grouped into clusters. We test the statistical significance of enrichment of gene annotations within each cluster. ICA-based clustering outperformed other leading methods in constructing functionally coherent clusters on various datasets. This result supports our model of genomic expression data as composite effect of independent biological processes. Comparison of clustering performance among various ICA algorithms including a kernel-based nonlinear ICA algorithm shows that nonlinear ICA performed the best for small datasets and natural-gradient maximization-likelihood worked well for all the datasets. 1 In trod u ction Microarray technology has enabled genome-wide expression profiling, promising to provide insight into underlying biological mechanism involved in gene regulation. To aid such discoveries, mathematical tools that are versatile enough to capture the underlying biology and simple enough to be applied efficiently on large datasets are needed. Analysis tools based on novel data mining techniques have been proposed [1]-[6]. When applying mathematical models and tools to microarray analysis, clustering genes that have the similar biological properties is an important step for three reasons: reduction of data complexity, prediction of gene function, and evaluation of the analysis approach by measuring the statistical significance of biological coherence of gene clusters. Independent component analysis (ICA) linearly decomposes each of N vectors into M common component vectors (N≥M) so that each component is statistically as independent from the others as possible. One of the main applications of ICA is blind source separation (BSS) that aims to separate source signals from their mixtures. There have been a few attempts to apply ICA to the microarray expression data to extract meaningful signals each corresponding to independent biological process [5]-[6]. In this paper, we provide the first evidence that ICA is a superior mathematical model and clustering tool for microarray analysis, compared to the most widely used methods namely PCA and k-means clustering. We also introduce the application of nonlinear ICA to microarray analysis, and show that it outperforms linear ICA on some datasets. We apply ICA to microarray data to decompose the input data into statistically independent components. Then, genes are clustered in an unsupervised fashion into non-mutually exclusive clusters. Each independent component is assigned a putative biological meaning based on functional annotations of genes that are predominant within the component. We systematically evaluate the clustering performance of several ICA algorithms on four expression datasets and show that ICA-based clustering is superior to other leading methods that have been applied to analyze the same datasets. We also proposed a kernel based nonlinear ICA algorithm for dealing with more realistic mixture model. Among the different linear ICA algorithms including six linear and one nonlinear ICA algorithm, the natural-gradient maximum-likelihood estimation method (NMLE) [7]-[8] performs well in all the datasets. Kernel-based nonlinear ICA method worked better for three small datasets. 2 M a t h e m at i c a l m o de l o f g e n om e - wi d e e x p r e s s i on Several distinct biological processes take place simultaneously inside a cell; each biological process has its own expression program to up-regulate or down-regulate the level of expression of specific sets of genes. We model a genome-wide expression pattern in a given condition (measured by a microarray assay) as a mixture of signals generated by statistically independent biological processes with different activation levels. We design two kinds of models for genomic expression pattern: a linear and nonlinear mixture model. Suppose that a cell is governed by M independent biological processes S = (s1, …, sM)T, each of which is a vector of K gene expression levels, and that we measure the levels of expression of all genes in N conditions, resulting in a microarray expression matrix X = (x1,…,xN)T. The expression level at each different condition j can be expressed as linear combinations of the M biological processes: xj=aj1s1+…+ajMsM. We can express this idea concisely in matrix notation as follows. X = AS , x1 a11 L a1M s1 M = M M M x N a N 1 L a NM s M (1) More generally, we can express X = (x1,…,xN)T as a post-nonlinear mixture of the underlying independent processes as follows, where f(.) is a nonlinear mapping from N to N dimensional space. X = f ( AS ), a11 L a1M s1 x1 M = f M M M a xN N 1 L a NM s M (2) 3 I n d e p e n d e n t c o m po n e n t a n a l y s i s In the models described above, since we assume that the underlying biological processes are independent, we suggest that vectors S=(s1,…,sM) are statistically independent and so ICA can recover S from the observed microarray data X. For linear ICA, we apply natural-gradient maximum estimation (NMLE) method which was proposed in [7] and was made more efficient by using natural gradient method in [8]. We also apply nonlinear ICA using reproducible kernel Hilbert spaces (RKHS) based on [9], as follows: 1. We map the N dimensional input data xi to Ф(xi) in the feature space by using the kernel trick. The feature space is defined by the relationship Ф(xi)TФ(xj)=k(xi,, xj). That is, inner product of mapped data is determined to by a kernel function k(.,.) in the input space; we used a Gaussian radial basis function (RBF) kernel (k(x,y)=exp(-|x-y|2)) and a polynomial kernel of degree 2 (k(x,y)=(xTy+1)2). To perform mapping, we found orthonormal bases of the feature space by randomly sampling L input data v={v1,…,vL} 1000 times and choosing one set minimizing the condition number of Φv=(Φ(v1),…,Φ(vL)). Then, a set of orthonormal bases of the feature space is determined by the selected L images of input data in v as Ξ = Φv(Φv TΦv )-1/2. We map all input data x1,…,xK, each corresponding to a gene, to Ψ(x1),…,Ψ(xK) in the feature space with basis Ξ, as follows: k (v1 , v1 ) K k (v1 , v L ) M M k (v L , v1 ) L k (v L , v L ) Ψ(xi)=(ΦvTΦv )-1/2Φv TΦv (xi) = −1 / 2 k (v1 , xi ) ∈ ℜ L (1≤ i≤K) (3) M k ( v L , xi ) 2. We linearly decompose the mapped data Ψ=[Ψ(x1),.,Ψ(xK)] ∈RL×K into statistically independent components using NMLE. 4 Proposed approach The microarray dataset we are given is in matrix form where each element xij corresponds to the level of expression of the jth gene in the ith experimental condition. Missing values are imputed by KNNImpute [10], an algorithm based on k nearest neighbors that is widely used in microarray analysis. Given the expression matrix X of N experiment by K genes, we perform the following steps. 1. Apply ICA to decompose X into independent components y1, …,yM as in Equations (1) and (2). Prior to applying ICA, remove any rows that make the expression matrix X singular. After ICA, each component denoted by yi is a vector comprising K loads gene expression levels, i.e., yi = (yi1, ...,yiK). We chose to let the number of components M to be maximized, which is equal the number of microarray experiments N because the maximum for N in our datasets was 250, which is smaller than the number of biological processes we hypothesize to act within a cell. 2. For each component, cluster genes according to their relative loads yij/mean(yi). Based on our ICA model, each component is a putative genomic expression program of an independent biological process. Thus, our hypothesis is that genes showing relatively high or low expression level within the component are the most important for the process. We create two clusters for each component: one cluster containing genes with expression level higher than a threshold, and one cluster containing genes with expression level lower than a threshold. Cluster i,1 = {gene j | y ij > mean( y i ) + c × std( y i )} Cluster i,2 = {gene j | y ij < mean( y i ) – c × std( y i )} (4) Here, mean(yi) is the average, std(yi) is the standard deviation of yi; and c is an adjustable coefficient. The value of the coefficient c was varied from 1.0 to 2.0 and the result for c=1.25 was presented in this paper. The results for other values of c are similar, and are presented on the website www.stanford.edu/~silee/ICA/. 3. For each cluster, measure the enrichment of each cluster with genes of known functional annotations. Using the Gene Ontology (GO) [11] and KEGG [12] gene annotation databases, we calculate the p-value for each cluster with every gene annotation, which is the probability that the cluster contains the observed number of genes with the annotation by chance assuming the hypergeometric distribution (details in [4]). For each gene annotation, the minimum p-value that is smaller than 10-7 obtained from any cluster was collected. If no p-value smaller than 10-7 is found, we consider the gene annotation not to be detected by the approach. As a result, we can assign biological meaning to each cluster and the corresponding independent component and we can evaluate the clustering performance by comparing the collected minimum p-value for each gene annotation with that from other clustering approach. 5 P e r f o r m a n c e e v a l uat i o n We tested the ICA-based clustering to four expression datasets (D1—D4) described in Table 1. Table 1: The four datasets used in our analysis ARRAY TYPE DESCRIPTION # OF GENES (K) # OF EXPS (N) D1 Spotted 4579 22 D2 Oligonucl eotide Spotted Oligonucl eotide Budding yeast during cell cycle and CLB2/CLN3 overactive strain [13] Budding yeast during cell cycle [14] 6616 17 C. elegans in various conditions [3] Normal human tissue including 19 kinds of tissues [15] 17817 7070 553 59 D3 D4 For D1 and D4, we compared the biological coherence of ICA components with that of PCA applied in the same datasets in [1] and [2], respectively. For D2 and D3, we compared with k-means clustering and the topomap method, applied in the same datasets in [4] and [3], respectively. We applied nonlinear ICA to D1, D2 and D4. Dataset D3 is very large and makes the nonlinear algorithm unstable. D1 was preprocessed to contain log-ratios xij=log2(Rij/Gij) between red and green intensities. In [1], principal components, referred to as eigenarrays, were hypothesized to be genomic expression programs of distinct biological processes. We compared the biological coherence of independent components with that of principal components found by [1]. Comparison was done in two ways: (1) For each component, we grouped genes within top x% of significant up-regulation and down-regulation (as measured by the load of the gene in the component) into two clusters with x adjusted from 5% to 45%. For each value of x, statistical significance was measured for clusters from independent components and compared with that from principal components based on the minimum p-value for each gene annotation, as described in Section 4. We made a scatter plot to compare the negative log of the collected best p-values for each gene annotation when x is fixed to be 15%, shown in Figure 1 (a) (2) Same as before, except we did not fix the value of x; instead, we collected the minimum p-value from each method for each GO and KEGG gene annotation category and compared the collected p-values (Figure 1 (b)). For both cases, in the majority of the gene annotation categories ICA produced significantly lower p-values than PCA did, especially for gene annotation for which both ICA and PCA showed high significance. Figure 1. Comparison of linear ICA (NMLE) to PCA on dataset D1 (a) when x is fixed to be 15%; (b) when x is not fixed. (c) Three independent components of dataset D4. Each gene is mapped to a point based on the value assigned to the gene in three independent components, which are enriched with liver- (red), Muscle- (orange) and vulva-specific (green) genes, respectively. The expression levels of genes in D4 were normalized across the 59 experiments, and the logarithms of the resulting values were taken. Experiments 57, 58, and 59 were removed because they made the expression matrix nearly singular. In [2], a clustering approach based on PCA and subsequent visual inspection was applied to an earlier version of this dataset, containing 50 of the 59 samples. After we performed ICA, the most significant independent components were enriched for liver-specific, muscle-specific and vulva-specific genes with p-value of 10-133, 10-124 and 100-117, respectively. In the ICA liver cluster, 198 genes were liver specific (out of a total of 244), as compared with the 23 liver-specific genes identified in [2] using PCA. The ICA muscle cluster of 235 genes contains 199 muscle specific genes compared to 19 muscle-specific genes identified in [2]. We generated a 3-dimensional scatter plot of the load expression levels of all genes annotated in [15] on these significant ICA components in Figure 1 (c). We can see that the liver-specific, muscle-specific and vulva-specific genes are strongly biased to lie on the x-, y-, and z- axis, respectively. We applied nonlinear ICA on this dataset and the first four most significant clusters from nonlinear ICA with Gaussian RBF kernel were muscle-specific, liver-specific, vulva-specific and brain-specific with p-value of 10-158, 10-127, 10-112 and 10-70, respectively, showing considerable improvement over the linear ICA clusters. For D2, variance-normalization was applied to the 3000 most variant genes as in [4]. The 17th experiment, which made the expression matrix close to singular, was removed. We measured the statistical significance of clusters as described in Section 4 and compared the smallest p-value of each gene annotation from our approach to that from k-means clustering applied to the same dataset [4]. We made a scatter plot for comparing the negative log of the smallest p-value (y-axis) from ICA clusters with that from k-means clustering (x-axis). The coefficient c is varied from 1.0 to 2.0 and the superiority of ICA-based clustering to k-means clustering does not change. In many practical settings, estimation of the best c is not needed; we can adjust c to get a desired size of the cluster unless our focus is to blindly find the size of clusters. Figure 2 (a) (b) (c) shows for c=1.25 a comparison of the performance of linear ICA (NMLE), nonlinear ICA with Gaussian RBF kernel (NICA gauss), and k-means clustering (k-means). For D3, first we removed experiments that contained more than 7000 missing values, because ICA does not perform properly when the dataset contains many missing values. The 250 remaining experiments were used, containing expression levels for 17817 genes preprocessed to be log-ratios xij=log2(Rij/Gij) between red and green intensities. We compared the biological coherence of clusters by our approach with that of topomap-based approach applied to the same dataset in [3]. The result when c=1.25 is plotted in the Figure 2 (d). We observe that the two methods perform very similarly, with most categories having roughly the same p-value in ICA and in the topomap clusters. The topomap clustering approach performs slightly better in a larger fraction of the categories. Still, we consider this performance a confirmation that ICA is a widely applicable method that requires minimal training: in this case the missing values and high diversity of the data make clustering especially challenging, while the topomap approach was specifically designed and manually trained for this dataset as described in [3]. Finally, we compared different ICA algorithms in terms of clustering performance. We tested six linear ICA methods: Natural Gradient Maximum Likelihood Estimation (NMLE) [7][8], Joint Approximate Diagonalization of Eigenmatrices [16], Fast Fixed Point ICA with three different measures of non-Gaussianity [17], and Extended Information Maximization (Infomax) [18]. We also tested two kernels for nonlinear ICA: Gaussian RBF kernel, and polynomial kernel (NICA ploy). For each dataset, we compared the biological coherence of clusters generated by each method. Among the six linear ICA algorithms, NMLE was the best in all datasets. Among both linear and nonlinear methods, the Gaussian kernel nonlinear ICA method was the best in Datasets D1, D2 and D4, the polynomial kernel nonlinear ICA method was best in Dataset D4, and NMLE was best in the large datasets (D3 and D4). In Figure 3, we compare the NMLE method with three other ICA methods for the dataset D2. Overall, the NMLE algorithm consistently performed well in all datasets. The nonlinear ICA algorithms performed best in the small datasets, but were unstable in the two largest datasets. More comparison results are demonstrated in the website www.stanford.edu/~silee/ICA/. Figure 2: Comparison of (a) linear ICA (NMLE) with k-means clustering, (b) nonlinear ICA with Gaussian RBF kernel to linear ICA (NMLE), and (c) nonlinear ICA with Gaussian RBF kernel to k-means clustering on the dataset D2. (d) Comparison of linear ICA (NMLE) to topomap-based approach on the dataset D3. Figure 3: Comparison of linear ICA (NMLE) to (a) Extended Infomax ICA algorithm, (b) Fast ICA with symmetric orthogonalization and tanh nonlinearity and (c) Nonlinear ICA with polynomial kernel of degree 2 on the Dataset (B). 6 D i s c u s s i on ICA is a powerful statistical method for separating mixed independent signals. We proposed applying ICA to decompose microarray data into independent gene expression patterns of underlying biological processes, and to group genes into clusters that are mutually non-exclusive with statistically significant functional coherence. Our clustering method outperformed several leading methods on a variety of datasets, with the added advantage that it requires setting only one parameter, namely the fraction c of standard deviations beyond which a gene is considered to be associated with a component’s cluster. We observed that performance was not very sensitive to that parameter, suggesting that ICA is robust enough to be used for clustering with little human intervention. The empirical performance of ICA in our tests supports the hypothesis that statistical independence is a good criterion for separating mixed biological signals in microarray data. The Extended Infomax ICA algorithm proposed in [18] can automatically determine whether the distribution of each source signal is super-Gaussian or sub-Gaussian. Interestingly, the application of Extended Infomax ICA to all the expression datasets uncovered no source signal with sub-Gaussian distribution. A likely explanation is that global gene expression profiles are mixtures of super-Gaussian sources rather than of sub-Gaussian sources. This finding is consistent with the following intuition: underlying biological processes are super-Gaussian, because they affect sharply the relevant genes, typically a small fraction of all genes, and leave the majority of genes relatively unaffected. Acknowledgments We thank Te-Won Lee for helpful feedback. We thank Relly Brandman, Chuong Do, and Yueyi Liu for edits to the manuscript. References [1] Alter O, Brown PO, Botstein D. Proc. Natl. Acad. Sci. USA 97(18):10101-10106, 2000. [2] Misra J, Schmitt W, et al. Genome Research 12:1112-1120, 2002. [3] Kim SK, Lund J, et al. Science 293:2087-2092, 2001. [4] Tavazoie S, Hughes JD, et al. Nature Genetics 22(3):281-285, 1999. [5] Hori G, Inoue M, et al. Proc. 3rd Int. Workshop on Independent Component Analysis and Blind Signal Separation, Helsinki, Finland, pp. 151-155, 2000. [6] Liebermeister W. Bioinformatics 18(1):51-60, 2002. [7] Bell AJ. and Sejnowski TJ. Neural Computation, 7:1129-1159, 1995. [8] Amari S, Cichocki A, et al. In Advances in Neural Information Processing Systems 8, pp. 757-763. Cambridge, MA: MIT Press, 1996. [9] Harmeling S, Ziehe A, et al. In Advances in Neural Information Processing Systems 8, pp. 757-763. Cambridge, MA: MIT Press, . [10] Troyanskaya O., Cantor M, et al. Bioinformatics 17:520-525, 2001. [11] The Gene Ontology Consortium. Genome Research 11:1425-1433, 2001. [12] Kanehisa M., Goto S. In Current Topics in Computational Molecular Biology, pp. 301–315. MIT-Press, Cambridge, MA, 2002. [13] Spellman PT, Sherlock G, et al. Mol. Biol. Cell 9:3273-3297, 1998. [14] Cho RJ, Campell MJ, et al. Molecular Cell 2:65-73, 1998. [15] Hsiao L, Dangond F, et al. Physiol. Genomics 7:97-104, 2001. [16] Cardoso JF, Neural Computation 11(1):157-192, 1999. [17] Hyvarinen A. IEEE Transactions on Neural Network 10(3):626–634, 1999. [18] Lee TW, Girolami M, et al. Neural Computation 11:417–441, 1999.
Reference: text
sentIndex sentText sentNum sentScore
1 edu * Abstract We propose an unsupervised methodology using independent component analysis (ICA) to cluster genes from DNA microarray data. [sent-4, score-0.738]
2 Based on an ICA mixture model of genomic expression patterns, linear and nonlinear ICA finds components that are specific to certain biological processes. [sent-5, score-0.565]
3 Genes that exhibit significant up-regulation or down-regulation within each component are grouped into clusters. [sent-6, score-0.137]
4 We test the statistical significance of enrichment of gene annotations within each cluster. [sent-7, score-0.417]
5 ICA-based clustering outperformed other leading methods in constructing functionally coherent clusters on various datasets. [sent-8, score-0.218]
6 This result supports our model of genomic expression data as composite effect of independent biological processes. [sent-9, score-0.411]
7 Comparison of clustering performance among various ICA algorithms including a kernel-based nonlinear ICA algorithm shows that nonlinear ICA performed the best for small datasets and natural-gradient maximization-likelihood worked well for all the datasets. [sent-10, score-0.413]
8 1 In trod u ction Microarray technology has enabled genome-wide expression profiling, promising to provide insight into underlying biological mechanism involved in gene regulation. [sent-11, score-0.579]
9 To aid such discoveries, mathematical tools that are versatile enough to capture the underlying biology and simple enough to be applied efficiently on large datasets are needed. [sent-12, score-0.159]
10 Independent component analysis (ICA) linearly decomposes each of N vectors into M common component vectors (N≥M) so that each component is statistically as independent from the others as possible. [sent-15, score-0.251]
11 One of the main applications of ICA is blind source separation (BSS) that aims to separate source signals from their mixtures. [sent-16, score-0.08]
12 There have been a few attempts to apply ICA to the microarray expression data to extract meaningful signals each corresponding to independent biological process [5]-[6]. [sent-17, score-0.539]
13 In this paper, we provide the first evidence that ICA is a superior mathematical model and clustering tool for microarray analysis, compared to the most widely used methods namely PCA and k-means clustering. [sent-18, score-0.357]
14 We also introduce the application of nonlinear ICA to microarray analysis, and show that it outperforms linear ICA on some datasets. [sent-19, score-0.312]
15 We apply ICA to microarray data to decompose the input data into statistically independent components. [sent-20, score-0.34]
16 Then, genes are clustered in an unsupervised fashion into non-mutually exclusive clusters. [sent-21, score-0.354]
17 Each independent component is assigned a putative biological meaning based on functional annotations of genes that are predominant within the component. [sent-22, score-0.635]
18 We systematically evaluate the clustering performance of several ICA algorithms on four expression datasets and show that ICA-based clustering is superior to other leading methods that have been applied to analyze the same datasets. [sent-23, score-0.497]
19 We also proposed a kernel based nonlinear ICA algorithm for dealing with more realistic mixture model. [sent-24, score-0.169]
20 Among the different linear ICA algorithms including six linear and one nonlinear ICA algorithm, the natural-gradient maximum-likelihood estimation method (NMLE) [7]-[8] performs well in all the datasets. [sent-25, score-0.156]
21 Kernel-based nonlinear ICA method worked better for three small datasets. [sent-26, score-0.118]
22 2 M a t h e m at i c a l m o de l o f g e n om e - wi d e e x p r e s s i on Several distinct biological processes take place simultaneously inside a cell; each biological process has its own expression program to up-regulate or down-regulate the level of expression of specific sets of genes. [sent-27, score-0.645]
23 We model a genome-wide expression pattern in a given condition (measured by a microarray assay) as a mixture of signals generated by statistically independent biological processes with different activation levels. [sent-28, score-0.657]
24 We design two kinds of models for genomic expression pattern: a linear and nonlinear mixture model. [sent-29, score-0.357]
25 Suppose that a cell is governed by M independent biological processes S = (s1, …, sM)T, each of which is a vector of K gene expression levels, and that we measure the levels of expression of all genes in N conditions, resulting in a microarray expression matrix X = (x1,…,xN)T. [sent-30, score-1.564]
26 The expression level at each different condition j can be expressed as linear combinations of the M biological processes: xj=aj1s1+…+ajMsM. [sent-31, score-0.311]
27 X = AS , x1 a11 L a1M s1 M = M M M x N a N 1 L a NM s M (1) More generally, we can express X = (x1,…,xN)T as a post-nonlinear mixture of the underlying independent processes as follows, where f(. [sent-33, score-0.142]
28 ) is a nonlinear mapping from N to N dimensional space. [sent-34, score-0.098]
29 We also apply nonlinear ICA using reproducible kernel Hilbert spaces (RKHS) based on [9], as follows: 1. [sent-37, score-0.143]
30 We map the N dimensional input data xi to Ф(xi) in the feature space by using the kernel trick. [sent-38, score-0.063]
31 That is, inner product of mapped data is determined to by a kernel function k(. [sent-40, score-0.068]
32 ) in the input space; we used a Gaussian radial basis function (RBF) kernel (k(x,y)=exp(-|x-y|2)) and a polynomial kernel of degree 2 (k(x,y)=(xTy+1)2). [sent-42, score-0.113]
33 To perform mapping, we found orthonormal bases of the feature space by randomly sampling L input data v={v1,…,vL} 1000 times and choosing one set minimizing the condition number of Φv=(Φ(v1),…,Φ(vL)). [sent-43, score-0.041]
34 4 Proposed approach The microarray dataset we are given is in matrix form where each element xij corresponds to the level of expression of the jth gene in the ith experimental condition. [sent-48, score-0.745]
35 Missing values are imputed by KNNImpute [10], an algorithm based on k nearest neighbors that is widely used in microarray analysis. [sent-49, score-0.212]
36 Given the expression matrix X of N experiment by K genes, we perform the following steps. [sent-50, score-0.145]
37 Apply ICA to decompose X into independent components y1, …,yM as in Equations (1) and (2). [sent-52, score-0.139]
38 Prior to applying ICA, remove any rows that make the expression matrix X singular. [sent-53, score-0.145]
39 After ICA, each component denoted by yi is a vector comprising K loads gene expression levels, i. [sent-54, score-0.542]
40 We chose to let the number of components M to be maximized, which is equal the number of microarray experiments N because the maximum for N in our datasets was 250, which is smaller than the number of biological processes we hypothesize to act within a cell. [sent-60, score-0.484]
41 For each component, cluster genes according to their relative loads yij/mean(yi). [sent-62, score-0.476]
42 Based on our ICA model, each component is a putative genomic expression program of an independent biological process. [sent-63, score-0.469]
43 Thus, our hypothesis is that genes showing relatively high or low expression level within the component are the most important for the process. [sent-64, score-0.569]
44 We create two clusters for each component: one cluster containing genes with expression level higher than a threshold, and one cluster containing genes with expression level lower than a threshold. [sent-65, score-1.308]
45 The value of the coefficient c was varied from 1. [sent-67, score-0.043]
46 For each cluster, measure the enrichment of each cluster with genes of known functional annotations. [sent-75, score-0.476]
47 For each gene annotation, the minimum p-value that is smaller than 10-7 obtained from any cluster was collected. [sent-77, score-0.373]
48 If no p-value smaller than 10-7 is found, we consider the gene annotation not to be detected by the approach. [sent-78, score-0.452]
49 As a result, we can assign biological meaning to each cluster and the corresponding independent component and we can evaluate the clustering performance by comparing the collected minimum p-value for each gene annotation with that from other clustering approach. [sent-79, score-1.035]
50 5 P e r f o r m a n c e e v a l uat i o n We tested the ICA-based clustering to four expression datasets (D1—D4) described in Table 1. [sent-80, score-0.342]
51 Table 1: The four datasets used in our analysis ARRAY TYPE DESCRIPTION # OF GENES (K) # OF EXPS (N) D1 Spotted 4579 22 D2 Oligonucl eotide Spotted Oligonucl eotide Budding yeast during cell cycle and CLB2/CLN3 overactive strain [13] Budding yeast during cell cycle [14] 6616 17 C. [sent-81, score-0.268]
52 elegans in various conditions [3] Normal human tissue including 19 kinds of tissues [15] 17817 7070 553 59 D3 D4 For D1 and D4, we compared the biological coherence of ICA components with that of PCA applied in the same datasets in [1] and [2], respectively. [sent-82, score-0.349]
53 For D2 and D3, we compared with k-means clustering and the topomap method, applied in the same datasets in [4] and [3], respectively. [sent-83, score-0.309]
54 Dataset D3 is very large and makes the nonlinear algorithm unstable. [sent-85, score-0.098]
55 D1 was preprocessed to contain log-ratios xij=log2(Rij/Gij) between red and green intensities. [sent-86, score-0.08]
56 In [1], principal components, referred to as eigenarrays, were hypothesized to be genomic expression programs of distinct biological processes. [sent-87, score-0.357]
57 We compared the biological coherence of independent components with that of principal components found by [1]. [sent-88, score-0.366]
58 Comparison was done in two ways: (1) For each component, we grouped genes within top x% of significant up-regulation and down-regulation (as measured by the load of the gene in the component) into two clusters with x adjusted from 5% to 45%. [sent-89, score-0.815]
59 For each value of x, statistical significance was measured for clusters from independent components and compared with that from principal components based on the minimum p-value for each gene annotation, as described in Section 4. [sent-90, score-0.596]
60 For both cases, in the majority of the gene annotation categories ICA produced significantly lower p-values than PCA did, especially for gene annotation for which both ICA and PCA showed high significance. [sent-92, score-0.921]
61 Comparison of linear ICA (NMLE) to PCA on dataset D1 (a) when x is fixed to be 15%; (b) when x is not fixed. [sent-94, score-0.109]
62 Each gene is mapped to a point based on the value assigned to the gene in three independent components, which are enriched with liver- (red), Muscle- (orange) and vulva-specific (green) genes, respectively. [sent-96, score-0.673]
63 The expression levels of genes in D4 were normalized across the 59 experiments, and the logarithms of the resulting values were taken. [sent-97, score-0.534]
64 Experiments 57, 58, and 59 were removed because they made the expression matrix nearly singular. [sent-98, score-0.145]
65 In [2], a clustering approach based on PCA and subsequent visual inspection was applied to an earlier version of this dataset, containing 50 of the 59 samples. [sent-99, score-0.156]
66 After we performed ICA, the most significant independent components were enriched for liver-specific, muscle-specific and vulva-specific genes with p-value of 10-133, 10-124 and 100-117, respectively. [sent-100, score-0.536]
67 In the ICA liver cluster, 198 genes were liver specific (out of a total of 244), as compared with the 23 liver-specific genes identified in [2] using PCA. [sent-101, score-0.864]
68 The ICA muscle cluster of 235 genes contains 199 muscle specific genes compared to 19 muscle-specific genes identified in [2]. [sent-102, score-1.283]
69 We generated a 3-dimensional scatter plot of the load expression levels of all genes annotated in [15] on these significant ICA components in Figure 1 (c). [sent-103, score-0.69]
70 We can see that the liver-specific, muscle-specific and vulva-specific genes are strongly biased to lie on the x-, y-, and z- axis, respectively. [sent-104, score-0.354]
71 For D2, variance-normalization was applied to the 3000 most variant genes as in [4]. [sent-106, score-0.372]
72 The 17th experiment, which made the expression matrix close to singular, was removed. [sent-107, score-0.145]
73 We measured the statistical significance of clusters as described in Section 4 and compared the smallest p-value of each gene annotation from our approach to that from k-means clustering applied to the same dataset [4]. [sent-108, score-0.813]
74 We made a scatter plot for comparing the negative log of the smallest p-value (y-axis) from ICA clusters with that from k-means clustering (x-axis). [sent-109, score-0.209]
75 0 and the superiority of ICA-based clustering to k-means clustering does not change. [sent-112, score-0.242]
76 In many practical settings, estimation of the best c is not needed; we can adjust c to get a desired size of the cluster unless our focus is to blindly find the size of clusters. [sent-113, score-0.087]
77 25 a comparison of the performance of linear ICA (NMLE), nonlinear ICA with Gaussian RBF kernel (NICA gauss), and k-means clustering (k-means). [sent-115, score-0.303]
78 For D3, first we removed experiments that contained more than 7000 missing values, because ICA does not perform properly when the dataset contains many missing values. [sent-116, score-0.118]
79 The 250 remaining experiments were used, containing expression levels for 17817 genes preprocessed to be log-ratios xij=log2(Rij/Gij) between red and green intensities. [sent-117, score-0.631]
80 We compared the biological coherence of clusters by our approach with that of topomap-based approach applied to the same dataset in [3]. [sent-118, score-0.358]
81 We observe that the two methods perform very similarly, with most categories having roughly the same p-value in ICA and in the topomap clusters. [sent-121, score-0.087]
82 The topomap clustering approach performs slightly better in a larger fraction of the categories. [sent-122, score-0.208]
83 Finally, we compared different ICA algorithms in terms of clustering performance. [sent-124, score-0.145]
84 We also tested two kernels for nonlinear ICA: Gaussian RBF kernel, and polynomial kernel (NICA ploy). [sent-126, score-0.166]
85 For each dataset, we compared the biological coherence of clusters generated by each method. [sent-127, score-0.274]
86 Among both linear and nonlinear methods, the Gaussian kernel nonlinear ICA method was the best in Datasets D1, D2 and D4, the polynomial kernel nonlinear ICA method was best in Dataset D4, and NMLE was best in the large datasets (D3 and D4). [sent-129, score-0.502]
87 In Figure 3, we compare the NMLE method with three other ICA methods for the dataset D2. [sent-130, score-0.066]
88 The nonlinear ICA algorithms performed best in the small datasets, but were unstable in the two largest datasets. [sent-132, score-0.098]
89 More comparison results are demonstrated in the website www. [sent-133, score-0.046]
90 Figure 2: Comparison of (a) linear ICA (NMLE) with k-means clustering, (b) nonlinear ICA with Gaussian RBF kernel to linear ICA (NMLE), and (c) nonlinear ICA with Gaussian RBF kernel to k-means clustering on the dataset D2. [sent-136, score-0.511]
91 (d) Comparison of linear ICA (NMLE) to topomap-based approach on the dataset D3. [sent-137, score-0.085]
92 Figure 3: Comparison of linear ICA (NMLE) to (a) Extended Infomax ICA algorithm, (b) Fast ICA with symmetric orthogonalization and tanh nonlinearity and (c) Nonlinear ICA with polynomial kernel of degree 2 on the Dataset (B). [sent-138, score-0.087]
93 6 D i s c u s s i on ICA is a powerful statistical method for separating mixed independent signals. [sent-139, score-0.07]
94 We proposed applying ICA to decompose microarray data into independent gene expression patterns of underlying biological processes, and to group genes into clusters that are mutually non-exclusive with statistically significant functional coherence. [sent-140, score-1.396]
95 Our clustering method outperformed several leading methods on a variety of datasets, with the added advantage that it requires setting only one parameter, namely the fraction c of standard deviations beyond which a gene is considered to be associated with a component’s cluster. [sent-141, score-0.459]
96 We observed that performance was not very sensitive to that parameter, suggesting that ICA is robust enough to be used for clustering with little human intervention. [sent-142, score-0.121]
97 The empirical performance of ICA in our tests supports the hypothesis that statistical independence is a good criterion for separating mixed biological signals in microarray data. [sent-143, score-0.378]
98 Interestingly, the application of Extended Infomax ICA to all the expression datasets uncovered no source signal with sub-Gaussian distribution. [sent-145, score-0.242]
99 A likely explanation is that global gene expression profiles are mixtures of super-Gaussian sources rather than of sub-Gaussian sources. [sent-146, score-0.431]
100 This finding is consistent with the following intuition: underlying biological processes are super-Gaussian, because they affect sharply the relevant genes, typically a small fraction of all genes, and leave the majority of genes relatively unaffected. [sent-147, score-0.562]
wordName wordTfidf (topN-words)
[('ica', 0.685), ('genes', 0.354), ('gene', 0.286), ('nmle', 0.228), ('microarray', 0.195), ('annotation', 0.166), ('expression', 0.145), ('biological', 0.127), ('clustering', 0.121), ('nonlinear', 0.098), ('cluster', 0.087), ('datasets', 0.076), ('significance', 0.07), ('topomap', 0.07), ('genomic', 0.069), ('dataset', 0.066), ('clusters', 0.062), ('significant', 0.061), ('coherence', 0.061), ('infomax', 0.061), ('independent', 0.052), ('component', 0.05), ('statistically', 0.049), ('kernel', 0.045), ('decompose', 0.044), ('rbf', 0.044), ('components', 0.043), ('processes', 0.043), ('pca', 0.039), ('specific', 0.038), ('cell', 0.037), ('levels', 0.035), ('budding', 0.035), ('enrichment', 0.035), ('eotide', 0.035), ('kegg', 0.035), ('liver', 0.035), ('loads', 0.035), ('nica', 0.035), ('oligonucl', 0.035), ('ontology', 0.035), ('serafim', 0.035), ('spotted', 0.035), ('std', 0.035), ('et', 0.034), ('xij', 0.033), ('green', 0.028), ('genome', 0.028), ('preprocessed', 0.028), ('mixture', 0.026), ('missing', 0.026), ('scatter', 0.026), ('yi', 0.026), ('grouped', 0.026), ('load', 0.026), ('annotations', 0.026), ('coefficient', 0.026), ('enriched', 0.026), ('putative', 0.026), ('website', 0.026), ('collected', 0.025), ('tools', 0.025), ('identified', 0.024), ('yeast', 0.024), ('muscle', 0.024), ('red', 0.024), ('compared', 0.024), ('fixed', 0.024), ('polynomial', 0.023), ('po', 0.023), ('mapped', 0.023), ('bases', 0.022), ('source', 0.021), ('underlying', 0.021), ('lee', 0.021), ('nm', 0.021), ('worked', 0.02), ('bioinformatics', 0.02), ('comparison', 0.02), ('six', 0.02), ('level', 0.02), ('signals', 0.02), ('linear', 0.019), ('biology', 0.019), ('outperformed', 0.019), ('orthonormal', 0.019), ('molecular', 0.018), ('blind', 0.018), ('applied', 0.018), ('xi', 0.018), ('supports', 0.018), ('xk', 0.018), ('separating', 0.018), ('containing', 0.017), ('fraction', 0.017), ('varied', 0.017), ('categories', 0.017), ('widely', 0.017), ('leading', 0.016), ('principal', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999934 86 nips-2003-ICA-based Clustering of Genes from Microarray Expression Data
Author: Su-in Lee, Serafim Batzoglou
Abstract: We propose an unsupervised methodology using independent component analysis (ICA) to cluster genes from DNA microarray data. Based on an ICA mixture model of genomic expression patterns, linear and nonlinear ICA finds components that are specific to certain biological processes. Genes that exhibit significant up-regulation or down-regulation within each component are grouped into clusters. We test the statistical significance of enrichment of gene annotations within each cluster. ICA-based clustering outperformed other leading methods in constructing functionally coherent clusters on various datasets. This result supports our model of genomic expression data as composite effect of independent biological processes. Comparison of clustering performance among various ICA algorithms including a kernel-based nonlinear ICA algorithm shows that nonlinear ICA performed the best for small datasets and natural-gradient maximization-likelihood worked well for all the datasets. 1 In trod u ction Microarray technology has enabled genome-wide expression profiling, promising to provide insight into underlying biological mechanism involved in gene regulation. To aid such discoveries, mathematical tools that are versatile enough to capture the underlying biology and simple enough to be applied efficiently on large datasets are needed. Analysis tools based on novel data mining techniques have been proposed [1]-[6]. When applying mathematical models and tools to microarray analysis, clustering genes that have the similar biological properties is an important step for three reasons: reduction of data complexity, prediction of gene function, and evaluation of the analysis approach by measuring the statistical significance of biological coherence of gene clusters. Independent component analysis (ICA) linearly decomposes each of N vectors into M common component vectors (N≥M) so that each component is statistically as independent from the others as possible. One of the main applications of ICA is blind source separation (BSS) that aims to separate source signals from their mixtures. There have been a few attempts to apply ICA to the microarray expression data to extract meaningful signals each corresponding to independent biological process [5]-[6]. In this paper, we provide the first evidence that ICA is a superior mathematical model and clustering tool for microarray analysis, compared to the most widely used methods namely PCA and k-means clustering. We also introduce the application of nonlinear ICA to microarray analysis, and show that it outperforms linear ICA on some datasets. We apply ICA to microarray data to decompose the input data into statistically independent components. Then, genes are clustered in an unsupervised fashion into non-mutually exclusive clusters. Each independent component is assigned a putative biological meaning based on functional annotations of genes that are predominant within the component. We systematically evaluate the clustering performance of several ICA algorithms on four expression datasets and show that ICA-based clustering is superior to other leading methods that have been applied to analyze the same datasets. We also proposed a kernel based nonlinear ICA algorithm for dealing with more realistic mixture model. Among the different linear ICA algorithms including six linear and one nonlinear ICA algorithm, the natural-gradient maximum-likelihood estimation method (NMLE) [7]-[8] performs well in all the datasets. Kernel-based nonlinear ICA method worked better for three small datasets. 2 M a t h e m at i c a l m o de l o f g e n om e - wi d e e x p r e s s i on Several distinct biological processes take place simultaneously inside a cell; each biological process has its own expression program to up-regulate or down-regulate the level of expression of specific sets of genes. We model a genome-wide expression pattern in a given condition (measured by a microarray assay) as a mixture of signals generated by statistically independent biological processes with different activation levels. We design two kinds of models for genomic expression pattern: a linear and nonlinear mixture model. Suppose that a cell is governed by M independent biological processes S = (s1, …, sM)T, each of which is a vector of K gene expression levels, and that we measure the levels of expression of all genes in N conditions, resulting in a microarray expression matrix X = (x1,…,xN)T. The expression level at each different condition j can be expressed as linear combinations of the M biological processes: xj=aj1s1+…+ajMsM. We can express this idea concisely in matrix notation as follows. X = AS , x1 a11 L a1M s1 M = M M M x N a N 1 L a NM s M (1) More generally, we can express X = (x1,…,xN)T as a post-nonlinear mixture of the underlying independent processes as follows, where f(.) is a nonlinear mapping from N to N dimensional space. X = f ( AS ), a11 L a1M s1 x1 M = f M M M a xN N 1 L a NM s M (2) 3 I n d e p e n d e n t c o m po n e n t a n a l y s i s In the models described above, since we assume that the underlying biological processes are independent, we suggest that vectors S=(s1,…,sM) are statistically independent and so ICA can recover S from the observed microarray data X. For linear ICA, we apply natural-gradient maximum estimation (NMLE) method which was proposed in [7] and was made more efficient by using natural gradient method in [8]. We also apply nonlinear ICA using reproducible kernel Hilbert spaces (RKHS) based on [9], as follows: 1. We map the N dimensional input data xi to Ф(xi) in the feature space by using the kernel trick. The feature space is defined by the relationship Ф(xi)TФ(xj)=k(xi,, xj). That is, inner product of mapped data is determined to by a kernel function k(.,.) in the input space; we used a Gaussian radial basis function (RBF) kernel (k(x,y)=exp(-|x-y|2)) and a polynomial kernel of degree 2 (k(x,y)=(xTy+1)2). To perform mapping, we found orthonormal bases of the feature space by randomly sampling L input data v={v1,…,vL} 1000 times and choosing one set minimizing the condition number of Φv=(Φ(v1),…,Φ(vL)). Then, a set of orthonormal bases of the feature space is determined by the selected L images of input data in v as Ξ = Φv(Φv TΦv )-1/2. We map all input data x1,…,xK, each corresponding to a gene, to Ψ(x1),…,Ψ(xK) in the feature space with basis Ξ, as follows: k (v1 , v1 ) K k (v1 , v L ) M M k (v L , v1 ) L k (v L , v L ) Ψ(xi)=(ΦvTΦv )-1/2Φv TΦv (xi) = −1 / 2 k (v1 , xi ) ∈ ℜ L (1≤ i≤K) (3) M k ( v L , xi ) 2. We linearly decompose the mapped data Ψ=[Ψ(x1),.,Ψ(xK)] ∈RL×K into statistically independent components using NMLE. 4 Proposed approach The microarray dataset we are given is in matrix form where each element xij corresponds to the level of expression of the jth gene in the ith experimental condition. Missing values are imputed by KNNImpute [10], an algorithm based on k nearest neighbors that is widely used in microarray analysis. Given the expression matrix X of N experiment by K genes, we perform the following steps. 1. Apply ICA to decompose X into independent components y1, …,yM as in Equations (1) and (2). Prior to applying ICA, remove any rows that make the expression matrix X singular. After ICA, each component denoted by yi is a vector comprising K loads gene expression levels, i.e., yi = (yi1, ...,yiK). We chose to let the number of components M to be maximized, which is equal the number of microarray experiments N because the maximum for N in our datasets was 250, which is smaller than the number of biological processes we hypothesize to act within a cell. 2. For each component, cluster genes according to their relative loads yij/mean(yi). Based on our ICA model, each component is a putative genomic expression program of an independent biological process. Thus, our hypothesis is that genes showing relatively high or low expression level within the component are the most important for the process. We create two clusters for each component: one cluster containing genes with expression level higher than a threshold, and one cluster containing genes with expression level lower than a threshold. Cluster i,1 = {gene j | y ij > mean( y i ) + c × std( y i )} Cluster i,2 = {gene j | y ij < mean( y i ) – c × std( y i )} (4) Here, mean(yi) is the average, std(yi) is the standard deviation of yi; and c is an adjustable coefficient. The value of the coefficient c was varied from 1.0 to 2.0 and the result for c=1.25 was presented in this paper. The results for other values of c are similar, and are presented on the website www.stanford.edu/~silee/ICA/. 3. For each cluster, measure the enrichment of each cluster with genes of known functional annotations. Using the Gene Ontology (GO) [11] and KEGG [12] gene annotation databases, we calculate the p-value for each cluster with every gene annotation, which is the probability that the cluster contains the observed number of genes with the annotation by chance assuming the hypergeometric distribution (details in [4]). For each gene annotation, the minimum p-value that is smaller than 10-7 obtained from any cluster was collected. If no p-value smaller than 10-7 is found, we consider the gene annotation not to be detected by the approach. As a result, we can assign biological meaning to each cluster and the corresponding independent component and we can evaluate the clustering performance by comparing the collected minimum p-value for each gene annotation with that from other clustering approach. 5 P e r f o r m a n c e e v a l uat i o n We tested the ICA-based clustering to four expression datasets (D1—D4) described in Table 1. Table 1: The four datasets used in our analysis ARRAY TYPE DESCRIPTION # OF GENES (K) # OF EXPS (N) D1 Spotted 4579 22 D2 Oligonucl eotide Spotted Oligonucl eotide Budding yeast during cell cycle and CLB2/CLN3 overactive strain [13] Budding yeast during cell cycle [14] 6616 17 C. elegans in various conditions [3] Normal human tissue including 19 kinds of tissues [15] 17817 7070 553 59 D3 D4 For D1 and D4, we compared the biological coherence of ICA components with that of PCA applied in the same datasets in [1] and [2], respectively. For D2 and D3, we compared with k-means clustering and the topomap method, applied in the same datasets in [4] and [3], respectively. We applied nonlinear ICA to D1, D2 and D4. Dataset D3 is very large and makes the nonlinear algorithm unstable. D1 was preprocessed to contain log-ratios xij=log2(Rij/Gij) between red and green intensities. In [1], principal components, referred to as eigenarrays, were hypothesized to be genomic expression programs of distinct biological processes. We compared the biological coherence of independent components with that of principal components found by [1]. Comparison was done in two ways: (1) For each component, we grouped genes within top x% of significant up-regulation and down-regulation (as measured by the load of the gene in the component) into two clusters with x adjusted from 5% to 45%. For each value of x, statistical significance was measured for clusters from independent components and compared with that from principal components based on the minimum p-value for each gene annotation, as described in Section 4. We made a scatter plot to compare the negative log of the collected best p-values for each gene annotation when x is fixed to be 15%, shown in Figure 1 (a) (2) Same as before, except we did not fix the value of x; instead, we collected the minimum p-value from each method for each GO and KEGG gene annotation category and compared the collected p-values (Figure 1 (b)). For both cases, in the majority of the gene annotation categories ICA produced significantly lower p-values than PCA did, especially for gene annotation for which both ICA and PCA showed high significance. Figure 1. Comparison of linear ICA (NMLE) to PCA on dataset D1 (a) when x is fixed to be 15%; (b) when x is not fixed. (c) Three independent components of dataset D4. Each gene is mapped to a point based on the value assigned to the gene in three independent components, which are enriched with liver- (red), Muscle- (orange) and vulva-specific (green) genes, respectively. The expression levels of genes in D4 were normalized across the 59 experiments, and the logarithms of the resulting values were taken. Experiments 57, 58, and 59 were removed because they made the expression matrix nearly singular. In [2], a clustering approach based on PCA and subsequent visual inspection was applied to an earlier version of this dataset, containing 50 of the 59 samples. After we performed ICA, the most significant independent components were enriched for liver-specific, muscle-specific and vulva-specific genes with p-value of 10-133, 10-124 and 100-117, respectively. In the ICA liver cluster, 198 genes were liver specific (out of a total of 244), as compared with the 23 liver-specific genes identified in [2] using PCA. The ICA muscle cluster of 235 genes contains 199 muscle specific genes compared to 19 muscle-specific genes identified in [2]. We generated a 3-dimensional scatter plot of the load expression levels of all genes annotated in [15] on these significant ICA components in Figure 1 (c). We can see that the liver-specific, muscle-specific and vulva-specific genes are strongly biased to lie on the x-, y-, and z- axis, respectively. We applied nonlinear ICA on this dataset and the first four most significant clusters from nonlinear ICA with Gaussian RBF kernel were muscle-specific, liver-specific, vulva-specific and brain-specific with p-value of 10-158, 10-127, 10-112 and 10-70, respectively, showing considerable improvement over the linear ICA clusters. For D2, variance-normalization was applied to the 3000 most variant genes as in [4]. The 17th experiment, which made the expression matrix close to singular, was removed. We measured the statistical significance of clusters as described in Section 4 and compared the smallest p-value of each gene annotation from our approach to that from k-means clustering applied to the same dataset [4]. We made a scatter plot for comparing the negative log of the smallest p-value (y-axis) from ICA clusters with that from k-means clustering (x-axis). The coefficient c is varied from 1.0 to 2.0 and the superiority of ICA-based clustering to k-means clustering does not change. In many practical settings, estimation of the best c is not needed; we can adjust c to get a desired size of the cluster unless our focus is to blindly find the size of clusters. Figure 2 (a) (b) (c) shows for c=1.25 a comparison of the performance of linear ICA (NMLE), nonlinear ICA with Gaussian RBF kernel (NICA gauss), and k-means clustering (k-means). For D3, first we removed experiments that contained more than 7000 missing values, because ICA does not perform properly when the dataset contains many missing values. The 250 remaining experiments were used, containing expression levels for 17817 genes preprocessed to be log-ratios xij=log2(Rij/Gij) between red and green intensities. We compared the biological coherence of clusters by our approach with that of topomap-based approach applied to the same dataset in [3]. The result when c=1.25 is plotted in the Figure 2 (d). We observe that the two methods perform very similarly, with most categories having roughly the same p-value in ICA and in the topomap clusters. The topomap clustering approach performs slightly better in a larger fraction of the categories. Still, we consider this performance a confirmation that ICA is a widely applicable method that requires minimal training: in this case the missing values and high diversity of the data make clustering especially challenging, while the topomap approach was specifically designed and manually trained for this dataset as described in [3]. Finally, we compared different ICA algorithms in terms of clustering performance. We tested six linear ICA methods: Natural Gradient Maximum Likelihood Estimation (NMLE) [7][8], Joint Approximate Diagonalization of Eigenmatrices [16], Fast Fixed Point ICA with three different measures of non-Gaussianity [17], and Extended Information Maximization (Infomax) [18]. We also tested two kernels for nonlinear ICA: Gaussian RBF kernel, and polynomial kernel (NICA ploy). For each dataset, we compared the biological coherence of clusters generated by each method. Among the six linear ICA algorithms, NMLE was the best in all datasets. Among both linear and nonlinear methods, the Gaussian kernel nonlinear ICA method was the best in Datasets D1, D2 and D4, the polynomial kernel nonlinear ICA method was best in Dataset D4, and NMLE was best in the large datasets (D3 and D4). In Figure 3, we compare the NMLE method with three other ICA methods for the dataset D2. Overall, the NMLE algorithm consistently performed well in all datasets. The nonlinear ICA algorithms performed best in the small datasets, but were unstable in the two largest datasets. More comparison results are demonstrated in the website www.stanford.edu/~silee/ICA/. Figure 2: Comparison of (a) linear ICA (NMLE) with k-means clustering, (b) nonlinear ICA with Gaussian RBF kernel to linear ICA (NMLE), and (c) nonlinear ICA with Gaussian RBF kernel to k-means clustering on the dataset D2. (d) Comparison of linear ICA (NMLE) to topomap-based approach on the dataset D3. Figure 3: Comparison of linear ICA (NMLE) to (a) Extended Infomax ICA algorithm, (b) Fast ICA with symmetric orthogonalization and tanh nonlinearity and (c) Nonlinear ICA with polynomial kernel of degree 2 on the Dataset (B). 6 D i s c u s s i on ICA is a powerful statistical method for separating mixed independent signals. We proposed applying ICA to decompose microarray data into independent gene expression patterns of underlying biological processes, and to group genes into clusters that are mutually non-exclusive with statistically significant functional coherence. Our clustering method outperformed several leading methods on a variety of datasets, with the added advantage that it requires setting only one parameter, namely the fraction c of standard deviations beyond which a gene is considered to be associated with a component’s cluster. We observed that performance was not very sensitive to that parameter, suggesting that ICA is robust enough to be used for clustering with little human intervention. The empirical performance of ICA in our tests supports the hypothesis that statistical independence is a good criterion for separating mixed biological signals in microarray data. The Extended Infomax ICA algorithm proposed in [18] can automatically determine whether the distribution of each source signal is super-Gaussian or sub-Gaussian. Interestingly, the application of Extended Infomax ICA to all the expression datasets uncovered no source signal with sub-Gaussian distribution. A likely explanation is that global gene expression profiles are mixtures of super-Gaussian sources rather than of sub-Gaussian sources. This finding is consistent with the following intuition: underlying biological processes are super-Gaussian, because they affect sharply the relevant genes, typically a small fraction of all genes, and leave the majority of genes relatively unaffected. Acknowledgments We thank Te-Won Lee for helpful feedback. We thank Relly Brandman, Chuong Do, and Yueyi Liu for edits to the manuscript. References [1] Alter O, Brown PO, Botstein D. Proc. Natl. Acad. Sci. USA 97(18):10101-10106, 2000. [2] Misra J, Schmitt W, et al. Genome Research 12:1112-1120, 2002. [3] Kim SK, Lund J, et al. Science 293:2087-2092, 2001. [4] Tavazoie S, Hughes JD, et al. Nature Genetics 22(3):281-285, 1999. [5] Hori G, Inoue M, et al. Proc. 3rd Int. Workshop on Independent Component Analysis and Blind Signal Separation, Helsinki, Finland, pp. 151-155, 2000. [6] Liebermeister W. Bioinformatics 18(1):51-60, 2002. [7] Bell AJ. and Sejnowski TJ. Neural Computation, 7:1129-1159, 1995. [8] Amari S, Cichocki A, et al. In Advances in Neural Information Processing Systems 8, pp. 757-763. Cambridge, MA: MIT Press, 1996. [9] Harmeling S, Ziehe A, et al. In Advances in Neural Information Processing Systems 8, pp. 757-763. Cambridge, MA: MIT Press, . [10] Troyanskaya O., Cantor M, et al. Bioinformatics 17:520-525, 2001. [11] The Gene Ontology Consortium. Genome Research 11:1425-1433, 2001. [12] Kanehisa M., Goto S. In Current Topics in Computational Molecular Biology, pp. 301–315. MIT-Press, Cambridge, MA, 2002. [13] Spellman PT, Sherlock G, et al. Mol. Biol. Cell 9:3273-3297, 1998. [14] Cho RJ, Campell MJ, et al. Molecular Cell 2:65-73, 1998. [15] Hsiao L, Dangond F, et al. Physiol. Genomics 7:97-104, 2001. [16] Cardoso JF, Neural Computation 11(1):157-192, 1999. [17] Hyvarinen A. IEEE Transactions on Neural Network 10(3):626–634, 1999. [18] Lee TW, Girolami M, et al. Neural Computation 11:417–441, 1999.
2 0.23980555 79 nips-2003-Gene Expression Clustering with Functional Mixture Models
Author: Darya Chudova, Christopher Hart, Eric Mjolsness, Padhraic Smyth
Abstract: We propose a functional mixture model for simultaneous clustering and alignment of sets of curves measured on a discrete time grid. The model is specifically tailored to gene expression time course data. Each functional cluster center is a nonlinear combination of solutions of a simple linear differential equation that describes the change of individual mRNA levels when the synthesis and decay rates are constant. The mixture of continuous time parametric functional forms allows one to (a) account for the heterogeneity in the observed profiles, (b) align the profiles in time by estimating real-valued time shifts, (c) capture the synthesis and decay of mRNA in the course of an experiment, and (d) regularize noisy profiles by enforcing smoothness in the mean curves. We derive an EM algorithm for estimating the parameters of the model, and apply the proposed approach to the set of cycling genes in yeast. The experiments show consistent improvement in predictive power and within cluster variance compared to regular Gaussian mixtures. 1
3 0.14617364 190 nips-2003-Unsupervised Color Decomposition Of Histologically Stained Tissue Samples
Author: Andrew Rabinovich, Sameer Agarwal, Casey Laris, Jeffrey H. Price, Serge J. Belongie
Abstract: Accurate spectral decomposition is essential for the analysis and diagnosis of histologically stained tissue sections. In this paper we present the first automated system for performing this decomposition. We compare the performance of our system with ground truth data and report favorable results. 1
4 0.13339569 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms
Author: Claudio Gentile
Abstract: New feature selection algorithms for linear threshold functions are described which combine backward elimination with an adaptive regularization method. This makes them particularly suitable to the classification of microarray expression data, where the goal is to obtain accurate rules depending on few genes only. Our algorithms are fast and easy to implement, since they center on an incremental (large margin) algorithm which allows us to avoid linear, quadratic or higher-order programming methods. We report on preliminary experiments with five known DNA microarray datasets. These experiments suggest that multiplicative large margin algorithms tend to outperform additive algorithms (such as SVM) on feature selection tasks. 1
5 0.10928929 1 nips-2003-1-norm Support Vector Machines
Author: Ji Zhu, Saharon Rosset, Robert Tibshirani, Trevor J. Hastie
Abstract: The standard 2-norm SVM is known for its good performance in twoclass classi£cation. In this paper, we consider the 1-norm SVM. We argue that the 1-norm SVM may have some advantage over the standard 2-norm SVM, especially when there are redundant noise features. We also propose an ef£cient algorithm that computes the whole solution path of the 1-norm SVM, hence facilitates adaptive selection of the tuning parameter for the 1-norm SVM. 1
6 0.10256699 73 nips-2003-Feature Selection in Clustering Problems
7 0.10033044 12 nips-2003-A Model for Learning the Semantics of Pictures
8 0.092516348 37 nips-2003-Automatic Annotation of Everyday Movements
9 0.086588465 46 nips-2003-Clustering with the Connectivity Kernel
10 0.086139478 111 nips-2003-Learning the k in k-means
11 0.072257482 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation
12 0.071317792 87 nips-2003-Identifying Structure across Pre-partitioned Data
13 0.069138139 107 nips-2003-Learning Spectral Clustering
14 0.062787138 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method
15 0.056517798 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning
16 0.051890511 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering
17 0.047256719 139 nips-2003-Nonlinear Filtering of Electron Micrographs by Means of Support Vector Regression
18 0.046565615 77 nips-2003-Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data
19 0.045213468 112 nips-2003-Learning to Find Pre-Images
20 0.043853 152 nips-2003-Pairwise Clustering and Graphical Models
topicId topicWeight
[(0, -0.153), (1, -0.098), (2, 0.024), (3, 0.023), (4, -0.147), (5, 0.12), (6, -0.04), (7, 0.116), (8, -0.005), (9, 0.057), (10, -0.05), (11, 0.024), (12, -0.19), (13, 0.021), (14, -0.094), (15, 0.19), (16, -0.075), (17, -0.012), (18, 0.029), (19, -0.088), (20, -0.17), (21, 0.011), (22, 0.084), (23, 0.199), (24, -0.035), (25, 0.144), (26, -0.175), (27, 0.032), (28, 0.034), (29, 0.04), (30, -0.209), (31, 0.074), (32, -0.113), (33, -0.056), (34, 0.1), (35, -0.133), (36, -0.033), (37, 0.163), (38, 0.089), (39, -0.15), (40, 0.09), (41, -0.091), (42, 0.077), (43, 0.036), (44, -0.027), (45, 0.088), (46, -0.018), (47, -0.014), (48, 0.003), (49, -0.046)]
simIndex simValue paperId paperTitle
same-paper 1 0.97033143 86 nips-2003-ICA-based Clustering of Genes from Microarray Expression Data
Author: Su-in Lee, Serafim Batzoglou
Abstract: We propose an unsupervised methodology using independent component analysis (ICA) to cluster genes from DNA microarray data. Based on an ICA mixture model of genomic expression patterns, linear and nonlinear ICA finds components that are specific to certain biological processes. Genes that exhibit significant up-regulation or down-regulation within each component are grouped into clusters. We test the statistical significance of enrichment of gene annotations within each cluster. ICA-based clustering outperformed other leading methods in constructing functionally coherent clusters on various datasets. This result supports our model of genomic expression data as composite effect of independent biological processes. Comparison of clustering performance among various ICA algorithms including a kernel-based nonlinear ICA algorithm shows that nonlinear ICA performed the best for small datasets and natural-gradient maximization-likelihood worked well for all the datasets. 1 In trod u ction Microarray technology has enabled genome-wide expression profiling, promising to provide insight into underlying biological mechanism involved in gene regulation. To aid such discoveries, mathematical tools that are versatile enough to capture the underlying biology and simple enough to be applied efficiently on large datasets are needed. Analysis tools based on novel data mining techniques have been proposed [1]-[6]. When applying mathematical models and tools to microarray analysis, clustering genes that have the similar biological properties is an important step for three reasons: reduction of data complexity, prediction of gene function, and evaluation of the analysis approach by measuring the statistical significance of biological coherence of gene clusters. Independent component analysis (ICA) linearly decomposes each of N vectors into M common component vectors (N≥M) so that each component is statistically as independent from the others as possible. One of the main applications of ICA is blind source separation (BSS) that aims to separate source signals from their mixtures. There have been a few attempts to apply ICA to the microarray expression data to extract meaningful signals each corresponding to independent biological process [5]-[6]. In this paper, we provide the first evidence that ICA is a superior mathematical model and clustering tool for microarray analysis, compared to the most widely used methods namely PCA and k-means clustering. We also introduce the application of nonlinear ICA to microarray analysis, and show that it outperforms linear ICA on some datasets. We apply ICA to microarray data to decompose the input data into statistically independent components. Then, genes are clustered in an unsupervised fashion into non-mutually exclusive clusters. Each independent component is assigned a putative biological meaning based on functional annotations of genes that are predominant within the component. We systematically evaluate the clustering performance of several ICA algorithms on four expression datasets and show that ICA-based clustering is superior to other leading methods that have been applied to analyze the same datasets. We also proposed a kernel based nonlinear ICA algorithm for dealing with more realistic mixture model. Among the different linear ICA algorithms including six linear and one nonlinear ICA algorithm, the natural-gradient maximum-likelihood estimation method (NMLE) [7]-[8] performs well in all the datasets. Kernel-based nonlinear ICA method worked better for three small datasets. 2 M a t h e m at i c a l m o de l o f g e n om e - wi d e e x p r e s s i on Several distinct biological processes take place simultaneously inside a cell; each biological process has its own expression program to up-regulate or down-regulate the level of expression of specific sets of genes. We model a genome-wide expression pattern in a given condition (measured by a microarray assay) as a mixture of signals generated by statistically independent biological processes with different activation levels. We design two kinds of models for genomic expression pattern: a linear and nonlinear mixture model. Suppose that a cell is governed by M independent biological processes S = (s1, …, sM)T, each of which is a vector of K gene expression levels, and that we measure the levels of expression of all genes in N conditions, resulting in a microarray expression matrix X = (x1,…,xN)T. The expression level at each different condition j can be expressed as linear combinations of the M biological processes: xj=aj1s1+…+ajMsM. We can express this idea concisely in matrix notation as follows. X = AS , x1 a11 L a1M s1 M = M M M x N a N 1 L a NM s M (1) More generally, we can express X = (x1,…,xN)T as a post-nonlinear mixture of the underlying independent processes as follows, where f(.) is a nonlinear mapping from N to N dimensional space. X = f ( AS ), a11 L a1M s1 x1 M = f M M M a xN N 1 L a NM s M (2) 3 I n d e p e n d e n t c o m po n e n t a n a l y s i s In the models described above, since we assume that the underlying biological processes are independent, we suggest that vectors S=(s1,…,sM) are statistically independent and so ICA can recover S from the observed microarray data X. For linear ICA, we apply natural-gradient maximum estimation (NMLE) method which was proposed in [7] and was made more efficient by using natural gradient method in [8]. We also apply nonlinear ICA using reproducible kernel Hilbert spaces (RKHS) based on [9], as follows: 1. We map the N dimensional input data xi to Ф(xi) in the feature space by using the kernel trick. The feature space is defined by the relationship Ф(xi)TФ(xj)=k(xi,, xj). That is, inner product of mapped data is determined to by a kernel function k(.,.) in the input space; we used a Gaussian radial basis function (RBF) kernel (k(x,y)=exp(-|x-y|2)) and a polynomial kernel of degree 2 (k(x,y)=(xTy+1)2). To perform mapping, we found orthonormal bases of the feature space by randomly sampling L input data v={v1,…,vL} 1000 times and choosing one set minimizing the condition number of Φv=(Φ(v1),…,Φ(vL)). Then, a set of orthonormal bases of the feature space is determined by the selected L images of input data in v as Ξ = Φv(Φv TΦv )-1/2. We map all input data x1,…,xK, each corresponding to a gene, to Ψ(x1),…,Ψ(xK) in the feature space with basis Ξ, as follows: k (v1 , v1 ) K k (v1 , v L ) M M k (v L , v1 ) L k (v L , v L ) Ψ(xi)=(ΦvTΦv )-1/2Φv TΦv (xi) = −1 / 2 k (v1 , xi ) ∈ ℜ L (1≤ i≤K) (3) M k ( v L , xi ) 2. We linearly decompose the mapped data Ψ=[Ψ(x1),.,Ψ(xK)] ∈RL×K into statistically independent components using NMLE. 4 Proposed approach The microarray dataset we are given is in matrix form where each element xij corresponds to the level of expression of the jth gene in the ith experimental condition. Missing values are imputed by KNNImpute [10], an algorithm based on k nearest neighbors that is widely used in microarray analysis. Given the expression matrix X of N experiment by K genes, we perform the following steps. 1. Apply ICA to decompose X into independent components y1, …,yM as in Equations (1) and (2). Prior to applying ICA, remove any rows that make the expression matrix X singular. After ICA, each component denoted by yi is a vector comprising K loads gene expression levels, i.e., yi = (yi1, ...,yiK). We chose to let the number of components M to be maximized, which is equal the number of microarray experiments N because the maximum for N in our datasets was 250, which is smaller than the number of biological processes we hypothesize to act within a cell. 2. For each component, cluster genes according to their relative loads yij/mean(yi). Based on our ICA model, each component is a putative genomic expression program of an independent biological process. Thus, our hypothesis is that genes showing relatively high or low expression level within the component are the most important for the process. We create two clusters for each component: one cluster containing genes with expression level higher than a threshold, and one cluster containing genes with expression level lower than a threshold. Cluster i,1 = {gene j | y ij > mean( y i ) + c × std( y i )} Cluster i,2 = {gene j | y ij < mean( y i ) – c × std( y i )} (4) Here, mean(yi) is the average, std(yi) is the standard deviation of yi; and c is an adjustable coefficient. The value of the coefficient c was varied from 1.0 to 2.0 and the result for c=1.25 was presented in this paper. The results for other values of c are similar, and are presented on the website www.stanford.edu/~silee/ICA/. 3. For each cluster, measure the enrichment of each cluster with genes of known functional annotations. Using the Gene Ontology (GO) [11] and KEGG [12] gene annotation databases, we calculate the p-value for each cluster with every gene annotation, which is the probability that the cluster contains the observed number of genes with the annotation by chance assuming the hypergeometric distribution (details in [4]). For each gene annotation, the minimum p-value that is smaller than 10-7 obtained from any cluster was collected. If no p-value smaller than 10-7 is found, we consider the gene annotation not to be detected by the approach. As a result, we can assign biological meaning to each cluster and the corresponding independent component and we can evaluate the clustering performance by comparing the collected minimum p-value for each gene annotation with that from other clustering approach. 5 P e r f o r m a n c e e v a l uat i o n We tested the ICA-based clustering to four expression datasets (D1—D4) described in Table 1. Table 1: The four datasets used in our analysis ARRAY TYPE DESCRIPTION # OF GENES (K) # OF EXPS (N) D1 Spotted 4579 22 D2 Oligonucl eotide Spotted Oligonucl eotide Budding yeast during cell cycle and CLB2/CLN3 overactive strain [13] Budding yeast during cell cycle [14] 6616 17 C. elegans in various conditions [3] Normal human tissue including 19 kinds of tissues [15] 17817 7070 553 59 D3 D4 For D1 and D4, we compared the biological coherence of ICA components with that of PCA applied in the same datasets in [1] and [2], respectively. For D2 and D3, we compared with k-means clustering and the topomap method, applied in the same datasets in [4] and [3], respectively. We applied nonlinear ICA to D1, D2 and D4. Dataset D3 is very large and makes the nonlinear algorithm unstable. D1 was preprocessed to contain log-ratios xij=log2(Rij/Gij) between red and green intensities. In [1], principal components, referred to as eigenarrays, were hypothesized to be genomic expression programs of distinct biological processes. We compared the biological coherence of independent components with that of principal components found by [1]. Comparison was done in two ways: (1) For each component, we grouped genes within top x% of significant up-regulation and down-regulation (as measured by the load of the gene in the component) into two clusters with x adjusted from 5% to 45%. For each value of x, statistical significance was measured for clusters from independent components and compared with that from principal components based on the minimum p-value for each gene annotation, as described in Section 4. We made a scatter plot to compare the negative log of the collected best p-values for each gene annotation when x is fixed to be 15%, shown in Figure 1 (a) (2) Same as before, except we did not fix the value of x; instead, we collected the minimum p-value from each method for each GO and KEGG gene annotation category and compared the collected p-values (Figure 1 (b)). For both cases, in the majority of the gene annotation categories ICA produced significantly lower p-values than PCA did, especially for gene annotation for which both ICA and PCA showed high significance. Figure 1. Comparison of linear ICA (NMLE) to PCA on dataset D1 (a) when x is fixed to be 15%; (b) when x is not fixed. (c) Three independent components of dataset D4. Each gene is mapped to a point based on the value assigned to the gene in three independent components, which are enriched with liver- (red), Muscle- (orange) and vulva-specific (green) genes, respectively. The expression levels of genes in D4 were normalized across the 59 experiments, and the logarithms of the resulting values were taken. Experiments 57, 58, and 59 were removed because they made the expression matrix nearly singular. In [2], a clustering approach based on PCA and subsequent visual inspection was applied to an earlier version of this dataset, containing 50 of the 59 samples. After we performed ICA, the most significant independent components were enriched for liver-specific, muscle-specific and vulva-specific genes with p-value of 10-133, 10-124 and 100-117, respectively. In the ICA liver cluster, 198 genes were liver specific (out of a total of 244), as compared with the 23 liver-specific genes identified in [2] using PCA. The ICA muscle cluster of 235 genes contains 199 muscle specific genes compared to 19 muscle-specific genes identified in [2]. We generated a 3-dimensional scatter plot of the load expression levels of all genes annotated in [15] on these significant ICA components in Figure 1 (c). We can see that the liver-specific, muscle-specific and vulva-specific genes are strongly biased to lie on the x-, y-, and z- axis, respectively. We applied nonlinear ICA on this dataset and the first four most significant clusters from nonlinear ICA with Gaussian RBF kernel were muscle-specific, liver-specific, vulva-specific and brain-specific with p-value of 10-158, 10-127, 10-112 and 10-70, respectively, showing considerable improvement over the linear ICA clusters. For D2, variance-normalization was applied to the 3000 most variant genes as in [4]. The 17th experiment, which made the expression matrix close to singular, was removed. We measured the statistical significance of clusters as described in Section 4 and compared the smallest p-value of each gene annotation from our approach to that from k-means clustering applied to the same dataset [4]. We made a scatter plot for comparing the negative log of the smallest p-value (y-axis) from ICA clusters with that from k-means clustering (x-axis). The coefficient c is varied from 1.0 to 2.0 and the superiority of ICA-based clustering to k-means clustering does not change. In many practical settings, estimation of the best c is not needed; we can adjust c to get a desired size of the cluster unless our focus is to blindly find the size of clusters. Figure 2 (a) (b) (c) shows for c=1.25 a comparison of the performance of linear ICA (NMLE), nonlinear ICA with Gaussian RBF kernel (NICA gauss), and k-means clustering (k-means). For D3, first we removed experiments that contained more than 7000 missing values, because ICA does not perform properly when the dataset contains many missing values. The 250 remaining experiments were used, containing expression levels for 17817 genes preprocessed to be log-ratios xij=log2(Rij/Gij) between red and green intensities. We compared the biological coherence of clusters by our approach with that of topomap-based approach applied to the same dataset in [3]. The result when c=1.25 is plotted in the Figure 2 (d). We observe that the two methods perform very similarly, with most categories having roughly the same p-value in ICA and in the topomap clusters. The topomap clustering approach performs slightly better in a larger fraction of the categories. Still, we consider this performance a confirmation that ICA is a widely applicable method that requires minimal training: in this case the missing values and high diversity of the data make clustering especially challenging, while the topomap approach was specifically designed and manually trained for this dataset as described in [3]. Finally, we compared different ICA algorithms in terms of clustering performance. We tested six linear ICA methods: Natural Gradient Maximum Likelihood Estimation (NMLE) [7][8], Joint Approximate Diagonalization of Eigenmatrices [16], Fast Fixed Point ICA with three different measures of non-Gaussianity [17], and Extended Information Maximization (Infomax) [18]. We also tested two kernels for nonlinear ICA: Gaussian RBF kernel, and polynomial kernel (NICA ploy). For each dataset, we compared the biological coherence of clusters generated by each method. Among the six linear ICA algorithms, NMLE was the best in all datasets. Among both linear and nonlinear methods, the Gaussian kernel nonlinear ICA method was the best in Datasets D1, D2 and D4, the polynomial kernel nonlinear ICA method was best in Dataset D4, and NMLE was best in the large datasets (D3 and D4). In Figure 3, we compare the NMLE method with three other ICA methods for the dataset D2. Overall, the NMLE algorithm consistently performed well in all datasets. The nonlinear ICA algorithms performed best in the small datasets, but were unstable in the two largest datasets. More comparison results are demonstrated in the website www.stanford.edu/~silee/ICA/. Figure 2: Comparison of (a) linear ICA (NMLE) with k-means clustering, (b) nonlinear ICA with Gaussian RBF kernel to linear ICA (NMLE), and (c) nonlinear ICA with Gaussian RBF kernel to k-means clustering on the dataset D2. (d) Comparison of linear ICA (NMLE) to topomap-based approach on the dataset D3. Figure 3: Comparison of linear ICA (NMLE) to (a) Extended Infomax ICA algorithm, (b) Fast ICA with symmetric orthogonalization and tanh nonlinearity and (c) Nonlinear ICA with polynomial kernel of degree 2 on the Dataset (B). 6 D i s c u s s i on ICA is a powerful statistical method for separating mixed independent signals. We proposed applying ICA to decompose microarray data into independent gene expression patterns of underlying biological processes, and to group genes into clusters that are mutually non-exclusive with statistically significant functional coherence. Our clustering method outperformed several leading methods on a variety of datasets, with the added advantage that it requires setting only one parameter, namely the fraction c of standard deviations beyond which a gene is considered to be associated with a component’s cluster. We observed that performance was not very sensitive to that parameter, suggesting that ICA is robust enough to be used for clustering with little human intervention. The empirical performance of ICA in our tests supports the hypothesis that statistical independence is a good criterion for separating mixed biological signals in microarray data. The Extended Infomax ICA algorithm proposed in [18] can automatically determine whether the distribution of each source signal is super-Gaussian or sub-Gaussian. Interestingly, the application of Extended Infomax ICA to all the expression datasets uncovered no source signal with sub-Gaussian distribution. A likely explanation is that global gene expression profiles are mixtures of super-Gaussian sources rather than of sub-Gaussian sources. This finding is consistent with the following intuition: underlying biological processes are super-Gaussian, because they affect sharply the relevant genes, typically a small fraction of all genes, and leave the majority of genes relatively unaffected. Acknowledgments We thank Te-Won Lee for helpful feedback. We thank Relly Brandman, Chuong Do, and Yueyi Liu for edits to the manuscript. References [1] Alter O, Brown PO, Botstein D. Proc. Natl. Acad. Sci. USA 97(18):10101-10106, 2000. [2] Misra J, Schmitt W, et al. Genome Research 12:1112-1120, 2002. [3] Kim SK, Lund J, et al. Science 293:2087-2092, 2001. [4] Tavazoie S, Hughes JD, et al. Nature Genetics 22(3):281-285, 1999. [5] Hori G, Inoue M, et al. Proc. 3rd Int. Workshop on Independent Component Analysis and Blind Signal Separation, Helsinki, Finland, pp. 151-155, 2000. [6] Liebermeister W. Bioinformatics 18(1):51-60, 2002. [7] Bell AJ. and Sejnowski TJ. Neural Computation, 7:1129-1159, 1995. [8] Amari S, Cichocki A, et al. In Advances in Neural Information Processing Systems 8, pp. 757-763. Cambridge, MA: MIT Press, 1996. [9] Harmeling S, Ziehe A, et al. In Advances in Neural Information Processing Systems 8, pp. 757-763. Cambridge, MA: MIT Press, . [10] Troyanskaya O., Cantor M, et al. Bioinformatics 17:520-525, 2001. [11] The Gene Ontology Consortium. Genome Research 11:1425-1433, 2001. [12] Kanehisa M., Goto S. In Current Topics in Computational Molecular Biology, pp. 301–315. MIT-Press, Cambridge, MA, 2002. [13] Spellman PT, Sherlock G, et al. Mol. Biol. Cell 9:3273-3297, 1998. [14] Cho RJ, Campell MJ, et al. Molecular Cell 2:65-73, 1998. [15] Hsiao L, Dangond F, et al. Physiol. Genomics 7:97-104, 2001. [16] Cardoso JF, Neural Computation 11(1):157-192, 1999. [17] Hyvarinen A. IEEE Transactions on Neural Network 10(3):626–634, 1999. [18] Lee TW, Girolami M, et al. Neural Computation 11:417–441, 1999.
2 0.69140905 79 nips-2003-Gene Expression Clustering with Functional Mixture Models
Author: Darya Chudova, Christopher Hart, Eric Mjolsness, Padhraic Smyth
Abstract: We propose a functional mixture model for simultaneous clustering and alignment of sets of curves measured on a discrete time grid. The model is specifically tailored to gene expression time course data. Each functional cluster center is a nonlinear combination of solutions of a simple linear differential equation that describes the change of individual mRNA levels when the synthesis and decay rates are constant. The mixture of continuous time parametric functional forms allows one to (a) account for the heterogeneity in the observed profiles, (b) align the profiles in time by estimating real-valued time shifts, (c) capture the synthesis and decay of mRNA in the course of an experiment, and (d) regularize noisy profiles by enforcing smoothness in the mean curves. We derive an EM algorithm for estimating the parameters of the model, and apply the proposed approach to the set of cycling genes in yeast. The experiments show consistent improvement in predictive power and within cluster variance compared to regular Gaussian mixtures. 1
3 0.67597526 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms
Author: Claudio Gentile
Abstract: New feature selection algorithms for linear threshold functions are described which combine backward elimination with an adaptive regularization method. This makes them particularly suitable to the classification of microarray expression data, where the goal is to obtain accurate rules depending on few genes only. Our algorithms are fast and easy to implement, since they center on an incremental (large margin) algorithm which allows us to avoid linear, quadratic or higher-order programming methods. We report on preliminary experiments with five known DNA microarray datasets. These experiments suggest that multiplicative large margin algorithms tend to outperform additive algorithms (such as SVM) on feature selection tasks. 1
4 0.45454988 190 nips-2003-Unsupervised Color Decomposition Of Histologically Stained Tissue Samples
Author: Andrew Rabinovich, Sameer Agarwal, Casey Laris, Jeffrey H. Price, Serge J. Belongie
Abstract: Accurate spectral decomposition is essential for the analysis and diagnosis of histologically stained tissue sections. In this paper we present the first automated system for performing this decomposition. We compare the performance of our system with ground truth data and report favorable results. 1
5 0.45391902 73 nips-2003-Feature Selection in Clustering Problems
Author: Volker Roth, Tilman Lange
Abstract: A novel approach to combining clustering and feature selection is presented. It implements a wrapper strategy for feature selection, in the sense that the features are directly selected by optimizing the discriminative power of the used partitioning algorithm. On the technical side, we present an efficient optimization algorithm with guaranteed local convergence property. The only free parameter of this method is selected by a resampling-based stability analysis. Experiments with real-world datasets demonstrate that our method is able to infer both meaningful partitions and meaningful subsets of features. 1
6 0.37686145 1 nips-2003-1-norm Support Vector Machines
7 0.31171301 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning
8 0.28617474 46 nips-2003-Clustering with the Connectivity Kernel
9 0.28161988 107 nips-2003-Learning Spectral Clustering
10 0.27759057 111 nips-2003-Learning the k in k-means
11 0.27016664 37 nips-2003-Automatic Annotation of Everyday Movements
12 0.26396406 178 nips-2003-Sparse Greedy Minimax Probability Machine Classification
13 0.26113805 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering
14 0.24865203 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation
15 0.24444538 12 nips-2003-A Model for Learning the Semantics of Pictures
16 0.21874771 186 nips-2003-Towards Social Robots: Automatic Evaluation of Human-Robot Interaction by Facial Expression Classification
17 0.20355058 75 nips-2003-From Algorithmic to Subjective Randomness
18 0.19698809 44 nips-2003-Can We Learn to Beat the Best Stock
19 0.19173209 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models
20 0.18911718 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion
topicId topicWeight
[(0, 0.029), (11, 0.047), (15, 0.225), (30, 0.013), (35, 0.093), (53, 0.139), (64, 0.013), (71, 0.045), (76, 0.044), (82, 0.012), (85, 0.046), (91, 0.131), (93, 0.011), (99, 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.85994315 86 nips-2003-ICA-based Clustering of Genes from Microarray Expression Data
Author: Su-in Lee, Serafim Batzoglou
Abstract: We propose an unsupervised methodology using independent component analysis (ICA) to cluster genes from DNA microarray data. Based on an ICA mixture model of genomic expression patterns, linear and nonlinear ICA finds components that are specific to certain biological processes. Genes that exhibit significant up-regulation or down-regulation within each component are grouped into clusters. We test the statistical significance of enrichment of gene annotations within each cluster. ICA-based clustering outperformed other leading methods in constructing functionally coherent clusters on various datasets. This result supports our model of genomic expression data as composite effect of independent biological processes. Comparison of clustering performance among various ICA algorithms including a kernel-based nonlinear ICA algorithm shows that nonlinear ICA performed the best for small datasets and natural-gradient maximization-likelihood worked well for all the datasets. 1 In trod u ction Microarray technology has enabled genome-wide expression profiling, promising to provide insight into underlying biological mechanism involved in gene regulation. To aid such discoveries, mathematical tools that are versatile enough to capture the underlying biology and simple enough to be applied efficiently on large datasets are needed. Analysis tools based on novel data mining techniques have been proposed [1]-[6]. When applying mathematical models and tools to microarray analysis, clustering genes that have the similar biological properties is an important step for three reasons: reduction of data complexity, prediction of gene function, and evaluation of the analysis approach by measuring the statistical significance of biological coherence of gene clusters. Independent component analysis (ICA) linearly decomposes each of N vectors into M common component vectors (N≥M) so that each component is statistically as independent from the others as possible. One of the main applications of ICA is blind source separation (BSS) that aims to separate source signals from their mixtures. There have been a few attempts to apply ICA to the microarray expression data to extract meaningful signals each corresponding to independent biological process [5]-[6]. In this paper, we provide the first evidence that ICA is a superior mathematical model and clustering tool for microarray analysis, compared to the most widely used methods namely PCA and k-means clustering. We also introduce the application of nonlinear ICA to microarray analysis, and show that it outperforms linear ICA on some datasets. We apply ICA to microarray data to decompose the input data into statistically independent components. Then, genes are clustered in an unsupervised fashion into non-mutually exclusive clusters. Each independent component is assigned a putative biological meaning based on functional annotations of genes that are predominant within the component. We systematically evaluate the clustering performance of several ICA algorithms on four expression datasets and show that ICA-based clustering is superior to other leading methods that have been applied to analyze the same datasets. We also proposed a kernel based nonlinear ICA algorithm for dealing with more realistic mixture model. Among the different linear ICA algorithms including six linear and one nonlinear ICA algorithm, the natural-gradient maximum-likelihood estimation method (NMLE) [7]-[8] performs well in all the datasets. Kernel-based nonlinear ICA method worked better for three small datasets. 2 M a t h e m at i c a l m o de l o f g e n om e - wi d e e x p r e s s i on Several distinct biological processes take place simultaneously inside a cell; each biological process has its own expression program to up-regulate or down-regulate the level of expression of specific sets of genes. We model a genome-wide expression pattern in a given condition (measured by a microarray assay) as a mixture of signals generated by statistically independent biological processes with different activation levels. We design two kinds of models for genomic expression pattern: a linear and nonlinear mixture model. Suppose that a cell is governed by M independent biological processes S = (s1, …, sM)T, each of which is a vector of K gene expression levels, and that we measure the levels of expression of all genes in N conditions, resulting in a microarray expression matrix X = (x1,…,xN)T. The expression level at each different condition j can be expressed as linear combinations of the M biological processes: xj=aj1s1+…+ajMsM. We can express this idea concisely in matrix notation as follows. X = AS , x1 a11 L a1M s1 M = M M M x N a N 1 L a NM s M (1) More generally, we can express X = (x1,…,xN)T as a post-nonlinear mixture of the underlying independent processes as follows, where f(.) is a nonlinear mapping from N to N dimensional space. X = f ( AS ), a11 L a1M s1 x1 M = f M M M a xN N 1 L a NM s M (2) 3 I n d e p e n d e n t c o m po n e n t a n a l y s i s In the models described above, since we assume that the underlying biological processes are independent, we suggest that vectors S=(s1,…,sM) are statistically independent and so ICA can recover S from the observed microarray data X. For linear ICA, we apply natural-gradient maximum estimation (NMLE) method which was proposed in [7] and was made more efficient by using natural gradient method in [8]. We also apply nonlinear ICA using reproducible kernel Hilbert spaces (RKHS) based on [9], as follows: 1. We map the N dimensional input data xi to Ф(xi) in the feature space by using the kernel trick. The feature space is defined by the relationship Ф(xi)TФ(xj)=k(xi,, xj). That is, inner product of mapped data is determined to by a kernel function k(.,.) in the input space; we used a Gaussian radial basis function (RBF) kernel (k(x,y)=exp(-|x-y|2)) and a polynomial kernel of degree 2 (k(x,y)=(xTy+1)2). To perform mapping, we found orthonormal bases of the feature space by randomly sampling L input data v={v1,…,vL} 1000 times and choosing one set minimizing the condition number of Φv=(Φ(v1),…,Φ(vL)). Then, a set of orthonormal bases of the feature space is determined by the selected L images of input data in v as Ξ = Φv(Φv TΦv )-1/2. We map all input data x1,…,xK, each corresponding to a gene, to Ψ(x1),…,Ψ(xK) in the feature space with basis Ξ, as follows: k (v1 , v1 ) K k (v1 , v L ) M M k (v L , v1 ) L k (v L , v L ) Ψ(xi)=(ΦvTΦv )-1/2Φv TΦv (xi) = −1 / 2 k (v1 , xi ) ∈ ℜ L (1≤ i≤K) (3) M k ( v L , xi ) 2. We linearly decompose the mapped data Ψ=[Ψ(x1),.,Ψ(xK)] ∈RL×K into statistically independent components using NMLE. 4 Proposed approach The microarray dataset we are given is in matrix form where each element xij corresponds to the level of expression of the jth gene in the ith experimental condition. Missing values are imputed by KNNImpute [10], an algorithm based on k nearest neighbors that is widely used in microarray analysis. Given the expression matrix X of N experiment by K genes, we perform the following steps. 1. Apply ICA to decompose X into independent components y1, …,yM as in Equations (1) and (2). Prior to applying ICA, remove any rows that make the expression matrix X singular. After ICA, each component denoted by yi is a vector comprising K loads gene expression levels, i.e., yi = (yi1, ...,yiK). We chose to let the number of components M to be maximized, which is equal the number of microarray experiments N because the maximum for N in our datasets was 250, which is smaller than the number of biological processes we hypothesize to act within a cell. 2. For each component, cluster genes according to their relative loads yij/mean(yi). Based on our ICA model, each component is a putative genomic expression program of an independent biological process. Thus, our hypothesis is that genes showing relatively high or low expression level within the component are the most important for the process. We create two clusters for each component: one cluster containing genes with expression level higher than a threshold, and one cluster containing genes with expression level lower than a threshold. Cluster i,1 = {gene j | y ij > mean( y i ) + c × std( y i )} Cluster i,2 = {gene j | y ij < mean( y i ) – c × std( y i )} (4) Here, mean(yi) is the average, std(yi) is the standard deviation of yi; and c is an adjustable coefficient. The value of the coefficient c was varied from 1.0 to 2.0 and the result for c=1.25 was presented in this paper. The results for other values of c are similar, and are presented on the website www.stanford.edu/~silee/ICA/. 3. For each cluster, measure the enrichment of each cluster with genes of known functional annotations. Using the Gene Ontology (GO) [11] and KEGG [12] gene annotation databases, we calculate the p-value for each cluster with every gene annotation, which is the probability that the cluster contains the observed number of genes with the annotation by chance assuming the hypergeometric distribution (details in [4]). For each gene annotation, the minimum p-value that is smaller than 10-7 obtained from any cluster was collected. If no p-value smaller than 10-7 is found, we consider the gene annotation not to be detected by the approach. As a result, we can assign biological meaning to each cluster and the corresponding independent component and we can evaluate the clustering performance by comparing the collected minimum p-value for each gene annotation with that from other clustering approach. 5 P e r f o r m a n c e e v a l uat i o n We tested the ICA-based clustering to four expression datasets (D1—D4) described in Table 1. Table 1: The four datasets used in our analysis ARRAY TYPE DESCRIPTION # OF GENES (K) # OF EXPS (N) D1 Spotted 4579 22 D2 Oligonucl eotide Spotted Oligonucl eotide Budding yeast during cell cycle and CLB2/CLN3 overactive strain [13] Budding yeast during cell cycle [14] 6616 17 C. elegans in various conditions [3] Normal human tissue including 19 kinds of tissues [15] 17817 7070 553 59 D3 D4 For D1 and D4, we compared the biological coherence of ICA components with that of PCA applied in the same datasets in [1] and [2], respectively. For D2 and D3, we compared with k-means clustering and the topomap method, applied in the same datasets in [4] and [3], respectively. We applied nonlinear ICA to D1, D2 and D4. Dataset D3 is very large and makes the nonlinear algorithm unstable. D1 was preprocessed to contain log-ratios xij=log2(Rij/Gij) between red and green intensities. In [1], principal components, referred to as eigenarrays, were hypothesized to be genomic expression programs of distinct biological processes. We compared the biological coherence of independent components with that of principal components found by [1]. Comparison was done in two ways: (1) For each component, we grouped genes within top x% of significant up-regulation and down-regulation (as measured by the load of the gene in the component) into two clusters with x adjusted from 5% to 45%. For each value of x, statistical significance was measured for clusters from independent components and compared with that from principal components based on the minimum p-value for each gene annotation, as described in Section 4. We made a scatter plot to compare the negative log of the collected best p-values for each gene annotation when x is fixed to be 15%, shown in Figure 1 (a) (2) Same as before, except we did not fix the value of x; instead, we collected the minimum p-value from each method for each GO and KEGG gene annotation category and compared the collected p-values (Figure 1 (b)). For both cases, in the majority of the gene annotation categories ICA produced significantly lower p-values than PCA did, especially for gene annotation for which both ICA and PCA showed high significance. Figure 1. Comparison of linear ICA (NMLE) to PCA on dataset D1 (a) when x is fixed to be 15%; (b) when x is not fixed. (c) Three independent components of dataset D4. Each gene is mapped to a point based on the value assigned to the gene in three independent components, which are enriched with liver- (red), Muscle- (orange) and vulva-specific (green) genes, respectively. The expression levels of genes in D4 were normalized across the 59 experiments, and the logarithms of the resulting values were taken. Experiments 57, 58, and 59 were removed because they made the expression matrix nearly singular. In [2], a clustering approach based on PCA and subsequent visual inspection was applied to an earlier version of this dataset, containing 50 of the 59 samples. After we performed ICA, the most significant independent components were enriched for liver-specific, muscle-specific and vulva-specific genes with p-value of 10-133, 10-124 and 100-117, respectively. In the ICA liver cluster, 198 genes were liver specific (out of a total of 244), as compared with the 23 liver-specific genes identified in [2] using PCA. The ICA muscle cluster of 235 genes contains 199 muscle specific genes compared to 19 muscle-specific genes identified in [2]. We generated a 3-dimensional scatter plot of the load expression levels of all genes annotated in [15] on these significant ICA components in Figure 1 (c). We can see that the liver-specific, muscle-specific and vulva-specific genes are strongly biased to lie on the x-, y-, and z- axis, respectively. We applied nonlinear ICA on this dataset and the first four most significant clusters from nonlinear ICA with Gaussian RBF kernel were muscle-specific, liver-specific, vulva-specific and brain-specific with p-value of 10-158, 10-127, 10-112 and 10-70, respectively, showing considerable improvement over the linear ICA clusters. For D2, variance-normalization was applied to the 3000 most variant genes as in [4]. The 17th experiment, which made the expression matrix close to singular, was removed. We measured the statistical significance of clusters as described in Section 4 and compared the smallest p-value of each gene annotation from our approach to that from k-means clustering applied to the same dataset [4]. We made a scatter plot for comparing the negative log of the smallest p-value (y-axis) from ICA clusters with that from k-means clustering (x-axis). The coefficient c is varied from 1.0 to 2.0 and the superiority of ICA-based clustering to k-means clustering does not change. In many practical settings, estimation of the best c is not needed; we can adjust c to get a desired size of the cluster unless our focus is to blindly find the size of clusters. Figure 2 (a) (b) (c) shows for c=1.25 a comparison of the performance of linear ICA (NMLE), nonlinear ICA with Gaussian RBF kernel (NICA gauss), and k-means clustering (k-means). For D3, first we removed experiments that contained more than 7000 missing values, because ICA does not perform properly when the dataset contains many missing values. The 250 remaining experiments were used, containing expression levels for 17817 genes preprocessed to be log-ratios xij=log2(Rij/Gij) between red and green intensities. We compared the biological coherence of clusters by our approach with that of topomap-based approach applied to the same dataset in [3]. The result when c=1.25 is plotted in the Figure 2 (d). We observe that the two methods perform very similarly, with most categories having roughly the same p-value in ICA and in the topomap clusters. The topomap clustering approach performs slightly better in a larger fraction of the categories. Still, we consider this performance a confirmation that ICA is a widely applicable method that requires minimal training: in this case the missing values and high diversity of the data make clustering especially challenging, while the topomap approach was specifically designed and manually trained for this dataset as described in [3]. Finally, we compared different ICA algorithms in terms of clustering performance. We tested six linear ICA methods: Natural Gradient Maximum Likelihood Estimation (NMLE) [7][8], Joint Approximate Diagonalization of Eigenmatrices [16], Fast Fixed Point ICA with three different measures of non-Gaussianity [17], and Extended Information Maximization (Infomax) [18]. We also tested two kernels for nonlinear ICA: Gaussian RBF kernel, and polynomial kernel (NICA ploy). For each dataset, we compared the biological coherence of clusters generated by each method. Among the six linear ICA algorithms, NMLE was the best in all datasets. Among both linear and nonlinear methods, the Gaussian kernel nonlinear ICA method was the best in Datasets D1, D2 and D4, the polynomial kernel nonlinear ICA method was best in Dataset D4, and NMLE was best in the large datasets (D3 and D4). In Figure 3, we compare the NMLE method with three other ICA methods for the dataset D2. Overall, the NMLE algorithm consistently performed well in all datasets. The nonlinear ICA algorithms performed best in the small datasets, but were unstable in the two largest datasets. More comparison results are demonstrated in the website www.stanford.edu/~silee/ICA/. Figure 2: Comparison of (a) linear ICA (NMLE) with k-means clustering, (b) nonlinear ICA with Gaussian RBF kernel to linear ICA (NMLE), and (c) nonlinear ICA with Gaussian RBF kernel to k-means clustering on the dataset D2. (d) Comparison of linear ICA (NMLE) to topomap-based approach on the dataset D3. Figure 3: Comparison of linear ICA (NMLE) to (a) Extended Infomax ICA algorithm, (b) Fast ICA with symmetric orthogonalization and tanh nonlinearity and (c) Nonlinear ICA with polynomial kernel of degree 2 on the Dataset (B). 6 D i s c u s s i on ICA is a powerful statistical method for separating mixed independent signals. We proposed applying ICA to decompose microarray data into independent gene expression patterns of underlying biological processes, and to group genes into clusters that are mutually non-exclusive with statistically significant functional coherence. Our clustering method outperformed several leading methods on a variety of datasets, with the added advantage that it requires setting only one parameter, namely the fraction c of standard deviations beyond which a gene is considered to be associated with a component’s cluster. We observed that performance was not very sensitive to that parameter, suggesting that ICA is robust enough to be used for clustering with little human intervention. The empirical performance of ICA in our tests supports the hypothesis that statistical independence is a good criterion for separating mixed biological signals in microarray data. The Extended Infomax ICA algorithm proposed in [18] can automatically determine whether the distribution of each source signal is super-Gaussian or sub-Gaussian. Interestingly, the application of Extended Infomax ICA to all the expression datasets uncovered no source signal with sub-Gaussian distribution. A likely explanation is that global gene expression profiles are mixtures of super-Gaussian sources rather than of sub-Gaussian sources. This finding is consistent with the following intuition: underlying biological processes are super-Gaussian, because they affect sharply the relevant genes, typically a small fraction of all genes, and leave the majority of genes relatively unaffected. Acknowledgments We thank Te-Won Lee for helpful feedback. We thank Relly Brandman, Chuong Do, and Yueyi Liu for edits to the manuscript. References [1] Alter O, Brown PO, Botstein D. Proc. Natl. Acad. Sci. USA 97(18):10101-10106, 2000. [2] Misra J, Schmitt W, et al. Genome Research 12:1112-1120, 2002. [3] Kim SK, Lund J, et al. Science 293:2087-2092, 2001. [4] Tavazoie S, Hughes JD, et al. Nature Genetics 22(3):281-285, 1999. [5] Hori G, Inoue M, et al. Proc. 3rd Int. Workshop on Independent Component Analysis and Blind Signal Separation, Helsinki, Finland, pp. 151-155, 2000. [6] Liebermeister W. Bioinformatics 18(1):51-60, 2002. [7] Bell AJ. and Sejnowski TJ. Neural Computation, 7:1129-1159, 1995. [8] Amari S, Cichocki A, et al. In Advances in Neural Information Processing Systems 8, pp. 757-763. Cambridge, MA: MIT Press, 1996. [9] Harmeling S, Ziehe A, et al. In Advances in Neural Information Processing Systems 8, pp. 757-763. Cambridge, MA: MIT Press, . [10] Troyanskaya O., Cantor M, et al. Bioinformatics 17:520-525, 2001. [11] The Gene Ontology Consortium. Genome Research 11:1425-1433, 2001. [12] Kanehisa M., Goto S. In Current Topics in Computational Molecular Biology, pp. 301–315. MIT-Press, Cambridge, MA, 2002. [13] Spellman PT, Sherlock G, et al. Mol. Biol. Cell 9:3273-3297, 1998. [14] Cho RJ, Campell MJ, et al. Molecular Cell 2:65-73, 1998. [15] Hsiao L, Dangond F, et al. Physiol. Genomics 7:97-104, 2001. [16] Cardoso JF, Neural Computation 11(1):157-192, 1999. [17] Hyvarinen A. IEEE Transactions on Neural Network 10(3):626–634, 1999. [18] Lee TW, Girolami M, et al. Neural Computation 11:417–441, 1999.
2 0.69271243 73 nips-2003-Feature Selection in Clustering Problems
Author: Volker Roth, Tilman Lange
Abstract: A novel approach to combining clustering and feature selection is presented. It implements a wrapper strategy for feature selection, in the sense that the features are directly selected by optimizing the discriminative power of the used partitioning algorithm. On the technical side, we present an efficient optimization algorithm with guaranteed local convergence property. The only free parameter of this method is selected by a resampling-based stability analysis. Experiments with real-world datasets demonstrate that our method is able to infer both meaningful partitions and meaningful subsets of features. 1
3 0.69068104 107 nips-2003-Learning Spectral Clustering
Author: Francis R. Bach, Michael I. Jordan
Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive a new cost function for spectral clustering based on a measure of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing this cost function with respect to the partition leads to a new spectral clustering algorithm. Minimizing with respect to the similarity matrix leads to an algorithm for learning the similarity matrix. We develop a tractable approximation of our cost function that is based on the power method of computing eigenvectors. 1
4 0.68732315 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms
Author: Claudio Gentile
Abstract: New feature selection algorithms for linear threshold functions are described which combine backward elimination with an adaptive regularization method. This makes them particularly suitable to the classification of microarray expression data, where the goal is to obtain accurate rules depending on few genes only. Our algorithms are fast and easy to implement, since they center on an incremental (large margin) algorithm which allows us to avoid linear, quadratic or higher-order programming methods. We report on preliminary experiments with five known DNA microarray datasets. These experiments suggest that multiplicative large margin algorithms tend to outperform additive algorithms (such as SVM) on feature selection tasks. 1
5 0.67705345 79 nips-2003-Gene Expression Clustering with Functional Mixture Models
Author: Darya Chudova, Christopher Hart, Eric Mjolsness, Padhraic Smyth
Abstract: We propose a functional mixture model for simultaneous clustering and alignment of sets of curves measured on a discrete time grid. The model is specifically tailored to gene expression time course data. Each functional cluster center is a nonlinear combination of solutions of a simple linear differential equation that describes the change of individual mRNA levels when the synthesis and decay rates are constant. The mixture of continuous time parametric functional forms allows one to (a) account for the heterogeneity in the observed profiles, (b) align the profiles in time by estimating real-valued time shifts, (c) capture the synthesis and decay of mRNA in the course of an experiment, and (d) regularize noisy profiles by enforcing smoothness in the mean curves. We derive an EM algorithm for estimating the parameters of the model, and apply the proposed approach to the set of cycling genes in yeast. The experiments show consistent improvement in predictive power and within cluster variance compared to regular Gaussian mixtures. 1
6 0.67672831 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems
7 0.67603225 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data
8 0.67489588 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation
9 0.67262095 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons
10 0.67247707 81 nips-2003-Geometric Analysis of Constrained Curves
11 0.67083925 78 nips-2003-Gaussian Processes in Reinforcement Learning
12 0.6705898 30 nips-2003-Approximability of Probability Distributions
13 0.66819882 168 nips-2003-Salient Boundary Detection using Ratio Contour
14 0.66782272 146 nips-2003-Online Learning of Non-stationary Sequences
15 0.66655266 113 nips-2003-Learning with Local and Global Consistency
16 0.66418231 69 nips-2003-Factorization with Uncertainty and Missing Data: Exploiting Temporal Coherence
17 0.6639415 126 nips-2003-Measure Based Regularization
18 0.66215998 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
19 0.6621536 68 nips-2003-Eye Movements for Reward Maximization
20 0.66124022 55 nips-2003-Distributed Optimization in Adaptive Networks