jmlr jmlr2007 jmlr2007-3 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, Srujana Merugu, Dharmendra S. Modha
Abstract: Co-clustering, or simultaneous clustering of rows and columns of a two-dimensional data matrix, is rapidly becoming a powerful data analysis technique. Co-clustering has enjoyed wide success in varied application domains such as text clustering, gene-microarray analysis, natural language processing and image, speech and video analysis. In this paper, we introduce a partitional co-clustering formulation that is driven by the search for a good matrix approximation—every co-clustering is associated with an approximation of the original data matrix and the quality of co-clustering is determined by the approximation error. We allow the approximation error to be measured using a large class of loss functions called Bregman divergences that include squared Euclidean distance and KL-divergence as special cases. In addition, we permit multiple structurally different co-clustering schemes that preserve various linear statistics of the original data matrix. To accomplish the above tasks, we introduce a new minimum Bregman information (MBI) principle that simultaneously generalizes the maximum entropy and standard least squares principles, and leads to a matrix approximation that is optimal among all generalized additive models in a certain natural parameter space. Analysis based on this principle yields an elegant meta algorithm, special cases of which include most previously known alternate minimization based clustering algorithms such as kmeans and co-clustering algorithms such as information theoretic (Dhillon et al., 2003b) and minimum sum-squared residue co-clustering (Cho et al., 2004). To demonstrate the generality and flexibility of our co-clustering framework, we provide examples and empirical evidence on a varic 2007 Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Srujana Merugu and Dharmendra Modha. BANERJEE , D HILLON , G HOSH , M ERUGU AND M ODHA ety of problem domains and also describe novel co-clustering applications such as missing value prediction and
Reference: text
sentIndex sentText sentNum sentScore
1 Simultaneous grouping of row and column clusters is more informative and digestible. [sent-47, score-0.274]
2 A row (or column) clustering can be thought of as dimensionality reduction along the rows (or columns). [sent-50, score-0.297]
3 Since coclustering incorporates row clustering information into column clustering and vice versa, one can think of it as a “statistical regularization” technique that can yield better quality clusters even if one is primarily interested in a single-sided clustering. [sent-52, score-0.576]
4 Single-sided, geometric clustering algorithms such as kmeans and its variants have computation time proportional to mnk per iteration, where m is the number of rows, n is the number of columns and k is the number of row clusters. [sent-58, score-0.297]
5 Since the number of row and column clusters is usually much smaller than the original number of rows and columns, co-clustering can lead to substantial reduction in the running time (see, for example, Dhillon et al. [sent-60, score-0.331]
6 , 2003) where all the rows and columns are partitioned into disjoint row and column clusters respectively. [sent-64, score-0.361]
7 For example, in the work of Cheng and Church (2000), the row and column averages of each co-cluster are preserved and the discrepancy between the original and the compressed representation is measured in terms of the sum of element-wise squared deviation. [sent-70, score-0.297]
8 , 2003b), which is applicable to data matrices representing joint probability distributions, preserves a different set of summary statistics, that is, the row and column averages and the co-cluster averages. [sent-72, score-0.354]
9 Information-theoretic co1 clustering provides a principled approach for simultaneously clustering the rows and columns of the joint probability distribution p(X,Y ). [sent-77, score-0.303]
10 Let the row clusters ˆ ˆ ˆ be denoted by {xg }, [g]k and the column clusters by {yh }, [h]l . [sent-79, score-0.324]
11 Let X and Y denote the clustered ˆ 1 1 random variables induced by X and Y that range over the set of row and column clusters respectively. [sent-80, score-0.274]
12 If only row and column marginals are to be preserved, that is, (5) holds, then the 1922 B REGMAN C O - CLUSTERING AND M ATRIX A PPROXIMATION product distribution p(X)p(Y ) has maximum entropy (see Cover and Thomas, 1991, Problem 5, Chapter 11). [sent-98, score-0.258]
13 This formulation serves the dual purpose of (i) obtaining row and column clusterings that optimize a well-defined global objective function, and (ii) providing a new class of desirable matrix approximations. [sent-113, score-0.309]
14 Matrices are denoted using upper case bold letters, for example, Z, whereas the corresponding lower case letters zuv denote the matrix elements. [sent-157, score-0.302]
15 When w is uniform and the support of Z, that is, Z consists of elements 1 ¯ in a matrix Z ∈ Rm×n , that is, Z = {zuv , [u]m , [v]n }, then E[Z] = mn ∑m ∑n zuv ≡ z. [sent-190, score-0.339]
16 The Bregman u=1 v=1 1 1 information in this case is given by Iφ (Z) = 1 m n 2 1 1 m n (zuv − z)2 = ¯ ¯ ∑∑ ∑ ∑ zuv − z2 = mn Z mn u=1 v=1 mn u=1 v=1 2 F + constant, that is, a linear function of the squared Frobenius norm of Z. [sent-191, score-0.4]
17 Then, p(X,Y ) can be written in the 1 1 form of the matrix Z = [zuv ], [u]m , [v]n , where zuv = p(xu , yv ) is a deterministic function of u and v. [sent-216, score-0.302]
18 1927 BANERJEE , D HILLON , G HOSH , M ERUGU AND M ODHA ˆ w, that is, p(Z(U,V ) = zuv ) = wuv . [sent-236, score-0.371]
19 The summary statistics may be properties of the co-clusters themselves, such as co-cluster marginals as in (4), and/or some other important statistics of the data, such as row and column marginals as in (5). [sent-247, score-0.267]
20 Using Lemma 4, the original Bregman clustering problem in (6) can be posed as one of finding the optimal co-clustering (ρ∗ , γ∗ ) defined as follows: ˆ ˆ ˆ (ρ∗ , γ∗ ) = argmin E[dφ (Z, Z)] = argmin [Iφ (Z) − Iφ (Z)] = argmax Iφ (Z) , (ρ,γ) (ρ,γ) (12) (ρ,γ) ˆ since Iφ (Z) is a constant. [sent-321, score-0.28]
21 3 Block Average Co-clustering Algorithm In this section, we present an algorithm for block average co-clustering based on a useful decomposition of the objective function (12), which gives a better insight on how to update either the row clustering ρ or the column clustering γ. [sent-325, score-0.523]
22 1 A U SEFUL D ECOMPOSITION ˆ From Theorem 1, it follows that for a given co-clustering (ρ, γ), the approximation Z that achieves the minimum Bregman information is given by zuv = E[Z|u, v], where u = ρ(u), v = γ(v). [sent-328, score-0.272]
23 Hence, the optimal ˆ ˆ ˆˆ ˆˆ ˆ approximation Z corresponding to (ρ, γ) is given by zuv = µuv = µρ(u)γ(v) . [sent-330, score-0.272]
24 ˆ ˆˆ ˆ With this closed form for Z, we have ˆ E[dφ (Z, Z)] = ∑ wuv dφ (zuv , µρ(u)γ(v) ) u,v k = l ∑∑ ∑ ∑ wuv dφ (zuv , µgh ) . [sent-331, score-0.262]
25 (14) g=1 h=1 u:ρ(u)=g v:γ(v)=h Note that (14) decomposes the objective function in terms of the row cluster assignment ρ(u) of each row u and column cluster assignment γ(v) of each column v. [sent-332, score-0.685]
26 2 U PDATING ROW AND C OLUMN C LUSTERS Since the decomposition (14) is additive over all the rows (or columns), we can update the current cluster assignment of each row (or column) in order to decrease the objective function. [sent-335, score-0.367]
27 Assuming ρ(u) = g, we can express the objective function (14) as the sum of row contributions of the form l Ju (g) = ∑ ∑ wuv dφ (zuv , µgh ) . [sent-337, score-0.286]
28 Hence, the best possible choice for the new row cluster assignment ρ new (u) is to pick the value of g that has the minimum cost, that is, ρnew (u) = argmin Ju (g) = argmin g g l ∑ ∑ wuv dφ (zuv , µgh ) . [sent-340, score-0.542]
29 (16) h=1 v:γ(v)=h Since the terms corresponding to each row are additive in (14), the row assignment update in (16) can be applied simultaneously to all rows to get the new row assignments ρ new (u), [u]n . [sent-341, score-0.59]
30 The new row 1 ˆ ˜ assignments effectively change the current approximation matrix Z to a new matrix Zρ1 γ0 , which is ˆ just a row-permuted version of Z that achieves a lower cost, that is, ˜ ˆ E[dφ (Z, Zρ1 γ0 )] ≤ E[dφ (Z, Z)] . [sent-342, score-0.322]
31 Note that the current approximation can possibly be ˜ further improved by another round of row clustering updates to get an approximation Zρ2 γ1 , where the subscript in ρ (or γ) denotes the number of times the row (column) cluster assignment has been updated. [sent-345, score-0.543]
32 ˜ Once all row and column assignments have been updated, the new approximation matrix Z need new , γnew ). [sent-348, score-0.352]
33 At every iteration, either the row clustering ρ or the column clustering γ is updated in order to decrease the objective function value in (12). [sent-368, score-0.463]
34 Let R ∈ {0, 1} m×k and 1934 B REGMAN C O - CLUSTERING AND M ATRIX A PPROXIMATION C ∈ {0, 1}n×l denote the row and column cluster membership matrices, that is, rug = 1 g = ρ(u), 0 otherwise, cvh = 1 h = γ(v), 0 otherwise. [sent-378, score-0.276]
35 Therefore, the co-clustering problem is essentially reduces can be expressed as the product RMC to finding row assignment matrix R, column assignment matrix C such that the approximation error ˆ ˆ dΦw (Z, Z) is minimized where Z = RMCT . [sent-381, score-0.49]
36 The matrix approximation obtained in the general case need not be expressible as a matrix factorization in terms of the cluster membership matrices R and C. [sent-411, score-0.274]
37 Similar to Algorithm 1, we obtain an iterative algorithm for the general case where we alternately optimize over the row cluster assignments, column cluster assignments and the MBI solution. [sent-426, score-0.362]
38 Theorem 3 Given the random variable Z, there are only six distinct co-clustering bases that approximate Z using conditional expectations of Z given combinations of the row and column random ˆ ˆ variables {U,V, U, V }. [sent-466, score-0.371]
39 , over all the columns and all the rows in the row cluster ˆ ˆ determined by U in case of E[Z|U]). [sent-473, score-0.271]
40 For example, ˆ ˆ in Figure 1(a), the expectations along the row clusters (E[Z|U]) and the column clusters (E[Z|V ]) are the statistics used for reconstructing the original Z. [sent-475, score-0.371]
41 4 Optimality of the MBI Solution We now present an analysis of the optimality of the MBI solution as the “best” reconstruction of the original matrix given the row and column clustering and the summary statistics corresponding to any of the co-clustering bases. [sent-533, score-0.489]
42 Throughout this section, let us suppose that the underlying measure w, the Bregman divergence dφ , the data matrix Z, number of row clusters k, number of column clusters l, and the co-clustering basis C are specified and fixed. [sent-612, score-0.5]
43 Step 2B: Hold the column clustering γt fixed, and find a better row co-clustering, say, ρt+1 . [sent-620, score-0.332]
44 Step 2C: Hold the row clustering ρt+1 fixed, and find a better column co-clustering, say, γt+1 . [sent-623, score-0.332]
45 Further, since the number of possible row (or column) clusterings is exponential in the number of rows (or columns), it is also essential to have an efficient means for determining the best row (or column) clustering for a fixed choice of the column (or row) clustering and the MBI solution. [sent-628, score-0.629]
46 Fortunately for the co-clustering problem, the expected distortion measure that quantifies the quality of a row (or column) clustering admits a separability property that allows independent optimal updates of the cluster assignments of every row (or column). [sent-629, score-0.489]
47 2 A Separability Property We begin by considering the quality of a candidate row (or column) clustering ρ in Step 2B (or step 2C) for a fixed choice of column (or row) clustering and MBI solution parameters. [sent-632, score-0.44]
48 Since our objective is to obtain an accurate reconstruction of the original matrix, a natural choice is to consider ˜ the expected Bregman distortion between the original Z and a reconstruction Z based on the row (or column) clustering ρ while keeping everything else fixed. [sent-633, score-0.398]
49 The quality of a candidate row (or column) clustering can now be quantified in terms of the accuracy of the corresponding ˜ Z where the other two arguments, that is, the column (or row) clustering and Lagrange multipliers ˆ are fixed. [sent-638, score-0.502]
50 ˜ looks complex, the fact that ∇φ is a one-one invertible function ensures that each eleAlthough Z ˜ ˜ ment zuv in the matrix Z corresponding to Z depends only on (u, ρ(u), v, γ(v)) for a given Λ. [sent-641, score-0.302]
51 Hence, ˜ for any given Λ, there exists a function ξ such that the point-wise distortion d φ (zuv , zuv ) can be ex˜ pressed as ξ(u, ρ(u), v, γ(v)), that is, it depends only on the corresponding row/column and cluster ˜ assignments. [sent-642, score-0.323]
52 1946 B REGMAN C O - CLUSTERING AND M ATRIX A PPROXIMATION as the sum of contributions from the rows (or columns) where each row (or column ) contribution only depends on the row and its current cluster assignment. [sent-649, score-0.465]
53 Further, for a fixed Λ and γ, since the total approximation error is the sum over the approximation errors due to each row (or column) and its cluster assignment, greedy cluster assignments of the individual rows result in a globally optimal row clustering ρ for the given Λ and γ. [sent-652, score-0.631]
54 First, we will demonstrate how to update row clustering (or column clustering) with respect to a fixed column clustering (or row clustering) and a fixed set of Lagrange multipliers. [sent-660, score-0.687]
55 Updating the row clustering keeping the column clustering and the Lagrange multipliers fixed leads to a new value for the Bregman coclustering objective function. [sent-664, score-0.611]
56 Now making use of the separability property in Lemma 8, we can efficiently optimize the contribution of each row assignment to the overall objective function to obtain the following row cluster update step. [sent-665, score-0.417]
57 A similar argument applies to step 2B where we seek to update the column clustering keeping the row clustering fixed. [sent-669, score-0.463]
58 Proof From Lemmas 9, 10, and 11, it follows that updating the row clustering ρ, the column clustering γ and the Lagrange multipliers Λ one at a time decreases the objective function of the Bregman co-clustering problem. [sent-694, score-0.525]
59 Hence, it is straightforward to obtain the row and column cluster update steps and implement these Bregman co-clustering algorithms. [sent-699, score-0.299]
60 However, it is often difficult to obtain good document clusters by directly clustering the matrix rows due to the inherent sparsity and high dimensionality (i. [sent-777, score-0.277]
61 13 In the generative model, we used 5 row clusters and 5 column clusters. [sent-824, score-0.274]
62 , row and column clustering), as indicated by the normalized mutual information with true labels, is better when the Bregman divergence used in the co-clustering algorithm matches that of the generative model. [sent-832, score-0.294]
63 From the table, it is clear that for relatively simple matrices such as M1 and M2 , reasonably low parameter bases such as C1 or C2 suffice, whereas for more complex matrices such as M6 , high parameter coclustering bases such as C6 are necessary. [sent-941, score-0.362]
64 Since several results comparing specific co-clustering schemes to alternative text clustering approaches have already been studied, we focus on the relative performance of the different coclustering bases introduced in this paper. [sent-949, score-0.266]
65 For each basis, the number of parameters varies as a function of the number of row and column clusters that the co-clustering algorithm uses. [sent-1003, score-0.274]
66 4080 Table 5: Mean absolute error (MAE) for reconstructing MovieLens data (all values) using coclustering methods based on squared Euclidean distance and I-divergence and coclustering basis C5 . [sent-1027, score-0.265]
67 To figure out the appropriate divergence and co-clustering basis for this data, we performed experiments using both squared Euclidean distance and I-divergence and various co-clustering bases with varying number of row and column clusters. [sent-1033, score-0.459]
68 Number of row and column clusters for co-clustering (based on squared Euclidean distance and basis C5 ) and rank of SVD and NNMF is set to 5 and the number of neighbors in the correlation method was set to 50. [sent-1061, score-0.367]
69 The co-clustering objective function in this case is given by J(ρ, γ) = m n z ∑ ∑ KL(zuv ||ˆ uv ) u=1 v=1 ˆ where Z = [zuv ] is the original matrix, Z = [ˆ uv ] is the MBI solution based on the co-clustering, z ˆ and the elements zuv and zuv belong to the r-simplex. [sent-1104, score-0.593]
70 For encoding the co-clustering, we employ a naive scheme that involves specifying the row and column clusters for each row and column respectively. [sent-1117, score-0.518]
71 Since there are k row clusters and l column clusters, the total number of bits required is given by m log2 k + n log2 l, as shown in the second column of Table 7. [sent-1118, score-0.366]
72 From the table, we observe that with an optimal choice of row and column clusters, one can obtain an efficient lossless compression of matrix consisting of finite categorical values. [sent-1131, score-0.308]
73 Taking partial derivatives with respect to λu,v , we obtain ˆ ˆ 1 ∑ wuv (zuv − zuv ) = 0 wuv u:ρ(u)=u ˆˆ ˆ ∀u, v, ˆ ˆ (30) γ(v)=v ˆ that is, E[Z|u, v] = E[Z |u, v] for all [u]k and [v]l . [sent-1227, score-0.502]
74 ˆ ˆ ˆ ˆ ˆ1 ˆ1 Now, setting partial derivatives of (29) with respect to zuv equal to 0, we get wuv ∇φ(zuv ) − wuv ∇φ(¯) + λuv z ˆˆ wuv =0, wuv ˆˆ ˆ ˆ where u = ρ(u) and v = γ(v). [sent-1228, score-0.764]
75 Since wuv ∈ R+ and z = E[Z] = E[Z ], the optimal solution Z = ZA ˆ ˆ ¯ has the form λ∗ ˆ ˆ ˆ ˆ (31) zuv = ∇φ(−1) ∇φ(E[Z]) − uv , u = ρ(u), v = γ(v), ˆ wuv ˆˆ ∗ where λuv corresponds to the optimal Lagrange multiplier. [sent-1229, score-0.547]
76 Hence, s ˆ ˆ E[ Z − ZA , ∇φ(Z ) ] = E[ Z − ZA , ∑ gr (E[Z|Gr ]) ] r=1 s = ∑ E[ Z ˆ − ZA , gr (E[Z|Gr ]) ] r=1 s = ˆ ∑ EG [ E[Z |Gr ] − E[ZA |Gr ], gr (E[Z|Gr ]) ] r = 0, r=1 ˆ since E[Z|Gr ] = E[ZA |Gr ], ∀Gr ∈ C . [sent-1275, score-0.438]
77 Of these three steps, the last two involve conceptually straightforward comparisons to determine the optimal row and column assignments at each stage whereas the first step usually involves non-linear optimization and can be computationally expensive. [sent-1292, score-0.258]
78 Let R ∈ {0, 1}m×k and C ∈ {0, 1}n×l denote the row and column 1972 B REGMAN C O - CLUSTERING AND M ATRIX A PPROXIMATION cluster membership matrices, that is, rug = 1 g = ρ(u), 0 otherwise, cvh = 1 h = γ(v), 0 otherwise. [sent-1310, score-0.276]
79 In other words, the entry rug = 1 iff row u belongs to row cluster g and the entry cvh = 1 iff column v belongs to column cluster h. [sent-1311, score-0.552]
80 (d) Right multiplication by ET —Replication of a given (column) vector along all the columns n ˆ ˆ ˆ ˆ For example, the conditional expectation E[Z|U, V ] involves partitioning along (U, V ), that is, both row and column clusters. [sent-1320, score-0.307]
81 Since there are k row clusters and l column clusters, there are kl partitions (or co-clusters) and a conditional expectation value corresponding to each of these partitions. [sent-1321, score-0.302]
82 To obtain these expectation values, we need to aggregate the rows into the row clusters as well as the columns into column clusters. [sent-1322, score-0.361]
83 Operation W ⊗ Z has the effect of attenuating each element zuv by its corresponding weight wuv . [sent-1325, score-0.371]
84 along rows in the same row cluster across each column, and then right multiplication by C aggregates this reduced matrix consisting of row cluster sums along columns in the same column cluster. [sent-1327, score-0.634]
85 f f To obtain a m × n full matrix ZU,V such that zuv = E[Z|ρ(u), γ(v)], we need to replicate the ˆ ˆ co-cluster values along the rows and columns corresponding to the row and column clusters respectively. [sent-1330, score-0.663]
86 Given the parameters of the MBI solution and a fixed column clustering determined by C, we want to find for each row, the row cluster assignment that leads to the best approximation to the original matrix. [sent-1346, score-0.471]
87 In other words, we are searching for a 1974 B REGMAN C O - CLUSTERING AND M ATRIX A PPROXIMATION ˜ row cluster membership matrix R that results in the most accurate reconstruction Z. [sent-1347, score-0.298]
88 Assuming the matrix ZrowRed is computed apriori, the row clustering only requires O(mkl) operations as opposed to O(mkn) since for each row, we only compare the reduced rows (of size 1 × l) in ZrowRed with the k possible row cluster representatives. [sent-1352, score-0.543]
89 The column cluster assignment step is similar to that of the previous row cluster assignment step and involves finding a column cluster membership ˜ ¯ˆ ˆ matrix C that results in the most accurate reconstruction Z = RZU,V C T . [sent-1355, score-0.696]
90 Though we do not explicitly compute it, the MBI matrix ˆ Z (shown in Table 9) can be expressed in terms of the row clustering R, column clustering C and these conditional expectations for any co-clustering basis. [sent-1367, score-0.577]
91 To obtain the row cluster assignment step, we observe that ˜ ˆ the reconstructed matrix Z, which has the same form as Z can be split into two additive terms of which only one depends on the candidate row clustering. [sent-1370, score-0.484]
92 The optimal row assignments can, therefore, be determined by finding the nearest row (among k possible ˜ ones) in ZrowVar for each of the m rows in Zrow . [sent-1374, score-0.355]
93 The above row assignment step can be readily instantiated for any specified co-clustering basis by choosing the appropriate matrices ˜ ˜ ZrowConst and ZrowVar from Table 10. [sent-1375, score-0.297]
94 As a result of this optimization, the row clustering step involves comparisons between smaller matrices and requires only O(mkl) operations. [sent-1381, score-0.306]
95 Using Table 8, one can exactly compute each of the relevant ˆ conditional expectations, which completely determine the MBI matrix Z (shown in Table 12) for a given row clustering R and column clustering C. [sent-1392, score-0.53]
96 To obtain the row assignment steps for I-divergence, we ˜ make use of the fact that the reconstructed matrix Z, can be decomposed as the Hadamard product of two terms of which only one depends on the candidate row or column clustering. [sent-1396, score-0.499]
97 To obtain the row cluster assignment step, we first recon˜ struct Z for a candidate co-clustering R using the Lagrange multipliers ΛU and ΛV computed ˆ ˆ ˜ in the previous step. [sent-1431, score-0.301]
98 More specifically, the reconstruction Z is given by ˜ Z = Emn ¯ (Z − R ΛU ET − Em ΛV CT ), ˆ n ˆ (34) that is, zuv = 1/(¯ − λρ (u) − λγ(v) ). [sent-1432, score-0.292]
99 Hence, the row assignment step is given by n z ∑ wuv (−zuv λg + log(¯ − λg − λγ(v) )), [u]m . [sent-1435, score-0.318]
100 The column assignment step can be similarly obtained ˜ by substituting the appropriate reconstructed matrix Z into the column update cost function and optimizing the part that depends on the column clustering, that is, m z ∑ wuv (−zuv λh + log(¯ − λρ(u) − λh )), [v]n . [sent-1438, score-0.573]
wordName wordTfidf (topN-words)
[('bregman', 0.556), ('za', 0.332), ('mbi', 0.298), ('zuv', 0.24), ('sec', 0.196), ('gr', 0.146), ('sa', 0.132), ('row', 0.132), ('wuv', 0.131), ('regman', 0.127), ('banerjee', 0.12), ('erugu', 0.117), ('hillon', 0.117), ('odha', 0.117), ('sb', 0.111), ('clustering', 0.108), ('hosh', 0.099), ('atrix', 0.096), ('column', 0.092), ('zv', 0.087), ('argmin', 0.086), ('coclustering', 0.086), ('pproximation', 0.086), ('ct', 0.081), ('divergences', 0.081), ('lagrange', 0.076), ('bases', 0.072), ('zrowvar', 0.072), ('divergence', 0.07), ('dhillon', 0.067), ('matrices', 0.066), ('multipliers', 0.062), ('matrix', 0.062), ('rows', 0.057), ('assignment', 0.055), ('zu', 0.054), ('reconstruction', 0.052), ('cluster', 0.052), ('cho', 0.051), ('clusters', 0.05), ('squared', 0.049), ('wgr', 0.048), ('em', 0.047), ('expectations', 0.047), ('uv', 0.045), ('rz', 0.045), ('basis', 0.044), ('euclidean', 0.044), ('summary', 0.043), ('zb', 0.043), ('zrowconst', 0.041), ('rt', 0.04), ('ratings', 0.038), ('partitional', 0.038), ('block', 0.037), ('mn', 0.037), ('zrowred', 0.034), ('zrowv', 0.034), ('ev', 0.034), ('wc', 0.034), ('assignments', 0.034), ('entropy', 0.034), ('approximation', 0.032), ('distortion', 0.031), ('zcolvar', 0.031), ('lemma', 0.03), ('columns', 0.03), ('cheng', 0.029), ('red', 0.028), ('conditional', 0.028), ('bbac', 0.027), ('euc', 0.027), ('idiv', 0.027), ('kmeans', 0.027), ('rzu', 0.027), ('zcolconst', 0.027), ('zg', 0.027), ('freitag', 0.026), ('church', 0.026), ('reconstructed', 0.026), ('multiplication', 0.025), ('additive', 0.025), ('della', 0.024), ('preserved', 0.024), ('movielens', 0.024), ('recommender', 0.024), ('rohwer', 0.024), ('zcolred', 0.024), ('zet', 0.024), ('zrow', 0.024), ('update', 0.023), ('preserve', 0.023), ('objective', 0.023), ('projection', 0.023), ('pietra', 0.023), ('categorical', 0.022), ('preserves', 0.021), ('log', 0.021), ('emn', 0.021), ('encoding', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000013 3 jmlr-2007-A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation
Author: Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, Srujana Merugu, Dharmendra S. Modha
Abstract: Co-clustering, or simultaneous clustering of rows and columns of a two-dimensional data matrix, is rapidly becoming a powerful data analysis technique. Co-clustering has enjoyed wide success in varied application domains such as text clustering, gene-microarray analysis, natural language processing and image, speech and video analysis. In this paper, we introduce a partitional co-clustering formulation that is driven by the search for a good matrix approximation—every co-clustering is associated with an approximation of the original data matrix and the quality of co-clustering is determined by the approximation error. We allow the approximation error to be measured using a large class of loss functions called Bregman divergences that include squared Euclidean distance and KL-divergence as special cases. In addition, we permit multiple structurally different co-clustering schemes that preserve various linear statistics of the original data matrix. To accomplish the above tasks, we introduce a new minimum Bregman information (MBI) principle that simultaneously generalizes the maximum entropy and standard least squares principles, and leads to a matrix approximation that is optimal among all generalized additive models in a certain natural parameter space. Analysis based on this principle yields an elegant meta algorithm, special cases of which include most previously known alternate minimization based clustering algorithms such as kmeans and co-clustering algorithms such as information theoretic (Dhillon et al., 2003b) and minimum sum-squared residue co-clustering (Cho et al., 2004). To demonstrate the generality and flexibility of our co-clustering framework, we provide examples and empirical evidence on a varic 2007 Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Srujana Merugu and Dharmendra Modha. BANERJEE , D HILLON , G HOSH , M ERUGU AND M ODHA ety of problem domains and also describe novel co-clustering applications such as missing value prediction and
2 0.15806973 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods
Author: Marc Teboulle
Abstract: Center-based partitioning clustering algorithms rely on minimizing an appropriately formulated objective function, and different formulations suggest different possible algorithms. In this paper, we start with the standard nonconvex and nonsmooth formulation of the partitioning clustering problem. We demonstrate that within this elementary formulation, convex analysis tools and optimization theory provide a unifying language and framework to design, analyze and extend hard and soft center-based clustering algorithms, through a generic algorithm which retains the computational simplicity of the popular k-means scheme. We show that several well known and more recent center-based clustering algorithms, which have been derived either heuristically, or/and have emerged from intuitive analogies in physics, statistical techniques and information theoretic perspectives can be recovered as special cases of the proposed analysis and we streamline their relationships. Keywords: clustering, k-means algorithm, convex analysis, support and asymptotic functions, distance-like functions, Bregman and Csiszar divergences, nonlinear means, nonsmooth optimization, smoothing algorithms, fixed point methods, deterministic annealing, expectation maximization, information theory and entropy methods
3 0.080860965 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification
Author: Jia Li, Surajit Ray, Bruce G. Lindsay
Abstract: A new clustering approach based on mode identification is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efficiently solved by an EM-style algorithm, namely, the Modal EM (MEM). This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. Without model fitting, the mode-based clustering yields a density description for every cluster, a major advantage of mixture-model-based clustering. Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnosing clustering results is also investigated. Specifically, a pairwise separability measure for clusters is defined using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is created to enforce strong separation. Experiments on simulated and real data demonstrate that the mode-based clustering approach tends to combine the strengths of linkage and mixture-model-based clustering. In addition, the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difficulty for parametric mixture modeling. A C package on the new algorithms is developed for public access at http://www.stat.psu.edu/∼jiali/hmac. Keywords: modal clustering, mode-based clustering, mixture modeling, modal EM, ridgeline EM, nonparametric density
4 0.076864265 2 jmlr-2007-A Complete Characterization of a Family of Solutions to a Generalized Fisher Criterion
Author: Marco Loog
Abstract: Recently, Ye (2005) suggested yet another optimization criterion for discriminant analysis and proposed a characterization of the family of solutions to this objective. The characterization, however, merely describes a part of the full solution set, that is, it is not complete and therefore not at all a characterization. This correspondence first gives the correct characterization and afterwards compares it to Ye’s. Keywords: linear discriminant analysis, Fisher criterion, small sample, characterization 1. Classical and New Criteria Given N feature vectors of dimensionality n, a linear reduction of dimensionality, based on classical Fisher LDA, determines an n × d transformation matrix L that, for a given d < K, K the number of classes, maximizes the so-called Fisher criterion: F(A) = tr((A t SW A)−1 (At SB A)) or, equivalently, F0 (A) = tr((At ST A)−1 (At SB A)). Here, SB := ∑K pi (mi − m)(mi − m)t , SW := ∑K pi Si , and i=1 i=1 ST = SB + SW . The matrices SB , SW , and ST are the so-called between-class, pooled within-class, and total covariance matrices. In addition, mi is the mean vector of class i, pi is the prior of class i, and the overall mean m equals ∑k pi mi . Finally, Si is the covariance matrix of class i. i=1 A solution to these optimization problems can be obtained by means of a generalized eigenvalue decomposition, which Fukunaga (1990) relates to a simultaneous diagonalization of the two matrices involved (see also Campbell and Atchley, 1981). More common is it to apply a standard −1 eigenvalue decomposition to S−1 SB (or SW SB ), resulting in an equivalent set of eigenvectors. The d T columns of the optimal solution L are simply taken to equal the d eigenvectors corresponding to the d largest eigenvalues. It is known that this solution is not unique and the full class can be obtained by multiplying L to the right with nonsingular d × d matrices (see Fukunaga, 1990). Clearly, if the total covariance ST is singular, neither the generalized nor the standard eigenvalue decomposition can be readily employed. Directly or indirectly, the problem is that the matrix inverse S−1 does not exist, which is the typical situation when dealing with small samples. In an attempt to T overcome this problem, Ye (2005) introduced a different criterion that is defined as F1 (A) = tr((At ST A)+ (At SB A)) , ∗. Also at Nordic Bioscience Imaging, Hovegade 207, DK-2730 Herlev, Denmark. c 2007 Marco Loog. (1) L OOG where + denotes taking the Moore-Penrose generalized inverse of a matrix. Like for F0 , an optimal transform L is one that maximizes the objective F1 . Again, this solution is not unique. 2. Correct Characterization For the full characterization of the set of solutions to Equation (1), initially the problem is looked at from a geometrical point of view (cf., Campbell and Atchley, 1981). It is assumed that the number of samples N is smaller than or equal to the feature dimensionality n. In the undersampled case, it is clear that all data variation occurs in an N − 1-dimensional subspace of the original space. To start with, a PCA of the data is carried out and the first N − 1 principal components are rotated to the first N − 1 axes of the n-dimensional space by means of a rotation matrix R. This matrix consists of all (normalized) eigenvectors of ST taken as its columns. After this rotation, new total and between-class covariance matrices, ST = Rt ST R and SB = Rt SB R, are obtained. These 0 0 matrices can be partitioned as follows: ST = Σ0T 0 and SB = ΣB 0 , where ΣT and ΣB are N − 1 × 0 N − 1 covariance matrices and ΣT is nonsingular and diagonal by construction. The between-class variation is obviously restricted to the N − 1-dimensional subspace in which the total data variation takes place, therefore a same partitioning of SB is possible. However, the covariance submatrix ΣB is not necessarily diagonal, neither does it have to be nonsingular. Basically, the PCA-based rotation R converts the initial problem into a more convenient one, splitting up the original space in an N − 1-dimensional one in which “everything interesting” takes place and a remaining n − N + 1dimensional subspace in which “nothing happens at all”. Now consider F1 in this transformed space and take a general n × d transformation matrix A, which is partitioned in a way similar to the covariance matrices, that is, X . Y A= (2) Here, X is an N − 1 × d-matrix and Y is of size n − N + 1 × d. Taking this definition, the following holds (cf., Ye, 2005): t + t F1 (A) = tr((A ST A) (A SB A)) = tr =tr X t ΣT X 0 0 0 + X Y X t ΣB X 0 0 0 t ΣT 0 = tr 0 0 X Y + (Xt ΣT X)−1 0 0 0 X Y t ΣB 0 0 0 X Y X t ΣB X 0 0 0 = tr((Xt ΣT X)−1 (Xt ΣB X)) = F0 (X) . From this it is immediate that a matrix A maximizes F1 if and only if the submatrix X maximizes the original Fisher criterion in the lower-dimensional subspace. Moreover, if L is such a matrix maximizing F1 in the PCA-transformed space, it is easy to check that R−1 L = Rt L provides a solution to the original, general problem that has not been preprocessed by means of a PCA (see also Fukunaga, 1990). A characterization of the complete family of solutions can now be given. Let Λ ∈ RN−1×d be an optimal solution of F0 (X) = tr((Xt ΣT X)−1 (Xt ΣB X)). As already noted in Section 1, the full set of solutions is given by F = {ΛZ ∈ RN−1×d | Z ∈ GLd (R)}, where GLd (R) denotes the general linear group of d × d invertible matrices. The previous paragraph essentially demonstrates that if X ∈ F , A in Equation (2) maximizes F1 . The matrix Y can be chosen ad 2122 C OMPLETE C HARACTERIZATION OF A FAMILY OF S OLUTIONS libitum. Now, the latter provides the solution in the PCA-transformed space and to solve the general problem we need to take the rotation back to the original space into account. All in all, this leads to the following complete family of solutions L , maximizing F1 in the original space: L = Rt ΛZ ∈ Rn×d Z ∈ GLd (R), Y ∈ Rn−N+1×d Y , (3) where Λ = argmaxX tr((Xt ΣT X)−1 (Xt ΣB X)) and Rt takes care of the rotation back. 3. Original Characterization Though not noted by Ye (2005), his attempt to characterize the full set of solutions of Equation (1) is based on a simultaneous diagonalization of the three covariance matrices S B , SW , and ST that is similar to the ideas described by Campbell and Atchley (1981) and Fukunaga (1990). Moreover, Golub and Van Loan (Theorem 8.7.1. 1996) can be readily applied to demonstrate that such simultaneous diagonalization is possible in the small sample setting. After the diagonalization step, partitioned between-class, pooled within-class, and total covariance matrices are considered. This partitioning is similar to the one employed in the previous section, which does not enforce matrices to be diagonal however. In the subsequent optimization step, the classical Fisher criterion is maximized basically in the appropriate subspace, comparable to the approach described above, but in a mildly more involved and concealed way. For this, matrices of the form Rt X are considered, consider Equations (2) and Y (3). However, Y is simply the null matrix and the family of solutions L provided is limited to L = Rt ΛZ ∈ Rn×d Z ∈ GLd (R) . 0 Obviously, this is far from a complete characterization, especially when N − 1 n which is, for instance, typically the case for the data sets considered by Ye (2005). Generally, the utility of a dimensionality reduction criterion, without additional constrains, depends on the efficiency over the full set of solutions. As Ye (2005) only considers two very specific instances from the large class of possibilities, it is unclear to what extent the new criterion really provides an efficient way of performing a reduction of dimensionality. References N. A. Campbell and W. R. Atchley. The geometry of canonical variate analysis. Systematic Zoology, 30(3):268–280, 1981. K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, New York, 1990. G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, third edition, 1996. J. Ye. Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. Journal of Machine Learning Research, 6:483–502, 2005. 2123
5 0.068685189 13 jmlr-2007-Bayesian Quadratic Discriminant Analysis
Author: Santosh Srivastava, Maya R. Gupta, Béla A. Frigyik
Abstract: Quadratic discriminant analysis is a common tool for classification, but estimation of the Gaussian parameters can be ill-posed. This paper contains theoretical and algorithmic contributions to Bayesian estimation for quadratic discriminant analysis. A distribution-based Bayesian classifier is derived using information geometry. Using a calculus of variations approach to define a functional Bregman divergence for distributions, it is shown that the Bayesian distribution-based classifier that minimizes the expected Bregman divergence of each class conditional distribution also minimizes the expected misclassification cost. A series approximation is used to relate regularized discriminant analysis to Bayesian discriminant analysis. A new Bayesian quadratic discriminant analysis classifier is proposed where the prior is defined using a coarse estimate of the covariance based on the training data; this classifier is termed BDA7. Results on benchmark data sets and simulations show that BDA7 performance is competitive with, and in some cases significantly better than, regularized quadratic discriminant analysis and the cross-validated Bayesian quadratic discriminant analysis classifier Quadratic Bayes. Keywords: quadratic discriminant analysis, regularized quadratic discriminant analysis, Bregman divergence, data-dependent prior, eigenvalue decomposition, Wishart, functional analysis
6 0.062732413 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
7 0.049366195 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians
9 0.044053197 27 jmlr-2007-Distances between Data Sets Based on Summary Statistics
11 0.036991276 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes
12 0.035233635 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification
13 0.034879308 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data
14 0.032117121 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis
15 0.030468332 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)
17 0.027639784 90 jmlr-2007-Value Regularization and Fenchel Duality
18 0.026820464 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels
19 0.026238114 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models
20 0.025718298 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
topicId topicWeight
[(0, 0.185), (1, 0.04), (2, -0.015), (3, 0.084), (4, -0.077), (5, -0.33), (6, -0.026), (7, 0.193), (8, 0.053), (9, -0.055), (10, -0.047), (11, 0.066), (12, -0.06), (13, 0.009), (14, -0.233), (15, 0.095), (16, 0.015), (17, 0.043), (18, 0.041), (19, 0.177), (20, -0.029), (21, -0.167), (22, -0.116), (23, 0.148), (24, -0.212), (25, 0.036), (26, 0.019), (27, -0.059), (28, -0.183), (29, -0.076), (30, 0.166), (31, 0.064), (32, -0.016), (33, -0.051), (34, -0.026), (35, -0.015), (36, -0.04), (37, -0.045), (38, -0.023), (39, -0.036), (40, -0.07), (41, 0.013), (42, 0.009), (43, -0.014), (44, -0.003), (45, 0.009), (46, 0.065), (47, 0.101), (48, -0.037), (49, 0.09)]
simIndex simValue paperId paperTitle
same-paper 1 0.9569661 3 jmlr-2007-A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation
Author: Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, Srujana Merugu, Dharmendra S. Modha
Abstract: Co-clustering, or simultaneous clustering of rows and columns of a two-dimensional data matrix, is rapidly becoming a powerful data analysis technique. Co-clustering has enjoyed wide success in varied application domains such as text clustering, gene-microarray analysis, natural language processing and image, speech and video analysis. In this paper, we introduce a partitional co-clustering formulation that is driven by the search for a good matrix approximation—every co-clustering is associated with an approximation of the original data matrix and the quality of co-clustering is determined by the approximation error. We allow the approximation error to be measured using a large class of loss functions called Bregman divergences that include squared Euclidean distance and KL-divergence as special cases. In addition, we permit multiple structurally different co-clustering schemes that preserve various linear statistics of the original data matrix. To accomplish the above tasks, we introduce a new minimum Bregman information (MBI) principle that simultaneously generalizes the maximum entropy and standard least squares principles, and leads to a matrix approximation that is optimal among all generalized additive models in a certain natural parameter space. Analysis based on this principle yields an elegant meta algorithm, special cases of which include most previously known alternate minimization based clustering algorithms such as kmeans and co-clustering algorithms such as information theoretic (Dhillon et al., 2003b) and minimum sum-squared residue co-clustering (Cho et al., 2004). To demonstrate the generality and flexibility of our co-clustering framework, we provide examples and empirical evidence on a varic 2007 Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Srujana Merugu and Dharmendra Modha. BANERJEE , D HILLON , G HOSH , M ERUGU AND M ODHA ety of problem domains and also describe novel co-clustering applications such as missing value prediction and
2 0.71634382 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods
Author: Marc Teboulle
Abstract: Center-based partitioning clustering algorithms rely on minimizing an appropriately formulated objective function, and different formulations suggest different possible algorithms. In this paper, we start with the standard nonconvex and nonsmooth formulation of the partitioning clustering problem. We demonstrate that within this elementary formulation, convex analysis tools and optimization theory provide a unifying language and framework to design, analyze and extend hard and soft center-based clustering algorithms, through a generic algorithm which retains the computational simplicity of the popular k-means scheme. We show that several well known and more recent center-based clustering algorithms, which have been derived either heuristically, or/and have emerged from intuitive analogies in physics, statistical techniques and information theoretic perspectives can be recovered as special cases of the proposed analysis and we streamline their relationships. Keywords: clustering, k-means algorithm, convex analysis, support and asymptotic functions, distance-like functions, Bregman and Csiszar divergences, nonlinear means, nonsmooth optimization, smoothing algorithms, fixed point methods, deterministic annealing, expectation maximization, information theory and entropy methods
3 0.47079501 13 jmlr-2007-Bayesian Quadratic Discriminant Analysis
Author: Santosh Srivastava, Maya R. Gupta, Béla A. Frigyik
Abstract: Quadratic discriminant analysis is a common tool for classification, but estimation of the Gaussian parameters can be ill-posed. This paper contains theoretical and algorithmic contributions to Bayesian estimation for quadratic discriminant analysis. A distribution-based Bayesian classifier is derived using information geometry. Using a calculus of variations approach to define a functional Bregman divergence for distributions, it is shown that the Bayesian distribution-based classifier that minimizes the expected Bregman divergence of each class conditional distribution also minimizes the expected misclassification cost. A series approximation is used to relate regularized discriminant analysis to Bayesian discriminant analysis. A new Bayesian quadratic discriminant analysis classifier is proposed where the prior is defined using a coarse estimate of the covariance based on the training data; this classifier is termed BDA7. Results on benchmark data sets and simulations show that BDA7 performance is competitive with, and in some cases significantly better than, regularized quadratic discriminant analysis and the cross-validated Bayesian quadratic discriminant analysis classifier Quadratic Bayes. Keywords: quadratic discriminant analysis, regularized quadratic discriminant analysis, Bregman divergence, data-dependent prior, eigenvalue decomposition, Wishart, functional analysis
4 0.29305416 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
Author: Wei Pan, Xiaotong Shen
Abstract: Variable selection in clustering analysis is both challenging and important. In the context of modelbased clustering analysis with a common diagonal covariance matrix, which is especially suitable for “high dimension, low sample size” settings, we propose a penalized likelihood approach with an L1 penalty function, automatically realizing variable selection via thresholding and delivering a sparse solution. We derive an EM algorithm to fit our proposed model, and propose a modified BIC as a model selection criterion to choose the number of components and the penalization parameter. A simulation study and an application to gene function prediction with gene expression profiles demonstrate the utility of our method. Keywords: BIC, EM, mixture model, penalized likelihood, soft-thresholding, shrinkage
5 0.29265684 2 jmlr-2007-A Complete Characterization of a Family of Solutions to a Generalized Fisher Criterion
Author: Marco Loog
Abstract: Recently, Ye (2005) suggested yet another optimization criterion for discriminant analysis and proposed a characterization of the family of solutions to this objective. The characterization, however, merely describes a part of the full solution set, that is, it is not complete and therefore not at all a characterization. This correspondence first gives the correct characterization and afterwards compares it to Ye’s. Keywords: linear discriminant analysis, Fisher criterion, small sample, characterization 1. Classical and New Criteria Given N feature vectors of dimensionality n, a linear reduction of dimensionality, based on classical Fisher LDA, determines an n × d transformation matrix L that, for a given d < K, K the number of classes, maximizes the so-called Fisher criterion: F(A) = tr((A t SW A)−1 (At SB A)) or, equivalently, F0 (A) = tr((At ST A)−1 (At SB A)). Here, SB := ∑K pi (mi − m)(mi − m)t , SW := ∑K pi Si , and i=1 i=1 ST = SB + SW . The matrices SB , SW , and ST are the so-called between-class, pooled within-class, and total covariance matrices. In addition, mi is the mean vector of class i, pi is the prior of class i, and the overall mean m equals ∑k pi mi . Finally, Si is the covariance matrix of class i. i=1 A solution to these optimization problems can be obtained by means of a generalized eigenvalue decomposition, which Fukunaga (1990) relates to a simultaneous diagonalization of the two matrices involved (see also Campbell and Atchley, 1981). More common is it to apply a standard −1 eigenvalue decomposition to S−1 SB (or SW SB ), resulting in an equivalent set of eigenvectors. The d T columns of the optimal solution L are simply taken to equal the d eigenvectors corresponding to the d largest eigenvalues. It is known that this solution is not unique and the full class can be obtained by multiplying L to the right with nonsingular d × d matrices (see Fukunaga, 1990). Clearly, if the total covariance ST is singular, neither the generalized nor the standard eigenvalue decomposition can be readily employed. Directly or indirectly, the problem is that the matrix inverse S−1 does not exist, which is the typical situation when dealing with small samples. In an attempt to T overcome this problem, Ye (2005) introduced a different criterion that is defined as F1 (A) = tr((At ST A)+ (At SB A)) , ∗. Also at Nordic Bioscience Imaging, Hovegade 207, DK-2730 Herlev, Denmark. c 2007 Marco Loog. (1) L OOG where + denotes taking the Moore-Penrose generalized inverse of a matrix. Like for F0 , an optimal transform L is one that maximizes the objective F1 . Again, this solution is not unique. 2. Correct Characterization For the full characterization of the set of solutions to Equation (1), initially the problem is looked at from a geometrical point of view (cf., Campbell and Atchley, 1981). It is assumed that the number of samples N is smaller than or equal to the feature dimensionality n. In the undersampled case, it is clear that all data variation occurs in an N − 1-dimensional subspace of the original space. To start with, a PCA of the data is carried out and the first N − 1 principal components are rotated to the first N − 1 axes of the n-dimensional space by means of a rotation matrix R. This matrix consists of all (normalized) eigenvectors of ST taken as its columns. After this rotation, new total and between-class covariance matrices, ST = Rt ST R and SB = Rt SB R, are obtained. These 0 0 matrices can be partitioned as follows: ST = Σ0T 0 and SB = ΣB 0 , where ΣT and ΣB are N − 1 × 0 N − 1 covariance matrices and ΣT is nonsingular and diagonal by construction. The between-class variation is obviously restricted to the N − 1-dimensional subspace in which the total data variation takes place, therefore a same partitioning of SB is possible. However, the covariance submatrix ΣB is not necessarily diagonal, neither does it have to be nonsingular. Basically, the PCA-based rotation R converts the initial problem into a more convenient one, splitting up the original space in an N − 1-dimensional one in which “everything interesting” takes place and a remaining n − N + 1dimensional subspace in which “nothing happens at all”. Now consider F1 in this transformed space and take a general n × d transformation matrix A, which is partitioned in a way similar to the covariance matrices, that is, X . Y A= (2) Here, X is an N − 1 × d-matrix and Y is of size n − N + 1 × d. Taking this definition, the following holds (cf., Ye, 2005): t + t F1 (A) = tr((A ST A) (A SB A)) = tr =tr X t ΣT X 0 0 0 + X Y X t ΣB X 0 0 0 t ΣT 0 = tr 0 0 X Y + (Xt ΣT X)−1 0 0 0 X Y t ΣB 0 0 0 X Y X t ΣB X 0 0 0 = tr((Xt ΣT X)−1 (Xt ΣB X)) = F0 (X) . From this it is immediate that a matrix A maximizes F1 if and only if the submatrix X maximizes the original Fisher criterion in the lower-dimensional subspace. Moreover, if L is such a matrix maximizing F1 in the PCA-transformed space, it is easy to check that R−1 L = Rt L provides a solution to the original, general problem that has not been preprocessed by means of a PCA (see also Fukunaga, 1990). A characterization of the complete family of solutions can now be given. Let Λ ∈ RN−1×d be an optimal solution of F0 (X) = tr((Xt ΣT X)−1 (Xt ΣB X)). As already noted in Section 1, the full set of solutions is given by F = {ΛZ ∈ RN−1×d | Z ∈ GLd (R)}, where GLd (R) denotes the general linear group of d × d invertible matrices. The previous paragraph essentially demonstrates that if X ∈ F , A in Equation (2) maximizes F1 . The matrix Y can be chosen ad 2122 C OMPLETE C HARACTERIZATION OF A FAMILY OF S OLUTIONS libitum. Now, the latter provides the solution in the PCA-transformed space and to solve the general problem we need to take the rotation back to the original space into account. All in all, this leads to the following complete family of solutions L , maximizing F1 in the original space: L = Rt ΛZ ∈ Rn×d Z ∈ GLd (R), Y ∈ Rn−N+1×d Y , (3) where Λ = argmaxX tr((Xt ΣT X)−1 (Xt ΣB X)) and Rt takes care of the rotation back. 3. Original Characterization Though not noted by Ye (2005), his attempt to characterize the full set of solutions of Equation (1) is based on a simultaneous diagonalization of the three covariance matrices S B , SW , and ST that is similar to the ideas described by Campbell and Atchley (1981) and Fukunaga (1990). Moreover, Golub and Van Loan (Theorem 8.7.1. 1996) can be readily applied to demonstrate that such simultaneous diagonalization is possible in the small sample setting. After the diagonalization step, partitioned between-class, pooled within-class, and total covariance matrices are considered. This partitioning is similar to the one employed in the previous section, which does not enforce matrices to be diagonal however. In the subsequent optimization step, the classical Fisher criterion is maximized basically in the appropriate subspace, comparable to the approach described above, but in a mildly more involved and concealed way. For this, matrices of the form Rt X are considered, consider Equations (2) and Y (3). However, Y is simply the null matrix and the family of solutions L provided is limited to L = Rt ΛZ ∈ Rn×d Z ∈ GLd (R) . 0 Obviously, this is far from a complete characterization, especially when N − 1 n which is, for instance, typically the case for the data sets considered by Ye (2005). Generally, the utility of a dimensionality reduction criterion, without additional constrains, depends on the efficiency over the full set of solutions. As Ye (2005) only considers two very specific instances from the large class of possibilities, it is unclear to what extent the new criterion really provides an efficient way of performing a reduction of dimensionality. References N. A. Campbell and W. R. Atchley. The geometry of canonical variate analysis. Systematic Zoology, 30(3):268–280, 1981. K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, New York, 1990. G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, third edition, 1996. J. Ye. Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. Journal of Machine Learning Research, 6:483–502, 2005. 2123
6 0.27350047 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification
7 0.24242353 27 jmlr-2007-Distances between Data Sets Based on Summary Statistics
8 0.21606377 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels
10 0.18004404 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes
11 0.17545976 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data
12 0.16214795 90 jmlr-2007-Value Regularization and Fenchel Duality
13 0.15208459 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification
14 0.15153991 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians
16 0.14467479 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis
17 0.14037308 30 jmlr-2007-Dynamics and Generalization Ability of LVQ Algorithms
18 0.13672262 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation
19 0.13311887 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression
20 0.13067549 88 jmlr-2007-Unlabeled Compression Schemes for Maximum Classes
topicId topicWeight
[(4, 0.042), (8, 0.034), (10, 0.039), (12, 0.031), (15, 0.023), (26, 0.422), (28, 0.038), (40, 0.038), (45, 0.017), (48, 0.03), (60, 0.034), (85, 0.045), (98, 0.097)]
simIndex simValue paperId paperTitle
same-paper 1 0.68409747 3 jmlr-2007-A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation
Author: Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, Srujana Merugu, Dharmendra S. Modha
Abstract: Co-clustering, or simultaneous clustering of rows and columns of a two-dimensional data matrix, is rapidly becoming a powerful data analysis technique. Co-clustering has enjoyed wide success in varied application domains such as text clustering, gene-microarray analysis, natural language processing and image, speech and video analysis. In this paper, we introduce a partitional co-clustering formulation that is driven by the search for a good matrix approximation—every co-clustering is associated with an approximation of the original data matrix and the quality of co-clustering is determined by the approximation error. We allow the approximation error to be measured using a large class of loss functions called Bregman divergences that include squared Euclidean distance and KL-divergence as special cases. In addition, we permit multiple structurally different co-clustering schemes that preserve various linear statistics of the original data matrix. To accomplish the above tasks, we introduce a new minimum Bregman information (MBI) principle that simultaneously generalizes the maximum entropy and standard least squares principles, and leads to a matrix approximation that is optimal among all generalized additive models in a certain natural parameter space. Analysis based on this principle yields an elegant meta algorithm, special cases of which include most previously known alternate minimization based clustering algorithms such as kmeans and co-clustering algorithms such as information theoretic (Dhillon et al., 2003b) and minimum sum-squared residue co-clustering (Cho et al., 2004). To demonstrate the generality and flexibility of our co-clustering framework, we provide examples and empirical evidence on a varic 2007 Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Srujana Merugu and Dharmendra Modha. BANERJEE , D HILLON , G HOSH , M ERUGU AND M ODHA ety of problem domains and also describe novel co-clustering applications such as missing value prediction and
2 0.31038859 13 jmlr-2007-Bayesian Quadratic Discriminant Analysis
Author: Santosh Srivastava, Maya R. Gupta, Béla A. Frigyik
Abstract: Quadratic discriminant analysis is a common tool for classification, but estimation of the Gaussian parameters can be ill-posed. This paper contains theoretical and algorithmic contributions to Bayesian estimation for quadratic discriminant analysis. A distribution-based Bayesian classifier is derived using information geometry. Using a calculus of variations approach to define a functional Bregman divergence for distributions, it is shown that the Bayesian distribution-based classifier that minimizes the expected Bregman divergence of each class conditional distribution also minimizes the expected misclassification cost. A series approximation is used to relate regularized discriminant analysis to Bayesian discriminant analysis. A new Bayesian quadratic discriminant analysis classifier is proposed where the prior is defined using a coarse estimate of the covariance based on the training data; this classifier is termed BDA7. Results on benchmark data sets and simulations show that BDA7 performance is competitive with, and in some cases significantly better than, regularized quadratic discriminant analysis and the cross-validated Bayesian quadratic discriminant analysis classifier Quadratic Bayes. Keywords: quadratic discriminant analysis, regularized quadratic discriminant analysis, Bregman divergence, data-dependent prior, eigenvalue decomposition, Wishart, functional analysis
3 0.30286449 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification
Author: Jia Li, Surajit Ray, Bruce G. Lindsay
Abstract: A new clustering approach based on mode identification is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efficiently solved by an EM-style algorithm, namely, the Modal EM (MEM). This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. Without model fitting, the mode-based clustering yields a density description for every cluster, a major advantage of mixture-model-based clustering. Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnosing clustering results is also investigated. Specifically, a pairwise separability measure for clusters is defined using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is created to enforce strong separation. Experiments on simulated and real data demonstrate that the mode-based clustering approach tends to combine the strengths of linkage and mixture-model-based clustering. In addition, the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difficulty for parametric mixture modeling. A C package on the new algorithms is developed for public access at http://www.stat.psu.edu/∼jiali/hmac. Keywords: modal clustering, mode-based clustering, mixture modeling, modal EM, ridgeline EM, nonparametric density
4 0.3010115 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation
Author: Guy Lebanon, Yi Mao, Joshua Dillon
Abstract: The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words representation and its n-gram extensions. This representation uses local smoothing to embed documents as smooth curves in the multinomial simplex thereby preserving valuable sequential information. In contrast to bag of words or n-grams, the new representation is able to robustly capture medium and long range sequential trends in the document. We discuss the representation and its geometric properties and demonstrate its applicability for various text processing tasks. Keywords: text processing, local smoothing
5 0.30062008 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
Author: Rie Johnson, Tong Zhang
Abstract: This paper investigates the effect of Laplacian normalization in graph-based semi-supervised learning. To this end, we consider multi-class transductive learning on graphs with Laplacian regularization. Generalization bounds are derived using geometric properties of the graph. Specifically, by introducing a definition of graph cut from learning theory, we obtain generalization bounds that depend on the Laplacian regularizer. We then use this analysis to better understand the role of graph Laplacian matrix normalization. Under assumptions that the cut is small, we derive near-optimal normalization factors by approximately minimizing the generalization bounds. The analysis reveals the limitations of the standard degree-based normalization method in that the resulting normalization factors can vary significantly within each connected component with the same class label, which may cause inferior generalization performance. Our theory also suggests a remedy that does not suffer from this problem. Experiments confirm the superiority of the normalization scheme motivated by learning theory on artificial and real-world data sets. Keywords: transductive learning, graph learning, Laplacian regularization, normalization of graph Laplacian
7 0.29707599 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data
9 0.29388607 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels
10 0.29287297 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis
11 0.29237673 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
12 0.29053682 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition
14 0.29020107 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
15 0.28907663 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models
16 0.28902262 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
17 0.28873232 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes
19 0.28821683 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
20 0.2857101 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds