nips nips2005 nips2005-13 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Rong Jin, Feng Kang, Chris H. Ding
Abstract: Spectral clustering enjoys its success in both data clustering and semisupervised learning. But, most spectral clustering algorithms cannot handle multi-class clustering problems directly. Additional strategies are needed to extend spectral clustering algorithms to multi-class clustering problems. Furthermore, most spectral clustering algorithms employ hard cluster membership, which is likely to be trapped by the local optimum. In this paper, we present a new spectral clustering algorithm, named “Soft Cut”. It improves the normalized cut algorithm by introducing soft membership, and can be efficiently computed using a bound optimization algorithm. Our experiments with a variety of datasets have shown the promising performance of the proposed clustering algorithm. 1
Reference: text
sentIndex sentText sentNum sentScore
1 But, most spectral clustering algorithms cannot handle multi-class clustering problems directly. [sent-2, score-1.049]
2 Additional strategies are needed to extend spectral clustering algorithms to multi-class clustering problems. [sent-3, score-1.073]
3 Furthermore, most spectral clustering algorithms employ hard cluster membership, which is likely to be trapped by the local optimum. [sent-4, score-0.797]
4 In this paper, we present a new spectral clustering algorithm, named “Soft Cut”. [sent-5, score-0.642]
5 It improves the normalized cut algorithm by introducing soft membership, and can be efficiently computed using a bound optimization algorithm. [sent-6, score-0.958]
6 Our experiments with a variety of datasets have shown the promising performance of the proposed clustering algorithm. [sent-7, score-0.523]
7 1 Introduction Data clustering has been an active research area with a long history. [sent-8, score-0.429]
8 Well-known clustering methods include the K-means methods (Hartigan & Wong. [sent-9, score-0.429]
9 , 1994), Gaussian Mixture Model (Redner & Walker, 1984), Probabilistic Latent Semantic Indexing (PLSI) (Hofmann, 1999), and Latent Dirichlet Allocation (LDA) (Blei et al. [sent-10, score-0.037]
10 Recently, spectral clustering methods (Shi & Malik, 2000; Ng et al. [sent-12, score-0.635]
11 , 2001; Bach & Jordan, 2004)have attracted more and more attention given their promising performance in data clustering and simplicity in implementation. [sent-15, score-0.429]
12 They treat the data clustering problem as a graph partitioning problem. [sent-16, score-0.501]
13 In its simplest form, a minimum cut algorithm is used to minimize the weights (or similarities) assigned to the removed edges. [sent-17, score-0.599]
14 To avoid unbalanced clustering results, different objectives have been proposed, including the ratio cut (Hagen & Kahng, 1991), normalized cut (Shi & Malik, 2000) and min-max cut (Ding et al. [sent-18, score-2.316]
15 To reduce the computational complexity, most spectral clustering algorithms use the relaxation approach, which maps discrete cluster memberships into continuous real numbers. [sent-20, score-0.861]
16 As a result, it is difficult to directly apply current spectral clustering algorithms to multiclass clustering problems. [sent-21, score-1.074]
17 , 2001; Yu & Shi, 2003) have been used to extend spectral clustering algorithms to multi-class clustering problems. [sent-23, score-1.073]
18 One common approach is to first construct a low-dimension space for data representation using the smallest eigenvectors of a graph Laplacian that is constructed based on the pair wise similarity of data. [sent-24, score-0.159]
19 Then, a standard clustering algorithm, such as the K-means method, is applied to cluster data points in the low-dimension space. [sent-25, score-0.552]
20 A too small number of eigenvectors will lead to an insufficient representation of data, and meanwhile a too large number of eigenvectors will bring in a significant amount of noise to the data representation. [sent-27, score-0.229]
21 , 2001) that the number of required eigenvectors is generally equal to the number of clusters, the analysis is valid only when data points of different clusters are well separated. [sent-30, score-0.195]
22 As will be shown later, when data points are not well separated, the optimal number of eigenvectors can be different from the number of clusters. [sent-31, score-0.121]
23 Another problem with the existing spectral clustering algorithms is that they are based on binary cluster membership and therefore are unable to express the uncertainty in data clustering. [sent-32, score-0.843]
24 Compared to hard cluster membership, probabilistic membership is advantageous in that it is less likely to be trapped by local minimums. [sent-33, score-0.374]
25 One example is the Bayesian clustering method (Redner & Walker, 1984), which is usually more robust than the K-means method because of its soft cluster memberships. [sent-34, score-0.703]
26 It is also advantageous to use probabilistic memberships when the cluster memberships are the intermediate results and will be used for other processes, for example selective sampling in active learning (Jin & Si, 2004). [sent-35, score-0.336]
27 In this paper, we present a new spectral clustering algorithm, named “Soft Cut”, that explicitly addresses the above two problems. [sent-36, score-0.642]
28 It extends the normalized cut algorithm by introducing probabilistic membership of data points. [sent-37, score-0.927]
29 By encoding membership of multiple clusters into a set of probabilities, the proposed clustering algorithm can be applied directly to multi-class clustering problems. [sent-38, score-1.131]
30 Our empirical studies with a variety of datasets have shown that the soft cut algorithm can substantially outperform the normalized cut algorithm for multi-class clustering. [sent-39, score-1.57]
31 2 Related Work The key idea of spectral clustering is to convert a clustering problem into a graph partitioning problem. [sent-45, score-1.122]
32 Then, a clustering problem can be formulated into the minimum cut problem, i. [sent-49, score-0.986]
33 , qn ) is a vector for binary memberships and each qi can be either −1 or 1. [sent-54, score-0.181]
34 Usually, a relaxation approach (Chung, 1997) is used to replace the vector q ∈ {−1, 1}n with a vector n ˆ2 q ∈ Rn under the constraint i=1 qi = n. [sent-58, score-0.162]
35 As a result of the relaxation, the approximate ˆ solution to (1) is the second smallest eigenvector of Laplacian L. [sent-59, score-0.077]
36 One problem with the minimum cut approach is that it does not take into account the size of clusters, which can lead to clusters of unbalanced sizes. [sent-60, score-0.655]
37 To resolve this problem, several different criteria are proposed, including the ratio cut (Hagen & Kahng, 1991), normalized cut (Shi & Malik, 2000) and min-max cut (Ding et al. [sent-61, score-1.863]
38 For example, in the normalized cut algorithm, the following objective is used: Jn (q) C+,− (q) C+,− (q) + D+ (q) D− (q) = n (2) n n where C+,− (q) = i,j=1 wi,j δ(qi , +)δ(qj , −) and D± = i=1 δ(qi , ±) j=1 wi,j . [sent-63, score-0.749]
39 , D± , is used as the denominators to avoid clusters of too small size. [sent-66, score-0.074]
40 Similar to the minimum cut approach, a relaxation approach is used to convert the problem in (2) into a eigenvector problem. [sent-67, score-0.666]
41 For multi-class clustering, we can extend the objective in (2) into the following form: K Jnorm mc (q) = z=1 z ′ =z Cz,z′ (q) Dz (q) (3) where K is the number of clusters, vector q ∈ {1, 2, . [sent-68, score-0.186]
42 In particular, a simple relaxation method cannot be applied directly here. [sent-73, score-0.06]
43 In the past, several heuristic approaches (Shi & Malik, 2000; Ng et al. [sent-74, score-0.037]
44 One common strategy is to first obtain the K smallest (excluding the one with zero eigenvalue) eigenvectors of Laplacian L, and project data points onto the low-dimension space that is spanned by the K eigenvectors. [sent-76, score-0.148]
45 Then, a standard clustering algorithm, such as the K-means method, is applied to cluster data points in this low-dimension space. [sent-77, score-0.552]
46 In contrast to these approaches, the proposed spectral clustering algorithm deals with the multi-class clustering problem directly. [sent-78, score-1.105]
47 It estimates the probabilities for each data point be in different clusters simultaneously. [sent-79, score-0.097]
48 Through the probabilistic cluster memberships, the proposed algorithm will be less likely to be trapped by local minimums, and therefore will be more robust than the existing spectral clustering algorithms. [sent-80, score-0.925]
49 3 Spectral Clustering with Soft Membership In this section, we describe a new spectral clustering algorithm, named “Soft Cut”, which extends the normalized cut algorithm by introducing probabilistic cluster membership. [sent-81, score-1.55]
50 In the following, we will present a formal description of the soft cut algorithm, followed by the procedure that efficiently optimizes the related optimization problem. [sent-82, score-0.756]
51 Thus, the objective function for multi-class clustering in (3) can be rewritten as: K Jn mc (q) K = z=1 z ′ =z ′ Let Jn mc = Cz,z (q) K z=1 Dz (q) . [sent-85, score-0.695]
52 K Cz,z′ (q) Cz,z (q) =K− Dz (q) Dz (q) z=1 Thus, instead of minimizing Jn mc , ′ we can maximize Jn (4) mc . [sent-86, score-0.208]
53 To extend the above objective function to a probabilistic framework, we introduce the probabilistic cluster membership. [sent-87, score-0.286]
54 2 Optimization Procedure In this subsection, we present a bound optimization algorithm (Salakhutdinov & Roweis, 2003) for efficiently finding the solution to (6). [sent-97, score-0.118]
55 In each iteration, a concave lower bound is first constructed for the objective function based on the solution obtained from the previous iteration. [sent-99, score-0.142]
56 ′ Let Q′ = [qi,j ]K×n be the probabilities obtained in the previous iteration, and Q = [qi,j ]K×n be the probabilities for current iteration. [sent-102, score-0.046]
57 Define ∆(Q, Q′ ) = log Jprob (Q) Jprob (Q′ ) which is the logarithm of the ratio of the objective functions between two consecutive iterations. [sent-103, score-0.152]
58 It can be acquired by wi,j + λi Dz (Q′ ) si,j log tz z n j=1 (14) Since the above objective function is concave, we can apply a standard numerical procedure, such as the Newton’s method, to efficiently find the value for λi . [sent-111, score-0.11]
59 4 Experiment In this section, we focus on examining the effectiveness of the proposed soft cut algorithm for multi-class clustering. [sent-112, score-0.811]
60 How effective is the proposed algorithm for data clustering? [sent-114, score-0.078]
61 We compare the proposed soft cut algorithm to the normalized cut algorithm with various numbers of eigenvectors. [sent-115, score-1.519]
62 How robust is the proposed algorithm for data clustering? [sent-117, score-0.099]
63 We evaluate the robustness of clustering algorithms by examining their variance across multiple trials. [sent-118, score-0.451]
64 1 Experiment Design Datasets In order to extensively examine the effectiveness of the proposed soft cut algorithm, a variety of datasets are used in this experiment. [sent-120, score-0.827]
65 They are: • Text documents that are extracted from the 20 newsgroups to form two five-class datasets, named as “M5” and “L5”. [sent-121, score-0.071]
66 Dataset M5 L5 Pendigit Ribosome Table 1: Datasets Description Description #Class #Instance Text documents 5 500 Text documents 5 500 Pen-based handwritting 10 2000 Ribosome rDNA sequences 8 1907 #Features 1000 1000 16 27617 • Pendigit that comes from the UCI data repository. [sent-123, score-0.054]
67 It contains annotated rRNA sequences of ribosome for 2000 different bacteria that belong to 10 different phylum (e. [sent-130, score-0.104]
68 Evaluation metrics To evaluate the performance of different clustering algorithms, two different metrics are used: • Clustering accuracy. [sent-134, score-0.471]
69 For the datasets that have no more than five classes, clustering accuracy is used as the evaluation metric. [sent-135, score-0.543]
70 To compute clustering accuracy, each automatically generated cluster is first aligned with a true class. [sent-136, score-0.531]
71 The classification accuracy based on the alignment is then computed, and the clustering accuracy is defined as the maximum classification accuracy among all possible alignments. [sent-137, score-0.547]
72 For the datasets that have more than five classes, due to the expensive computation involved in finding the optimal alignment, we use the normalized mutual information (Banerjee et al. [sent-139, score-0.272]
73 If Tu and Tl denote the cluster labels and true class labels assigned to data points, the normalized mutual information “nmi” is defined as nmi = 2I(Tu , Tl ) (H(Tu ) + H(Tl )) where I(Tu , Tl ) stands for the mutual information between clustering labels Tu and true class labels Tl . [sent-141, score-0.886]
74 Both the EM algorithm and the Kmeans methods are used to cluster the data points that are projected into the low-dimension space spanned by the smallest eigenvectors of a graph Laplacian. [sent-148, score-0.324]
75 2 Experiment (I): Effectiveness of The Soft Cut Algorithm The clustering results of both the soft cut algorithm and the normalized cut algorithm are summarized in Table 2. [sent-150, score-1.912]
76 In addition to the Kmeans algorithm, we also apply the EM clustering algorithm to the normalized cut algorithm. [sent-151, score-1.162]
77 In this experiment, the number of eigenvectors used for the normalized cut algorithms is equal to the number of clusters. [sent-152, score-0.813]
78 First, comparing to both normalized cut algorithms, we see that the proposed clustering algorithm substantially outperform the normalized cut algorithms for all datasets. [sent-153, score-1.94]
79 Second, Table 2: Clustering results for different clustering methods. [sent-154, score-0.429]
80 Clustering accuracy is used for dataset “L5” and “M5” as the evaluation metric, and normalized mutual information is used for “Pendigit” and “Ribosome” . [sent-155, score-0.233]
81 8 Table 3: Clustering accuracy for normalized cut with embedding in eigenspace with K eigenvectors. [sent-180, score-0.724]
82 2 comparing to the normalized cut algorithm using the Kmeans method, we see that the soft cut algorithm has smaller variance in its clustering results. [sent-254, score-1.912]
83 This can be explained by the fact that the Kmeans algorithm uses binary cluster membership and therefore is likely to be trapped by local optimums. [sent-255, score-0.34]
84 As indicated in Table 2, if we replace the Kmeans algoirthm with the EM algorithm in the normalized cut algorithm, the variance in clustering results is generally reduced but at the price of degradation in the performance of clustering. [sent-256, score-1.162]
85 Based on the above observation, we conclude that the soft cut algorithm appears to be effective and robust for multi-class clustering. [sent-257, score-0.771]
86 3 Experiment (II): Normalized Cut using Different Numbers of Eigenvectors One potential reason why the normalized cut algorithm perform worse than the proposed algorithm is that the number of clusters may not be the optimal number of eigenvectors. [sent-259, score-0.885]
87 To examine this issue, we test the normalized cut algorithm with different number of eigenvectors. [sent-260, score-0.733]
88 The Kmeans method is used for clustering the eigenvectors. [sent-261, score-0.429]
89 The results of the normalized cut algorithm using different number of eigenvectors are summarized in Table 3. [sent-262, score-0.833]
90 First, we clearly see that the best clustering results may not necessarily happen when the number of eigenvectors is exactly equal to the number of clusters. [sent-264, score-0.529]
91 In fact, for three out of four cases, the best performance is achieved when the number of eigenvectors is larger than the number of clusters. [sent-265, score-0.1]
92 This result indicates that the choice of numbers of eigenvectors can have a significant impact on the performance of clustering. [sent-266, score-0.1]
93 Second, comparing the results in Table 3 to the results in Table 2, we see that the soft cut algorithm is still able to outperform the normalized cut algorithm even with the optimal number of eigenvectors. [sent-267, score-1.512]
94 In general, since spectral clustering is originally designed for binary-class classification, it requires an extra step when it is extended to multi-class clustering problems. [sent-268, score-1.027]
95 In contrast, the soft cut algorithm directly targets on multi-class clustering problems, and thus is able to achieve better performance than the normalized cut algorithm. [sent-270, score-1.89]
96 5 Conclusion In this paper, we proposed a novel probabilistic algorithm for spectral clustering, called “soft cut” algorithm. [sent-271, score-0.298]
97 It introduces probabilistic membership into the normalized cut algorithm and directly targets on the multi-class clustering problems. [sent-272, score-1.354]
98 Our empirical studies with a number of datasets have shown that the proposed algorithm outperforms the normalized cut algorithm considerably. [sent-273, score-0.869]
99 A min-max cut algorithm for graph partitioning and data clustering. [sent-314, score-0.671]
100 Fast spectral methods for ratio cut partitioning and clustering. [sent-321, score-0.787]
wordName wordTfidf (topN-words)
[('cut', 0.557), ('clustering', 0.429), ('dz', 0.343), ('spectral', 0.169), ('tz', 0.159), ('soft', 0.151), ('kmeans', 0.139), ('normalized', 0.134), ('membership', 0.121), ('ding', 0.104), ('mc', 0.104), ('ribosome', 0.104), ('tl', 0.103), ('cluster', 0.102), ('qi', 0.102), ('shi', 0.102), ('eigenvectors', 0.1), ('jprob', 0.099), ('pendigit', 0.099), ('jn', 0.095), ('tu', 0.085), ('memberships', 0.079), ('clusters', 0.074), ('malik', 0.07), ('relaxation', 0.06), ('kahng', 0.06), ('redner', 0.06), ('datasets', 0.058), ('objective', 0.058), ('trapped', 0.056), ('log', 0.052), ('hagen', 0.052), ('probabilistic', 0.051), ('ng', 0.049), ('zha', 0.047), ('walker', 0.047), ('named', 0.044), ('mutual', 0.043), ('algorithm', 0.042), ('jin', 0.042), ('partitioning', 0.04), ('hartigan', 0.04), ('et', 0.037), ('proposed', 0.036), ('nmi', 0.035), ('salakhutdinov', 0.035), ('qj', 0.034), ('laplacian', 0.033), ('accuracy', 0.033), ('graph', 0.032), ('concave', 0.032), ('table', 0.032), ('banerjee', 0.032), ('gu', 0.032), ('ninth', 0.032), ('yu', 0.031), ('em', 0.031), ('hofmann', 0.029), ('meanwhile', 0.029), ('pi', 0.029), ('outperform', 0.029), ('latent', 0.029), ('bound', 0.028), ('blei', 0.028), ('chung', 0.028), ('smallest', 0.027), ('berkeley', 0.027), ('documents', 0.027), ('experiment', 0.027), ('bach', 0.026), ('eigenvector', 0.026), ('jordan', 0.026), ('advantageous', 0.025), ('multiclass', 0.025), ('labels', 0.025), ('effectiveness', 0.025), ('description', 0.024), ('optimization', 0.024), ('solution', 0.024), ('semantic', 0.024), ('unbalanced', 0.024), ('extend', 0.024), ('proceedings', 0.023), ('probabilities', 0.023), ('evaluation', 0.023), ('rk', 0.023), ('convert', 0.023), ('introducing', 0.022), ('dirichlet', 0.022), ('algorithms', 0.022), ('logarithm', 0.021), ('simon', 0.021), ('robust', 0.021), ('metrics', 0.021), ('points', 0.021), ('ratio', 0.021), ('targets', 0.02), ('conference', 0.019), ('alignment', 0.019), ('likely', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 13 nips-2005-A Probabilistic Approach for Optimizing Spectral Clustering
Author: Rong Jin, Feng Kang, Chris H. Ding
Abstract: Spectral clustering enjoys its success in both data clustering and semisupervised learning. But, most spectral clustering algorithms cannot handle multi-class clustering problems directly. Additional strategies are needed to extend spectral clustering algorithms to multi-class clustering problems. Furthermore, most spectral clustering algorithms employ hard cluster membership, which is likely to be trapped by the local optimum. In this paper, we present a new spectral clustering algorithm, named “Soft Cut”. It improves the normalized cut algorithm by introducing soft membership, and can be efficiently computed using a bound optimization algorithm. Our experiments with a variety of datasets have shown the promising performance of the proposed clustering algorithm. 1
2 0.35414645 177 nips-2005-Size Regularized Cut for Data Clustering
Author: Yixin Chen, Ya Zhang, Xiang Ji
Abstract: We present a novel spectral clustering method that enables users to incorporate prior knowledge of the size of clusters into the clustering process. The cost function, which is named size regularized cut (SRcut), is defined as the sum of the inter-cluster similarity and a regularization term measuring the relative size of two clusters. Finding a partition of the data set to minimize SRcut is proved to be NP-complete. An approximation algorithm is proposed to solve a relaxed version of the optimization problem as an eigenvalue problem. Evaluations over different data sets demonstrate that the method is not sensitive to outliers and performs better than normalized cut. 1
3 0.23503104 102 nips-2005-Kernelized Infomax Clustering
Author: David Barber, Felix V. Agakov
Abstract: We propose a simple information-theoretic approach to soft clustering based on maximizing the mutual information I(x, y) between the unknown cluster labels y and the training patterns x with respect to parameters of specifically constrained encoding distributions. The constraints are chosen such that patterns are likely to be clustered similarly if they lie close to specific unknown vectors in the feature space. The method may be conveniently applied to learning the optimal affinity matrix, which corresponds to learning parameters of the kernelized encoder. The procedure does not require computations of eigenvalues of the Gram matrices, which makes it potentially attractive for clustering large data sets. 1
4 0.2315907 75 nips-2005-Fixing two weaknesses of the Spectral Method
Author: Kevin Lang
Abstract: We discuss two intrinsic weaknesses of the spectral graph partitioning method, both of which have practical consequences. The first is that spectral embeddings tend to hide the best cuts from the commonly used hyperplane rounding method. Rather than cleaning up the resulting suboptimal cuts with local search, we recommend the adoption of flow-based rounding. The second weakness is that for many “power law” graphs, the spectral method produces cuts that are highly unbalanced, thus decreasing the usefulness of the method for visualization (see figure 4(b)) or as a basis for divide-and-conquer algorithms. These balance problems, which occur even though the spectral method’s quotient-style objective function does encourage balance, can be fixed with a stricter balance constraint that turns the spectral mathematical program into an SDP that can be solved for million-node graphs by a method of Burer and Monteiro. 1 Background Graph partitioning is the NP-hard problem of finding a small graph cut subject to the constraint that neither side of the resulting partitioning of the nodes is “too small”. We will be dealing with several versions: the graph bisection problem, which requires perfect 1 : 1 2 2 balance; the β-balanced cut problem (with β a fraction such as 1 ), which requires at least 3 β : (1 − β) balance; and the quotient cut problem, which requires the small side to be large enough to “pay for” the edges in the cut. The quotient cut metric is c/ min(a, b), where c is the cutsize and a and b are the sizes of the two sides of the cut. All of the well-known variants of the quotient cut metric (e.g. normalized cut [15]) have similar behavior with respect to the issues discussed in this paper. The spectral method for graph partitioning was introduced in 1973 by Fiedler and Donath & Hoffman [6]. In the mid-1980’s Alon & Milman [1] proved that spectral cuts can be at worst quadratically bad; in the mid 1990’s Guattery & Miller [10] proved that this analysis is tight by exhibiting a family of n-node graphs whose spectral bisections cut O(n 2/3 ) edges versus the optimal O(n1/3 ) edges. On the other hand, Spielman & Teng [16] have proved stronger performance guarantees for the special case of spacelike graphs. The spectral method can be derived by relaxing a quadratic integer program which encodes the graph bisection problem (see section 3.1). The solution to this relaxation is the “Fiedler vector”, or second smallest eigenvector of the graph’s discrete Laplacian matrix, whose elements xi can be interpreted as an embedding of the graph on the line. To obtain a (A) Graph with nearly balanced 8-cut (B) Spectral Embedding (C) Notional Flow-based Embedding Figure 1: The spectral embedding hides the best solution from hyperplane rounding. specific cut, one must apply a “rounding method” to this embedding. The hyperplane rounding method chooses one of the n − 1 cuts which separate the nodes whose x i values lie above and below some split value x. ˆ 2 Using flow to find cuts that are hidden from hyperplane rounding Theorists have long known that the spectral method cannot distinguish between deep cuts and long paths, and that this confusion can cause it to cut a graph in the wrong direction thereby producing the spectral method’s worst-case behavior [10]. In this section we will show by example that even when the spectral method is not fooled into cutting in the wrong direction, the resulting embedding can hide the best cuts from the hyperplane rounding method. This is a possible explanation for the frequently made empirical observation (see e.g. [12]) that hyperplane roundings of spectral embeddings are noisy and therefore benefit from cleanup with a local search method such as Fiduccia-Matheyses [8]. Consider the graph in figure 1(a), which has a near-bisection cutting 8 edges. For this graph the spectral method produces the embedding shown in figure 1(b), and recommends that we make a vertical cut (across the horizontal dimension which is based on the Fiedler vector). This is correct in a generalized sense, but it is obvious that no hyperplane (or vertical line in this picture) can possibly extract the optimal 8-edge cut. Some insight into why spectral embeddings tend to have this problem can be obtained from the spectral method’s electrical interpretation. In this view the graph is represented by a resistor network [7]. Current flowing in this network causes voltage drops across the resistors, thus determining the nodes’ voltages and hence their positions. When current flows through a long series of resistors, it induces a progressive voltage drop. This is what causes the excessive length of the embeddings of the horizontal girder-like structures which are blocking all vertical hyperplane cuts in figure 1(b). If the embedding method were somehow not based on current, but rather on flow, which does not distinguish between a pipe and a series of pipes, then the long girders could retract into the two sides of the embedding, as suggested by figure 1(c), and the best cut would be revealed. Because theoretical flow-like embedding methods such as [14] are currently not practical, we point out that in cases like figure 1(b), where the spectral method has not chosen an incorrect direction for the cut, one can use an S-T max flow problem with the flow running in the recommended direction (horizontally for this embedding) to extract the good cut even though it is hidden from all hyperplanes. We currently use two different flow-based rounding methods. A method called MQI looks for quotient cuts, and is already described in [13]. Another method, that we shall call Midflow, looks for β-balanced cuts. The input to Midflow is a graph and an ordering of its nodes (obtained e.g. from a spectral embedding or from the projection of any embedding onto a line). We divide the graph’s nodes into 3 sets F, L, and U. The sets F and L respectively contain the first βn and last βn nodes in the ordering, and U contains the remaining 50-50 balance ng s ro un di Hy pe r pl an e neg-pos split quotient cut score (cutsize / size of small side) 0.01 ctor r ve iedle of F 0.004 0.003 0.00268 0.00232 Best hyperplane rounding of Fiedler Vector Best improvement with local search 0.002 0.00138 0.001 60000 80000 Midflow rounding beta = 1/4 100000 120000 0.00145 140000 Midflow rounding of Fiedler Vector beta = 1/3 160000 180000 200000 220000 240000 number of nodes on ’left’ side of cut (out of 324800) Figure 2: A typical example (see section 2.1) where flow-based rounding beats hyperplane rounding, even when the hyperplane cuts are improved with Fiduccia-Matheyses search. Note that for this spacelike graph, the best quotient cuts have reasonably good balance. U = n − 2βn nodes, which are “up for grabs”. We set up an S-T max flow problem with one node for every graph node plus 2 new nodes for the source and sink. For each graph edge there are two arcs, one in each direction, with unit capacity. Finally, the nodes in F are pinned to the source and the nodes in L are pinned to sink by infinite capacity arcs. This max-flow problem can be solved by a good implementation of the push-relabel algorithm (such as Goldberg and Cherkassky’s hi pr [4]) in time that empirically is nearly linear with a very good constant factor. Figure 6 shows that solving a MidFlow problem with hi pr can be 1000 times cheaper than finding a spectral embedding with ARPACK. When the goal is finding good β-balanced cuts, MidFlow rounding is strictly more powerful than hyperplane rounding; from a given node ordering hyperplane rounding chooses the best of U + 1 candidate cuts, while MidFlow rounding chooses the best of 2U candidates, including all of those considered by hyperplane rounding. [Similarly, MQI rounding is strictly more powerful than hyperplane rounding for the task of finding good quotient cuts.] 2.1 A concrete example The plot in figure 2 shows a number of cuts in a 324,800 node nearly planar graph derived from a 700x464 pixel downward-looking view of some clouds over some mountains.1 The y-axis of the plot is quotient cut score; smaller values are better. We note in passing that the commonly used split point x = 0 does not yield the best hyperplane cut. Our main ˆ point is that the two cuts generated by MidFlow rounding of the Fiedler vector (with β = 1 3 and β = 1 ) are nearly twice as good as the best hyperplane cut. Even after the best 4 hyperplane cut has been improved by taking the best result of 100 runs of a version of Fiduccia-Matheyses local search, it is still much worse than the cuts obtained by flowbased rounding. 1 The graph’s edges are unweighted but are chosen by a randomized rule which is more likely to include an edge between two neighboring pixels if they have a similar grey value. Good cuts in the graph tend to run along discontinuities in the image, as one would expect. quotient cut score 1 SDP-LB (smaller is better) 0.1 Scatter plot showing cuts in a
5 0.22431719 178 nips-2005-Soft Clustering on Graphs
Author: Kai Yu, Shipeng Yu, Volker Tresp
Abstract: We propose a simple clustering framework on graphs encoding pairwise data similarities. Unlike usual similarity-based methods, the approach softly assigns data to clusters in a probabilistic way. More importantly, a hierarchical clustering is naturally derived in this framework to gradually merge lower-level clusters into higher-level ones. A random walk analysis indicates that the algorithm exposes clustering structures in various resolutions, i.e., a higher level statistically models a longer-term diffusion on graphs and thus discovers a more global clustering structure. Finally we provide very encouraging experimental results. 1
6 0.17336653 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
7 0.1449441 79 nips-2005-Fusion of Similarity Data in Clustering
8 0.13305146 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
9 0.12805723 159 nips-2005-Q-Clustering
10 0.12032324 127 nips-2005-Mixture Modeling by Affinity Propagation
11 0.098746106 84 nips-2005-Generalization in Clustering with Unobserved Features
12 0.076429643 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
13 0.066462085 104 nips-2005-Laplacian Score for Feature Selection
14 0.061031338 126 nips-2005-Metric Learning by Collapsing Classes
15 0.059249178 41 nips-2005-Coarse sample complexity bounds for active learning
16 0.057255134 71 nips-2005-Fast Krylov Methods for N-Body Learning
17 0.055001993 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
18 0.049945153 11 nips-2005-A Hierarchical Compositional System for Rapid Object Detection
19 0.048670407 115 nips-2005-Learning Shared Latent Structure for Image Synthesis and Robotic Imitation
20 0.046107333 147 nips-2005-On the Convergence of Eigenspaces in Kernel Principal Component Analysis
topicId topicWeight
[(0, 0.178), (1, 0.133), (2, -0.104), (3, -0.075), (4, -0.46), (5, -0.158), (6, 0.051), (7, -0.038), (8, -0.307), (9, 0.057), (10, -0.17), (11, 0.013), (12, 0.104), (13, 0.049), (14, 0.164), (15, -0.02), (16, 0.096), (17, -0.009), (18, -0.057), (19, -0.006), (20, 0.102), (21, -0.092), (22, 0.008), (23, -0.048), (24, -0.039), (25, 0.087), (26, -0.034), (27, 0.011), (28, 0.075), (29, -0.001), (30, 0.067), (31, -0.027), (32, 0.001), (33, -0.036), (34, -0.04), (35, -0.057), (36, 0.012), (37, 0.071), (38, -0.053), (39, -0.131), (40, -0.008), (41, -0.01), (42, -0.006), (43, 0.023), (44, 0.02), (45, 0.061), (46, 0.055), (47, 0.04), (48, -0.067), (49, -0.035)]
simIndex simValue paperId paperTitle
same-paper 1 0.9787026 13 nips-2005-A Probabilistic Approach for Optimizing Spectral Clustering
Author: Rong Jin, Feng Kang, Chris H. Ding
Abstract: Spectral clustering enjoys its success in both data clustering and semisupervised learning. But, most spectral clustering algorithms cannot handle multi-class clustering problems directly. Additional strategies are needed to extend spectral clustering algorithms to multi-class clustering problems. Furthermore, most spectral clustering algorithms employ hard cluster membership, which is likely to be trapped by the local optimum. In this paper, we present a new spectral clustering algorithm, named “Soft Cut”. It improves the normalized cut algorithm by introducing soft membership, and can be efficiently computed using a bound optimization algorithm. Our experiments with a variety of datasets have shown the promising performance of the proposed clustering algorithm. 1
2 0.92497844 177 nips-2005-Size Regularized Cut for Data Clustering
Author: Yixin Chen, Ya Zhang, Xiang Ji
Abstract: We present a novel spectral clustering method that enables users to incorporate prior knowledge of the size of clusters into the clustering process. The cost function, which is named size regularized cut (SRcut), is defined as the sum of the inter-cluster similarity and a regularization term measuring the relative size of two clusters. Finding a partition of the data set to minimize SRcut is proved to be NP-complete. An approximation algorithm is proposed to solve a relaxed version of the optimization problem as an eigenvalue problem. Evaluations over different data sets demonstrate that the method is not sensitive to outliers and performs better than normalized cut. 1
3 0.70025843 75 nips-2005-Fixing two weaknesses of the Spectral Method
Author: Kevin Lang
Abstract: We discuss two intrinsic weaknesses of the spectral graph partitioning method, both of which have practical consequences. The first is that spectral embeddings tend to hide the best cuts from the commonly used hyperplane rounding method. Rather than cleaning up the resulting suboptimal cuts with local search, we recommend the adoption of flow-based rounding. The second weakness is that for many “power law” graphs, the spectral method produces cuts that are highly unbalanced, thus decreasing the usefulness of the method for visualization (see figure 4(b)) or as a basis for divide-and-conquer algorithms. These balance problems, which occur even though the spectral method’s quotient-style objective function does encourage balance, can be fixed with a stricter balance constraint that turns the spectral mathematical program into an SDP that can be solved for million-node graphs by a method of Burer and Monteiro. 1 Background Graph partitioning is the NP-hard problem of finding a small graph cut subject to the constraint that neither side of the resulting partitioning of the nodes is “too small”. We will be dealing with several versions: the graph bisection problem, which requires perfect 1 : 1 2 2 balance; the β-balanced cut problem (with β a fraction such as 1 ), which requires at least 3 β : (1 − β) balance; and the quotient cut problem, which requires the small side to be large enough to “pay for” the edges in the cut. The quotient cut metric is c/ min(a, b), where c is the cutsize and a and b are the sizes of the two sides of the cut. All of the well-known variants of the quotient cut metric (e.g. normalized cut [15]) have similar behavior with respect to the issues discussed in this paper. The spectral method for graph partitioning was introduced in 1973 by Fiedler and Donath & Hoffman [6]. In the mid-1980’s Alon & Milman [1] proved that spectral cuts can be at worst quadratically bad; in the mid 1990’s Guattery & Miller [10] proved that this analysis is tight by exhibiting a family of n-node graphs whose spectral bisections cut O(n 2/3 ) edges versus the optimal O(n1/3 ) edges. On the other hand, Spielman & Teng [16] have proved stronger performance guarantees for the special case of spacelike graphs. The spectral method can be derived by relaxing a quadratic integer program which encodes the graph bisection problem (see section 3.1). The solution to this relaxation is the “Fiedler vector”, or second smallest eigenvector of the graph’s discrete Laplacian matrix, whose elements xi can be interpreted as an embedding of the graph on the line. To obtain a (A) Graph with nearly balanced 8-cut (B) Spectral Embedding (C) Notional Flow-based Embedding Figure 1: The spectral embedding hides the best solution from hyperplane rounding. specific cut, one must apply a “rounding method” to this embedding. The hyperplane rounding method chooses one of the n − 1 cuts which separate the nodes whose x i values lie above and below some split value x. ˆ 2 Using flow to find cuts that are hidden from hyperplane rounding Theorists have long known that the spectral method cannot distinguish between deep cuts and long paths, and that this confusion can cause it to cut a graph in the wrong direction thereby producing the spectral method’s worst-case behavior [10]. In this section we will show by example that even when the spectral method is not fooled into cutting in the wrong direction, the resulting embedding can hide the best cuts from the hyperplane rounding method. This is a possible explanation for the frequently made empirical observation (see e.g. [12]) that hyperplane roundings of spectral embeddings are noisy and therefore benefit from cleanup with a local search method such as Fiduccia-Matheyses [8]. Consider the graph in figure 1(a), which has a near-bisection cutting 8 edges. For this graph the spectral method produces the embedding shown in figure 1(b), and recommends that we make a vertical cut (across the horizontal dimension which is based on the Fiedler vector). This is correct in a generalized sense, but it is obvious that no hyperplane (or vertical line in this picture) can possibly extract the optimal 8-edge cut. Some insight into why spectral embeddings tend to have this problem can be obtained from the spectral method’s electrical interpretation. In this view the graph is represented by a resistor network [7]. Current flowing in this network causes voltage drops across the resistors, thus determining the nodes’ voltages and hence their positions. When current flows through a long series of resistors, it induces a progressive voltage drop. This is what causes the excessive length of the embeddings of the horizontal girder-like structures which are blocking all vertical hyperplane cuts in figure 1(b). If the embedding method were somehow not based on current, but rather on flow, which does not distinguish between a pipe and a series of pipes, then the long girders could retract into the two sides of the embedding, as suggested by figure 1(c), and the best cut would be revealed. Because theoretical flow-like embedding methods such as [14] are currently not practical, we point out that in cases like figure 1(b), where the spectral method has not chosen an incorrect direction for the cut, one can use an S-T max flow problem with the flow running in the recommended direction (horizontally for this embedding) to extract the good cut even though it is hidden from all hyperplanes. We currently use two different flow-based rounding methods. A method called MQI looks for quotient cuts, and is already described in [13]. Another method, that we shall call Midflow, looks for β-balanced cuts. The input to Midflow is a graph and an ordering of its nodes (obtained e.g. from a spectral embedding or from the projection of any embedding onto a line). We divide the graph’s nodes into 3 sets F, L, and U. The sets F and L respectively contain the first βn and last βn nodes in the ordering, and U contains the remaining 50-50 balance ng s ro un di Hy pe r pl an e neg-pos split quotient cut score (cutsize / size of small side) 0.01 ctor r ve iedle of F 0.004 0.003 0.00268 0.00232 Best hyperplane rounding of Fiedler Vector Best improvement with local search 0.002 0.00138 0.001 60000 80000 Midflow rounding beta = 1/4 100000 120000 0.00145 140000 Midflow rounding of Fiedler Vector beta = 1/3 160000 180000 200000 220000 240000 number of nodes on ’left’ side of cut (out of 324800) Figure 2: A typical example (see section 2.1) where flow-based rounding beats hyperplane rounding, even when the hyperplane cuts are improved with Fiduccia-Matheyses search. Note that for this spacelike graph, the best quotient cuts have reasonably good balance. U = n − 2βn nodes, which are “up for grabs”. We set up an S-T max flow problem with one node for every graph node plus 2 new nodes for the source and sink. For each graph edge there are two arcs, one in each direction, with unit capacity. Finally, the nodes in F are pinned to the source and the nodes in L are pinned to sink by infinite capacity arcs. This max-flow problem can be solved by a good implementation of the push-relabel algorithm (such as Goldberg and Cherkassky’s hi pr [4]) in time that empirically is nearly linear with a very good constant factor. Figure 6 shows that solving a MidFlow problem with hi pr can be 1000 times cheaper than finding a spectral embedding with ARPACK. When the goal is finding good β-balanced cuts, MidFlow rounding is strictly more powerful than hyperplane rounding; from a given node ordering hyperplane rounding chooses the best of U + 1 candidate cuts, while MidFlow rounding chooses the best of 2U candidates, including all of those considered by hyperplane rounding. [Similarly, MQI rounding is strictly more powerful than hyperplane rounding for the task of finding good quotient cuts.] 2.1 A concrete example The plot in figure 2 shows a number of cuts in a 324,800 node nearly planar graph derived from a 700x464 pixel downward-looking view of some clouds over some mountains.1 The y-axis of the plot is quotient cut score; smaller values are better. We note in passing that the commonly used split point x = 0 does not yield the best hyperplane cut. Our main ˆ point is that the two cuts generated by MidFlow rounding of the Fiedler vector (with β = 1 3 and β = 1 ) are nearly twice as good as the best hyperplane cut. Even after the best 4 hyperplane cut has been improved by taking the best result of 100 runs of a version of Fiduccia-Matheyses local search, it is still much worse than the cuts obtained by flowbased rounding. 1 The graph’s edges are unweighted but are chosen by a randomized rule which is more likely to include an edge between two neighboring pixels if they have a similar grey value. Good cuts in the graph tend to run along discontinuities in the image, as one would expect. quotient cut score 1 SDP-LB (smaller is better) 0.1 Scatter plot showing cuts in a
4 0.64296216 159 nips-2005-Q-Clustering
Author: Mukund Narasimhan, Nebojsa Jojic, Jeff A. Bilmes
Abstract: We show that Queyranne’s algorithm for minimizing symmetric submodular functions can be used for clustering with a variety of different objective functions. Two specific criteria that we consider in this paper are the single linkage and the minimum description length criteria. The first criterion tries to maximize the minimum distance between elements of different clusters, and is inherently “discriminative”. It is known that optimal clusterings into k clusters for any given k in polynomial time for this criterion can be computed. The second criterion seeks to minimize the description length of the clusters given a probabilistic generative model. We show that the optimal partitioning into 2 clusters, and approximate partitioning (guaranteed to be within a factor of 2 of the the optimal) for more clusters can be computed. To the best of our knowledge, this is the first time that a tractable algorithm for finding the optimal clustering with respect to the MDL criterion for 2 clusters has been given. Besides the optimality result for the MDL criterion, the chief contribution of this paper is to show that the same algorithm can be used to optimize a broad class of criteria, and hence can be used for many application specific criterion for which efficient algorithm are not known.
5 0.62698656 178 nips-2005-Soft Clustering on Graphs
Author: Kai Yu, Shipeng Yu, Volker Tresp
Abstract: We propose a simple clustering framework on graphs encoding pairwise data similarities. Unlike usual similarity-based methods, the approach softly assigns data to clusters in a probabilistic way. More importantly, a hierarchical clustering is naturally derived in this framework to gradually merge lower-level clusters into higher-level ones. A random walk analysis indicates that the algorithm exposes clustering structures in various resolutions, i.e., a higher level statistically models a longer-term diffusion on graphs and thus discovers a more global clustering structure. Finally we provide very encouraging experimental results. 1
6 0.62582195 102 nips-2005-Kernelized Infomax Clustering
7 0.46524835 127 nips-2005-Mixture Modeling by Affinity Propagation
8 0.46152541 84 nips-2005-Generalization in Clustering with Unobserved Features
9 0.45962328 79 nips-2005-Fusion of Similarity Data in Clustering
10 0.42518035 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
11 0.33121702 71 nips-2005-Fast Krylov Methods for N-Body Learning
12 0.27394772 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
13 0.2542111 180 nips-2005-Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms
14 0.24359274 107 nips-2005-Large scale networks fingerprinting and visualization using the k-core decomposition
15 0.22841814 33 nips-2005-Bayesian Sets
16 0.2221183 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
17 0.21610494 104 nips-2005-Laplacian Score for Feature Selection
18 0.19228256 205 nips-2005-Worst-Case Bounds for Gaussian Process Models
19 0.19173054 4 nips-2005-A Bayesian Spatial Scan Statistic
20 0.17899624 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
topicId topicWeight
[(3, 0.055), (10, 0.042), (27, 0.02), (31, 0.027), (34, 0.11), (39, 0.017), (44, 0.321), (50, 0.014), (55, 0.024), (69, 0.052), (73, 0.109), (88, 0.074), (91, 0.028)]
simIndex simValue paperId paperTitle
1 0.85390437 64 nips-2005-Efficient estimation of hidden state dynamics from spike trains
Author: Marton G. Danoczy, Richard H. R. Hahnloser
Abstract: Neurons can have rapidly changing spike train statistics dictated by the underlying network excitability or behavioural state of an animal. To estimate the time course of such state dynamics from single- or multiple neuron recordings, we have developed an algorithm that maximizes the likelihood of observed spike trains by optimizing the state lifetimes and the state-conditional interspike-interval (ISI) distributions. Our nonparametric algorithm is free of time-binning and spike-counting problems and has the computational complexity of a Mixed-state Markov Model operating on a state sequence of length equal to the total number of recorded spikes. As an example, we fit a two-state model to paired recordings of premotor neurons in the sleeping songbird. We find that the two state-conditional ISI functions are highly similar to the ones measured during waking and singing, respectively. 1
same-paper 2 0.77907538 13 nips-2005-A Probabilistic Approach for Optimizing Spectral Clustering
Author: Rong Jin, Feng Kang, Chris H. Ding
Abstract: Spectral clustering enjoys its success in both data clustering and semisupervised learning. But, most spectral clustering algorithms cannot handle multi-class clustering problems directly. Additional strategies are needed to extend spectral clustering algorithms to multi-class clustering problems. Furthermore, most spectral clustering algorithms employ hard cluster membership, which is likely to be trapped by the local optimum. In this paper, we present a new spectral clustering algorithm, named “Soft Cut”. It improves the normalized cut algorithm by introducing soft membership, and can be efficiently computed using a bound optimization algorithm. Our experiments with a variety of datasets have shown the promising performance of the proposed clustering algorithm. 1
3 0.7398603 124 nips-2005-Measuring Shared Information and Coordinated Activity in Neuronal Networks
Author: Kristina Klinkner, Cosma Shalizi, Marcelo Camperi
Abstract: Most nervous systems encode information about stimuli in the responding activity of large neuronal networks. This activity often manifests itself as dynamically coordinated sequences of action potentials. Since multiple electrode recordings are now a standard tool in neuroscience research, it is important to have a measure of such network-wide behavioral coordination and information sharing, applicable to multiple neural spike train data. We propose a new statistic, informational coherence, which measures how much better one unit can be predicted by knowing the dynamical state of another. We argue informational coherence is a measure of association and shared information which is superior to traditional pairwise measures of synchronization and correlation. To find the dynamical states, we use a recently-introduced algorithm which reconstructs effective state spaces from stochastic time series. We then extend the pairwise measure to a multivariate analysis of the network by estimating the network multi-information. We illustrate our method by testing it on a detailed model of the transition from gamma to beta rhythms. Much of the most important information in neural systems is shared over multiple neurons or cortical areas, in such forms as population codes and distributed representations [1]. On behavioral time scales, neural information is stored in temporal patterns of activity as opposed to static markers; therefore, as information is shared between neurons or brain regions, it is physically instantiated as coordination between entire sequences of neural spikes. Furthermore, neural systems and regions of the brain often require coordinated neural activity to perform important functions; acting in concert requires multiple neurons or cortical areas to share information [2]. Thus, if we want to measure the dynamic network-wide behavior of neurons and test hypotheses about them, we need reliable, practical methods to detect and quantify behavioral coordination and the associated information sharing across multiple neural units. These would be especially useful in testing ideas about how particular forms of coordination relate to distributed coding (e.g., that of [3]). Current techniques to analyze relations among spike trains handle only pairs of neurons, so we further need a method which is extendible to analyze the coordination in the network, system, or region as a whole. Here we propose a new measure of behavioral coordination and information sharing, informational coherence, based on the notion of dynamical state. Section 1 argues that coordinated behavior in neural systems is often not captured by exist- ing measures of synchronization or correlation, and that something sensitive to nonlinear, stochastic, predictive relationships is needed. Section 2 defines informational coherence as the (normalized) mutual information between the dynamical states of two systems and explains how looking at the states, rather than just observables, fulfills the needs laid out in Section 1. Since we rarely know the right states a prori, Section 2.1 briefly describes how we reconstruct effective state spaces from data. Section 2.2 gives some details about how we calculate the informational coherence and approximate the global information stored in the network. Section 3 applies our method to a model system (a biophysically detailed conductance-based model) comparing our results to those of more familiar second-order statistics. In the interest of space, we omit proofs and a full discussion of the existing literature, giving only minimal references here; proofs and references will appear in a longer paper now in preparation. 1 Synchrony or Coherence? Most hypotheses which involve the idea that information sharing is reflected in coordinated activity across neural units invoke a very specific notion of coordinated activity, namely strict synchrony: the units should be doing exactly the same thing (e.g., spiking) at exactly the same time. Investigators then measure coordination by measuring how close the units come to being strictly synchronized (e.g., variance in spike times). From an informational point of view, there is no reason to favor strict synchrony over other kinds of coordination. One neuron consistently spiking 50 ms after another is just as informative a relationship as two simultaneously spiking, but such stable phase relations are missed by strict-synchrony approaches. Indeed, whatever the exact nature of the neural code, it uses temporally extended patterns of activity, and so information sharing should be reflected in coordination of those patterns, rather than just the instantaneous activity. There are three common ways of going beyond strict synchrony: cross-correlation and related second-order statistics, mutual information, and topological generalized synchrony. The cross-correlation function (the normalized covariance function; this includes, for present purposes, the joint peristimulus time histogram [2]), is one of the most widespread measures of synchronization. It can be efficiently calculated from observable series; it handles statistical as well as deterministic relationships between processes; by incorporating variable lags, it reduces the problem of phase locking. Fourier transformation of the covariance function γXY (h) yields the cross-spectrum FXY (ν), which in turn gives the 2 spectral coherence cXY (ν) = FXY (ν)/FX (ν)FY (ν), a normalized correlation between the Fourier components of X and Y . Integrated over frequencies, the spectral coherence measures, essentially, the degree of linear cross-predictability of the two series. ([4] applies spectral coherence to coordinated neural activity.) However, such second-order statistics only handle linear relationships. Since neural processes are known to be strongly nonlinear, there is little reason to think these statistics adequately measure coordination and synchrony in neural systems. Mutual information is attractive because it handles both nonlinear and stochastic relationships and has a very natural and appealing interpretation. Unfortunately, it often seems to fail in practice, being disappointingly small even between signals which are known to be tightly coupled [5]. The major reason is that the neural codes use distinct patterns of activity over time, rather than many different instantaneous actions, and the usual approach misses these extended patterns. Consider two neurons, one of which drives the other to spike 50 ms after it does, the driving neuron spiking once every 500 ms. These are very tightly coordinated, but whether the first neuron spiked at time t conveys little information about what the second neuron is doing at t — it’s not spiking, but it’s not spiking most of the time anyway. Mutual information calculated from the direct observations conflates the “no spike” of the second neuron preparing to fire with its just-sitting-around “no spike”. Here, mutual information could find the coordination if we used a 50 ms lag, but that won’t work in general. Take two rate-coding neurons with base-line firing rates of 1 Hz, and suppose that a stimulus excites one to 10 Hz and suppresses the other to 0.1 Hz. The spiking rates thus share a lot of information, but whether the one neuron spiked at t is uninformative about what the other neuron did then, and lagging won’t help. Generalized synchrony is based on the idea of establishing relationships between the states of the various units. “State” here is taken in the sense of physics, dynamics and control theory: the state at time t is a variable which fixes the distribution of observables at all times ≥ t, rendering the past of the system irrelevant [6]. Knowing the state allows us to predict, as well as possible, how the system will evolve, and how it will respond to external forces [7]. Two coupled systems are said to exhibit generalized synchrony if the state of one system is given by a mapping from the state of the other. Applications to data employ statespace reconstruction [8]: if the state x ∈ X evolves according to smooth, d-dimensional deterministic dynamics, and we observe a generic function y = f (x), then the space Y of time-delay vectors [y(t), y(t − τ ), ...y(t − (k − 1)τ )] is diffeomorphic to X if k > 2d, for generic choices of lag τ . The various versions of generalized synchrony differ on how, precisely, to quantify the mappings between reconstructed state spaces, but they all appear to be empirically equivalent to one another and to notions of phase synchronization based on Hilbert transforms [5]. Thus all of these measures accommodate nonlinear relationships, and are potentially very flexible. Unfortunately, there is essentially no reason to believe that neural systems have deterministic dynamics at experimentally-accessible levels of detail, much less that there are deterministic relationships among such states for different units. What we want, then, but none of these alternatives provides, is a quantity which measures predictive relationships among states, but allows those relationships to be nonlinear and stochastic. The next section introduces just such a measure, which we call “informational coherence”. 2 States and Informational Coherence There are alternatives to calculating the “surface” mutual information between the sequences of observations themselves (which, as described, fails to capture coordination). If we know that the units are phase oscillators, or rate coders, we can estimate their instantaneous phase or rate and, by calculating the mutual information between those variables, see how coordinated the units’ patterns of activity are. However, phases and rates do not exhaust the repertoire of neural patterns and a more general, common scheme is desirable. The most general notion of “pattern of activity” is simply that of the dynamical state of the system, in the sense mentioned above. We now formalize this. Assuming the usual notation for Shannon information [9], the information content of a state variable X is H[X] and the mutual information between X and Y is I[X; Y ]. As is well-known, I[X; Y ] ≤ min H[X], H[Y ]. We use this to normalize the mutual state information to a 0 − 1 scale, and this is the informational coherence (IC). ψ(X, Y ) = I[X; Y ] , with 0/0 = 0 . min H[X], H[Y ] (1) ψ can be interpreted as follows. I[X; Y ] is the Kullback-Leibler divergence between the joint distribution of X and Y , and the product of their marginal distributions [9], indicating the error involved in ignoring the dependence between X and Y . The mutual information between predictive, dynamical states thus gauges the error involved in assuming the two systems are independent, i.e., how much predictions could improve by taking into account the dependence. Hence it measures the amount of dynamically-relevant information shared between the two systems. ψ simply normalizes this value, and indicates the degree to which two systems have coordinated patterns of behavior (cf. [10], although this only uses directly observable quantities). 2.1 Reconstruction and Estimation of Effective State Spaces As mentioned, the state space of a deterministic dynamical system can be reconstructed from a sequence of observations. This is the main tool of experimental nonlinear dynamics [8]; but the assumption of determinism is crucial and false, for almost any interesting neural system. While classical state-space reconstruction won’t work on stochastic processes, such processes do have state-space representations [11], and, in the special case of discretevalued, discrete-time series, there are ways to reconstruct the state space. Here we use the CSSR algorithm, introduced in [12] (code available at http://bactra.org/CSSR). This produces causal state models, which are stochastic automata capable of statistically-optimal nonlinear prediction; the state of the machine is a minimal sufficient statistic for the future of the observable process[13].1 The basic idea is to form a set of states which should be (1) Markovian, (2) sufficient statistics for the next observable, and (3) have deterministic transitions (in the automata-theory sense). The algorithm begins with a minimal, one-state, IID model, and checks whether these properties hold, by means of hypothesis tests. If they fail, the model is modified, generally but not always by adding more states, and the new model is checked again. Each state of the model corresponds to a distinct distribution over future events, i.e., to a statistical pattern of behavior. Under mild conditions, which do not involve prior knowledge of the state space, CSSR converges in probability to the unique causal state model of the data-generating process [12]. In practice, CSSR is quite fast (linear in the data size), and generalizes at least as well as training hidden Markov models with the EM algorithm and using cross-validation for selection, the standard heuristic [12]. One advantage of the causal state approach (which it shares with classical state-space reconstruction) is that state estimation is greatly simplified. In the general case of nonlinear state estimation, it is necessary to know not just the form of the stochastic dynamics in the state space and the observation function, but also their precise parametric values and the distribution of observation and driving noises. Estimating the state from the observable time series then becomes a computationally-intensive application of Bayes’s Rule [17]. Due to the way causal states are built as statistics of the data, with probability 1 there is a finite time, t, at which the causal state at time t is certain. This is not just with some degree of belief or confidence: because of the way the states are constructed, it is impossible for the process to be in any other state at that time. Once the causal state has been established, it can be updated recursively, i.e., the causal state at time t + 1 is an explicit function of the causal state at time t and the observation at t + 1. The causal state model can be automatically converted, therefore, into a finite-state transducer which reads in an observation time series and outputs the corresponding series of states [18, 13]. (Our implementation of CSSR filters its training data automatically.) The result is a new time series of states, from which all non-predictive components have been filtered out. 2.2 Estimating the Coherence Our algorithm for estimating the matrix of informational coherences is as follows. For each unit, we reconstruct the causal state model, and filter the observable time series to produce a series of causal states. Then, for each pair of neurons, we construct a joint histogram of 1 Causal state models have the same expressive power as observable operator models [14] or predictive state representations [7], and greater power than variable-length Markov models [15, 16]. a b Figure 1: Rastergrams of neuronal spike-times in the network. Excitatory, pyramidal neurons (numbers 1 to 1000) are shown in green, inhibitory interneurons (numbers 1001 to 1300) in red. During the first 10 seconds (a), the current connections among the pyramidal cells are suppressed and a gamma rhythm emerges (left). At t = 10s, those connections become active, leading to a beta rhythm (b, right). the state distribution, estimate the mutual information between the states, and normalize by the single-unit state informations. This gives a symmetric matrix of ψ values. Even if two systems are independent, their estimated IC will, on average, be positive, because, while they should have zero mutual information, the empirical estimate of mutual information is non-negative. Thus, the significance of IC values must be assessed against the null hypothesis of system independence. The easiest way to do so is to take the reconstructed state models for the two systems and run them forward, independently of one another, to generate a large number of simulated state sequences; from these calculate values of the IC. This procedure will approximate the sampling distribution of the IC under a null model which preserves the dynamics of each system, but not their interaction. We can then find p-values as usual. We omit them here to save space. 2.3 Approximating the Network Multi-Information There is broad agreement [2] that analyses of networks should not just be an analysis of pairs of neurons, averaged over pairs. Ideally, an analysis of information sharing in a network would look at the over-all structure of statistical dependence between the various units, reflected in the complete joint probability distribution P of the states. This would then allow us, for instance, to calculate the n-fold multi-information, I[X1 , X2 , . . . Xn ] ≡ D(P ||Q), the Kullback-Leibler divergence between the joint distribution P and the product of marginal distributions Q, analogous to the pairwise mutual information [19]. Calculated over the predictive states, the multi-information would give the total amount of shared dynamical information in the system. Just as we normalized the mutual information I[X1 , X2 ] by its maximum possible value, min H[X1 ], H[X2 ], we normalize the multiinformation by its maximum, which is the smallest sum of n − 1 marginal entropies: I[X1 ; X2 ; . . . Xn ] ≤ min k H[Xn ] i=k Unfortunately, P is a distribution over a very high dimensional space and so, hard to estimate well without strong parametric constraints. We thus consider approximations. The lowest-order approximation treats all the units as independent; this is the distribution Q. One step up are tree distributions, where the global distribution is a function of the joint distributions of pairs of units. Not every pair of units needs to enter into such a distribution, though every unit must be part of some pair. Graphically, a tree distribution corresponds to a spanning tree, with edges linking units whose interactions enter into the global probability, and conversely spanning trees determine tree distributions. Writing ET for the set of pairs (i, j) and abbreviating X1 = x1 , X2 = x2 , . . . Xn = xn by X = x, one has n T (X = x) = (i,j)∈ET T (Xi = xi , Xj = xj ) T (Xi = xi ) T (Xi = xi )T (Xj = xj ) i=1 (2) where the marginal distributions T (Xi ) and the pair distributions T (Xi , Xj ) are estimated by the empirical marginal and pair distributions. We must now pick edges ET so that T best approximates the true global distribution P . A natural approach is to minimize D(P ||T ), the divergence between P and its tree approximation. Chow and Liu [20] showed that the maximum-weight spanning tree gives the divergence-minimizing distribution, taking an edge’s weight to be the mutual information between the variables it links. There are three advantages to using the Chow-Liu approximation. (1) Estimating T from empirical probabilities gives a consistent maximum likelihood estimator of the ideal ChowLiu tree [20], with reasonable rates of convergence, so T can be reliably known even if P cannot. (2) There are efficient algorithms for constructing maximum-weight spanning trees, such as Prim’s algorithm [21, sec. 23.2], which runs in time O(n2 + n log n). Thus, the approximation is computationally tractable. (3) The KL divergence of the Chow-Liu distribution from Q gives a lower bound on the network multi-information; that bound is just the sum of the mutual informations along the edges in the tree: I[X1 ; X2 ; . . . Xn ] ≥ D(T ||Q) = I[Xi ; Xj ] (3) (i,j)∈ET Even if we knew P exactly, Eq. 3 would be useful as an alternative to calculating D(P ||Q) directly, evaluating log P (x)/Q(x) for all the exponentially-many configurations x. It is natural to seek higher-order approximations to P , e.g., using three-way interactions not decomposable into pairwise interactions [22, 19]. But it is hard to do so effectively, because finding the optimal approximation to P when such interactions are allowed is NP [23], and analytical formulas like Eq. 3 generally do not exist [19]. We therefore confine ourselves to the Chow-Liu approximation here. 3 Example: A Model of Gamma and Beta Rhythms We use simulated data as a test case, instead of empirical multiple electrode recordings, which allows us to try the method on a system of over 1000 neurons and compare the measure against expected results. The model, taken from [24], was originally designed to study episodes of gamma (30–80Hz) and beta (12–30Hz) oscillations in the mammalian nervous system, which often occur successively with a spontaneous transition between them. More concretely, the rhythms studied were those displayed by in vitro hippocampal (CA1) slice preparations and by in vivo neocortical EEGs. The model contains two neuron populations: excitatory (AMPA) pyramidal neurons and inhibitory (GABAA ) interneurons, defined by conductance-based Hodgkin-Huxley-style equations. Simulations were carried out in a network of 1000 pyramidal cells and 300 interneurons. Each cell was modeled as a one-compartment neuron with all-to-all coupling, endowed with the basic sodium and potassium spiking currents, an external applied current, and some Gaussian input noise. The first 10 seconds of the simulation correspond to the gamma rhythm, in which only a group of neurons is made to spike via a linearly increasing applied current. The beta rhythm a b c d Figure 2: Heat-maps of coordination for the network, as measured by zero-lag cross-correlation (top row) and informational coherence (bottom), contrasting the gamma rhythm (left column) with the beta (right). Colors run from red (no coordination) through yellow to pale cream (maximum). (subsequent 10 seconds) is obtained by activating pyramidal-pyramidal recurrent connections (potentiated by Hebbian preprocessing as a result of synchrony during the gamma rhythm) and a slow outward after-hyper-polarization (AHP) current (the M-current), suppressed during gamma due to the metabotropic activation used in the generation of the rhythm. During the beta rhythm, pyramidal cells, silent during gamma rhythm, fire on a subset of interneurons cycles (Fig. 1). Fig. 2 compares zero-lag cross-correlation, a second-order method of quantifying coordination, with the informational coherence calculated from the reconstructed states. (In this simulation, we could have calculated the actual states of the model neurons directly, rather than reconstructing them, but for purposes of testing our method we did not.) Crosscorrelation finds some of the relationships visible in Fig. 1, but is confused by, for instance, the phase shifts between pyramidal cells. (Surface mutual information, not shown, gives similar results.) Informational coherence, however, has no trouble recognizing the two populations as effectively coordinated blocks. The presence of dynamical noise, problematic for ordinary state reconstruction, is not an issue. The average IC is 0.411 (or 0.797 if the inactive, low-numbered neurons are excluded). The tree estimate of the global informational multi-information is 3243.7 bits, with a global coherence of 0.777. The right half of Fig. 2 repeats this analysis for the beta rhythm; in this stage, the average IC is 0.614, and the tree estimate of the global multi-information is 7377.7 bits, though the estimated global coherence falls very slightly to 0.742. This is because low-numbered neurons which were quiescent before are now active, contributing to the global information, but the over-all pattern is somewhat weaker and more noisy (as can be seen from Fig. 1b.) So, as expected, the total information content is higher, but the overall coordination across the network is lower. 4 Conclusion Informational coherence provides a measure of neural information sharing and coordinated activity which accommodates nonlinear, stochastic relationships between extended patterns of spiking. It is robust to dynamical noise and leads to a genuinely multivariate measure of global coordination across networks or regions. Applied to data from multi-electrode recordings, it should be a valuable tool in evaluating hypotheses about distributed neural representation and function. Acknowledgments Thanks to R. Haslinger, E. Ionides and S. Page; and for support to the Santa Fe Institute (under grants from Intel, the NSF and the MacArthur Foundation, and DARPA agreement F30602-00-2-0583), the Clare Booth Luce Foundation (KLK) and the James S. McDonnell Foundation (CRS). References [1] L. F. Abbott and T. J. Sejnowski, eds. Neural Codes and Distributed Representations. MIT Press, 1998. [2] E. N. Brown, R. E. Kass, and P. P. Mitra. Nature Neuroscience, 7:456–461, 2004. [3] D. H. Ballard, Z. Zhang, and R. P. N. Rao. In R. P. N. Rao, B. A. Olshausen, and M. S. Lewicki, eds., Probabilistic Models of the Brain, pp. 273–284, MIT Press, 2002. [4] D. R. Brillinger and A. E. P. Villa. In D. R. Brillinger, L. T. Fernholz, and S. Morgenthaler, eds., The Practice of Data Analysis, pp. 77–92. Princeton U.P., 1997. [5] R. Quian Quiroga et al. Physical Review E, 65:041903, 2002. [6] R. F. Streater. Statistical Dynamics. Imperial College Press, London. [7] M. L. Littman, R. S. Sutton, and S. Singh. In T. G. Dietterich, S. Becker, and Z. Ghahramani, eds., Advances in Neural Information Processing Systems 14, pp. 1555–1561. MIT Press, 2002. [8] H. Kantz and T. Schreiber. Nonlinear Time Series Analysis. Cambridge U.P., 1997. [9] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, 1991. [10] M. Palus et al. Physical Review E, 63:046211, 2001. [11] F. B. Knight. Annals of Probability, 3:573–596, 1975. [12] C. R. Shalizi and K. L. Shalizi. In M. Chickering and J. Halpern, eds., Uncertainty in Artificial Intelligence: Proceedings of the Twentieth Conference, pp. 504–511. AUAI Press, 2004. [13] C. R. Shalizi and J. P. Crutchfield. Journal of Statistical Physics, 104:817–819, 2001. [14] H. Jaeger. Neural Computation, 12:1371–1398, 2000. [15] D. Ron, Y. Singer, and N. Tishby. Machine Learning, 25:117–149, 1996. [16] P. B¨ hlmann and A. J. Wyner. Annals of Statistics, 27:480–513, 1999. u [17] N. U. Ahmed. Linear and Nonlinear Filtering for Scientists and Engineers. World Scientific, 1998. [18] D. R. Upper. PhD thesis, University of California, Berkeley, 1997. [19] E. Schneidman, S. Still, M. J. Berry, and W. Bialek. Physical Review Letters, 91:238701, 2003. [20] C. K. Chow and C. N. Liu. IEEE Transactions on Information Theory, IT-14:462–467, 1968. [21] T. H. Cormen et al. Introduction to Algorithms. 2nd ed. MIT Press, 2001. [22] S. Amari. IEEE Transacttions on Information Theory, 47:1701–1711, 2001. [23] S. Kirshner, P. Smyth, and A. Robertson. Tech. Rep. 04-04, UC Irvine, Information and Computer Science, 2004. [24] M. S. Olufsen et al. Journal of Computational Neuroscience, 14:33–54, 2003.
4 0.64398056 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations
Author: Aurelie C. Lozano, Sanjeev R. Kulkarni, Robert E. Schapire
Abstract: We study the statistical convergence and consistency of regularized Boosting methods, where the samples are not independent and identically distributed (i.i.d.) but come from empirical processes of stationary β-mixing sequences. Utilizing a technique that constructs a sequence of independent blocks close in distribution to the original samples, we prove the consistency of the composite classifiers resulting from a regularization achieved by restricting the 1-norm of the base classifiers’ weights. When compared to the i.i.d. case, the nature of sampling manifests in the consistency result only through generalization of the original condition on the growth of the regularization parameter.
5 0.50280595 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
Author: Tong Zhang, Rie Kubota Ando
Abstract: We consider a framework for semi-supervised learning using spectral decomposition based un-supervised kernel design. This approach subsumes a class of previously proposed semi-supervised learning methods on data graphs. We examine various theoretical properties of such methods. In particular, we derive a generalization performance bound, and obtain the optimal kernel design by minimizing the bound. Based on the theoretical analysis, we are able to demonstrate why spectral kernel design based methods can often improve the predictive performance. Experiments are used to illustrate the main consequences of our analysis.
6 0.49965233 198 nips-2005-Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails
7 0.49317044 102 nips-2005-Kernelized Infomax Clustering
8 0.49087647 71 nips-2005-Fast Krylov Methods for N-Body Learning
9 0.48783073 55 nips-2005-Describing Visual Scenes using Transformed Dirichlet Processes
10 0.48693874 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
11 0.4846074 177 nips-2005-Size Regularized Cut for Data Clustering
12 0.4828226 104 nips-2005-Laplacian Score for Feature Selection
13 0.4781099 77 nips-2005-From Lasso regression to Feature vector machine
14 0.47125843 98 nips-2005-Infinite latent feature models and the Indian buffet process
15 0.46919906 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
16 0.46905157 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
17 0.46890885 189 nips-2005-Tensor Subspace Analysis
18 0.46595779 184 nips-2005-Structured Prediction via the Extragradient Method
19 0.46258256 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
20 0.46239224 178 nips-2005-Soft Clustering on Graphs