nips nips2005 nips2005-177 knowledge-graph by maker-knowledge-mining

177 nips-2005-Size Regularized Cut for Data Clustering

Source: pdf

Author: Yixin Chen, Ya Zhang, Xiang Ji

Abstract: We present a novel spectral clustering method that enables users to incorporate prior knowledge of the size of clusters into the clustering process. The cost function, which is named size regularized cut (SRcut), is deﬁned as the sum of the inter-cluster similarity and a regularization term measuring the relative size of two clusters. Finding a partition of the data set to minimize SRcut is proved to be NP-complete. An approximation algorithm is proposed to solve a relaxed version of the optimization problem as an eigenvalue problem. Evaluations over different data sets demonstrate that the method is not sensitive to outliers and performs better than normalized cut. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract We present a novel spectral clustering method that enables users to incorporate prior knowledge of the size of clusters into the clustering process. [sent-10, score-0.616]

2 The cost function, which is named size regularized cut (SRcut), is deﬁned as the sum of the inter-cluster similarity and a regularization term measuring the relative size of two clusters. [sent-11, score-0.822]

3 Finding a partition of the data set to minimize SRcut is proved to be NP-complete. [sent-12, score-0.148]

4 Evaluations over different data sets demonstrate that the method is not sensitive to outliers and performs better than normalized cut. [sent-14, score-0.173]

5 1 Introduction In recent years, spectral clustering based on graph partitioning theories has emerged as one of the most effective data clustering tools. [sent-15, score-0.536]

6 Each edge is assigned a weight describing the similarity between the two nodes connected by the edge. [sent-18, score-0.119]

7 Clustering is then accomplished by ﬁnding the best cuts of the graph that optimize certain predeﬁned cost functions. [sent-19, score-0.209]

8 The optimization usually leads to the computation of the top eigenvectors of certain graph afﬁnity matrices, and the clustering result can be derived from the obtained eigen-space [12, 6]. [sent-20, score-0.281]

9 Many cost functions, such as the ratio cut [3], average association [15], spectral k-means [19], normalized cut [15], min-max cut [7], and a measure using conductance and cut [9] have been proposed along with the corresponding eigen-systems for the data clustering purpose. [sent-21, score-2.007]

10 The above data clustering methods, as well as most other methods in the literature, bear a common characteristic that manages to generate results maximizing the intra-cluster similarity, and/or minimizing the inter-cluster similarity. [sent-22, score-0.171]

11 These approaches perform well in some cases, but fail drastically when target data sets possess complex, extreme data distributions, and when the user has special needs for the data clustering task. [sent-23, score-0.186]

12 For example, it has been pointed out by several researchers that normalized cut sometimes displays sensitivity to outliers [7, 14]. [sent-24, score-0.527]

13 Normalized cut tends to ﬁnd a cluster consisting of a very small number of points if those points are far away from the center of the data set [14]. [sent-25, score-0.432]

14 There has been an abundance of prior work on embedding user’s prior knowledge of the data set in the clustering process. [sent-26, score-0.199]

15 Kernighan and Lin [11] applied a local search procedure that maintained two equally sized clusters while trying to minimize the association between the clusters. [sent-27, score-0.126]

16 Banerjee and Ghosh [2] proposed a method to balance the size of the clusters by considering an explicit soft constraint. [sent-30, score-0.217]

17 [17] presented a method to learn a clustering metric over user speciﬁed samples. [sent-32, score-0.167]

18 Yu and Shi [18] introduced a method to include must-link grouping cues in normalized cut. [sent-33, score-0.095]

19 Other related works include leaving fraction of the points unclustered to avoid the effect of outliers [4] and enforcing minimum cluster size constraint [10]. [sent-34, score-0.255]

20 In this paper, we present a novel clustering method based on graph partitioning. [sent-35, score-0.262]

21 The new method enables users to incorporate prior knowledge of the expected size of clusters into the clustering process. [sent-36, score-0.406]

22 Speciﬁcally, the cost function of the new method is deﬁned as the sum of the inter-cluster similarity and a regularization term that measures the relative size of two clusters. [sent-37, score-0.243]

23 An “optimal” partition corresponds to a tradeoff between the inter-cluster similarity and the relative size of two clusters. [sent-38, score-0.307]

24 We show that the size of the clusters generated by the optimal partition can be controlled by adjusting the weight on the regularization term. [sent-39, score-0.369]

25 2 Size regularized cut We model a given data set using a weighted undirected graph G = G(V, E, W) where V, E, and W denote the vertex set, edge set, and graph afﬁnity matrix, respectively. [sent-42, score-0.819]

26 Each vertex i ∈ V represents a data point, and each edge (i, j) ∈ E is assigned a nonnegative weight Wij to reﬂect the similarity between the data points i and j. [sent-43, score-0.187]

27 A graph partitioning method attempts to organize vertices into groups so that the intra-cluster similarity is high, and/or the inter-cluster similarity is low. [sent-44, score-0.382]

28 A simple way to quantify the cost for partitioning vertices into two disjoint sets V1 and V2 is the cut size cut(V1 , V2 ) = Wij , i∈V1 ,j∈V2 which can be viewed as the similarity or association between V1 and V2 . [sent-45, score-0.795]

29 Finding a binary partition of the graph that minimizes the cut size is known as the minimum cut problem. [sent-46, score-1.17]

30 However, the minimum cut criterion favors grouping small sets of isolated nodes in the graph [15]. [sent-48, score-0.612]

31 To capture the need for more balanced clusters, it has been proposed to include the cluster size information as a multiplicative penalty factor in the cost function, such as average cut [3] and normalized cut [15]. [sent-49, score-1.127]

32 (1) Here, β = [β1 , · · · , βN ]T is a weight vector where βi is a nonnegative weight associated with vertex i, and N is the total number of vertices in V. [sent-51, score-0.187]

33 In contrast with minimum cut, average cut and normalized cut tend to generate more balanced clusters. [sent-57, score-0.951]

34 However, due to the multiplicative nature of their cost functions, average cut and normalized cut are still sensitive to outliers. [sent-58, score-0.974]

35 This is because the cut value for separating outliers from the rest of the data points is usually close to zero, and thus makes the multiplicative penalty factor void. [sent-59, score-0.507]

36 To avoid the drawback of the above multiplicative cost functions, we introduce an additive cost function for graph bi-partitioning. [sent-60, score-0.279]

37 The cost function is named size regularized cut (SRcut), and is deﬁned as SRcut(V1 , V2 ) = cut(V1 , V2 ) − α|V1 |β |V2 |β (3) where |Vj |β (j = 1, 2) is described in (2), β and α > 0 are given a priori. [sent-61, score-0.644]

38 The last term in (3), α|V1 |β |V2 |β , is the size regularization term, which can be interpreted as below. [sent-62, score-0.112]

39 2 Therefore, |V1 |β |V2 |β achieves the maximum value when two clusters are of equal weighted size. [sent-65, score-0.119]

40 Consequently, minimizing SRcut is equivalent to minimizing the similarity between two clusters and, at the same time, searching for a balanced partition. [sent-66, score-0.284]

41 The tradeoff between the inter-cluster similarity and the balance of the cut depends on the α parameter, which needs to be determined by the prior information on the size of clusters. [sent-67, score-0.627]

42 If α = 0, minimum SRcut will assign all vertices to one cluster. [sent-68, score-0.122]

43 On the other end, if α 0, minimum SRcut will generate two clusters of equal size (if N is an even number). [sent-69, score-0.234]

44 In a spirit similar to that of (3), we can deﬁne size regularized association (SRassoc) as SRassoc(V1 , V2 ) = cut(Vi , Vi ) + 2α|V1 |β |V2 |β i=1,2 where cut(Vi , Vi ) measures the intra-cluster similarity. [sent-71, score-0.184]

45 2 Hence, minimizing size regularized cut is in fact identical to maximizing size regularized association. [sent-73, score-0.74]

46 In other words, minimizing the size regularized inter-cluster similarity is equivalent to maximizing the size regularized intra-cluster similarity. [sent-74, score-0.412]

47 In this paper, we will use SRcut as the clustering criterion. [sent-75, score-0.143]

48 SRcut(V1 , V2 ) = 3 Size ratio monotonicity min(|V | ,|V | ) Let V1 and V2 be a partition of V. [sent-76, score-0.212]

49 The size ratio r = max(|V1 |β ,|V2 |β ) deﬁnes the relative 1 β 2 β size of two clusters. [sent-77, score-0.217]

50 The following theorem shows that by controlling the parameter α in the SRcut cost function, one can control the balance of the optimal partition. [sent-79, score-0.119]

51 In addition, the size ratio increases monotonically as the increase of α. [sent-80, score-0.162]

52 1 (Size Ratio Monotonicity) Let V1 and V2 be the clusters generated by the minimum SRcut with α = αi , and the corresponding size ratio, ri , be deﬁned as ri = If α1 > α2 ≥ 0, then r1 ≥ r2 . [sent-82, score-0.234]

53 i i max(|V1 |β , |V2 |β ) Proof: Given vertex weight vector β, let S be the collection of all distinct values that the size regularization term in (3) can have, i. [sent-84, score-0.189]

54 Next, we deﬁne cuti be the minimal cut satisfying |V1 |β |V2 |β = Si , i. [sent-89, score-0.616]

55 , cuti = min |V1 |β |V2 |β = Si V1 ∪ V 2 = V V1 ∩ V 2 = ∅ cut(V1 , V2 ) , then min V1 ∪ V 2 = V V1 ∩ V 2 = ∅ SRcut(V1 , V2 ) = min i=1,···,|S| (cuti − αSi ) . [sent-91, score-0.33]

56 2 2 If V1 and V2 are the clusters generated by the minimum SRcut with α = α2 , then 2 2 |V1 |β |V2 |β = Sk∗ where k ∗ = argmini=1,···,|S| (cuti − α2 Si ). [sent-92, score-0.156]

57 (6) 1 1 Now, let V1 and V2 be the clusters generated by the minimum SRcut with α = α1 , and 1 1 |V1 |β |V2 |β = Sj ∗ where j ∗ = argmini=1,···,|S| (cuti − α1 Si ). [sent-96, score-0.156]

58 Unfortunately, minimizing size regularized cut for an arbitrary α is an NP-complete problem. [sent-101, score-0.581]

59 4 Size regularized cut and graph bisection The decision problem for minimum SRcut can be formulated as: whether, given an undirected graph G(V, E, W) with weight vector β and regularization parameter α, a partition exists such that SRcut is less than a given cost. [sent-103, score-1.112]

60 Next we show that graph bisection can be reduced, in polynomial time, to minimum SRcut. [sent-105, score-0.309]

61 Since graph bisection is a classiﬁed NP-complete problem [1], so is minimum SRcut. [sent-106, score-0.291]

62 1 (Graph Bisection) Given an undirected graph G = G(V, E, W) with even number of vertices where W is the adjacency matrix, ﬁnd a pair of disjoint subsets V1 , V2 ⊂ V of equal size and V1 ∪ V2 = V, such that the number of edges between vertices in V1 and vertices in V2 , i. [sent-108, score-0.47]

63 Let cuti be the minimal cut with the size of the smaller subset is i, i. [sent-114, score-0.694]

64 , cuti = min min(|V1 |, |V2 |) = i V1 ∪ V 2 = V V1 ∩ V 2 = ∅ cut(V1 , V2 ) . [sent-116, score-0.258]

65 Clearly, we have d∗ ≥ cuti+1 − cuti for 0 ≤ i ≤ N − 2i − 1 ≥ 1. [sent-117, score-0.222]

66 If 0 ≤ i ≤ N 2 − 1, then α(N − 2i − 1) > d∗ ≥ cuti+1 − cuti . [sent-119, score-0.222]

67 This implies that cuti − αi(N − i) > cuti+1 − α(i + 1)(N − i − 1) , or, equivalently, min min(|V1 |, |V2 |) = i V1 ∪ V 2 = V V1 ∩ V 2 = ∅ for 0 ≤ i ≤ N 2 cut(V1 , V2 )−α|V1 ||V2 | > min min(|V1 |, |V2 |) = i + 1 V1 ∪ V 2 = V V1 ∩ V 2 = ∅ cut(V1 , V2 )−α|V1 ||V2 | − 1. [sent-120, score-0.294]

68 Hence, for any α > d∗ , minimizing SRcut is identical to minimizing cut(V1 , V2 ) − α|V1 ||V2 | with the constraint that |V1 | = |V2 | = N , V1 ∪ V2 = V, and V1 ∩ V2 = ∅, which is exactly 2 2 the graph bisection problem since α|V1 ||V2 | = α N is a constant. [sent-121, score-0.292]

69 4 5 An approximation algorithm for SRcut Given a partition of vertex set V into two sets V1 and V2 , let x ∈ {−1, 1}N be an indicator vector such that xi = 1 if i ∈ V1 and xi = −1 if i ∈ V2 . [sent-122, score-0.224]

70 4 4 Given W, α, and β, we have SRcut(V1 , V2 ) = (7) argminx∈{−1,1}N SRcut(x) = argmaxx∈{−1,1}N xT (W − αββ T )x If we deﬁne a normalized indicator vector, y = √1 x (i. [sent-125, score-0.094]

71 And the solution is the eigenvector corresponding to the largest eigenvalue of W − αββ T (or named the largest eigenvector). [sent-129, score-0.187]

72 Similar to other spectral graph partitioning techniques that use top eigenvectors to approximate “optimal” partitions, the largest eigenvector of W − αββ T provides a linear search direction, along which a splitting point can be found. [sent-130, score-0.386]

73 We use a simple approach by checking each element in the largest eigenvector as a possible splitting point. [sent-131, score-0.136]

74 The ﬁnal partition is determined by the splitting point with the minimum SRcut value. [sent-135, score-0.222]

75 4 The SRcut value of the partition generated by the largest eigenvector provides an upper bound for SRcut∗ . [sent-139, score-0.229]

76 As implied by SRcut cost function in (3), the partition of the dataset depends on the value of α, which determines the tradeoff between inter-cluster similarity and the balance of the partition. [sent-140, score-0.332]

77 1 indicates that with the increase of α, the size ratio of the clusters generated by the optimal partition increase monotonically, i. [sent-142, score-0.37]

78 1 for the approximated partition derived above, our empirical study shows that, in general, the size ratio of the approximated partition increases along with α. [sent-146, score-0.431]

79 Therefore, we use the prior information on the size of the clusters to select α. [sent-147, score-0.197]

80 Speciﬁcally, we deﬁne expected size min(s ratio, R, as R = max(s1 ,s2 ) where s1 and s2 are the expected size of the two clusters 1 ,s2 ) (known a priori). [sent-148, score-0.257]

81 We then search for a value of α such that the resulting size ratio is close to R. [sent-149, score-0.139]

82 A simple one-dimensional search method based on bracketing and bisection is implemented [13]. [sent-150, score-0.117]

83 The pseudo code of the searching algorithm is given in Algorithm 1 along with the rest of the clustering procedure. [sent-151, score-0.166]

84 The input of the algorithm is the graph afﬁnity matrix W, the weight vector β, the expected size ratio R, and α0 > 0 (the initial T We value of α). [sent-152, score-0.284]

85 If the expected size ratio R is unknown, one can estimate R assuming that the data are i. [sent-155, score-0.139]

86 The ﬁnal partition is chosen to be the one with the minimum cut value by assuming that a “good” partition should have a small cut. [sent-168, score-0.709]

87 Similar to other spectral graph clustering methods, the time complexity of SRcut can be signiﬁcantly reduced if the afﬁnity matrix W is sparse, i. [sent-172, score-0.348]

88 This leads to a data set of 50 clusters with a total of 9102 documents. [sent-183, score-0.101]

89 We pair each cluster with another cluster to form a data set, so that 190 test data sets are generated. [sent-186, score-0.095]

90 Clearly, SRcut outperforms the normalized cut on both data sets. [sent-194, score-0.464]

91 SRcut performs signiﬁcantly better than normalized cut on the 20-Newsgroups data set. [sent-195, score-0.464]

92 The results suggest that SRcut is less sensitive to outliers than normalized cut. [sent-197, score-0.154]

93 8 Conclusions We proposed size regularized cut, a novel method that enables users to specify prior knowledge of the size of two clusters in spectral clustering. [sent-198, score-0.489]

94 2531 account inter-cluster similarity and the relative size of two clusters. [sent-206, score-0.144]

95 The “optimal” partition of the data set corresponds to a tradeoff between the inter-cluster similarity and the balance of the partition. [sent-207, score-0.267]

96 We proved that ﬁnding a partition with minimum SRcut is an NPcomplete problem. [sent-208, score-0.203]

97 Evaluations over different data sets indicate that the method is not sensitive to outliers and performs better than normalized cut. [sent-210, score-0.173]

98 The SRcut model can be easily adapted to solve multiple-clusters problem by applying the clustering method recursively/iteratively on data sets. [sent-211, score-0.143]

99 Since graph bisection can be reduced to SRcut, the proposed approximation algorithm provides a new spectral technique for graph bisection. [sent-212, score-0.422]

100 Comparing SRcut with other graph bisection algorithms is therefore an interesting future work. [sent-213, score-0.236]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('srcut', 0.76), ('cut', 0.394), ('cuti', 0.222), ('clustering', 0.143), ('partition', 0.13), ('graph', 0.119), ('bisection', 0.117), ('clusters', 0.101), ('regularized', 0.081), ('size', 0.078), ('normalized', 0.07), ('spectral', 0.067), ('vertices', 0.067), ('similarity', 0.066), ('cost', 0.065), ('partitioning', 0.064), ('srassoc', 0.063), ('outliers', 0.063), ('ratio', 0.061), ('eigenvector', 0.055), ('minimum', 0.055), ('vertex', 0.051), ('largest', 0.044), ('ding', 0.041), ('argmini', 0.041), ('vj', 0.04), ('sk', 0.039), ('cluster', 0.038), ('balanced', 0.038), ('balance', 0.038), ('splitting', 0.037), ('undirected', 0.037), ('min', 0.036), ('int', 0.035), ('si', 0.034), ('regularization', 0.034), ('wij', 0.033), ('tradeoff', 0.033), ('cutk', 0.032), ('cutt', 0.032), ('kernighan', 0.032), ('wagstaff', 0.032), ('yixin', 0.032), ('nity', 0.031), ('multiplicative', 0.03), ('users', 0.029), ('documents', 0.029), ('minimizing', 0.028), ('vi', 0.027), ('shi', 0.027), ('assigned', 0.027), ('named', 0.026), ('weight', 0.026), ('af', 0.026), ('grouping', 0.025), ('cuts', 0.025), ('zha', 0.025), ('banerjee', 0.025), ('ghosh', 0.025), ('gu', 0.025), ('association', 0.025), ('user', 0.024), ('indicator', 0.024), ('dhillon', 0.023), ('monotonically', 0.023), ('searching', 0.023), ('relaxed', 0.023), ('km', 0.022), ('repeat', 0.022), ('document', 0.021), ('sensitive', 0.021), ('monotonicity', 0.021), ('enforcing', 0.021), ('mutual', 0.021), ('mining', 0.02), ('knowledge', 0.02), ('penalty', 0.02), ('interval', 0.019), ('optimization', 0.019), ('sets', 0.019), ('lin', 0.019), ('complexity', 0.019), ('evaluations', 0.019), ('polynomial', 0.018), ('prior', 0.018), ('adjacency', 0.018), ('xing', 0.018), ('eigenvalue', 0.018), ('proved', 0.018), ('weighted', 0.018), ('disjoint', 0.017), ('st', 0.017), ('foundations', 0.017), ('nonnegative', 0.017), ('simon', 0.017), ('enables', 0.017), ('yu', 0.016), ('manually', 0.016), ('approximated', 0.016), ('theorem', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 177 nips-2005-Size Regularized Cut for Data Clustering

Author: Yixin Chen, Ya Zhang, Xiang Ji

2 0.35414645 13 nips-2005-A Probabilistic Approach for Optimizing Spectral Clustering

Author: Rong Jin, Feng Kang, Chris H. Ding

Abstract: Spectral clustering enjoys its success in both data clustering and semisupervised learning. But, most spectral clustering algorithms cannot handle multi-class clustering problems directly. Additional strategies are needed to extend spectral clustering algorithms to multi-class clustering problems. Furthermore, most spectral clustering algorithms employ hard cluster membership, which is likely to be trapped by the local optimum. In this paper, we present a new spectral clustering algorithm, named “Soft Cut”. It improves the normalized cut algorithm by introducing soft membership, and can be efﬁciently computed using a bound optimization algorithm. Our experiments with a variety of datasets have shown the promising performance of the proposed clustering algorithm. 1

3 0.20763379 75 nips-2005-Fixing two weaknesses of the Spectral Method

Author: Kevin Lang

Abstract: We discuss two intrinsic weaknesses of the spectral graph partitioning method, both of which have practical consequences. The ﬁrst is that spectral embeddings tend to hide the best cuts from the commonly used hyperplane rounding method. Rather than cleaning up the resulting suboptimal cuts with local search, we recommend the adoption of ﬂow-based rounding. The second weakness is that for many “power law” graphs, the spectral method produces cuts that are highly unbalanced, thus decreasing the usefulness of the method for visualization (see ﬁgure 4(b)) or as a basis for divide-and-conquer algorithms. These balance problems, which occur even though the spectral method’s quotient-style objective function does encourage balance, can be ﬁxed with a stricter balance constraint that turns the spectral mathematical program into an SDP that can be solved for million-node graphs by a method of Burer and Monteiro. 1 Background Graph partitioning is the NP-hard problem of ﬁnding a small graph cut subject to the constraint that neither side of the resulting partitioning of the nodes is “too small”. We will be dealing with several versions: the graph bisection problem, which requires perfect 1 : 1 2 2 balance; the β-balanced cut problem (with β a fraction such as 1 ), which requires at least 3 β : (1 − β) balance; and the quotient cut problem, which requires the small side to be large enough to “pay for” the edges in the cut. The quotient cut metric is c/ min(a, b), where c is the cutsize and a and b are the sizes of the two sides of the cut. All of the well-known variants of the quotient cut metric (e.g. normalized cut [15]) have similar behavior with respect to the issues discussed in this paper. The spectral method for graph partitioning was introduced in 1973 by Fiedler and Donath & Hoffman [6]. In the mid-1980’s Alon & Milman [1] proved that spectral cuts can be at worst quadratically bad; in the mid 1990’s Guattery & Miller [10] proved that this analysis is tight by exhibiting a family of n-node graphs whose spectral bisections cut O(n 2/3 ) edges versus the optimal O(n1/3 ) edges. On the other hand, Spielman & Teng [16] have proved stronger performance guarantees for the special case of spacelike graphs. The spectral method can be derived by relaxing a quadratic integer program which encodes the graph bisection problem (see section 3.1). The solution to this relaxation is the “Fiedler vector”, or second smallest eigenvector of the graph’s discrete Laplacian matrix, whose elements xi can be interpreted as an embedding of the graph on the line. To obtain a (A) Graph with nearly balanced 8-cut (B) Spectral Embedding (C) Notional Flow-based Embedding Figure 1: The spectral embedding hides the best solution from hyperplane rounding. speciﬁc cut, one must apply a “rounding method” to this embedding. The hyperplane rounding method chooses one of the n − 1 cuts which separate the nodes whose x i values lie above and below some split value x. ˆ 2 Using ﬂow to ﬁnd cuts that are hidden from hyperplane rounding Theorists have long known that the spectral method cannot distinguish between deep cuts and long paths, and that this confusion can cause it to cut a graph in the wrong direction thereby producing the spectral method’s worst-case behavior [10]. In this section we will show by example that even when the spectral method is not fooled into cutting in the wrong direction, the resulting embedding can hide the best cuts from the hyperplane rounding method. This is a possible explanation for the frequently made empirical observation (see e.g. [12]) that hyperplane roundings of spectral embeddings are noisy and therefore beneﬁt from cleanup with a local search method such as Fiduccia-Matheyses [8]. Consider the graph in ﬁgure 1(a), which has a near-bisection cutting 8 edges. For this graph the spectral method produces the embedding shown in ﬁgure 1(b), and recommends that we make a vertical cut (across the horizontal dimension which is based on the Fiedler vector). This is correct in a generalized sense, but it is obvious that no hyperplane (or vertical line in this picture) can possibly extract the optimal 8-edge cut. Some insight into why spectral embeddings tend to have this problem can be obtained from the spectral method’s electrical interpretation. In this view the graph is represented by a resistor network [7]. Current ﬂowing in this network causes voltage drops across the resistors, thus determining the nodes’ voltages and hence their positions. When current ﬂows through a long series of resistors, it induces a progressive voltage drop. This is what causes the excessive length of the embeddings of the horizontal girder-like structures which are blocking all vertical hyperplane cuts in ﬁgure 1(b). If the embedding method were somehow not based on current, but rather on ﬂow, which does not distinguish between a pipe and a series of pipes, then the long girders could retract into the two sides of the embedding, as suggested by ﬁgure 1(c), and the best cut would be revealed. Because theoretical ﬂow-like embedding methods such as [14] are currently not practical, we point out that in cases like ﬁgure 1(b), where the spectral method has not chosen an incorrect direction for the cut, one can use an S-T max ﬂow problem with the ﬂow running in the recommended direction (horizontally for this embedding) to extract the good cut even though it is hidden from all hyperplanes. We currently use two different ﬂow-based rounding methods. A method called MQI looks for quotient cuts, and is already described in [13]. Another method, that we shall call Midﬂow, looks for β-balanced cuts. The input to Midﬂow is a graph and an ordering of its nodes (obtained e.g. from a spectral embedding or from the projection of any embedding onto a line). We divide the graph’s nodes into 3 sets F, L, and U. The sets F and L respectively contain the ﬁrst βn and last βn nodes in the ordering, and U contains the remaining 50-50 balance ng s ro un di Hy pe r pl an e neg-pos split quotient cut score (cutsize / size of small side) 0.01 ctor r ve iedle of F 0.004 0.003 0.00268 0.00232 Best hyperplane rounding of Fiedler Vector Best improvement with local search 0.002 0.00138 0.001 60000 80000 Midflow rounding beta = 1/4 100000 120000 0.00145 140000 Midflow rounding of Fiedler Vector beta = 1/3 160000 180000 200000 220000 240000 number of nodes on ’left’ side of cut (out of 324800) Figure 2: A typical example (see section 2.1) where ﬂow-based rounding beats hyperplane rounding, even when the hyperplane cuts are improved with Fiduccia-Matheyses search. Note that for this spacelike graph, the best quotient cuts have reasonably good balance. U = n − 2βn nodes, which are “up for grabs”. We set up an S-T max ﬂow problem with one node for every graph node plus 2 new nodes for the source and sink. For each graph edge there are two arcs, one in each direction, with unit capacity. Finally, the nodes in F are pinned to the source and the nodes in L are pinned to sink by inﬁnite capacity arcs. This max-ﬂow problem can be solved by a good implementation of the push-relabel algorithm (such as Goldberg and Cherkassky’s hi pr [4]) in time that empirically is nearly linear with a very good constant factor. Figure 6 shows that solving a MidFlow problem with hi pr can be 1000 times cheaper than ﬁnding a spectral embedding with ARPACK. When the goal is ﬁnding good β-balanced cuts, MidFlow rounding is strictly more powerful than hyperplane rounding; from a given node ordering hyperplane rounding chooses the best of U + 1 candidate cuts, while MidFlow rounding chooses the best of 2U candidates, including all of those considered by hyperplane rounding. [Similarly, MQI rounding is strictly more powerful than hyperplane rounding for the task of ﬁnding good quotient cuts.] 2.1 A concrete example The plot in ﬁgure 2 shows a number of cuts in a 324,800 node nearly planar graph derived from a 700x464 pixel downward-looking view of some clouds over some mountains.1 The y-axis of the plot is quotient cut score; smaller values are better. We note in passing that the commonly used split point x = 0 does not yield the best hyperplane cut. Our main ˆ point is that the two cuts generated by MidFlow rounding of the Fiedler vector (with β = 1 3 and β = 1 ) are nearly twice as good as the best hyperplane cut. Even after the best 4 hyperplane cut has been improved by taking the best result of 100 runs of a version of Fiduccia-Matheyses local search, it is still much worse than the cuts obtained by ﬂowbased rounding. 1 The graph’s edges are unweighted but are chosen by a randomized rule which is more likely to include an edge between two neighboring pixels if they have a similar grey value. Good cuts in the graph tend to run along discontinuities in the image, as one would expect. quotient cut score 1 SDP-LB (smaller is better) 0.1 Scatter plot showing cuts in a

4 0.17809971 178 nips-2005-Soft Clustering on Graphs

Author: Kai Yu, Shipeng Yu, Volker Tresp

Abstract: We propose a simple clustering framework on graphs encoding pairwise data similarities. Unlike usual similarity-based methods, the approach softly assigns data to clusters in a probabilistic way. More importantly, a hierarchical clustering is naturally derived in this framework to gradually merge lower-level clusters into higher-level ones. A random walk analysis indicates that the algorithm exposes clustering structures in various resolutions, i.e., a higher level statistically models a longer-term diffusion on graphs and thus discovers a more global clustering structure. Finally we provide very encouraging experimental results. 1

5 0.1163355 159 nips-2005-Q-Clustering

Author: Mukund Narasimhan, Nebojsa Jojic, Jeff A. Bilmes

Abstract: We show that Queyranne’s algorithm for minimizing symmetric submodular functions can be used for clustering with a variety of different objective functions. Two speciﬁc criteria that we consider in this paper are the single linkage and the minimum description length criteria. The ﬁrst criterion tries to maximize the minimum distance between elements of different clusters, and is inherently “discriminative”. It is known that optimal clusterings into k clusters for any given k in polynomial time for this criterion can be computed. The second criterion seeks to minimize the description length of the clusters given a probabilistic generative model. We show that the optimal partitioning into 2 clusters, and approximate partitioning (guaranteed to be within a factor of 2 of the the optimal) for more clusters can be computed. To the best of our knowledge, this is the ﬁrst time that a tractable algorithm for ﬁnding the optimal clustering with respect to the MDL criterion for 2 clusters has been given. Besides the optimality result for the MDL criterion, the chief contribution of this paper is to show that the same algorithm can be used to optimize a broad class of criteria, and hence can be used for many application speciﬁc criterion for which efﬁcient algorithm are not known.

6 0.099976279 102 nips-2005-Kernelized Infomax Clustering

7 0.094156757 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

8 0.083422266 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

9 0.081483111 127 nips-2005-Mixture Modeling by Affinity Propagation

10 0.076434053 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning

11 0.076324776 79 nips-2005-Fusion of Similarity Data in Clustering

12 0.057767086 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration

13 0.055860229 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

14 0.051661693 41 nips-2005-Coarse sample complexity bounds for active learning

15 0.051171266 107 nips-2005-Large scale networks fingerprinting and visualization using the k-core decomposition

16 0.048138384 84 nips-2005-Generalization in Clustering with Unobserved Features

17 0.045521617 116 nips-2005-Learning Topology with the Generative Gaussian Graph and the EM Algorithm

18 0.044783007 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning

19 0.043618187 125 nips-2005-Message passing for task redistribution on sparse graphs

20 0.041469492 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.155), (1, 0.108), (2, -0.079), (3, -0.093), (4, -0.379), (5, -0.096), (6, -0.004), (7, -0.012), (8, -0.244), (9, 0.081), (10, -0.132), (11, -0.009), (12, 0.057), (13, 0.057), (14, 0.147), (15, -0.032), (16, 0.119), (17, 0.037), (18, -0.035), (19, -0.007), (20, 0.078), (21, -0.092), (22, -0.047), (23, -0.042), (24, -0.06), (25, 0.124), (26, -0.048), (27, 0.009), (28, 0.045), (29, 0.005), (30, 0.076), (31, -0.028), (32, 0.025), (33, -0.068), (34, 0.016), (35, -0.002), (36, -0.027), (37, 0.154), (38, -0.045), (39, -0.179), (40, -0.05), (41, 0.086), (42, -0.017), (43, 0.025), (44, 0.048), (45, 0.076), (46, 0.004), (47, -0.007), (48, -0.106), (49, -0.093)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96258289 177 nips-2005-Size Regularized Cut for Data Clustering

Author: Yixin Chen, Ya Zhang, Xiang Ji

2 0.88988572 13 nips-2005-A Probabilistic Approach for Optimizing Spectral Clustering

Author: Rong Jin, Feng Kang, Chris H. Ding

3 0.81538141 75 nips-2005-Fixing two weaknesses of the Spectral Method

Author: Kevin Lang

4 0.56463754 159 nips-2005-Q-Clustering

Author: Mukund Narasimhan, Nebojsa Jojic, Jeff A. Bilmes

5 0.53934729 178 nips-2005-Soft Clustering on Graphs

Author: Kai Yu, Shipeng Yu, Volker Tresp

6 0.43097684 102 nips-2005-Kernelized Infomax Clustering

7 0.33516267 127 nips-2005-Mixture Modeling by Affinity Propagation

8 0.32795689 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

9 0.32572699 79 nips-2005-Fusion of Similarity Data in Clustering

10 0.3164911 107 nips-2005-Large scale networks fingerprinting and visualization using the k-core decomposition

11 0.28456131 84 nips-2005-Generalization in Clustering with Unobserved Features

12 0.27776024 71 nips-2005-Fast Krylov Methods for N-Body Learning

13 0.27579483 180 nips-2005-Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms

14 0.24896717 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

15 0.23791501 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning

16 0.22776893 116 nips-2005-Learning Topology with the Generative Gaussian Graph and the EM Algorithm

17 0.20285273 125 nips-2005-Message passing for task redistribution on sparse graphs

18 0.20041955 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning

19 0.20028076 205 nips-2005-Worst-Case Bounds for Gaussian Process Models

20 0.20021461 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.094), (10, 0.128), (14, 0.166), (27, 0.021), (31, 0.032), (34, 0.099), (39, 0.023), (41, 0.019), (44, 0.024), (55, 0.036), (69, 0.062), (73, 0.077), (88, 0.061), (91, 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.84264398 177 nips-2005-Size Regularized Cut for Data Clustering

Author: Yixin Chen, Ya Zhang, Xiang Ji

2 0.84227395 166 nips-2005-Robust Fisher Discriminant Analysis

Author: Seung-jean Kim, Alessandro Magnani, Stephen Boyd

Abstract: Fisher linear discriminant analysis (LDA) can be sensitive to the problem data. Robust Fisher LDA can systematically alleviate the sensitivity problem by explicitly incorporating a model of data uncertainty in a classiﬁcation problem and optimizing for the worst-case scenario under this model. The main contribution of this paper is show that with general convex uncertainty models on the problem data, robust Fisher LDA can be carried out using convex optimization. For a certain type of product form uncertainty model, robust Fisher LDA can be carried out at a cost comparable to standard Fisher LDA. The method is demonstrated with some numerical examples. Finally, we show how to extend these results to robust kernel Fisher discriminant analysis, i.e., robust Fisher LDA in a high dimensional feature space. 1

3 0.73165786 110 nips-2005-Learning Depth from Single Monocular Images

Author: Ashutosh Saxena, Sung H. Chung, Andrew Y. Ng

Abstract: We consider the task of depth estimation from a single monocular image. We take a supervised learning approach to this problem, in which we begin by collecting a training set of monocular images (of unstructured outdoor environments which include forests, trees, buildings, etc.) and their corresponding ground-truth depthmaps. Then, we apply supervised learning to predict the depthmap as a function of the image. Depth estimation is a challenging problem, since local features alone are insufﬁcient to estimate depth at a point, and one needs to consider the global context of the image. Our model uses a discriminatively-trained Markov Random Field (MRF) that incorporates multiscale local- and global-image features, and models both depths at individual points as well as the relation between depths at different points. We show that, even on unstructured scenes, our algorithm is frequently able to recover fairly accurate depthmaps. 1

4 0.72521567 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration

Author: Benjamin V. Roy

Abstract: We consider approximate value iteration with a parameterized approximator in which the state space is partitioned and the optimal cost-to-go function over each partition is approximated by a constant. We establish performance loss bounds for policies derived from approximations associated with ﬁxed points. These bounds identify beneﬁts to having projection weights equal to the invariant distribution of the resulting policy. Such projection weighting leads to the same ﬁxed points as TD(0). Our analysis also leads to the ﬁrst performance loss bound for approximate value iteration with an average cost objective. 1 Preliminaries Consider a discrete-time communicating Markov decision process (MDP) with a ﬁnite state space S = {1, . . . , |S|}. At each state x ∈ S, there is a ﬁnite set Ux of admissible actions. If the current state is x and an action u ∈ Ux is selected, a cost of gu (x) is incurred, and the system transitions to a state y ∈ S with probability pxy (u). For any x ∈ S and u ∈ Ux , y∈S pxy (u) = 1. Costs are discounted at a rate of α ∈ (0, 1) per period. Each instance of such an MDP is deﬁned by a quintuple (S, U, g, p, α). A (stationary deterministic) policy is a mapping µ that assigns an action u ∈ Ux to each state x ∈ S. If actions are selected based on a policy µ, the state follows a Markov process with transition matrix Pµ , where each (x, y)th entry is equal to pxy (µ(x)). The restriction to communicating MDPs ensures that it is possible to reach any state from any other state. Each policy µ is associated with a cost-to-go function Jµ ∈ |S| , deﬁned by Jµ = ∞ t t −1 gµ , where, with some abuse of notation, gµ (x) = gµ(x) (x) t=0 α Pµ gµ = (I − αPµ ) for each x ∈ S. A policy µ is said to be greedy with respect to a function J if µ(x) ∈ argmin(gu (x) + α y∈S pxy (u)J(y)) for all x ∈ S. u∈Ux The optimal cost-to-go function J ∗ ∈ |S| is deﬁned by J ∗ (x) = minµ Jµ (x), for all x ∈ S. A policy µ∗ is said to be optimal if Jµ∗ = J ∗ . It is well-known that an optimal policy exists. Further, a policy µ∗ is optimal if and only if it is greedy with respect to J ∗ . Hence, given the optimal cost-to-go function, optimal actions can computed be minimizing the right-hand side of the above inclusion. Value iteration generates a sequence J converging to J ∗ according to J +1 = T J , where T is the dynamic programming operator, deﬁned by (T J)(x) = minu∈Ux (gu (x) + α y∈S pxy (u)J(y)), for all x ∈ S and J ∈ |S| . This sequence converges to J ∗ for any initialization of J0 . 2 Approximate Value Iteration The state spaces of relevant MDPs are typically so large that computation and storage of a cost-to-go function is infeasible. One approach to dealing with this obstacle involves partitioning the state space S into a manageable number K of disjoint subsets S1 , . . . , SK and approximating the optimal cost-to-go function with a function that is constant over each partition. This can be thought of as a form of state aggregation – all states within a given partition are assumed to share a common optimal cost-to-go. To represent an approximation, we deﬁne a matrix Φ ∈ |S|×K such that each kth column is an indicator function for the kth partition Sk . Hence, for any r ∈ K , k, and x ∈ Sk , (Φr)(x) = rk . In this paper, we study variations of value iteration, each of which computes a vector r so that Φr approximates J ∗ . The use of such a policy µr which is greedy with respect to Φr is justiﬁed by the following result (see [10] for a proof): ˜ Theorem 1 If µ is a greedy policy with respect to a function J ∈ Jµ − J ∗ ≤ ∞ 2α ˜ J∗ − J 1−α |S| then ∞. One common way of approximating a function J ∈ |S| with a function of the form Φr involves projection with respect to a weighted Euclidean norm · π . The weighted Euclidean 1/2 |S| 2 norm: J 2,π = . Here, π ∈ + is a vector of weights that assign x∈S π(x)J (x) relative emphasis among states. The projection Ππ J is the function Φr that attains the minimum of J −Φr 2,π ; if there are multiple functions Φr that attain the minimum, they must form an afﬁne space, and the projection is taken to be the one with minimal norm Φr 2,π . Note that in our context, where each kth column of Φ represents an indicator function for the kth partition, for any π, J, and x ∈ Sk , (Ππ J)(x) = y∈Sk π(y)J(y)/ y∈Sk π(y). Approximate value iteration begins with a function Φr(0) and generates a sequence according to Φr( +1) = Ππ T Φr( ) . It is well-known that the dynamic programming operator T is a contraction mapping with respect to the maximum norm. Further, Ππ is maximum-norm nonexpansive [16, 7, 8]. (This is not true for general Φ, but is true in our context in which columns of Φ are indicator functions for partitions.) It follows that the composition Ππ T is a contraction mapping. By the contraction mapping theorem, Ππ T has a unique ﬁxed point Φ˜, which is the limit of the sequence Φr( ) . Further, the following result holds: r Theorem 2 For any MDP, partition, and weights π with support intersecting every partition, if Φ˜ = Ππ T Φ˜ then r r Φ˜ − J ∗ r ∞ ≤ 2 min J ∗ − Φr 1 − α r∈ K and (1 − α) Jµr − J ∗ ˜ ∞ ≤ ∞, 4α min J ∗ − Φr 1 − α r∈ K ∞. The ﬁrst inequality of the theorem is an approximation error bound, established in [16, 7, 8] for broader classes of approximators that include state aggregation as a special case. The second is a performance loss bound, derived by simply combining the approximation error bound and Theorem 1. Note that Jµr (x) ≥ J ∗ (x) for all x, so the left-hand side of the performance loss bound ˜ is the maximal increase in cost-to-go, normalized by 1 − α. This normalization is natural, since a cost-to-go function is a linear combination of expected future costs, with coefﬁcients 1, α, α2 , . . ., which sum to 1/(1 − α). Our motivation of the normalizing constant begs the question of whether, for ﬁxed MDP parameters (S, U, g, p) and ﬁxed Φ, minr J ∗ − Φr ∞ also grows with 1/(1 − α). It turns out that minr J ∗ − Φr ∞ = O(1). To see why, note that for any µ, Jµ = (I − αPµ )−1 gµ = 1 λ µ + hµ , 1−α where λµ (x) is the expected average cost if the process starts in state x and is controlled by policy µ, τ −1 1 t λµ = lim Pµ gµ , τ →∞ τ t=0 and hµ is the discounted differential cost function hµ = (I − αPµ )−1 (gµ − λµ ). Both λµ and hµ converge to ﬁnite vectors as α approaches 1 [3]. For an optimal policy µ∗ , limα↑1 λµ∗ (x) does not depend on x (in our context of a communicating MDP). Since constant functions lie in the range of Φ, lim min J ∗ − Φr α↑1 r∈ K ∞ ≤ lim hµ∗ α↑1 ∞ < ∞. The performance loss bound still exhibits an undesirable dependence on α through the coefﬁcient 4α/(1 − α). In most relevant contexts, α is close to 1; a representative value might be 0.99. Consequently, 4α/(1 − α) can be very large. Unfortunately, the bound is sharp, as expressed by the following theorem. We will denote by 1 the vector with every component equal to 1. Theorem 3 For any δ > 0, α ∈ (0, 1), and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ r r with π = 1, 4α min J ∗ − Φr ∞ − δ. (1 − α) Jµr − J ∗ ∞ ≥ ˜ 1 − α r∈ K This theorem is established through an example in [22]. The choice of uniform weights (π = 1) is meant to point out that even for such a simple, perhaps natural, choice of weights, the performance loss bound is sharp. Based on Theorems 2 and 3, one might expect that there exists MDP parameters (S, U, g, p) and a partition such that, with π = 1, (1 − α) Jµr − J ∗ ˜ ∞ =Θ 1 min J ∗ − Φr 1 − α r∈ K ∞ . In other words, that the performance loss is both lower and upper bounded by 1/(1 − α) times the smallest possible approximation error. It turns out that this is not true, at least if we restrict to a ﬁnite state space. However, as the following theorem establishes, the coefﬁcient multiplying minr∈ K J ∗ − Φr ∞ can grow arbitrarily large as α increases, keeping all else ﬁxed. Theorem 4 For any L and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that limα↑1 minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ with π = 1, r r lim inf (1 − α) (Jµr (x) − J ∗ (x)) ≥ L lim min J ∗ − Φr ∞ , ˜ α↑1 α↑1 r∈ K for all x ∈ S. This Theorem is also established through an example [22]. For any µ and x, lim ((1 − α)Jµ (x) − λµ (x)) = lim(1 − α)hµ (x) = 0. α↑1 α↑1 Combined with Theorem 4, this yields the following corollary. Corollary 1 For any L and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that limα↑1 minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ with π = 1, r r ∗ lim inf (λµr (x) − λµ∗ (x)) ≥ L lim min J − Φr ∞ , ˜ α↑1 α↑1 r∈ K for all x ∈ S. 3 Using the Invariant Distribution In the previous section, we considered an approximation Φ˜ that solves Ππ T Φ˜ = Φ˜ for r r r some arbitrary pre-selected weights π. We now turn to consider use of an invariant state distribution πr of Pµr as the weight vector.1 This leads to a circular deﬁnition: the weights ˜ ˜ are used in deﬁning r and now we are deﬁning the weights in terms of r. What we are ˜ ˜ really after here is a vector r that satisﬁes Ππr T Φ˜ = Φ˜. The following theorem captures ˜ r r ˜ the associated beneﬁts. (Due to space limitations, we omit the proof, which is provided in the full length version of this paper [22].) Theorem 5 For any MDP and partition, if Φ˜ = Ππr T Φ˜ and πr has support intersecting r r ˜ ˜ T every partition, (1 − α)πr (Jµr − J ∗ ) ≤ 2α minr∈ K J ∗ − Φr ∞ . ˜ ˜ When α is close to 1, which is typical, the right-hand side of our new performance loss bound is far less than that of Theorem 2. The primary improvement is in the omission of a factor of 1 − α from the denominator. But for the bounds to be compared in a meaningful way, we must also relate the left-hand-side expressions. A relation can be based on the fact that for all µ, limα↑1 (1 − α)Jµ − λµ ∞ = 0, as explained in Section 2. In particular, based on this, we have lim(1 − α) Jµ − J ∗ ∞ = |λµ − λ∗ | = λµ − λ∗ = lim π T (Jµ − J ∗ ), α↑1 α↑1 for all policies µ and probability distributions π. Hence, the left-hand-side expressions from the two performance bounds become directly comparable as α approaches 1. Another interesting comparison can be made by contrasting Corollary 1 against the following immediate consequence of Theorem 5. Corollary 2 For all MDP parameters (S, U, g, p) and partitions, if Φ˜ = Ππr T Φ˜ and r r ˜ lim inf α↑1 x∈Sk πr (x) > 0 for all k, ˜ lim sup λµr − λµ∗ ∞ ≤ 2 lim min J ∗ − Φr ∞ . ˜ α↑1 α↑1 r∈ K The comparison suggests that solving Φ˜ = Ππr T Φ˜ is strongly preferable to solving r r ˜ Φ˜ = Ππ T Φ˜ with π = 1. r r 1 By an invariant state distribution of a transition matrix P , we mean any probability distribution π such that π T P = π T . In the event that Pµr has multiple invariant distributions, πr denotes an ˜ ˜ arbitrary choice. 4 Exploration If a vector r solves Φ˜ = Ππr T Φ˜ and the support of πr intersects every partition, Theorem ˜ r r ˜ ˜ 5 promises a desirable bound. However, there are two signiﬁcant shortcomings to this solution concept, which we will address in this section. First, in some cases, the equation Ππr T Φ˜ = Φ˜ does not have a solution. It is easy to produce examples of this; though r r ˜ no example has been documented for the particular class of approximators we are using here, [2] offers an example involving a different linearly parameterized approximator that captures the spirit of what can happen. Second, it would be nice to relax the requirement that the support of πr intersect every partition. ˜ To address these shortcomings, we introduce stochastic policies. A stochastic policy µ maps state-action pairs to probabilities. For each x ∈ S and u ∈ Ux , µ(x, u) is the probability of taking action u when in state x. Hence, µ(x, u) ≥ 0 for all x ∈ S and u ∈ Ux , and u∈Ux µ(x, u) = 1 for all x ∈ S. Given a scalar > 0 and a function J, the -greedy Boltzmann exploration policy with respect to J is deﬁned by µ(x, u) = e−(Tu J)(x)(|Ux |−1)/ e . −(Tu J)(x)(|Ux |−1)/ e u∈Ux e For any > 0 and r, let µr denote the -greedy Boltzmann exploration policy with respect to Φr. Further, we deﬁne a modiﬁed dynamic programming operator that incorporates Boltzmann exploration: (T J)(x) = u∈Ux e−(Tu J)(x)(|Ux |−1)/ e (Tu J)(x) . −(Tu J)(x)(|Ux |−1)/ e u∈Ux e As approaches 0, -greedy Boltzmann exploration policies become greedy and the modiﬁed dynamic programming operators become the dynamic programming operator. More precisely, for all r, x, and J, lim ↓0 µr (x, µr (x)) = 1 and lim ↓1 T J = T J. These are immediate consequences of the following result (see [4] for a proof). Lemma 1 For any n, v ∈ mini vi . n , mini vi + ≥ i e−vi (n−1)/ e vi / i e−vi (n−1)/ e ≥ Because we are only concerned with communicating MDPs, there is a unique invariant state distribution associated with each -greedy Boltzmann exploration policy µr and the support of this distribution is S. Let πr denote this distribution. We consider a vector r that ˜ solves Φ˜ = Ππr T Φ˜. For any > 0, there exists a solution to this equation (this is an r r ˜ immediate extension of Theorem 5.1 from [4]). We have the following performance loss bound, which parallels Theorem 5 but with an equation for which a solution is guaranteed to exist and without any requirement on the resulting invariant distribution. (Again, we omit the proof, which is available in [22].) Theorem 6 For any MDP, partition, and > 0, if Φ˜ = Ππr T Φ˜ then (1 − r r ˜ T ∗ ∗ α)(πr ) (Jµr − J ) ≤ 2α minr∈ K J − Φr ∞ + . ˜ ˜ 5 Computation: TD(0) Though computation is not a focus of this paper, we offer a brief discussion here. First, we describe a simple algorithm from [16], which draws on ideas from temporal-difference learning [11, 12] and Q-learning [23, 24] to solve Φ˜ = Ππ T Φ˜. It requires an abilr r ity to sample a sequence of states x(0) , x(1) , x(2) , . . ., each independent and identically distributed according to π. Also required is a way to efﬁciently compute (T Φr)(x) = minu∈Ux (gu (x) + α y∈S pxy (u)(Φr)(y)), for any given x and r. This is typically possible when the action set Ux and the support of px· (u) (i.e., the set of states that can follow x if action u is selected) are not too large. The algorithm generates a sequence of vectors r( ) according to r( +1) = r( ) + γ φ(x( ) ) (T Φr( ) )(x( ) ) − (Φr( ) )(x( ) ) , where γ is a step size and φ(x) denotes the column vector made up of components from the xth row of Φ. In [16], using results from [15, 9], it is shown that under appropriate assumptions on the step size sequence, r( ) converges to a vector r that solves Φ˜ = Ππ T Φ˜. ˜ r r The equation Φ˜ = Ππ T Φ˜ may have no solution. Further, the requirement that states r r are sampled independently from the invariant distribution may be impractical. However, a natural extension of the above algorithm leads to an easily implementable version of TD(0) that aims at solving Φ˜ = Ππr T Φ˜. The algorithm requires simulation of a trajectory r r ˜ x0 , x1 , x2 , . . . of the MDP, with each action ut ∈ Uxt generated by the -greedy Boltzmann exploration policy with respect to Φr(t) . The sequence of vectors r(t) is generated according to r(t+1) = r(t) + γt φ(xt ) (T Φr(t) )(xt ) − (Φr(t) )(xt ) . Under suitable conditions on the step size sequence, if this algorithm converges, the limit satisﬁes Φ˜ = Ππr T Φ˜. Whether such an algorithm converges and whether there are r r ˜ other algorithms that can effectively solve Φ˜ = Ππr T Φ˜ for broad classes of relevant r r ˜ problems remain open issues. 6 Extensions and Open Issues Our results demonstrate that weighting a Euclidean norm projection by the invariant distribution of a greedy (or approximately greedy) policy can lead to a dramatic performance gain. It is intriguing that temporal-difference learning implicitly carries out such a projection, and consequently, any limit of convergence obeys the stronger performance loss bound. This is not the ﬁrst time that the invariant distribution has been shown to play a critical role in approximate value iteration and temporal-difference learning. In prior work involving approximation of a cost-to-go function for a ﬁxed policy (no control) and a general linearly parameterized approximator (arbitrary matrix Φ), it was shown that weighting by the invariant distribution is key to ensuring convergence and an approximation error bound [17, 18]. Earlier empirical work anticipated this [13, 14]. The temporal-difference learning algorithm presented in Section 5 is a version of TD(0), This is a special case of TD(λ), which is parameterized by λ ∈ [0, 1]. It is not known whether the results of this paper can be extended to the general case of λ ∈ [0, 1]. Prior research has suggested that larger values of λ lead to superior results. In particular, an example of [1] and the approximation error bounds of [17, 18], both of which are restricted to the case of a ﬁxed policy, suggest that approximation error is ampliﬁed by a factor of 1/(1 − α) as λ is changed from 1 to 0. The results of Sections 3 and 4 suggest that this factor vanishes if one considers a controlled process and performance loss rather than approximation error. Whether the results of this paper can be extended to accommodate approximate value iteration with general linearly parameterized approximators remains an open issue. In this broader context, error and performance loss bounds of the kind offered by Theorem 2 are unavailable, even when the invariant distribution is used to weight the projection. Such error and performance bounds are available, on the other hand, for the solution to a certain linear program [5, 6]. Whether a factor of 1/(1 − α) can similarly be eliminated from these bounds is an open issue. Our results can be extended to accommodate an average cost objective, assuming that the MDP is communicating. With Boltzmann exploration, the equation of interest becomes Φ˜ = Ππr (T Φ˜ − λ1). r r ˜ ˜ ˜ The variables include an estimate λ ∈ of the minimal average cost λ∗ ∈ and an approximation Φ˜ of the optimal differential cost function h∗ . The discount factor α is set r to 1 in computing an -greedy Boltzmann exploration policy as well as T . There is an average-cost version of temporal-difference learning for which any limit of convergence ˜ ˜ (λ, r) satisﬁes this equation [19, 20, 21]. Generalization of Theorem 2 does not lead to a useful result because the right-hand side of the bound becomes inﬁnite as α approaches 1. On the other hand, generalization of Theorem 6 yields the ﬁrst performance loss bound for approximate value iteration with an average-cost objective: Theorem 7 For any communicating MDP with an average-cost objective, partition, and r ˜ > 0, if Φ˜ = Ππr (T Φ˜ − λ1) then r ˜ λµr − λ∗ ≤ 2 min h∗ − Φr ˜ r∈ K ∞ + . Here, λµr ∈ denotes the average cost under policy µr , which is well-deﬁned because the ˜ ˜ process is irreducible under an -greedy Boltzmann exploration policy. This theorem can be proved by taking limits on the left and right-hand sides of the bound of Theorem 6. It is easy to see that the limit of the left-hand side is λµr − λ∗ . The limit of minr∈ K J ∗ − Φr ∞ ˜ on the right-hand side is minr∈ K h∗ − Φr ∞ . (This follows from the analysis of [3].) Acknowledgments This material is based upon work supported by the National Science Foundation under Grant ECS-9985229 and by the Ofﬁce of Naval Research under Grant MURI N00014-001-0637. The author’s understanding of the topic beneﬁted from collaborations with Dimitri Bertsekas, Daniela de Farias, and John Tsitsiklis. A full length version of this paper has been submitted to Mathematics of Operations Research and has beneﬁted from a number of useful comments and suggestions made by reviewers. References [1] D. P. Bertsekas. A counterexample to temporal-difference learning. Neural Computation, 7:270–279, 1994. [2] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientiﬁc, Belmont, MA, 1996. [3] D. Blackwell. Discrete dynamic programming. Annals of Mathematical Statistics, 33:719–726, 1962. [4] D. P. de Farias and B. Van Roy. On the existence of ﬁxed points for approximate value iteration and temporal-difference learning. Journal of Optimization Theory and Applications, 105(3), 2000. [5] D. P. de Farias and B. Van Roy. Approximate dynamic programming via linear programming. In Advances in Neural Information Processing Systems 14. MIT Press, 2002. [6] D. P. de Farias and B. Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51(6):850–865, 2003. [7] G. J. Gordon. Stable function approximation in dynamic programming. Technical Report CMU-CS-95-103, Carnegie Mellon University, 1995. [8] G. J. Gordon. Stable function approximation in dynamic programming. In Machine Learning: Proceedings of the Twelfth International Conference (ICML), San Francisco, CA, 1995. [9] T. Jaakkola, M. I. Jordan, and S. P. Singh. On the Convergence of Stochastic Iterative Dynamic Programming Algorithms. Neural Computation, 6:1185–1201, 1994. [10] S. P. Singh and R. C. Yee. An upper-bound on the loss from approximate optimalvalue functions. Machine Learning, 1994. [11] R. S. Sutton. Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, Amherst, MA, 1984. [12] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988. [13] R. S. Sutton. On the virtues of linear learning and trajectory distributions. In Proceedings of the Workshop on Value Function Approximation, Machine Learning Conference, 1995. [14] R. S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8, Cambridge, MA, 1996. MIT Press. [15] J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185–202, 1994. [16] J. N. Tsitsiklis and B. Van Roy. Feature–based methods for large scale dynamic programming. Machine Learning, 22:59–94, 1996. [17] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal–difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674–690, 1997. [18] J. N. Tsitsiklis and B. Van Roy. Analysis of temporal-difference learning with function approximation. In Advances in Neural Information Processing Systems 9, Cambridge, MA, 1997. MIT Press. [19] J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning. In Proceedings of the IEEE Conference on Decision and Control, 1997. [20] J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning. Automatica, 35(11):1799–1808, 1999. [21] J. N. Tsitsiklis and B. Van Roy. On average versus discounted reward temporaldifference learning. Machine Learning, 49(2-3):179–191, 2002. [22] B. Van Roy. Performance loss bounds for approximate value iteration with state aggregation. Under review with Mathematics of Operations Research, available at www.stanford.edu/ bvr/psﬁles/aggregation.pdf, 2005. [23] C. J. C. H. Watkins. Learning From Delayed Rewards. PhD thesis, Cambridge University, Cambridge, UK, 1989. [24] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279–292, 1992.

5 0.71911073 75 nips-2005-Fixing two weaknesses of the Spectral Method

Author: Kevin Lang

6 0.70011431 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning

7 0.69196481 13 nips-2005-A Probabilistic Approach for Optimizing Spectral Clustering

8 0.68999606 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations

9 0.68941718 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity

10 0.68557203 84 nips-2005-Generalization in Clustering with Unobserved Features

11 0.68532735 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games

12 0.68282413 112 nips-2005-Learning Minimum Volume Sets

13 0.68199235 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

14 0.68154442 144 nips-2005-Off-policy Learning with Options and Recognizers

15 0.67806429 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

16 0.67668897 102 nips-2005-Kernelized Infomax Clustering

17 0.6746335 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines

18 0.67243969 58 nips-2005-Divergences, surrogate loss functions and experimental design

19 0.67182577 198 nips-2005-Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails

20 0.67158782 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations