nips nips2005 nips2005-75 knowledge-graph by maker-knowledge-mining

75 nips-2005-Fixing two weaknesses of the Spectral Method

Source: pdf

Author: Kevin Lang

Abstract: We discuss two intrinsic weaknesses of the spectral graph partitioning method, both of which have practical consequences. The ﬁrst is that spectral embeddings tend to hide the best cuts from the commonly used hyperplane rounding method. Rather than cleaning up the resulting suboptimal cuts with local search, we recommend the adoption of ﬂow-based rounding. The second weakness is that for many “power law” graphs, the spectral method produces cuts that are highly unbalanced, thus decreasing the usefulness of the method for visualization (see ﬁgure 4(b)) or as a basis for divide-and-conquer algorithms. These balance problems, which occur even though the spectral method’s quotient-style objective function does encourage balance, can be ﬁxed with a stricter balance constraint that turns the spectral mathematical program into an SDP that can be solved for million-node graphs by a method of Burer and Monteiro. 1 Background Graph partitioning is the NP-hard problem of ﬁnding a small graph cut subject to the constraint that neither side of the resulting partitioning of the nodes is “too small”. We will be dealing with several versions: the graph bisection problem, which requires perfect 1 : 1 2 2 balance; the β-balanced cut problem (with β a fraction such as 1 ), which requires at least 3 β : (1 − β) balance; and the quotient cut problem, which requires the small side to be large enough to “pay for” the edges in the cut. The quotient cut metric is c/ min(a, b), where c is the cutsize and a and b are the sizes of the two sides of the cut. All of the well-known variants of the quotient cut metric (e.g. normalized cut [15]) have similar behavior with respect to the issues discussed in this paper. The spectral method for graph partitioning was introduced in 1973 by Fiedler and Donath & Hoffman [6]. In the mid-1980’s Alon & Milman [1] proved that spectral cuts can be at worst quadratically bad; in the mid 1990’s Guattery & Miller [10] proved that this analysis is tight by exhibiting a family of n-node graphs whose spectral bisections cut O(n 2/3 ) edges versus the optimal O(n1/3 ) edges. On the other hand, Spielman & Teng [16] have proved stronger performance guarantees for the special case of spacelike graphs. The spectral method can be derived by relaxing a quadratic integer program which encodes the graph bisection problem (see section 3.1). The solution to this relaxation is the “Fiedler vector”, or second smallest eigenvector of the graph’s discrete Laplacian matrix, whose elements xi can be interpreted as an embedding of the graph on the line. To obtain a (A) Graph with nearly balanced 8-cut (B) Spectral Embedding (C) Notional Flow-based Embedding Figure 1: The spectral embedding hides the best solution from hyperplane rounding. speciﬁc cut, one must apply a “rounding method” to this embedding. The hyperplane rounding method chooses one of the n − 1 cuts which separate the nodes whose x i values lie above and below some split value x. ˆ 2 Using ﬂow to ﬁnd cuts that are hidden from hyperplane rounding Theorists have long known that the spectral method cannot distinguish between deep cuts and long paths, and that this confusion can cause it to cut a graph in the wrong direction thereby producing the spectral method’s worst-case behavior [10]. In this section we will show by example that even when the spectral method is not fooled into cutting in the wrong direction, the resulting embedding can hide the best cuts from the hyperplane rounding method. This is a possible explanation for the frequently made empirical observation (see e.g. [12]) that hyperplane roundings of spectral embeddings are noisy and therefore beneﬁt from cleanup with a local search method such as Fiduccia-Matheyses [8]. Consider the graph in ﬁgure 1(a), which has a near-bisection cutting 8 edges. For this graph the spectral method produces the embedding shown in ﬁgure 1(b), and recommends that we make a vertical cut (across the horizontal dimension which is based on the Fiedler vector). This is correct in a generalized sense, but it is obvious that no hyperplane (or vertical line in this picture) can possibly extract the optimal 8-edge cut. Some insight into why spectral embeddings tend to have this problem can be obtained from the spectral method’s electrical interpretation. In this view the graph is represented by a resistor network [7]. Current ﬂowing in this network causes voltage drops across the resistors, thus determining the nodes’ voltages and hence their positions. When current ﬂows through a long series of resistors, it induces a progressive voltage drop. This is what causes the excessive length of the embeddings of the horizontal girder-like structures which are blocking all vertical hyperplane cuts in ﬁgure 1(b). If the embedding method were somehow not based on current, but rather on ﬂow, which does not distinguish between a pipe and a series of pipes, then the long girders could retract into the two sides of the embedding, as suggested by ﬁgure 1(c), and the best cut would be revealed. Because theoretical ﬂow-like embedding methods such as [14] are currently not practical, we point out that in cases like ﬁgure 1(b), where the spectral method has not chosen an incorrect direction for the cut, one can use an S-T max ﬂow problem with the ﬂow running in the recommended direction (horizontally for this embedding) to extract the good cut even though it is hidden from all hyperplanes. We currently use two different ﬂow-based rounding methods. A method called MQI looks for quotient cuts, and is already described in [13]. Another method, that we shall call Midﬂow, looks for β-balanced cuts. The input to Midﬂow is a graph and an ordering of its nodes (obtained e.g. from a spectral embedding or from the projection of any embedding onto a line). We divide the graph’s nodes into 3 sets F, L, and U. The sets F and L respectively contain the ﬁrst βn and last βn nodes in the ordering, and U contains the remaining 50-50 balance ng s ro un di Hy pe r pl an e neg-pos split quotient cut score (cutsize / size of small side) 0.01 ctor r ve iedle of F 0.004 0.003 0.00268 0.00232 Best hyperplane rounding of Fiedler Vector Best improvement with local search 0.002 0.00138 0.001 60000 80000 Midflow rounding beta = 1/4 100000 120000 0.00145 140000 Midflow rounding of Fiedler Vector beta = 1/3 160000 180000 200000 220000 240000 number of nodes on ’left’ side of cut (out of 324800) Figure 2: A typical example (see section 2.1) where ﬂow-based rounding beats hyperplane rounding, even when the hyperplane cuts are improved with Fiduccia-Matheyses search. Note that for this spacelike graph, the best quotient cuts have reasonably good balance. U = n − 2βn nodes, which are “up for grabs”. We set up an S-T max ﬂow problem with one node for every graph node plus 2 new nodes for the source and sink. For each graph edge there are two arcs, one in each direction, with unit capacity. Finally, the nodes in F are pinned to the source and the nodes in L are pinned to sink by inﬁnite capacity arcs. This max-ﬂow problem can be solved by a good implementation of the push-relabel algorithm (such as Goldberg and Cherkassky’s hi pr [4]) in time that empirically is nearly linear with a very good constant factor. Figure 6 shows that solving a MidFlow problem with hi pr can be 1000 times cheaper than ﬁnding a spectral embedding with ARPACK. When the goal is ﬁnding good β-balanced cuts, MidFlow rounding is strictly more powerful than hyperplane rounding; from a given node ordering hyperplane rounding chooses the best of U + 1 candidate cuts, while MidFlow rounding chooses the best of 2U candidates, including all of those considered by hyperplane rounding. [Similarly, MQI rounding is strictly more powerful than hyperplane rounding for the task of ﬁnding good quotient cuts.] 2.1 A concrete example The plot in ﬁgure 2 shows a number of cuts in a 324,800 node nearly planar graph derived from a 700x464 pixel downward-looking view of some clouds over some mountains.1 The y-axis of the plot is quotient cut score; smaller values are better. We note in passing that the commonly used split point x = 0 does not yield the best hyperplane cut. Our main ˆ point is that the two cuts generated by MidFlow rounding of the Fiedler vector (with β = 1 3 and β = 1 ) are nearly twice as good as the best hyperplane cut. Even after the best 4 hyperplane cut has been improved by taking the best result of 100 runs of a version of Fiduccia-Matheyses local search, it is still much worse than the cuts obtained by ﬂowbased rounding. 1 The graph’s edges are unweighted but are chosen by a randomized rule which is more likely to include an edge between two neighboring pixels if they have a similar grey value. Good cuts in the graph tend to run along discontinuities in the image, as one would expect. quotient cut score 1 SDP-LB (smaller is better) 0.1 Scatter plot showing cuts in a

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract We discuss two intrinsic weaknesses of the spectral graph partitioning method, both of which have practical consequences. [sent-3, score-0.497]

2 The ﬁrst is that spectral embeddings tend to hide the best cuts from the commonly used hyperplane rounding method. [sent-4, score-1.282]

3 Rather than cleaning up the resulting suboptimal cuts with local search, we recommend the adoption of ﬂow-based rounding. [sent-5, score-0.319]

4 The second weakness is that for many “power law” graphs, the spectral method produces cuts that are highly unbalanced, thus decreasing the usefulness of the method for visualization (see ﬁgure 4(b)) or as a basis for divide-and-conquer algorithms. [sent-6, score-0.638]

5 1 Background Graph partitioning is the NP-hard problem of ﬁnding a small graph cut subject to the constraint that neither side of the resulting partitioning of the nodes is “too small”. [sent-8, score-0.791]

6 The quotient cut metric is c/ min(a, b), where c is the cutsize and a and b are the sizes of the two sides of the cut. [sent-10, score-0.579]

7 All of the well-known variants of the quotient cut metric (e. [sent-11, score-0.545]

8 normalized cut [15]) have similar behavior with respect to the issues discussed in this paper. [sent-13, score-0.292]

9 The spectral method for graph partitioning was introduced in 1973 by Fiedler and Donath & Hoffman [6]. [sent-14, score-0.524]

10 In the mid-1980’s Alon & Milman [1] proved that spectral cuts can be at worst quadratically bad; in the mid 1990’s Guattery & Miller [10] proved that this analysis is tight by exhibiting a family of n-node graphs whose spectral bisections cut O(n 2/3 ) edges versus the optimal O(n1/3 ) edges. [sent-15, score-1.348]

11 On the other hand, Spielman & Teng [16] have proved stronger performance guarantees for the special case of spacelike graphs. [sent-16, score-0.094]

12 The spectral method can be derived by relaxing a quadratic integer program which encodes the graph bisection problem (see section 3. [sent-17, score-0.554]

13 The solution to this relaxation is the “Fiedler vector”, or second smallest eigenvector of the graph’s discrete Laplacian matrix, whose elements xi can be interpreted as an embedding of the graph on the line. [sent-19, score-0.385]

14 To obtain a (A) Graph with nearly balanced 8-cut (B) Spectral Embedding (C) Notional Flow-based Embedding Figure 1: The spectral embedding hides the best solution from hyperplane rounding. [sent-20, score-0.702]

15 The hyperplane rounding method chooses one of the n − 1 cuts which separate the nodes whose x i values lie above and below some split value x. [sent-22, score-1.001]

16 In this section we will show by example that even when the spectral method is not fooled into cutting in the wrong direction, the resulting embedding can hide the best cuts from the hyperplane rounding method. [sent-24, score-1.383]

17 [12]) that hyperplane roundings of spectral embeddings are noisy and therefore beneﬁt from cleanup with a local search method such as Fiduccia-Matheyses [8]. [sent-27, score-0.56]

18 Consider the graph in ﬁgure 1(a), which has a near-bisection cutting 8 edges. [sent-28, score-0.243]

19 For this graph the spectral method produces the embedding shown in ﬁgure 1(b), and recommends that we make a vertical cut (across the horizontal dimension which is based on the Fiedler vector). [sent-29, score-0.918]

20 This is correct in a generalized sense, but it is obvious that no hyperplane (or vertical line in this picture) can possibly extract the optimal 8-edge cut. [sent-30, score-0.207]

21 Some insight into why spectral embeddings tend to have this problem can be obtained from the spectral method’s electrical interpretation. [sent-31, score-0.538]

22 In this view the graph is represented by a resistor network [7]. [sent-32, score-0.207]

23 This is what causes the excessive length of the embeddings of the horizontal girder-like structures which are blocking all vertical hyperplane cuts in ﬁgure 1(b). [sent-35, score-0.64]

24 If the embedding method were somehow not based on current, but rather on ﬂow, which does not distinguish between a pipe and a series of pipes, then the long girders could retract into the two sides of the embedding, as suggested by ﬁgure 1(c), and the best cut would be revealed. [sent-36, score-0.51]

25 A method called MQI looks for quotient cuts, and is already described in [13]. [sent-39, score-0.309]

26 The input to Midﬂow is a graph and an ordering of its nodes (obtained e. [sent-41, score-0.323]

27 from a spectral embedding or from the projection of any embedding onto a line). [sent-43, score-0.516]

28 We divide the graph’s nodes into 3 sets F, L, and U. [sent-44, score-0.091]

29 The sets F and L respectively contain the ﬁrst βn and last βn nodes in the ordering, and U contains the remaining 50-50 balance ng s ro un di Hy pe r pl an e neg-pos split quotient cut score (cutsize / size of small side) 0. [sent-45, score-0.935]

30 00232 Best hyperplane rounding of Fiedler Vector Best improvement with local search 0. [sent-50, score-0.564]

31 001 60000 80000 Midflow rounding beta = 1/4 100000 120000 0. [sent-53, score-0.357]

32 00145 140000 Midflow rounding of Fiedler Vector beta = 1/3 160000 180000 200000 220000 240000 number of nodes on ’left’ side of cut (out of 324800) Figure 2: A typical example (see section 2. [sent-54, score-0.785]

33 1) where ﬂow-based rounding beats hyperplane rounding, even when the hyperplane cuts are improved with Fiduccia-Matheyses search. [sent-55, score-1.09]

34 Note that for this spacelike graph, the best quotient cuts have reasonably good balance. [sent-56, score-0.679]

35 We set up an S-T max ﬂow problem with one node for every graph node plus 2 new nodes for the source and sink. [sent-58, score-0.43]

36 For each graph edge there are two arcs, one in each direction, with unit capacity. [sent-59, score-0.207]

37 Finally, the nodes in F are pinned to the source and the nodes in L are pinned to sink by inﬁnite capacity arcs. [sent-60, score-0.25]

38 This max-ﬂow problem can be solved by a good implementation of the push-relabel algorithm (such as Goldberg and Cherkassky’s hi pr [4]) in time that empirically is nearly linear with a very good constant factor. [sent-61, score-0.095]

39 Figure 6 shows that solving a MidFlow problem with hi pr can be 1000 times cheaper than ﬁnding a spectral embedding with ARPACK. [sent-62, score-0.424]

40 [Similarly, MQI rounding is strictly more powerful than hyperplane rounding for the task of ﬁnding good quotient cuts. [sent-64, score-1.174]

41 1 A concrete example The plot in ﬁgure 2 shows a number of cuts in a 324,800 node nearly planar graph derived from a 700x464 pixel downward-looking view of some clouds over some mountains. [sent-66, score-0.615]

42 1 The y-axis of the plot is quotient cut score; smaller values are better. [sent-67, score-0.545]

43 We note in passing that the commonly used split point x = 0 does not yield the best hyperplane cut. [sent-68, score-0.246]

44 Our main ˆ point is that the two cuts generated by MidFlow rounding of the Fiedler vector (with β = 1 3 and β = 1 ) are nearly twice as good as the best hyperplane cut. [sent-69, score-0.957]

45 Even after the best 4 hyperplane cut has been improved by taking the best result of 100 runs of a version of Fiduccia-Matheyses local search, it is still much worse than the cuts obtained by ﬂowbased rounding. [sent-70, score-0.943]

46 Good cuts in the graph tend to run along discontinuities in the image, as one would expect. [sent-72, score-0.555]

47 quotient cut score 1 SDP-LB (smaller is better) 0. [sent-73, score-0.606]

48 1 Scatter plot showing cuts in a "power-law graph" (Yahoo Groups) 10 100 (worse balance) 1k 10k size of small side 100k 1M (better balance) Figure 3: This scatter plot of cuts in a 1. [sent-74, score-0.717]

49 6 million node collaborative ﬁltering graph shows a surprising relationship between cut quality and balance (see section 3). [sent-75, score-0.843]

50 The SDP lower bound proves that all balanced cuts are worse than the unbalanced cuts seen on the left. [sent-76, score-0.773]

51 2 Effectiveness on real graphs and benchmarks We have found the ﬂow-based Midﬂow and MQI rounding methods to be highly effective in practice on diverse classes of graphs including space-like graphs and power law graphs. [sent-78, score-0.872]

52 Results for real-world power law graphs are shown in ﬁgure 5. [sent-79, score-0.197]

53 walshaw/partition, which keeps track of the best nearly balanced cuts ever found for a number of classic benchmarks. [sent-85, score-0.45]

54 Using ﬂow-based rounding to extract cuts from spectral-type embeddings, we have found new record cuts for the majority of the largest graphs on the site, including fe body, t60k, wing, brack2, fe tooth, fe rotor, 598a, 144, wave, m14b, and auto. [sent-86, score-1.289]

55 It is interesting to note that the spectral method previously did not own any of the records for these classic benchmarks, although it could have if ﬂow-based rounding had been used instead of hyperplane rounding. [sent-87, score-0.803]

56 3 Finding balanced cuts in “power law” graphs The spectral method does not require cuts to have perfect balance, but the denominator in its quotient-style objective function does reward balance and punish imbalance. [sent-88, score-1.358]

57 Thus one might expect the spectral method to produce cuts with fairly good balance, and this is what does happen for the class of spacelike graphs that inform much of our intuition. [sent-89, score-0.785]

58 However, there are now many economically important “power law” [5] graphs whose best quotient cuts have extremely bad balance. [sent-90, score-0.817]

59 Examples at Yahoo include the web graph, social graphs based on DLBP co-authorship and Yahoo IM buddy lists, a music similarity graph, and bipartite collaborative ﬁltering graphs relating Yahoo Groups with users, and advertisers with search phrases. [sent-91, score-0.476]

60 To save space we show one scatter plot (ﬁgure 3) of quotient cut scores versus balance that is typical for graphs from this class. [sent-92, score-0.976]

61 We see that apparently there is a tradeoff between these two quantities, and in fact the quotient cut score gets better as Figure 4: Left: a social graph with octopus structure as predicted by Chung and Lu [5]. [sent-93, score-0.908]

62 Center: a “normalized cut” Spectral embedding chops off one tentacle per dimension. [sent-94, score-0.152]

63 Right: an SDP embedding looks better and is more useful for ﬁnding balanced cuts. [sent-95, score-0.267]

64 balance gets worse, which is exactly the opposite of what one would expect. [sent-96, score-0.238]

65 When run on graphs of this type, the spectral method (and other quotient cut methods such as Metis+MQI [13]) wants to chop off tiny pieces. [sent-97, score-1.008]

66 First, cutting off a tiny piece after paying for a computation on the whole graph kills the scalability of divide and conquer algorithms by causing their overall run time to increase e. [sent-99, score-0.308]

67 Second, low-dimensional spectral embeddings of these graphs (see e. [sent-102, score-0.485]

68 These problems can be avoided by solving a semideﬁnite relaxation of graph bisection that has a much stronger balance constraint. [sent-105, score-0.534]

69 In 2004, Arora, Rao, and Vazirani [14] proved the best-ever approximation guarantee for graph partitioning by analysing a version of this SDP which was augmented with certain triangle inequalities that serve much the same purpose as ﬂow (but which are too expensive to solve for large graphs). [sent-107, score-0.311]

70 1 A semideﬁnite program which strengthens the balance requirement The graph bisection problem can be expressed as a Quadratic Integer Program as follows. [sent-109, score-0.553]

71 There is an n-element column vector x of indicator variables xi , each of which assigns one node to a particular side of the cut by assuming a value from the set {−1, 1}. [sent-110, score-0.442]

72 With these indicator values, the objective function 1 xT Lx (where L is the graph’s discrete Laplacian 4 matrix) works out to be equal to the number of edges crossing the cut. [sent-111, score-0.106]

73 Finally, the requirement of perfect balance is expressed by the constraint xT e = 0, where e is a vector of all ones. [sent-112, score-0.238]

74 Since this QIP exactly encodes the graph bisection problem, solving it is NP-hard. [sent-113, score-0.27]

75 The spectral relaxation of this QIP attains solvability by allowing the indicator variables to assume arbitrary real values, provided that their average squared magnitude is 1. [sent-114, score-0.289]

76 More interestingly for the present discussion, the balance contraint xT e = 0 now permits a qualitatively different kind of balance where a tiny group of nodes moves a long way out from the origin where the nodes acquire enough leverage to counterbalance everyone else. [sent-117, score-0.694]

77 For graphs where the best quotient cut has good balance (e. [sent-118, score-0.981]

78 meshes) this does not actually happen, but for graphs whose best quotient cut has bad balance, it does happen, as can be seen in ﬁgure 4(b). [sent-120, score-0.79]

79 2 Methods for solving the SDP for large graphs Interior point methods cannot solve (2) for graphs with more than a few thousand nodes, but newer methods achieve better scaling by ensuring that all dense n by n matrices have only an implicit (and approximate) existence. [sent-128, score-0.347]

80 A good example is Helmberg and Rendl’s program SBmethod [11], which can solve the dual of (2) for graphs with about 50,000 nodes by converting it to an equivalent “eigenvalue optimization” problem. [sent-129, score-0.295]

81 The output of SBmethod is a low-rank approximate spectral factorization of the Gram matrix, consisting of an estimated rank r, plus an n by r matrix X whose rows are the nodes’ indicator vectors. [sent-130, score-0.287]

82 We have successfully run Burer and Monteiro’s code on large graphs containing more than a million nodes. [sent-138, score-0.24]

83 We typically run it several times with different small ﬁxed values of r, and then choose the smallest r which allows the objective function to reach its best known value. [sent-139, score-0.095]

84 On medium-size graphs this produces estimates for r which are in rough agreement with those produced by SBmethod. [sent-140, score-0.187]

85 The run time scaling of SDP-LR is compared with that of ARPACK and hi pr in ﬁgure 6. [sent-141, score-0.089]

86 03 SDP + Hyperplanes +H ype 1 Sp quotient cut score SDP + Hyperplanes nes rpla ype +H Spectral quotient cut score (smaller is better) 0. [sent-166, score-1.28]

87 In all four cases, the spectral line is lower on the left, and the SDP line is lower on the right, which means that Spectral produces better unbalanced cuts and the SDP produces better balanced cuts. [sent-169, score-0.733]

88 Cuts obtained by rounding random 1-d projections of the SDP embedding using Midﬂow (to produce β-balanced cuts) followed by MQI (to improve the quotient cut score) are also shown; these ﬂow-based cuts are consistently better than hyperplane cuts. [sent-170, score-1.609]

89 130301 package (with r < 10) to approximately solve (1) for several large graphs including: a 130,000 node social graph representing co-authorship in DBLP; a 1. [sent-173, score-0.486]

90 9 million node social graph built from the buddy lists of a subset of the users of Yahoo Instant Messenger; a 1. [sent-174, score-0.455]

91 6 million node bipartite graph relating Yahoo Groups and users; and a 1. [sent-175, score-0.344]

92 5 million node graph made by symmetrizing the TREC WT10G web graph. [sent-176, score-0.34]

93 It is clear from ﬁgure 5 that in all four cases the SDP embedding leads to better balanced cuts, and that ﬂow-based rounding works better hyperplane rounding. [sent-177, score-0.831]

94 Also, ﬁgures 4(b) and 4(c) show 3-d Spectral and SDP embeddings of a small subset of the Yahoo IM social graph; the SDP embedding is qualitatively different and arguably better for visualization purposes. [sent-178, score-0.386]

95 100000 CK PA run time (seconds) 10000 lem 1000 rob np g lvin 100 e Eig h wit AR P ing SD SD lv So So s eti 10 hM it hw p 1 ng ti ec Bis 0. [sent-186, score-0.104]

96 01 100 LR P- h wit 1000 gra r i_p hh low wit idF gM lvin So 10000 100000 1e+06 graph size (nodes + edges) 1e+07 Figure 6: Run time scaling on subsets of the Yahoo IM graph. [sent-188, score-0.323]

97 Finding Spectral and SDP embeddings with ARPACK and SDP-LR requires about the same amount of time, while MidFlow rounding with hi pr is about 1000 times faster. [sent-189, score-0.531]

98 Average distances in random graphs with given expected degree sequences. [sent-208, score-0.159]

99 A ﬂow-based method for improving the expansion or conductance of graph cuts. [sent-255, score-0.234]

100 Spectral partitioning works: Planar graphs and ﬁnite element meshes. [sent-266, score-0.237]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('rounding', 0.357), ('cuts', 0.319), ('sdp', 0.312), ('cut', 0.292), ('quotient', 0.253), ('balance', 0.238), ('spectral', 0.212), ('graph', 0.207), ('hyperplane', 0.207), ('graphs', 0.159), ('embedding', 0.152), ('yahoo', 0.149), ('midflow', 0.12), ('ow', 0.116), ('embeddings', 0.114), ('burer', 0.103), ('fiedler', 0.103), ('nodes', 0.091), ('mqi', 0.086), ('partitioning', 0.078), ('mid', 0.074), ('spacelike', 0.068), ('social', 0.066), ('bisection', 0.063), ('score', 0.061), ('balanced', 0.057), ('node', 0.054), ('gure', 0.053), ('million', 0.052), ('donath', 0.051), ('hyperplanes', 0.051), ('monteiro', 0.051), ('satish', 0.051), ('sbmethod', 0.051), ('indicator', 0.051), ('semide', 0.049), ('bad', 0.047), ('worse', 0.047), ('fe', 0.045), ('side', 0.045), ('program', 0.045), ('users', 0.042), ('wit', 0.041), ('best', 0.039), ('law', 0.038), ('sp', 0.038), ('flow', 0.036), ('cutting', 0.036), ('tiny', 0.036), ('xx', 0.035), ('nearly', 0.035), ('arpack', 0.034), ('buddy', 0.034), ('cherkassky', 0.034), ('cutsize', 0.034), ('dblp', 0.034), ('goemans', 0.034), ('guattery', 0.034), ('hide', 0.034), ('hoffman', 0.034), ('lvin', 0.034), ('pinned', 0.034), ('qip', 0.034), ('renato', 0.034), ('resistors', 0.034), ('spielman', 0.034), ('trec', 0.034), ('vazirani', 0.034), ('ype', 0.034), ('scatter', 0.034), ('sd', 0.033), ('unbalanced', 0.031), ('bipartite', 0.031), ('hi', 0.03), ('pr', 0.03), ('meshes', 0.03), ('messenger', 0.03), ('better', 0.029), ('run', 0.029), ('looks', 0.029), ('edges', 0.028), ('produces', 0.028), ('programming', 0.028), ('web', 0.027), ('objective', 0.027), ('lang', 0.027), ('kevin', 0.027), ('arora', 0.027), ('alon', 0.027), ('ows', 0.027), ('method', 0.027), ('groups', 0.027), ('im', 0.026), ('relaxation', 0.026), ('proved', 0.026), ('lx', 0.025), ('samuel', 0.025), ('visualization', 0.025), ('ordering', 0.025), ('plus', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999905 75 nips-2005-Fixing two weaknesses of the Spectral Method

Author: Kevin Lang

2 0.2315907 13 nips-2005-A Probabilistic Approach for Optimizing Spectral Clustering

Author: Rong Jin, Feng Kang, Chris H. Ding

Abstract: Spectral clustering enjoys its success in both data clustering and semisupervised learning. But, most spectral clustering algorithms cannot handle multi-class clustering problems directly. Additional strategies are needed to extend spectral clustering algorithms to multi-class clustering problems. Furthermore, most spectral clustering algorithms employ hard cluster membership, which is likely to be trapped by the local optimum. In this paper, we present a new spectral clustering algorithm, named “Soft Cut”. It improves the normalized cut algorithm by introducing soft membership, and can be efﬁciently computed using a bound optimization algorithm. Our experiments with a variety of datasets have shown the promising performance of the proposed clustering algorithm. 1

3 0.20763379 177 nips-2005-Size Regularized Cut for Data Clustering

Author: Yixin Chen, Ya Zhang, Xiang Ji

Abstract: We present a novel spectral clustering method that enables users to incorporate prior knowledge of the size of clusters into the clustering process. The cost function, which is named size regularized cut (SRcut), is deﬁned as the sum of the inter-cluster similarity and a regularization term measuring the relative size of two clusters. Finding a partition of the data set to minimize SRcut is proved to be NP-complete. An approximation algorithm is proposed to solve a relaxed version of the optimization problem as an eigenvalue problem. Evaluations over different data sets demonstrate that the method is not sensitive to outliers and performs better than normalized cut. 1

4 0.14691648 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

Author: Tong Zhang, Rie Kubota Ando

Abstract: We consider a framework for semi-supervised learning using spectral decomposition based un-supervised kernel design. This approach subsumes a class of previously proposed semi-supervised learning methods on data graphs. We examine various theoretical properties of such methods. In particular, we derive a generalization performance bound, and obtain the optimal kernel design by minimizing the bound. Based on the theoretical analysis, we are able to demonstrate why spectral kernel design based methods can often improve the predictive performance. Experiments are used to illustrate the main consequences of our analysis.

5 0.098105162 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning

Author: Andreas Argyriou, Mark Herbster, Massimiliano Pontil

Abstract: A foundational problem in semi-supervised learning is the construction of a graph underlying the data. We propose to use a method which optimally combines a number of differently constructed graphs. For each of these graphs we associate a basic graph kernel. We then compute an optimal combined kernel. This kernel solves an extended regularization problem which requires a joint minimization over both the data and the set of graph kernels. We present encouraging results on different OCR tasks where the optimal combined kernel is computed from graphs constructed with a variety of distances functions and the ‘k’ in nearest neighbors. 1

6 0.097040586 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

7 0.090435997 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning

8 0.085960306 178 nips-2005-Soft Clustering on Graphs

9 0.073895551 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

10 0.0684181 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

11 0.063620657 125 nips-2005-Message passing for task redistribution on sparse graphs

12 0.063464962 46 nips-2005-Consensus Propagation

13 0.060782034 41 nips-2005-Coarse sample complexity bounds for active learning

14 0.059577554 105 nips-2005-Large-Scale Multiclass Transduction

15 0.058324341 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

16 0.057129435 114 nips-2005-Learning Rankings via Convex Hull Separation

17 0.055380844 104 nips-2005-Laplacian Score for Feature Selection

18 0.053281132 102 nips-2005-Kernelized Infomax Clustering

19 0.053200491 184 nips-2005-Structured Prediction via the Extragradient Method

20 0.052440412 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.16), (1, 0.1), (2, -0.069), (3, -0.086), (4, -0.31), (5, -0.048), (6, -0.031), (7, 0.014), (8, -0.08), (9, 0.141), (10, -0.023), (11, -0.026), (12, -0.025), (13, 0.009), (14, 0.056), (15, -0.136), (16, 0.036), (17, 0.05), (18, -0.007), (19, 0.021), (20, 0.071), (21, -0.064), (22, -0.05), (23, 0.009), (24, -0.009), (25, 0.092), (26, -0.012), (27, 0.024), (28, 0.061), (29, -0.008), (30, 0.096), (31, -0.043), (32, -0.057), (33, -0.038), (34, -0.023), (35, 0.038), (36, -0.046), (37, 0.243), (38, -0.072), (39, -0.191), (40, -0.083), (41, 0.159), (42, -0.061), (43, 0.092), (44, -0.031), (45, 0.057), (46, 0.088), (47, 0.028), (48, -0.078), (49, -0.122)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97788066 75 nips-2005-Fixing two weaknesses of the Spectral Method

Author: Kevin Lang

2 0.81292611 177 nips-2005-Size Regularized Cut for Data Clustering

Author: Yixin Chen, Ya Zhang, Xiang Ji

3 0.65938866 13 nips-2005-A Probabilistic Approach for Optimizing Spectral Clustering

Author: Rong Jin, Feng Kang, Chris H. Ding

4 0.36777377 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning

Author: Zhenyue Zhang, Hongyuan Zha

Abstract: We propose a fast manifold learning algorithm based on the methodology of domain decomposition. Starting with the set of sample points partitioned into two subdomains, we develop the solution of the interface problem that can glue the embeddings on the two subdomains into an embedding on the whole domain. We provide a detailed analysis to assess the errors produced by the gluing process using matrix perturbation theory. Numerical examples are given to illustrate the efﬁciency and effectiveness of the proposed methods. 1

5 0.35558957 180 nips-2005-Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms

Author: Baback Moghaddam, Yair Weiss, Shai Avidan

Abstract: Sparse PCA seeks approximate sparse “eigenvectors” whose projections capture the maximal variance of data. As a cardinality-constrained and non-convex optimization problem, it is NP-hard and is encountered in a wide range of applied ﬁelds, from bio-informatics to ﬁnance. Recent progress has focused mainly on continuous approximation and convex relaxation of the hard cardinality constraint. In contrast, we consider an alternative discrete spectral formulation based on variational eigenvalue bounds and provide an effective greedy strategy as well as provably optimal solutions using branch-and-bound search. Moreover, the exact methodology used reveals a simple renormalization step that improves approximate solutions obtained by any continuous method. The resulting performance gain of discrete algorithms is demonstrated on real-world benchmark data and in extensive Monte Carlo evaluation trials. 1

6 0.35242051 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning

7 0.34834686 178 nips-2005-Soft Clustering on Graphs

8 0.34784204 107 nips-2005-Large scale networks fingerprinting and visualization using the k-core decomposition

9 0.34190896 125 nips-2005-Message passing for task redistribution on sparse graphs

10 0.31933448 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

11 0.31384495 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

12 0.30723462 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

13 0.30064341 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

14 0.27541196 159 nips-2005-Q-Clustering

15 0.2687678 116 nips-2005-Learning Topology with the Generative Gaussian Graph and the EM Algorithm

16 0.26174983 71 nips-2005-Fast Krylov Methods for N-Body Learning

17 0.26050973 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach

18 0.25948805 105 nips-2005-Large-Scale Multiclass Transduction

19 0.25622541 46 nips-2005-Consensus Propagation

20 0.24314301 204 nips-2005-Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.033), (10, 0.514), (27, 0.02), (31, 0.03), (34, 0.078), (39, 0.01), (41, 0.016), (55, 0.025), (69, 0.044), (73, 0.05), (88, 0.048), (91, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97540295 165 nips-2005-Response Analysis of Neuronal Population with Synaptic Depression

Author: Wentao Huang, Licheng Jiao, Shan Tan, Maoguo Gong

Abstract: In this paper, we aim at analyzing the characteristic of neuronal population responses to instantaneous or time-dependent inputs and the role of synapses in neural information processing. We have derived an evolution equation of the membrane potential density function with synaptic depression, and obtain the formulas for analytic computing the response of instantaneous re rate. Through a technical analysis, we arrive at several signi cant conclusions: The background inputs play an important role in information processing and act as a switch betwee temporal integration and coincidence detection. the role of synapses can be regarded as a spatio-temporal lter; it is important in neural information processing for the spatial distribution of synapses and the spatial and temporal relation of inputs. The instantaneous input frequency can affect the response amplitude and phase delay. 1

same-paper 2 0.9388026 75 nips-2005-Fixing two weaknesses of the Spectral Method

Author: Kevin Lang

3 0.90934855 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration

Author: Benjamin V. Roy

Abstract: We consider approximate value iteration with a parameterized approximator in which the state space is partitioned and the optimal cost-to-go function over each partition is approximated by a constant. We establish performance loss bounds for policies derived from approximations associated with ﬁxed points. These bounds identify beneﬁts to having projection weights equal to the invariant distribution of the resulting policy. Such projection weighting leads to the same ﬁxed points as TD(0). Our analysis also leads to the ﬁrst performance loss bound for approximate value iteration with an average cost objective. 1 Preliminaries Consider a discrete-time communicating Markov decision process (MDP) with a ﬁnite state space S = {1, . . . , |S|}. At each state x ∈ S, there is a ﬁnite set Ux of admissible actions. If the current state is x and an action u ∈ Ux is selected, a cost of gu (x) is incurred, and the system transitions to a state y ∈ S with probability pxy (u). For any x ∈ S and u ∈ Ux , y∈S pxy (u) = 1. Costs are discounted at a rate of α ∈ (0, 1) per period. Each instance of such an MDP is deﬁned by a quintuple (S, U, g, p, α). A (stationary deterministic) policy is a mapping µ that assigns an action u ∈ Ux to each state x ∈ S. If actions are selected based on a policy µ, the state follows a Markov process with transition matrix Pµ , where each (x, y)th entry is equal to pxy (µ(x)). The restriction to communicating MDPs ensures that it is possible to reach any state from any other state. Each policy µ is associated with a cost-to-go function Jµ ∈ |S| , deﬁned by Jµ = ∞ t t −1 gµ , where, with some abuse of notation, gµ (x) = gµ(x) (x) t=0 α Pµ gµ = (I − αPµ ) for each x ∈ S. A policy µ is said to be greedy with respect to a function J if µ(x) ∈ argmin(gu (x) + α y∈S pxy (u)J(y)) for all x ∈ S. u∈Ux The optimal cost-to-go function J ∗ ∈ |S| is deﬁned by J ∗ (x) = minµ Jµ (x), for all x ∈ S. A policy µ∗ is said to be optimal if Jµ∗ = J ∗ . It is well-known that an optimal policy exists. Further, a policy µ∗ is optimal if and only if it is greedy with respect to J ∗ . Hence, given the optimal cost-to-go function, optimal actions can computed be minimizing the right-hand side of the above inclusion. Value iteration generates a sequence J converging to J ∗ according to J +1 = T J , where T is the dynamic programming operator, deﬁned by (T J)(x) = minu∈Ux (gu (x) + α y∈S pxy (u)J(y)), for all x ∈ S and J ∈ |S| . This sequence converges to J ∗ for any initialization of J0 . 2 Approximate Value Iteration The state spaces of relevant MDPs are typically so large that computation and storage of a cost-to-go function is infeasible. One approach to dealing with this obstacle involves partitioning the state space S into a manageable number K of disjoint subsets S1 , . . . , SK and approximating the optimal cost-to-go function with a function that is constant over each partition. This can be thought of as a form of state aggregation – all states within a given partition are assumed to share a common optimal cost-to-go. To represent an approximation, we deﬁne a matrix Φ ∈ |S|×K such that each kth column is an indicator function for the kth partition Sk . Hence, for any r ∈ K , k, and x ∈ Sk , (Φr)(x) = rk . In this paper, we study variations of value iteration, each of which computes a vector r so that Φr approximates J ∗ . The use of such a policy µr which is greedy with respect to Φr is justiﬁed by the following result (see [10] for a proof): ˜ Theorem 1 If µ is a greedy policy with respect to a function J ∈ Jµ − J ∗ ≤ ∞ 2α ˜ J∗ − J 1−α |S| then ∞. One common way of approximating a function J ∈ |S| with a function of the form Φr involves projection with respect to a weighted Euclidean norm · π . The weighted Euclidean 1/2 |S| 2 norm: J 2,π = . Here, π ∈ + is a vector of weights that assign x∈S π(x)J (x) relative emphasis among states. The projection Ππ J is the function Φr that attains the minimum of J −Φr 2,π ; if there are multiple functions Φr that attain the minimum, they must form an afﬁne space, and the projection is taken to be the one with minimal norm Φr 2,π . Note that in our context, where each kth column of Φ represents an indicator function for the kth partition, for any π, J, and x ∈ Sk , (Ππ J)(x) = y∈Sk π(y)J(y)/ y∈Sk π(y). Approximate value iteration begins with a function Φr(0) and generates a sequence according to Φr( +1) = Ππ T Φr( ) . It is well-known that the dynamic programming operator T is a contraction mapping with respect to the maximum norm. Further, Ππ is maximum-norm nonexpansive [16, 7, 8]. (This is not true for general Φ, but is true in our context in which columns of Φ are indicator functions for partitions.) It follows that the composition Ππ T is a contraction mapping. By the contraction mapping theorem, Ππ T has a unique ﬁxed point Φ˜, which is the limit of the sequence Φr( ) . Further, the following result holds: r Theorem 2 For any MDP, partition, and weights π with support intersecting every partition, if Φ˜ = Ππ T Φ˜ then r r Φ˜ − J ∗ r ∞ ≤ 2 min J ∗ − Φr 1 − α r∈ K and (1 − α) Jµr − J ∗ ˜ ∞ ≤ ∞, 4α min J ∗ − Φr 1 − α r∈ K ∞. The ﬁrst inequality of the theorem is an approximation error bound, established in [16, 7, 8] for broader classes of approximators that include state aggregation as a special case. The second is a performance loss bound, derived by simply combining the approximation error bound and Theorem 1. Note that Jµr (x) ≥ J ∗ (x) for all x, so the left-hand side of the performance loss bound ˜ is the maximal increase in cost-to-go, normalized by 1 − α. This normalization is natural, since a cost-to-go function is a linear combination of expected future costs, with coefﬁcients 1, α, α2 , . . ., which sum to 1/(1 − α). Our motivation of the normalizing constant begs the question of whether, for ﬁxed MDP parameters (S, U, g, p) and ﬁxed Φ, minr J ∗ − Φr ∞ also grows with 1/(1 − α). It turns out that minr J ∗ − Φr ∞ = O(1). To see why, note that for any µ, Jµ = (I − αPµ )−1 gµ = 1 λ µ + hµ , 1−α where λµ (x) is the expected average cost if the process starts in state x and is controlled by policy µ, τ −1 1 t λµ = lim Pµ gµ , τ →∞ τ t=0 and hµ is the discounted differential cost function hµ = (I − αPµ )−1 (gµ − λµ ). Both λµ and hµ converge to ﬁnite vectors as α approaches 1 [3]. For an optimal policy µ∗ , limα↑1 λµ∗ (x) does not depend on x (in our context of a communicating MDP). Since constant functions lie in the range of Φ, lim min J ∗ − Φr α↑1 r∈ K ∞ ≤ lim hµ∗ α↑1 ∞ < ∞. The performance loss bound still exhibits an undesirable dependence on α through the coefﬁcient 4α/(1 − α). In most relevant contexts, α is close to 1; a representative value might be 0.99. Consequently, 4α/(1 − α) can be very large. Unfortunately, the bound is sharp, as expressed by the following theorem. We will denote by 1 the vector with every component equal to 1. Theorem 3 For any δ > 0, α ∈ (0, 1), and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ r r with π = 1, 4α min J ∗ − Φr ∞ − δ. (1 − α) Jµr − J ∗ ∞ ≥ ˜ 1 − α r∈ K This theorem is established through an example in [22]. The choice of uniform weights (π = 1) is meant to point out that even for such a simple, perhaps natural, choice of weights, the performance loss bound is sharp. Based on Theorems 2 and 3, one might expect that there exists MDP parameters (S, U, g, p) and a partition such that, with π = 1, (1 − α) Jµr − J ∗ ˜ ∞ =Θ 1 min J ∗ − Φr 1 − α r∈ K ∞ . In other words, that the performance loss is both lower and upper bounded by 1/(1 − α) times the smallest possible approximation error. It turns out that this is not true, at least if we restrict to a ﬁnite state space. However, as the following theorem establishes, the coefﬁcient multiplying minr∈ K J ∗ − Φr ∞ can grow arbitrarily large as α increases, keeping all else ﬁxed. Theorem 4 For any L and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that limα↑1 minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ with π = 1, r r lim inf (1 − α) (Jµr (x) − J ∗ (x)) ≥ L lim min J ∗ − Φr ∞ , ˜ α↑1 α↑1 r∈ K for all x ∈ S. This Theorem is also established through an example [22]. For any µ and x, lim ((1 − α)Jµ (x) − λµ (x)) = lim(1 − α)hµ (x) = 0. α↑1 α↑1 Combined with Theorem 4, this yields the following corollary. Corollary 1 For any L and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that limα↑1 minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ with π = 1, r r ∗ lim inf (λµr (x) − λµ∗ (x)) ≥ L lim min J − Φr ∞ , ˜ α↑1 α↑1 r∈ K for all x ∈ S. 3 Using the Invariant Distribution In the previous section, we considered an approximation Φ˜ that solves Ππ T Φ˜ = Φ˜ for r r r some arbitrary pre-selected weights π. We now turn to consider use of an invariant state distribution πr of Pµr as the weight vector.1 This leads to a circular deﬁnition: the weights ˜ ˜ are used in deﬁning r and now we are deﬁning the weights in terms of r. What we are ˜ ˜ really after here is a vector r that satisﬁes Ππr T Φ˜ = Φ˜. The following theorem captures ˜ r r ˜ the associated beneﬁts. (Due to space limitations, we omit the proof, which is provided in the full length version of this paper [22].) Theorem 5 For any MDP and partition, if Φ˜ = Ππr T Φ˜ and πr has support intersecting r r ˜ ˜ T every partition, (1 − α)πr (Jµr − J ∗ ) ≤ 2α minr∈ K J ∗ − Φr ∞ . ˜ ˜ When α is close to 1, which is typical, the right-hand side of our new performance loss bound is far less than that of Theorem 2. The primary improvement is in the omission of a factor of 1 − α from the denominator. But for the bounds to be compared in a meaningful way, we must also relate the left-hand-side expressions. A relation can be based on the fact that for all µ, limα↑1 (1 − α)Jµ − λµ ∞ = 0, as explained in Section 2. In particular, based on this, we have lim(1 − α) Jµ − J ∗ ∞ = |λµ − λ∗ | = λµ − λ∗ = lim π T (Jµ − J ∗ ), α↑1 α↑1 for all policies µ and probability distributions π. Hence, the left-hand-side expressions from the two performance bounds become directly comparable as α approaches 1. Another interesting comparison can be made by contrasting Corollary 1 against the following immediate consequence of Theorem 5. Corollary 2 For all MDP parameters (S, U, g, p) and partitions, if Φ˜ = Ππr T Φ˜ and r r ˜ lim inf α↑1 x∈Sk πr (x) > 0 for all k, ˜ lim sup λµr − λµ∗ ∞ ≤ 2 lim min J ∗ − Φr ∞ . ˜ α↑1 α↑1 r∈ K The comparison suggests that solving Φ˜ = Ππr T Φ˜ is strongly preferable to solving r r ˜ Φ˜ = Ππ T Φ˜ with π = 1. r r 1 By an invariant state distribution of a transition matrix P , we mean any probability distribution π such that π T P = π T . In the event that Pµr has multiple invariant distributions, πr denotes an ˜ ˜ arbitrary choice. 4 Exploration If a vector r solves Φ˜ = Ππr T Φ˜ and the support of πr intersects every partition, Theorem ˜ r r ˜ ˜ 5 promises a desirable bound. However, there are two signiﬁcant shortcomings to this solution concept, which we will address in this section. First, in some cases, the equation Ππr T Φ˜ = Φ˜ does not have a solution. It is easy to produce examples of this; though r r ˜ no example has been documented for the particular class of approximators we are using here, [2] offers an example involving a different linearly parameterized approximator that captures the spirit of what can happen. Second, it would be nice to relax the requirement that the support of πr intersect every partition. ˜ To address these shortcomings, we introduce stochastic policies. A stochastic policy µ maps state-action pairs to probabilities. For each x ∈ S and u ∈ Ux , µ(x, u) is the probability of taking action u when in state x. Hence, µ(x, u) ≥ 0 for all x ∈ S and u ∈ Ux , and u∈Ux µ(x, u) = 1 for all x ∈ S. Given a scalar > 0 and a function J, the -greedy Boltzmann exploration policy with respect to J is deﬁned by µ(x, u) = e−(Tu J)(x)(|Ux |−1)/ e . −(Tu J)(x)(|Ux |−1)/ e u∈Ux e For any > 0 and r, let µr denote the -greedy Boltzmann exploration policy with respect to Φr. Further, we deﬁne a modiﬁed dynamic programming operator that incorporates Boltzmann exploration: (T J)(x) = u∈Ux e−(Tu J)(x)(|Ux |−1)/ e (Tu J)(x) . −(Tu J)(x)(|Ux |−1)/ e u∈Ux e As approaches 0, -greedy Boltzmann exploration policies become greedy and the modiﬁed dynamic programming operators become the dynamic programming operator. More precisely, for all r, x, and J, lim ↓0 µr (x, µr (x)) = 1 and lim ↓1 T J = T J. These are immediate consequences of the following result (see [4] for a proof). Lemma 1 For any n, v ∈ mini vi . n , mini vi + ≥ i e−vi (n−1)/ e vi / i e−vi (n−1)/ e ≥ Because we are only concerned with communicating MDPs, there is a unique invariant state distribution associated with each -greedy Boltzmann exploration policy µr and the support of this distribution is S. Let πr denote this distribution. We consider a vector r that ˜ solves Φ˜ = Ππr T Φ˜. For any > 0, there exists a solution to this equation (this is an r r ˜ immediate extension of Theorem 5.1 from [4]). We have the following performance loss bound, which parallels Theorem 5 but with an equation for which a solution is guaranteed to exist and without any requirement on the resulting invariant distribution. (Again, we omit the proof, which is available in [22].) Theorem 6 For any MDP, partition, and > 0, if Φ˜ = Ππr T Φ˜ then (1 − r r ˜ T ∗ ∗ α)(πr ) (Jµr − J ) ≤ 2α minr∈ K J − Φr ∞ + . ˜ ˜ 5 Computation: TD(0) Though computation is not a focus of this paper, we offer a brief discussion here. First, we describe a simple algorithm from [16], which draws on ideas from temporal-difference learning [11, 12] and Q-learning [23, 24] to solve Φ˜ = Ππ T Φ˜. It requires an abilr r ity to sample a sequence of states x(0) , x(1) , x(2) , . . ., each independent and identically distributed according to π. Also required is a way to efﬁciently compute (T Φr)(x) = minu∈Ux (gu (x) + α y∈S pxy (u)(Φr)(y)), for any given x and r. This is typically possible when the action set Ux and the support of px· (u) (i.e., the set of states that can follow x if action u is selected) are not too large. The algorithm generates a sequence of vectors r( ) according to r( +1) = r( ) + γ φ(x( ) ) (T Φr( ) )(x( ) ) − (Φr( ) )(x( ) ) , where γ is a step size and φ(x) denotes the column vector made up of components from the xth row of Φ. In [16], using results from [15, 9], it is shown that under appropriate assumptions on the step size sequence, r( ) converges to a vector r that solves Φ˜ = Ππ T Φ˜. ˜ r r The equation Φ˜ = Ππ T Φ˜ may have no solution. Further, the requirement that states r r are sampled independently from the invariant distribution may be impractical. However, a natural extension of the above algorithm leads to an easily implementable version of TD(0) that aims at solving Φ˜ = Ππr T Φ˜. The algorithm requires simulation of a trajectory r r ˜ x0 , x1 , x2 , . . . of the MDP, with each action ut ∈ Uxt generated by the -greedy Boltzmann exploration policy with respect to Φr(t) . The sequence of vectors r(t) is generated according to r(t+1) = r(t) + γt φ(xt ) (T Φr(t) )(xt ) − (Φr(t) )(xt ) . Under suitable conditions on the step size sequence, if this algorithm converges, the limit satisﬁes Φ˜ = Ππr T Φ˜. Whether such an algorithm converges and whether there are r r ˜ other algorithms that can effectively solve Φ˜ = Ππr T Φ˜ for broad classes of relevant r r ˜ problems remain open issues. 6 Extensions and Open Issues Our results demonstrate that weighting a Euclidean norm projection by the invariant distribution of a greedy (or approximately greedy) policy can lead to a dramatic performance gain. It is intriguing that temporal-difference learning implicitly carries out such a projection, and consequently, any limit of convergence obeys the stronger performance loss bound. This is not the ﬁrst time that the invariant distribution has been shown to play a critical role in approximate value iteration and temporal-difference learning. In prior work involving approximation of a cost-to-go function for a ﬁxed policy (no control) and a general linearly parameterized approximator (arbitrary matrix Φ), it was shown that weighting by the invariant distribution is key to ensuring convergence and an approximation error bound [17, 18]. Earlier empirical work anticipated this [13, 14]. The temporal-difference learning algorithm presented in Section 5 is a version of TD(0), This is a special case of TD(λ), which is parameterized by λ ∈ [0, 1]. It is not known whether the results of this paper can be extended to the general case of λ ∈ [0, 1]. Prior research has suggested that larger values of λ lead to superior results. In particular, an example of [1] and the approximation error bounds of [17, 18], both of which are restricted to the case of a ﬁxed policy, suggest that approximation error is ampliﬁed by a factor of 1/(1 − α) as λ is changed from 1 to 0. The results of Sections 3 and 4 suggest that this factor vanishes if one considers a controlled process and performance loss rather than approximation error. Whether the results of this paper can be extended to accommodate approximate value iteration with general linearly parameterized approximators remains an open issue. In this broader context, error and performance loss bounds of the kind offered by Theorem 2 are unavailable, even when the invariant distribution is used to weight the projection. Such error and performance bounds are available, on the other hand, for the solution to a certain linear program [5, 6]. Whether a factor of 1/(1 − α) can similarly be eliminated from these bounds is an open issue. Our results can be extended to accommodate an average cost objective, assuming that the MDP is communicating. With Boltzmann exploration, the equation of interest becomes Φ˜ = Ππr (T Φ˜ − λ1). r r ˜ ˜ ˜ The variables include an estimate λ ∈ of the minimal average cost λ∗ ∈ and an approximation Φ˜ of the optimal differential cost function h∗ . The discount factor α is set r to 1 in computing an -greedy Boltzmann exploration policy as well as T . There is an average-cost version of temporal-difference learning for which any limit of convergence ˜ ˜ (λ, r) satisﬁes this equation [19, 20, 21]. Generalization of Theorem 2 does not lead to a useful result because the right-hand side of the bound becomes inﬁnite as α approaches 1. On the other hand, generalization of Theorem 6 yields the ﬁrst performance loss bound for approximate value iteration with an average-cost objective: Theorem 7 For any communicating MDP with an average-cost objective, partition, and r ˜ > 0, if Φ˜ = Ππr (T Φ˜ − λ1) then r ˜ λµr − λ∗ ≤ 2 min h∗ − Φr ˜ r∈ K ∞ + . Here, λµr ∈ denotes the average cost under policy µr , which is well-deﬁned because the ˜ ˜ process is irreducible under an -greedy Boltzmann exploration policy. This theorem can be proved by taking limits on the left and right-hand sides of the bound of Theorem 6. It is easy to see that the limit of the left-hand side is λµr − λ∗ . The limit of minr∈ K J ∗ − Φr ∞ ˜ on the right-hand side is minr∈ K h∗ − Φr ∞ . (This follows from the analysis of [3].) Acknowledgments This material is based upon work supported by the National Science Foundation under Grant ECS-9985229 and by the Ofﬁce of Naval Research under Grant MURI N00014-001-0637. The author’s understanding of the topic beneﬁted from collaborations with Dimitri Bertsekas, Daniela de Farias, and John Tsitsiklis. A full length version of this paper has been submitted to Mathematics of Operations Research and has beneﬁted from a number of useful comments and suggestions made by reviewers. References [1] D. P. Bertsekas. A counterexample to temporal-difference learning. Neural Computation, 7:270–279, 1994. [2] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientiﬁc, Belmont, MA, 1996. [3] D. Blackwell. Discrete dynamic programming. Annals of Mathematical Statistics, 33:719–726, 1962. [4] D. P. de Farias and B. Van Roy. On the existence of ﬁxed points for approximate value iteration and temporal-difference learning. Journal of Optimization Theory and Applications, 105(3), 2000. [5] D. P. de Farias and B. Van Roy. Approximate dynamic programming via linear programming. In Advances in Neural Information Processing Systems 14. MIT Press, 2002. [6] D. P. de Farias and B. Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51(6):850–865, 2003. [7] G. J. Gordon. Stable function approximation in dynamic programming. Technical Report CMU-CS-95-103, Carnegie Mellon University, 1995. [8] G. J. Gordon. Stable function approximation in dynamic programming. In Machine Learning: Proceedings of the Twelfth International Conference (ICML), San Francisco, CA, 1995. [9] T. Jaakkola, M. I. Jordan, and S. P. Singh. On the Convergence of Stochastic Iterative Dynamic Programming Algorithms. Neural Computation, 6:1185–1201, 1994. [10] S. P. Singh and R. C. Yee. An upper-bound on the loss from approximate optimalvalue functions. Machine Learning, 1994. [11] R. S. Sutton. Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, Amherst, MA, 1984. [12] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988. [13] R. S. Sutton. On the virtues of linear learning and trajectory distributions. In Proceedings of the Workshop on Value Function Approximation, Machine Learning Conference, 1995. [14] R. S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8, Cambridge, MA, 1996. MIT Press. [15] J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185–202, 1994. [16] J. N. Tsitsiklis and B. Van Roy. Feature–based methods for large scale dynamic programming. Machine Learning, 22:59–94, 1996. [17] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal–difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674–690, 1997. [18] J. N. Tsitsiklis and B. Van Roy. Analysis of temporal-difference learning with function approximation. In Advances in Neural Information Processing Systems 9, Cambridge, MA, 1997. MIT Press. [19] J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning. In Proceedings of the IEEE Conference on Decision and Control, 1997. [20] J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning. Automatica, 35(11):1799–1808, 1999. [21] J. N. Tsitsiklis and B. Van Roy. On average versus discounted reward temporaldifference learning. Machine Learning, 49(2-3):179–191, 2002. [22] B. Van Roy. Performance loss bounds for approximate value iteration with state aggregation. Under review with Mathematics of Operations Research, available at www.stanford.edu/ bvr/psﬁles/aggregation.pdf, 2005. [23] C. J. C. H. Watkins. Learning From Delayed Rewards. PhD thesis, Cambridge University, Cambridge, UK, 1989. [24] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279–292, 1992.

4 0.89639395 156 nips-2005-Prediction and Change Detection

Author: Mark Steyvers, Scott Brown

Abstract: We measure the ability of human observers to predict the next datum in a sequence that is generated by a simple statistical process undergoing change at random points in time. Accurate performance in this task requires the identification of changepoints. We assess individual differences between observers both empirically, and using two kinds of models: a Bayesian approach for change detection and a family of cognitively plausible fast and frugal models. Some individuals detect too many changes and hence perform sub-optimally due to excess variability. Other individuals do not detect enough changes, and perform sub-optimally because they fail to notice short-term temporal trends. 1 Intr oduction Decision-making often requires a rapid response to change. For example, stock analysts need to quickly detect changes in the market in order to adjust investment strategies. Coaches need to track changes in a player’s performance in order to adjust strategy. When tracking changes, there are costs involved when either more or less changes are observed than actually occurred. For example, when using an overly conservative change detection criterion, a stock analyst might miss important short-term trends and interpret them as random fluctuations instead. On the other hand, a change may also be detected too readily. For example, in basketball, a player who makes a series of consecutive baskets is often identified as a “hot hand” player whose underlying ability is perceived to have suddenly increased [1,2]. This might lead to sub-optimal passing strategies, based on random fluctuations. We are interested in explaining individual differences in a sequential prediction task. Observers are shown stimuli generated from a simple statistical process with the task of predicting the next datum in the sequence. The latent parameters of the statistical process change discretely at random points in time. Performance in this task depends on the accurate detection of those changepoints, as well as inference about future outcomes based on the outcomes that followed the most recent inferred changepoint. There is much prior research in statistics on the problem of identifying changepoints [3,4,5]. In this paper, we adopt a Bayesian approach to the changepoint identification problem and develop a simple inference procedure to predict the next datum in a sequence. The Bayesian model serves as an ideal observer model and is useful to characterize the ways in which individuals deviate from optimality. The plan of the paper is as follows. We first introduce the sequential prediction task and discuss a Bayesian analysis of this prediction problem. We then discuss the results from a few individuals in this prediction task and show how the Bayesian approach can capture individual differences with a single “twitchiness” parameter that describes how readily changes are perceived in random sequences. We will show that some individuals are too twitchy: their performance is too variable because they base their predictions on too little of the recent data. Other individuals are not twitchy enough, and they fail to capture fast changes in the data. We also show how behavior can be explained with a set of fast and frugal models [6]. These are cognitively realistic models that operate under plausible computational constraints. 2 A pr ediction task wit h m ult iple c hange points In the prediction task, stimuli are presented sequentially and the task is to predict the next stimulus in the sequence. After t trials, the observer has been presented with stimuli y1, y2, …, yt and the task is to make a prediction about yt+1. After the prediction is made, the actual outcome yt+1 is revealed and the next trial proceeds to the prediction of yt+2. This procedure starts with y1 and is repeated for T trials. The observations yt are D-dimensional vectors with elements sampled from binomial distributions. The parameters of those distributions change discretely at random points in time such that the mean increases or decreases after a change point. This generates a sequence of observation vectors, y1, y2, …, yT, where each yt = {yt,1 … yt,D}. Each of the yt,d is sampled from a binomial distribution Bin(θt,d,K), so 0 ≤ yt,d ≤ K. The parameter vector θt ={θt,1 … θt,D} changes depending on the locations of the changepoints. At each time step, xt is a binary indicator for the occurrence of a changepoint occurring at time t+1. The parameter α determines the probability of a change occurring in the sequence. The generative model is specified by the following algorithm: 1. For d=1..D sample θ1,d from a Uniform(0,1) distribution 2. For t=2..T, (a) Sample xt-1 from a Bernoulli(α) distribution (b) If xt-1=0, then θt=θt-1, else for d=1..D sample θt,d from a Uniform(0,1) distribution (c) for d=1..D, sample yt from a Bin(θt,d,K) distribution Table 1 shows some data generated from the changepoint model with T=20, α=.1,and D=1. In the prediction task, y will be observed, but x and θ are not. Table 1: Example data t x θ y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 .68 .68 .68 .68 .48 .48 .48 .74 .74 .74 .74 .74 .74 .19 .19 .87 .87 .87 .87 .87 9 7 8 7 4 4 4 9 8 3 6 7 8 2 1 8 9 9 8 8 3 A Bayesian pr ediction m ode l In both our Bayesian and fast-and-frugal analyses, the prediction task is decomposed into two inference procedures. First, the changepoint locations are identified. This is followed by predictive inference for the next outcome based on the most recent changepoint locations. Several Bayesian approaches have been developed for changepoint problems involving single or multiple changepoints [3,5]. We apply a Markov Chain Monte Carlo (MCMC) analysis to approximate the joint posterior distribution over changepoint assignments x while integrating out θ. Gibbs sampling will be used to sample from this posterior marginal distribution. The samples can then be used to predict the next outcome in the sequence. 3.1 I n f e r e nc e f o r c h a n g e p o i n t a s s i g n m e n t s . To apply Gibbs sampling, we evaluate the conditional probability of assigning a changepoint at time i, given all other changepoint assignments and the current α value. By integrating out θ, the conditional probability is P ( xi | x−i , y, α ) = ∫ P ( xi ,θ , α | x− i , y ) (1) θ where x− i represents all switch point assignments except xi. This can be simplified by considering the location of the most recent changepoint preceding and following time i and the outcomes occurring between these locations. Let niL be the number of time steps from the last changepoint up to and including the current time step i such that xi − nL =1 and xi − nL + j =0 for 0 < niL . Similarly, let niR be the number of time steps that i i follow time step i up to the next changepoint such that xi + n R =1 and xi + nR − j =0 for i R i 0 < n . Let y = L i ∑ i − niL < k ≤ i i yk and y = ∑ k < k ≤i + n R yk . The update equation for the R i i changepoint assignment can then be simplified to P ( xi = m | x−i ) ∝ ( ) ( ( ) D Γ 1 + y L + y R Γ 1 + Kn L + Kn R − y L − y R ⎧ i, j i, j i i i, j i, j ⎪ (1 − α ) ∏ L R Γ 2 + Kni + Kni ⎪ j =1 ⎪ ⎨ L L L R R R ⎪ D Γ 1 + yi, j Γ 1 + Kni − yi, j Γ 1 + yi, j Γ 1 + Kni − yi, j α∏ ⎪ Γ 2 + KniL Γ 2 + KniR ⎪ j =1 ⎩ ( ) ( ( ) ( ) ( ) ) ( ) m=0 ) (2) m =1 We initialize the Gibbs sampler by sampling each xt from a Bernoulli(α) distribution. All changepoint assignments are then updated sequentially by the Gibbs sampling equation above. The sampler is run for M iterations after which one set of changepoint assignments is saved. The Gibbs sampler is then restarted multiple times until S samples have been collected. Although we could have included an update equation for α, in this analysis we treat α as a known constant. This will be useful when characterizing the differences between human observers in terms of differences in α. 3.2 P r e d i c ti v e i n f er e n ce The next latent parameter value θt+1 and outcome yt+1 can be predicted on the basis of observed outcomes that occurred after the last inferred changepoint: θ t+1, j = t ∑ i =t* +1 yt+1, j = round (θt +1, j K ) yi, j / K , (3) where t* is the location of the most recent change point. By considering multiple Gibbs samples, we get a distribution over outcomes yt+1. We base the model predictions on the mean of this distribution. 3.3 I l l u s t r a t i o n o f m o d e l p er f o r m a n c e Figure 1 illustrates the performance of the model on a one dimensional sequence (D=1) generated from the changepoint model with T=160, α=0.05, and K=10. The Gibbs sampler was run for M=30 iterations and S=200 samples were collected. The top panel shows the actual changepoints (triangles) and the distribution of changepoint assignments averaged over samples. The bottom panel shows the observed data y (thin lines) as well as the θ values in the generative model (rescaled between 0 and 10). At locations with large changes between observations, the marginal changepoint probability is quite high. At other locations, the true change in the mean is very small, and the model is less likely to put in a changepoint. The lower right panel shows the distribution over predicted θt+1 values. xt 1 0.5 0 yt 10 1 5 θt+1 0.5 0 20 40 60 80 100 120 140 160 0 Figure 1. Results of model simulation. 4 Prediction experiment We tested performance of 9 human observers in the prediction task. The observers included the authors, a visitor, and one student who were aware of the statistical nature of the task as well as naïve students. The observers were seated in front of an LCD touch screen displaying a two-dimensional grid of 11 x 11 buttons. The changepoint model was used to generate a sequence of T=1500 stimuli for two binomial variables y1 and y2 (D=2, K=10). The change probability α was set to 0.1. The two variables y1 and y2 specified the two-dimensional button location. The same sequence was used for all observers. On each trial, the observer touched a button on the grid displayed on the touch screen. Following each button press, the button corresponding to the next {y1,y2} outcome in the sequence was highlighted. Observers were instructed to press the button that best predicted the next location of the highlighted button. The 1500 trials were divided into three blocks of 500 trials. Breaks were allowed between blocks. The whole experiment lasted between 15 and 30 minutes. Figure 2 shows the first 50 trials from the third block of the experiment. The top and bottom panels show the actual outcomes for the y1 and y2 button grid coordinates as well as the predictions for two observers (SB and MY). The figure shows that at trial 15, the y1 and y2 coordinates show a large shift followed by an immediate shift in observer’s MY predictions (on trial 16). Observer SB waits until trial 17 to make a shift. 10 5 0 outcomes SB predictions MY predictions 10 5 0 0 5 10 15 20 25 Trial 30 35 40 45 50 Figure 2. Trial by trial predictions from two observers. 4.1 T a s k er r o r We assessed prediction performance by comparing the prediction with the actual outcome in the sequence. Task error was measured by normalized city-block distance T 1 (4) task error= ∑ yt ,1 − ytO,1 + yt ,2 − ytO,2 (T − 1) t =2 where yO represents the observer’s prediction. Note that the very first trial is excluded from this calculation. Even though more suitable probabilistic measures for prediction error could have been adopted, we wanted to allow comparison of observer’s performance with both probabilistic and non-probabilistic models. Task error ranged from 2.8 (for participant MY) to 3.3 (for ML). We also assessed the performance of five models – their task errors ranged from 2.78 to 3.20. The Bayesian models (Section 3) had the lowest task errors, just below 2.8. This fits with our definition of the Bayesian models as “ideal observer” models – their task error is lower than any other model’s and any human observer’s task error. The fast and frugal models (Section 5) had task errors ranging from 2.85 to 3.20. 5 Modeling R esults We will refer to the models with the following letter codes: B=Bayesian Model, LB=limited Bayesian model, FF1..3=fast and frugal models 1..3. We assessed model fit by comparing the model’s prediction against the human observers’ predictions, again using a normalized city-block distance model error= T 1 ∑ ytM − ytO,1 + ytM − ytO,2 ,1 ,2 (T − 1) t=2 (5) where yM represents the model’s prediction. The model error for each individual observer is shown in Figure 3. It is important to note that because each model is associated with a set of free parameters, the parameters optimized for task error and model error are different. For Figure 3, the parameters were optimized to minimize Equation (5) for each individual observer, showing the extent to which these models can capture the performance of individual observers, not necessarily providing the best task performance. B LB FF1 FF2 MY MS MM EJ FF3 Model Error 2 1.5 1 0.5 0 PH NP DN SB ML 1 Figure 3. Model error for each individual observer. 5.1 B ay e s i a n p re d i ct i o n m o d e l s At each trial t, the model was provided with the sequence of all previous outcomes. The Gibbs sampling and inference procedures from Eq. (2) and (3) were applied with M=30 iterations and S=200 samples. The change probability α was a free parameter. In the full Bayesian model, the whole sequence of observations up to the current trial is available for prediction, leading to a memory requirement of up to T=1500 trials – a psychologically unreasonable assumption. We therefore also simulated a limited Bayesian model (LB) where the observed sequence was truncated to the last 10 outcomes. The LB model showed almost no decrement in task performance compared to the full Bayesian model. Figure 3 also shows that it fit human data quite well. 5.2 I n d i v i d u a l D i f f er e nc e s The right-hand panel of Figure 4 plots each observer’s task error as a function of the mean city-block distance between their subsequent button presses. This shows a clear U-shaped function. Observers with very variable predictions (e.g., ML and DN) had large average changes between successive button pushes, and also had large task error: These observers were too “twitchy”. Observers with very small average button changes (e.g., SB and NP) were not twitchy enough, and also had large task error. Observers in the middle had the lowest task error (e.g., MS and MY). The left-hand panel of Figure 4 shows the same data, but with the x-axis based on the Bayesian model fits. Instead of using mean button change distance to index twitchiness (as in 1 Error bars indicate bootstrapped 95% confidence intervals. the right-hand panel), the left-hand panel uses the estimated α parameters from the Bayesian model. A similar U-shaped pattern is observed: individuals with too large or too small α estimates have large task errors. 3.3 DN 3.2 Task Error ML SB 3.2 NP 3.1 Task Error 3.3 PH EJ 3 MM MS MY 2.9 2.8 10 -4 10 -3 10 -2 DN NP 3.1 3 PH EJ MM MS 2.9 B ML SB MY 2.8 10 -1 10 0 0.5 1 α 1.5 2 Mean Button Change 2.5 3 Figure 4. Task error vs. “twitchiness”. Left-hand panel indexes twitchiness using estimated α parameters from Bayesian model fits. Right-hand panel uses mean distance between successive predictions. 5.3 F a s t - a n d - F r u g a l ( F F ) p r e d ic t i o n m o d e l s These models perform the prediction task using simple heuristics that are cognitively plausible. The FF models keep a short memory of previous stimulus values and make predictions using the same two-step process as the Bayesian model. First, a decision is made as to whether the latent parameter θ has changed. Second, remembered stimulus values that occurred after the most recently detected changepoint are used to generate the next prediction. A simple heuristic is used to detect changepoints: If the distance between the most recent observation and prediction is greater than some threshold amount, a change is inferred. We defined the distance between a prediction (p) and an observation (y) as the difference between the log-likelihoods of y assuming θ=p and θ=y. Thus, if fB(.|θ, K) is the binomial density with parameters θ and K, the distance between observation y and prediction p is defined as d(y,p)=log(fB(y|y,K))-log(fB(y|p,K)). A changepoint on time step t+1 is inferred whenever d(yt,pt)>C. The parameter C governs the twitchiness of the model predictions. If C is large, only very dramatic changepoints will be detected, and the model will be too conservative. If C is small, the model will be too twitchy, and will detect changepoints on the basis of small random fluctuations. Predictions are based on the most recent M observations, which are kept in memory, unless a changepoint has been detected in which case only those observations occurring after the changepoint are used for prediction. The prediction for time step t+1 is simply the mean of these observations, say p. Human observers were reticent to make predictions very close to the boundaries. This was modeled by allowing the FF model to change its prediction for the next time step, yt+1, towards the mean prediction (0.5). This change reflects a two-way bet. If the probability of a change occurring is α, the best guess will be 0.5 if that change occurs, or the mean p if the change does not occur. Thus, the prediction made is actually yt+1=1/2 α+(1-α)p. Note that we do not allow perfect knowledge of the probability of a changepoint, α. Instead, an estimated value of α is used based on the number of changepoints detected in the data series up to time t. The FF model nests two simpler FF models that are psychologically interesting. If the twitchiness threshold parameter C becomes arbitrarily large, the model never detects a change and instead becomes a continuous running average model. Predictions from this model are simply a boxcar smooth of the data. Alternatively, if we assume no memory the model must based each prediction on only the previous stimulus (i.e., M=1). Above, in Figure 3, we labeled the complete FF model as FF1, the boxcar model as FF2 and the memoryless model FF3. Figure 3 showed that the complete FF model (FF1) fit the data from all observers significantly better than either the boxcar model (FF2) or the memoryless model (FF3). Exceptions were observers PH, DN and ML, for whom all three FF model fit equally well. This result suggests that our observers were (mostly) doing more than just keeping a running average of the data, or using only the most recent observation. The FF1 model fit the data about as well as the Bayesian models for all observers except MY and MS. Note that, in general, the FF1 and Bayesian model fits are very good: the average city block distance between the human data and the model prediction is around 0.75 (out of 10) buttons on both the x- and y-axes. 6 C onclusion We used an online prediction task to study changepoint detection. Human observers had to predict the next observation in stochastic sequences containing random changepoints. We showed that some observers are too “twitchy”: They perform poorly on the prediction task because they see changes where only random fluctuation exists. Other observers are not twitchy enough, and they perform poorly because they fail to see small changes. We developed a Bayesian changepoint detection model that performed the task optimally, and also provided a good fit to human data when sub-optimal parameter settings were used. Finally, we developed a fast-and-frugal model that showed how participants may be able to perform well at the task using minimal information and simple decision heuristics. Acknowledgments We thank Eric-Jan Wagenmakers and Mike Yi for useful discussions related to this work. This work was supported in part by a grant from the US Air Force Office of Scientific Research (AFOSR grant number FA9550-04-1-0317). R e f er e n ce s [1] Gilovich, T., Vallone, R. and Tversky, A. (1985). The hot hand in basketball: on the misperception of random sequences. Cognitive Psychology17, 295-314. [2] Albright, S.C. (1993a). A statistical analysis of hitting streaks in baseball. Journal of the American Statistical Association, 88, 1175-1183. [3] Stephens, D.A. (1994). Bayesian retrospective multiple changepoint identification. Applied Statistics 43(1), 159-178. [4] Carlin, B.P., Gelfand, A.E., & Smith, A.F.M. (1992). Hierarchical Bayesian analysis of changepoint problems. Applied Statistics 41(2), 389-405. [5] Green, P.J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82(4), 711-732. [6] Gigerenzer, G., & Goldstein, D.G. (1996). Reasoning the fast and frugal way: Models of bounded rationality. Psychological Review, 103, 650-669.

5 0.83756649 110 nips-2005-Learning Depth from Single Monocular Images

Author: Ashutosh Saxena, Sung H. Chung, Andrew Y. Ng

Abstract: We consider the task of depth estimation from a single monocular image. We take a supervised learning approach to this problem, in which we begin by collecting a training set of monocular images (of unstructured outdoor environments which include forests, trees, buildings, etc.) and their corresponding ground-truth depthmaps. Then, we apply supervised learning to predict the depthmap as a function of the image. Depth estimation is a challenging problem, since local features alone are insufﬁcient to estimate depth at a point, and one needs to consider the global context of the image. Our model uses a discriminatively-trained Markov Random Field (MRF) that incorporates multiscale local- and global-image features, and models both depths at individual points as well as the relation between depths at different points. We show that, even on unstructured scenes, our algorithm is frequently able to recover fairly accurate depthmaps. 1

6 0.55102736 177 nips-2005-Size Regularized Cut for Data Clustering

7 0.52463347 53 nips-2005-Cyclic Equilibria in Markov Games

8 0.51242107 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games

9 0.51118898 61 nips-2005-Dynamical Synapses Give Rise to a Power-Law Distribution of Neuronal Avalanches

10 0.50776112 46 nips-2005-Consensus Propagation

11 0.47997168 194 nips-2005-Top-Down Control of Visual Attention: A Rational Account

12 0.47913471 187 nips-2005-Temporal Abstraction in Temporal-difference Networks

13 0.47481361 153 nips-2005-Policy-Gradient Methods for Planning

14 0.46484506 34 nips-2005-Bayesian Surprise Attracts Human Attention

15 0.45975175 170 nips-2005-Scaling Laws in Natural Scenes and the Inference of 3D Shape

16 0.45716158 144 nips-2005-Off-policy Learning with Options and Recognizers

17 0.45688525 23 nips-2005-An Application of Markov Random Fields to Range Sensing

18 0.45131007 173 nips-2005-Sensory Adaptation within a Bayesian Framework for Perception

19 0.45028144 39 nips-2005-Beyond Pair-Based STDP: a Phenomenological Rule for Spike Triplet and Frequency Effects

20 0.44665784 99 nips-2005-Integrate-and-Fire models with adaptation are good enough