nips nips2008 nips2008-171 knowledge-graph by maker-knowledge-mining

171 nips-2008-Online Prediction on Large Diameter Graphs


Source: pdf

Author: Mark Herbster, Guy Lever, Massimiliano Pontil

Abstract: We continue our study of online prediction of the labelling of a graph. We show a fundamental limitation of Laplacian-based algorithms: if the graph has a large diameter then the number of mistakes made by such algorithms may be proportional to the square root of the number of vertices, even when tackling simple problems. We overcome this drawback by means of an efficient algorithm which achieves a logarithmic mistake bound. It is based on the notion of a spine, a path graph which provides a linear embedding of the original graph. In practice, graphs may exhibit cluster structure; thus in the last part, we present a modified algorithm which achieves the “best of both worlds”: it performs well locally in the presence of cluster structure, and globally on large diameter graphs. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract We continue our study of online prediction of the labelling of a graph. [sent-7, score-0.656]

2 We show a fundamental limitation of Laplacian-based algorithms: if the graph has a large diameter then the number of mistakes made by such algorithms may be proportional to the square root of the number of vertices, even when tackling simple problems. [sent-8, score-0.482]

3 It is based on the notion of a spine, a path graph which provides a linear embedding of the original graph. [sent-10, score-0.384]

4 1 Introduction We study the problem of predicting the labelling of a graph in the online learning framework. [sent-12, score-0.918]

5 Consider the following game for predicting the labelling of a graph: Nature presents a graph; nature queries a vertex vi1 ; the learner predicts y1 ∈ {−1, 1}, the label of the vertex; nature presents a ˆ label y1 ; nature queries a vertex vi2 ; the learner predicts y2 ; and so forth. [sent-13, score-1.45]

6 The learner’s goal is to ˆ minimise the total number of mistakes M = |{t : yt = yt }|. [sent-14, score-0.302]

7 In [9, 8, 7], the cut size (the number of edges between disagreeing labels) was used as a measure of the complexity of a graph’s labelling, and mistake bounds relative to this and the graph diameter were derived. [sent-17, score-0.672]

8 The apparent deficiency of these methods is that they have poor bounds when the graph diameter is large relative to the number of vertices. [sent-19, score-0.365]

9 In particular, we discuss an example of a n-vertex labelled graph with a single edge between disagreeing label sets. [sent-21, score-0.4]

10 We solve this problem by observing that there exists an approximate structure-preserving embedding of any graph into a path graph. [sent-24, score-0.407]

11 In particular the cut-size of any labelling is increased by no more than a factor of two. [sent-25, score-0.511]

12 A logarithmic mistake bound for learning on a path graph follows by the Halving algorithm analysis. [sent-30, score-0.58]

13 Secondly, we use the spine of the graph as a foundation to add a binary support tree to the original graph. [sent-31, score-0.587]

14 This enables us to prove a bound which is the “best of both worlds” – if the predicted set of vertices has cluster-structure we will obtain a bound appropriate for that case, but if instead, the predicted set exhibits a large diameter we will obtain a polylogarithmic bound. [sent-32, score-0.494]

15 The seminal approach to semi-supervised learning over graphs in [3] is to predict with a labelling which is consistent with a minimum label-separating cut. [sent-34, score-0.65]

16 In [8, 7] the online graph labelling problem was studied. [sent-36, score-0.845]

17 An aim of those papers was to provide a natural interpretation of the bound on the cumulative mistakes of the kernel perceptron when the kernel is the pseudoinverse of the graph Laplacian – bounds in this case being relative to the cut and (resistance) diameter of the graph. [sent-37, score-0.676]

18 In this paper we necessarily build directly on the very recent results in [7] as those results depend on the resistance diameter of the predicted vertex set as opposed to the whole graph [8]. [sent-38, score-0.84]

19 The online graph labelling problem is also studied in [13], and here the graph structure is not given initially. [sent-39, score-1.066]

20 A slightly weaker logarithmic bound for the online graph labelling problem has also been independently derived via a connection to an online routing problem in the very recent [5]. [sent-40, score-1.082]

21 2 Preliminaries We study the process of predicting a labelling defined on the vertices of a graph. [sent-41, score-0.785]

22 Following the classical online learning framework, a sequence of labelled vertices {(vi1 , y1 ), (vi2 , y2 ), . [sent-42, score-0.375]

23 }, the trial sequence, is presented to a learning algorithm such that, on sight of each vertex vit , the learner makes a prediction yt for the label value, after which the correct label is revealed. [sent-45, score-0.703]

24 We analyse the performance of a learning algorithm in the mistake bound framework [12] – the aim is to minimise the maximum possible cumulative number of mistakes made on the training sequence. [sent-47, score-0.342]

25 A graph G = (V, E) is a collection of vertices V = {v1 , . [sent-48, score-0.422]

26 Denote i ∼ j whenever vi and vj are connected so that E = {(i, j) : i ∼ j} is the set of unordered pairs of connected vertex indices. [sent-52, score-0.581]

27 The quadratic form associated with the Laplacian relates to the cut size of graph labellings. [sent-57, score-0.331]

28 Given a labelling u ∈ IRn of G = (V, E) we define the cut size of u by 1 T 1 ΦG (u) = u Gu = Aij (ui − uj )2 . [sent-59, score-0.711]

29 (1) 4 4 (i,j)∈E n In particular, if u ∈ {−1, 1} we say that a cut occurs on edge (i, j) if ui = uj and ΦG (u) measures the number of cuts. [sent-60, score-0.329]

30 We evaluate the performance of prediction algorithms in terms of the cut size and the resistance diameter of the graph. [sent-61, score-0.441]

31 There is an established natural connection between graphs and resistive networks where each edge (i, j) ∈ E is viewed as a resistor with resistance 1/Aij [4]. [sent-62, score-0.311]

32 Thus the effective resistance rG (vi , vj ) between vertex vi and vj is the potential difference needed to induce a unit current flow between vi and vj . [sent-63, score-0.925]

33 The resistance diameter of a graph RG := maxvi ,vj ∈V rG (vi , vj ) is the maximum effective resistance between any pair of vertices on the graph. [sent-68, score-0.964]

34 3 Limitations of online minimum semi-norm interpolation As we will show, it is possible to develop online algorithms for predicting the labelling of a graph which have a mistake bound that is a logarithmic function of the number of vertices. [sent-69, score-1.358]

35 Given a partially labelled graph G = (V, E) with |V | = n – that is, such that for some ≤ n, y ∈ {−1, 1} is a labelling defined on the vertices V = {vi1 , vi2 , . [sent-71, score-0.994]

36 ˆ y The common justification behind the above learning paradigm [14, 2] is that minimizing the cut (1) encourages neighbouring vertices to be similarly labelled. [sent-82, score-0.334]

37 However, we now demonstrate that in the online setting such a regime will perform poorly on √ certain graph constructions – there exists a trial sequence on which the method will make at least θ( n) mistakes. [sent-83, score-0.43]

38 An octopus graph of size d is defined to be d path graphs (the tentacles) of length d (that is, with d + 1 vertices) all adjoined at a common end vertex, to which a further single head vertex is attached, so that n = |V | = d2 + 2. [sent-85, score-0.803]

39 , y|V | ) the labelling such that yi = 1 if vi is the head vertex and yi = −1 otherwise. [sent-91, score-0.964]

40 There exists a trial sequence for which online minimum semi-norm interpolation makes θ( |V |) mistakes. [sent-92, score-0.3]

41 Let the first query vertex be the head vertex, and let the end vertex of a tentacle be queried at each subsequent trial. [sent-94, score-0.762]

42 The solution to the minimum semi-norm interpolation with boundary values problem is precisely the harmonic solution [4] n ¯ ¯ y (that is, for every unlabeled vertex vj , i=1 Aij (¯i − yj ) = 0). [sent-96, score-0.581]

43 Using this analogy, suppose that the end points of k < d tentacles are labelled and that the end vertex vq of an unlabelled tentacle is queried. [sent-98, score-0.516]

44 By Kirchoff’s law, a current of λ flows along each labelled tentacle (in order to obey the harmonic principle at 2 every vertex it is clear that no current flows along the unlabelled tentacles). [sent-100, score-0.499]

45 The above demonstrates a limitation in the method of online Laplacian minimum semi-norm interpolation for predicting a graph labelling – the mistake bound can be proportional to the square root of the number of data points. [sent-104, score-1.205]

46 4 A linear graph embedding We demonstrate a method of embedding data represented as a connected graph G into a path graph, we call it a spine of G, which partially preserves the structure of G. [sent-106, score-0.943]

47 We would like to find a path graph with the same vertex set as G, which solves ΦP (u) . [sent-108, score-0.664]

48 min max P∈Pn u∈{−1,1}n ΦG (u) If a Hamiltonian path H of G (a path on G which visits each vertex precisely once) exists, then (u) the approximation ratio is ΦH(u) ≤ 1. [sent-109, score-0.566]

49 ΦG We now detail the construction of a spine of a graph G = (V, E), with |V | = n. [sent-112, score-0.479]

50 Starting from any node, G is traversed in the manner of a depth-first search (that is, each vertex is fully explored before backtracking to the last unexplored vertex), and an ordered list VL = {vl1 , vl2 , . [sent-113, score-0.349]

51 , vl2m+1 } of the vertices (m ≤ |E|) in the order that they are visited is formed, allowing repetitions when a vertex is visited more than once. [sent-116, score-0.573]

52 Let u be an arbitrary labelling of G and denote, as usual, 1 ΦG (u) = 4 (i,j)∈EG (ui − uj )2 and ΦL (u) = 1 (i,j)∈EL (ui − uj )2 . [sent-122, score-0.691]

53 We then take any subsequence VL of VL containing every vertex in V exactly once. [sent-124, score-0.32]

54 A spine S = (V, ES ) is a graph formed by connecting each vertex in V to its immediate neighbours in the subsequence VL with an edge. [sent-125, score-0.825]

55 Since a cut occurs between connected vertices vi and vj in S only if a cut occurs on some edge in EL located between the corresponding vertices in the list VL we have ΦS (u) ≤ ΦL (u) ≤ 2ΦG (u). [sent-126, score-0.89]

56 (3) Thus we have reduced the problem of learning the cut on a generic graph to that of learning the cut on a path graph. [sent-127, score-0.564]

57 Note that the 1-NN algorithm does not perform well on general √ graphs; on the octopus graph discussed above, for example, it can make at least θ( n) mistakes, and even θ(n) mistakes on a related graph construction [8]. [sent-129, score-0.608]

58 5 Predicting with a spine We consider implementing the 1-NN algorithm on a path graph and demonstrate that it achieves a mistake bound which is logarithmic in the length of the line. [sent-130, score-0.862]

59 We shall prove the result by noting that the Halving algorithm [1] (under certain conditions on the probabilities assigned to each hypothesis) implements the nearest neighbour algorithm on a path graph. [sent-142, score-0.356]

60 Given an unlabelled example xt ∈ X at trial t the predicted P label yt is that which agrees with the majority vote – that is, such that ˆ it predicts randomly if this is equal to most MH mistakes with 1 2 ). [sent-150, score-0.38]

61 The position of all cuts fixes the labelling up to flipping every label, and each of these two resulting possible arrangements are equally likely. [sent-154, score-0.588]

62 This recipe associates with each possible labelling u ∈ {−1, 1}n a probability p(u) which is a function of the labelling’s cut size 1 ΦP (u) α (1 − α)n−1−ΦP (u) . [sent-155, score-0.621]

63 (6) 2 This induces a full joint probability distribution on the space of vertex labels. [sent-156, score-0.32]

64 In fact (6) is a Gibbs measure and as such defines a Markov random field over the space of vertex labels [10]. [sent-157, score-0.32]

65 The mass function p therefore satisfies the Markov property p(u) = p(ui = γ | uj = γj ∀j = i) = p(ui = γ | uj = γj ∀j ∈ Ni ), (7) where here Ni is the set of vertices neighbouring vi – those connected to vi by an edge. [sent-158, score-0.63]

66 Given a path graph P = (V, E), a set of vertices V ⊂ V and a vertex vi ∈ V , we define the boundary vertices v , vr (either of which may be vacuous) to be the two vertices in V that are closest to vi in each direction along the path; its nearest neighbours in each direction. [sent-161, score-1.652]

67 For example, p(uj = γ | uk = γ) is the probability of an even number of cuts between vertex vj and vertex vk . [sent-164, score-0.779]

68 s 2 (10) p(uj = γ | uk = γ) = s even Likewise we have that p(uj = γ | uk = γ) = s odd Note also that for any single vertex we have p(ui = γ) = 1 for γ ∈ {−1, 1}. [sent-167, score-0.32]

69 Given the task of predicting the labelling of an n-vertex path graph online, the Halving algorithm, with a probability distribution over the labellings defined as in (6) and such that 0 < α < 1 , implements the nearest neighbour algorithm. [sent-169, score-1.19]

70 Suppose that t − 1 trials have been performed so that we have a partial labelling of a subset V ⊂ V , {(vi1 , y1 ), (vi2 , y2 ), . [sent-171, score-0.511]

71 Suppose the label of vertex vit is queried so that the Halving algorithm makes the following prediction yt for vertex vit : yt = y if p(uit = y | uij = ˆ ˆ 1 1 yj ∀ 1 ≤ j < t) > 2 , yt = −y if p(uit = y | uij = yj ∀ 1 ≤ j < t) < 2 (and predicts randomly ˆ 1 if this probability is equal to 2 ). [sent-175, score-1.32]

72 We first consider the case where the conditional labelling includes vertices on both sides of vit . [sent-176, score-0.798]

73 To show equivalence with the nearest neighbour method whenever α < 1 , we have from (9, 10, 11) 2 p(uit = y | u = y, ur = y) = (1 + (1 − 2α)| −it | )(1 − (1 − 2α)|r−it | ) 2(1 − (1 − 2α)| −r| ) 1 which is greater than 1 if | − it | < |r − it | and less than 2 if | − it | > |r − it |. [sent-179, score-0.322]

74 A direct application of the Halving algorithm mistake bound (5) now gives M ≤ log2 1 p(u) = log2 αΦP (u) (1 2 − α)n−1−ΦP (u) P (u) 1 where u is any labelling consistent with the trial sequence. [sent-183, score-0.808]

75 The nearest neighbour algorithm can predict the labelling of any graph G = (V, E), by first transferring the data representation to that of a spine S of G, as presented in Section 4. [sent-186, score-1.223]

76 We observe that predicting with the spine is a minimax improvement over Laplacian minimal seminorm interpolation. [sent-193, score-0.331]

77 In fact this trivially generalizes to θ( ΦG (u)n) mistakes by creating a colony of ΦG (u) octopi then identifying each previously separate head vertex as a single central vertex. [sent-195, score-0.477]

78 We compute the spine in O(|E|) time by simply listing vertices in the order in which they are first visited during a depthfirst search traversal of G. [sent-198, score-0.485]

79 Using online 1-NN requires O(|V | ln |V |) time to predict an arbitrary vertex sequence using a self-balancing binary search tree (e. [sent-199, score-0.604]

80 , a red-black tree) as the insertion of each vertex into the tree and determination of the nearest left and right neighbour is O(ln |V |). [sent-201, score-0.6]

81 6 Prediction with a binary support tree The Pounce online label prediction algorithm [7] is designed to exploit cluster structure of a graph G = (V, E) and achieves the following mistake bound M ≤ N (X, ρ, rG ) + 4ΦG (u)ρ + 1, (13) for any ρ > 0. [sent-202, score-0.757]

82 Here, u ∈ IRn is any labelling consistent with the trial sequence, X = {vi1 , vi2 , . [sent-203, score-0.612]

83 } ⊆ V is the set of inputs and N (X, ρ, rG ) is a covering number – the minimum number of balls of resistance diameter ρ (see Section 2) required to cover X. [sent-206, score-0.331]

84 The mistake bound (13) can be preferable to (12) whenever the inputs are sufficiently clustered and so has a cover of small diameter sets. [sent-207, score-0.34]

85 Given a graph G = (V, E), with |V | = n, and spine S, we define a binary support tree of G to be any binary tree T = (VT , ET ) of least possible depth, D, whose leaves are the vertices of S, in order. [sent-216, score-0.864]

86 We show that there is a weighting of the support tree which ensures that the resistance diameter of the support tree is small, but also such that any labelling of the leaf vertices can be extended to the support tree such that its cut size remains small. [sent-218, score-1.47]

87 Suppose each edge (i, j) ∈ ET has a weight Aij , which is a function of the edge’s depth d = max{dT (vi , vr ), dT (vj , vr )}, Aij = W (d) where dT (v, v ) ¯ is the number of edges in the shortest path from v to v . [sent-225, score-0.416]

88 Consider the unique labelling u such that, for 1 ≤ i ≤ n we have ui = ui and such that for every other vertex vp ∈ VT , with child ¯ u +¯ ¯ u vertices vc1 , vc2 , we have up = c1 2 c2 , or up = uc in the case where vp has only one child, vc . [sent-226, score-1.284]

89 ¯ ¯ ¯ Suppose the edges (p, c1 ), (p, c2 ) ∈ ET are at some depth d in T , and let V ⊂ V correspond to the leaf vertices of T descended from vp . [sent-227, score-0.344]

90 Define ΦS (uV ) to be the cut of u restricted to vertices in V . [sent-228, score-0.311]

91 Since the sets of leaf descendants of all vertices at depth d form a partition of V , summing (14) first over all parent nodes at a given depth and then over all integers d ∈ [1, D] gives D ¯ 4ΦT (u) ≤ 2 W (d)ΦS (u). [sent-231, score-0.322]

92 The Pounce algorithm is ¯ then used to predict the (partial) labelling defined on G. [sent-236, score-0.54]

93 } ⊆ V relative to the resistance distance rG of G and u ∈ IRn is any labelling consistent with the trial sequence. [sent-241, score-0.767]

94 Let u be some labelling consistent with the trial sequence. [sent-244, score-0.612]

95 Moreover, by the arguments in Lemma 9 there exists some labelling ¯ ¯ u of the weighted support tree T of G, consistent with u on V , such that ΦT (u) < ΦS (u). [sent-246, score-0.67]

96 ¯ ¯ (19) By Rayleigh’s monotonicity law the addition of the support tree does not increase the resistance between any vertices on G, hence N (X, ρ, rG ) ≤ N (X, ρ, rG ). [sent-248, score-0.494]

97 7 Conclusion We have explored a deficiency with existing online techniques for predicting the labelling of a graph. [sent-252, score-0.697]

98 As a solution, we have presented an approximate cut-preserving embedding of any graph G = (V, E) into a simple path graph, which we call a spine, such that an implementation of the 1nearest-neighbours algorithm is an efficient realisation of a Bayes optimal classifier. [sent-253, score-0.384]

99 This therefore achieves a mistake bound which is logarithmic in the size of the vertex set for any graph, and the complexity of our algorithm is of O(|E| + |V | ln |V |). [sent-254, score-0.646]

100 Low congestion online routing and an improved mistake bound for online prediction of graph labeling. [sent-285, score-0.701]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('labelling', 0.511), ('vertex', 0.32), ('spine', 0.258), ('graph', 0.221), ('vertices', 0.201), ('rg', 0.184), ('uit', 0.18), ('resistance', 0.155), ('diameter', 0.144), ('neighbour', 0.143), ('mistake', 0.138), ('path', 0.123), ('ur', 0.118), ('mistakes', 0.117), ('pounce', 0.115), ('halving', 0.115), ('online', 0.113), ('cut', 0.11), ('vi', 0.093), ('vl', 0.092), ('uj', 0.09), ('vj', 0.088), ('vit', 0.086), ('vr', 0.086), ('ui', 0.082), ('yt', 0.078), ('tree', 0.076), ('laplacian', 0.073), ('trial', 0.073), ('predicting', 0.073), ('vacuous', 0.072), ('aij', 0.071), ('ln', 0.066), ('irn', 0.066), ('labelled', 0.061), ('nearest', 0.061), ('interpolation', 0.059), ('bound', 0.058), ('unweighted', 0.053), ('cuts', 0.051), ('graphs', 0.05), ('octopus', 0.049), ('tentacle', 0.049), ('tentacles', 0.049), ('vt', 0.049), ('depth', 0.048), ('edge', 0.047), ('vp', 0.044), ('uij', 0.043), ('uv', 0.042), ('connected', 0.04), ('head', 0.04), ('embedding', 0.04), ('logarithmic', 0.04), ('ft', 0.039), ('herbster', 0.039), ('learner', 0.038), ('label', 0.038), ('eg', 0.037), ('unlabelled', 0.037), ('predicts', 0.037), ('el', 0.035), ('worlds', 0.035), ('incurred', 0.034), ('queried', 0.033), ('disagreeing', 0.033), ('multiset', 0.033), ('polylogarithmic', 0.033), ('resistive', 0.033), ('minimum', 0.032), ('prediction', 0.032), ('harmonic', 0.032), ('support', 0.032), ('ows', 0.031), ('law', 0.03), ('predict', 0.029), ('hamiltonian', 0.029), ('kirchoff', 0.029), ('labellings', 0.029), ('minimise', 0.029), ('traversed', 0.029), ('implements', 0.029), ('consistent', 0.028), ('boundary', 0.026), ('routing', 0.026), ('arrangements', 0.026), ('pseudoinverse', 0.026), ('resistor', 0.026), ('neighbours', 0.026), ('edges', 0.026), ('visited', 0.026), ('leaf', 0.025), ('cluster', 0.025), ('yj', 0.024), ('achieves', 0.024), ('rt', 0.024), ('exists', 0.023), ('gu', 0.023), ('neighbouring', 0.023), ('proves', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 171 nips-2008-Online Prediction on Large Diameter Graphs

Author: Mark Herbster, Guy Lever, Massimiliano Pontil

Abstract: We continue our study of online prediction of the labelling of a graph. We show a fundamental limitation of Laplacian-based algorithms: if the graph has a large diameter then the number of mistakes made by such algorithms may be proportional to the square root of the number of vertices, even when tackling simple problems. We overcome this drawback by means of an efficient algorithm which achieves a logarithmic mistake bound. It is based on the notion of a spine, a path graph which provides a linear embedding of the original graph. In practice, graphs may exhibit cluster structure; thus in the last part, we present a modified algorithm which achieves the “best of both worlds”: it performs well locally in the presence of cluster structure, and globally on large diameter graphs. 1

2 0.369068 84 nips-2008-Fast Prediction on a Tree

Author: Mark Herbster, Massimiliano Pontil, Sergio R. Galeano

Abstract: Given an n-vertex weighted tree with structural diameter S and a subset of m vertices, we present a technique to compute a corresponding m × m Gram matrix of the pseudoinverse of the graph Laplacian in O(n + m2 + mS) time. We discuss the application of this technique to fast label prediction on a generic graph. We approximate the graph with a spanning tree and then we predict with the kernel perceptron. We address the approximation of the graph with either a minimum spanning tree or a shortest path tree. The fast computation of the pseudoinverse enables us to address prediction problems on large graphs. We present experiments on two web-spam classification tasks, one of which includes a graph with 400,000 vertices and more than 10,000,000 edges. The results indicate that the accuracy of our technique is competitive with previous methods using the full graph information. 1

3 0.2242361 104 nips-2008-Improved Moves for Truncated Convex Models

Author: Philip Torr, M. P. Kumar

Abstract: We consider the problem of obtaining the approximate maximum a posteriori estimate of a discrete random field characterized by pairwise potentials that form a truncated convex model. For this problem, we propose an improved st-MINCUT based move making algorithm. Unlike previous move making approaches, which either provide a loose bound or no bound on the quality of the solution (in terms of the corresponding Gibbs energy), our algorithm achieves the same guarantees as the standard linear programming (LP) relaxation. Compared to previous approaches based on the LP relaxation, e.g. interior-point algorithms or treereweighted message passing (TRW), our method is faster as it uses only the efficient st-MINCUT algorithm in its design. Furthermore, it directly provides us with a primal solution (unlike TRW and other related methods which solve the dual of the LP). We demonstrate the effectiveness of the proposed approach on both synthetic and standard real data problems. Our analysis also opens up an interesting question regarding the relationship between move making algorithms (such as α-expansion and the algorithms presented in this paper) and the randomized rounding schemes used with convex relaxations. We believe that further explorations in this direction would help design efficient algorithms for more complex relaxations.

4 0.13728155 53 nips-2008-Counting Solution Clusters in Graph Coloring Problems Using Belief Propagation

Author: Lukas Kroc, Ashish Sabharwal, Bart Selman

Abstract: We show that an important and computationally challenging solution space feature of the graph coloring problem (COL), namely the number of clusters of solutions, can be accurately estimated by a technique very similar to one for counting the number of solutions. This cluster counting approach can be naturally written in terms of a new factor graph derived from the factor graph representing the COL instance. Using a variant of the Belief Propagation inference framework, we can efficiently approximate cluster counts in random COL problems over a large range of graph densities. We illustrate the algorithm on instances with up to 100, 000 vertices. Moreover, we supply a methodology for computing the number of clusters exactly using advanced techniques from the knowledge compilation literature. This methodology scales up to several hundred variables. 1

5 0.12606744 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel

Author: Benjamin Yackley, Eduardo Corona, Terran Lane

Abstract: Many interesting problems, including Bayesian network structure-search, can be cast in terms of finding the optimum value of a function over the space of graphs. However, this function is often expensive to compute exactly. We here present a method derived from the study of Reproducing Kernel Hilbert Spaces which takes advantage of the regular structure of the space of all graphs on a fixed number of nodes to obtain approximations to the desired function quickly and with reasonable accuracy. We then test this method on both a small testing set and a real-world Bayesian network; the results suggest that not only is this method reasonably accurate, but that the BDe score itself varies quadratically over the space of all graphs. 1

6 0.11540395 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

7 0.11488116 107 nips-2008-Influence of graph construction on graph-based clustering measures

8 0.1125536 69 nips-2008-Efficient Exact Inference in Planar Ising Models

9 0.10543729 78 nips-2008-Exact Convex Confidence-Weighted Learning

10 0.10463566 194 nips-2008-Regularized Learning with Networks of Features

11 0.096816115 192 nips-2008-Reducing statistical dependencies in natural signals using radial Gaussianization

12 0.093158647 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions

13 0.078020878 225 nips-2008-Supervised Bipartite Graph Inference

14 0.07294438 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging

15 0.07155773 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms

16 0.071507044 224 nips-2008-Structured ranking learning using cumulative distribution networks

17 0.070179172 150 nips-2008-Near-optimal Regret Bounds for Reinforcement Learning

18 0.063390568 168 nips-2008-Online Metric Learning and Fast Similarity Search

19 0.063328907 214 nips-2008-Sparse Online Learning via Truncated Gradient

20 0.061574455 212 nips-2008-Skill Characterization Based on Betweenness


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.185), (1, 0.001), (2, -0.097), (3, 0.052), (4, 0.016), (5, -0.1), (6, 0.101), (7, -0.041), (8, 0.077), (9, -0.371), (10, -0.164), (11, 0.104), (12, -0.111), (13, -0.113), (14, -0.012), (15, -0.212), (16, -0.009), (17, 0.114), (18, -0.173), (19, -0.007), (20, 0.023), (21, -0.158), (22, 0.133), (23, -0.038), (24, -0.169), (25, 0.001), (26, 0.022), (27, -0.022), (28, 0.007), (29, -0.139), (30, 0.111), (31, -0.006), (32, -0.038), (33, -0.008), (34, -0.053), (35, 0.006), (36, -0.095), (37, -0.042), (38, -0.003), (39, 0.082), (40, 0.08), (41, -0.062), (42, 0.057), (43, 0.104), (44, -0.106), (45, 0.036), (46, -0.008), (47, -0.109), (48, 0.008), (49, 0.006)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9707095 171 nips-2008-Online Prediction on Large Diameter Graphs

Author: Mark Herbster, Guy Lever, Massimiliano Pontil

Abstract: We continue our study of online prediction of the labelling of a graph. We show a fundamental limitation of Laplacian-based algorithms: if the graph has a large diameter then the number of mistakes made by such algorithms may be proportional to the square root of the number of vertices, even when tackling simple problems. We overcome this drawback by means of an efficient algorithm which achieves a logarithmic mistake bound. It is based on the notion of a spine, a path graph which provides a linear embedding of the original graph. In practice, graphs may exhibit cluster structure; thus in the last part, we present a modified algorithm which achieves the “best of both worlds”: it performs well locally in the presence of cluster structure, and globally on large diameter graphs. 1

2 0.88776672 84 nips-2008-Fast Prediction on a Tree

Author: Mark Herbster, Massimiliano Pontil, Sergio R. Galeano

Abstract: Given an n-vertex weighted tree with structural diameter S and a subset of m vertices, we present a technique to compute a corresponding m × m Gram matrix of the pseudoinverse of the graph Laplacian in O(n + m2 + mS) time. We discuss the application of this technique to fast label prediction on a generic graph. We approximate the graph with a spanning tree and then we predict with the kernel perceptron. We address the approximation of the graph with either a minimum spanning tree or a shortest path tree. The fast computation of the pseudoinverse enables us to address prediction problems on large graphs. We present experiments on two web-spam classification tasks, one of which includes a graph with 400,000 vertices and more than 10,000,000 edges. The results indicate that the accuracy of our technique is competitive with previous methods using the full graph information. 1

3 0.59862834 69 nips-2008-Efficient Exact Inference in Planar Ising Models

Author: Nicol N. Schraudolph, Dmitry Kamenetsky

Abstract: We give polynomial-time algorithms for the exact computation of lowest-energy states, worst margin violators, partition functions, and marginals in certain binary undirected graphical models. Our approach provides an interesting alternative to the well-known graph cut paradigm in that it does not impose any submodularity constraints; instead we require planarity to establish a correspondence with perfect matchings in an expanded dual graph. Maximum-margin parameter estimation for a boundary detection task shows our approach to be efficient and effective. A C++ implementation is available from http://nic.schraudolph.org/isinf/. 1

4 0.57505667 115 nips-2008-Learning Bounded Treewidth Bayesian Networks

Author: Gal Elidan, Stephen Gould

Abstract: With the increased availability of data for complex domains, it is desirable to learn Bayesian network structures that are sufficiently expressive for generalization while also allowing for tractable inference. While the method of thin junction trees can, in principle, be used for this purpose, its fully greedy nature makes it prone to overfitting, particularly when data is scarce. In this work we present a novel method for learning Bayesian networks of bounded treewidth that employs global structure modifications and that is polynomial in the size of the graph and the treewidth bound. At the heart of our method is a triangulated graph that we dynamically update in a way that facilitates the addition of chain structures that increase the bound on the model’s treewidth by at most one. We demonstrate the effectiveness of our “treewidth-friendly” method on several real-life datasets. Importantly, we also show that by using global operators, we are able to achieve better generalization even when learning Bayesian networks of unbounded treewidth. 1

5 0.57311821 104 nips-2008-Improved Moves for Truncated Convex Models

Author: Philip Torr, M. P. Kumar

Abstract: We consider the problem of obtaining the approximate maximum a posteriori estimate of a discrete random field characterized by pairwise potentials that form a truncated convex model. For this problem, we propose an improved st-MINCUT based move making algorithm. Unlike previous move making approaches, which either provide a loose bound or no bound on the quality of the solution (in terms of the corresponding Gibbs energy), our algorithm achieves the same guarantees as the standard linear programming (LP) relaxation. Compared to previous approaches based on the LP relaxation, e.g. interior-point algorithms or treereweighted message passing (TRW), our method is faster as it uses only the efficient st-MINCUT algorithm in its design. Furthermore, it directly provides us with a primal solution (unlike TRW and other related methods which solve the dual of the LP). We demonstrate the effectiveness of the proposed approach on both synthetic and standard real data problems. Our analysis also opens up an interesting question regarding the relationship between move making algorithms (such as α-expansion and the algorithms presented in this paper) and the randomized rounding schemes used with convex relaxations. We believe that further explorations in this direction would help design efficient algorithms for more complex relaxations.

6 0.56576902 107 nips-2008-Influence of graph construction on graph-based clustering measures

7 0.52322346 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel

8 0.51248628 53 nips-2008-Counting Solution Clusters in Graph Coloring Problems Using Belief Propagation

9 0.44809377 212 nips-2008-Skill Characterization Based on Betweenness

10 0.41136417 225 nips-2008-Supervised Bipartite Graph Inference

11 0.38540483 194 nips-2008-Regularized Learning with Networks of Features

12 0.37211779 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

13 0.30771387 22 nips-2008-An Online Algorithm for Maximizing Submodular Functions

14 0.29695806 224 nips-2008-Structured ranking learning using cumulative distribution networks

15 0.26824257 170 nips-2008-Online Optimization in X-Armed Bandits

16 0.26589498 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions

17 0.26441541 78 nips-2008-Exact Convex Confidence-Weighted Learning

18 0.26262712 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging

19 0.24351624 55 nips-2008-Cyclizing Clusters via Zeta Function of a Graph

20 0.23758268 15 nips-2008-Adaptive Martingale Boosting


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.076), (7, 0.034), (12, 0.032), (28, 0.157), (57, 0.043), (59, 0.394), (63, 0.041), (64, 0.014), (71, 0.014), (75, 0.017), (77, 0.032), (78, 0.015), (83, 0.044)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91917169 23 nips-2008-An ideal observer model of infant object perception

Author: Charles Kemp, Fei Xu

Abstract: Before the age of 4 months, infants make inductive inferences about the motions of physical objects. Developmental psychologists have provided verbal accounts of the knowledge that supports these inferences, but often these accounts focus on categorical rather than probabilistic principles. We propose that infant object perception is guided in part by probabilistic principles like persistence: things tend to remain the same, and when they change they do so gradually. To illustrate this idea we develop an ideal observer model that incorporates probabilistic principles of rigidity and inertia. Like previous researchers, we suggest that rigid motions are expected from an early age, but we challenge the previous claim that the inertia principle is relatively slow to develop [1]. We support these arguments by modeling several experiments from the developmental literature. Over the past few decades, ingenious experiments [1, 2] have suggested that infants rely on systematic expectations about physical objects when interpreting visual scenes. Looking time studies suggest, for example, that infants expect objects to follow continuous trajectories through time and space, and understand that two objects cannot simultaneously occupy the same location. Many of these studies have been replicated several times, but there is still no consensus about the best way to characterize the knowledge that gives rise to these findings. Two main approaches can be found in the literature. The verbal approach uses natural language to characterize principles of object perception [1, 3]: for example, Spelke [4] proposes that object perception is consistent with principles including continuity (“a moving object traces exactly one connected path over space and time”) and cohesion (“a moving object maintains its connectedness and boundaries”). The mechanistic approach proposes that physical knowledge is better characterized by describing the mechanisms that give rise to behavior, and researchers working in this tradition often develop computational models that support their theoretical proposals [5]. We pursue a third approach—the ideal observer approach [6, 7, 8]—that combines aspects of both previous traditions. Like the verbal approach, our primary goal is to characterize principles that account for infant behavior, and we will not attempt to characterize the mechanisms that produce this behavior. Like the mechanistic approach, we emphasize the importance of formal models, and suggest that these models can capture forms of knowledge that are difficult for verbal accounts to handle. Ideal observer models [6, 9] specify the conclusions that normatively follow given a certain source of information and a body of background knowledge. These models can therefore address questions about the information and the knowledge that support perception. Approaches to the information question characterize the kinds of perceptual information that human observers use. For example, Geisler [9] discusses which components of the information available at the retina contribute to visual perception, and Banks and Shannon [10] use ideal observer models to study the perceptual consequences of immaturities in the retina. Approaches to the knowledge question characterize the background assumptions that are combined with the available input in order to make inductive inferences. For example, Weiss and Adelson [7] describe several empirical phenomena that are consistent with the a priori assumption that motions tend to be slow and smooth. There are few previous attempts to develop ideal observer models of infant perception, and most of them focus only on the information question [10]. This paper addresses the knowledge question, and proposes that the ideal observer approach can help to identify the minimal set of principles needed to account for the visual competence of young infants. Most verbal theories of object perception focus on categorical principles [4], or principles that make a single distinction between possible and impossible scenes. We propose that physical knowledge in infancy is also characterized by probabilistic principles, or expectations that make some possible scenes more surprising than others. We demonstrate the importance of probabilistic principles by focusing on two examples: the rigidity principle states that objects usually maintain their shape and size when they move, and the inertia principle states that objects tend to maintain the same pattern of motion over time. Both principles capture important regularities, but exceptions to these regularities are relatively common. Focusing on rigidity and inertia allows us to demonstrate two contributions that probabilistic approaches can make. First, probabilistic approaches can reinforce current proposals about infant perception. Spelke [3] suggests that rigidity is a core principle that guides object perception from a very early age, and we demonstrate how this idea can be captured by a model that also tolerates exceptions, such as non-rigid biological motion. Second, probabilistic approaches can identify places where existing proposals may need to be revised. Spelke [3] argues that the principle of inertia is slow to develop, but we suggest that a probabilistic version of this principle can help to account for inferences made early in development. 1 An ideal observer approach An ideal observer approach to object perception can be formulated in terms of a generative model for scenes. Scenes can be generated in three steps. First we choose the number n of objects that will appear in the scene, and generate the shape, visual appearance, and initial location of each object. We then choose a velocity field for each object which specifies how the object moves and changes shape over time. Finally, we create a visual scene by taking a two-dimensional projection of the moving objects generated in the two previous steps. An ideal observer approach explores the idea that the inferences made by infants approximate the optimal inferences with respect to this generative model. We work within this general framework but make two simplifications. We will not discuss how the shapes and visual appearances of objects are generated, and we make the projection step simple by working with a two-dimensional world. These simplifications allow us to focus on the expectations about velocity fields that guide motion perception in infants. The next two sections present two prior distributions that can be used to generate velocity fields. The first is a baseline prior that does not incorporate probabilistic principles, and the second incorporates probabilistic versions of rigidity and inertia. The two priors capture different kinds of knowledge, and we argue that the second provides the more accurate characterization of the knowledge that infants bring to object perception. 1.1 A baseline prior on velocity fields Our baseline prior is founded on five categorical principles that are closely related to principles discussed by Spelke [3, 4]. The principles we consider rely on three basic notions: space, time, and matter. We also refer to particles, which are small pieces of matter that occupy space-time points. Particles satisfy several principles: C1. Temporal continuity. Particles are not created or destroyed. In other words, every particle that exists at time t1 must also exist at time t2 . C2. Spatial continuity. Each particle traces a continuous trajectory through space. C3. Exclusion. No two particles may occupy the same space-time point. An object is a collection of particles, and these collections satisfy two principles: C4. Discreteness. Each particle belongs to exactly one object. C5. Cohesion. At each point in time, the particles belonging to an object occupy a single connected region of space. Suppose that we are interested in a space-time window specified by a bounded region of space and a bounded interval of time. For simplicity, we will assume that space is two-dimensional, and that the space-time window corresponds to the unit cube. Suppose that a velocity field v assigns a velocity (vx , vy ) to each particle in the space-time window, and let vi be the field created by considering only particles that belong to object i. We develop a theory of object perception by defining a prior distribution p(v) on velocity fields. Consider first the distribution p(v1 ) on fields for a single object. Any field that violates one or more of principles C1–C5 is assigned zero probability. For instance, fields where part of an object winks out of existence violate the principle of temporal continuity, and fields where an object splits into two distinct pieces violate the principle of cohesion. Many fields, however, remain, including fields that specify non-rigid motions and jagged trajectories. For now, assume that we are working with a space of fields that is bounded but very large, and that the prior distribution over this space is uniform for all fields consistent with principles C1–C5: p(v1 ) ∝ f (v1 ) = 0 1 if v1 violates C1–C5 otherwise. (1) Consider now the distribution p(v1 , v2 ) on fields for pairs of objects. Principles C1 through C5 rule out some of these fields, but again we must specify a prior distribution on those that remain. Our prior is induced by the following principle: C6. Independence. Velocity fields for multiple objects are independently generated subject to principles C1 through C5. More formally, the independence principle specifies how the prior for the multiple object case is related to the prior p(v1 ) on velocity fields for a single object (Equation 1): p(v1 , . . . , vn ) ∝ f (v1 , . . . , vn ) = 0 if {vi } collectively violate C1–C5 f (v1 ) . . . f (vn ) otherwise. (2) 1.2 A smoothness prior on velocity fields We now develop a prior p(v) that incorporates probabilistic expectations about the motion of physical objects. Consider again the prior p(v1 ) on the velocity field v1 of a single object. Principles C1–C5 make a single cut that distinguishes possible from impossible fields, but we need to consider whether infants have additional knowledge that makes some of the possible fields less surprising than others. One informal idea that seems relevant is the notion of persistence[11]: things tend to remain the same, and when they change they do so gradually. We focus on two versions of this idea that may guide expectations about velocity fields: S1. Spatial smoothness. Velocity fields tend to be smooth in space. S2. Temporal smoothness. Velocity fields tend to be smooth in time. A field is “smooth in space” if neighboring particles tend to have similar velocities at any instant of time. The smoothest possible field will be one where all particles have the same velocity at any instant—in other words, where an object moves rigidly. The principle of spatial smoothness therefore captures the idea that objects tend to maintain the same shape and size. A field is “smooth in time” if any particle tends to have similar velocities at nearby instants of time. The smoothest possible field will be one where each particle maintains the same velocity throughout the entire interval of interest. The principle of temporal smoothness therefore captures the idea that objects tend to maintain their initial pattern of motion. For instance, stationary objects tend to remain stationary, moving objects tend to keep moving, and a moving object following a given trajectory tends to continue along that trajectory. Principles S1 and S2 are related to two principles— rigidity and inertia—that have been discussed in the developmental literature. The rigidity principle states that objects “tend to maintain their size and shape over motion”[3], and the inertia principle states that objects move smoothly in the absence of obstacles [4]. Some authors treat these principles rather differently: for instance, Spelke suggests that rigidity is one of the core principles that guides object perception from a very early age [3], but that the principle of inertia is slow to develop and is weak or fragile once acquired. Since principles S1 and S2 seem closely related, the suggestion that one develops much later than the other seems counterintuitive. The rest of this paper explores the idea that both of these principles are needed to characterize infant perception. Our arguments will be supported by formal analyses, and we therefore need formal versions of S1 and S2. There may be different ways to formalize these principles, but we present a simple L1 L2 U b) 200 L1 L2 U 0 log “ p(H1 |v) p(H2 |v) ” a) −200 baseline smoothness Figure 1: (a) Three scenes inspired by the experiments of Spelke and colleagues [12, 13]. Each scene can be interpreted as a single object, or as a small object on top of a larger object. (b) Relative preferences for the one-object and two-object interpretations according to two models. The baseline model prefers the one-object interpretation in all three cases, but the smoothness model prefers the one-object interpretation only for scenes L1 and L2. approach that builds on existing models of motion perception in adults [7, 8]. We define measures of instantaneous roughness that capture how rapidly a velocity field v varies in space and time: Rspace (v, t) = ∂v(x, y, t) ∂x 2 ∂v(x, y, t) ∂t 1 vol(O(t)) 2 + ∂v(x, y, t) ∂y 2 dxdy (3) O(t) Rtime (v, t) = 1 vol(O(t)) dxdy (4) O(t) where O(t) is the set of all points that are occupied by the object at time t, and vol(O(t)) is the volume of the object at time t. Rspace (v, t) will be large if neighboring particles at time t tend to have different velocities, and Rtime (v, t) will be large if many particles are accelerating at time t. We combine our two roughness measures to create a single smoothness function S(·) that measures the smoothness of a velocity field: S(v) = −λspace Rspace (v, t)dt − λtime Rtime (v, t)dt (5) where λspace and λtime are positive weights that capture the importance of spatial smoothness and temporal smoothness. For all analyses in this paper we set λspace = 10000 and λtime = 250, which implies that violations of spatial smoothness are penalized more harshly than violations of temporal smoothness. We now replace Equation 1 with a prior on velocity fields that takes smoothness into account: 0 if v1 violates C1–C5 p(v1 ) ∝ f (v1 ) = (6) exp (S(v1 )) otherwise. Combining Equation 6 with Equation 2 specifies a model of object perception that incorporates probabilistic principles of rigidity and inertia. 2 Empirical findings: spatial smoothness There are many experiments where infants aged 4 months and younger appear to make inferences that are consistent with the principle of rigidity. This section suggests that the principle of spatial smoothness can account for these results. We therefore propose that a probabilistic principle (spatial smoothness) can explain all of the findings previously presented in support of a categorical principle (rigidity), and can help in addition to explain how infants perceive non-rigid motion. One set of studies explores inferences about the number of objects in a scene. When a smaller block is resting on top of a larger block (L1 in Figure 1a), 3-month-olds infer that the scene includes a single object [12]. The same result holds when the small and large blocks are both moving in the same direction (L2 in Figure 1a) [13]. When these blocks are moving in opposite directions (U in Figure 1a), however, infants appear to infer that the scene contains two objects [13]. Results like these suggest that infants may have a default expectation that objects tend to move rigidly. We compared the predictions made by two models about the scenes in Figure 1a. The smoothness model uses a prior p(v1 ) that incorporates principles S1 and S2 (Equation 6), and the baseline model is identical except that it sets λspace = λtime = 0. Both models therefore incorporate principles C1– C6, but only the smoothness model captures the principle of spatial smoothness. Given any of the scenes in Figure 1a, an infant must solve two problems: she must compute the velocity field v for the scene and must decide whether this field specifies the motion of one or two objects. Here we focus on the second problem, and assume that the infant’s perceptual system has already computed a veridical velocity field for each scene that we consider. In principle, however, the smoothness prior in Equation 6 can address both problems. Previous authors have shown how smoothness priors can be used to compute velocity fields given raw image data [7, 8]. Let H1 be the hypothesis that a given velocity field corresponds to a single object, and let H2 be the hypothesis that the field specifies the motions of two objects. We assume that the prior probabilities of these hypotheses are equal, and that P (H1 ) = P (H2 ) = 0.5. An ideal observer can use the posterior odds ratio to choose between these hypotheses: P (H1 |v) P (v|H1 ) P (H1 ) = ≈ P (H2 |v) P (v|H2 ) P (H2 ) f (v) f (v1 )dv1 f (v1 , v2 )dv1 dv2 f (vA , vB ) (7) Equation 7 follows from Equations 2 and 6, and from approximating P (v|H2 ) by considering only the two object interpretation (vA , vB ) with maximum posterior probability. For each scene in Figure 1a, the best two object interpretation will specify a field vA for the small upper block, and a field vB for the large lower block. To approximate the posterior odds ratio in Equation 7 we compute rough approximations of f (v1 )dv1 and f (v1 , v2 )dv1 dv2 by summing over a finite space of velocity fields. As described in the supporting material, we consider all fields that can be built from objects with 5 possible shapes, 900 possible starting locations, and 10 possible trajectories. For computational tractability, we convert each continuous velocity field to a discrete field defined over a space-time grid with 45 cells along each spatial dimension and 21 cells along the temporal dimension. Our results show that both models prefer the one-object hypothesis H1 when presented with scenes L1 and L2 (Figure 1b). Since there are many more two-object scenes than one-object scenes, any typical two-object interpretation is assigned lower prior probability than a typical one-object interpretation. This preference for simpler interpretations is a consequence of the Bayesian Occam’s razor. The baseline model makes the same kind of inference about scene U, and again prefers the one-object interpretation. Like infants, however, the smoothness model prefers the two-object interpretation of scene U. This model assigns low probability to a one-object interpretation where adjacent points on the object have very different velocities, and this preference for smooth motion is strong enough to overcome the simplicity preference that makes the difference when interpreting the other two scenes. Other experiments from the developmental literature have produced results consistent with the principle of spatial smoothness. For example, 3.5-month olds are surprised when a tall object is fully hidden behind a short screen, 4 month olds are surprised when a large object appears to pass through a small slot, and 4.5-month olds expect a swinging screen to be interrupted when an object is placed in its path [1, 2]. All three inferences appear to rely on the expectation that objects tend not to shrink or to compress like foam rubber. Many of these experiments are consistent with an account that simply rules out non-rigid motion instead of introducing a graded preference for spatial smoothness. Biological motions, however, are typically non-rigid, and experiments suggest that infants can track and make inferences about objects that follow non-rigid trajectories [14]. Findings like these call for a theory like ours that incorporates a preference for rigid motion, but recognizes that non-rigid motions are possible. 3 Empirical findings: temporal smoothness We now turn to the principle of temporal smoothness (S2) and discuss some of the experimental evidence that bears on this principle. Some researchers suggest that a closely related principle (inertia) is slow to develop, but we argue that expectations about temporal smoothness are needed to capture inferences made before the age of 4 months. Baillargeon and DeVos [15] describe one relevant experiment that explores inferences about moving objects and obstacles. During habituation, 3.5-month-old infants saw a car pass behind an occluder and emerge from the other side (habituation stimulus H in Figure 2a). An obstacle was then placed in the direct path of the car (unlikely scenes U1 and U2) or beside this direct path (likely scene L), and the infants again saw the car pass behind the occluder and emerge from the other side. Looking L U1 U2 p(L) p(U 1) ” log “ p(L) p(U 2) ” 400 H “ 600 a) log log “ pH (L) pH (U 1) ” log “ pH (L) pH (U 2) ” b) 200 X X X baseline 0 smoothness Figure 2: (a) Stimuli inspired by the experiments of [15]. The habituation stimulus H shows a block passing behind a barrier and emerging on the other side. After habituation, a new block is added either out of the direct path of the first block (L) or directly in the path of the first block (U1 and U2). In U1, the first block leaps over the second block, and in U2 the second block hops so that the first block can pass underneath. (b) Relative probabilities of scenes L, U1 and U2 according to two models. The baseline model finds all three scenes equally likely a priori, and considers L and U2 equally likely after habituation. The smoothness model considers L more likely than the other scenes both before and after habituation. a) H1 H2 L U b) ” log p(L) p(U ) 300 log “ pH1 (L) pH1 (U ) ” 200 c) “ log “ pH2 (L) pH2 (U ) ” 100 0 −100 X X baseline smoothness Figure 3: (a) Stimuli inspired by the experiments of Spelke et al. [16]. (b) Model predictions. After habituation to H1, the smoothness model assigns roughly equal probabilities to L and U. After habituation to H2, the model considers L more likely. (c) A stronger test of the inertia principle. Now the best interpretation of stimulus U involves multiple changes of direction. time measurements suggested that the infants were more surprised to see the car emerge when the obstacle lay within the direct path of the car. This result is consistent with the principle of temporal smoothness, which suggests that infants expected the car to maintain a straight-line trajectory, and the obstacle to remain stationary. We compared the smoothness model and the baseline model on a schematic version of this task. To model this experiment, we again assume that the infant’s perceptual system has recovered a veridical velocity field, but now we must allow for occlusion. An ideal observer approach that treats a two dimensional scene as a projection of a three dimensional world can represent the occluder as an object in its own right. Here, however, we continue to work with a two dimensional world, and treat the occluded parts of the scene as missing data. An ideal observer approach should integrate over all possible values of the missing data, but for computational simplicity we approximate this approach by considering only one or two high-probability interpretations of each occluded scene. We also need to account for habituation, and for cases where the habituation stimulus includes occlusion. We assume that an ideal observer computes a habituation field vH , or the velocity field with maximum posterior probability given the habituation stimulus. In Figure 2a, the inferred habituation field vH specifies a trajectory where the block moves smoothly from the left to the right of the scene. We now assume that the observer expects subsequent velocity fields to be similar to vH . Formally, we use a product-of-experts approach to define a post-habituation distribution on velocity fields: pH (v) ∝ p(v)p(v|vH ) (8) The first expert p(v) uses the prior distribution in Equation 6, and the second expert p(v|vH ) assumes that field v is drawn from a Gaussian distribution centered on vH . Intuitively, after habituation to vH the second expert expects that subsequent velocity fields will be similar to vH . More information about this model of habituation is provided in the supporting material. Given these assumptions, the black and dark gray bars in Figure 2 indicate relative a priori probabilities for scenes L, U1 and U2. The baseline model considers all three scenes equally probable, but the smoothness model prefers L. After habituation, the baseline model is still unable to account for the behavioral data, since it considers scenes L and U2 to be equally probable. The smoothness model, however, continues to prefer L. We previously mentioned three consequences of the principle of temporal smoothness: stationary objects tend to remain stationary, moving objects tend to keep moving, and moving objects tend to maintain a steady trajectory. The “car and obstacle” task addresses the first and third of these proposals, but other tasks provide support for the second. Many authors have studied settings where one moving object comes to a stop, and a second object starts to move [17]. Compared to the case where the first object collides with the second, infants appear to be surprised by the “no-contact” case where the two objects never touch. This finding is consistent with the temporal smoothness principle, which predicts that infants expect the first object to continue moving until forced to stop, and expect the second object to remain stationary until forced to start. Other experiments [18] provide support for the principle of temporal smoothness, but there are also studies that appear inconsistent with this principle. In one of these studies [16], infants are initially habituated to a block that moves from one corner of an enclosure to another (H1 in Figure 3a). After habituation, infants see a block that begins from a different corner, and now the occluder is removed to reveal the block in a location consistent with a straight-line trajectory (L) or in a location that matches the final resting place during the habituation phase (U). Looking times suggest that infants aged 4-12 months are no more surprised by the inertia-violating outcome (U) than the inertia-consistent outcome (L). The smoothness model, however, can account for this finding. The outcome in U is contrary to temporal smoothness but consistent with habituation, and the tradeoff between these factors leads the model to assign roughly the same probability to scenes L and U (Figure 3b). Only one of the inertia experiments described by Spelke et al. [16] and Spelke et al. [1] avoids this tradeoff between habituation and smoothness. This experiment considers a case where the habituation stimulus (H2 in Figure 3a) is equally similar to the two test stimuli. The results suggest that 8 month olds are now surprised by the inertia-violating outcome, and the predictions of our model are consistent with this finding (Figure 3b). 4 and 6 month olds, however, continue to look equally at the two outcomes. Note, however, that the trajectories in Figure 3 include at most one inflection point. Experiments that consider trajectories with many inflection points can provide a more powerful way of exploring whether 4 month olds have expectations about temporal smoothness. One possible experiment is sketched in Figure 3c. The task is very similar to the task in Figure 3a, except that a barrier is added after habituation. In order for the block to end up in the same location as before, it must now follow a tortuous path around the barrier (U). Based on the principle of temporal smoothness, we predict that 4-month-olds will be more surprised to see the outcome in stimulus U than the outcome in stimulus L. This experimental design is appealing in part because previous work shows that infants are surprised by a case similar to U where the barrier extends all the way from one wall to the other [16], and our proposed experiment is a minor variant of this task. Although there is room for debate about the status of temporal smoothness, we presented two reasons to revisit the conclusion that this principle develops relatively late. First, some version of this principle seems necessary to account for experiments like the car and obstacle experiment in Figure 2. Second, most of the inertia experiments that produced null results use a habituation stimulus which may have prevented infants from revealing their default expectations, and the one experiment that escapes this objection considers a relatively minor violation of temporal smoothness. Additional experiments are needed to explore this principle, but we predict that the inertia principle will turn out to be yet another example of knowledge that is available earlier than researchers once thought. 4 Discussion and Conclusion We argued that characterizations of infant knowledge should include room for probabilistic expectations, and that probabilistic expectations about spatial and temporal smoothness appear to play a role in infant object perception. To support these claims we described an ideal observer model that includes both categorical (C1 through C5) and probabilistic principles (S1 and S2), and demonstrated that the categorical principles alone are insufficient to account for several experimental findings. Our two probabilistic principles are related to principles (rigidity and inertia) that have previously been described as categorical principles. Although rigidity and inertia appear to play a role in some early inferences, formulating these principles as probabilistic expectations helps to explain how infants deal with non-rigid motion and violations of inertia. Our analysis focused on some of the many existing experiments in the developmental literature, but new experiments will be needed to explore our probabilistic approach in depth. Categorical versions of a given principle (e.g. rigidity) allow room for only two kinds of behavior depending on whether the principle is violated or not. Probabilistic principles can be violated to a greater or lesser extent, and our approach predicts that violations of different magnitude may lead to different behaviors. Future studies of rigidity and inertia can consider violations of these principles that range from mild (Figure 3a) to severe (Figure 3c), and can explore whether infants respond to these violations differently. Future work should also consider whether the categorical principles we described (C1 through C5) are better characterized as probabilistic expectations. In particular, future studies can explore whether young infants consider large violations of cohesion (C5) or spatial continuity (C2) more surprising than smaller violations of these principles. Although we did not focus on learning, our approach allows us to begin thinking formally about how principles of object perception might be acquired. First, we can explore how parameters like the smoothness parameters in our model (λspace and λtime ) might be tuned by experience. Second, we can use statistical model selection to explore transitions between different sets of principles. For instance, if a learner begins with the baseline model we considered (principles C1–C6), we can explore which subsequent observations provide the strongest statistical evidence for smoothness principles S1 and S2, and how much of this evidence is required before an ideal learner would prefer our smoothness model over the baseline model. It is not yet clear which principles of object perception could be learned, but the ideal observer approach can help to resolve this question. References [1] E. S. Spelke, K. Breinlinger, J. Macomber, and K. Jacobson. Origins of knowledge. Psychological Review, 99:605–632, 1992. [2] R. Baillargeon, L. Kotovsky, and A. Needham. The acquisition of physical knowledge in infancy. In D. Sperber, D. Premack, and A. J. Premack, editors, Causal Cognition: A multidisciplinary debate, pages 79–116. Clarendon Press, Oxford, 1995. [3] E. S. Spelke. Principles of object perception. Cognitive Science, 14:29–56, 1990. [4] E. Spelke. Initial knowledge: six suggestions. Cognition, 50:431–445, 1994. [5] D. Mareschal and S. P. Johnson. Learning to perceive object unity: a connectionist account. Developmental Science, 5:151–172, 2002. [6] D. Kersten and A. Yuille. Bayesian models of object perception. Current opinion in Neurobiology, 13: 150–158, 2003. [7] Y. Weiss and E. H. Adelson. Slow and smooth: a Bayesian theory for the combination of local motion signals in human vision. Technical Report A.I Memo No. 1624, MIT, 1998. [8] A. L. Yuille and N. M. Grzywacz. A mathematical analysis of the motion coherence theory. International Journal of Computer Vision, 3:155–175, 1989. [9] W. S. Geisler. Physical limits of acuity and hyperacuity. Journal of the Optical Society of America, 1(7): 775–782, 1984. [10] M. S. Banks and E. Shannon. Spatial and chromatic visual efficiency in human neonates. In Visual perception and cognition in infancy, pages 1–46. Lawrence Erlbaum Associates, Hillsdale, NJ, 1993. [11] R. Baillargeon. Innate ideas revisited: for a principle of persistence in infants’ physical reasoning. Perspectives on Psychological Science, 3(3):2–13, 2008. [12] R. Kestenbaum, N. Termine, and E. S. Spelke. Perception of objects and object boundaries by threemonth-old infants. British Journal of Developmental Psychology, 5:367–383, 1987. [13] E. S. Spelke, C. von Hofsten, and R. Kestenbaum. Object perception and object-directed reaching in infancy: interaction of spatial and kinetic information for object boundaries. Developmental Psychology, 25:185–196, 1989. [14] G. Huntley-Fenner, S. Carey, and A. Solimando. Objects are individuals but stuff doesn’t count: perceived rigidity and cohesiveness influence infants’ representations of small groups of discrete entities. Cognition, 85:203–221, 2002. [15] R. Baillargeon and J. DeVos. Object permanence in young infants: further evidence. Child Development, 61(6):1227–1246, 1991. [16] E. S. Spelke, G. Katz, S. E. Purcell, S. M. Ehrlich, and K. Breinlinger. Early knowledge of object motion: continuity and inertia. Cognition, 51:131–176, 1994. [17] L. Kotovsky and R. Baillargeon. Reasoning about collisions involving inert objects in 7.5-month-old infants. Developmental Science, 3(3):344–359, 2000. [18] T. Wilcox and A. Schweinle. Infants’ use of speed information to individuate objects in occlusion events. Infant Behavior and Development, 26:253–282, 2003.

2 0.83413285 198 nips-2008-Resolution Limits of Sparse Coding in High Dimensions

Author: Sundeep Rangan, Vivek Goyal, Alyson K. Fletcher

Abstract: This paper addresses the problem of sparsity pattern detection for unknown ksparse n-dimensional signals observed through m noisy, random linear measurements. Sparsity pattern recovery arises in a number of settings including statistical model selection, pattern detection, and image acquisition. The main results in this paper are necessary and sufficient conditions for asymptotically-reliable sparsity pattern recovery in terms of the dimensions m, n and k as well as the signal-tonoise ratio (SNR) and the minimum-to-average ratio (MAR) of the nonzero entries of the signal. We show that m > 2k log(n − k)/(SNR · MAR) is necessary for any algorithm to succeed, regardless of complexity; this matches a previous sufficient condition for maximum likelihood estimation within a constant factor under certain scalings of k, SNR and MAR with n. We also show a sufficient condition for a computationally-trivial thresholding algorithm that is larger than the previous expression by only a factor of 4(1 + SNR) and larger than the requirement for lasso by only a factor of 4/MAR. This provides insight on the precise value and limitations of convex programming-based algorithms.

same-paper 3 0.80953336 171 nips-2008-Online Prediction on Large Diameter Graphs

Author: Mark Herbster, Guy Lever, Massimiliano Pontil

Abstract: We continue our study of online prediction of the labelling of a graph. We show a fundamental limitation of Laplacian-based algorithms: if the graph has a large diameter then the number of mistakes made by such algorithms may be proportional to the square root of the number of vertices, even when tackling simple problems. We overcome this drawback by means of an efficient algorithm which achieves a logarithmic mistake bound. It is based on the notion of a spine, a path graph which provides a linear embedding of the original graph. In practice, graphs may exhibit cluster structure; thus in the last part, we present a modified algorithm which achieves the “best of both worlds”: it performs well locally in the presence of cluster structure, and globally on large diameter graphs. 1

4 0.77404016 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation

Author: Dotan D. Castro, Dmitry Volkinshtein, Ron Meir

Abstract: Actor-critic algorithms for reinforcement learning are achieving renewed popularity due to their good convergence properties in situations where other approaches often fail (e.g., when function approximation is involved). Interestingly, there is growing evidence that actor-critic approaches based on phasic dopamine signals play a key role in biological learning through cortical and basal ganglia loops. We derive a temporal difference based actor critic learning algorithm, for which convergence can be proved without assuming widely separated time scales for the actor and the critic. The approach is demonstrated by applying it to networks of spiking neurons. The established relation between phasic dopamine and the temporal difference signal lends support to the biological relevance of such algorithms. 1

5 0.6888839 132 nips-2008-Measures of Clustering Quality: A Working Set of Axioms for Clustering

Author: Shai Ben-David, Margareta Ackerman

Abstract: Aiming towards the development of a general clustering theory, we discuss abstract axiomatization for clustering. In this respect, we follow up on the work of Kleinberg, ([1]) that showed an impossibility result for such axiomatization. We argue that an impossibility result is not an inherent feature of clustering, but rather, to a large extent, it is an artifact of the specific formalism used in [1]. As opposed to previous work focusing on clustering functions, we propose to address clustering quality measures as the object to be axiomatized. We show that principles like those formulated in Kleinberg’s axioms can be readily expressed in the latter framework without leading to inconsistency. A clustering-quality measure (CQM) is a function that, given a data set and its partition into clusters, returns a non-negative real number representing how strong or conclusive the clustering is. We analyze what clustering-quality measures should look like and introduce a set of requirements (axioms) for such measures. Our axioms capture the principles expressed by Kleinberg’s axioms while retaining consistency. We propose several natural clustering quality measures, all satisfying the proposed axioms. In addition, we analyze the computational complexity of evaluating the quality of a given clustering and show that, for the proposed CQMs, it can be computed in polynomial time. 1

6 0.55257058 84 nips-2008-Fast Prediction on a Tree

7 0.54312885 179 nips-2008-Phase transitions for high-dimensional joint support recovery

8 0.52194619 215 nips-2008-Sparse Signal Recovery Using Markov Random Fields

9 0.51858574 10 nips-2008-A rational model of preference learning and choice prediction by children

10 0.50210238 201 nips-2008-Robust Near-Isometric Matching via Structured Learning of Graphical Models

11 0.50151616 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube

12 0.49869314 202 nips-2008-Robust Regression and Lasso

13 0.49773201 136 nips-2008-Model selection and velocity estimation using novel priors for motion patterns

14 0.49452999 99 nips-2008-High-dimensional support union recovery in multivariate regression

15 0.49205193 104 nips-2008-Improved Moves for Truncated Convex Models

16 0.49172199 121 nips-2008-Learning to Use Working Memory in Partially Observable Environments through Dopaminergic Reinforcement

17 0.49057975 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

18 0.4873786 196 nips-2008-Relative Margin Machines

19 0.48596737 95 nips-2008-Grouping Contours Via a Related Image

20 0.48501739 75 nips-2008-Estimating vector fields using sparse basis field expansions