nips nips2012 nips2012-196 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Xiao-ming Wu, Zhenguo Li, Anthony M. So, John Wright, Shih-fu Chang
Abstract: We propose a novel stochastic process that is with probability αi being absorbed at current state i, and with probability 1 − αi follows a random edge out of it. We analyze its properties and show its potential for exploring graph structures. We prove that under proper absorption rates, a random walk starting from a set S of low conductance will be mostly absorbed in S. Moreover, the absorption probabilities vary slowly inside S, while dropping sharply outside, thus implementing the desirable cluster assumption for graph-based learning. Remarkably, the partially absorbing process unifies many popular models arising in a variety of contexts, provides new insights into them, and makes it possible for transferring findings from one paradigm to another. Simulation results demonstrate its promising applications in retrieval and classification.
Reference: text
sentIndex sentText sentNum sentScore
1 hk Abstract We propose a novel stochastic process that is with probability αi being absorbed at current state i, and with probability 1 − αi follows a random edge out of it. [sent-6, score-0.283]
2 We prove that under proper absorption rates, a random walk starting from a set S of low conductance will be mostly absorbed in S. [sent-8, score-1.164]
3 Moreover, the absorption probabilities vary slowly inside S, while dropping sharply outside, thus implementing the desirable cluster assumption for graph-based learning. [sent-9, score-1.108]
4 Remarkably, the partially absorbing process unifies many popular models arising in a variety of contexts, provides new insights into them, and makes it possible for transferring findings from one paradigm to another. [sent-10, score-0.224]
5 In graph-based learning one often adopts the cluster assumption, which states that the semantics usually vary smoothly for vertices within regions of high density [17], and suggests to place the prediction boundary in regions of low density [5]. [sent-13, score-0.238]
6 It is thus interesting to ask how the cluster assumption can be realized in terms of random walks. [sent-14, score-0.139]
7 Although a random walk appears to explore the graph globally, it converges to a stationary distribution determined solely by vertex degrees regardless of the starting points, a phenomenon well known as the mixing of random walks [11]. [sent-15, score-0.506]
8 This causes some random walk approaches intended to capture non-local graph structures to fail, especially when the underlying graph is well connected, i. [sent-16, score-0.258]
9 For example, it was recently proven in [16] that under some mild conditions the hitting and commute times on large graphs do not take into account the global structure of the graph at all, despite the fact that they have integrated all the relevant paths on the graph. [sent-19, score-0.244]
10 A natural question is: can we design a random walk which implements the cluster assumption with some guarantees? [sent-22, score-0.226]
11 In this paper, we propose partially absorbing random walks (PARWs), a novel random walk model whose properties can be analyzed theoretically. [sent-23, score-0.417]
12 In PARWs, a random walk is with probability αi being absorbed at current state i, and with probability 1 − αi follows a random edge out of it. [sent-24, score-0.389]
13 PARWs are guaranteed to implement the cluster assumption in the sense that under proper absorp1 pkk pii k' p jj j' i' k k t =0 t =1 j i t=2 (a) (b) j i (c) Figure 1: A partially absorbing random walk. [sent-25, score-0.501]
14 tion rates, a random walk starting from a set S of low conductance will be mostly absorbed in S. [sent-29, score-0.473]
15 Furthermore, we show that by setting the absorption rates, the absorption probabilities can vary slowly inside S, while dropping sharply outside S. [sent-30, score-1.684]
16 More interestingly, it turns out that many existing models including PageRank, hitting and commute times, and label propagation algorithms in semi-supervised learning, can be unified or related in PARWs, which brings at least two benefits. [sent-32, score-0.194]
17 As this process continues, the amount of flow stored in each vertex will accumulate and there will be less and less flow left running on the graph. [sent-42, score-0.147]
18 Starting from state i, there is some probability pii that the process will stay at i in the next step; and once it stays, the process will be absorbed into state i. [sent-58, score-0.457]
19 Hence, we shall call the above process a partially absorbing random walk (PARW), where pii is the absorption rate of state i. [sent-59, score-1.21]
20 If 0 < pii < 1, then we say that i is a partially absorbing state. [sent-60, score-0.346]
21 If pii = 1, then we say that i is a fully absorbing state. [sent-61, score-0.315]
22 Finally, if pii = 0, then we say that i is a transient state. [sent-62, score-0.18]
23 Note that if pii ∈ {0, 1} for every state i ∈ N , then the above process reduces to a standard Markov chain [9]. [sent-63, score-0.208]
24 One can observe that any PARW can be realized as a standard Markov chain by adding a sink (fully absorbing state) to each vertex in the graph, as illustrated in Fig. [sent-65, score-0.317]
25 The transition 2 probability from i to its sink i′ equals the absorption rate pii in PARWs. [sent-67, score-0.912]
26 , the process is absorbed at a state only after it has stayed at that state for m-consecutive steps. [sent-70, score-0.273]
27 , dn ) with di = j wij as the degree of vertex i, and define the Laplacian of G by L = D − W [6]. [sent-79, score-0.218]
28 Suppose that we define the first order transition probabilities of a PARW by pij = λi λi +di , wij λi +di , i = j, (2) i = j. [sent-88, score-0.167]
29 Then, we see that state i is an absorbing state (either partially or fully) when λi > 0, and is a transient state when λi = 0. [sent-89, score-0.33]
30 In particular, the matrix Λ acts like a regularizer that controls the absorption rate of each state, i. [sent-90, score-0.707]
31 We are interested in the probability aij that a random walk starting from state i, is absorbed at state j in any finite number of steps. [sent-95, score-0.5]
32 Let A = [aij ] ∈ Rn×n be the matrix of absorption probabilities. [sent-96, score-0.691]
33 Now, observe that the absorbing probabilities {aij } satisfy the following equations: aii aij = λi ×1+ λi + di = k=i j=i wij aji , λi + di (3) wik akj , i = j. [sent-106, score-0.469]
34 This means that a PARW starting from any vertex will eventually be absorbed, provided that there is at least one absorbing state in the state space. [sent-115, score-0.411]
35 1, we see that the absorption probabilities (A) are governed by both the structure of the graph (L) and the regularizer matrix (Λ). [sent-118, score-0.833]
36 α at α = 0, implying that L+ reflects the variation of absorption probabilities when the absorption rate is very small. [sent-145, score-1.442]
37 By (6), we see that ranking by L+ is essentially the same as ranking by Aα , when α is sufficiently small. [sent-146, score-0.198]
38 Let a be the absorption probability vector of a PARW starting from vertex i. [sent-150, score-0.882]
39 This shows that personalized PageRank is a special case of PARWs with absorption rates pii = λiλi i = β. [sent-157, score-0.937]
40 By (6), when Λ = αI and α is sufficiently small, ranking with Hij or Cij (say, with respect to i) is the same as ranking by Aα − Aα or jj ij Aα + Aα − 2Aα respectively. [sent-160, score-0.274]
41 This appears to be not particularly meaningful because the term Aα jj ij jj ii is the self-absorption probability that does not contain any essential information with the starting vertex i. [sent-161, score-0.321]
42 This argument is also supported in a recent study by [16], where the hitting and commute times are shown to be dominated by the inverse of degrees of vertices. [sent-163, score-0.213]
43 This happens to suggest using absorption probabilities for ranking and as similarity measure, because when α is sufficiently small, ranking by the off-diagonal terms of L+ is essentially the same as ranking by Aα , i. [sent-166, score-1.048]
44 , the absorption probability of starting ij from i and being absorbed at j. [sent-168, score-0.967]
45 The harmonic function method [20] is a PARW when setting λi = ∞ (absorption rate 1) for the labeled vertices while λi = 0 (absorption rate 0) for the unlabeled. [sent-172, score-0.171]
46 In [19] the authors have made this interpretation in terms of absorbing random walks, where a random walk arriving at an absorbing state will stay there forever. [sent-173, score-0.492]
47 PARWs can be viewed as an extension of absorbing random walks. [sent-174, score-0.17]
48 The regularized harmonic function method [5] is also a PARW when setting λi = α for the labeled vertices while λi = 0 for the unlabeled. [sent-175, score-0.171]
49 If we add an additional sink to the graph, a variant of harmonic function method [10] and a variant of the regularized harmonic function method [3] can all be included as instances of PARWs. [sent-179, score-0.194]
50 For instance, hitting and commute times are not suitable for ranking given its interpretation in absorption probabilities, as discussed above. [sent-185, score-0.968]
51 3 PARWs with Graph Conductance In this section, we present results on the properties of the absorption probability vector ai obtained by a PARW starting from vertex i (i. [sent-191, score-0.906]
52 We show that properties of ai i relate closely to the connectivity between i and the rest of graph, which can be captured by the conductance of the cluster S where i belongs. [sent-194, score-0.237]
53 We also find that properties of ai depend on the setting of absorption rates. [sent-195, score-0.715]
54 In general, the probability mass of ai is mostly absorbed by S. [sent-197, score-0.257]
55 Under proper absorption rates, ai can vary slowly within S while dropping sharply outside S. [sent-198, score-0.952]
56 ¯ The conductance of a subset S ⊂ V of vertices is defined as Φ(S) = minw(S,S) S)) , where (d(S),d( ¯ ¯ ¯ w(S, S) := ¯ wij is the cut between S and its complement S [6]. [sent-200, score-0.225]
57 In terms of the conductance of S, the following theorem gives an upper bound on the expected probability mass escaped from S if the distribution of the starting vertex is π S . [sent-205, score-0.34]
58 1 shows that most of the probability mass will be absorbed in S, provided that S is of small conductance and the random walk starts from S according to π S . [sent-212, score-0.44]
59 To identify the entire cluster, what is more desirable would be that the absorption probabilities vary slowly within the cluster while dropping sharply outside. [sent-214, score-1.084]
60 As such, the cluster can be identified by detecting the sharp drop. [sent-215, score-0.123]
61 We show below that such property can be achieved by setting appropriate absorption rates at vertices. [sent-216, score-0.732]
62 To vary slowly within the cluster, the flow needs to be distributed evenly within it; while to drop sharply outside, the flow must be prevented from escaping. [sent-221, score-0.23]
63 This means that the absorption rates should be small in the interior but large near the boundary area of the cluster. [sent-222, score-0.788]
64 It corresponds to the absorption rates pii = λiλi i = α+di , which decrease monoton+d ically with di . [sent-224, score-0.941]
65 Since the degrees of vertices are usually relatively large in the interior of the cluster due to denser connections, and small near its boundary area (Fig. [sent-225, score-0.289]
66 2(a)), the absorption rates are therefore much larger at its boundary than in its interior (Fig. [sent-226, score-0.771]
67 State differently, a random walk may move freely inside the cluster, but it will get absorbed with high probability when traveling near the cluster’s boundary. [sent-228, score-0.35]
68 In this way, the absorption rates set up a bounding “wall” around the cluster to prevent the random walk from escaping, leading to an absorption probability vector that 1 A cluster is understood as a subset of vertices of small conductance. [sent-229, score-1.847]
69 5 −3 4 −3 x 10 4 2 2 0 0 (a) x 10 300 (b) 600 900 0 0 (c) 300 600 900 (d) Figure 2: Absorption rates and absorption probabilities. [sent-230, score-0.732]
70 (a) A data set of three Gaussians with the degrees of vertices in the underlying graph shown (see Section 4 for the descriptions of the data and graph construction). [sent-231, score-0.247]
71 (b–c) Absorption rates and absorption probabilities for Λ = αI (α = 10−3 ). [sent-233, score-0.792]
72 For illustration purpose, in (a–b), the degrees of vertices and the absorption rates have been properly scaled, and in (c), the data are arranged such that points within each Gaussian appear consecutively. [sent-235, score-0.877]
73 varies slowly within the cluster while dropping sharply outside (Figs. [sent-236, score-0.315]
74 , the absorption probability of starting from i and absorbed at j is equal to the probability of starting from j and absorbed at i. [sent-241, score-1.199]
75 For simplicity, we use the abbreviated notation a to denote ai , the absorption probability vector for the PARW starting from vertex i. [sent-242, score-0.906]
76 By (3) and the symmetry property, we immediately see that a has the following “harmonic” property: a(i) = λi + λi + di k=i wik a(k), λi + di a(j) = k=j wjk a(k), j = i. [sent-243, score-0.15]
77 Another desirable property one should notice for this PARW is that the starting vertex always has the largest absorption probability, as shown by the following lemma. [sent-245, score-0.885]
78 2 and without loss of generality, we assume the vertices are sorted so that a(1) > a(2) ≥ · · · ≥ a(n), where vertex 1 is the starting vertex. [sent-250, score-0.253]
79 ¯ The following theorem quantifies the drop of the absorption probabilities between Sk and Sk . [sent-256, score-0.811]
80 3 shows that the weighted difference in absorption probabilities between Sk and Sk is k α 1 − j=1 a(j) , implying that it drops slowly when α is small and as k increases, as expected. [sent-264, score-0.83]
81 Next we show the variation of absorption probabilities with graph conductance. [sent-265, score-0.817]
82 The following theorem says that a(j +1) will drop little from a(j) if the set Sj has high conductance or if the vertex j is far away from the starting vertex 1 (i. [sent-267, score-0.435]
83 2(a) with the starting vertex denoted in black circle. [sent-296, score-0.173]
84 5 tells us that if the set Sj has high conductance, then there will be a set Sk much larger than Sj where the absorption probability a(k) remains large. [sent-348, score-0.725]
85 1, we see that the absorption probability vector of the PARW with Λ = αI has the nice property of varying slowly within the cluster while dropping sharply outside. [sent-353, score-0.999]
86 We remark that similar analyses have been conducted in [1, 2] on personalized PageRank, for the local clustering problem [15] whose goal is to find a local cut of low conductance near a specified starting vertex. [sent-354, score-0.257]
87 As shown in Section 2, personalized PageRank is a special case of PARWs with β Λ = αD = 1−β D, which corresponds to setting the same absorption rate pii = β at each vertex. [sent-355, score-0.896]
88 This explains the “heuristic” used in [1, 2] where the personalized PageRank vector is divided by the degrees of vertices to generate a sharp drop. [sent-359, score-0.197]
89 The similarity between vertices i and j is computed as wij = exp(−d2 /σ) if i is within j’s k nearest neighbors or vice versa, and wij = 0 ij otherwise (wii = 0), where σ = 0. [sent-363, score-0.215]
90 The first experiment is to examine the absorption probabilities when varying absorption rates. [sent-365, score-1.442]
91 For Λ = αI, when α is large, most probability mass is absorbed in the cluster of the starting vertex (Fig. [sent-379, score-0.491]
92 As it becomes appropriately small, the probability mass distributes evenly within the cluster, and a sharp drop emerges (Fig. [sent-381, score-0.186]
93 As α → 0, the probability mass distributes more evenly within each cluster and also on the entire graph (Figs. [sent-383, score-0.274]
94 This is due to the uniform absorption rates on the graph, which makes the flow favor vertices with denser connections (i. [sent-387, score-0.83]
95 For PARWs, we classify each unlabeled instance u to the class of the labeled vertex v where u is most likely to be absorbed, i. [sent-406, score-0.125]
96 , v = arg maxi∈L aui where L denotes the labeled data and aui is the absorption probability. [sent-408, score-0.766]
97 5 Conclusions We have presented partially absorbing random walks (PARWs), a novel stochastic process generalizing ordinary random walks. [sent-417, score-0.327]
98 Moreover, a new algorithm developed from PARWs has been theoretically shown to be able to reveal cluster structure under the cluster assumption. [sent-419, score-0.2]
99 Detecting sharp drops in pagerank and a simplified local partitioning algorithm. [sent-429, score-0.219]
100 The pagerank citation ranking: Bringing order to the web. [sent-501, score-0.167]
wordName wordTfidf (topN-words)
[('absorption', 0.691), ('parws', 0.373), ('parw', 0.276), ('absorbed', 0.169), ('pagerank', 0.167), ('absorbing', 0.154), ('pii', 0.146), ('walk', 0.11), ('vertex', 0.106), ('cluster', 0.1), ('ranking', 0.099), ('commute', 0.098), ('conductance', 0.096), ('walks', 0.09), ('ow', 0.083), ('sj', 0.08), ('hitting', 0.08), ('vertices', 0.08), ('harmonic', 0.072), ('starting', 0.067), ('sharply', 0.067), ('graph', 0.066), ('di', 0.063), ('probabilities', 0.06), ('personalized', 0.059), ('dropping', 0.058), ('jj', 0.054), ('slowly', 0.05), ('wij', 0.049), ('state', 0.042), ('rates', 0.041), ('hmn', 0.041), ('sk', 0.041), ('hij', 0.038), ('drop', 0.038), ('lgc', 0.037), ('aij', 0.036), ('degrees', 0.035), ('pij', 0.035), ('sink', 0.034), ('mass', 0.031), ('partially', 0.031), ('drops', 0.029), ('cij', 0.028), ('aui', 0.028), ('diffusion', 0.026), ('usps', 0.025), ('outside', 0.025), ('gaussians', 0.025), ('wik', 0.024), ('laplacian', 0.024), ('ai', 0.024), ('realized', 0.023), ('sharp', 0.023), ('transition', 0.023), ('evenly', 0.023), ('andersen', 0.022), ('vary', 0.022), ('ndings', 0.022), ('ij', 0.022), ('theorem', 0.022), ('distributes', 0.021), ('sheds', 0.021), ('digits', 0.021), ('stored', 0.021), ('desirable', 0.021), ('boundary', 0.021), ('relations', 0.02), ('unify', 0.02), ('aii', 0.02), ('inside', 0.02), ('process', 0.02), ('transferring', 0.019), ('lim', 0.019), ('implementing', 0.019), ('labeled', 0.019), ('simulation', 0.019), ('xt', 0.019), ('transient', 0.019), ('clustering', 0.018), ('denser', 0.018), ('probability', 0.018), ('interior', 0.018), ('emerges', 0.017), ('near', 0.017), ('suppose', 0.017), ('relate', 0.017), ('manifold', 0.016), ('brings', 0.016), ('random', 0.016), ('included', 0.016), ('regularizer', 0.016), ('tells', 0.016), ('bousquet', 0.016), ('focs', 0.015), ('arranged', 0.015), ('say', 0.015), ('mostly', 0.015), ('semisupervised', 0.015), ('within', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 196 nips-2012-Learning with Partially Absorbing Random Walks
Author: Xiao-ming Wu, Zhenguo Li, Anthony M. So, John Wright, Shih-fu Chang
Abstract: We propose a novel stochastic process that is with probability αi being absorbed at current state i, and with probability 1 − αi follows a random edge out of it. We analyze its properties and show its potential for exploring graph structures. We prove that under proper absorption rates, a random walk starting from a set S of low conductance will be mostly absorbed in S. Moreover, the absorption probabilities vary slowly inside S, while dropping sharply outside, thus implementing the desirable cluster assumption for graph-based learning. Remarkably, the partially absorbing process unifies many popular models arising in a variety of contexts, provides new insights into them, and makes it possible for transferring findings from one paradigm to another. Simulation results demonstrate its promising applications in retrieval and classification.
2 0.098421797 141 nips-2012-GenDeR: A Generic Diversified Ranking Algorithm
Author: Jingrui He, Hanghang Tong, Qiaozhu Mei, Boleslaw Szymanski
Abstract: Diversified ranking is a fundamental task in machine learning. It is broadly applicable in many real world problems, e.g., information retrieval, team assembling, product search, etc. In this paper, we consider a generic setting where we aim to diversify the top-k ranking list based on an arbitrary relevance function and an arbitrary similarity function among all the examples. We formulate it as an optimization problem and show that in general it is NP-hard. Then, we show that for a large volume of the parameter space, the proposed objective function enjoys the diminishing returns property, which enables us to design a scalable, greedy algorithm to find the (1 − 1/e) near-optimal solution. Experimental results on real data sets demonstrate the effectiveness of the proposed algorithm.
3 0.085561492 165 nips-2012-Iterative ranking from pair-wise comparisons
Author: Sahand Negahban, Sewoong Oh, Devavrat Shah
Abstract: The question of aggregating pairwise comparisons to obtain a global ranking over a collection of objects has been of interest for a very long time: be it ranking of online gamers (e.g. MSR’s TrueSkill system) and chess players, aggregating social opinions, or deciding which product to sell based on transactions. In most settings, in addition to obtaining ranking, finding ‘scores’ for each object (e.g. player’s rating) is of interest to understanding the intensity of the preferences. In this paper, we propose a novel iterative rank aggregation algorithm for discovering scores for objects from pairwise comparisons. The algorithm has a natural random walk interpretation over the graph of objects with edges present between two objects if they are compared; the scores turn out to be the stationary probability of this random walk. The algorithm is model independent. To establish the efficacy of our method, however, we consider the popular Bradley-Terry-Luce (BTL) model in which each object has an associated score which determines the probabilistic outcomes of pairwise comparisons between objects. We bound the finite sample error rates between the scores assumed by the BTL model and those estimated by our algorithm. This, in essence, leads to order-optimal dependence on the number of samples required to learn the scores well by our algorithm. Indeed, the experimental evaluation shows that our (model independent) algorithm performs as well as the Maximum Likelihood Estimator of the BTL model and outperforms a recently proposed algorithm by Ammar and Shah [1]. 1
4 0.069704451 69 nips-2012-Clustering Sparse Graphs
Author: Yudong Chen, Sujay Sanghavi, Huan Xu
Abstract: We develop a new algorithm to cluster sparse unweighted graphs – i.e. partition the nodes into disjoint clusters so that there is higher density within clusters, and low across clusters. By sparsity we mean the setting where both the in-cluster and across cluster edge densities are very small, possibly vanishing in the size of the graph. Sparsity makes the problem noisier, and hence more difficult to solve. Any clustering involves a tradeoff between minimizing two kinds of errors: missing edges within clusters and present edges across clusters. Our insight is that in the sparse case, these must be penalized differently. We analyze our algorithm’s performance on the natural, classical and widely studied “planted partition” model (also called the stochastic block model); we show that our algorithm can cluster sparser graphs, and with smaller clusters, than all previous methods. This is seen empirically as well. 1
5 0.067421354 260 nips-2012-Online Sum-Product Computation Over Trees
Author: Mark Herbster, Stephen Pasteris, Fabio Vitale
Abstract: We consider the problem of performing efficient sum-product computations in an online setting over a tree. A natural application of our methods is to compute the marginal distribution at a vertex in a tree-structured Markov random field. Belief propagation can be used to solve this problem, but requires time linear in the size of the tree, and is therefore too slow in an online setting where we are continuously receiving new data and computing individual marginals. With our method we aim to update the data and compute marginals in time that is no more than logarithmic in the size of the tree, and is often significantly less. We accomplish this via a hierarchical covering structure that caches previous local sum-product computations. Our contribution is three-fold: we i) give a linear time algorithm to find an optimal hierarchical cover of a tree; ii) give a sum-productlike algorithm to efficiently compute marginals with respect to this cover; and iii) apply “i” and “ii” to find an efficient algorithm with a regret bound for the online allocation problem in a multi-task setting. 1
6 0.066039748 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk
7 0.062243726 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set
8 0.061364546 323 nips-2012-Statistical Consistency of Ranking Methods in A Rank-Differentiable Probability Space
9 0.045256287 337 nips-2012-The Lovász ϑ function, SVMs and finding large dense subgraphs
10 0.044481032 309 nips-2012-Semi-supervised Eigenvectors for Locally-biased Learning
11 0.043010499 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking
12 0.042479962 99 nips-2012-Dip-means: an incremental clustering method for estimating the number of clusters
13 0.04219972 155 nips-2012-Human memory search as a random walk in a semantic network
14 0.041939694 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models
15 0.040920645 135 nips-2012-Forging The Graphs: A Low Rank and Positive Semidefinite Graph Learning Approach
16 0.040486995 169 nips-2012-Label Ranking with Partial Abstention based on Thresholded Probabilistic Models
17 0.039786641 326 nips-2012-Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses
18 0.03964299 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification
19 0.039347131 38 nips-2012-Algorithms for Learning Markov Field Policies
20 0.038967419 253 nips-2012-On Triangular versus Edge Representations --- Towards Scalable Modeling of Networks
topicId topicWeight
[(0, 0.105), (1, 0.011), (2, 0.015), (3, -0.037), (4, -0.008), (5, -0.03), (6, -0.02), (7, -0.001), (8, -0.021), (9, 0.11), (10, 0.009), (11, -0.083), (12, -0.051), (13, 0.0), (14, 0.052), (15, -0.005), (16, -0.03), (17, 0.032), (18, 0.012), (19, -0.0), (20, -0.019), (21, -0.052), (22, -0.025), (23, 0.034), (24, 0.01), (25, 0.041), (26, -0.034), (27, 0.094), (28, -0.041), (29, -0.054), (30, 0.051), (31, -0.046), (32, 0.032), (33, 0.02), (34, 0.02), (35, -0.086), (36, -0.003), (37, 0.007), (38, 0.043), (39, 0.019), (40, 0.005), (41, 0.015), (42, 0.019), (43, 0.03), (44, -0.055), (45, -0.023), (46, -0.06), (47, -0.029), (48, -0.024), (49, -0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.9010846 196 nips-2012-Learning with Partially Absorbing Random Walks
Author: Xiao-ming Wu, Zhenguo Li, Anthony M. So, John Wright, Shih-fu Chang
Abstract: We propose a novel stochastic process that is with probability αi being absorbed at current state i, and with probability 1 − αi follows a random edge out of it. We analyze its properties and show its potential for exploring graph structures. We prove that under proper absorption rates, a random walk starting from a set S of low conductance will be mostly absorbed in S. Moreover, the absorption probabilities vary slowly inside S, while dropping sharply outside, thus implementing the desirable cluster assumption for graph-based learning. Remarkably, the partially absorbing process unifies many popular models arising in a variety of contexts, provides new insights into them, and makes it possible for transferring findings from one paradigm to another. Simulation results demonstrate its promising applications in retrieval and classification.
2 0.69842702 141 nips-2012-GenDeR: A Generic Diversified Ranking Algorithm
Author: Jingrui He, Hanghang Tong, Qiaozhu Mei, Boleslaw Szymanski
Abstract: Diversified ranking is a fundamental task in machine learning. It is broadly applicable in many real world problems, e.g., information retrieval, team assembling, product search, etc. In this paper, we consider a generic setting where we aim to diversify the top-k ranking list based on an arbitrary relevance function and an arbitrary similarity function among all the examples. We formulate it as an optimization problem and show that in general it is NP-hard. Then, we show that for a large volume of the parameter space, the proposed objective function enjoys the diminishing returns property, which enables us to design a scalable, greedy algorithm to find the (1 − 1/e) near-optimal solution. Experimental results on real data sets demonstrate the effectiveness of the proposed algorithm.
3 0.63887513 69 nips-2012-Clustering Sparse Graphs
Author: Yudong Chen, Sujay Sanghavi, Huan Xu
Abstract: We develop a new algorithm to cluster sparse unweighted graphs – i.e. partition the nodes into disjoint clusters so that there is higher density within clusters, and low across clusters. By sparsity we mean the setting where both the in-cluster and across cluster edge densities are very small, possibly vanishing in the size of the graph. Sparsity makes the problem noisier, and hence more difficult to solve. Any clustering involves a tradeoff between minimizing two kinds of errors: missing edges within clusters and present edges across clusters. Our insight is that in the sparse case, these must be penalized differently. We analyze our algorithm’s performance on the natural, classical and widely studied “planted partition” model (also called the stochastic block model); we show that our algorithm can cluster sparser graphs, and with smaller clusters, than all previous methods. This is seen empirically as well. 1
4 0.61346388 323 nips-2012-Statistical Consistency of Ranking Methods in A Rank-Differentiable Probability Space
Author: Yanyan Lan, Jiafeng Guo, Xueqi Cheng, Tie-yan Liu
Abstract: This paper is concerned with the statistical consistency of ranking methods. Recently, it was proven that many commonly used pairwise ranking methods are inconsistent with the weighted pairwise disagreement loss (WPDL), which can be viewed as the true loss of ranking, even in a low-noise setting. This result is interesting but also surprising, given that the pairwise ranking methods have been shown very effective in practice. In this paper, we argue that the aforementioned result might not be conclusive, depending on what kind of assumptions are used. We give a new assumption that the labels of objects to rank lie in a rank-differentiable probability space (RDPS), and prove that the pairwise ranking methods become consistent with WPDL under this assumption. What is especially inspiring is that RDPS is actually not stronger than but similar to the low-noise setting. Our studies provide theoretical justifications of some empirical findings on pairwise ranking methods that are unexplained before, which bridge the gap between theory and applications.
5 0.60185128 165 nips-2012-Iterative ranking from pair-wise comparisons
Author: Sahand Negahban, Sewoong Oh, Devavrat Shah
Abstract: The question of aggregating pairwise comparisons to obtain a global ranking over a collection of objects has been of interest for a very long time: be it ranking of online gamers (e.g. MSR’s TrueSkill system) and chess players, aggregating social opinions, or deciding which product to sell based on transactions. In most settings, in addition to obtaining ranking, finding ‘scores’ for each object (e.g. player’s rating) is of interest to understanding the intensity of the preferences. In this paper, we propose a novel iterative rank aggregation algorithm for discovering scores for objects from pairwise comparisons. The algorithm has a natural random walk interpretation over the graph of objects with edges present between two objects if they are compared; the scores turn out to be the stationary probability of this random walk. The algorithm is model independent. To establish the efficacy of our method, however, we consider the popular Bradley-Terry-Luce (BTL) model in which each object has an associated score which determines the probabilistic outcomes of pairwise comparisons between objects. We bound the finite sample error rates between the scores assumed by the BTL model and those estimated by our algorithm. This, in essence, leads to order-optimal dependence on the number of samples required to learn the scores well by our algorithm. Indeed, the experimental evaluation shows that our (model independent) algorithm performs as well as the Maximum Likelihood Estimator of the BTL model and outperforms a recently proposed algorithm by Ammar and Shah [1]. 1
6 0.60039079 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set
7 0.56981581 169 nips-2012-Label Ranking with Partial Abstention based on Thresholded Probabilistic Models
8 0.54701418 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk
9 0.54250109 135 nips-2012-Forging The Graphs: A Low Rank and Positive Semidefinite Graph Learning Approach
10 0.50903744 309 nips-2012-Semi-supervised Eigenvectors for Locally-biased Learning
11 0.50701654 22 nips-2012-A latent factor model for highly multi-relational data
12 0.49980631 337 nips-2012-The Lovász ϑ function, SVMs and finding large dense subgraphs
13 0.48886609 133 nips-2012-Finding Exemplars from Pairwise Dissimilarities via Simultaneous Sparse Recovery
14 0.456278 286 nips-2012-Random Utility Theory for Social Choice
15 0.45600083 155 nips-2012-Human memory search as a random walk in a semantic network
16 0.45549315 85 nips-2012-Convergence and Energy Landscape for Cheeger Cut Clustering
17 0.45270059 157 nips-2012-Identification of Recurrent Patterns in the Activation of Brain Networks
18 0.4510425 291 nips-2012-Reducing statistical time-series problems to binary classification
19 0.45046502 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning
20 0.44930834 128 nips-2012-Fast Resampling Weighted v-Statistics
topicId topicWeight
[(0, 0.036), (2, 0.272), (17, 0.025), (21, 0.02), (36, 0.01), (38, 0.146), (42, 0.018), (54, 0.035), (55, 0.016), (74, 0.052), (76, 0.128), (80, 0.074), (92, 0.061)]
simIndex simValue paperId paperTitle
same-paper 1 0.788872 196 nips-2012-Learning with Partially Absorbing Random Walks
Author: Xiao-ming Wu, Zhenguo Li, Anthony M. So, John Wright, Shih-fu Chang
Abstract: We propose a novel stochastic process that is with probability αi being absorbed at current state i, and with probability 1 − αi follows a random edge out of it. We analyze its properties and show its potential for exploring graph structures. We prove that under proper absorption rates, a random walk starting from a set S of low conductance will be mostly absorbed in S. Moreover, the absorption probabilities vary slowly inside S, while dropping sharply outside, thus implementing the desirable cluster assumption for graph-based learning. Remarkably, the partially absorbing process unifies many popular models arising in a variety of contexts, provides new insights into them, and makes it possible for transferring findings from one paradigm to another. Simulation results demonstrate its promising applications in retrieval and classification.
2 0.76814193 8 nips-2012-A Generative Model for Parts-based Object Segmentation
Author: S. Eslami, Christopher Williams
Abstract: The Shape Boltzmann Machine (SBM) [1] has recently been introduced as a stateof-the-art model of foreground/background object shape. We extend the SBM to account for the foreground object’s parts. Our new model, the Multinomial SBM (MSBM), can capture both local and global statistics of part shapes accurately. We combine the MSBM with an appearance model to form a fully generative model of images of objects. Parts-based object segmentations are obtained simply by performing probabilistic inference in the model. We apply the model to two challenging datasets which exhibit significant shape and appearance variability, and find that it obtains results that are comparable to the state-of-the-art. There has been significant focus in computer vision on object recognition and detection e.g. [2], but a strong desire remains to obtain richer descriptions of objects than just their bounding boxes. One such description is a parts-based object segmentation, in which an image is partitioned into multiple sets of pixels, each belonging to either a part of the object of interest, or its background. The significance of parts in computer vision has been recognized since the earliest days of the field (e.g. [3, 4, 5]), and there exists a rich history of work on probabilistic models for parts-based segmentation e.g. [6, 7]. Many such models only consider local neighborhood statistics, however several models have recently been proposed that aim to increase the accuracy of segmentations by also incorporating prior knowledge about the foreground object’s shape [8, 9, 10, 11]. In such cases, probabilistic techniques often mainly differ in how accurately they represent and learn about the variability exhibited by the shapes of the object’s parts. Accurate models of the shapes and appearances of parts can be necessary to perform inference in datasets that exhibit large amounts of variability. In general, the stronger the models of these two components, the more performance is improved. A generative model has the added benefit of being able to generate samples, which allows us to visually inspect the quality of its understanding of the data and the problem. Recently, a generative probabilistic model known as the Shape Boltzmann Machine (SBM) has been used to model binary object shapes [1]. The SBM has been shown to constitute the state-of-the-art and it possesses several highly desirable characteristics: samples from the model look realistic, and it generalizes to generate samples that differ from the limited number of examples it is trained on. The main contributions of this paper are as follows: 1) In order to account for object parts we extend the SBM to use multinomial visible units instead of binary ones, resulting in the Multinomial Shape Boltzmann Machine (MSBM), and we demonstrate that the MSBM constitutes a strong model of parts-based object shape. 2) We combine the MSBM with an appearance model to form a fully generative model of images of objects (see Fig. 1). We show how parts-based object segmentations can be obtained simply by performing probabilistic inference in the model. We apply our model to two challenging datasets and find that in addition to being principled and fully generative, the model’s performance is comparable to the state-of-the-art. 1 Train labels Train images Test image Appearance model Joint Model Shape model Parsing Figure 1: Overview. Using annotated images separate models of shape and appearance are trained. Given an unseen test image, its parsing is obtained via inference in the proposed joint model. In Secs. 1 and 2 we present the model and propose efficient inference and learning schemes. In Sec. 3 we compare and contrast the resulting joint model with existing work in the literature. We describe our experimental results in Sec. 4 and conclude with a discussion in Sec. 5. 1 Model We consider datasets of cropped images of an object class. We assume that the images are constructed through some combination of a fixed number of parts. Given a dataset D = {Xd }, d = 1...n of such images X, each consisting of P pixels {xi }, i = 1...P , we wish to infer a segmentation S for the image. S consists of a labeling si for every pixel, where si is a 1-of-(L+1) encoded variable, and L is the fixed number of parts that combine to generate the foreground. In other words, si = (sli ), P l = 0...L, sli 2 {0, 1} and l sli = 1. Note that the background is also treated as a ‘part’ (l = 0). Accurate inference of S is driven by models for 1) part shapes and 2) part appearances. Part shapes: Several types of models can be used to define probabilistic distributions over segmentations S. The simplest approach is to model each pixel si independently with categorical variables whose parameters are specified by the object’s mean shape (Fig. 2(a)). Markov Random Fields (MRFs, Fig. 2(b)) additionally model interactions between nearby pixels using pairwise potential functions that efficiently capture local properties of images like smoothness and continuity. Restricted Boltzmann Machines (RBMs) and their multi-layered counterparts Deep Boltzmann Machines (DBMs, Fig. 2(c)) make heavy use of hidden variables to efficiently define higher-order potentials that take into account the configuration of larger groups of image pixels. The introduction of such hidden variables provides a way to efficiently capture complex, global properties of image pixels. RBMs and DBMs are powerful generative models, but they also have many parameters. Segmented images, however, are expensive to obtain and datasets are typically small (hundreds of examples). In order to learn a model that accurately captures the properties of part shapes we use DBMs but also impose carefully chosen connectivity and capacity constraints, following the structure of the Shape Boltzmann Machine (SBM) [1]. We further extend the model to account for multi-part shapes to obtain the Multinomial Shape Boltzmann Machine (MSBM). The MSBM has two layers of latent variables: h1 and h2 (collectively H = {h1 , h2 }), and defines a P Boltzmann distribution over segmentations p(S) = h1 ,h2 exp{ E(S, h1 , h2 |✓s )}/Z(✓s ) where X X X X X 1 2 E(S, h1 , h2 |✓s ) = bli sli + wlij sli h1 + c 1 h1 + wjk h1 h2 + c2 h2 , (1) j j j j k k k i,l j i,j,l j,k k where j and k range over the first and second layer hidden variables, and ✓s = {W 1 , W 2 , b, c1 , c2 } are the shape model parameters. In the first layer, local receptive fields are enforced by connecting each hidden unit in h1 only to a subset of the visible units, corresponding to one of four patches, as shown in Fig. 2(d,e). Each patch overlaps its neighbor by b pixels, which allows boundary continuity to be learned at the lowest layer. We share weights between the four sets of first-layer hidden units and patches, and purposely restrict the number of units in h2 . These modifications significantly reduce the number of parameters whilst taking into account an important property of shapes, namely that the strongest dependencies between pixels are typically local. 2 h2 1 1 h S S (a) Mean h S (b) MRF h2 h2 h1 S S (c) DBM b (d) SBM (e) 2D SBM Figure 2: Models of shape. Object shape is modeled with undirected graphical models. (a) 1D slice of a mean model. (b) Markov Random Field in 1D. (c) Deep Boltzmann Machine in 1D. (d) 1D slice of a Shape Boltzmann Machine. (e) Shape Boltzmann Machine in 2D. In all models latent units h are binary and visible units S are multinomial random variables. Based on Fig. 2 of [1]. k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 ⇡ l=0 l=1 l=2 Figure 3: A model of appearances. Left: An exemplar dataset. Here we assume one background (l = 0) and two foreground (l = 1, non-body; l = 2, body) parts. Right: The corresponding appearance model. In this example, L = 2, K = 3 and W = 6. Best viewed in color. Part appearances: Pixels in a given image are assumed to have been generated by W fixed Gaussians in RGB space. During pre-training, the means {µw } and covariances {⌃w } of these Gaussians are extracted by training a mixture model with W components on every pixel in the dataset, ignoring image and part structure. It is also assumed that each of the L parts can have different appearances in different images, and that these appearances can be clustered into K classes. The classes differ in how likely they are to use each of the W components when ‘coloring in’ the part. The generative process is as follows. For part l in an image, one of the K classes is chosen (represented by a 1-of-K indicator variable al ). Given al , the probability distribution defined on pixels associated with part l is given by a Gaussian mixture model with means {µw } and covariances {⌃w } and mixing proportions { lkw }. The prior on A = {al } specifies the probability ⇡lk of appearance class k being chosen for part l. Therefore appearance parameters ✓a = {⇡lk , lkw } (see Fig. 3) and: a p(xi |A, si , ✓ ) = p(A|✓a ) = Y l Y l a sli p(xi |al , ✓ ) p(al |✓a ) = = Y Y X YY l l k w lkw N (xi |µw , ⌃w ) !alk !sli (⇡lk )alk . , (2) (3) k Combining shapes and appearances: To summarize, the latent variables for X are A, S, H, and the model’s active parameters ✓ include shape parameters ✓s and appearance parameters ✓a , so that p(X, A, S, H|✓) = Y 1 p(A|✓a )p(S, H|✓s ) p(xi |A, si , ✓a ) , Z( ) i (4) where the parameter adjusts the relative contributions of the shape and appearance components. See Fig. 4 for an illustration of the complete graphical model. During learning, we find the values of ✓ that maximize the likelihood of the training data D, and segmentation is performed on a previously-unseen image by querying the marginal distribution p(S|Xtest , ✓). Note that Z( ) is constant throughout the execution of the algorithms. We set via trial and error in our experiments. 3 n H ✓a si al H xi L+1 ✓s S X A P Figure 4: A model of shape and appearance. Left: The joint model. Pixels xi are modeled via appearance variables al . The model’s belief about each layer’s shape is captured by shape variables H. Segmentation variables si assign each pixel to a layer. Right: Schematic for an image X. 2 Inference and learning Inference: We approximate p(A, S, H|X, ✓) by drawing samples of A, S and H using block-Gibbs Markov Chain Monte Carlo (MCMC). The desired distribution p(S|X, ✓) can then be obtained by considering only the samples for S (see Algorithm 1). In order to sample p(A|S, H, X, ✓) we consider the conditional distribution of appearance class k being chosen for part l which is given by: Q P ·s ⇡lk i ( w lkw N (xi |µw , ⌃w )) li h Q P i. p(alk = 1|S, X, ✓) = P (5) K ·sli r=1 ⇡lr i( w lrw N (xi |µw , ⌃w )) Since the MSBM only has edges between each pair of adjacent layers, all hidden units within a layer are conditionally independent given the units in the other two layers. This property can be exploited to make inference in the shape model exact and efficient. The conditional probabilities are: X X 1 2 p(h1 = 1|s, h2 , ✓) = ( wlij sli + wjk h2 + c1 ), (6) j k j i,l p(h2 k 1 = 1|h , ✓) = ( X k 2 wjk h1 j + c2 ), j (7) j where (y) = 1/(1 + exp( y)) is the sigmoid function. To sample from p(H|S, X, ✓) we iterate between Eqns. 6 and 7 multiple times and keep only the final values of h1 and h2 . Finally, we draw samples for the pixels in p(S|A, H, X, ✓) independently: P 1 exp( j wlij h1 + bli ) p(xi |A, sli = 1, ✓) j p(sli = 1|A, H, X, ✓) = PL . (8) P 1 1 m=1 exp( j wmij hj + bmi ) p(xi |A, smi = 1, ✓) Seeding: Since the latent-space is extremely high-dimensional, in practice we find it helpful to run several inference chains, each initializing S(1) to a different value. The ‘best’ inference is retained and the others are discarded. The computation of the likelihood p(X|✓) of image X is intractable, so we approximate the quality of each inference using a scoring function: 1X Score(X|✓) = p(X, A(t) , S(t) , H(t) |✓), (9) T t where {A(t) , S(t) , H(t) }, t = 1...T are the samples obtained from the posterior p(A, S, H|X, ✓). If the samples were drawn from the prior p(A, S, H|✓) the scoring function would be an unbiased estimator of p(X|✓), but would be wildly inaccurate due to the high probability of missing the important regions of latent space (see e.g. [12, p. 107-109] for further discussion of this issue). Learning: Learning of the model involves maximizing the log likelihood log p(D|✓a , ✓s ) of the training dataset D with respect to the model parameters ✓a and ✓s . Since training is partially supervised, in that for each image X its corresponding segmentation S is also given, we can learn the parameters of the shape and appearance components separately. For appearances, the learning of the mixing coefficients and the histogram parameters decomposes into standard mixture updates independently for each part. For shapes, we follow the standard deep 4 Algorithm 1 MCMC inference algorithm. 1: procedure I NFER(X, ✓) 2: Initialize S(1) , H(1) 3: for t 2 : chain length do 4: A(t) ⇠ p(A|S(t 1) , H(t 1) , X, ✓) 5: S(t) ⇠ p(S|A(t) , H(t 1) , X, ✓) 6: H(t) ⇠ p(H|S(t) , ✓) 7: return {S(t) }t=burnin:chain length learning literature closely [13, 1]. In the pre-training phase we greedily train the model bottom up, one layer at a time. We begin by training an RBM on the observed data using stochastic maximum likelihood learning (SML; also referred to as ‘persistent CD’; [14, 13]). Once this RBM is trained, we infer the conditional mean of the hidden units for each training image. The resulting vectors then serve as the training data for a second RBM which is again trained using SML. We use the parameters of these two RBMs to initialize the parameters of the full MSBM model. In the second phase we perform approximate stochastic gradient ascent in the likelihood of the full model to finetune the parameters in an EM-like scheme as described in [13]. 3 Related work Existing probabilistic models of images can be categorized by the amount of variability they expect to encounter in the data and by how they model this variability. A significant portion of the literature models images using only two parts: a foreground object and its background e.g. [15, 16, 17, 18, 19]. Models that account for the parts within the foreground object mainly differ in how accurately they learn about and represent the variability of the shapes of the object’s parts. In Probabilistic Index Maps (PIMs) [8] a mean partitioning is learned, and the deformable PIM [9] additionally allows for local deformations of this mean partitioning. Stel Component Analysis [10] accounts for larger amounts of shape variability by learning a number of different template means for the object that are blended together on a pixel-by-pixel basis. Factored Shapes and Appearances [11] models global properties of shape using a factor analysis-like model, and ‘masked’ RBMs have been used to model more local properties of shape [20]. However, none of these models constitute a strong model of shape in terms of realism of samples and generalization capabilities [1]. We demonstrate in Sec. 4 that, like the SBM, the MSBM does in fact possess these properties. The closest works to ours in terms of ability to deal with datasets that exhibit significant variability in both shape and appearance are the works of Bo and Fowlkes [21] and Thomas et al. [22]. Bo and Fowlkes [21] present an algorithm for pedestrian segmentation that models the shapes of the parts using several template means. The different parts are composed using hand coded geometric constraints, which means that the model cannot be automatically extended to other application domains. The Implicit Shape Model (ISM) used in [22] is reliant on interest point detectors and defines distributions over segmentations only in the posterior, and therefore is not fully generative. The model presented here is entirely learned from data and fully generative, therefore it can be applied to new datasets and diagnosed with relative ease. Due to its modular structure, we also expect it to rapidly absorb future developments in shape and appearance models. 4 Experiments Penn-Fudan pedestrians: The first dataset that we considered is Penn-Fudan pedestrians [23], consisting of 169 images of pedestrians (Fig. 6(a)). The images are annotated with ground-truth segmentations for L = 7 different parts (hair, face, upper and lower clothes, shoes, legs, arms; Fig. 6(d)). We compare the performance of the model with the algorithm of Bo and Fowlkes [21]. For the shape component, we trained an MSBM on the 684 images of a labeled version of the HumanEva dataset [24] (at 48 ⇥ 24 pixels; also flipped horizontally) with overlap b = 4, and 400 and 50 hidden units in the first and second layers respectively. Each layer was pre-trained for 3000 epochs (iterations). After pre-training, joint training was performed for 1000 epochs. 5 (c) Completion (a) Sampling (b) Diffs ! ! ! Figure 5: Learned shape model. (a) A chain of samples (1000 samples between frames). The apparent ‘blurriness’ of samples is not due to averaging or resizing. We display the probability of each pixel belonging to different parts. If, for example, there is a 50-50 chance that a pixel belongs to the red or blue parts, we display that pixel in purple. (b) Differences between the samples and their most similar counterparts in the training dataset. (c) Completion of occlusions (pink). To assess the realism and generalization characteristics of the learned MSBM we sample from it. In Fig. 5(a) we show a chain of unconstrained samples from an MSBM generated via block-Gibbs MCMC (1000 samples between frames). The model captures highly non-linear correlations in the data whilst preserving the object’s details (e.g. face and arms). To demonstrate that the model has not simply memorized the training data, in Fig. 5(b) we show the difference between the sampled shapes in Fig. 5(a) and their closest images in the training set (based on per-pixel label agreement). We see that the model generalizes in non-trivial ways to generate realistic shapes that it had not encountered during training. In Fig. 5(c) we show how the MSBM completes rectangular occlusions. The samples highlight the variability in possible completions captured by the model. Note how, e.g. the length of the person’s trousers on one leg affects the model’s predictions for the other, demonstrating the model’s knowledge about long-range dependencies. An interactive M ATLAB GUI for sampling from this MSBM has been included in the supplementary material. The Penn-Fudan dataset (at 200 ⇥ 100 pixels) was then split into 10 train/test cross-validation splits without replacement. We used the training images in each split to train the appearance component with a vocabulary of size W = 50 and K = 100 mixture components1 . We additionally constrained the model by sharing the appearance models for the arms and legs with that of the face. We assess the quality of the appearance model by performing the following experiment: for each test image, we used the scoring function described in Eq. 9 to evaluate a number of different proposal segmentations for that image. We considered 10 randomly chosen segmentations from the training dataset as well as the ground-truth segmentation for the test image, and found that the appearance model correctly assigns the highest score to the ground-truth 95% of the time. During inference, the shape and appearance models (which are defined on images of different sizes), were combined at 200 ⇥ 100 pixels via M ATLAB’s imresize function, and we set = 0.8 (Eq. 8) via trial and error. Inference chains were seeded at 100 exemplar segmentations from the HumanEva dataset (obtained using the K-medoids algorithm with K = 100), and were run for 20 Gibbs iterations each (with 5 iterations of Eqs. 6 and 7 per Gibbs iteration). Our unoptimized M ATLAB implementation completed inference for each chain in around 7 seconds. We compute the conditional probability of each pixel belonging to different parts given the last set of samples obtained from the highest scoring chain, assign each pixel independently to the most likely part at that pixel, and report the percentage of correctly labeled pixels (see Table 1). We find that accuracy can be improved using superpixels (SP) computed on X (pixels within a superpixel are all assigned the most common label within it; as with [21] we use gPb-OWT-UCM [25]). We also report the accuracy obtained, had the top scoring seed segmentation been returned as the final segmentation for each image. Here the quality of the seed is determined solely by the appearance model. We observe that the model has comparable performance to the state-of-the-art but pedestrianspecific algorithm of [21], and that inference in the model significantly improves the accuracy of the segmentations over the baseline (top seed+SP). Qualitative results can be seen in Fig. 6(c). 1 We obtained the best quantitative results with these settings. The appearances exhibited by the parts in the dataset are highly varied, and the complexity of the appearance model reflects this fact. 6 Table 1: Penn-Fudan pedestrians. We report the percentage of correctly labeled pixels. The final column is an average of the background, upper and lower body scores (as reported in [21]). FG BG Upper Body Lower Body Head Average Bo and Fowlkes [21] 73.3% 81.1% 73.6% 71.6% 51.8% 69.5% MSBM MSBM + SP 70.7% 71.6% 72.8% 73.8% 68.6% 69.9% 66.7% 68.5% 53.0% 54.1% 65.3% 66.6% Top seed Top seed + SP 59.0% 61.6% 61.8% 67.3% 56.8% 60.8% 49.8% 54.1% 45.5% 43.5% 53.5% 56.4% Table 2: ETHZ cars. We report the percentage of pixels belonging to each part that are labeled correctly. The final column is an average weighted by the frequency of occurrence of each label. BG Body Wheel Window Bumper License Light Average ISM [22] 93.2% 72.2% 63.6% 80.5% 73.8% 56.2% 34.8% 86.8% MSBM 94.6% 72.7% 36.8% 74.4% 64.9% 17.9% 19.9% 86.0% Top seed 92.2% 68.4% 28.3% 63.8% 45.4% 11.2% 15.1% 81.8% ETHZ cars: The second dataset that we considered is the ETHZ labeled cars dataset [22], which itself is a subset of the LabelMe dataset [23], consisting of 139 images of cars, all in the same semiprofile view (Fig. 7(a)). The images are annotated with ground-truth segmentations for L = 6 parts (body, wheel, window, bumper, license plate, headlight; Fig. 7(d)). We compare the performance of the model with the ISM of Thomas et al. [22], who also report their results on this dataset. The dataset was split into 10 train/test cross-validation splits without replacement. We used the training images in each split to train both the shape and appearance components. For the shape component, we trained an MSBM at 50 ⇥ 50 pixels with overlap b = 4, and 2000 and 100 hidden units in the first and second layers respectively. Each layer was pre-trained for 3000 epochs and joint training was performed for 1000 epochs. The appearance model was trained with a vocabulary of size W = 50 and K = 100 mixture components and we set = 0.7. Inference chains were seeded at 50 exemplar segmentations (obtained using K-medoids). We find that the use of superpixels does not help with this dataset (due to the poor quality of superpixels obtained for these images). Qualitative and quantitative results that show the performance of model to be comparable to the state-of-the-art ISM can be seen in Fig. 7(c) and Table 2. We believe the discrepancy in accuracy between the MSBM and ISM on the ‘license’ and ‘light’ labels to mainly be due to ISM’s use of interest-points, as they are able to locate such fine structures accurately. By incorporating better models of part appearance into the generative model, we expect to see this discrepancy decrease. 5 Conclusions and future work In this paper we have shown how the SBM can be extended to obtain the MSBM, and presented a principled probabilistic model of images of objects that exploits the MSBM as its model for part shapes. We demonstrated how object segmentations can be obtained simply by performing MCMC inference in the model. The model can also be treated as a probabilistic evaluator of segmentations: given a proposal segmentation it can be used to estimate its likelihood. This leads us to believe that the combination of a generative model such as ours, with a discriminative, bottom-up segmentation algorithm could be highly effective. We are currently investigating how textured appearance models, which take into account the spatial structure of pixels, affect the learning and inference algorithms and the performance of the model. Acknowledgments Thanks to Charless Fowlkes and Vittorio Ferrari for access to datasets, and to Pushmeet Kohli and John Winn for valuable discussions. AE has received funding from the Carnegie Trust, the SORSAS scheme, and the IST Programme under the PASCAL2 Network of Excellence (IST-2007-216886). 7 (a) Test (c) MSBM (b) Bo and Fowlkes (d) Ground truth Background Hair Face Upper Shoes Legs Lower Arms (d) Ground truth (c) MSBM (b) Thomas et al. (a) Test Figure 6: Penn-Fudan pedestrians. (a) Test images. (b) Results reported by Bo and Fowlkes [21]. (c) Output of the joint model. (d) Ground-truth images. Images shown are those selected by [21]. Background Body Wheel Window Bumper License Headlight Figure 7: ETHZ cars. (a) Test images. (b) Results reported by Thomas et al. [22]. (c) Output of the joint model. (d) Ground-truth images. Images shown are those selected by [22]. 8 References [1] S. M. Ali Eslami, Nicolas Heess, and John Winn. The Shape Boltzmann Machine: a Strong Model of Object Shape. In IEEE CVPR, 2012. [2] Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. The PASCAL Visual Object Classes (VOC) Challenge. International Journal of Computer Vision, 88:303–338, 2010. [3] Martin Fischler and Robert Elschlager. The Representation and Matching of Pictorial Structures. IEEE Transactions on Computers, 22(1):67–92, 1973. [4] David Marr. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. Freeman, 1982. [5] Irving Biederman. Recognition-by-components: A theory of human image understanding. Psychological Review, 94:115–147, 1987. [6] Ashish Kapoor and John Winn. Located Hidden Random Fields: Learning Discriminative Parts for Object Detection. In ECCV, pages 302–315, 2006. [7] John Winn and Jamie Shotton. The Layout Consistent Random Field for Recognizing and Segmenting Partially Occluded Objects. In IEEE CVPR, pages 37–44, 2006. [8] Nebojsa Jojic and Yaron Caspi. Capturing Image Structure with Probabilistic Index Maps. In IEEE CVPR, pages 212–219, 2004. [9] John Winn and Nebojsa Jojic. LOCUS: Learning object classes with unsupervised segmentation. In ICCV, pages 756–763, 2005. [10] Nebojsa Jojic, Alessandro Perina, Marco Cristani, Vittorio Murino, and Brendan Frey. Stel component analysis. In IEEE CVPR, pages 2044–2051, 2009. [11] S. M. Ali Eslami and Christopher K. I. Williams. Factored Shapes and Appearances for Partsbased Object Understanding. In BMVC, pages 18.1–18.12, 2011. [12] Nicolas Heess. Learning generative models of mid-level structure in natural images. PhD thesis, University of Edinburgh, 2011. [13] Ruslan Salakhutdinov and Geoffrey Hinton. Deep Boltzmann Machines. In AISTATS, volume 5, pages 448–455, 2009. [14] Tijmen Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In ICML, pages 1064–1071, 2008. [15] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. “GrabCut”: interactive foreground extraction using iterated graph cuts. ACM SIGGRAPH, 23:309–314, 2004. [16] Eran Borenstein, Eitan Sharon, and Shimon Ullman. Combining Top-Down and Bottom-Up Segmentation. In CVPR Workshop on Perceptual Organization in Computer Vision, 2004. [17] Himanshu Arora, Nicolas Loeff, David Forsyth, and Narendra Ahuja. Unsupervised Segmentation of Objects using Efficient Learning. IEEE CVPR, pages 1–7, 2007. [18] Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. ClassCut for unsupervised class segmentation. In ECCV, pages 380–393, 2010. [19] Nicolas Heess, Nicolas Le Roux, and John Winn. Weakly Supervised Learning of ForegroundBackground Segmentation using Masked RBMs. In ICANN, 2011. [20] Nicolas Le Roux, Nicolas Heess, Jamie Shotton, and John Winn. Learning a Generative Model of Images by Factoring Appearance and Shape. Neural Computation, 23(3):593–650, 2011. [21] Yihang Bo and Charless Fowlkes. Shape-based Pedestrian Parsing. In IEEE CVPR, 2011. [22] Alexander Thomas, Vittorio Ferrari, Bastian Leibe, Tinne Tuytelaars, and Luc Van Gool. Using Recognition and Annotation to Guide a Robot’s Attention. IJRR, 28(8):976–998, 2009. [23] Bryan Russell, Antonio Torralba, Kevin Murphy, and William Freeman. LabelMe: A Database and Tool for Image Annotation. International Journal of Computer Vision, 77:157–173, 2008. [24] Leonid Sigal, Alexandru Balan, and Michael Black. HumanEva. International Journal of Computer Vision, 87(1-2):4–27, 2010. [25] Pablo Arbelaez, Michael Maire, Charless C. Fowlkes, and Jitendra Malik. From Contours to Regions: An Empirical Evaluation. In IEEE CVPR, 2009. 9
3 0.74163109 37 nips-2012-Affine Independent Variational Inference
Author: Edward Challis, David Barber
Abstract: We consider inference in a broad class of non-conjugate probabilistic models based on minimising the Kullback-Leibler divergence between the given target density and an approximating ‘variational’ density. In particular, for generalised linear models we describe approximating densities formed from an affine transformation of independently distributed latent variables, this class including many well known densities as special cases. We show how all relevant quantities can be efficiently computed using the fast Fourier transform. This extends the known class of tractable variational approximations and enables the fitting for example of skew variational densities to the target density. 1
4 0.73600155 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound
Author: Shusen Wang, Zhihua Zhang
Abstract: The CUR matrix decomposition is an important extension of Nystr¨ m approximao tion to a general matrix. It approximates any data matrix in terms of a small number of its columns and rows. In this paper we propose a novel randomized CUR algorithm with an expected relative-error bound. The proposed algorithm has the advantages over the existing relative-error CUR algorithms that it possesses tighter theoretical bound and lower time complexity, and that it can avoid maintaining the whole data matrix in main memory. Finally, experiments on several real-world datasets demonstrate significant improvement over the existing relative-error algorithms. 1
Author: Lars Buesing, Maneesh Sahani, Jakob H. Macke
Abstract: Latent linear dynamical systems with generalised-linear observation models arise in a variety of applications, for instance when modelling the spiking activity of populations of neurons. Here, we show how spectral learning methods (usually called subspace identification in this context) for linear systems with linear-Gaussian observations can be extended to estimate the parameters of a generalised-linear dynamical system model despite a non-linear and non-Gaussian observation process. We use this approach to obtain estimates of parameters for a dynamical model of neural population data, where the observed spike-counts are Poisson-distributed with log-rates determined by the latent dynamical process, possibly driven by external inputs. We show that the extended subspace identification algorithm is consistent and accurately recovers the correct parameters on large simulated data sets with a single calculation, avoiding the costly iterative computation of approximate expectation-maximisation (EM). Even on smaller data sets, it provides an effective initialisation for EM, avoiding local optima and speeding convergence. These benefits are shown to extend to real neural data.
6 0.63349199 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
7 0.63072634 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes
8 0.62972295 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
9 0.62810367 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity
10 0.62806547 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback
11 0.62776494 38 nips-2012-Algorithms for Learning Markov Field Policies
12 0.62751287 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models
13 0.62638432 260 nips-2012-Online Sum-Product Computation Over Trees
14 0.62616765 148 nips-2012-Hamming Distance Metric Learning
15 0.62611383 292 nips-2012-Regularized Off-Policy TD-Learning
16 0.62571156 112 nips-2012-Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model
17 0.62509859 113 nips-2012-Efficient and direct estimation of a neural subunit model for sensory coding
18 0.62508225 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models
19 0.62452477 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
20 0.62429631 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance