nips nips2013 nips2013-66 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Christina E. Lee, Asuman Ozdaglar, Devavrat Shah
Abstract: Computing the stationary distribution of a large finite or countably infinite state space Markov Chain (MC) has become central in many problems such as statistical inference and network analysis. Standard methods involve large matrix multiplications as in power iteration, or simulations of long random walks, as in Markov Chain Monte Carlo (MCMC). Power iteration is costly, as it involves computation at every state. For MCMC, it is difficult to determine whether the random walks are long enough to guarantee convergence. In this paper, we provide a novel algorithm that answers whether a chosen state in a MC has stationary probability larger than some ∆ ∈ (0, 1), and outputs an estimate of the stationary probability. Our algorithm is constant time, using information from a local neighborhood of the state on the graph induced by the MC, which has constant size relative to the state space. The multiplicative error of the estimate is upper bounded by a function of the mixing properties of the MC. Simulation results show MCs for which this method gives tight estimates. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Computing the stationary distribution of a large finite or countably infinite state space Markov Chain (MC) has become central in many problems such as statistical inference and network analysis. [sent-5, score-0.374]
2 In this paper, we provide a novel algorithm that answers whether a chosen state in a MC has stationary probability larger than some ∆ ∈ (0, 1), and outputs an estimate of the stationary probability. [sent-9, score-0.5]
3 Our algorithm is constant time, using information from a local neighborhood of the state on the graph induced by the MC, which has constant size relative to the state space. [sent-10, score-0.337]
4 1 Introduction Computing the stationary distribution of a Markov chain (MC) with a very large state space (finite, or countably infinite) has become central to statistical inference. [sent-13, score-0.559]
5 MCMC methods involve sampling states from a long random walk over the entire state space [5, 6]. [sent-16, score-0.457]
6 However, there is no clearly defined stopping condition in general settings, and computations must be performed at every state of the MC. [sent-20, score-0.202]
7 Our algorithm answers the following question: for a given node i of a countable state space MC, is the stationary probability of i larger than a given threshold ∆ ∈ (0, 1), and can we approximate it? [sent-22, score-0.743]
8 For chosen parameters ∆, , and α, our algorithm guarantees that for nodes such that the estimate πi < ∆/(1 + ), the true ˆ 1 value πi is also less than ∆ with probability at least 1 − α. [sent-23, score-0.173]
9 In addition, if πi ≥ ∆/(1 + ), with ˆ probability at least 1 − α, the estimate is within an times Zmax (i) multiplicative factor away from the true πi , where Zmax (i) is effectively a “local mixing time” for i derived from the fundamental matrix of the transition probability matrix P . [sent-24, score-0.263]
10 Our algorithm uses only a“local” neighborhood of the state i, defined with respect to the Markov graph. [sent-26, score-0.19]
11 Its correctness relies on a basic property: the stationary probability of each node is inversely proportional to the mean of its “return time. [sent-28, score-0.449]
12 ” Therefore, we sample return times to the node and use the empirical average as an estimate. [sent-29, score-0.373]
13 Since return times can be arbitrarily long, we truncate sample return times at a chosen threshold. [sent-30, score-0.232]
14 We utilize the exponential concentration of return times in Markov chains to establish theoretical guarantees for the algorithm. [sent-32, score-0.277]
15 For countably infinite state space Markov chains, we build upon a result by Hajek [9] on the concentration of certain types of hitting times to derive concentration of return times to a given node. [sent-34, score-0.493]
16 For graphs that mix quickly, the distribution over return times concentrates more sharply around its mean, resulting in tighter performance guarantees. [sent-36, score-0.266]
17 We illustrate the wide applicability of our local algorithm for computing network centralities and stationary distributions of queuing models. [sent-37, score-0.268]
18 MCMC was originally proposed in [5], and a tractable way to design a random walk for a target distribution was proposed by Hastings [6]. [sent-39, score-0.33]
19 Given a distribution π(x), the method designs a Markov chain such that the stationary distribution of the Markov chain is equal to the target distribution. [sent-40, score-0.52]
20 Without using the full transition matrix of the Markov chain, Monte Carlo sampling techniques estimate the distribution by sampling random walks via the transition probabilities at each node. [sent-41, score-0.258]
21 As the length of the random walk approaches infinity, the distribution over possible states of the random walk approaches stationary distribution. [sent-42, score-0.846]
22 The majority of work following the initial introduction of the algorithm involves analyzing the convergence rates and mixing times of this random walk [8, 12]. [sent-44, score-0.403]
23 For general non-reversible countably infinite state space Markov chains, little is known about the mixing time. [sent-47, score-0.268]
24 Thus, it is difficult to verify if the random walk has sufficiently converged to the stationary distribution, and before that point there is no guarantee whether the estimate obtained from the random walk is larger or smaller than the true stationary probability. [sent-48, score-1.02]
25 Although power iteration can be implemented in a distributed manner, each iteration requires computation to be performed by every state in the MC, which is expensive for large state space MCs. [sent-53, score-0.255]
26 For countably infinite state space MCs, there is no clear analog to matrix multiplication. [sent-54, score-0.224]
27 In the specialized setting of PageRank, the goal is to compute the stationary distribution of a specific Markov chain described by a transition matrix P = (1 − β)Q + β1 · rT , where Q is a stochastic transition probability matrix, and β is a scalar in (0, 1). [sent-55, score-0.527]
28 This can be interpreted as random walk in which every step either follows Q with probability 1 − β, or with probability β jumps to a node according to the distribution specified by vector r. [sent-56, score-0.671]
29 [17]: it outputs a set of “important” nodes – with probability 1 − o(1), it includes all nodes with PageRank greater than a given threshold ∆, and does not include nodes 1 with PageRank less than ∆/c for a given c > 1. [sent-62, score-0.463]
30 2 2 Setup, problem statement & algorithm Consider a discrete time, irreducible, positive-recurrent MC {Xt }t≥0 on a countable state space Σ (n) having transition probability matrix P . [sent-65, score-0.38]
31 Let Ti be the return time to a node i, and let Hi be the maximal hitting time to a node i such that Ti = inf{t ≥ 1 | Xt = i} and Hi = max Ej [Ti ]. [sent-68, score-0.655]
32 An irreducible positive recurrent Markov chain has a unique stationary distribution satisfying [18, 8]: πi = Ei Ti t=1 1{Xt =i} = Ei [Ti ] 1 for all i ∈ Σ. [sent-70, score-0.524]
33 Ei [Ti ] (2) The Markov chain can be visualized as a random walk over a weighted directed graph G = (Σ, E, P ), where Σ is the set of nodes, E = {(i, j) ∈ Σ × Σ : Pij > 0} is the set of edges, and P describes the weights of the edges. [sent-71, score-0.515]
34 1 The local neighborhood of size r around node i ∈ Σ is defined as {j ∈ Σ | dG (i, j) ≤ r}, where dG (i, j) is the length of the shortest directed path (in terms of number of edges) from i to j in G. [sent-72, score-0.482]
35 An algorithm is local if it only uses information within a local neighborhood of size r around i, where r is constant with respect to the size of the state space. [sent-73, score-0.336]
36 The fundamental matrix Z of a finite state space Markov chain is ∞ Z P (t) − 1π T = I − P + 1π T −1 t=0 ∞ (t) Pjk − πk . [sent-74, score-0.312]
37 , such that Zjk t=0 (t) Since Pjk denotes the probability that a random walk beginning at node j is at node k after t steps, Zjk represents how quickly the probability mass at node k from a random walk beginning at node j converges to πk . [sent-75, score-1.836]
38 1 Problem Statement Consider a discrete time, irreducible, aperiodic, positive recurrent MC {Xt }t≥0 on a countable state space Σ with transition probability matrix P : Σ × Σ → [0, 1]. [sent-78, score-0.433]
39 We answer this with a local algorithm, which uses only edges within a local neighborhood around i of constant size with respect to the state space. [sent-81, score-0.336]
40 Consider the Clique-Cycle Markov chain shown in Figure 1(a) with n nodes, composed of a size k clique connected to a size (n − k + 1) cycle. [sent-83, score-0.223]
41 For node j in the clique excluding i, with probability 1/2, the random walk stays at node j, and with probability 1/2 the random walk chooses a random neighbor uniformly. [sent-84, score-1.361]
42 For node j in the cycle, with probability 1/2, the random walk stays at node j, and with probability 1/2 the random walk travels counterclockwise to the subsequent node in the cycle. [sent-85, score-1.549]
43 For node i, with probability the random walk enters the cycle, with probability 1/2 the random walk chooses any neighbor in the clique; and with probability 1/2 − the random walk stays at node i. [sent-86, score-1.695]
44 We can show that the expected return time to node i is (1 − 2 )k + 2 n. [sent-87, score-0.344]
45 Suppose we observe only the local neighborhood of constant size r around node i. [sent-89, score-0.446]
46 All Clique-Cycle Markov chains with more than k + 2r nodes have identical local neighborhoods. [sent-90, score-0.277]
47 Therefore, for any ∆ ∈ (0, 1), there exists two Clique-Cycle Markov chains which have the same and k, but two different values for n, such that even though their local neighborhoods are identical, πi > ∆ in the MC with a smaller n, while πi < ∆ in the MC with a larger n. [sent-91, score-0.177]
48 Therefore, by restricting ourselves to a local neighborhood around i of constant size, we will not be able to correctly determine whether πi > ∆ for every node i in any arbitrary MC. [sent-92, score-0.446]
49 1 Throughout the paper, Markov chain and random walk on a network are used interchangeably; similarly, nodes and states are used interchangeably. [sent-93, score-0.615]
50 3 i 1 (a) Clique-Cycle Markov chain 2 3 4 5 (b) MM1 Queue Figure 1: Examples of Markov Chains 2. [sent-94, score-0.185]
51 2 Algorithm Given a threshold ∆ ∈ (0, 1) and a node i ∈ Σ, the algorithm obtains an estimate πi of πi , and ˆ uses πi to determine whether to output 0 (πi ≤ ∆) or 1 (πi > ∆). [sent-95, score-0.319]
52 It takes many independent samples of a truncated random walk that begins at node i and stops either when the random walk returns to node i, or when the length exceeds a predetermined maximum denoted by θ. [sent-98, score-1.254]
53 Each sample is generated by simulating the random walk using “crawl” operations over the MC. [sent-99, score-0.33]
54 The expected length of each random walk sample is Ei [min(Ti , θ)], which is close to Ei [Ti ] when θ is large. [sent-100, score-0.366]
55 We use Chernoff’s bound to choose a sufficiently large number of samples as a function of θ to guarantee that with probability 1 − α, the average length of the sample random walks will lie within (1 ± ) of Ei [min(Ti , θ)]. [sent-102, score-0.184]
56 The algorithm searches for an appropriate size for the local neighborhood by beginning small and increasing the size geometrically. [sent-104, score-0.177]
57 Input: Anchor node i ∈ Σ and parameters ∆ = threshold for importance, = closeness of the estimate, and α = probability of failure. [sent-106, score-0.33]
58 , N (t) }, generate independent samples sk ∼ min(Ti , θ(t) ) by simulating paths of the MC beginning at node i, and setting sk to be the length of the k th sample path. [sent-111, score-0.325]
59 ˆ ˜ 4 This algorithm outputs two estimates for the anchor node i: πi , which relies on the second expression ˆ in Eq. [sent-119, score-0.36]
60 We refer to the total number of ˜ iterations used in the algorithm as the value of t at the time of termination, denoted by tmax . [sent-122, score-0.193]
61 The t ˆ(t) total number of random walk steps taken within the first t iterations is k=1 N (t) · Ti . [sent-123, score-0.409]
62 Since θ(t) governs the radius of the ∆ local neighborhood that the algorithm utilizes, this implies that our algorithm is local, since the maximum distance is strictly upper bounded by 1 , which is constant with respect to the MC. [sent-125, score-0.221]
63 Thus when the ˆ algorithm terminates at stopping condition (a), πi < ∆ with high probability. [sent-127, score-0.176]
64 When the algorithm terminates at condition (b), the fraction of samples truncated is small, which will imply that the (t) percentage error of estimate πi is upper bounded as a function of and properties of the MC. [sent-128, score-0.259]
65 For an aperiodic, irreducible, positive recurrent, countable state space Markov chain, and for any i ∈ Σ, with probability greater than 1 − α: (t) 1. [sent-133, score-0.364]
66 The number of iterations tmax and the total number of steps (or neighbor queries) used by the algorithm are bounded above by2 3 tmax ≤ ln 1 ∆ tmax 1 ln( α ) 3∆ ˆ(t) ˜ N (t) · Ti ≤ O , and k=1 . [sent-139, score-0.703]
67 Part 1 is proved by using Chernoff’s bound to show that N (t) is large enough to guarantee that with ˆ(t) probability greater than 1 − α, for all iterations t, Ti concentrates around its mean. [sent-140, score-0.278]
68 The analysis depends on how sharply the distribution over return times concentrates around the mean. [sent-148, score-0.266]
69 Therefore, with probability greater than 1 − α, if the algorithm terminates at condition (b), then (t) πi − πi ˆ (t) πi ˆ 2 3 ≤ (3Zmax (i) + 1) . [sent-152, score-0.209]
70 The proof relies on the fact that the distribution of the return time Ti has an exponentially decaying tail [8], ensuring that the return time Ti concentrates around its mean Ei [Ti ]. [sent-157, score-0.392]
71 When the algorithm terminates at stopping condition (b), P(Ti > θ) ≤ ( 4 + ) with high 3 probability, thus the percentage error is bounded by O( Zmax (i)). [sent-158, score-0.221]
72 4 also uses the property of an exponentially decaying tail as a function of Hi to show (t) that for large θ(t) , with high probability, Pi Ti > θ(t) will be small and πi will be close to πi , ˆ and thus the algorithm will terminate at one of the stopping conditions. [sent-168, score-0.237]
73 The bound is a function of how sharply the distribution over return times concentrates around the mean. [sent-169, score-0.266]
74 4(a) states that for low probability nodes, the algorithm will terminate at stopping condition (a) for large enough iterations. [sent-171, score-0.207]
75 (b) For all nodes i ∈ Σ, with probability greater than 1 − α, the total number of steps used by the algorithm is bounded above by tmax (t) ˆ N (t) · Ti k=1 3. [sent-177, score-0.438]
76 4 require the state space of the MC to be finite, so we can upper bound the tail of the distribution of Ti using the maximal hitting time Hi . [sent-181, score-0.252]
77 In fact, these results can be extended to many countably infinite state space Markov chains, as well. [sent-182, score-0.224]
78 We prove that the tail of the distribution of Ti decays exponentially for any node i in any countable state space Markov chain that satisfies Assumption 3. [sent-183, score-0.776]
79 The Markov chain {Xt } is aperiodic and irreducible. [sent-187, score-0.257]
80 But in fact, this is quite reasonable: by the Foster-Lyapunov criteria [20], a countable state space Markov chain is positive recurrent if and 6 only if there exists a Lyapunov function V : Σ → R+ that satisfies condition (1) and (3), as well as (2’): E[V (Xt+1 )|Xt = x] < ∞ for all x ∈ Σ. [sent-193, score-0.539]
81 The existence of the Lyapunov function allows us to decompose the state space into sets B and B c such that for all nodes x ∈ B c , there is an expected decrease in the Lyapunov function in the next step or transition. [sent-196, score-0.227]
82 In addition, in any single step, the random walk cannot escape “too far”. [sent-198, score-0.33]
83 Using the concentration bounds for the countable state space settings, we can prove the following theorems that parallel the theorems stated for the finite state space setting. [sent-199, score-0.503]
84 5, (a) For any node i ∈ B such that πi < (1 − )∆/(1 + ), with probability greater than 1 − α, the total number of steps used by the algorithm is bounded above by tmax 1 ln( α ) ˆ(t) ˜ N (t) · Ti O 2 1 1 − 2−1/Ri Ri ln k=1 1 1+ − πi (1 − )∆ −1 . [sent-212, score-0.684]
85 (b) For all nodes i ∈ B, with probability greater than 1 − α, the total number of steps used by the algorithm is bounded above by tmax (t) ˆ N (t) · Ti ˜ ≤O 1 ln( α ) 2 k=1 Ri ln πi α 1 1 + ∆ 1 − 2−1/Ri . [sent-213, score-0.527]
86 In order to prove these theorems, we build upon results of [9], and establish that return times have exponentially decaying tails for countable state space MCs that satisfy Assumption 3. [sent-214, score-0.449]
87 Given a scalar parameter β and a stochastic transition matrix P , let {Xt } be the Markov chain with transition β matrix n 1 · 1T + (1 − β)P . [sent-217, score-0.335]
88 In every step, there is an β probability of jumping uniformly randomly to any other node in the network. [sent-218, score-0.299]
89 In queuing theory, Markov chains are used to model the queue length at a server, which evolves over time as requests arrive and are processed. [sent-222, score-0.389]
90 We use the basic MM1 queue, equivalent to a random walk on Z+ . [sent-223, score-0.33]
91 The queue length is modeled with the Markov chain shown in Figure 1(b), where p is the probability that a new request arrives before the current request is fully processed. [sent-225, score-0.494]
92 (t ) (t ) Figures 2(a) and 2(b) plot πi max and πi max for each node in the PageRank or MM1 queue MC, ˆ ˜ respectively. [sent-226, score-0.386]
93 ∆ Figure 2: Simulations showing results of our algorithm applied to PageRank and MM1 Queue setting Observe that the algorithm indeed obtains close estimates for nodes such that πi > ∆, and for nodes such that πi ≤ ∆, the algorithm successfully outputs 0 (i. [sent-248, score-0.231]
94 Figures 2(c) and 2(d) show the computation time, or total number of random walk steps taken by our algorithm, as a function of ∆. [sent-253, score-0.369]
95 5 Summary We proposed a local algorithm for estimating the stationary probability of a node in a MC. [sent-259, score-0.5]
96 The algorithm is a truncated Monte Carlo method, sampling return paths to the node of interest. [sent-260, score-0.388]
97 Second, it only uses a constant size neighborhood around the node of interest, upper bounded by 1 . [sent-263, score-0.471]
98 Third, it only ∆ performs computation at the node of interest. [sent-264, score-0.257]
99 For MCs that mix well, the estimate will be tight with high probability for nodes such that πi > ∆. [sent-267, score-0.173]
100 Reversible Markov chains and random walks on graphs: Chapter 2 (General Markov chains). [sent-323, score-0.203]
wordName wordTfidf (topN-words)
[('pagerank', 0.391), ('walk', 0.33), ('ti', 0.289), ('node', 0.257), ('mc', 0.229), ('chain', 0.185), ('zmax', 0.177), ('markov', 0.171), ('mcs', 0.155), ('tmax', 0.153), ('stationary', 0.15), ('countable', 0.136), ('irreducible', 0.136), ('queue', 0.129), ('chains', 0.126), ('nodes', 0.1), ('xt', 0.098), ('countably', 0.097), ('state', 0.096), ('neighborhood', 0.094), ('ln', 0.089), ('return', 0.087), ('hi', 0.083), ('ei', 0.08), ('walks', 0.077), ('transition', 0.075), ('lyapunov', 0.074), ('anchor', 0.072), ('aperiodic', 0.072), ('terminates', 0.07), ('stopping', 0.068), ('queuing', 0.067), ('concentrates', 0.064), ('terminate', 0.059), ('greater', 0.059), ('lids', 0.059), ('carlo', 0.055), ('monte', 0.055), ('hitting', 0.054), ('recurrent', 0.053), ('local', 0.051), ('request', 0.051), ('mcmc', 0.049), ('personalized', 0.048), ('diaconis', 0.046), ('bounded', 0.045), ('aldous', 0.044), ('asuman', 0.044), ('avrachenkov', 0.044), ('fogaras', 0.044), ('pjk', 0.044), ('rosenbluth', 0.044), ('tilde', 0.044), ('around', 0.044), ('mixing', 0.044), ('truncated', 0.044), ('probability', 0.042), ('sharply', 0.042), ('termination', 0.042), ('iterations', 0.04), ('tail', 0.04), ('decaying', 0.039), ('pij', 0.039), ('pi', 0.039), ('bahmani', 0.039), ('jeh', 0.039), ('ozdaglar', 0.039), ('theorems', 0.039), ('steps', 0.039), ('clique', 0.038), ('eecs', 0.038), ('condition', 0.038), ('length', 0.036), ('borgs', 0.036), ('devavrat', 0.036), ('century', 0.036), ('hat', 0.036), ('zjk', 0.036), ('concentration', 0.035), ('massachusetts', 0.034), ('reversible', 0.034), ('stays', 0.034), ('chernoff', 0.032), ('power', 0.032), ('beginning', 0.032), ('exponentially', 0.031), ('estimate', 0.031), ('nite', 0.031), ('space', 0.031), ('requests', 0.031), ('outputs', 0.031), ('threshold', 0.031), ('upper', 0.031), ('web', 0.031), ('neighbor', 0.031), ('dg', 0.03), ('server', 0.03), ('ri', 0.029), ('times', 0.029), ('guarantee', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 66 nips-2013-Computing the Stationary Distribution Locally
Author: Christina E. Lee, Asuman Ozdaglar, Devavrat Shah
Abstract: Computing the stationary distribution of a large finite or countably infinite state space Markov Chain (MC) has become central in many problems such as statistical inference and network analysis. Standard methods involve large matrix multiplications as in power iteration, or simulations of long random walks, as in Markov Chain Monte Carlo (MCMC). Power iteration is costly, as it involves computation at every state. For MCMC, it is difficult to determine whether the random walks are long enough to guarantee convergence. In this paper, we provide a novel algorithm that answers whether a chosen state in a MC has stationary probability larger than some ∆ ∈ (0, 1), and outputs an estimate of the stationary probability. Our algorithm is constant time, using information from a local neighborhood of the state on the graph induced by the MC, which has constant size relative to the state space. The multiplicative error of the estimate is upper bounded by a function of the mixing properties of the MC. Simulation results show MCs for which this method gives tight estimates. 1
2 0.18027644 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series
Author: Daniele Durante, Bruno Scarpa, David Dunson
Abstract: In modeling multivariate time series, it is important to allow time-varying smoothness in the mean and covariance process. In particular, there may be certain time intervals exhibiting rapid changes and others in which changes are slow. If such locally adaptive smoothness is not accounted for, one can obtain misleading inferences and predictions, with over-smoothing across erratic time intervals and under-smoothing across times exhibiting slow variation. This can lead to miscalibration of predictive intervals, which can be substantially too narrow or wide depending on the time. We propose a continuous multivariate stochastic process for time series having locally varying smoothness in both the mean and covariance matrix. This process is constructed utilizing latent dictionary functions in time, which are given nested Gaussian process priors and linearly related to the observed data through a sparse mapping. Using a differential equation representation, we bypass usual computational bottlenecks in obtaining MCMC and online algorithms for approximate Bayesian inference. The performance is assessed in simulations and illustrated in a financial application. 1 1.1
3 0.14700586 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition
Author: Tzu-Kuo Huang, Jeff Schneider
Abstract: Learning dynamic models from observed data has been a central issue in many scientific studies or engineering tasks. The usual setting is that data are collected sequentially from trajectories of some dynamical system operation. In quite a few modern scientific modeling tasks, however, it turns out that reliable sequential data are rather difficult to gather, whereas out-of-order snapshots are much easier to obtain. Examples include the modeling of galaxies, chronic diseases such Alzheimer’s, or certain biological processes. Existing methods for learning dynamic model from non-sequence data are mostly based on Expectation-Maximization, which involves non-convex optimization and is thus hard to analyze. Inspired by recent advances in spectral learning methods, we propose to study this problem from a different perspective: moment matching and spectral decomposition. Under that framework, we identify reasonable assumptions on the generative process of non-sequence data, and propose learning algorithms based on the tensor decomposition method [2] to provably recover firstorder Markov models and hidden Markov models. To the best of our knowledge, this is the first formal guarantee on learning from non-sequence data. Preliminary simulation results confirm our theoretical findings. 1
4 0.14309672 288 nips-2013-Scalable Influence Estimation in Continuous-Time Diffusion Networks
Author: Nan Du, Le Song, Manuel Gomez-Rodriguez, Hongyuan Zha
Abstract: If a piece of information is released from a media site, can we predict whether it may spread to one million web pages, in a month ? This influence estimation problem is very challenging since both the time-sensitive nature of the task and the requirement of scalability need to be addressed simultaneously. In this paper, we propose a randomized algorithm for influence estimation in continuous-time diffusion networks. Our algorithm can estimate the influence of every node in a network with |V| nodes and |E| edges to an accuracy of using n = O(1/ 2 ) randomizations and up to logarithmic factors O(n|E|+n|V|) computations. When used as a subroutine in a greedy influence maximization approach, our proposed algorithm is guaranteed to find a set of C nodes with the influence of at least (1 − 1/e) OPT −2C , where OPT is the optimal value. Experiments on both synthetic and real-world data show that the proposed algorithm can easily scale up to networks of millions of nodes while significantly improves over previous state-of-the-arts in terms of the accuracy of the estimated influence and the quality of the selected nodes in maximizing the influence. 1
5 0.12824649 299 nips-2013-Solving inverse problem of Markov chain with partial observations
Author: Tetsuro Morimura, Takayuki Osogami, Tsuyoshi Ide
Abstract: The Markov chain is a convenient tool to represent the dynamics of complex systems such as traffic and social systems, where probabilistic transition takes place between internal states. A Markov chain is characterized by initial-state probabilities and a state-transition probability matrix. In the traditional setting, a major goal is to study properties of a Markov chain when those probabilities are known. This paper tackles an inverse version of the problem: we find those probabilities from partial observations at a limited number of states. The observations include the frequency of visiting a state and the rate of reaching a state from another. Practical examples of this task include traffic monitoring systems in cities, where we need to infer the traffic volume on single link on a road network from a limited number of observation points. We formulate this task as a regularized optimization problem, which is efficiently solved using the notion of natural gradient. Using synthetic and real-world data sets including city traffic monitoring data, we demonstrate the effectiveness of our method.
6 0.123083 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models
7 0.10773309 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints
8 0.10403206 289 nips-2013-Scalable kernels for graphs with continuous attributes
9 0.10030184 151 nips-2013-Learning Chordal Markov Networks by Constraint Satisfaction
10 0.092044488 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel
11 0.088368356 196 nips-2013-Modeling Overlapping Communities with Node Popularities
12 0.08591488 7 nips-2013-A Gang of Bandits
13 0.084512152 213 nips-2013-Nonparametric Multi-group Membership Model for Dynamic Networks
14 0.082944341 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC
15 0.08238057 118 nips-2013-Fast Determinantal Point Process Sampling with Application to Clustering
16 0.076941267 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration
17 0.075047858 359 nips-2013-Σ-Optimality for Active Learning on Gaussian Random Fields
18 0.074276701 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
19 0.073632762 308 nips-2013-Spike train entropy-rate estimation using hierarchical Dirichlet process priors
20 0.073403426 35 nips-2013-Analyzing the Harmonic Structure in Graph-Based Learning
topicId topicWeight
[(0, 0.19), (1, -0.006), (2, 0.022), (3, 0.014), (4, 0.007), (5, 0.097), (6, 0.116), (7, -0.036), (8, 0.048), (9, -0.056), (10, 0.025), (11, -0.14), (12, 0.165), (13, 0.031), (14, 0.002), (15, 0.018), (16, -0.001), (17, 0.075), (18, 0.029), (19, 0.027), (20, 0.121), (21, -0.017), (22, 0.006), (23, -0.08), (24, 0.026), (25, 0.102), (26, -0.025), (27, 0.08), (28, -0.102), (29, -0.124), (30, -0.077), (31, 0.04), (32, -0.256), (33, -0.056), (34, -0.095), (35, 0.009), (36, 0.002), (37, 0.076), (38, 0.032), (39, -0.088), (40, 0.079), (41, 0.114), (42, -0.012), (43, -0.004), (44, -0.157), (45, -0.128), (46, 0.036), (47, 0.009), (48, -0.099), (49, -0.006)]
simIndex simValue paperId paperTitle
same-paper 1 0.96206623 66 nips-2013-Computing the Stationary Distribution Locally
Author: Christina E. Lee, Asuman Ozdaglar, Devavrat Shah
Abstract: Computing the stationary distribution of a large finite or countably infinite state space Markov Chain (MC) has become central in many problems such as statistical inference and network analysis. Standard methods involve large matrix multiplications as in power iteration, or simulations of long random walks, as in Markov Chain Monte Carlo (MCMC). Power iteration is costly, as it involves computation at every state. For MCMC, it is difficult to determine whether the random walks are long enough to guarantee convergence. In this paper, we provide a novel algorithm that answers whether a chosen state in a MC has stationary probability larger than some ∆ ∈ (0, 1), and outputs an estimate of the stationary probability. Our algorithm is constant time, using information from a local neighborhood of the state on the graph induced by the MC, which has constant size relative to the state space. The multiplicative error of the estimate is upper bounded by a function of the mixing properties of the MC. Simulation results show MCs for which this method gives tight estimates. 1
2 0.76841247 288 nips-2013-Scalable Influence Estimation in Continuous-Time Diffusion Networks
Author: Nan Du, Le Song, Manuel Gomez-Rodriguez, Hongyuan Zha
Abstract: If a piece of information is released from a media site, can we predict whether it may spread to one million web pages, in a month ? This influence estimation problem is very challenging since both the time-sensitive nature of the task and the requirement of scalability need to be addressed simultaneously. In this paper, we propose a randomized algorithm for influence estimation in continuous-time diffusion networks. Our algorithm can estimate the influence of every node in a network with |V| nodes and |E| edges to an accuracy of using n = O(1/ 2 ) randomizations and up to logarithmic factors O(n|E|+n|V|) computations. When used as a subroutine in a greedy influence maximization approach, our proposed algorithm is guaranteed to find a set of C nodes with the influence of at least (1 − 1/e) OPT −2C , where OPT is the optimal value. Experiments on both synthetic and real-world data show that the proposed algorithm can easily scale up to networks of millions of nodes while significantly improves over previous state-of-the-arts in terms of the accuracy of the estimated influence and the quality of the selected nodes in maximizing the influence. 1
3 0.74146086 299 nips-2013-Solving inverse problem of Markov chain with partial observations
Author: Tetsuro Morimura, Takayuki Osogami, Tsuyoshi Ide
Abstract: The Markov chain is a convenient tool to represent the dynamics of complex systems such as traffic and social systems, where probabilistic transition takes place between internal states. A Markov chain is characterized by initial-state probabilities and a state-transition probability matrix. In the traditional setting, a major goal is to study properties of a Markov chain when those probabilities are known. This paper tackles an inverse version of the problem: we find those probabilities from partial observations at a limited number of states. The observations include the frequency of visiting a state and the rate of reaching a state from another. Practical examples of this task include traffic monitoring systems in cities, where we need to infer the traffic volume on single link on a road network from a limited number of observation points. We formulate this task as a regularized optimization problem, which is efficiently solved using the notion of natural gradient. Using synthetic and real-world data sets including city traffic monitoring data, we demonstrate the effectiveness of our method.
4 0.65124029 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series
Author: Daniele Durante, Bruno Scarpa, David Dunson
Abstract: In modeling multivariate time series, it is important to allow time-varying smoothness in the mean and covariance process. In particular, there may be certain time intervals exhibiting rapid changes and others in which changes are slow. If such locally adaptive smoothness is not accounted for, one can obtain misleading inferences and predictions, with over-smoothing across erratic time intervals and under-smoothing across times exhibiting slow variation. This can lead to miscalibration of predictive intervals, which can be substantially too narrow or wide depending on the time. We propose a continuous multivariate stochastic process for time series having locally varying smoothness in both the mean and covariance matrix. This process is constructed utilizing latent dictionary functions in time, which are given nested Gaussian process priors and linearly related to the observed data through a sparse mapping. Using a differential equation representation, we bypass usual computational bottlenecks in obtaining MCMC and online algorithms for approximate Bayesian inference. The performance is assessed in simulations and illustrated in a financial application. 1 1.1
5 0.54970074 151 nips-2013-Learning Chordal Markov Networks by Constraint Satisfaction
Author: Jukka Corander, Tomi Janhunen, Jussi Rintanen, Henrik Nyman, Johan Pensar
Abstract: We investigate the problem of learning the structure of a Markov network from data. It is shown that the structure of such networks can be described in terms of constraints which enables the use of existing solver technology with optimization capabilities to compute optimal networks starting from initial scores computed from the data. To achieve efficient encodings, we develop a novel characterization of Markov network structure using a balancing condition on the separators between cliques forming the network. The resulting translations into propositional satisfiability and its extensions such as maximum satisfiability, satisfiability modulo theories, and answer set programming, enable us to prove optimal certain networks which have been previously found by stochastic search. 1
6 0.5348767 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent
7 0.49757993 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel
8 0.4852871 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models
9 0.48365289 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition
10 0.46451062 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models
11 0.46358338 213 nips-2013-Nonparametric Multi-group Membership Model for Dynamic Networks
12 0.45697743 289 nips-2013-Scalable kernels for graphs with continuous attributes
13 0.44673899 154 nips-2013-Learning Gaussian Graphical Models with Observed or Latent FVSs
14 0.44635895 359 nips-2013-Σ-Optimality for Active Learning on Gaussian Random Fields
15 0.43754649 7 nips-2013-A Gang of Bandits
16 0.42361853 291 nips-2013-Sensor Selection in High-Dimensional Gaussian Trees with Nuisances
17 0.42332938 332 nips-2013-Tracking Time-varying Graphical Structure
18 0.4118236 3 nips-2013-A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variables
19 0.41004258 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models
20 0.40447852 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration
topicId topicWeight
[(2, 0.233), (16, 0.047), (33, 0.125), (34, 0.088), (41, 0.036), (49, 0.021), (56, 0.183), (70, 0.017), (85, 0.079), (89, 0.039), (93, 0.02)]
simIndex simValue paperId paperTitle
1 0.92379767 191 nips-2013-Minimax Optimal Algorithms for Unconstrained Linear Optimization
Author: Brendan McMahan, Jacob Abernethy
Abstract: We design and analyze minimax-optimal algorithms for online linear optimization games where the player’s choice is unconstrained. The player strives to minimize regret, the difference between his loss and the loss of a post-hoc benchmark strategy. While the standard benchmark is the loss of the best strategy chosen from a bounded comparator set, we consider a very broad range of benchmark functions. The problem is cast as a sequential multi-stage zero-sum game, and we give a thorough analysis of the minimax behavior of the game, providing characterizations for the value of the game, as well as both the player’s and the adversary’s optimal strategy. We show how these objects can be computed efficiently under certain circumstances, and by selecting an appropriate benchmark, we construct a novel hedging strategy for an unconstrained betting game. 1
2 0.88221693 248 nips-2013-Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs
Author: Liam C. MacDermed, Charles Isbell
Abstract: We present four major results towards solving decentralized partially observable Markov decision problems (DecPOMDPs) culminating in an algorithm that outperforms all existing algorithms on all but one standard infinite-horizon benchmark problems. (1) We give an integer program that solves collaborative Bayesian games (CBGs). The program is notable because its linear relaxation is very often integral. (2) We show that a DecPOMDP with bounded belief can be converted to a POMDP (albeit with actions exponential in the number of beliefs). These actions correspond to strategies of a CBG. (3) We present a method to transform any DecPOMDP into a DecPOMDP with bounded beliefs (the number of beliefs is a free parameter) using optimal (not lossless) belief compression. (4) We show that the combination of these results opens the door for new classes of DecPOMDP algorithms based on previous POMDP algorithms. We choose one such algorithm, point-based valued iteration, and modify it to produce the first tractable value iteration method for DecPOMDPs that outperforms existing algorithms. 1
3 0.84888464 123 nips-2013-Flexible sampling of discrete data correlations without the marginal distributions
Author: Alfredo Kalaitzis, Ricardo Silva
Abstract: Learning the joint dependence of discrete variables is a fundamental problem in machine learning, with many applications including prediction, clustering and dimensionality reduction. More recently, the framework of copula modeling has gained popularity due to its modular parameterization of joint distributions. Among other properties, copulas provide a recipe for combining flexible models for univariate marginal distributions with parametric families suitable for potentially high dimensional dependence structures. More radically, the extended rank likelihood approach of Hoff (2007) bypasses learning marginal models completely when such information is ancillary to the learning task at hand as in, e.g., standard dimensionality reduction problems or copula parameter estimation. The main idea is to represent data by their observable rank statistics, ignoring any other information from the marginals. Inference is typically done in a Bayesian framework with Gaussian copulas, and it is complicated by the fact this implies sampling within a space where the number of constraints increases quadratically with the number of data points. The result is slow mixing when using off-the-shelf Gibbs sampling. We present an efficient algorithm based on recent advances on constrained Hamiltonian Markov chain Monte Carlo that is simple to implement and does not require paying for a quadratic cost in sample size. 1 Contribution There are many ways of constructing multivariate discrete distributions: from full contingency tables in the small dimensional case [1], to structured models given by sparsity constraints [11] and (hierarchies of) latent variable models [6]. More recently, the idea of copula modeling [16] has been combined with such standard building blocks. Our contribution is a novel algorithm for efficient Markov chain Monte Carlo (MCMC) for the copula framework introduced by [7], extending algorithmic ideas introduced by [17]. A copula is a continuous cumulative distribution function (CDF) with uniformly distributed univariate marginals in the unit interval [0, 1]. It complements graphical models and other formalisms that provide a modular parameterization of joint distributions. The core idea is simple and given by the following observation: suppose we are given a (say) bivariate CDF F (y1 , y2 ) with marginals −1 −1 F1 (y1 ) and F2 (y2 ). This CDF can then be rewritten as F (F1 (F1 (y1 )), F2 (F2 (y2 ))). The func−1 −1 tion C(·, ·) given by F (F1 (·), F2 (·)) is a copula. For discrete distributions, this decomposition is not unique but still well-defined [16]. Copulas have found numerous applications in statistics and machine learning since they provide a way of constructing flexible multivariate distributions by mix-and-matching different copulas with different univariate marginals. For instance, one can combine flexible univariate marginals Fi (·) with useful but more constrained high-dimensional copulas. We will not further motivate the use of copula models, which has been discussed at length in recent 1 machine learning publications and conference workshops, and for which comprehensive textbooks exist [e.g., 9]. For a recent discussion on the applications of copulas from a machine learning perspective, [4] provides an overview. [10] is an early reference in machine learning. The core idea dates back at least to the 1950s [16]. In the discrete case, copulas can be difficult to apply: transforming a copula CDF into a probability mass function (PMF) is computationally intractable in general. For the continuous case, a common ˆ trick goes as follows: transform variables by defining ai ≡ Fi (yi ) for an estimate of Fi (·) and then fit a copula density c(·, . . . , ·) to the resulting ai [e.g. 9]. It is not hard to check this breaks down in the discrete case [7]. An alternative is to represent the CDF to PMF transformation for each data point by a continuous integral on a bounded space. Sampling methods can then be used. This trick has allowed many applications of the Gaussian copula to discrete domains. Readers familiar with probit models will recognize the similarities to models where an underlying latent Gaussian field is discretized into observable integers as in Gaussian process classifiers and ordinal regression [18]. Such models can be indirectly interpreted as special cases of the Gaussian copula. In what follows, we describe in Section 2 the Gaussian copula and the general framework for constructing Bayesian estimators of Gaussian copulas by [7], the extended rank likelihood framework. This framework entails computational issues which are discussed. A recent general approach for MCMC in constrained Gaussian fields by [17] can in principle be directly applied to this problem as a blackbox, but at a cost that scales quadratically in sample size and as such it is not practical in general. Our key contribution is given in Section 4. An application experiment on the Bayesian Gaussian copula factor model is performed in Section 5. Conclusions are discussed in the final section. 2 Gaussian copulas and the extended rank likelihood It is not hard to see that any multivariate Gaussian copula is fully defined by a correlation matrix C, since marginal distributions have no free parameters. In practice, the following equivalent generative model is used to define a sample U according to a Gaussian copula GC(C): 1. Sample Z from a zero mean Gaussian with covariance matrix C 2. For each Zj , set Uj = Φ(zj ), where Φ(·) is the CDF of the standard Gaussian It is clear that each Uj follows a uniform distribution in [0, 1]. To obtain a model for variables {y1 , y2 , . . . , yp } with marginal distributions Fj (·) and copula GC(C), one can add the deterministic (n) (1) (1) (2) step yj = Fj−1 (uj ). Now, given n samples of observed data Y ≡ {y1 , . . . , yp , y1 , . . . , yp }, one is interested on inferring C via a Bayesian approach and the posterior distribution p(C, θF | Y) ∝ pGC (Y | C, θF )π(C, θF ) where π(·) is a prior distribution, θF are marginal parameters for each Fj (·), which in general might need to be marginalized since they will be unknown, and pGC (·) is the PMF of a (here discrete) distribution with a Gaussian copula and marginals given by θF . Let Z be the underlying latent Gaussians of the corresponding copula for dataset Y. Although Y is a deterministic function of Z, this mapping is not invertible due to the discreteness of the distribution: each marginal Fj (·) has jumps. Instead, the reverse mapping only enforces the constraints where (i ) (i ) (i ) (i ) yj 1 < yj 2 implies zj 1 < zj 2 . Based on this observation, [7] considers the event Z ∈ D(y), where D(y) is the set of values of Z in Rn×p obeying those constraints, that is (k) (k) D(y) ≡ Z ∈ Rn×p : max zj s.t. yj (i) < yj (i) (k) (i) (k) < zj < min zj s.t. yj < yj . Since {Y = y} ⇒ Z(y) ∈ D(y), we have pGC (Y | C, θF ) = pGC (Z ∈ D(y), Y | C, θF ) = pN (Z ∈ D(y) | C) × pGC (Y| Z ∈ D(y), C, θF ), (1) the first factor of the last line being that of a zero-mean a Gaussian density function marginalized over D(y). 2 The extended rank likelihood is defined by the first factor of (1). With this likelihood, inference for C is given simply by marginalizing p(C, Z | Y) ∝ I(Z ∈ D(y)) pN (Z| C) π(C), (2) the first factor of the right-hand side being the usual binary indicator function. Strictly speaking, this is not a fully Bayesian method since partial information on the marginals is ignored. Nevertheless, it is possible to show that under some mild conditions there is information in the extended rank likelihood to consistently estimate C [13]. It has two important properties: first, in many applications where marginal distributions are nuisance parameters, this sidesteps any major assumptions about the shape of {Fi (·)} – applications include learning the degree of dependence among variables (e.g., to understand relationships between social indicators as in [7] and [13]) and copula-based dimensionality reduction (a generalization of correlation-based principal component analysis, e.g., [5]); second, MCMC inference in the extended rank likelihood is conceptually simpler than with the joint likelihood, since dropping marginal models will remove complicated entanglements between C and θF . Therefore, even if θF is necessary (when, for instance, predicting missing values of Y), an estimate of C can be computed separately and will not depend on the choice of estimator for {Fi (·)}. The standard model with a full correlation matrix C can be further refined to take into account structure implied by sparse inverse correlation matrices [2] or low rank decompositions via higher-order latent variable models [13], among others. We explore the latter case in section 5. An off-the-shelf algorithm for sampling from (2) is full Gibbs sampling: first, given Z, the (full or structured) correlation matrix C can be sampled by standard methods. More to the point, sampling (i) Z is straightforward if for each variable j and data point i we sample Zj conditioned on all other variables. The corresponding distribution is an univariate truncated Gaussian. This is the approach used originally by Hoff. However, mixing can be severely compromised by the sampling of Z, and that is where novel sampling methods can facilitate inference. 3 Exact HMC for truncated Gaussian distributions (i) Hoff’s algorithm modifies the positions of all Zj associated with a particular discrete value of Yj , conditioned on the remaining points. As the number of data points increases, the spread of the hard (i) boundaries on Zj , given by data points of Zj associated with other levels of Yj , increases. This (i) reduces the space in which variables Zj can move at a time. To improve the mixing, we aim to sample from the joint Gaussian distribution of all latent variables (i) Zj , i = 1 . . . n , conditioned on other columns of the data, such that the constraints between them are satisfied and thus the ordering in the observation level is conserved. Standard Gibbs approaches for sampling from truncated Gaussians reduce the problem to sampling from univariate truncated Gaussians. Even though each step is computationally simple, mixing can be slow when strong correlations are induced by very tight truncation bounds. In the following, we briefly describe the methodology recently introduced by [17] that deals with the problem of sampling from log p(x) ∝ − 1 x Mx + r x , where x, r ∈ Rn and M is positive 2 definite, with linear constraints of the form fj x ≤ gj , where fj ∈ Rn , j = 1 . . . m, is the normal vector to some linear boundary in the sample space. Later in this section we shall describe how this framework can be applied to the Gaussian copula extended rank likelihood model. More importantly, the observed rank statistics impose only linear constraints of the form xi1 ≤ xi2 . We shall describe how this special structure can be exploited to reduce the runtime complexity of the constrained sampler from O(n2 ) (in the number of observations) to O(n) in practice. 3.1 Hamiltonian Monte Carlo for the Gaussian distribution Hamiltonian Monte Carlo (HMC) [15] is a MCMC method that extends the sampling space with auxiliary variables so that (ideally) deterministic moves in the joint space brings the sampler to 3 potentially far places in the original variable space. Deterministic moves cannot in general be done, but this is possible in the Gaussian case. The form of the Hamiltonian for the general d-dimensional Gaussian case with mean µ and precision matrix M is: 1 1 H = x Mx − r x + s M−1 s , (3) 2 2 where M is also known in the present context as the mass matrix, r = Mµ and s is the velocity. Both x and s are Gaussian distributed so this Hamiltonian can be seen (up to a constant) as the negative log of the product of two independent Gaussian random variables. The physical interpretation is that of a sum of potential and kinetic energy terms, where the total energy of the system is conserved. In a system where this Hamiltonian function is constant, we can exactly compute its evolution through the pair of differential equations: ˙ x= sH = M−1 s , ˙ s=− xH = −Mx + r . (4) These are solved exactly by x(t) = µ + a sin(t) + b cos(t) , where a and b can be identified at initial conditions (t = 0) : ˙ a = x(0) = M−1 s , b = x(0) − µ . (5) Therefore, the exact HMC algorithm can be summarised as follows: • Initialise the allowed travel time T and some initial position x0 . • Repeat for HMC samples k = 1 . . . N 1. Sample sk ∼ N (0, M) 2. Use sk and xk to update a and b and store the new position at the end of the trajectory xk+1 = x(T ) as an HMC sample. It can be easily shown that the Markov chain of sampled positions has the desired equilibrium distribution N µ, M−1 [17]. 3.2 Sampling with linear constraints Sampling from multivariate Gaussians does not require any method as sophisticated as HMC, but the plot thickens when the target distribution is truncated by linear constraints of the form Fx ≤ g . Here, F ∈ Rm×n is a constraint matrix whose every row is the normal vector to a linear boundary in the sample space. This is equivalent to sampling from a Gaussian that is confined in the (not necessarily bounded) convex polyhedron {x : Fx ≤ g}. In general, to remain within the boundaries of each wall, once a new velocity has been sampled one must compute all possible collision times with the walls. The smallest of all collision times signifies the wall that the particle should bounce from at that collision time. Figure 1 illustrates the concept with two simple examples on 2 and 3 dimensions. The collision times can be computed analytically and their equations can be found in the supplementary material. We also point the reader to [17] for a more detailed discussion of this implementation. Once the wall to be hit has been found, then position and velocity at impact time are computed and the velocity is reflected about the boundary normal1 . The constrained HMC sampler is summarized follows: • Initialise the allowed travel time T and some initial position x0 . • Repeat for HMC samples k = 1 . . . N 1. Sample sk ∼ N (0, M) 2. Use sk and xk to update a and b . 1 Also equivalent to transforming the velocity with a Householder reflection matrix about the bounding hyperplane. 4 1 2 3 4 1 2 3 4 Figure 1: Left: Trajectories of the first 40 iterations of the exact HMC sampler on a 2D truncated Gaussian. A reflection of the velocity can clearly be seen when the particle meets wall #2 . Here, the constraint matrix F is a 4 × 2 matrix. Center: The same example after 40000 samples. The coloring of each sample indicates its density value. Right: The anatomy of a 3D Gaussian. The walls are now planes and in this case F is a 2 × 3 matrix. Figure best seen in color. 3. Reset remaining travel time Tleft ← T . Until no travel time is left or no walls can be reached (no solutions exist), do: (a) Compute impact times with all walls and pick the smallest one, th (if a solution exists). (b) Compute v(th ) and reflect it about the hyperplane fh . This is the updated velocity after impact. The updated position is x(th ) . (c) Tleft ← Tleft − th 4. Store the new position at the end of the trajectory xk+1 as an HMC sample. In general, all walls are candidates for impact, so the runtime of the sampler is linear in m , the number of constraints. This means that the computational load is concentrated in step 3(a). Another consideration is that of the allocated travel time T . Depending on the shape of the bounding polyhedron and the number of walls, a very large travel time can induce many more bounces thus requiring more computations per sample. On the other hand, a very small travel time explores the distribution more locally so the mixing of the chain can suffer. What constitutes a given travel time “large” or “small” is relative to the dimensionality, the number of constraints and the structure of the constraints. Due to the nature of our problem, the number of constraints, when explicitly expressed as linear functions, is O(n2 ) . Clearly, this restricts any direct application of the HMC framework for Gaussian copula estimation to small-sample (n) datasets. More importantly, we show how to exploit the structure of the constraints to reduce the number of candidate walls (prior to each bounce) to O(n) . 4 HMC for the Gaussian Copula extended rank likelihood model Given some discrete data Y ∈ Rn×p , the task is to infer the correlation matrix of the underlying Gaussian copula. Hoff’s sampling algorithm proceeds by alternating between sampling the continu(i) (i) ous latent representation Zj of each Yj , for i = 1 . . . n, j = 1 . . . p , and sampling a covariance matrix from an inverse-Wishart distribution conditioned on the sampled matrix Z ∈ Rn×p , which is then renormalized as a correlation matrix. From here on, we use matrix notation for the samples, as opposed to the random variables – with (i) Zi,j replacing Zj , Z:,j being a column of Z, and Z:,\j being the submatrix of Z without the j-th column. In a similar vein to Hoff’s sampling algorithm, we replace the successive sampling of each Zi,j conditioned on Zi,\j (a conditional univariate truncated Gaussian) with the simultaneous sampling of Z:,j conditioned on Z:,\j . This is done through an HMC step from a conditional multivariate truncated Gaussian. The added benefit of this HMC step over the standard Gibbs approach, is that of a handle for regulating the trade-off between exploration and runtime via the allocated travel time T . Larger travel times potentially allow for larger moves in the sample space, but it comes at a cost as explained in the sequel. 5 4.1 The Hough envelope algorithm The special structure of constraints. Recall that the number of constraints is quadratic in the dimension of the distribution. This is because every Z sample must satisfy the conditions of the event Z ∈ D(y) of the extended rank likelihood (see Section 2). In other words, for any column Z:,j , all entries are organised into a partition L(j) of |L(j) | levels, the number of unique values observed for the discrete or ordinal variable Y (j) . Thereby, for any two adjacent levels lk , lk+1 ∈ L(j) and any pair i1 ∈ lk , i2 ∈ lk+1 , it must be true that Zli ,j < Zli+1 ,j . Equivalently, a constraint f exists where fi1 = 1, fi2 = −1 and g = 0 . It is easy to see that O(n2 ) of such constraints are induced by the order statistics of the j-th variable. To deal with this boundary explosion, we developed the Hough Envelope algorithm to search efficiently, within all pairs in {Z:,j }, in practically linear time. Recall in HMC (section 3.2) that the trajectory of the particle, x(t), is decomposed as xi (t) = ai sin(t) + bi cos(t) + µi , (6) and there are n such functions, grouped into a partition of levels as described above. The Hough envelope2 is found for every pair of adjacent levels. We illustrate this with an example of 10 dimensions and two levels in Figure 2, without loss of generalization to any number of levels or dimensions. Assume we represent trajectories for points in level lk with blue curves, and points in lk+1 with red curves. Assuming we start with a valid state, at time t = 0 all red curves are above all blue curves. The goal is to find the smallest t where a blue curve meets a red curve. This will be our collision time where a bounce will be necessary. 5 3 1 2 Figure 2: The trajectories xj (t) of each component are sinusoid functions. The right-most green dot signifies the wall and the time th of the earliest bounce, where the first inter-level pair (that is, any two components respectively from the blue and red level) becomes equal, in this case the constraint activated being xblue2 = xred2 . 4 4 5 1 2 3 0.2 0.4 0.6 t 0.8 1 1.2 1.4 1. First we find the largest component bluemax of the blue level at t = 0. This takes O(n) time. Clearly, this will be the largest component until its sinusoid intersects that of any other component. 2. To find the next largest component, compute the roots of xbluemax (t) − xi (t) = 0 for all components and pick the smallest (earliest) one (represented by a green dot). This also takes O(n) time. 3. Repeat this procedure until a red sinusoid crosses the highest running blue sinusoid. When this happens, the time of earliest bounce and its constraint are found. In the worst-case scenario, n such repetitions have to be made, but in practice we can safely assume an fixed upper bound h on the number of blue crossings before a inter-level crossing occurs. In experiments, we found h << n, no more than 10 in simulations with hundreds of thousands of curves. Thus, this search strategy takes O(n) time in practice to complete, mirroring the analysis of other output-sensitive algorithms such as the gift wrapping algorithm for computing convex hulls [8]. Our HMC sampling approach is summarized in Algorithm 1. 2 The name is inspired from the fact that each xi (t) is the sinusoid representation, in angle-distance space, of all lines that pass from the (ai , bi ) point in a − b space. A representation known in image processing as the Hough transform [3]. 6 Algorithm 1 HMC for GCERL # Notation: T MN (µ, C, F) is a truncated multivariate normal with location vector µ, scale matrix C and constraints encoded by F and g = 0 . # IW(df, V0 ) is an inverse-Wishart prior with degrees of freedom df and scale matrix V0 . Input: Y ∈ Rn×p , allocated travel time T , a starting Z and variable covariance V ∈ Rp×p , df = p + 2, V0 = df Ip and chain size N . Generate constraints F(j) from Y:,j , for j = 1 . . . p . for samples k = 1 . . . N do # Resample Z as follows: for variables j = 1 . . . p do −1 −1 2 Compute parameters: σj = Vjj − Vj,\j V\j,\j V\j,j , µj = Z:,\j V\j,\j V\j,j . 2 Get one sample Z:,j ∼ T MN µj , σj I, F(j) efficiently by using the Hough Envelope algorithm, see section 4.1. end for Resample V ∼ IW(df + n, V0 + Z Z) . Compute correlation matrix C, s.t. Ci,j = Vi,j / Vi,i Vj,j and store sample, C(k) ← C . end for 5 An application on the Bayesian Gausian copula factor model In this section we describe an experiment that highlights the benefits of our HMC treatment, compared to a state-of-the-art parameter expansion (PX) sampling scheme. During this experiment we ask the important question: “How do the two schemes compare when we exploit the full-advantage of the HMC machinery to jointly sample parameters and the augmented data Z, in a model of latent variables and structured correlations?” We argue that under such circumstances the superior convergence speed and mixing of HMC undeniably compensate for its computational overhead. Experimental setup In this section we provide results from an application on the Gaussian copula latent factor model of [13] (Hoff’s model [7] for low-rank structured correlation matrices). We modify the parameter expansion (PX) algorithm used by [13] by replacing two of its Gibbs steps with a single HMC step. We show a much faster convergence to the true mode with considerable support on its vicinity. We show that unlike the HMC, the PX algorithm falls short of properly exploring the posterior in any reasonable finite amount of time, even for small models, even for small samples. Worse, PX fails in ways one cannot easily detect. Namely, we sample each row of the factor loadings matrix Λ jointly with the corresponding column of the augmented data matrix Z, conditioning on the higher-order latent factors. This step is analogous to Pakman and Paninski’s [17, sec.3.1] use of HMC in the context of a binary probit model (the extension to many levels in the discrete marginal is straightforward with direct application of the constraint matrix F and the Hough envelope algorithm). The sampling of the higher level latent factors remains identical to [13]. Our scheme involves no parameter expansion. We do however interweave the Gibbs step for the Z matrix similarly to Hoff. This has the added benefit of exploring the Z sample space within their current boundaries, complementing the joint (λ, z) sampling which moves the boundaries jointly. The value of such ”interweaving” schemes has been addressed in [19]. Results We perform simulations of 10000 iterations, n = 1000 observations (rows of Y), travel time π/2 for HMC with the setups listed in the following table, along with the elapsed times of each sampling scheme. These experiments were run on Intel COREi7 desktops with 4 cores and 8GB of RAM. Both methods were parallelized across the observed variables (p). Figure p (vars) k (latent factors) M (ordinal levels) elapsed (mins): HMC PX 3(a) : 20 5 2 115 8 3(b) : 10 3 2 80 6 10 3 5 203 16 3(c) : Many functionals of the loadings matrix Λ can be assessed. We focus on reconstructing the true (low-rank) correlation matrix of the Gaussian copula. In particular, we summarize the algorithm’s 7 outcome with the root mean squared error (RMSE) of the differences between entries of the ground-truth correlation matrix and the implied correlation matrix at each iteration of a MCMC scheme (so the following plots looks like a time-series of 10000 timepoints), see Figures 3(a), 3(b) and 3(c) . (a) (b) (c) Figure 3: Reconstruction (RMSE per iteration) of the low-rank structured correlation matrix of the Gaussian copula and its histogram (along the left side). (a) Simulation setup: 20 variables, 5 factors, 5 levels. HMC (blue) reaches a better mode faster (in iterations/CPU-time) than PX (red). Even more importantly the RMSE posterior samples of PX are concentrated in a much smaller region compared to HMC, even after 10000 iterations. This illustrates that PX poorly explores the true distribution. (b) Simulation setup: 10 vars, 3 factors, 2 levels. We observe behaviors similar to Figure 3(a). Note that the histogram counts RMSEs after the burn-in period of PX (iteration #500). (c) Simulation setup: 10 vars, 3 factors, 5 levels. We observe behaviors similar to Figures 3(a) and 3(b) but with a thinner tail for HMC. Note that the histogram counts RMSEs after the burn-in period of PX (iteration #2000). Main message HMC reaches a better mode faster (iterations/CPUtime). Even more importantly the RMSE posterior samples of PX are concentrated in a much smaller region compared to HMC, even after 10000 iterations. This illustrates that PX poorly explores the true distribution. As an analogous situation we refer to the top and bottom panels of Figure 14 of Radford Neal’s slice sampler paper [14]. If there was no comparison against HMC, there would be no evidence from the PX plot alone that the algorithm is performing poorly. This mirrors Radford Neal’s statement opening Section 8 of his paper: “a wrong answer is obtained without any obvious indication that something is amiss”. The concentration on the posterior mode of PX in these simulations is misleading of the truth. PX might seen a bit simpler to implement, but it seems one cannot avoid using complex algorithms for complex models. We urge practitioners to revisit their past work with this model to find out by how much credible intervals of functionals of interest have been overconfident. Whether trivially or severely, our algorithm offers the first principled approach for checking this out. 6 Conclusion Sampling large random vectors simultaneously in order to improve mixing is in general a very hard problem, and this is why clever methods such as HMC or elliptical slice sampling [12] are necessary. We expect that the method here developed is useful not only for those with data analysis problems within the large family of Gaussian copula extended rank likelihood models, but the method itself and its behaviour might provide some new insights on MCMC sampling in constrained spaces in general. Another direction of future work consists of exploring methods for elliptical copulas, and related possible extensions of general HMC for non-Gaussian copula models. Acknowledgements The quality of this work has benefited largely from comments by our anonymous reviewers and useful discussions with Simon Byrne and Vassilios Stathopoulos. Research was supported by EPSRC grant EP/J013293/1. 8 References [1] Y. Bishop, S. Fienberg, and P. Holland. Discrete Multivariate Analysis: Theory and Practice. MIT Press, 1975. [2] A. Dobra and A. Lenkoski. Copula Gaussian graphical models and their application to modeling functional disability data. Annals of Applied Statistics, 5:969–993, 2011. [3] R. O. Duda and P. E. Hart. Use of the Hough transformation to detect lines and curves in pictures. Communications of the ACM, 15(1):11–15, 1972. [4] G. Elidan. Copulas and machine learning. Proceedings of the Copulae in Mathematical and Quantitative Finance workshop, to appear, 2013. [5] F. Han and H. Liu. Semiparametric principal component analysis. Advances in Neural Information Processing Systems, 25:171–179, 2012. [6] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006. [7] P. Hoff. Extending the rank likelihood for semiparametric copula estimation. Annals of Applied Statistics, 1:265–283, 2007. [8] R. Jarvis. On the identification of the convex hull of a finite set of points in the plane. Information Processing Letters, 2(1):18–21, 1973. [9] H. Joe. Multivariate Models and Dependence Concepts. Chapman-Hall, 1997. [10] S. Kirshner. Learning with tree-averaged densities and distributions. Neural Information Processing Systems, 2007. [11] S. Lauritzen. Graphical Models. Oxford University Press, 1996. [12] I. Murray, R. Adams, and D. MacKay. Elliptical slice sampling. JMLR Workshop and Conference Proceedings: AISTATS 2010, 9:541–548, 2010. [13] J. Murray, D. Dunson, L. Carin, and J. Lucas. Bayesian Gaussian copula factor models for mixed data. Journal of the American Statistical Association, to appear, 2013. [14] R. Neal. Slice sampling. The Annals of Statistics, 31:705–767, 2003. [15] R. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, pages 113–162, 2010. [16] R. Nelsen. An Introduction to Copulas. Springer-Verlag, 2007. [17] A. Pakman and L. Paninski. Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians. arXiv:1208.4118, 2012. [18] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [19] Y. Yu and X. L. Meng. To center or not to center: That is not the question — An ancillaritysufficiency interweaving strategy (ASIS) for boosting MCMC efficiency. Journal of Computational and Graphical Statistics, 20(3):531–570, 2011. 9
4 0.84660727 51 nips-2013-Bayesian entropy estimation for binary spike train data using parametric prior knowledge
Author: Evan W. Archer, Il M. Park, Jonathan W. Pillow
Abstract: Shannon’s entropy is a basic quantity in information theory, and a fundamental building block for the analysis of neural codes. Estimating the entropy of a discrete distribution from samples is an important and difficult problem that has received considerable attention in statistics and theoretical neuroscience. However, neural responses have characteristic statistical structure that generic entropy estimators fail to exploit. For example, existing Bayesian entropy estimators make the naive assumption that all spike words are equally likely a priori, which makes for an inefficient allocation of prior probability mass in cases where spikes are sparse. Here we develop Bayesian estimators for the entropy of binary spike trains using priors designed to flexibly exploit the statistical structure of simultaneouslyrecorded spike responses. We define two prior distributions over spike words using mixtures of Dirichlet distributions centered on simple parametric models. The parametric model captures high-level statistical features of the data, such as the average spike count in a spike word, which allows the posterior over entropy to concentrate more rapidly than with standard estimators (e.g., in cases where the probability of spiking differs strongly from 0.5). Conversely, the Dirichlet distributions assign prior mass to distributions far from the parametric model, ensuring consistent estimates for arbitrary distributions. We devise a compact representation of the data and prior that allow for computationally efficient implementations of Bayesian least squares and empirical Bayes entropy estimators with large numbers of neurons. We apply these estimators to simulated and real neural data and show that they substantially outperform traditional methods.
5 0.84488398 171 nips-2013-Learning with Noisy Labels
Author: Nagarajan Natarajan, Inderjit Dhillon, Pradeep Ravikumar, Ambuj Tewari
Abstract: In this paper, we theoretically study the problem of binary classification in the presence of random classification noise — the learner, instead of seeing the true labels, sees labels that have independently been flipped with some small probability. Moreover, random label noise is class-conditional — the flip probability depends on the class. We provide two approaches to suitably modify any given surrogate loss function. First, we provide a simple unbiased estimator of any loss, and obtain performance bounds for empirical risk minimization in the presence of iid data with noisy labels. If the loss function satisfies a simple symmetry condition, we show that the method leads to an efficient algorithm for empirical minimization. Second, by leveraging a reduction of risk minimization under noisy labels to classification with weighted 0-1 loss, we suggest the use of a simple weighted surrogate loss, for which we are able to obtain strong empirical risk bounds. This approach has a very remarkable consequence — methods used in practice such as biased SVM and weighted logistic regression are provably noise-tolerant. On a synthetic non-separable dataset, our methods achieve over 88% accuracy even when 40% of the labels are corrupted, and are competitive with respect to recently proposed methods for dealing with label noise in several benchmark datasets.
same-paper 6 0.84400326 66 nips-2013-Computing the Stationary Distribution Locally
7 0.8323949 182 nips-2013-Manifold-based Similarity Adaptation for Label Propagation
8 0.80675578 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence
9 0.80520785 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries
10 0.8012377 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences
11 0.78923744 325 nips-2013-The Pareto Regret Frontier
12 0.78899288 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.
13 0.77293783 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
14 0.76034206 252 nips-2013-Predictive PAC Learning and Process Decompositions
15 0.75991422 79 nips-2013-DESPOT: Online POMDP Planning with Regularization
16 0.75757158 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
17 0.75710595 230 nips-2013-Online Learning with Costly Features and Labels
18 0.75384164 25 nips-2013-Adaptive Anonymity via $b$-Matching
19 0.75306499 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings
20 0.75296497 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints