jmlr jmlr2013 jmlr2013-43 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Vinayak Rao, Yee Whye Teh
Abstract: Markov jump processes (or continuous-time Markov chains) are a simple and important class of continuous-time dynamical systems. In this paper, we tackle the problem of simulating from the posterior distribution over paths in these models, given partial and noisy observations. Our approach is an auxiliary variable Gibbs sampler, and is based on the idea of uniformization. This sets up a Markov chain over paths by alternately sampling a finite set of virtual jump times given the current path, and then sampling a new path given the set of extant and virtual jump times. The first step involves simulating a piecewise-constant inhomogeneous Poisson process, while for the second, we use a standard hidden Markov model forward filtering-backward sampling algorithm. Our method is exact and does not involve approximations like time-discretization. We demonstrate how our sampler extends naturally to MJP-based models like Markov-modulated Poisson processes and continuous-time Bayesian networks, and show significant computational benefits over state-ofthe-art MCMC samplers for these models. Keywords: Markov jump process, MCMC, Gibbs sampler, uniformization, Markov-modulated Poisson process, continuous-time Bayesian network
Reference: text
sentIndex sentText sentNum sentScore
1 This sets up a Markov chain over paths by alternately sampling a finite set of virtual jump times given the current path, and then sampling a new path given the set of extant and virtual jump times. [sent-12, score-0.557]
2 We demonstrate how our sampler extends naturally to MJP-based models like Markov-modulated Poisson processes and continuous-time Bayesian networks, and show significant computational benefits over state-ofthe-art MCMC samplers for these models. [sent-15, score-0.333]
3 Introduction The Markov jump process (MJP) extends the discrete-time Markov chain to continuous time, and forms a simple and popular class of continuous-time dynamical systems. [sent-17, score-0.225]
4 Importantly, our sampler is easily adapted to complicated extensions of MJPs such as MMPPs and continuous-time Bayesian networks (CTBNs) (Nodelman et al. [sent-39, score-0.27]
5 Like many existing methods, our sampler introduces auxiliary variables which simplify the structure of the MJP, using an idea called uniformization. [sent-41, score-0.321]
6 The overall structure of our algorithm is that of an auxiliary variable Gibbs sampler, alternately resampling the auxiliary variables given the MJP trajectory, and the trajectory given the auxiliary variables. [sent-45, score-0.312]
7 In Section 3 we introduce the idea of uniformization and describe our MCMC sampler for the simple case of a discretely observed MJP. [sent-47, score-0.375]
8 In Section 4, we apply our sampler to the Markov-modulated Poisson process, while in Section 5, we describe continuous-time Bayesian networks, and extend our algorithm to that setting. [sent-48, score-0.27]
9 Markov Jump Processes (MJPs) A Markov jump process (S(t), t ∈ R+ ) is a stochastic process with right-continuous, piecewiseconstant paths (see for example Cinlar, 1975). [sent-52, score-0.217]
10 Let π0 be a density with respect to the counting measure µS on (S , ΣS ); this defines the initial distribution over states at tstart . [sent-65, score-0.253]
11 The MJP jumps to a new state si = s at time ti , for an s = si−1 , with probability proportional to Assi−1 . [sent-79, score-0.216]
12 From Gillespie’s algorithm, we see that sampling an MJP trajectory involves sequentially sampling n + 1 waiting times from exponential densities with one of N rates, and n new states from one of N discrete distributions, each depending on the previous state. [sent-84, score-0.285]
13 In the simplest case, we observe the process at the boundaries tstart and tend . [sent-103, score-0.241]
14 Run a discrete-time Markov chain with initial distribution π0 and transition matrix B on the times in W ; this is a Markov chain subordinated to the Poisson process W . [sent-140, score-0.291]
15 The Markov chain will assign a set of states (v0 ,V ); v0 at the initial time tstart , and V = (v1 , . [sent-141, score-0.354]
16 This corrects for the fact that since the Poisson rate Ω dominates the leaving rates of all states of the MJP, W will on average contain more events than there are jumps in the MJP path. [sent-150, score-0.228]
17 The second term gives the marginal distribution over states for a discrete-time Markov chain after n steps, given that the initial state is drawn from π0 , and subsequent states are assigned according to a transition matrix B. [sent-158, score-0.307]
18 Since the transition kernels induced by the uniformization procedure agree with those of the Markov jump process (exp(At)) for all t, and since the two processes also share the same initial distribution of states, π0 , all finite dimensional distributions agree. [sent-160, score-0.354]
19 Call the virtual jump times UT ; associated with UT is a sequence of states US . [sent-185, score-0.262]
20 Figure 2: Uniformization-based Gibbs sampler: starting with an MJP trajectory (left), resample the thinned events (middle) and then resample the trajectory given all Poisson events (right). [sent-188, score-0.609]
21 Given an MJP trajectory (s0 , S, T ) (Figure 2 (left)), we proceed by first sampling the set of virtual jumps UT given (s0 , S, T ), as a result recovering the uniformized characterization (s0 ,V,W ) (Figure 2 (middle)). [sent-190, score-0.316]
22 The next proposition shows that conditioned on (s0 , S, T ), the virtual jump times UT are distributed as an inhomogeneous Poisson process with intensity R(t) = Ω + AS(t) (we remind the reader that A has a negative diagonal, so that R(t) ≤ Ω). [sent-195, score-0.353]
23 This intensity is piecewise-constant, taking the value ri = Ω + Asi on the interval [ti ,ti+1 ) (with t0 = tstart and tn+1 = tend ), so it is easy to sample UT and thus U. [sent-196, score-0.285]
24 In other words, the Markov jump process (s0 , S, T ) along with virtual jumps U drawn M from the inhomogeneous Poisson process as above is equivalent to the times W being drawn from a Poisson process with rate Ω, followed by the states (v0 ,V ) being drawn from the subordinated Markov chain. [sent-201, score-0.522]
25 Firstly, note that by assumption, X depends only on the MJP trajectory (s0 , S, T ) and not on the auxiliary jumps U. [sent-222, score-0.276]
26 Effectively, ˜ ˜ ˜ a new MJP path S(t) ˜ given an MJP path, an iteration of our algorithm corresponds to introducing thinned events, relabelling the thinned and actual transitions using FFBS, and then discarding the new thinned events to obtain a new MJP. [sent-236, score-0.306]
27 3302 FAST MCMC S AMPLING FOR MJP S AND E XTENSIONS Algorithm 2 Blocked Gibbs sampler for an MJP on the interval [tstart ,tend ] A set of observations X, and parameters A (the rate matrix), π0 (the initial distribution over states) and Ω > maxs (|As |). [sent-238, score-0.467]
28 ˜ ˜ ˜ Proposition 3 The auxiliary variable Gibbs sampler described above has the posterior distribution p(s0 , S, T |X) as its stationary distribution. [sent-249, score-0.372]
29 Thus, there is positive probability density of sampling appropriate auxiliary jump times U and moving from any MJP trajectory to any other. [sent-253, score-0.385]
30 ˜ Since FFBS returns a new state sequence V that is independent of V given W , the only dependence between successive MCMC samples arises because the new candidate jump times include the old jump times, that is, T ⊂ W . [sent-256, score-0.343]
31 This means that the new MJP trajectory has non-zero probability of making a jump at a same time as the old trajectory. [sent-257, score-0.312]
32 This proceeds by producing independent posterior samples of the Poisson events W in the interval between observations, and then (like our sampler) running a discrete-time Markov chain on this set of times to sample a new trajectory. [sent-273, score-0.298]
33 However, sampling from the posterior distribution over W is not easy, depending crucially on the observation process, and usually requires a random number of O(N 3 ) matrix multiplications (as the sampler iterates over the possible number of Poisson events). [sent-274, score-0.369]
34 Moreover, we demonstrate that our sampler mixes very rapidly. [sent-277, score-0.27]
35 We point out here that as the number of states N increases, if the transition rates Ass′ , s = s′ , remain O(1), then the uniformization rate Ω and the total number of state transitions are O(N). [sent-278, score-0.327]
36 This prior is conjugate, with sufficient statistics for the posterior distribution given a trajectory S(t) being the total amount of time Ts spent in each state s and the number of transitions ns′ s from each s to s′ . [sent-285, score-0.357]
37 ˜ ˜ A new rate matrix A implies a new uniformization rate Ω, and in the latter case, we must also ˜ Besides being more complicated, this account for the probability of the Poisson events W under Ω. [sent-287, score-0.279]
38 Thus, we first discard the thinned events U, update A conditioned only on the MJP trajectory, and then reintroduce the thinned events under the new parameters. [sent-289, score-0.299]
39 We can view the sampler of Algorithm 2 as a transition kernel KA ((s0 , S, T ), (s0 , S, T )) that preserves the posterior distribution under the rate matrix A. [sent-290, score-0.428]
40 Our ˜ ˜ ˜ overall sampler then alternately updates (s0 , S, T ) via the transition kernel KA (·, ·), and then updates A given (s0 , S, T ). [sent-291, score-0.317]
41 Note that computing 3 ) eigenvector problem, so that in this case, the this stationary distribution requires solving an O(N overall Gibbs sampler scales cubically even though Algorithm 2 scales quadratically. [sent-295, score-0.291]
42 The state of this MJP trajectory was observed via a Poisson process likelihood model (see Section 4), and posterior samples given the observations and A were produced by a C++ implementation of our algorithm. [sent-299, score-0.337]
43 We found this to be true in general: when embedded within an outer MCMC sampler, our sampler produced similar effective ESSs as an MJP sampler that produces independent trajectories. [sent-310, score-0.54]
44 The latter is typically more expensive, and in any case, we will show that the computational savings provided by our sampler far outweigh the cost of dependent trajectories. [sent-311, score-0.27]
45 Figure 4 shows the initial burn-in of a sampler with this setting for different initializations. [sent-313, score-0.293]
46 The vertical axis shows the number of state transitions in the MJP trajectory of each iteration. [sent-314, score-0.248]
47 Fearnhead and Sherlock (2006) developed an exact sampler for MMPPs based on a dynamic program for calculating the probability of O marginalizing out the MJP trajectory. [sent-333, score-0.27]
48 A backward sampling step then draws an exact posterior sample of the MJP trajectory (S(t),t ∈ O) evaluated at the times in O. [sent-336, score-0.262]
49 Finally a uniformization-based endpoint conditioned MJP sampler is used to fill in the MJP trajectory between every pair of times in O. [sent-337, score-0.476]
50 Our MCMC sampler outlined in the previous section can be straightforwardly extended to the MMPP without any of these disadvantages. [sent-343, score-0.27]
51 Resampling the auxiliary jump events (step 1 in algorithm 2) remains unaffected, since conditioned on the current MJP trajectory, they are independent of the observations O. [sent-344, score-0.296]
52 For our sampler, increasing the number of observed events leaves the computation time largely unaffected, while for the sampler of Fearnhead and Sherlock (2006), this increases quite significantly. [sent-365, score-0.376]
53 This reiterates the point that our sampler works at the time scale of the latent MJP, while Fearnhead and Sherlock (2006) work at the time scale of the observed Poisson process. [sent-366, score-0.33]
54 In both cases, our sampler again offers increased efficiency of up to two orders of magnitude. [sent-368, score-0.27]
55 In fact, the only problems where we observed the sampler of Fearnhead and Sherlock (2006) to outperform ours were low-dimensional problems with only a few Poisson observations in a long interval, and with one very unstable state. [sent-369, score-0.295]
56 A few very stable MJP states and a few very unstable ones results in a high uniformization rate Ω but only a few state transitions. [sent-370, score-0.239]
57 The resulting large number of virtual jumps can make our sampler inefficient. [sent-371, score-0.401]
58 We see that our sampler (plotted with squares) offers substantial speed-up over the sampler of Fearnhead and Sherlock (2006) (plotted with circles). [sent-374, score-0.54]
59 However, recall that this cubic scaling is not a property of our MJP trajectory sampler; rather it is a consequence of using the equilibrium distribution of a sampled rate matrix as the initial distribution over states, which requires calculating an eigenvector of a proposed rate matrix. [sent-376, score-0.28]
60 If we fix the initial distribution over states (to the discrete uniform distribution), giving the line plotted with inverted triangles in the figure, we observe that our sampler scales quadratically. [sent-377, score-0.341]
61 Like Section 2, we represent the trajectory of the CTBN, S(t), with the initial state s0 and the pair of sequences (S, T ). [sent-412, score-0.23]
62 Now, si , the ith element of S, is an m-component vector representing the states of all nodes at ti , the time of the ith state change of the CTBN. [sent-414, score-0.224]
63 The rate i i matrix of a node n varies over time as the configuration of its parents changes, and we will write An,t for the relevant matrix at time t. [sent-417, score-0.22]
64 Following Equation (2), we can write down the probability density 3310 FAST MCMC S AMPLING FOR MJP S AND E XTENSIONS of (s0 , S, T ) as p(s0 , S, T ) = π0 (s0 ) |T | m exp − ∑ ki ∏ A ki i i−1 k ,t si si−1 i=1 tend k=1 tstart |Ak,t(t) |dt . [sent-418, score-0.241]
65 Sk (10) Algorithm 3 Algorithm to sample a CTBN trajectory on the interval [tstart ,tend ] Input: Output: The CTBN graph G , a set of rate matrices {A} for all nodes and for all parent configurations and an initial distribution over states π0 . [sent-419, score-0.366]
66 This suggests a Gibbs sampling scheme where the trajectory of each node is resampled given the configuration of its Markov blanket. [sent-445, score-0.227]
67 However, even without any associated observations, sampling a node trajectory conditioned on the complete trajectory of its Markov blanket is not straightforward. [sent-448, score-0.407]
68 Thus, even over an interval of time where the parent configuration remains constant, the conditional distribution of the trajectory of a node is not a homogeneous MJP because of the effect of the node’s children, which act as ‘observations’ that are continuously observed. [sent-453, score-0.326]
69 2 Auxiliary Variable Gibbs Sampling for CTBNs We now show how our uniformization-based sampler can easily be adapted to conditionally sample a trajectory for node k without resorting to approximations. [sent-464, score-0.471]
70 For node k, the MJP trajectory (s0 , S, T ) has a uniformized construction from a subordinating Poisson process. [sent-467, score-0.243]
71 Recall also that our algorithm first reconstructs the thinned Poisson events UT using a piecewise homogeneous Poisson process with rate (Ωt + Ak,t ), and then updates the trajectory usS(t) ing the forward-backward algorithm (so that W = T ∪ UT forms the candidate transitions times of the MJP). [sent-470, score-0.457]
72 ˜ The new trajectory Sk (t) is now obtained using the forward-filtering backward-sampling algorithm, with the given inhomogeneous transition matrices and likelihood functions. [sent-479, score-0.269]
73 The following proposition now follows directly from our previous results in Section 3: Proposition 4 The auxiliary variable Gibbs sampler described above converges to the posterior distribution over the CTBN sample paths. [sent-480, score-0.372]
74 Note that our algorithm produces a new trajectory that is dependent, through T , on the previous trajectory (unlike a true Gibbs update as in El-Hay et al. [sent-481, score-0.318]
75 1 T HE L OTKA -VOLTERRA P ROCESS We first apply our sampler to the Lotka-Volterra process (Wilkinson, 2009; Opper and Sanguinetti, 2007). [sent-490, score-0.301]
76 In its present form, our sampler cannot handle this infinite state-space (but see Rao and Teh, 2012). [sent-496, score-0.27]
77 Given these observations (as well as the true parameter values), we approximated the posterior distribution over paths by two methods: using 1000 samples from our MCMC sampler (with a 3313 R AO AND T EH 20 30 True path Mean−field approx. [sent-503, score-0.401]
78 3 We could not apply the implementation of the Gibbs sampler of El-Hay et al. [sent-507, score-0.27]
79 4 Average Relative Error vs Number of Samples For the remaining experiments, we compared our sampler with the Gibbs sampler of El-Hay et al. [sent-516, score-0.54]
80 The statistics in question are the time spent by each node in different states as well as the number of transitions from each state to the others. [sent-530, score-0.237]
81 The states of the nodes were observed at times 0 and 20 and we produced endpoint-conditioned posterior samples of paths over the time interval [0, 20]. [sent-539, score-0.287]
82 (2008) sampler took about 6 minutes, while our sampler ran in about 30 seconds. [sent-546, score-0.54]
83 5 Time Requirements for the Chain-Shaped CTBN In the next two experiments, we compare the times required by CTBN-RLE and our uniformizationbased sampler to produce 100 effective samples as the size of the chain-shaped CTBN increased in different ways. [sent-548, score-0.319]
84 The time spent calculating matrix exponentials does grow linearly, however our uniformization-based sampler always takes less time than even this. [sent-565, score-0.38]
85 As seen in the middle plot, once again, our sampler is always faster. [sent-567, score-0.27]
86 Asymptotically, we expect our sampler to scale as O(N 2 ) and El-Hay et al. [sent-568, score-0.27]
87 While we have not hit that regime yet, we can see that the cost of our sampler grows more slowly with the number of states. [sent-570, score-0.27]
88 6 Time Requirements for the Drug-Effect CTBN Our final experiment, reported in the rightmost plot of Figure 11, measures the time required as the interval length (tend − tstart ) increases. [sent-572, score-0.263]
89 However, since our sampler produces exact samples (up to numerical precision), our comparison is fair. [sent-577, score-0.293]
90 Discussion We proposed a novel Markov chain Monte Carlo sampling method for Markov jump processes. [sent-580, score-0.22]
91 This constructs a Markov jump process by subordinating a Markov chain to a Poisson process, and amounts to running a Markov chain on a random discretization of time. [sent-582, score-0.372]
92 Our sampler is a blocked Gibbs sampler in this augmented represen3316 FAST MCMC S AMPLING FOR MJP S AND E XTENSIONS tation and proceeds by alternately resampling the discretization given the Markov chain and vice versa. [sent-583, score-0.645]
93 Experimentally, we find that this auxiliary variable Gibbs sampler is computationally very efficient. [sent-584, score-0.321]
94 The sampler easily generalizes to other MJP-based models, and we presented samplers for Markov-modulated Poisson processes and continuous-time Bayesian networks. [sent-585, score-0.333]
95 First, our algorithm is easily applicable to inhomogeneous Markov jump processes where techniques based on matrix exponentiation cannot be applied. [sent-593, score-0.268]
96 Following recent work (Rao and Teh, 2011b), we can also look at generalizing our sampler to semi-Markov processes where the holding times of the states follow non-exponential distributions. [sent-594, score-0.369]
97 By combining our technique with slice sampling ideas (Neal, 2003), we can explore Markov jump processes with countably infinite state spaces. [sent-596, score-0.222]
98 Of course, in practical settings, any trajectory from this process is bounded with probability 1, and we can extend our method to this case by treating Ω as a trajectory dependent random variable. [sent-599, score-0.349]
99 We allow the chain to be inhomogeneous, with Bt being the state transition matrix at time t (so that p(St+1 = s′ |St = s) = Bts′ s ). [sent-608, score-0.218]
100 An exact Gibbs sampler for the Markov-modulated Poisson process. [sent-711, score-0.27]
wordName wordTfidf (topN-words)
[('mjp', 0.7), ('sampler', 0.27), ('ctbn', 0.231), ('poisson', 0.203), ('tstart', 0.182), ('trajectory', 0.159), ('mcmc', 0.154), ('jump', 0.123), ('fearnhead', 0.119), ('uniformization', 0.105), ('sherlock', 0.098), ('ao', 0.091), ('xtensions', 0.084), ('markov', 0.082), ('events', 0.076), ('chain', 0.071), ('mjps', 0.07), ('jumps', 0.066), ('ess', 0.065), ('ampling', 0.065), ('eh', 0.065), ('virtual', 0.065), ('ffbs', 0.063), ('inhomogeneous', 0.063), ('nodelman', 0.063), ('thinned', 0.063), ('maxs', 0.06), ('gibbs', 0.058), ('ctbns', 0.056), ('gillespie', 0.056), ('asi', 0.054), ('auxiliary', 0.051), ('interval', 0.051), ('posterior', 0.051), ('st', 0.051), ('hobolth', 0.049), ('states', 0.048), ('hay', 0.048), ('state', 0.048), ('transition', 0.047), ('sk', 0.046), ('trajectories', 0.045), ('rao', 0.045), ('ot', 0.043), ('node', 0.042), ('mmpp', 0.042), ('subordinating', 0.042), ('lt', 0.042), ('transitions', 0.041), ('ti', 0.041), ('bayesian', 0.04), ('rate', 0.038), ('ut', 0.038), ('samplers', 0.038), ('resample', 0.038), ('parents', 0.036), ('ass', 0.035), ('exponentiation', 0.035), ('mmpps', 0.035), ('predator', 0.035), ('sanguinetti', 0.035), ('shelton', 0.035), ('wilkinson', 0.035), ('wi', 0.034), ('discretization', 0.034), ('children', 0.033), ('paths', 0.032), ('si', 0.031), ('process', 0.031), ('time', 0.03), ('guration', 0.029), ('tend', 0.028), ('bts', 0.028), ('prey', 0.028), ('queueing', 0.028), ('spent', 0.028), ('vi', 0.027), ('times', 0.026), ('el', 0.026), ('sampling', 0.026), ('nodes', 0.026), ('processes', 0.025), ('observations', 0.025), ('cpu', 0.024), ('avi', 0.024), ('sp', 0.024), ('intensity', 0.024), ('teh', 0.024), ('initial', 0.023), ('chains', 0.023), ('homogeneous', 0.023), ('samples', 0.023), ('stone', 0.022), ('matrix', 0.022), ('mixing', 0.021), ('conditioned', 0.021), ('parent', 0.021), ('carter', 0.021), ('cubically', 0.021), ('daley', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 43 jmlr-2013-Fast MCMC Sampling for Markov Jump Processes and Extensions
Author: Vinayak Rao, Yee Whye Teh
Abstract: Markov jump processes (or continuous-time Markov chains) are a simple and important class of continuous-time dynamical systems. In this paper, we tackle the problem of simulating from the posterior distribution over paths in these models, given partial and noisy observations. Our approach is an auxiliary variable Gibbs sampler, and is based on the idea of uniformization. This sets up a Markov chain over paths by alternately sampling a finite set of virtual jump times given the current path, and then sampling a new path given the set of extant and virtual jump times. The first step involves simulating a piecewise-constant inhomogeneous Poisson process, while for the second, we use a standard hidden Markov model forward filtering-backward sampling algorithm. Our method is exact and does not involve approximations like time-discretization. We demonstrate how our sampler extends naturally to MJP-based models like Markov-modulated Poisson processes and continuous-time Bayesian networks, and show significant computational benefits over state-ofthe-art MCMC samplers for these models. Keywords: Markov jump process, MCMC, Gibbs sampler, uniformization, Markov-modulated Poisson process, continuous-time Bayesian network
2 0.096741423 16 jmlr-2013-Bayesian Nonparametric Hidden Semi-Markov Models
Author: Matthew J. Johnson, Alan S. Willsky
Abstract: There is much interest in the Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM) as a natural Bayesian nonparametric extension of the ubiquitous Hidden Markov Model for learning from sequential and time-series data. However, in many settings the HDP-HMM’s strict Markovian constraints are undesirable, particularly if we wish to learn or encode non-geometric state durations. We can extend the HDP-HMM to capture such structure by drawing upon explicit-duration semi-Markov modeling, which has been developed mainly in the parametric non-Bayesian setting, to allow construction of highly interpretable models that admit natural prior information on state durations. In this paper we introduce the explicit-duration Hierarchical Dirichlet Process Hidden semiMarkov Model (HDP-HSMM) and develop sampling algorithms for efficient posterior inference. The methods we introduce also provide new methods for sampling inference in the finite Bayesian HSMM. Our modular Gibbs sampling methods can be embedded in samplers for larger hierarchical Bayesian models, adding semi-Markov chain modeling as another tool in the Bayesian inference toolbox. We demonstrate the utility of the HDP-HSMM and our inference methods on both synthetic and real experiments. Keywords: Bayesian nonparametrics, time series, semi-Markov, sampling algorithms, Hierarchical Dirichlet Process Hidden Markov Model
3 0.05125067 90 jmlr-2013-Quasi-Newton Method: A New Direction
Author: Philipp Hennig, Martin Kiefel
Abstract: Four decades after their invention, quasi-Newton methods are still state of the art in unconstrained numerical optimization. Although not usually interpreted thus, these are learning algorithms that fit a local quadratic approximation to the objective function. We show that many, including the most popular, quasi-Newton methods can be interpreted as approximations of Bayesian linear regression under varying prior assumptions. This new notion elucidates some shortcomings of classical algorithms, and lights the way to a novel nonparametric quasi-Newton method, which is able to make more efficient use of available information at computational cost similar to its predecessors. Keywords: optimization, numerical analysis, probability, Gaussian processes
4 0.049546197 11 jmlr-2013-Algorithms for Discovery of Multiple Markov Boundaries
Author: Alexander Statnikov, Nikita I. Lytkin, Jan Lemeire, Constantin F. Aliferis
Abstract: Algorithms for Markov boundary discovery from data constitute an important recent development in machine learning, primarily because they offer a principled solution to the variable/feature selection problem and give insight on local causal structure. Over the last decade many sound algorithms have been proposed to identify a single Markov boundary of the response variable. Even though faithful distributions and, more broadly, distributions that satisfy the intersection property always have a single Markov boundary, other distributions/data sets may have multiple Markov boundaries of the response variable. The latter distributions/data sets are common in practical data-analytic applications, and there are several reasons why it is important to induce multiple Markov boundaries from such data. However, there are currently no sound and efficient algorithms that can accomplish this task. This paper describes a family of algorithms TIE* that can discover all Markov boundaries in a distribution. The broad applicability as well as efficiency of the new algorithmic family is demonstrated in an extensive benchmarking study that involved comparison with 26 state-of-the-art algorithms/variants in 15 data sets from a diversity of application domains. Keywords: Markov boundary discovery, variable/feature selection, information equivalence, violations of faithfulness
5 0.044563998 104 jmlr-2013-Sparse Single-Index Model
Author: Pierre Alquier, Gérard Biau
Abstract: Let (X,Y ) be a random pair taking values in R p × R. In the so-called single-index model, one has Y = f ⋆ (θ⋆T X) +W , where f ⋆ is an unknown univariate measurable function, θ⋆ is an unknown vector in Rd , and W denotes a random noise satisfying E[W |X] = 0. The single-index model is known to offer a flexible way to model a variety of high-dimensional real-world phenomena. However, despite its relative simplicity, this dimension reduction scheme is faced with severe complications as soon as the underlying dimension becomes larger than the number of observations (“p larger than n” paradigm). To circumvent this difficulty, we consider the single-index model estimation problem from a sparsity perspective using a PAC-Bayesian approach. On the theoretical side, we offer a sharp oracle inequality, which is more powerful than the best known oracle inequalities for other common procedures of single-index recovery. The proposed method is implemented by means of the reversible jump Markov chain Monte Carlo technique and its performance is compared with that of standard procedures. Keywords: single-index model, sparsity, regression estimation, PAC-Bayesian, oracle inequality, reversible jump Markov chain Monte Carlo method
6 0.043339789 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood
7 0.041066106 98 jmlr-2013-Segregating Event Streams and Noise with a Markov Renewal Process Model
8 0.040901355 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components
9 0.038056955 45 jmlr-2013-GPstuff: Bayesian Modeling with Gaussian Processes
10 0.036544312 108 jmlr-2013-Stochastic Variational Inference
11 0.034724794 15 jmlr-2013-Bayesian Canonical Correlation Analysis
12 0.033068471 97 jmlr-2013-Risk Bounds of Learning Processes for Lévy Processes
13 0.032108683 58 jmlr-2013-Language-Motivated Approaches to Action Recognition
14 0.031814717 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
15 0.031761553 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
16 0.031232435 44 jmlr-2013-Finding Optimal Bayesian Networks Using Precedence Constraints
17 0.028862614 121 jmlr-2013-Variational Inference in Nonconjugate Models
18 0.026973186 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
19 0.026647093 114 jmlr-2013-The Rate of Convergence of AdaBoost
20 0.024309468 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization
topicId topicWeight
[(0, -0.133), (1, -0.082), (2, -0.013), (3, 0.017), (4, 0.009), (5, 0.017), (6, -0.033), (7, 0.023), (8, -0.036), (9, 0.068), (10, 0.046), (11, -0.007), (12, -0.144), (13, -0.079), (14, 0.037), (15, 0.005), (16, -0.026), (17, -0.084), (18, 0.124), (19, 0.164), (20, -0.007), (21, 0.068), (22, -0.39), (23, -0.046), (24, 0.05), (25, 0.013), (26, 0.009), (27, -0.068), (28, -0.153), (29, -0.164), (30, -0.053), (31, 0.066), (32, -0.134), (33, 0.172), (34, -0.075), (35, -0.049), (36, 0.048), (37, 0.155), (38, 0.093), (39, -0.099), (40, -0.12), (41, -0.078), (42, 0.11), (43, -0.038), (44, -0.056), (45, -0.057), (46, 0.055), (47, 0.003), (48, -0.055), (49, -0.086)]
simIndex simValue paperId paperTitle
same-paper 1 0.94886655 43 jmlr-2013-Fast MCMC Sampling for Markov Jump Processes and Extensions
Author: Vinayak Rao, Yee Whye Teh
Abstract: Markov jump processes (or continuous-time Markov chains) are a simple and important class of continuous-time dynamical systems. In this paper, we tackle the problem of simulating from the posterior distribution over paths in these models, given partial and noisy observations. Our approach is an auxiliary variable Gibbs sampler, and is based on the idea of uniformization. This sets up a Markov chain over paths by alternately sampling a finite set of virtual jump times given the current path, and then sampling a new path given the set of extant and virtual jump times. The first step involves simulating a piecewise-constant inhomogeneous Poisson process, while for the second, we use a standard hidden Markov model forward filtering-backward sampling algorithm. Our method is exact and does not involve approximations like time-discretization. We demonstrate how our sampler extends naturally to MJP-based models like Markov-modulated Poisson processes and continuous-time Bayesian networks, and show significant computational benefits over state-ofthe-art MCMC samplers for these models. Keywords: Markov jump process, MCMC, Gibbs sampler, uniformization, Markov-modulated Poisson process, continuous-time Bayesian network
2 0.74536014 16 jmlr-2013-Bayesian Nonparametric Hidden Semi-Markov Models
Author: Matthew J. Johnson, Alan S. Willsky
Abstract: There is much interest in the Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM) as a natural Bayesian nonparametric extension of the ubiquitous Hidden Markov Model for learning from sequential and time-series data. However, in many settings the HDP-HMM’s strict Markovian constraints are undesirable, particularly if we wish to learn or encode non-geometric state durations. We can extend the HDP-HMM to capture such structure by drawing upon explicit-duration semi-Markov modeling, which has been developed mainly in the parametric non-Bayesian setting, to allow construction of highly interpretable models that admit natural prior information on state durations. In this paper we introduce the explicit-duration Hierarchical Dirichlet Process Hidden semiMarkov Model (HDP-HSMM) and develop sampling algorithms for efficient posterior inference. The methods we introduce also provide new methods for sampling inference in the finite Bayesian HSMM. Our modular Gibbs sampling methods can be embedded in samplers for larger hierarchical Bayesian models, adding semi-Markov chain modeling as another tool in the Bayesian inference toolbox. We demonstrate the utility of the HDP-HSMM and our inference methods on both synthetic and real experiments. Keywords: Bayesian nonparametrics, time series, semi-Markov, sampling algorithms, Hierarchical Dirichlet Process Hidden Markov Model
3 0.44057962 11 jmlr-2013-Algorithms for Discovery of Multiple Markov Boundaries
Author: Alexander Statnikov, Nikita I. Lytkin, Jan Lemeire, Constantin F. Aliferis
Abstract: Algorithms for Markov boundary discovery from data constitute an important recent development in machine learning, primarily because they offer a principled solution to the variable/feature selection problem and give insight on local causal structure. Over the last decade many sound algorithms have been proposed to identify a single Markov boundary of the response variable. Even though faithful distributions and, more broadly, distributions that satisfy the intersection property always have a single Markov boundary, other distributions/data sets may have multiple Markov boundaries of the response variable. The latter distributions/data sets are common in practical data-analytic applications, and there are several reasons why it is important to induce multiple Markov boundaries from such data. However, there are currently no sound and efficient algorithms that can accomplish this task. This paper describes a family of algorithms TIE* that can discover all Markov boundaries in a distribution. The broad applicability as well as efficiency of the new algorithmic family is demonstrated in an extensive benchmarking study that involved comparison with 26 state-of-the-art algorithms/variants in 15 data sets from a diversity of application domains. Keywords: Markov boundary discovery, variable/feature selection, information equivalence, violations of faithfulness
4 0.35997373 15 jmlr-2013-Bayesian Canonical Correlation Analysis
Author: Arto Klami, Seppo Virtanen, Samuel Kaski
Abstract: Canonical correlation analysis (CCA) is a classical method for seeking correlations between two multivariate data sets. During the last ten years, it has received more and more attention in the machine learning community in the form of novel computational formulations and a plethora of applications. We review recent developments in Bayesian models and inference methods for CCA which are attractive for their potential in hierarchical extensions and for coping with the combination of large dimensionalities and small sample sizes. The existing methods have not been particularly successful in fulfilling the promise yet; we introduce a novel efficient solution that imposes group-wise sparsity to estimate the posterior of an extended model which not only extracts the statistical dependencies (correlations) between data sets but also decomposes the data into shared and data set-specific components. In statistics literature the model is known as inter-battery factor analysis (IBFA), for which we now provide a Bayesian treatment. Keywords: Bayesian modeling, canonical correlation analysis, group-wise sparsity, inter-battery factor analysis, variational Bayesian approximation
5 0.33079967 98 jmlr-2013-Segregating Event Streams and Noise with a Markov Renewal Process Model
Author: Dan Stowell, Mark D. Plumbley
Abstract: We describe an inference task in which a set of timestamped event observations must be clustered into an unknown number of temporal sequences with independent and varying rates of observations. Various existing approaches to multi-object tracking assume a fixed number of sources and/or a fixed observation rate; we develop an approach to inferring structure in timestamped data produced by a mixture of an unknown and varying number of similar Markov renewal processes, plus independent clutter noise. The inference simultaneously distinguishes signal from noise as well as clustering signal observations into separate source streams. We illustrate the technique via synthetic experiments as well as an experiment to track a mixture of singing birds. Source code is available. Keywords: multi-target tracking, clustering, point processes, flow network, sound
6 0.32817128 90 jmlr-2013-Quasi-Newton Method: A New Direction
7 0.25392255 97 jmlr-2013-Risk Bounds of Learning Processes for Lévy Processes
8 0.24353084 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
9 0.22064 45 jmlr-2013-GPstuff: Bayesian Modeling with Gaussian Processes
10 0.2140467 58 jmlr-2013-Language-Motivated Approaches to Action Recognition
11 0.20455123 104 jmlr-2013-Sparse Single-Index Model
12 0.19291008 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components
13 0.18668529 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
14 0.18012384 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
15 0.17691313 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation
16 0.16743936 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood
17 0.15747075 42 jmlr-2013-Fast Generalized Subset Scan for Anomalous Pattern Detection
18 0.15578274 82 jmlr-2013-Optimally Fuzzy Temporal Memory
19 0.15065651 68 jmlr-2013-Machine Learning with Operational Costs
20 0.14347789 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization
topicId topicWeight
[(0, 0.044), (5, 0.094), (6, 0.06), (10, 0.07), (20, 0.016), (23, 0.027), (44, 0.431), (68, 0.032), (70, 0.032), (75, 0.047), (85, 0.02), (87, 0.013)]
simIndex simValue paperId paperTitle
1 0.97073436 67 jmlr-2013-MLPACK: A Scalable C++ Machine Learning Library
Author: Ryan R. Curtin, James R. Cline, N. P. Slagle, William B. March, Parikshit Ram, Nishant A. Mehta, Alexander G. Gray
Abstract: MLPACK is a state-of-the-art, scalable, multi-platform C++ machine learning library released in late 2011 offering both a simple, consistent API accessible to novice users and high performance and flexibility to expert users by leveraging modern features of C++. MLPACK provides cutting-edge algorithms whose benchmarks exhibit far better performance than other leading machine learning libraries. MLPACK version 1.0.3, licensed under the LGPL, is available at http://www.mlpack.org. Keywords: C++, dual-tree algorithms, machine learning software, open source software, largescale learning 1. Introduction and Goals Though several machine learning libraries are freely available online, few, if any, offer efficient algorithms to the average user. For instance, the popular Weka toolkit (Hall et al., 2009) emphasizes ease of use but scales poorly; the distributed Apache Mahout library offers scalability at a cost of higher overhead (such as clusters and powerful servers often unavailable to the average user). Also, few libraries offer breadth; for instance, libsvm (Chang and Lin, 2011) and the Tilburg MemoryBased Learner (TiMBL) are highly scalable and accessible yet each offer only a single method. MLPACK, intended to be the machine learning analog to the general-purpose LAPACK linear algebra library, aims to combine efficiency and accessibility. Written in C++, MLPACK uses the highly efficient Armadillo matrix library (Sanderson, 2010) and is freely available under the GNU Lesser General Public License (LGPL). Through the use of C++ templates, MLPACK both eliminates unnecessary copying of data sets and performs expression optimizations unavailable in other languages. Also, MLPACK is, to our knowledge, unique among existing libraries in using generic programming features of C++ to allow customization of the available machine learning methods without incurring performance penalties. c 2013 Ryan R. Curtin, James R. Cline, N. P. Slagle, William B. March, Parikshit Ram, Nishant A. Mehta and Alexander G. Gray. C URTIN , C LINE , S LAGLE , M ARCH , R AM , M EHTA AND G RAY In addition, users ranging from students to experts should find the consistent, intuitive interface of MLPACK to be highly accessible. Finally, the source code provides references and comprehensive documentation. Four major goals of the development team of MLPACK are • • • • to implement scalable, fast machine learning algorithms, to design an intuitive, consistent, and simple API for non-expert users, to implement a variety of machine learning methods, and to provide cutting-edge machine learning algorithms unavailable elsewhere. This paper offers both an introduction to the simple and extensible API and a glimpse of the superior performance of the library. 2. Package Overview Each algorithm available in MLPACK features both a set of C++ library functions and a standalone command-line executable. Version 1.0.3 includes the following methods: • • • • • • • • • • • • • • nearest/furthest neighbor search with cover trees or kd-trees (k-nearest-neighbors) range search with cover trees or kd-trees Gaussian mixture models (GMMs) hidden Markov models (HMMs) LARS / Lasso regression k-means clustering fast hierarchical clustering (Euclidean MST calculation)1 (March et al., 2010) kernel PCA (and regular PCA) local coordinate coding1 (Yu et al., 2009) sparse coding using dictionary learning RADICAL (Robust, Accurate, Direct ICA aLgorithm) (Learned-Miller and Fisher, 2003) maximum variance unfolding (MVU) via LRSDP1 (Burer and Monteiro, 2003) the naive Bayes classifier density estimation trees1 (Ram and Gray, 2011) The development team manages MLPACK with Subversion and the Trac bug reporting system, allowing easy downloads and simple bug reporting. The entire development process is transparent, so any interested user can easily contribute to the library. MLPACK can compile from source on Linux, Mac OS, and Windows; currently, different Linux distributions are reviewing MLPACK for inclusion in their package managers, which will allow users to install MLPACK without needing to compile from source. 3. A Consistent, Simple API MLPACK features a highly accessible API, both in style (such as consistent naming schemes and coding conventions) and ease of use (such as templated defaults), as well as stringent documentation standards. Consequently, a new user can execute algorithms out-of-the-box often with little or no adjustment to parameters, while the seasoned expert can expect extreme flexibility in algorithmic 1. This algorithm is not available in any other comparable software package. 802 MLPACK: A S CALABLE C++ M ACHINE L EARNING L IBRARY Data Set wine cloud wine-qual isolet miniboone yp-msd corel covtype mnist randu MLPACK 0.0003 0.0069 0.0290 13.0197 20.2045 5430.0478 4.9716 14.3449 2719.8087 1020.9142 Weka 0.0621 0.1174 0.8868 213.4735 216.1469 >9000.0000 14.4264 45.9912 >9000.0000 2665.0921 Shogun 0.0277 0.5000 4.3617 37.6190 2351.4637 >9000.0000 555.9600 >9000.0000 3536.4477 >9000.0000 MATLAB 0.0021 0.0210 0.6465 46.9518 1088.1127 >9000.0000 60.8496 >9000.0000 4838.6747 1679.2893 mlpy 0.0025 0.3520 4.0431 52.0437 3219.2696 >9000.0000 209.5056 >9000.0000 5192.3586 >9000.0000 sklearn 0.0008 0.0192 0.1668 46.8016 714.2385 >9000.0000 160.4597 651.6259 5363.9650 8780.0176 Table 1: k-NN benchmarks (in seconds). Data Set UCI Name Size wine Wine 178x13 cloud Cloud 2048x10 wine-qual Wine Quality 6497x11 isolet ISOLET 7797x617 miniboone MiniBooNE 130064x50 Data Set UCI Name Size yp-msd YearPredictionMSD 515345x90 corel Corel 37749x32 covtype Covertype 581082x54 mnist N/A 70000x784 randu N/A 1000000x10 Table 2: Benchmark data set sizes. tuning. For example, the following line initializes an object which will perform the standard kmeans clustering in Euclidean space: KMeans
2 0.8206265 85 jmlr-2013-Pairwise Likelihood Ratios for Estimation of Non-Gaussian Structural Equation Models
Author: Aapo Hyvärinen, Stephen M. Smith
Abstract: We present new measures of the causal direction, or direction of effect, between two non-Gaussian random variables. They are based on the likelihood ratio under the linear non-Gaussian acyclic model (LiNGAM). We also develop simple first-order approximations of the likelihood ratio and analyze them based on related cumulant-based measures, which can be shown to find the correct causal directions. We show how to apply these measures to estimate LiNGAM for more than two variables, and even in the case of more variables than observations. We further extend the method to cyclic and nonlinear models. The proposed framework is statistically at least as good as existing ones in the cases of few data points or noisy data, and it is computationally and conceptually very simple. Results on simulated fMRI data indicate that the method may be useful in neuroimaging where the number of time points is typically quite small. Keywords: structural equation model, Bayesian network, non-Gaussianity, causality, independent component analysis
same-paper 3 0.74270433 43 jmlr-2013-Fast MCMC Sampling for Markov Jump Processes and Extensions
Author: Vinayak Rao, Yee Whye Teh
Abstract: Markov jump processes (or continuous-time Markov chains) are a simple and important class of continuous-time dynamical systems. In this paper, we tackle the problem of simulating from the posterior distribution over paths in these models, given partial and noisy observations. Our approach is an auxiliary variable Gibbs sampler, and is based on the idea of uniformization. This sets up a Markov chain over paths by alternately sampling a finite set of virtual jump times given the current path, and then sampling a new path given the set of extant and virtual jump times. The first step involves simulating a piecewise-constant inhomogeneous Poisson process, while for the second, we use a standard hidden Markov model forward filtering-backward sampling algorithm. Our method is exact and does not involve approximations like time-discretization. We demonstrate how our sampler extends naturally to MJP-based models like Markov-modulated Poisson processes and continuous-time Bayesian networks, and show significant computational benefits over state-ofthe-art MCMC samplers for these models. Keywords: Markov jump process, MCMC, Gibbs sampler, uniformization, Markov-modulated Poisson process, continuous-time Bayesian network
4 0.36339328 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning
Author: Andrea Tacchetti, Pavan K. Mallapragada, Matteo Santoro, Lorenzo Rosasco
Abstract: We present GURLS, a least squares, modular, easy-to-extend software library for efficient supervised learning. GURLS is targeted to machine learning practitioners, as well as non-specialists. It offers a number state-of-the-art training strategies for medium and large-scale learning, and routines for efficient model selection. The library is particularly well suited for multi-output problems (multi-category/multi-label). GURLS is currently available in two independent implementations: Matlab and C++. It takes advantage of the favorable properties of regularized least squares algorithm to exploit advanced tools in linear algebra. Routines to handle computations with very large matrices by means of memory-mapped storage and distributed task execution are available. The package is distributed under the BSD license and is available for download at https://github.com/LCSL/GURLS. Keywords: regularized least squares, big data, linear algebra 1. Introduction and Design Supervised learning has become a fundamental tool for the design of intelligent systems and the analysis of high dimensional data. Key to this success has been the availability of efficient, easy-touse software packages. New data collection technologies make it easy to gather high dimensional, multi-output data sets of increasing size. This trend calls for new software solutions for the automatic training, tuning and testing of supervised learning methods. These observations motivated the design of GURLS (Grand Unified Regularized Least Squares). The package was developed to pursue the following goals: Speed: Fast training/testing procedures for learning problems with potentially large/huge number of points, features and especially outputs (e.g., classes). Memory: Flexible data management to work with large data sets by means of memory-mapped storage. Performance: ∗. Also in the Laboratory for Computational and Statistical Learning, Istituto Italiano di Tecnologia and Massachusetts Institute of Technology c 2013 Andrea Tacchetti, Pavan K. Mallapragada, Matteo Santoro and Lorenzo Rosasco. TACCHETTI , M ALLAPRAGADA , S ANTORO AND ROSASCO State of the art results in high-dimensional multi-output problems. Usability and modularity: Easy to use and to expand. GURLS is based on Regularized Least Squares (RLS) and takes advantage of all the favorable properties of these methods (Rifkin et al., 2003). Since the algorithm reduces to solving a linear system, GURLS is set up to exploit the powerful tools, and recent advances, of linear algebra (including randomized solver, first order methods, etc.). Second, it makes use of RLS properties which are particularly suited for high dimensional learning. For example: (1) RLS has natural primal and dual formulation (hence having complexity which is the smallest between number of examples and features); (2) efficient parameter selection (closed form expression of the leave one out error and efficient computations of regularization path); (3) natural and efficient extension to multiple outputs. Specific attention has been devoted to handle large high dimensional data sets. We rely on data structures that can be serialized using memory-mapped files, and on a distributed task manager to perform a number of key steps (such as matrix multiplication) without loading the whole data set in memory. Efforts were devoted to to provide a lean API and an exhaustive documentation. GURLS has been deployed and tested successfully on Linux, MacOS and Windows. The library is distributed under the simplified BSD license, and can be downloaded from https://github.com/LCSL/GURLS. 2. Description of the Library The library comprises four main modules. GURLS and bGURLS—both implemented in Matlab— are aimed at solving learning problems with small/medium and large-scale data sets respectively. GURLS++ and bGURLS++ are their C++ counterparts. The Matlab and C++ versions share the same design, but the C++ modules have significant improvements, which make them faster and more flexible. The specification of the desired machine learning experiment in the library is straightforward. Basically, it is a formal description of a pipeline, that is, an ordered sequence of steps. Each step identifies an actual learning task, and belongs to a predefined category. The core of the library is a method (a class in the C++ implementation) called GURLScore, which is responsible for processing the sequence of tasks in the proper order and for linking the output of the former task to the input of the subsequent one. A key role is played by the additional “options” structure, referred to as OPT. OPT is used to store all configuration parameters required to customize the behavior of individual tasks in the pipeline. Tasks receive configuration parameters from OPT in read-only mode and—upon termination—the results are appended to the structure by GURLScore in order to make them available to subsequent tasks. This allows the user to skip the execution of some tasks in a pipeline, by simply inserting the desired results directly into the options structure. Currently, we identify six different task categories: data set splitting, kernel computation, model selection, training, evaluation and testing and performance assessment and analysis. Tasks belonging to the same category may be interchanged with each other. 2.1 Learning From Large Data Sets Two modules in GURLS have been specifically designed to deal with big data scenarios. The approach we adopted is mainly based on a memory-mapped abstraction of matrix and vector data structures, and on a distributed computation of a number of standard problems in linear algebra. For learning on big data, we decided to focus specifically on those situations where one seeks a linear model on a large set of (possibly non linear) features. A more accurate specification of what “large” means in GURLS is related to the number of features d and the number of training 3202 GURLS: A L EAST S QUARES L IBRARY FOR S UPERVISED L EARNING data set optdigit landast pendigit letter isolet # of samples 3800 4400 7400 10000 6200 # of classes 10 6 10 26 26 # of variables 64 36 16 16 600 Table 1: Data sets description. examples n: we require it must be possible to store a min(d, n) × min(d, n) matrix in memory. In practice, this roughly means we can train models with up-to 25k features on machines with 8Gb of RAM, and up-to 50k features on machines with 36Gb of RAM. We do not require the data matrix itself to be stored in memory: within GURLS it is possible to manage an arbitrarily large set of training examples. We distinguish two different scenarios. Data sets that can fully reside in RAM without any memory mapping techniques—such as swapping—are considered to be small/medium. Larger data sets are considered to be “big” and learning must be performed using either bGURLS or bGURLS++ . These two modules include all the design patterns described above, and have been complemented with additional big data and distributed computation capabilities. Big data support is obtained using a data structure called bigarray, which allows to handle data matrices as large as the space available on the hard drive: we store the entire data set on disk and load only small chunks in memory when required. There are some differences between the Matlab and C++ implementations. bGURLS relies on a simple, ad hoc interface, called GURLS Distributed Manager (GDM), to distribute matrix-matrix multiplications, thus allowing users to perform the important task of kernel matrix computation on a distributed network of computing nodes. After this step, the subsequent tasks behave as in GURLS. bGURLS++ (currently in active development) offers more interesting features because it is based on the MPI libraries. Therefore, it allows for a full distribution within every single task of the pipeline. All the processes read the input data from a shared filesystem over the network and then start executing the same pipeline. During execution, each process’ task communicates with the corresponding ones. Every process maintains its local copy of the options. Once the same task is completed by all processes, the local copies of the options are synchronized. This architecture allows for the creation of hybrid pipelines comprising serial one-process-based tasks from GURLS++ . 3. Experiments We decided to focus the experimental analysis in the paper to the assessment of GURLS’ performance both in terms of accuracy and time. In our experiments we considered 5 popular data sets, briefly described in Table 1. Experiments were run on a Intel Xeon 5140 @ 2.33GHz processor with 8GB of RAM, and running Ubuntu 8.10 Server (64 bit). optdigit accuracy (%) GURLS (linear primal) GURLS (linear dual) LS-SVM linear GURLS (500 random features) GURLS (1000 random features) GURLS (Gaussian kernel) LS-SVM (Gaussian kernel) time (s) landsat accuracy (%) time (s) pendigit accuracy (%) time (s) 92.3 92.3 92.3 96.8 97.5 98.3 98.3 0.49 726 7190 25.6 207 13500 26100 63.68 66.3 64.6 63.5 63.5 90.4 90.51 0.22 1148 6526 28.0 187 20796 18430 82.24 82.46 82.3 96.7 95.8 98.4 98.36 0.23 5590 46240 31.6 199 100600 120170 Table 2: Comparison between GURLS and LS-SVM. 3203 TACCHETTI , M ALLAPRAGADA , S ANTORO AND ROSASCO Performance (%) 1 0.95 0.9 0.85 isolet(∇) letter(×) 0.8 pendigit(∆) 0.75 landsat(♦) optdigit(◦) 0.7 LIBSVM:rbf 0.65 GURLS++:rbf GURLS:randomfeatures-1000 0.6 GURLS:randomfeatures-500 0.55 0.5 0 10 GURLS:rbf 1 10 2 10 3 10 4 Time (s) 10 Figure 1: Prediction accuracy vs. computing time. The color represents the training method and the library used. In blue: the Matlab implementation of RLS with RBF kernel, in red: its C++ counterpart. In dark red: results of LIBSVM with RBF kernel. In yellow and green: results obtained using a linear kernel on 500 and 1000 random features respectively. We set up different pipelines and compared the performance to SVM, for which we used the python modular interface to LIBSVM (Chang and Lin, 2011). Automatic selection of the optimal regularization parameter is implemented identically in all experiments: (i) split the data; (ii) define a set of regularization parameter on a regular grid; (iii) perform hold-out validation. The variance of the Gaussian kernel has been fixed by looking at the statistics of the pairwise distances among training examples. The prediction accuracy of GURLS and GURLS++ is identical—as expected—but the implementation in C++ is significantly faster. The prediction accuracy of standard RLS-based methods is in many cases higher than SVM. Exploiting the primal formulation of RLS, we further ran experiments with the random features approximation (Rahimi and Recht, 2008). As show in Figure 1, the performance of this method is comparable to that of SVM at a much lower computational cost in the majority of the tested data sets. We further compared GURLS with another available least squares based toolbox: the LS-SVM toolbox (Suykens et al., 2001), which includes routines for parameter selection such as coupled simulated annealing and line/grid search. The goal of this experiment is to benchmark the performance of the parameter selection with random data splitting included in GURLS. For a fair comparison, we considered only the Matlab implementation of GURLS. Results are reported in Table 2. As expected, using the linear kernel with the primal formulation—not available in LS-SVM—is the fastest approach since it leverages the lower dimensionality of the input space. When the Gaussian kernel is used, GURLS and LS-SVM have comparable computing time and classification performance. Note, however, that in GURLS the number of parameter in the grid search is fixed to 400, while in LS-SVM it may vary and is limited to 70. The interesting results obtained with the random features implementation in GURLS, make it an interesting choice in many applications. Finally, all GURLS pipelines, in their Matlab implementation, are faster than LS-SVM and further improvements can be achieved with GURLS++ . Acknowledgments We thank Tomaso Poggio, Zak Stone, Nicolas Pinto, Hristo S. Paskov and CBCL for comments and insights. 3204 GURLS: A L EAST S QUARES L IBRARY FOR S UPERVISED L EARNING References C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www. csie.ntu.edu.tw/˜cjlin/libsvm. A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems, volume 21, pages 1313–1320, 2008. R. Rifkin, G. Yeo, and T. Poggio. Regularized least-squares classification. Nato Science Series Sub Series III Computer and Systems Sciences, 190:131–154, 2003. J. Suykens, T. V. Gestel, J. D. Brabanter, B. D. Moor, and J. Vandewalle. Least Squares Support Vector Machines. World Scientific, 2001. ISBN 981-238-151-1. 3205
5 0.33378091 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood
Author: Jaakko Riihimäki, Pasi Jylänki, Aki Vehtari
Abstract: This paper considers probabilistic multinomial probit classification using Gaussian process (GP) priors. Challenges with multiclass GP classification are the integration over the non-Gaussian posterior distribution, and the increase of the number of unknown latent variables as the number of target classes grows. Expectation propagation (EP) has proven to be a very accurate method for approximate inference but the existing EP approaches for the multinomial probit GP classification rely on numerical quadratures, or independence assumptions between the latent values associated with different classes, to facilitate the computations. In this paper we propose a novel nested EP approach which does not require numerical quadratures, and approximates accurately all betweenclass posterior dependencies of the latent values, but still scales linearly in the number of classes. The predictive accuracy of the nested EP approach is compared to Laplace, variational Bayes, and Markov chain Monte Carlo (MCMC) approximations with various benchmark data sets. In the experiments nested EP was the most consistent method compared to MCMC sampling, but in terms of classification accuracy the differences between all the methods were small from a practical point of view. Keywords: Gaussian process, multiclass classification, multinomial probit, approximate inference, expectation propagation
6 0.32940525 19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations
7 0.32742742 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs
8 0.31435394 120 jmlr-2013-Variational Algorithms for Marginal MAP
9 0.31143317 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
11 0.31010249 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
12 0.30991882 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
14 0.3067669 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut
15 0.30604124 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression
16 0.30582243 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
17 0.30454111 22 jmlr-2013-Classifying With Confidence From Incomplete Information
18 0.30328465 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
19 0.30203596 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
20 0.30191928 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering