jmlr jmlr2013 jmlr2013-16 knowledge-graph by maker-knowledge-mining

16 jmlr-2013-Bayesian Nonparametric Hidden Semi-Markov Models

Source: pdf

Author: Matthew J. Johnson, Alan S. Willsky

Abstract: There is much interest in the Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM) as a natural Bayesian nonparametric extension of the ubiquitous Hidden Markov Model for learning from sequential and time-series data. However, in many settings the HDP-HMM’s strict Markovian constraints are undesirable, particularly if we wish to learn or encode non-geometric state durations. We can extend the HDP-HMM to capture such structure by drawing upon explicit-duration semi-Markov modeling, which has been developed mainly in the parametric non-Bayesian setting, to allow construction of highly interpretable models that admit natural prior information on state durations. In this paper we introduce the explicit-duration Hierarchical Dirichlet Process Hidden semiMarkov Model (HDP-HSMM) and develop sampling algorithms for efﬁcient posterior inference. The methods we introduce also provide new methods for sampling inference in the ﬁnite Bayesian HSMM. Our modular Gibbs sampling methods can be embedded in samplers for larger hierarchical Bayesian models, adding semi-Markov chain modeling as another tool in the Bayesian inference toolbox. We demonstrate the utility of the HDP-HSMM and our inference methods on both synthetic and real experiments. Keywords: Bayesian nonparametrics, time series, semi-Markov, sampling algorithms, Hierarchical Dirichlet Process Hidden Markov Model

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We can extend the HDP-HMM to capture such structure by drawing upon explicit-duration semi-Markov modeling, which has been developed mainly in the parametric non-Bayesian setting, to allow construction of highly interpretable models that admit natural prior information on state durations. [sent-8, score-0.162]

2 Our modular Gibbs sampling methods can be embedded in samplers for larger hierarchical Bayesian models, adding semi-Markov chain modeling as another tool in the Bayesian inference toolbox. [sent-11, score-0.326]

3 Or in separating a home power signal into the power signals of individual devices, we would be able to perform the task much better if we were able to exploit our prior knowledge about the levels and durations of each device’s power modes (Kolter and Johnson, 2011). [sent-17, score-0.82]

4 However, it shares the HDP-HMM’s restriction to geometric state durations, thus limiting the model’s expressiveness regarding duration structure. [sent-33, score-0.595]

5 Moreover, its global selftransition bias is shared among all states, and so it does not allow for learning state-speciﬁc duration information. [sent-34, score-0.489]

6 , 2009) induces non-Markovian state durations at the coarser levels of its state hierarchy, but even the coarser levels are constrained to have a sum-of-geometrics form, and hence it can be difﬁcult to incorporate prior information. [sent-36, score-0.478]

7 We combine semi-Markovian ideas with the HDP-HMM to construct a general class of models that allow for both Bayesian nonparametric inference of state complexity as well as general duration distributions. [sent-39, score-0.726]

8 In addition, the sampling techniques we develop for the Hierarchical Dirichlet Process Hidden semi-Markov Model (HDP-HSMM) provide new approaches to inference in HDP-HMMs that can avoid some of the difﬁculties which result in slow mixing rates. [sent-40, score-0.166]

9 We also provide a brief treatment of the Bayesian nonparametric HDP-HMM and sampling inference algorithms. [sent-44, score-0.171]

10 In synthetic experiments, we demonstrate that our sampler mixes very quickly on data generated by both HMMs and HSMMs and accurately learns parameter values and state cardinality. [sent-49, score-0.267]

11 1 HMMs The core of the HMM consists of two layers: a layer of hidden state variables and a layer of observation or emission variables, as shown in Figure 1. [sent-62, score-0.268]

12 The hidden state sequence, x = (xt )T , t=1 is a sequence of random variables on a ﬁnite alphabet, xt ∈ {1, 2, . [sent-63, score-0.277]

13 We focus on explicit duration semi-Markov modeling; that is, we are interested in the setting where each state’s duration is given an explicit distribution. [sent-75, score-0.978]

14 The basic idea underlying this HSMM formalism is to augment the generative process of a standard HMM with a random state duration time, drawn from some state-speciﬁc distribution when 675 J OHNSON AND W ILLSKY ! [sent-77, score-0.597]

15 The number s=1 of shaded nodes associated with each zs , denoted by Ds , is drawn from a state-speciﬁc duration distribution. [sent-90, score-0.559]

16 The state remains constant until the duration expires, at which point there is a Markov transition to a new state. [sent-92, score-0.637]

17 We use the random variable Dt to denote the duration of a state that is entered at time t, and we write the probability mass function for the random variable as p(dt |xt = i). [sent-93, score-0.566]

18 When deﬁning an HSMM model, one must also choose whether the observation sequence ends exactly on a segment boundary or whether the observations are censored at the end, so that the ﬁnal segment may possibly be cut off in the observations. [sent-101, score-0.177]

19 e It is possible to perform efﬁcient message-passing inference along an HSMM state chain (conditioned on parameters and observations) in a way similar to the standard alpha-beta dynamic programming algorithm for standard HMMs. [sent-104, score-0.202]

20 Dt+1 represents the duration of the segment beginning at time t + 1. [sent-112, score-0.542]

21 The ﬁnal additive term in the expression for Bt (i) is described in Gu´ don (2007); it e constitutes the contribution of state segments that run off the end of the provided observations, as per the censoring assumption, and depends on the survival function of the duration distribution. [sent-115, score-0.604]

22 The greater expressive power of the HSMM model necessarily increases the computational cost: the above message passing requires O(T 2 N + T N 2 ) basic operations for a chain of length T and state cardinality N , while the corresponding HMM message passing algorithm requires only O(T N 2 ). [sent-117, score-0.351]

23 However, if the support of the duration distribution is limited, or if we truncate possible segment lengths included in the inference messages to some maximum dmax , we can instead express the asymptotic message passing cost as O(T dmax N + T N 2 ). [sent-118, score-0.701]

24 Such truncations are often natural as the duration prior often causes the message contributions to decay rapidly with sufﬁciently large d. [sent-119, score-0.583]

25 Though the increased complexity of message-passing over an HMM signiﬁcantly increases the cost per iteration of sampling inference for a global model, the cost is offset because HSMM samplers need far fewer total iterations to converge. [sent-120, score-0.185]

26 The model employs an HDP prior over an inﬁnite state space, which enables both inference of state complexity and Bayesian mixing over models of varying complexity. [sent-125, score-0.358]

27 The HDP plays the role of a prior over inﬁnite transition matrices: each πj is a DP draw and is interpreted as the transition distribution from state j. [sent-142, score-0.268]

28 (2008), which has been shown to reduce mixing times for HDP-HMM inference while sacriﬁcing little of the “tail” of the inﬁnite transition matrix. [sent-148, score-0.19]

29 While the Sticky HDP-HMM allows some control over duration statistics, the state duration distributions remain geometric; the goal of this work is to provide a model in which any duration distributions may be used. [sent-158, score-1.606]

30 However, there is a clear modeling gain by eliminating self-transitions: when self-transitions are allowed, the “explicit duration distributions” do not model the state duration statistics directly. [sent-163, score-1.091]

31 We do not investigate here the problem of selecting particular observation and duration distribution classes; model selection is a fundamental challenge in generative modeling, and models must be chosen to capture structure in any particular data. [sent-165, score-0.599]

32 Instead, we provide the HDP-HSMM and related models as tools in which modeling choices (such as the selection of observation and duration distribution classes to ﬁt particular data) can be made ﬂexibly and naturally. [sent-166, score-0.604]

33 1 Finite Bayesian HSMM The ﬁnite Bayesian HSMM is a combination of the Bayesian HMM approach with semi-Markov state durations and is the model we generalize to the HDP-HSMM. [sent-168, score-0.27]

34 It is instructive to compare this construction with that of the ﬁnite model used in the weak-limit HDP-HSMM sampler that will be described in Section 4. [sent-172, score-0.158]

35 The generative process for a Bayesian HSMM with N states and observation and duration parameter prior distributions of H and G, respectively, can be summarized as 679 J OHNSON AND W ILLSKY iid πi ∼ Dir(α(1 − δi )) iid (θi , ωi ) ∼ H × G i = 1, 2, . [sent-174, score-0.785]

36 , xt1 :t2 = zs , s s iid t1 = s yt1 :t2 ∼ f (θzs ) s s Ds ¯ t2 = t1 + Ds − 1, s s s T − t|xt+1 = i)p(yt+1:T |xt+1 = i, Dt+1 > T − t), ˜ censoring term where p represents the duration distribution restricted to the set of possible durations D ⊂ N+ and ˜ re-normalized. [sent-180, score-0.842]

37 First, we compare the HDP-HSMM direct assignment sampler to the weak limit sampler as well as the Sticky HDP-HMM direct assignment sampler, showing that the HDP-HSMM direct assignment sampler has similar performance to that for the Sticky HDP-HMM and that the weak limit sampler is much faster. [sent-185, score-0.883]

38 Next, we evaluate the HDP-HSMM weak limit sampler on synthetic data generated from ﬁnite HSMMs and HMMs. [sent-186, score-0.227]

39 We show that the HDP-HSMM applied to HSMM data can efﬁciently learn the correct model, including the correct number of states and state labels, while the HDP-HMM is unable to capture non-geometric duration statistics. [sent-187, score-0.633]

40 We also apply the HDP-HSMM to data generated by an HMM and demonstrate that, when equipped with a duration distribution class that includes geometric durations, the HDP-HSMM can also efﬁciently learn an HMM model when appropriate with little loss in efﬁciency. [sent-188, score-0.547]

41 Next, we use the HDP-HSMM in a factorial (Ghahramani and Jordan, 1997) structure for the purpose of disaggregating a wholehome power signal into the power draws of individual devices. [sent-189, score-0.38]

42 We show that encoding simple duration prior information when modeling individual devices can greatly improve performance, and further that a Bayesian treatment of the parameters is advantageous. [sent-190, score-0.65]

43 691 J OHNSON AND W ILLSKY (a) (b) Figure 11: 11(a) compares the Geometric-HDP-HSMM direct assignment sampler with that of the Sticky HDP-HMM, both applied to HMM data. [sent-194, score-0.217]

44 The sticky parameter κ was chosen to maximize mixing. [sent-195, score-0.189]

45 11(b) compares the HDP-HSMM direct assignment sampler with the weak limit sampler. [sent-196, score-0.254]

46 1 Synthetic Data Figure 11 compares the HDP-HSMM direct assignment sampler to that of the Sticky HDP-HMM as well as the HDP-HSMM weak limit sampler. [sent-199, score-0.254]

47 Figure 11(a) shows that the direct assignment sampler for a Geometric-HDP-HSMM performs similarly to the Sticky HDP-HSMM direct assignment sampler when applied to data generated by an HMM with scalar Gaussian emissions. [sent-200, score-0.434]

48 Figures 11(b) shows that the weak limit sampler mixes much more quickly than the direct assignment sampler. [sent-201, score-0.254]

49 Each iteration of the weak limit sampler is also much faster to execute (approximately 50x faster in our implementations in Python). [sent-202, score-0.195]

50 Due to its much greater efﬁciency, we focus on the weak limit sampler for the rest of this section; we believe it is a superior inference algorithm whenever an adequately large approximation parameter L can be chosen a priori. [sent-203, score-0.276]

51 In the 25 Gibbs sampling runs for each model, we applied 5 chains to each of 5 generated observation sequences. [sent-205, score-0.152]

52 By setting the class of duration distributions to be a superclass of the class of geometric distributions, we can allow an HDP-HSMM model to learn an HMM from data when appropriate. [sent-208, score-0.578]

53 In each plot, the blue line indicates the error of the chain with the median error across 25 independent Gibbs chains, while the red dashed lines indicate the chains with the 10th and 90th percentile errors at each iteration. [sent-224, score-0.187]

54 In each plot, the blue line indicates the error of the chain with the median error across 25 independent Gibbs chains, while the red dashed lines indicate the chains with the 10th and 90th percentile errors at each iteration. [sent-227, score-0.187]

55 In each plot, the blue line indicates the error of the chain with the median error across 25 independent Gibbs chains, while the red dashed line indicates the chains with the 10th and 90th percentile error at each iteration. [sent-239, score-0.187]

56 model to learn geometric durations as well as signiﬁcantly non-geometric distributions with modes away from zero. [sent-240, score-0.365]

57 The sampler chains quickly concentrated at r = 1 for all state duration distributions. [sent-250, score-0.786]

58 This experiment demonstrates that with the appropriate choice of duration distribution the HDP-HSMM can effectively learn an HMM model. [sent-252, score-0.518]

59 2 Power Disaggregation In this section we show an application of the HDP-HSMM factorial structure to an unsupervised power signal disaggregation problem. [sent-254, score-0.461]

60 The task is to estimate the power draw from individual devices, such as refrigerators and microwaves, given an aggregated whole-home power consumption signal. [sent-255, score-0.377]

61 This application demonstrates the utility of including duration information in priors as well as the signiﬁcant speedup achieved with changepoint-based inference. [sent-257, score-0.554]

62 The power disaggregation problem has a rich history (Zeifman and Roth, 2011) with many proposed approaches for a variety of problem speciﬁcations. [sent-258, score-0.361]

63 (2010) also explores many 694 BAYESIAN N ONPARAMETRIC H IDDEN S EMI -M ARKOV M ODELS other aspects of the problem, such as additional data features, and builds a very compelling complete solution to the disaggregation problem, while we focus on the factorial time series modeling itself. [sent-262, score-0.357]

64 We chose the top 5 power-drawing devices (refrigerator, lighting, dishwasher, microwave, furnace) across several houses and identiﬁed 18 24-hour segments across 4 houses for which many (but not always all) of the devices switched on at least once. [sent-264, score-0.152]

65 We constructed simple priors that set the rough power draw levels and duration statistics of the modes for several devices. [sent-266, score-0.818]

66 For example, the power draw from home lighting changes infrequently and can have many different levels, so an HDP-HSMM with a bias towards longer negative-binomial durations is appropriate. [sent-267, score-0.416]

67 On the other hand, a refrigerator’s power draw cycle is very regular and usually exhibits only three modes, so our priors biased the refrigerator HDP-HSMM to have fewer modes and set the power levels accordingly. [sent-268, score-0.557]

68 We did not truncate the duration distributions during inference, and we set the weak limit approximation parameter L to be twice the number of expected modes for each device; for example, for the refrigerator device we set L = 6 and for lighting we set L = 20. [sent-270, score-0.924]

69 We performed sampling inference independently on each observation sequence. [sent-271, score-0.171]

70 As a baseline for comparison, we also constructed a factorial sticky HDP-HMM (Fox et al. [sent-272, score-0.289]

71 , 2008) with the same observation priors and with duration biases that induced the same average mode durations as the corresponding HDP-HSMM priors. [sent-273, score-0.823]

72 We also compare to the factorial HMM performance presented in Kolter and Johnson (2011), which ﬁt device models using an EM algorithm on training data. [sent-274, score-0.283]

73 The set of possible changepoints is easily identiﬁable in these data, and a primary task of the model is to organize the jumps observed in the observations into an explanation in terms of the individual device models. [sent-276, score-0.206]

74 By simply computing ﬁrst differences and thresholding, we are able to reduce the number of potential changepoints we need to consider from 5000 to 100-200, and hence we are able to speed up state sequence resampling by orders of magnitude. [sent-277, score-0.164]

75 = 1 − 2 (i) (i) yt − y t ˆ T ¯ t=1 yt (i) where yt refers to the observed total power consumption at time t, yt is the true power consumed at ¯ (i) time t by device i, and yt is the estimated power consumption. [sent-280, score-0.955]

76 1 seconds, including block sampling the state sequence and resampling all observation, duration, and transition parameters. [sent-283, score-0.223]

77 Figure 16: Overall accuracy comparison between the EM-trained FHMM of Kolter and Johnson (2011), the factorial sticky HDP-HMM, and the factorial HDP-HSMM. [sent-290, score-0.389]

78 could be ﬁt during inference, while the EM-based approach ﬁxed device model parameters that may not be consistent across homes. [sent-291, score-0.147]

79 Furthermore, the incorporation of duration structure and prior information provided a signiﬁcant performance increase for the HDP-HSMM approach. [sent-292, score-0.538]

80 Finally, Figures 18 and 19 shows total power consumption estimates for the two models on two selected data sequences. [sent-294, score-0.244]

81 We note that the nonparametric prior was very important for modeling the power consumption due to lighting. [sent-295, score-0.336]

82 For the other devices the number of power modes (and hence states) is not so uncertain, but duration statistics can provide a 696 BAYESIAN N ONPARAMETRIC H IDDEN S EMI -M ARKOV M ODELS House 1 2 3 6 Mean EM FHMM 46. [sent-297, score-0.788]

83 (a) (b) (c) Figure 18: Estimated total power consumption for a data sequence where the HDP-HSMM signiﬁcantly outperformed the HDP-HMM due to its modeling of duration regularities. [sent-314, score-0.761]

84 697 J OHNSON AND W ILLSKY (a) (b) (c) Figure 19: Estimated total power consumption for a data sequence where both the HDP-HMM and HDP-HSMM approaches performed well. [sent-315, score-0.236]

85 strong clue for disaggregation; for these, the main advantage of our model is in providing Bayesian inference and duration modeling. [sent-316, score-0.57]

86 Conclusion We have developed the HDP-HSMM and two Gibbs sampling inference algorithms, the weak limit and direct assignment samplers, uniting explicit-duration semi-Markov modeling with new Bayesian nonparametric techniques. [sent-318, score-0.303]

87 These models and algorithms not only allow learning from complex sequential data with non-Markov duration statistics in supervised and unsupervised settings, but also can be used as tools in constructing and performing infernece in larger hierarchical models. [sent-319, score-0.586]

88 Power Disaggregation Priors We used simple hand-set priors for the power disaggregation experiments in Section 5. [sent-327, score-0.426]

89 2, where each prior had two free parameters that were set to encode rough means and variances for each device mode’s durations and emissions. [sent-328, score-0.421]

90 To put priors on multiple modes we used atomic mixture models in the priors. [sent-329, score-0.184]

91 Similarly, we use NegBin(α, β; r) to denote Negative Binomial duration distribution priors where a latent state-speciﬁc “success” parameter p is drawn from p ∼ Beta(α, β) and the parameter r is ﬁxed, so that state durations for that state are then drawn from NegBin(p, r). [sent-333, score-0.901]

92 (Note choosing r = 1 sets a geometric duration class. [sent-334, score-0.518]

93 ) We set the priors for the Factorial Sticky HDP-HMM by using the same set of observation prior parameters as for the HDP-HSMM and setting state-speciﬁc sticky bias parameters so as to match the expected durations encoded in the HDP-HSMM duration priors. [sent-335, score-1.028]

94 Inference in hidden markov models with explicit state duration distributions. [sent-361, score-0.74]

95 Exploring the state sequence space for hidden markov and semi-markov chains. [sent-389, score-0.243]

96 A bayesian approach to hidden semi-markov model based speech synthesis. [sent-404, score-0.221]

97 Observation priors encode rough power levels that are expected from devices. [sent-408, score-0.278]

98 Duration priors encode duration statistics that are expected from devices. [sent-409, score-0.586]

99 Markov chain monte carlo in approximate dirichlet and beta twoparameter process hierarchical models. [sent-421, score-0.176]

100 Construction of nonparametric bayesian models from parametric bayes equations. [sent-472, score-0.204]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('duration', 0.489), ('hsmm', 0.25), ('negbin', 0.25), ('disaggregation', 0.221), ('durations', 0.193), ('sticky', 0.189), ('sampler', 0.158), ('hmm', 0.152), ('gauss', 0.147), ('device', 0.147), ('power', 0.14), ('illsky', 0.133), ('ohnson', 0.133), ('bayesian', 0.125), ('hsmms', 0.118), ('idden', 0.118), ('fox', 0.114), ('watts', 0.113), ('kolter', 0.103), ('arkov', 0.101), ('emi', 0.101), ('factorial', 0.1), ('hidden', 0.096), ('onparametric', 0.091), ('refrigerator', 0.088), ('modes', 0.083), ('inference', 0.081), ('state', 0.077), ('xt', 0.076), ('devices', 0.076), ('johnson', 0.071), ('dirichlet', 0.071), ('transition', 0.071), ('zs', 0.07), ('consumption', 0.068), ('priors', 0.065), ('yt', 0.064), ('odels', 0.064), ('chains', 0.062), ('hierarchical', 0.061), ('assignment', 0.059), ('changepoints', 0.059), ('teh', 0.058), ('samplers', 0.057), ('hdp', 0.057), ('segment', 0.053), ('emission', 0.052), ('iid', 0.052), ('percentile', 0.05), ('lighting', 0.049), ('prior', 0.049), ('sampling', 0.047), ('gibbs', 0.047), ('message', 0.045), ('dishwasher', 0.044), ('furnace', 0.044), ('gem', 0.044), ('microwave', 0.044), ('willsky', 0.044), ('chain', 0.044), ('nonparametric', 0.043), ('observation', 0.043), ('markov', 0.042), ('murphy', 0.042), ('hamming', 0.042), ('levels', 0.041), ('dt', 0.04), ('mixing', 0.038), ('states', 0.038), ('censoring', 0.038), ('markovian', 0.038), ('weak', 0.037), ('modeling', 0.036), ('models', 0.036), ('ds', 0.034), ('home', 0.034), ('messages', 0.033), ('mode', 0.033), ('encode', 0.032), ('synthetic', 0.032), ('posterior', 0.032), ('hmms', 0.032), ('median', 0.031), ('bt', 0.031), ('beal', 0.031), ('distributions', 0.031), ('generative', 0.031), ('dewar', 0.029), ('diarization', 0.029), ('hashimoto', 0.029), ('redd', 0.029), ('refrigerators', 0.029), ('tranter', 0.029), ('zeifman', 0.029), ('backwards', 0.029), ('geometric', 0.029), ('learn', 0.029), ('dp', 0.028), ('sequence', 0.028), ('monitoring', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 16 jmlr-2013-Bayesian Nonparametric Hidden Semi-Markov Models

Author: Matthew J. Johnson, Alan S. Willsky

2 0.10081006 108 jmlr-2013-Stochastic Variational Inference

Author: Matthew D. Hoffman, David M. Blei, Chong Wang, John Paisley

Abstract: We develop stochastic variational inference, a scalable algorithm for approximating posterior distributions. We develop this technique for a large class of probabilistic models and we demonstrate it with two probabilistic topic models, latent Dirichlet allocation and the hierarchical Dirichlet process topic model. Using stochastic variational inference, we analyze several large collections of documents: 300K articles from Nature, 1.8M articles from The New York Times, and 3.8M articles from Wikipedia. Stochastic inference can easily handle data sets of this size and outperforms traditional variational inference, which can only handle a smaller subset. (We also show that the Bayesian nonparametric topic model outperforms its parametric counterpart.) Stochastic variational inference lets us apply complex Bayesian models to massive data sets. Keywords: Bayesian inference, variational inference, stochastic optimization, topic models, Bayesian nonparametrics

3 0.096741423 43 jmlr-2013-Fast MCMC Sampling for Markov Jump Processes and Extensions

Author: Vinayak Rao, Yee Whye Teh

Abstract: Markov jump processes (or continuous-time Markov chains) are a simple and important class of continuous-time dynamical systems. In this paper, we tackle the problem of simulating from the posterior distribution over paths in these models, given partial and noisy observations. Our approach is an auxiliary variable Gibbs sampler, and is based on the idea of uniformization. This sets up a Markov chain over paths by alternately sampling a ﬁnite set of virtual jump times given the current path, and then sampling a new path given the set of extant and virtual jump times. The ﬁrst step involves simulating a piecewise-constant inhomogeneous Poisson process, while for the second, we use a standard hidden Markov model forward ﬁltering-backward sampling algorithm. Our method is exact and does not involve approximations like time-discretization. We demonstrate how our sampler extends naturally to MJP-based models like Markov-modulated Poisson processes and continuous-time Bayesian networks, and show signiﬁcant computational beneﬁts over state-ofthe-art MCMC samplers for these models. Keywords: Markov jump process, MCMC, Gibbs sampler, uniformization, Markov-modulated Poisson process, continuous-time Bayesian network

4 0.082067765 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

Author: Sébastien Gerchinovitz

Abstract: We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T . We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same ﬂavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with ﬁxed design. Keywords: sparsity, online linear regression, individual sequences, adaptive regret bounds

5 0.076537259 58 jmlr-2013-Language-Motivated Approaches to Action Recognition

Author: Manavender R. Malgireddy, Ifeoma Nwogu, Venu Govindaraju

Abstract: We present language-motivated approaches to detecting, localizing and classifying activities and gestures in videos. In order to obtain statistical insight into the underlying patterns of motions in activities, we develop a dynamic, hierarchical Bayesian model which connects low-level visual features in videos with poses, motion patterns and classes of activities. This process is somewhat analogous to the method of detecting topics or categories from documents based on the word content of the documents, except that our documents are dynamic. The proposed generative model harnesses both the temporal ordering power of dynamic Bayesian networks such as hidden Markov models (HMMs) and the automatic clustering power of hierarchical Bayesian models such as the latent Dirichlet allocation (LDA) model. We also introduce a probabilistic framework for detecting and localizing pre-speciﬁed activities (or gestures) in a video sequence, analogous to the use of ﬁller models for keyword detection in speech processing. We demonstrate the robustness of our classiﬁcation model and our spotting framework by recognizing activities in unconstrained real-life video sequences and by spotting gestures via a one-shot-learning approach. Keywords: dynamic hierarchical Bayesian networks, topic models, activity recognition, gesture spotting, generative models

6 0.068782978 121 jmlr-2013-Variational Inference in Nonconjugate Models

7 0.058981847 15 jmlr-2013-Bayesian Canonical Correlation Analysis

8 0.05561481 98 jmlr-2013-Segregating Event Streams and Noise with a Markov Renewal Process Model

9 0.055464033 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation

10 0.053663332 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

11 0.044731729 11 jmlr-2013-Algorithms for Discovery of Multiple Markov Boundaries

12 0.043321509 90 jmlr-2013-Quasi-Newton Method: A New Direction

13 0.043140039 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

14 0.036873721 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

15 0.036262609 94 jmlr-2013-Ranked Bandits in Metric Spaces: Learning Diverse Rankings over Large Document Collections

16 0.033906505 76 jmlr-2013-Nonparametric Sparsity and Regularization

17 0.033711154 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization

18 0.032315955 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems

19 0.032275427 45 jmlr-2013-GPstuff: Bayesian Modeling with Gaussian Processes

20 0.028856575 8 jmlr-2013-A Theory of Multiclass Boosting

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.168), (1, -0.153), (2, -0.057), (3, 0.017), (4, -0.076), (5, 0.026), (6, 0.055), (7, 0.009), (8, 0.003), (9, 0.04), (10, 0.052), (11, -0.008), (12, -0.169), (13, -0.081), (14, 0.079), (15, 0.017), (16, 0.005), (17, -0.107), (18, 0.201), (19, 0.156), (20, -0.086), (21, 0.057), (22, -0.374), (23, -0.051), (24, 0.023), (25, -0.005), (26, -0.006), (27, -0.071), (28, -0.193), (29, 0.05), (30, 0.033), (31, 0.115), (32, -0.12), (33, -0.028), (34, 0.064), (35, -0.099), (36, 0.049), (37, 0.045), (38, 0.093), (39, -0.089), (40, -0.04), (41, 0.015), (42, 0.041), (43, 0.008), (44, -0.033), (45, -0.037), (46, 0.037), (47, -0.009), (48, -0.103), (49, -0.04)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95878601 16 jmlr-2013-Bayesian Nonparametric Hidden Semi-Markov Models

Author: Matthew J. Johnson, Alan S. Willsky

2 0.79215407 43 jmlr-2013-Fast MCMC Sampling for Markov Jump Processes and Extensions

Author: Vinayak Rao, Yee Whye Teh

3 0.51092523 15 jmlr-2013-Bayesian Canonical Correlation Analysis

Author: Arto Klami, Seppo Virtanen, Samuel Kaski

Abstract: Canonical correlation analysis (CCA) is a classical method for seeking correlations between two multivariate data sets. During the last ten years, it has received more and more attention in the machine learning community in the form of novel computational formulations and a plethora of applications. We review recent developments in Bayesian models and inference methods for CCA which are attractive for their potential in hierarchical extensions and for coping with the combination of large dimensionalities and small sample sizes. The existing methods have not been particularly successful in fulﬁlling the promise yet; we introduce a novel efﬁcient solution that imposes group-wise sparsity to estimate the posterior of an extended model which not only extracts the statistical dependencies (correlations) between data sets but also decomposes the data into shared and data set-speciﬁc components. In statistics literature the model is known as inter-battery factor analysis (IBFA), for which we now provide a Bayesian treatment. Keywords: Bayesian modeling, canonical correlation analysis, group-wise sparsity, inter-battery factor analysis, variational Bayesian approximation

4 0.41272715 11 jmlr-2013-Algorithms for Discovery of Multiple Markov Boundaries

Author: Alexander Statnikov, Nikita I. Lytkin, Jan Lemeire, Constantin F. Aliferis

Abstract: Algorithms for Markov boundary discovery from data constitute an important recent development in machine learning, primarily because they offer a principled solution to the variable/feature selection problem and give insight on local causal structure. Over the last decade many sound algorithms have been proposed to identify a single Markov boundary of the response variable. Even though faithful distributions and, more broadly, distributions that satisfy the intersection property always have a single Markov boundary, other distributions/data sets may have multiple Markov boundaries of the response variable. The latter distributions/data sets are common in practical data-analytic applications, and there are several reasons why it is important to induce multiple Markov boundaries from such data. However, there are currently no sound and efﬁcient algorithms that can accomplish this task. This paper describes a family of algorithms TIE* that can discover all Markov boundaries in a distribution. The broad applicability as well as efﬁciency of the new algorithmic family is demonstrated in an extensive benchmarking study that involved comparison with 26 state-of-the-art algorithms/variants in 15 data sets from a diversity of application domains. Keywords: Markov boundary discovery, variable/feature selection, information equivalence, violations of faithfulness

5 0.38366917 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

Author: Sébastien Gerchinovitz

6 0.34099159 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation

7 0.33387223 98 jmlr-2013-Segregating Event Streams and Noise with a Markov Renewal Process Model

8 0.29317698 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

9 0.28772843 58 jmlr-2013-Language-Motivated Approaches to Action Recognition

10 0.2639603 108 jmlr-2013-Stochastic Variational Inference

11 0.26363581 57 jmlr-2013-Kernel Bayes' Rule: Bayesian Inference with Positive Definite Kernels

12 0.25643963 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems

13 0.23845215 90 jmlr-2013-Quasi-Newton Method: A New Direction

14 0.23650198 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos

15 0.2265591 121 jmlr-2013-Variational Inference in Nonconjugate Models

16 0.18320617 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

17 0.18163827 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning

18 0.18125711 45 jmlr-2013-GPstuff: Bayesian Modeling with Gaussian Processes

19 0.17000882 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

20 0.16730843 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.024), (5, 0.09), (6, 0.047), (10, 0.62), (20, 0.019), (23, 0.017), (44, 0.011), (53, 0.02), (68, 0.015), (70, 0.012), (75, 0.025), (87, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97493851 96 jmlr-2013-Regularization-Free Principal Curve Estimation

Author: Samuel Gerber, Ross Whitaker

Abstract: Principal curves and manifolds provide a framework to formulate manifold learning within a statistical context. Principal curves deﬁne the notion of a curve passing through the middle of a distribution. While the intuition is clear, the formal deﬁnition leads to some technical and practical difﬁculties. In particular, principal curves are saddle points of the mean-squared projection distance, which poses severe challenges for estimation and model selection. This paper demonstrates that the difﬁculties in model selection associated with the saddle point property of principal curves are intrinsically tied to the minimization of the mean-squared projection distance. We introduce a new objective function, facilitated through a modiﬁcation of the principal curve estimation approach, for which all critical points are principal curves and minima. Thus, the new formulation removes the fundamental issue for model selection in principal curve estimation. A gradient-descent-based estimator demonstrates the effectiveness of the new formulation for controlling model complexity on numerical experiments with synthetic and real data. Keywords: principal curve, manifold estimation, unsupervised learning, model complexity, model selection

same-paper 2 0.94574988 16 jmlr-2013-Bayesian Nonparametric Hidden Semi-Markov Models

Author: Matthew J. Johnson, Alan S. Willsky

3 0.92234564 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach

Author: Gang Niu, Bo Dai, Lin Shang, Masashi Sugiyama

Abstract: The large volume principle proposed by Vladimir Vapnik, which advocates that hypotheses lying in an equivalence class with a larger volume are more preferable, is a useful alternative to the large margin principle. In this paper, we introduce a new discriminative clustering model based on the large volume principle called maximum volume clustering (MVC), and then propose two approximation schemes to solve this MVC model: A soft-label MVC method using sequential quadratic programming and a hard-label MVC method using semi-deﬁnite programming, respectively. The proposed MVC is theoretically advantageous for three reasons. The optimization involved in hardlabel MVC is convex, and under mild conditions, the optimization involved in soft-label MVC is akin to a convex one in terms of the resulting clusters. Secondly, the soft-label MVC method pos∗. A preliminary and shorter version has appeared in Proceedings of 14th International Conference on Artiﬁcial Intelligence and Statistics (Niu et al., 2011). The preliminary work was done when GN was studying at Department of Computer Science and Technology, Nanjing University, and BD was studying at Institute of Automation, Chinese Academy of Sciences. A Matlab implementation of maximum volume clustering is available from http://sugiyama-www.cs.titech.ac.jp/∼gang/software.html. c 2013 Gang Niu, Bo Dai, Lin Shang and Masashi Sugiyama. N IU , DAI , S HANG AND S UGIYAMA sesses a clustering error bound. Thirdly, MVC includes the optimization problems of a spectral clustering, two relaxed k-means clustering and an information-maximization clustering as special limit cases when its regularization parameter goes to inﬁnity. Experiments on several artiﬁcial and benchmark data sets demonstrate that the proposed MVC compares favorably with state-of-the-art clustering methods. Keywords: discriminative clustering, large volume principle, sequential quadratic programming, semi-deﬁnite programming, ﬁnite sample stability, clustering error

4 0.81663513 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding

Author: Ery Arias-Castro, Bruno Pelletier

Abstract: Maximum Variance Unfolding is one of the main methods for (nonlinear) dimensionality reduction. We study its large sample limit, providing speciﬁc rates of convergence under standard assumptions. We ﬁnd that it is consistent when the underlying submanifold is isometric to a convex subset, and we provide some simple examples where it fails to be consistent. Keywords: maximum variance unfolding, isometric embedding, U-processes, empirical processes, proximity graphs.

5 0.60602969 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

Author: Wendelin Böhmer, Steffen Grünewälder, Yun Shen, Marek Musial, Klaus Obermayer

Abstract: Linear reinforcement learning (RL) algorithms like least-squares temporal difference learning (LSTD) require basis functions that span approximation spaces of potential value functions. This article investigates methods to construct these bases from samples. We hypothesize that an ideal approximation spaces should encode diffusion distances and that slow feature analysis (SFA) constructs such spaces. To validate our hypothesis we provide theoretical statements about the LSTD value approximation error and induced metric of approximation spaces constructed by SFA and the state-of-the-art methods Krylov bases and proto-value functions (PVF). In particular, we prove that SFA minimizes the average (over all tasks in the same environment) bound on the above approximation error. Compared to other methods, SFA is very sensitive to sampling and can sometimes fail to encode the whole state space. We derive a novel importance sampling modiﬁcation to compensate for this effect. Finally, the LSTD and least squares policy iteration (LSPI) performance of approximation spaces constructed by Krylov bases, PVF, SFA and PCA is compared in benchmark tasks and a visual robot navigation experiment (both in a realistic simulation and with a robot). The results support our hypothesis and suggest that (i) SFA provides subspace-invariant features for MDPs with self-adjoint transition operators, which allows strong guarantees on the approximation error, (ii) the modiﬁed SFA algorithm is best suited for LSPI in both discrete and continuous state spaces and (iii) approximation spaces encoding diffusion distances facilitate LSPI performance. Keywords: reinforcement learning, diffusion distance, proto value functions, slow feature analysis, least-squares policy iteration, visual robot navigation c 2013 Wendelin B¨ hmer, Steffen Gr¨ new¨ lder, Yun Shen, Marek Musial and Klaus Obermayer. o u a ¨ ¨ ¨ B OHMER , G R UNEW ALDER , S HEN , M USIAL AND O BERMAYER

6 0.5935902 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization

7 0.58471668 37 jmlr-2013-Divvy: Fast and Intuitive Exploratory Data Analysis

8 0.57981879 59 jmlr-2013-Large-scale SVD and Manifold Learning

9 0.57971466 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

10 0.57926297 86 jmlr-2013-Parallel Vector Field Embedding

11 0.56755501 27 jmlr-2013-Consistent Selection of Tuning Parameters via Variable Selection Stability

12 0.56702507 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses

13 0.56479782 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs

14 0.56235957 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

15 0.55770421 42 jmlr-2013-Fast Generalized Subset Scan for Anomalous Pattern Detection

16 0.55750674 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression

17 0.55702585 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

18 0.55594146 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

19 0.55142891 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

20 0.54894853 41 jmlr-2013-Experiment Selection for Causal Discovery