nips nips2008 nips2008-154 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Alan S. Willsky, Erik B. Sudderth, Michael I. Jordan, Emily B. Fox
Abstract: Many nonlinear dynamical phenomena can be effectively modeled by a system that switches among a set of conditionally linear dynamical modes. We consider two such models: the switching linear dynamical system (SLDS) and the switching vector autoregressive (VAR) process. Our nonparametric Bayesian approach utilizes a hierarchical Dirichlet process prior to learn an unknown number of persistent, smooth dynamical modes. We develop a sampling algorithm that combines a truncated approximation to the Dirichlet process with efficient joint sampling of the mode and state sequences. The utility and flexibility of our model are demonstrated on synthetic data, sequences of dancing honey bees, and the IBOVESPA stock index.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Many nonlinear dynamical phenomena can be effectively modeled by a system that switches among a set of conditionally linear dynamical modes. [sent-9, score-0.526]
2 We consider two such models: the switching linear dynamical system (SLDS) and the switching vector autoregressive (VAR) process. [sent-10, score-0.784]
3 Our nonparametric Bayesian approach utilizes a hierarchical Dirichlet process prior to learn an unknown number of persistent, smooth dynamical modes. [sent-11, score-0.35]
4 We develop a sampling algorithm that combines a truncated approximation to the Dirichlet process with efficient joint sampling of the mode and state sequences. [sent-12, score-0.369]
5 The utility and flexibility of our model are demonstrated on synthetic data, sequences of dancing honey bees, and the IBOVESPA stock index. [sent-13, score-0.352]
6 1 Introduction Linear dynamical systems (LDSs) are useful in describing dynamical phenomena as diverse as human motion [9], financial time-series [4], maneuvering targets [6, 10], and the dance of honey bees [8]. [sent-14, score-0.783]
7 For example, a coasting ballistic missile makes an evasive maneuver; a country experiences a recession, a central bank intervention, or some national or global event; a honey bee changes from a waggle to a turn right dance. [sent-16, score-0.327]
8 In addition, there is always the possibility of a new, previously unseen dynamical behavior. [sent-18, score-0.217]
9 These considerations motivate us to develop a nonparametric Bayesian approach for learning switching LDS (SLDS) models. [sent-19, score-0.296]
10 We also consider a special case of the SLDS—the switching vector autoregressive (VAR) process—in which direct observations of the underlying dynamical process are assumed available. [sent-20, score-0.598]
11 One can view switching dynamical processes as an extension of hidden Markov models (HMMs) in which each HMM state, or mode, is associated with a dynamical process. [sent-22, score-0.747]
12 Existing methods for learning SLDSs and switching VAR processes rely on either fixing the number of HMM modes, such as in [8], or considering a change-point detection formulation where each inferred change is to a new, previously unseen dynamical mode, such as in [14]. [sent-23, score-0.611]
13 In this paper we show how one can remain agnostic about the number of dynamical modes while still allowing for returns to previously exhibited dynamical behaviors. [sent-24, score-0.55]
14 Hierarchical Dirichlet processes (HDP) can be used as a prior on the parameters of HMMs with unknown mode space cardinality [2, 12]. [sent-25, score-0.312]
15 In this paper we make use of a variant of the HDPHMM—the sticky HDP-HMM of [5]—that provides improved control over the number of modes inferred by the HDP-HMM; such control is crucial for the problems we examine. [sent-26, score-0.342]
16 Although the HDP-HMM and its sticky extension are very flexible time series models, they do make a strong Markovian assumption that observations are conditionally independent given the HMM mode. [sent-27, score-0.295]
17 Our nonparametric Bayesian approach for learning switching dynamical processes extends the sticky HDP-HMM formulation to learn an unknown number of persistent, smooth dynamical modes and thereby capture a wider range of temporal dependencies. [sent-29, score-1.146]
18 2 Background: Switching Linear Dynamic Systems A state space (SS) model provides a general framework for analyzing many dynamical phenomena. [sent-30, score-0.248]
19 The model consists of an underlying state, xt ∈ Rn , with linear dynamics observed via y t ∈ Rd . [sent-31, score-0.148]
20 A linear time-invariant SS model, in which the dynamics do not depend on time, is given by xt = Axt−1 + et y t = Cxt + wt , (1) where et and wt are independent Gaussian noise processes with covariances Σ and R, respectively. [sent-32, score-0.309]
21 An order r VAR process, denoted by VAR(r), with observations y t ∈ Rd , can be defined as r yt = Ai y t−i + et et ∼ N (0, Σ). [sent-33, score-0.141]
22 The dynamical phenomena we examine in this paper exhibit behaviors better modeled as switches between a set of linear dynamical models. [sent-66, score-0.496]
23 Due to uncertainty in the mode of the process, the overall model is nonlinear. [sent-67, score-0.231]
24 We define a switching linear dynamical system (SLDS) by xt = A(zt ) xt−1 + et (zt ) y t = Cxt + wt . [sent-68, score-0.654]
25 (4) The first-order Markov process zt indexes the mode-specific LDS at time t, which is driven by Gaussian noise et (zt ) ∼ N (0, Σ(zt ) ). [sent-69, score-0.613]
26 We similarly define a switching VAR(r) process by r (zt ) yt = Ai y t−i + et (zt ) et (zt ) ∼ N (0, Σ(zt ) ). [sent-70, score-0.405]
27 (5) i=1 Note that the underlying state dynamics of the SLDS are equivalent to a switching VAR(1) process. [sent-71, score-0.326]
28 (b) HDP mixture model with πj ∼ DP(α, β), zji ∼ πj , and yji ∼ f (y | θzji ). [sent-76, score-0.097]
29 (c)-(d) Sticky HDP-HMM prior on switching VAR(2) and SLDS processes with the mode evolving as zt+1 ∼ πzt for πk ∼ DP(α + κ, (αβ + κδk )/(α + κ)). [sent-77, score-0.576]
30 Taking a hierarchical Bayesian approach, the HDP draws G0 from a Dirichlet process prior DP(γ, H), and then draws group specific distributions Gj ∼ DP(α, G0 ). [sent-86, score-0.101]
31 The HDP can be used to develop an HMM with a potentially infinite mode space [2, 12]. [sent-113, score-0.231]
32 For this HDP-HMM, each HDP group-specific distribution, πj , is a mode-specific transition distribution and, due to the infinite mode space, there are infinitely many groups. [sent-114, score-0.231]
33 Let zt denote the mode of the Markov chain at time t. [sent-115, score-0.749]
34 For discrete Markov processes zt ∼ πzt−1 , so that zt−1 indexes the group to which yt is assigned. [sent-116, score-0.635]
35 The current HMM mode zt then indexes the parameter θzt used to generate observation yt . [sent-117, score-0.817]
36 By sampling πj ∼ DP(α, β), the HDP prior encourages modes to have similar transition distributions (E[πjk | β] = βk ). [sent-120, score-0.182]
37 When modeling dynamical processes with mode persistence, the flexible nature of the HDP-HMM prior allows for mode sequences with unrealistically fast dynamics to have large posterior probability. [sent-122, score-0.879]
38 Recently, it has been shown [5] that one may mitigate this problem by instead considering a sticky HDP-HMM where πj is distributed as follows: πj ∼ DP α + κ, αβ + κδj α+κ . [sent-123, score-0.226]
39 4 The HDP-SLDS and HDP-AR-HMM Models For greater modeling flexibility, we take a nonparametric approach in defining the mode space of our switching dynamical processes. [sent-132, score-0.744]
40 Specifically, we develop extensions of the sticky HDP-HMM for both the SLDS and switching VAR models. [sent-133, score-0.49]
41 Let ˜ ˜ C ∈ Rd×n , n ≥ d, be the measurement matrix associated with a dynamical system defined by A, ˜ and assume C has full row rank. [sent-138, score-0.217]
42 [6] considered a related, yet simpler formulation for modeling a maneuvering target as a fixed LDS driven by a switching exogenous input. [sent-143, score-0.402]
43 Since the number of maneuver modes was assumed unknown, the exogenous input was taken to be the emissions of a HDP-HMM. [sent-144, score-0.222]
44 [3] in which the exogenous input was an independent noise process generated from a DP mixture model. [sent-147, score-0.11]
45 The HDP-SLDS is a major departure from these works since the dynamic parameters themselves change with the mode and are learned from the data, providing a much more expressive model. [sent-148, score-0.317]
46 The switching VAR(r) process can similarly be posed as an HDP-HMM in which the observations are modeled as conditionally VAR(r). [sent-149, score-0.372]
47 The generative processes for these two models are summarized as follows: HDP-AR-HMM HDP-SLDS Mode dynamics zt ∼ πzt−1 zt ∼ πzt−1 (zt ) r Observation dynamics y t = i=1 Ai y t−i + et (zt ) xt = A(zt ) xt−1 + et (zt ) y t = Cxt + wt (13) Here, πj is as defined in Sec. [sent-152, score-1.35]
48 1 Posterior Inference of Dynamic Parameters In this section we focus on developing a prior to regularize the learning of different dynamical modes conditioned on a fixed mode assignment z1:T . [sent-156, score-0.626]
49 For the SLDS, we analyze the posterior distribution of the dynamic parameters given a fixed, known state sequence x1:T . [sent-157, score-0.148]
50 Methods for learning the number of modes and resampling the sequences x1:T and z1:T are discussed in Sec. [sent-158, score-0.179]
51 Conditioned on the mode sequence, one may partition the observations into K different linear regression problems, where K = |{z1 , . [sent-161, score-0.27]
52 That is, for each mode k, we may form a matrix Y(k) with Nk columns consisting of the observations y t with zt = k. [sent-165, score-0.788]
53 The posterior distribution over the VAR(r) parameters associated with the k th mode decomposes as follows: p(A(k) , Σ(k) | D(k) ) = p(A(k) | Σ(k) , D(k) )p(Σ(k) | D(k) ). [sent-171, score-0.256]
54 2 Gibbs Sampler For the switching VAR(r) process, our sampler iterates between sampling the mode sequence, z1:T , and both the dynamic and sticky HDP-HMM parameters. [sent-183, score-0.87]
55 The sampler for the SLDS is identical to that of a switching VAR(1) process with the additional step of sampling the state sequence, x1:T , and conditioning on the state sequence when resampling dynamic parameters. [sent-184, score-0.55]
56 The resulting Gibbs sampler is described below and further elaborated upon in supplemental Appendix II. [sent-185, score-0.098]
57 Sampling Dynamic Parameters Conditioned on a sample of the mode sequence, z1:T , and the observations, y 1:T , or state sequence, x1:T , we can sample the dynamic parameters θ = {A(k) , Σ(k) } from the posterior density described in Sec. [sent-186, score-0.343]
58 Sampling z1:T As shown in [5], the mixing rate of the Gibbs sampler for the HDP-HMM can be dramatically improved by using a truncated approximation to the HDP, such as the weak limit approximation, and jointly sampling the mode sequence using a variant of the forward-backward algorithm. [sent-190, score-0.383]
59 Specifically, we compute backward messages mt+1,t (zt ) ∝ p(y t+1:T |zt , y t−r+1:t , π, θ) and then recursively sample each zt conditioned on zt−1 from p(zt | zt−1 , y 1:T , π, θ) ∝ p(zt | πzt−1 )p(y t | y t−r:t−1 , A(zt ) , Σ(zt ) )mt+1,t (zt ), r i=1 (21) (z ) Ai t y t−i , Σ(zt ) ). [sent-191, score-0.57]
60 where p(y t | y t−r:t−1 , A(zt ) , Σ(zt ) ) = N ( Joint sampling of the mode sequence is especially important when the observations are directly correlated via a dynamical process since this correlation further slows the mixing rate of the direct assignment sampler of [12]. [sent-192, score-0.655]
61 Sampling x1:T (HDP-SLDS only) Conditioned on the mode sequence z1:T and the set of dynamic parameters θ, our dynamical process simplifies to a time-varying linear dynamical system. [sent-195, score-0.796]
62 We can then block sample x1:T by first running a backward filter to compute mt+1,t (xt ) ∝ p(y t+1:T |xt , zt+1:T , θ) and then recursively sampling each xt conditioned on xt−1 from p(xt | xt−1 , y 1:T , z1:T , θ) ∝ p(xt | xt−1 , A(zt ) , Σ(zt ) )p(y t | xt , R)mt+1,t (xt ). [sent-196, score-0.32]
63 The associated 10th, 50th, and 90th Hamming distance quantiles over 100 trials are shown for the (b) HDP-VAR(1)HMM, (c) HDP-VAR(2)-HMM, (d) HDP-SLDS with C = I (top and bottom) and C = [1 0] (middle), and (e) sticky HDP-HMM using first difference observations. [sent-268, score-0.264]
64 2, we compare the performance of the HDP-VAR(1)-HMM, HDP-VAR(2)HMM, HDP-SLDS, and a baseline sticky HDP-HMM on three sets of test data (see Fig. [sent-270, score-0.226]
65 The Hamming distance error is calculated by first choosing the optimal mapping of indices maximizing overlap between the true and estimated mode sequences. [sent-272, score-0.269]
66 For the first scenario, the data were generated from a 5-mode switching VAR(1) process. [sent-273, score-0.264]
67 The three switching linear dynamical models provide comparable performance since both the HDP-VAR(2)-HMM and HDP-SLDS with C = I contain the class of HDP-VAR(1)-HMMs. [sent-274, score-0.481]
68 Note that the HDP-SLDS sampler is slower to mix since the hidden, three-dimensional continuous state is also sampled. [sent-275, score-0.09]
69 In the second scenario, the data were generated from a 3-mode switching AR(2) process. [sent-276, score-0.264]
70 Together, these results demonstrate both the differences between our models as well as the models’ ability to learn switching processes with varying numbers of modes. [sent-282, score-0.313]
71 Finally, note that all of the switching models yielded significant improvements relative to the baseline sticky HDP-HMM, even when the latter was given first differences of the observations. [sent-283, score-0.49]
72 In [4], a 2-mode Markov switching stochastic volatility (MSSV) model is used to identify periods of higher volatility in the daily returns. [sent-288, score-0.437]
73 The MSSV assumes that the log-volatilities follow an AR(1) process with a Markov switching mean. [sent-289, score-0.303]
74 This underlying process is observed via conditionally independent and normally distributed daily returns. [sent-290, score-0.1]
75 8 False Alarm Rate 1 (a) (b) (c) (d) Figure 3: (a) IBOVESPA stock index daily returns from 01/03/1997 to 01/16/2001. [sent-314, score-0.104]
76 We clearly see the advantage of using a SLDS model combined with the sticky HDP-HMM prior on the mode sequence. [sent-325, score-0.489]
77 Without the sticky extension, the HDP-SLDS over-segments the data and rapidly switches between redundant states which leads to a dramatically larger number of inferred change points. [sent-326, score-0.309]
78 Dancing Honey Bees We test the HDP-VAR(1)-HMM on a set of six dancing honey bee sequences, aiming to segment the sequences into the three dances displayed in Fig. [sent-327, score-0.396]
79 ) The data consist of measurements y t = [cos(θt ) sin(θt ) xt yt ]T , where (xt , yt ) denotes the 2D coordinates of the bee’s body and θt its head angle. [sent-330, score-0.201]
80 4(d)-(e), our model achieves a superior segmentation compared to the change-point formulation in almost all cases, while also identifying modes which reoccur over time. [sent-333, score-0.141]
81 [8] also presented an analysis of the honey bee data, using an SLDS with a fixed number of modes. [sent-335, score-0.23]
82 (The authors also considered a “parameterized segmental SLDS (PS-SLDS),” which makes use of domain knowledge specific to honey bee dancing and requires additional supervision during the learning process. [sent-337, score-0.329]
83 As seen in Table 1, the HDP-VAR(1)-HMM yields very good performance on sequences 4 to 6 in terms of the learned segmentation and number of modes (see Fig. [sent-339, score-0.179]
84 For sequences 1 to 3—which are much less regular than sequences 4 to 6—the performance of the unsupervised procedure is substantially worse. [sent-341, score-0.16]
85 This motivated us to also consider a partially supervised variant of the HDP-VAR(1)-HMM in which we fix the ground truth mode sequences for five out of six of the sequences, and jointly infer both a combined set of dynamic parameters and the left-out mode sequence. [sent-342, score-0.621]
86 4 is the extreme variation in head angle during the waggle dances of sequences 1 to 3. [sent-345, score-0.192]
87 Indeed, our learned segmentations consistently identify turn-right and turn-left modes, but often create a new, sequence-specific waggle dance mode. [sent-347, score-0.168]
88 Many of our errors can be attributed to creating multiple waggle dance modes within a sequence. [sent-348, score-0.255]
89 6 Discussion In this paper, we have addressed the problem of learning switching linear dynamical models with an unknown number of modes for describing complex dynamical phenomena. [sent-350, score-0.814]
90 We presented a non- (2) (3) (4) 2 3 Estimated mode Estimated mode Estimated mode 3 2 1 200 400 Time (a) 600 0 2 1 1 0 Detection Rate 4 3 200 400 Time (b) 600 800 0 (5) (6) 1 4 200 400 Time (c) 600 1 0. [sent-351, score-0.693]
91 4 HDP−VAR−HMM, unsupervised HDP−VAR−HMM, supervised Change−point formulation Viterbi sequence 0. [sent-355, score-0.135]
92 4 HDP−VAR−HMM, unsupervised HDP−VAR−HMM, supervised Change−point formulation Viterbi sequence 0. [sent-362, score-0.135]
93 8 1 False Alarm Rate (e) Figure 4: (top) Trajectories of the dancing honey bees for sequences 1 to 6, colored by waggle (red), turn right (blue), and turn left (green) dances. [sent-367, score-0.441]
94 (a)-(c) Estimated mode sequences representing the median error for sequences 4, 5, and 6 at the 200th Gibbs iteration, with errors indicated in red. [sent-368, score-0.357]
95 (d)-(e) ROC curves for the unsupervised HDP-VAR-HMM, partially supervised HDP-VAR-HMM, and change-point formulation of [14] using the Viterbi sequence for segmenting datasets 1-3 and 4-6, respectively. [sent-369, score-0.135]
96 0 Table 1: Median label accuracy of the HDP-VAR(1)-HMM using unsupervised and partially supervised Gibbs sampling, compared to accuracy of the supervised PS-SLDS and SLDS procedures, where the latter algorithms were based on a supervised MCMC procedure (DD-MCMC) [8]. [sent-394, score-0.154]
97 Using the same parameter settings, in one case we are able to learn changes in the volatility of the IBOVESPA stock exchange while in another case we learn segmentations of data into waggle, turn-right, and turn-left honey bee dances. [sent-396, score-0.426]
98 Bayesian inference for dynamic models with Dirichlet process mixtures. [sent-419, score-0.095]
99 Simulation-based sequential analysis of Markov switching stochastic volatility models. [sent-427, score-0.335]
100 Learning and inferring motion patterns using parametric segmental switching linear dynamic systems. [sent-468, score-0.348]
wordName wordTfidf (topN-words)
[('zt', 0.518), ('slds', 0.339), ('switching', 0.264), ('mode', 0.231), ('sticky', 0.226), ('hdp', 0.218), ('dynamical', 0.217), ('var', 0.194), ('honey', 0.145), ('syy', 0.145), ('hamming', 0.121), ('xt', 0.117), ('hmm', 0.116), ('modes', 0.116), ('dp', 0.105), ('waggle', 0.097), ('bee', 0.085), ('cxt', 0.081), ('ibovespa', 0.081), ('dirichlet', 0.079), ('stock', 0.073), ('dancing', 0.071), ('volatility', 0.071), ('bees', 0.065), ('maneuvering', 0.065), ('sequences', 0.063), ('ar', 0.061), ('ss', 0.06), ('sampler', 0.059), ('iw', 0.057), ('dynamic', 0.056), ('iteration', 0.052), ('lds', 0.052), ('processes', 0.049), ('exogenous', 0.048), ('fox', 0.048), ('dance', 0.042), ('zji', 0.042), ('yt', 0.042), ('sudderth', 0.042), ('normalized', 0.04), ('supervised', 0.04), ('process', 0.039), ('jt', 0.039), ('supplemental', 0.039), ('autoregressive', 0.039), ('observations', 0.039), ('distance', 0.038), ('mt', 0.037), ('gem', 0.036), ('sequence', 0.036), ('viterbi', 0.034), ('unsupervised', 0.034), ('sampling', 0.034), ('dir', 0.033), ('prior', 0.032), ('dances', 0.032), ('maneuver', 0.032), ('mssv', 0.032), ('oh', 0.032), ('rehg', 0.032), ('yji', 0.032), ('gibbs', 0.032), ('phenomena', 0.032), ('nonparametric', 0.032), ('alarm', 0.031), ('daily', 0.031), ('gj', 0.031), ('state', 0.031), ('dynamics', 0.031), ('change', 0.03), ('switches', 0.03), ('conditionally', 0.03), ('hierarchical', 0.03), ('conditioned', 0.03), ('et', 0.03), ('segmentations', 0.029), ('sy', 0.028), ('segmental', 0.028), ('xuan', 0.028), ('date', 0.027), ('indexes', 0.026), ('emissions', 0.026), ('caron', 0.026), ('detection', 0.026), ('wt', 0.026), ('markov', 0.025), ('posterior', 0.025), ('formulation', 0.025), ('place', 0.024), ('persistent', 0.024), ('willsky', 0.024), ('roc', 0.024), ('mixture', 0.023), ('dramatically', 0.023), ('exchange', 0.023), ('recursively', 0.022), ('exibility', 0.022), ('vec', 0.022), ('fusion', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 154 nips-2008-Nonparametric Bayesian Learning of Switching Linear Dynamical Systems
Author: Alan S. Willsky, Erik B. Sudderth, Michael I. Jordan, Emily B. Fox
Abstract: Many nonlinear dynamical phenomena can be effectively modeled by a system that switches among a set of conditionally linear dynamical modes. We consider two such models: the switching linear dynamical system (SLDS) and the switching vector autoregressive (VAR) process. Our nonparametric Bayesian approach utilizes a hierarchical Dirichlet process prior to learn an unknown number of persistent, smooth dynamical modes. We develop a sampling algorithm that combines a truncated approximation to the Dirichlet process with efficient joint sampling of the mode and state sequences. The utility and flexibility of our model are demonstrated on synthetic data, sequences of dancing honey bees, and the IBOVESPA stock index.
2 0.19634962 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms
Author: Sham M. Kakade, Ambuj Tewari
Abstract: This paper examines the generalization properties of online convex programming algorithms when the loss function is Lipschitz and strongly convex. Our main result is a sharp bound, that holds with high probability, on the excess risk of the output of an online algorithm in terms of the average regret. This allows one to use recent algorithms with logarithmic cumulative regret guarantees to achieve fast convergence rates for the excess risk with high probability. As a corollary, we characterize the convergence rate of P EGASOS (with high probability), a recently proposed method for solving the SVM optimization problem. 1
3 0.12305904 168 nips-2008-Online Metric Learning and Fast Similarity Search
Author: Prateek Jain, Brian Kulis, Inderjit S. Dhillon, Kristen Grauman
Abstract: Metric learning algorithms can provide useful distance functions for a variety of domains, and recent work has shown good accuracy for problems where the learner can access all distance constraints at once. However, in many real applications, constraints are only available incrementally, thus necessitating methods that can perform online updates to the learned metric. Existing online algorithms offer bounds on worst-case performance, but typically do not perform well in practice as compared to their offline counterparts. We present a new online metric learning algorithm that updates a learned Mahalanobis metric based on LogDet regularization and gradient descent. We prove theoretical worst-case performance bounds, and empirically compare the proposed method against existing online metric learning algorithms. To further boost the practicality of our approach, we develop an online locality-sensitive hashing scheme which leads to efficient updates to data structures used for fast approximate similarity search. We demonstrate our algorithm on multiple datasets and show that it outperforms relevant baselines.
4 0.10638361 195 nips-2008-Regularized Policy Iteration
Author: Amir M. Farahmand, Mohammad Ghavamzadeh, Shie Mannor, Csaba Szepesvári
Abstract: In this paper we consider approximate policy-iteration-based reinforcement learning algorithms. In order to implement a flexible function approximation scheme we propose the use of non-parametric methods with regularization, providing a convenient way to control the complexity of the function approximator. We propose two novel regularized policy iteration algorithms by adding L2 -regularization to two widely-used policy evaluation methods: Bellman residual minimization (BRM) and least-squares temporal difference learning (LSTD). We derive efficient implementation for our algorithms when the approximate value-functions belong to a reproducing kernel Hilbert space. We also provide finite-sample performance bounds for our algorithms and show that they are able to achieve optimal rates of convergence under the studied conditions. 1
5 0.10011537 112 nips-2008-Kernel Measures of Independence for non-iid Data
Author: Xinhua Zhang, Le Song, Arthur Gretton, Alex J. Smola
Abstract: Many machine learning algorithms can be formulated in the framework of statistical independence such as the Hilbert Schmidt Independence Criterion. In this paper, we extend this criterion to deal with structured and interdependent observations. This is achieved by modeling the structures using undirected graphical models and comparing the Hilbert space embeddings of distributions. We apply this new criterion to independent component analysis and sequence clustering. 1
6 0.08047539 206 nips-2008-Sequential effects: Superstition or rational behavior?
7 0.080469966 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
8 0.080113791 242 nips-2008-Translated Learning: Transfer Learning across Different Feature Spaces
9 0.074418612 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs
10 0.072158962 208 nips-2008-Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes
11 0.069372952 43 nips-2008-Cell Assemblies in Large Sparse Inhibitory Networks of Biologically Realistic Spiking Neurons
12 0.068492472 28 nips-2008-Asynchronous Distributed Learning of Topic Models
13 0.067011312 229 nips-2008-Syntactic Topic Models
14 0.062976234 12 nips-2008-Accelerating Bayesian Inference over Nonlinear Differential Equations with Gaussian Processes
15 0.0580904 189 nips-2008-Rademacher Complexity Bounds for Non-I.I.D. Processes
16 0.057471141 221 nips-2008-Stochastic Relational Models for Large-scale Dyadic Data using MCMC
17 0.056892619 57 nips-2008-Deflation Methods for Sparse PCA
18 0.056733392 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions
19 0.056504849 231 nips-2008-Temporal Dynamics of Cognitive Control
20 0.055224463 234 nips-2008-The Infinite Factorial Hidden Markov Model
topicId topicWeight
[(0, -0.157), (1, 0.046), (2, 0.002), (3, -0.032), (4, -0.065), (5, 0.14), (6, 0.006), (7, 0.144), (8, 0.05), (9, -0.054), (10, -0.063), (11, -0.06), (12, 0.073), (13, 0.038), (14, 0.024), (15, 0.028), (16, -0.054), (17, -0.013), (18, -0.045), (19, -0.126), (20, -0.104), (21, 0.028), (22, 0.012), (23, 0.024), (24, -0.004), (25, 0.026), (26, -0.031), (27, 0.049), (28, 0.089), (29, -0.021), (30, -0.031), (31, -0.076), (32, 0.004), (33, -0.085), (34, -0.017), (35, -0.0), (36, 0.149), (37, 0.02), (38, -0.045), (39, -0.017), (40, -0.074), (41, 0.038), (42, -0.05), (43, -0.132), (44, -0.061), (45, 0.19), (46, 0.083), (47, 0.007), (48, 0.16), (49, 0.082)]
simIndex simValue paperId paperTitle
same-paper 1 0.94745183 154 nips-2008-Nonparametric Bayesian Learning of Switching Linear Dynamical Systems
Author: Alan S. Willsky, Erik B. Sudderth, Michael I. Jordan, Emily B. Fox
Abstract: Many nonlinear dynamical phenomena can be effectively modeled by a system that switches among a set of conditionally linear dynamical modes. We consider two such models: the switching linear dynamical system (SLDS) and the switching vector autoregressive (VAR) process. Our nonparametric Bayesian approach utilizes a hierarchical Dirichlet process prior to learn an unknown number of persistent, smooth dynamical modes. We develop a sampling algorithm that combines a truncated approximation to the Dirichlet process with efficient joint sampling of the mode and state sequences. The utility and flexibility of our model are demonstrated on synthetic data, sequences of dancing honey bees, and the IBOVESPA stock index.
2 0.48748529 13 nips-2008-Adapting to a Market Shock: Optimal Sequential Market-Making
Author: Sanmay Das, Malik Magdon-Ismail
Abstract: We study the profit-maximization problem of a monopolistic market-maker who sets two-sided prices in an asset market. The sequential decision problem is hard to solve because the state space is a function. We demonstrate that the belief state is well approximated by a Gaussian distribution. We prove a key monotonicity property of the Gaussian state update which makes the problem tractable, yielding the first optimal sequential market-making algorithm in an established model. The algorithm leads to a surprising insight: an optimal monopolist can provide more liquidity than perfectly competitive market-makers in periods of extreme uncertainty, because a monopolist is willing to absorb initial losses in order to learn a new valuation rapidly so she can extract higher profits later. 1
3 0.46787435 167 nips-2008-One sketch for all: Theory and Application of Conditional Random Sampling
Author: Ping Li, Kenneth W. Church, Trevor J. Hastie
Abstract: Conditional Random Sampling (CRS) was originally proposed for efficiently computing pairwise (l2 , l1 ) distances, in static, large-scale, and sparse data. This study modifies the original CRS and extends CRS to handle dynamic or streaming data, which much better reflect the real-world situation than assuming static data. Compared with many other sketching algorithms for dimension reductions such as stable random projections, CRS exhibits a significant advantage in that it is “one-sketch-for-all.” In particular, we demonstrate the effectiveness of CRS in efficiently computing the Hamming norm, the Hamming distance, the lp distance, and the χ2 distance. A generic estimator and an approximate variance formula are also provided, for approximating any type of distances. We recommend CRS as a promising tool for building highly scalable systems, in machine learning, data mining, recommender systems, and information retrieval. 1
4 0.46778905 221 nips-2008-Stochastic Relational Models for Large-scale Dyadic Data using MCMC
Author: Shenghuo Zhu, Kai Yu, Yihong Gong
Abstract: Stochastic relational models (SRMs) [15] provide a rich family of choices for learning and predicting dyadic data between two sets of entities. The models generalize matrix factorization to a supervised learning problem that utilizes attributes of entities in a hierarchical Bayesian framework. Previously variational Bayes inference was applied for SRMs, which is, however, not scalable when the size of either entity set grows to tens of thousands. In this paper, we introduce a Markov chain Monte Carlo (MCMC) algorithm for equivalent models of SRMs in order to scale the computation to very large dyadic data sets. Both superior scalability and predictive accuracy are demonstrated on a collaborative filtering problem, which involves tens of thousands users and half million items. 1 Stochastic Relational Models Stochastic relational models (SRMs) [15] are generalizations of Gaussian process (GP) models [11] to the relational domain, where each observation is a dyadic datum, indexed by a pair of entities. They model dyadic data by a multiplicative interaction of two Gaussian process priors. Let U be the feature representation (or index) space of a set of entities. A pair-wise similarity in U is given by a kernel (covariance) function Σ : U × U → R. A Gaussian process (GP) defines a random function f : U → R, whose distribution is characterized by a mean function and the covariance function Σ, denoted by f ∼ N∞ (0, Σ)1 , where, for simplicity, we assume the mean to be the constant zero. GP complies with the intuition regarding the smoothness — if two entities ui and uj are similar according to Σ, then f (ui ) and f (uj ) are similar with a high probability. A domain of dyadic data must involve another set of entities, let it be represented (or indexed) by V. In a similar way, this entity set is associated with another kernel function Ω. For example, in a typical collaborative filtering domain, U represents users while V represents items, then, Σ measures the similarity between users and Ω measures the similarity between items. Being the relation between a pair of entities from different sets, a dyadic variable y is indexed by the product space U × V. Then an SRM aims to model y(u, v) by the following generative process, Model 1. The generative model of an SRM: 1. Draw kernel functions Σ ∼ IW ∞ (δ, Σ◦ ), and Ω ∼ IW ∞ (δ, Ω◦ ); 2. For k = 1, . . . , d: draw random functions fk ∼ N∞ (0, Σ), and gk ∼ N∞ (0, Ω); 1 We denote an n dimensional Gaussian distribution with a covariance matrix Σ by Nn (0, Σ). Then N∞ (0, Σ) explicitly indicates that a GP follows an “infinite dimensional” Gaussian distribution. 1 3. For each pair (u, v): draw y(u, v) ∼ p(y(u, v)|z(u, v), γ), where d 1 z(u, v) = √ fk (u)gk (v) + b(u, v). d k=1 In this model, IW ∞ (δ, Σ◦ ) and IW ∞ (δ, Ω◦ ) are hyper priors, whose details will be introduced later. p(y|z, γ) is the problem-specific noise model. For example, it can follow a Gaussian noise distribution y ∼ N1 (z, γ) if y is numerical, or, a Bernoulli distribution if y is binary. Function b(u, v) is the bias function over the U × V. For simplicity, we assume b(u, v) = 0. In the limit d → ∞, the model converges to a special case where fk and gk can be analytically marginalized out and z becomes a Gaussian process z ∼ N∞ (0, Σ ⊗ Ω) [15], with the covariance between pairs being a tensor kernel K ((ui , vs ), (uj , vt )) = Σ(ui , uj )Ω(vs , vt ). In anther special case, if Σ and Ω are both fixed to be Dirac delta functions, and U, V are finite sets, it is easy to see that the model reduces to probabilistic matrix factorization. The hyper prior IW ∞ (δ, Σ◦ ) is called inverted Wishart Process that generalizes the finite ndimensional inverted Wishart distribution [2] IW n (Σ|δ, Σ◦ ) ∝ |Σ| − 1 (δ+2n) 2 1 etr − Σ−1 Σ◦ , 2 where δ is the degree-of-freedom parameter, and Σ◦ is a positive definite kernel matrix. We note that the above definition is different from the popular formulation [3] or [4] in the machine learning community. The advantage of this new notation is demonstrated by the following theorem [2]. Theorem 1. Let A ∼ IW m (δ, K), A ∈ R+ , K ∈ R+ , and A and K be partitioned as A= A11 , A12 , A21 , A22 K= K11 , K12 K21 , K22 where A11 and K11 are two n × n sub matrices, n < m, then A11 ∼ IW n (δ, K11 ). The new formulation of inverted Wishart is consistent under marginalization. Therefore, similar to the way of deriving GPs from Gaussian distributions, we define a distribution of infinite-dimensional kernel functions, denoted by Σ ∼ IW ∞ (δ, Σ◦ ), such that any sub kernel matrix of size m × m follows Σ ∼ IW m (δ, Σ◦ ), where both Σ and Σ◦ are positive definite kernel functions. In case when U and V are sets of entity indices, SRMs let Σ◦ and Ω◦ both be Dirac delta functions, i.e., any of their sub kernel matrices is an identity matrix. Similar to GP regression/classification, the major application of SRMs is supervised prediction based on observed relational values and input features of entities. Formally, let YI = {y(u, v)|(u, v) ∈ I} be the set of noisy observations, where I ⊂ U × V, the model aims to predict the noise-free values ZO = {z(u, v)|(u, v) ∈ O} on O ⊂ U × V. As our computation is always on a finite set containing both I and O, from now on, we only consider the finite subset U0 × V0 , a finite support subset of U × V that contains I ∪ O. Accordingly we let Σ be the covariance matrix of Σ on U0 , and Ω be the covariance matrix of Ω on V0 . Previously a variational Bayesian method was applied to SRMs [15], which computes the maximum a posterior estimates of Σ and Ω, given YI , and then predicts ZO based on the estimated Σ and Ω. There are two limitations of this empirical Bayesian approach: (1) The variational method is not a fully Bayesian treatment. Ideally we wish to integrate Σ and Ω; (2) The more critical issue is, the algorithm has the complexity O(m3 + n3 ), with m = |U0 | and n = |V0 |, is not scalable to a large relational domain where m or n exceeds several thousands. In this paper we will introduce a fully Bayesian inference algorithm using Markov chain Monte Carlo sampling. By deriving equivalent sampling processes, we show the algorithms can be applied to a dataset, which is 103 times larger than the previous work [15], and produce an excellent accuracy. In the rest of this paper, we present our algorithms for Bayesian inference of SRMs in Section 2. Some related work is discussed in Section 3, followed by experiment results of SRMs in Section 4. Section 5 concludes. 2 2 Bayesian Models and MCMC Inference In this paper, we tackle the scalability issue with a fully Bayesian paradigm. We estimate the expectation of ZO directly from YI using Markov-chain Monte Carlo (MCMC) algorithm (specifically, Gibbs sampling), instead of evaluating that from estimated Σ or Ω. Our contribution is in how to make the MCMC inference more efficient for large scale data. We first introduce some necessary notation here. Bold capital letters, e.g. X, indicate matrices. I(m) is an identity matrix of size m × m. Nd , Nm,d , IW m , χ−2 are the multivariate normal distribution, the matrix-variate normal distribution, the inverse-Wishart distribution, and the inverse chi-square distribution, respectively. 2.1 Models with Non-informative Priors Let r = |I|, m = |U0 | and n = |V0 |. It is assumed that d min(m, n), and the observed set, I, is sparse, i.e. r mn. First, we consider the case of Σ◦ = αI(m) and Ω◦ = βI(n) . Let {fk } on U0 denoted by matrix variate F of size m × d, {gk } on V0 denoted by matrix variate G of size n × d. Then the generative model is written as Model 2 and depicted in Figure 1. Model 2. The generative model of a matrix-variate SRM: Σ 1. Draw Σ ∼ IW m (δ, αI(m) ) and Ω ∼ IW n (δ, βI(n) ); Ω I(d) G F 2. Draw F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ) and G|Ω ∼ Nn,d (0, Ω ⊗ I(d) ); I(d) Z 3. Draw s2 ∼ χ−2 (ν, σ 2 ) ; 4. Draw Y|F, G, s2 ∼ Nm,n (Z, s2 I(m) ⊗ I(n) ), where Z = FG . s2 Y where Nm,d is the matrix-variate normal distribution of size m × d; α, Figure 1: Model 2 β, δ, ν and σ 2 are scalar parameters of the model. A slight difference √ between this finite model and Model 1 is that the coefficient 1/ d is ignored for simplicity because this coefficient can be absorbed by α or β. As we can explicitly compute Pr(Σ|F), Pr(Ω|G), Pr(F|YI , G, Σ, s2 ), Pr(G|YI , F, Ω, s2 ), Pr(s2 |YI , F, G), we can apply Gibbs sampling algorithm to compute ZO . However, the computational time complexity is at least O(m3 + n3 ), which is not practical for large scale data. 2.2 Gibbs Sampling Method To overcome the inefficiency in sampling large covariance matrices, we rewrite the sampling process using the property of Theorem 2 to take the advantage of d min(m, n). αI(d) αI(m) Theorem 2. If 1. Σ ∼ IW m (δ, αI(m) ) and F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ), 2. K ∼ IW d (δ, αI(d) ) and H|K ∼ Nm,d (0, I(m) ⊗ K), then, matrix variates, F and H, have the same distribution. Proof sketch. Matrix variate F follows a matrix variate t distribution, t(δ, 0, αI(m) , I(d) ), which is written as 1 Σ I(d) F → I(m) K F Figure 2: Theorem 2 1 p(F) ∝ |I(m) + (αI(m) )−1 F(I(d) )−1 F |− 2 (δ+m+d−1) = |I(m) + α−1 FF |− 2 (δ+m+d−1) Matrix variate H follows a matrix variate t distribution, t(δ, 0, I(m) , αI(d) ), which can be written as 1 1 p(H) ∝ |I(m) + (I(m) )−1 H(αI(d) )−1 H |− 2 (δ+m+d−1) = |I(m) + α−1 HH |− 2 (δ+m+d−1) Thus, matrix variates, F and H, have the same distribution. 3 This theorem allows us to sample a smaller covariance matrix K of size d × d on the column side instead of sampling a large covariance matrix Σ of size m × m on the row side. The translation is depicted in Figure 2. This theorem applies to G as well, thus we rewrite the model as Model 3 (or Figure 3). A similar idea was used in our previous work [16]. Model 3. The alternative generative model of a matrix-variate SRM: I(m) I(n) K R 1. Draw K ∼ IW d (δ, αI(d) ) and R ∼ IW d (δ, βI(d) ); G F 2. Draw F|K ∼ Nm,d (0, I(m) ⊗ K), and G|R ∼ Nn,d (0, I(n) ⊗ R), 3. Draw s2 ∼ χ−2 (ν, σ 2 ) ; Z 4. Draw Y|F, G, s2 ∼ Nm,n (Z, s2 I(m) ⊗ I(n) ), where Z = FG . s2 Y Let column vector f i be the i-th row of matrix F, and column vector gj Figure 3: Model 3 be the j-th row of matrix G. In Model 3, {f i } are independent given K, 2 G and s . Similar independence applies to {gj } as well. The conditional posterior distribution of K, R, {f i }, {gj } and s2 can be easily computed, thus the Gibbs sampling for SRM is named BSRM (for Bayesian SRM). We use Gibbs sampling to compute the mean of ZO , which is derived from the samples of FG . Because of the sparsity of I, each iteration in this sampling algorithm can be computed in O(d2 r + d3 (m + n)) time complexity2 , which is a dramatic reduction from the previous time complexity O(m3 + n3 ) . 2.3 Models with Informative Priors An important characteristic of SRMs is that it allows the inclusion of certain prior knowledge of entities into the model. Specifically, the prior information is encoded as the prior covariance parameters, i.e. Σ◦ and Ω◦ . In the general case, it is difficult to run sampling process due to the size of Σ◦ and Ω◦ . We assume that Σ◦ and Ω◦ have a special form, i.e. Σ◦ = F◦ (F◦ ) + αI(m) , where F◦ is an m × p matrix, and Ω◦ = G◦ (G◦ ) + βI(n) , where G◦ is an n × q matrix, and the magnitude of p and q is about the same as or less than that of d. This prior knowledge can be obtained from some additional features of entities. Although such an informative Σ◦ prevents us from directly sampling each row of F independently, as we do in Model 3, we can expand matrix F of size m × d to (F, F◦ ) of size m × (d + p), and derive an equivalent model, where rows of F are conditionally independent given F◦ . Figure 4 illustrates this transformation. Theorem 3. Let δ > p, Σ◦ = F◦ (F◦ ) + αI(m) , where F◦ is an m × p matrix. If 1. Σ ∼ IW m (δ, Σ◦ ) and F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ), K11 K12 ∼ IW d+p (δ − p, αI(d+p) ) and K21 K22 H|K ∼ Nm,d (F◦ K−1 K21 , I(m) ⊗ K11·2 ), 22 2. K = αI(d+p) Σ0 Σ I(d) F → I(m) K (F, F0 ) Figure 4: Theorem 3 where K11·2 = K11 − K12 K−1 K21 , then F and H have the same distribution. 22 Proof sketch. Consider the distribution (H1 , H2 )|K ∼ Nm,d+p (0, I(m) ⊗ K). (1) Because H1 |H2 ∼ Nm,d (H2 K−1 K21 , I(m) ⊗ K11·2 ), p(H) = p(H1 |H2 = F◦ ). On the other 22 hand, we have a matrix-variate t distribution, (H1 , H2 ) ∼ tm,d+p (δ − p, 0, αI(m) , I(d+p) ). By Theorem 4.3.9 in [4], we have H1 |H2 ∼ tm,d (δ, 0, αI(m) + H2 H2 , I(d) ) = tm,d (δ, 0, Σ◦ , I(d) ), which implies p(F) = p(H1 |H2 = F◦ ) = p(H). 2 |Y − FG |2 can be efficiently computed in O(dr) time. I 4 The following corollary allows us to compute the posterior distribution of K efficiently. Corollary 4. K|H ∼ IW d+p (δ + m, αI(d+p) + (H, F◦ ) (H, F◦ )). Proof sketch. Because normal distribution and inverse Wishart distribution are conjugate, we can derive the posterior distribution K from Eq. (1). Thus, we can explicitly sample from the conditional posterior distributions, as listed in Algorithm 1 (BSRM/F for BSRM with features) in Appendix. We note that when p = q = 0, Algorithm 1 (BSRM/F) reduces to the exact algorithm for BSRM. Each iteration in this sampling algorithm can be computed in O(d2 r + d3 (m + n) + dpm + dqn) time complexity. 2.4 Unblocking for Sampling Implementation Blocking Gibbs sampling technique is commonly used to improve the sampling efficiency by reducing the sample variance according to the Rao-Blackwell theorem (c.f. [9]). However, blocking Gibbs sampling is not necessary to be computationally efficient. To improve the computational efficiency of Algorithm 1, we use unblocking sampling to reduce the major computational cost is Step 2 and Step 4. We consider sampling each element of F conditionally. The sampling process is written as Step 4 and Step 9 of Algorithm 2, which is called BSRM/F with conditional Gibss sampling. We can reduce the computational cost of each iteration to O(dr + d2 (m + n) + dpm + dqn), which is comparable to other low-rank matrix factorization approaches. Though such a conditional sampling process increases the sample variance comparing to Algorithm 1, we can afford more samples within a given amount of time due to its faster speed. Our experiments show that the overall computational cost of Algorithm 2 is usually less than that of Algorithm 1 when achieving the same accuracy. Additionally, since {f i } are independent, we can parallelize the for loops in Step 4 and Step 9 of Algorithm 2. 3 Related Work SRMs fall into a class of statistical latent-variable relational models that explain relations by latent factors of entities. Recently a number of such models were proposed that can be roughly put into two groups, depending on whether the latent factors are continuous or discrete: (1) Discrete latent-state relational models: a large body of research infers latent classes of entities and explains the entity relationship by the probability conditioned on the joint state of participating entities, e.g., [6, 14, 7, 1]. In another work [10], binary latent factors are modeled; (2) Continuous latent-variable relational models: many such models assume relational data underlain by multiplicative effects between latent variables of entities, e.g. [5]. A simple example is matrix factorization, which recently has become very popular in collaborative filtering applications, e.g., [12, 8, 13]. The latest Bayesian probabilistic matrix factorization [13] reported the state-of-the-art accuracy of matrix factorization on Netflix data. Interestingly, the model turns out to be similar to our Model 3 under the non-informative prior. This paper reveals the equivalence between different models and offers a more general Bayesian framework that allows informative priors from entity features to play a role. The framework also generalizes Gaussian processes [11] to a relational domain, where a nonparametric prior for stochastic relational processes is described. 4 Experiments Synthetic data: We compare BSRM under noninformative priors against two other algorithms: the fast max-margin matrix factorization (fMMMF) in [12] with a square loss, and SRM using variational Bayesian approach (SRM-VB) in [15]. We generate a 30 × 20 random matrix (Figure 5(a)), then add Gaussian noise with σ 2 = 0.1 (Figure 5(b)). The root mean squared noise is 0.32. We select 70% elements as the observed data and use the rest of the elements for testing. The reconstruction matrix and root mean squared errors (RMSEs) of predictions on the test elements are shown in Figure 5(c)-5(e). BSRM outperforms the variational approach of SRMs and fMMMF. Note that because of the log-determinant penalty of the inverse Wishart prior, SRM-VB enforces the rank to be smaller, thus the result of SRM-VB looks smoother than that of BSRM. 5 5 5 5 5 5 10 10 10 10 10 15 15 15 15 15 20 20 20 20 20 25 25 25 25 30 30 2 4 6 8 10 12 14 16 18 20 30 2 4 6 8 10 12 14 16 18 20 25 30 2 4 6 8 10 12 14 16 18 20 30 2 4 6 8 10 12 14 16 18 20 (a) Original Matrix (b) With Noise(0.32) (c) fMMMF (0.27) (d) SRM-VB(0.22) 2 4 6 8 10 12 14 16 18 20 (e) BSRM(0.19) Figure 5: Experiments on synthetic data. RMSEs are shown in parentheses. RMSE MAE User Mean 1.425 1.141 Movie Mean 1.387 1.103 fMMMF [12] 1.186 0.943 VB [8] 1.165 0.915 Table 1: RMSE (root mean squared error) and MAE (mean absolute error) of the experiments on EachMovie data. All standard errors are 0.001 or less. EachMovie data: We test the accuracy and the efficiency of our algorithms on EachMovie. The dataset contains 74, 424 users’ 2, 811, 718 ratings on 1, 648 movies, i.e. about 2.29% are rated by zero-to-five stars. We put all the ratings into a matrix, and randomly select 80% as observed data to predict the remaining ratings. The random selection was carried out 10 times independently. We compare our approach against several competing methods: 1) User Mean, predicting ratings by the sample mean of the same user’s ratings; 2) Move Mean, predicting rating by the sample mean of ratings on the same movie; 3) fMMMF [12]; 4) VB introduced in [8], which is a probabilistic lowrank matrix factorization using variational approximation. Because of the data size, we cannot run the SRM-VB of [15]. We test the algorithms BSRM and BSRM/F, both following Algorithm 2, which run without and with features, respectively. The features used in BSRM/F are generated from the PCA result of the binary indicator matrix that indicates whether the user rates the movie. The 10 top factors of both the user side and the movie side are used as features, i.e. p = 10, q = 10. We run the experiments with different d = 16, 32, 100, 200, 300. The hyper parameters are set to some trivial values, δ = p + 1 = 11, α = β = 1, σ 2 = 1, and ν = 1. The results are shown in Table 1 and 2. We find that the accuracy improves as the number of d is increased. Once d is greater than 100, the further improvement is not very significant within a reasonable amount of running time. rank (d) BSRM RMSE MAE BSRM/F RMSE MAE 16 1.0983 0.8411 1.0952 0.8311 32 1.0924 0.8321 1.0872 0.8280 100 1.0905 0.8335 1.0848 0.8289 200 1.0903 0.8340 1.0846 0.8293 300 1.0902 0.8393 1.0852 0.8292 Table 2: RMSE (root mean squared error) and MAE (mean absolute error) of experiments on EachMovie data. All standard errors are 0.001 or less. RMSE To compare the overall computational efficiency of the two Gibbs sampling procedures, Algorithm 1 and Algorithm 2, we run both algorithms 1.2 and record the running time and accuracy Algorithm 1 Algorithm 2 in RMSE. The dimensionality d is set to 1.18 be 100. We compute the average ZO and burn-in ends evaluate it after a certain number of itera1.16 tions. The evaluation results are shown in Figure 6. We run both algorithms for 100 1.14 burn-in ends iterations as the burn-in period, so that we 1.12 can have an independent start sample. After the burn-in period, we restart to compute 1.1 the averaged ZO and evaluate them, therefore there are abrupt points at 100 iterations 1.08 in both cases. The results show that the 0 1000 2000 3000 4000 5000 6000 7000 8000 overall accuracy of Algorithm 2 is better at Running time (sec) any given time. Figure 6: Time-Accuracy of Algorithm 1 and 2 6 Netflix data: We also test the algorithms on the large collection of user ratings from netflix.com. The dataset consists of 100, 480, 507 ratings from 480, 189 users on 17, 770 movies. In addition, Netflix also provides a set of validation data with 1, 408, 395 ratings. In order to evaluate the prediction accuracy, there is a test set containing 2, 817, 131 ratings whose values are withheld and unknown for all the participants. The features used in BSRM/F are generated from the PCA result of a binary matrix that indicates whether or not the user rated the movie. The top 30 user-side factors are used as features, none of movie-side factors are used, i.e. p = 30, q = 0. The hyper parameters are set to some trivial values, δ = p + 1 = 31, α = β = 1, σ 2 = 1, and ν = 1. The results on the validation data are shown in Table 3. The submitted result of BSRM/F(400) achieves RMSE 0.8881 on the test set. The running time is around 21 minutes per iteration for 400 latent dimensions on an Intel Xeon 2GHz PC. RMSE VB[8] 0.9141 BPMF [13] 0.8920 100 0.8930 BSRM 200 0.8910 400 0.8895 100 0.8926 BSRM/F 200 400 0.8880 0.8874 Table 3: RMSE (root mean squared error) of experiments on Netflix data. 5 Conclusions In this paper, we study the fully Bayesian inference for stochastic relational models (SRMs), for learning the real-valued relation between entities of two sets. We overcome the scalability issue by transforming SRMs into equivalent models, which can be efficiently sampled. The experiments show that the fully Bayesian inference outperforms the previously used variational Bayesian inference on SRMs. In addition, some techniques for efficient computation in this paper can be applied to other large-scale Bayesian inferences, especially for models involving inverse-Wishart distributions. Acknowledgment: We thank the reviewers and Sarah Tyler for constructive comments. References [1] E. Airodi, D. Blei, S. Fienberg, and E. P. Xing. Mixed membership stochastic blockmodels. In Journal of Machine Learning Research, 2008. [2] A. P. Dawid. Some matrix-variate distribution theory: notational considerations and a Bayesian application. Biometrika, 68:265–274, 1981. [3] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman & Hall, New York, 1995. [4] A. K. Gupta and D. K. Nagar. Matrix Variate Distributions. Chapman & Hall/CRC, 2000. [5] P. Hoff. Multiplicative latent factor models for description and prediction of social networks. Computational and Mathematical Organization Theory, 2007. [6] T. Hofmann. Latent semantic models for collaborative filtering. ACM Trans. Inf. Syst., 22(1):89–115, 2004. [7] C. Kemp, J. B. Tenenbaum, T. L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI), 2006. [8] Y. J. Lim and Y. W. Teh. Variational Bayesian approach to movie rating prediction. In Proceedings of KDD Cup and Workshop, 2007. [9] J. S. Liu. Monte Carlo Strategies in Scientific Computing. Springer, 2001. [10] E. Meeds, Z. Ghahramani, R. Neal, and S. T. Roweis. Modeling dyadic data with binary latent factors. In Advances in Neural Information Processing Systems 19, 2007. [11] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [12] J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In ICML, 2005. 7 [13] R. Salakhutdinov and A. Mnih. Bayeisna probabilistic matrix factorization using Markov chain Monte Carlo. In The 25th International Conference on Machine Learning, 2008. [14] Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel. Infinite hidden relational models. In Proceedings of the 22nd International Conference on Uncertainty in Artificial Intelligence (UAI), 2006. [15] K. Yu, W. Chu, S. Yu, V. Tresp, and Z. Xu. Stochastic relational models for discriminative link prediction. In Advances in Neural Information Processing Systems 19 (NIPS), 2006. [16] S. Zhu, K. Yu, and Y. Gong. Predictive matrix-variate t models. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, NIPS ’07: Advances in Neural Information Processing Systems 20, pages 1721–1728. MIT Press, Cambridge, MA, 2008. Appendix Before presenting the algorithms, we introduce the necessary notation. Let Ii = {j|(i, j) ∈ I} and Ij = {i|(i, j) ∈ I}. A matrix with subscripts indicates its submatrix, which consists its entries at the given indices in the subscripts, for example, XIj ,j is a subvector of the j-th column of X whose row indices are in set Ij , X·,j is the j-th column of X (· indicates the full set). Xi,j denotes the (i, j)-th 2 entry of X. |X|2 is the squared sum of elements in set I, i.e. (i,j)∈I Xi,j . We fill the unobserved I elements in Y with 0 for simplicity in notation Algorithm 1 BSRM/F: Gibbs sampling for SRM with features 1: Draw K ∼ IW d+p (δ + m, αI(d+p) + (F, F◦ ) (F, F◦ )); 2: For each i ∈ U0 , draw f i ∼ Nd (K(i) (s−2 G Y i,· + K−1 K12 K−1 f ◦ ), K(i) ), 11·2 22 i −1 where K(i) = s−2 (GIi ,· ) GIi ,· + K−1 ; 11·2 3: Draw R ∼ IW d+q (δ + n, βI(d+q) + (G, G◦ ) (G, G◦ )); 4: For each j ∈ V0 , draw gj ∼ Nd (R(j) (s−2 F Y ·,j + R−1 R12 R−1 g◦ ), R(j) ), 11·2 22 j −1 where R(j) = s−2 (FIj ,· ) FIj ,· + R−1 ; 11·2 5: Draw s2 ∼ χ−2 (ν + r, σ 2 + |Y − FG |2 ). I Algorithm 2 BSRM/F: Conditional Gibbs sampling for SRM with features 1: ∆i,j ← Yi,j − k Fi,k Gj,k , for (i, j) ∈ I; 2: Draw Φ ∼ Wd+p (δ + m + d + p − 1, (αI(d+p) + (F, F◦ ) (F, F◦ ))−1 ); 3: for each (i, k) ∈ U0 × {1, · · · , d} do 4: Draw f ∼ N1 (φ−1 (s−2 ∆i,Ii GIi ,k − Fi,· Φ·,k ), φ−1 ), where φ = s−2 (GIi ,k ) GIi ,k + Φk,k ; 5: Update Fi,k ← Fi,k + f , and ∆i,j ← ∆i,j − f Gj,k , for j ∈ Ii ; 6: end for 7: Draw Ψ ∼ Wd+q (δ + n + d + q − 1, (βI(d+q) + (G, G◦ ) (G, G◦ ))−1 ); 8: for each (j, k) ∈ V0 × {1, · · · , d} do 9: Draw g ∼ N1 (ψ −1 (s−2 ∆Ij ,j FIj ,k −Gj,· Ψ·,k ), ψ −1 ), where ψ = s−2 (FIj ,k ) FIj ,k +Ψk,k ; 10: Update Gj,k ← Gj,k + g and ∆i,j ← ∆i,j − gFi,k , for i ∈ Ij ; 11: end for 12: Draw s2 ∼ χ−2 (ν + r, σ 2 + |∆|2 ). I 8
5 0.46037203 168 nips-2008-Online Metric Learning and Fast Similarity Search
Author: Prateek Jain, Brian Kulis, Inderjit S. Dhillon, Kristen Grauman
Abstract: Metric learning algorithms can provide useful distance functions for a variety of domains, and recent work has shown good accuracy for problems where the learner can access all distance constraints at once. However, in many real applications, constraints are only available incrementally, thus necessitating methods that can perform online updates to the learned metric. Existing online algorithms offer bounds on worst-case performance, but typically do not perform well in practice as compared to their offline counterparts. We present a new online metric learning algorithm that updates a learned Mahalanobis metric based on LogDet regularization and gradient descent. We prove theoretical worst-case performance bounds, and empirically compare the proposed method against existing online metric learning algorithms. To further boost the practicality of our approach, we develop an online locality-sensitive hashing scheme which leads to efficient updates to data structures used for fast approximate similarity search. We demonstrate our algorithm on multiple datasets and show that it outperforms relevant baselines.
6 0.43668896 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms
7 0.40849742 206 nips-2008-Sequential effects: Superstition or rational behavior?
8 0.40159625 112 nips-2008-Kernel Measures of Independence for non-iid Data
9 0.37951863 242 nips-2008-Translated Learning: Transfer Learning across Different Feature Spaces
10 0.37763277 57 nips-2008-Deflation Methods for Sparse PCA
11 0.37301007 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs
12 0.35294577 134 nips-2008-Mixed Membership Stochastic Blockmodels
13 0.34356573 7 nips-2008-A computational model of hippocampal function in trace conditioning
14 0.34118223 236 nips-2008-The Mondrian Process
15 0.33603376 90 nips-2008-Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity
16 0.33187845 234 nips-2008-The Infinite Factorial Hidden Markov Model
17 0.31788456 203 nips-2008-Scalable Algorithms for String Kernels with Inexact Matching
18 0.31510264 165 nips-2008-On the Reliability of Clustering Stability in the Large Sample Regime
19 0.31005761 195 nips-2008-Regularized Policy Iteration
20 0.30702841 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions
topicId topicWeight
[(6, 0.051), (7, 0.084), (12, 0.029), (28, 0.165), (57, 0.09), (59, 0.017), (63, 0.026), (71, 0.012), (72, 0.328), (77, 0.043), (78, 0.011), (83, 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.77969164 154 nips-2008-Nonparametric Bayesian Learning of Switching Linear Dynamical Systems
Author: Alan S. Willsky, Erik B. Sudderth, Michael I. Jordan, Emily B. Fox
Abstract: Many nonlinear dynamical phenomena can be effectively modeled by a system that switches among a set of conditionally linear dynamical modes. We consider two such models: the switching linear dynamical system (SLDS) and the switching vector autoregressive (VAR) process. Our nonparametric Bayesian approach utilizes a hierarchical Dirichlet process prior to learn an unknown number of persistent, smooth dynamical modes. We develop a sampling algorithm that combines a truncated approximation to the Dirichlet process with efficient joint sampling of the mode and state sequences. The utility and flexibility of our model are demonstrated on synthetic data, sequences of dancing honey bees, and the IBOVESPA stock index.
2 0.63342875 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
Author: Francis R. Bach
Abstract: For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the ℓ1 -norm or the block ℓ1 -norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.
3 0.5486415 118 nips-2008-Learning Transformational Invariants from Natural Movies
Author: Charles Cadieu, Bruno A. Olshausen
Abstract: We describe a hierarchical, probabilistic model that learns to extract complex motion from movies of the natural environment. The model consists of two hidden layers: the first layer produces a sparse representation of the image that is expressed in terms of local amplitude and phase variables. The second layer learns the higher-order structure among the time-varying phase variables. After training on natural movies, the top layer units discover the structure of phase-shifts within the first layer. We show that the top layer units encode transformational invariants: they are selective for the speed and direction of a moving pattern, but are invariant to its spatial structure (orientation/spatial-frequency). The diversity of units in both the intermediate and top layers of the model provides a set of testable predictions for representations that might be found in V1 and MT. In addition, the model demonstrates how feedback from higher levels can influence representations at lower levels as a by-product of inference in a graphical model. 1
4 0.54626036 200 nips-2008-Robust Kernel Principal Component Analysis
Author: Minh H. Nguyen, Fernando Torre
Abstract: Kernel Principal Component Analysis (KPCA) is a popular generalization of linear PCA that allows non-linear feature extraction. In KPCA, data in the input space is mapped to higher (usually) dimensional feature space where the data can be linearly modeled. The feature space is typically induced implicitly by a kernel function, and linear PCA in the feature space is performed via the kernel trick. However, due to the implicitness of the feature space, some extensions of PCA such as robust PCA cannot be directly generalized to KPCA. This paper presents a technique to overcome this problem, and extends it to a unified framework for treating noise, missing data, and outliers in KPCA. Our method is based on a novel cost function to perform inference in KPCA. Extensive experiments, in both synthetic and real data, show that our algorithm outperforms existing methods. 1
5 0.5461244 138 nips-2008-Modeling human function learning with Gaussian processes
Author: Thomas L. Griffiths, Chris Lucas, Joseph Williams, Michael L. Kalish
Abstract: Accounts of how people learn functional relationships between continuous variables have tended to focus on two possibilities: that people are estimating explicit functions, or that they are performing associative learning supported by similarity. We provide a rational analysis of function learning, drawing on work on regression in machine learning and statistics. Using the equivalence of Bayesian linear regression and Gaussian processes, we show that learning explicit rules and using similarity can be seen as two views of one solution to this problem. We use this insight to define a Gaussian process model of human function learning that combines the strengths of both approaches. 1
6 0.54363328 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation
7 0.54360616 62 nips-2008-Differentiable Sparse Coding
8 0.54322255 66 nips-2008-Dynamic visual attention: searching for coding length increments
9 0.54320711 71 nips-2008-Efficient Sampling for Gaussian Process Inference using Control Variables
10 0.54248649 192 nips-2008-Reducing statistical dependencies in natural signals using radial Gaussianization
11 0.54217637 4 nips-2008-A Scalable Hierarchical Distributed Language Model
12 0.5414412 31 nips-2008-Bayesian Exponential Family PCA
13 0.54014254 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
14 0.53944862 129 nips-2008-MAS: a multiplicative approximation scheme for probabilistic inference
15 0.53764778 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks
16 0.53746074 221 nips-2008-Stochastic Relational Models for Large-scale Dyadic Data using MCMC
17 0.53660244 235 nips-2008-The Infinite Hierarchical Factor Regression Model
18 0.53657633 231 nips-2008-Temporal Dynamics of Cognitive Control
19 0.53648609 30 nips-2008-Bayesian Experimental Design of Magnetic Resonance Imaging Sequences
20 0.53640294 219 nips-2008-Spectral Hashing