nips nips2007 nips2007-156 knowledge-graph by maker-knowledge-mining

156 nips-2007-Predictive Matrix-Variate t Models

Source: pdf

Author: Shenghuo Zhu, Kai Yu, Yihong Gong

Abstract: It is becoming increasingly important to learn from a partially-observed random matrix and predict its missing elements. We assume that the entire matrix is a single sample drawn from a matrix-variate t distribution and suggest a matrixvariate t model (MVTM) to predict those missing elements. We show that MVTM generalizes a range of known probabilistic models, and automatically performs model selection to encourage sparse predictive models. Due to the non-conjugacy of its prior, it is difﬁcult to make predictions by computing the mode or mean of the posterior distribution. We suggest an optimization method that sequentially minimizes a convex upper-bound of the log-likelihood, which is very efﬁcient and scalable. The experiments on a toy data and EachMovie dataset show a good predictive accuracy of the model. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract It is becoming increasingly important to learn from a partially-observed random matrix and predict its missing elements. [sent-6, score-0.174]

2 We assume that the entire matrix is a single sample drawn from a matrix-variate t distribution and suggest a matrixvariate t model (MVTM) to predict those missing elements. [sent-7, score-0.286]

3 We show that MVTM generalizes a range of known probabilistic models, and automatically performs model selection to encourage sparse predictive models. [sent-8, score-0.167]

4 Due to the non-conjugacy of its prior, it is difﬁcult to make predictions by computing the mode or mean of the posterior distribution. [sent-9, score-0.216]

5 We suggest an optimization method that sequentially minimizes a convex upper-bound of the log-likelihood, which is very efﬁcient and scalable. [sent-10, score-0.139]

6 , singular value decomposition (SVD), have been widely used in various data analysis applications. [sent-14, score-0.075]

7 An important class of applications is to predict missing elements given a partially observed random matrix. [sent-15, score-0.111]

8 For example, putting ratings of users into a matrix form, the goal of collaborative ﬁltering is to predict those unseen ratings in the matrix. [sent-16, score-0.404]

9 To predict unobserved elements in matrices, the structures of the matrices play an importance role, for example, the similarity between columns and between rows. [sent-17, score-0.172]

10 Such structures imply that elements in a random matrix are no longer independent and identically-distributed (i. [sent-18, score-0.139]

11 In this paper, we model the random matrix of interest as a single sample drawn from a matrixvariate t distribution, which is a generalization of Student-t distribution. [sent-26, score-0.153]

12 We call the predictive model under such a prior by matrix-variate t model (MVTM). [sent-27, score-0.112]

13 First, it continues the line of gradual generalizations across several known probabilistic models on random matrices, namely, from probabilistic principle component analysis (PPCA) [11], to Gaussian process latent-variable models (GPLVMs)[7], and to multi-task Gaussian processes (MTGPs) [13]. [sent-29, score-0.132]

14 From a Bayesian modeling point of view, the marginalization of hyper-parameters means an automatic model selection and usually leads to a better generalization performance [8]; Second, the model selection by MVTMs explicitly encourages simpler predictive models that have lower ranks. [sent-31, score-0.219]

15 This property allows to generalize distributions for ﬁnite matrices to inﬁnite stochastic processes. [sent-33, score-0.099]

16 Σ Σ Ω Ω Σ Ω R S Σ S I T T T T Y Y Y Y (a) (b) (c) (d) Figure 1: Models for matrix prediction. [sent-34, score-0.101]

17 (b) and (c) are two normal-inverse-Wishart models, equivalent to MVTM when the covariance variable S (or R) is marginalized. [sent-36, score-0.111]

18 (d) MTGP, which requires to optimize the covariance variable S. [sent-37, score-0.111]

19 It is thus difﬁcult to make predictions by computing the mode or mean of the posterior distribution. [sent-40, score-0.216]

20 We suggest an optimization method that sequentially minimizes a convex upper-bound of the log-likelihood, which is highly efﬁcient and scalable. [sent-41, score-0.139]

21 In the experiments, the algorithm shows very good efﬁciency and excellent prediction accuracy. [sent-42, score-0.097]

22 Let Y be a p × m observational matrix and T be the underlying p × m noise-free random matrix. [sent-51, score-0.101]

23 Probabilistic Principal Component Analysis (PPCA) [11] assumes that yj , the j-th column vector of Y, can be generated from a latent vector vj in a k-dimensional linear space (k < p). [sent-54, score-0.172]

24 The model is deﬁned as yj = Wvj + µ + j and vj ∼ Nk (vj ; 0, Ik ), where j ∼ Np ( j ; 0, σ 2 Ip ), and W is a p × k loading matrix. [sent-55, score-0.172]

25 By integrating out vj , we obtain the marginal distribution yj ∼ Np (yj ; µ, WW + σ 2 Ip ). [sent-56, score-0.285]

26 It considers the same linear relationship from latent representation vj to observations yj . [sent-60, score-0.172]

27 Instead of treating vj as random variables, GPLVM assigns a prior on W and see {vj } as parameters yj = Wvj + j , and W ∼ Np,k (W; 0, Ip , Ik ), where the elements of W are independent Gaussian random variables. [sent-61, score-0.251]

28 sample from a Gaussian process prior with the covariance VV + σ 2 Im and V = [v1 , . [sent-65, score-0.152]

29 From a matrix modeling point of view, GPLVM estimates the covariance between the rows and assume the columns to be conditionally independent. [sent-71, score-0.243]

30 Multi-task Gaussian Process (MTGP) [13] is a multi-task learning model where each column of Y is a predictive function of one task, sampled from a Gaussian process prior, yj = tj + j , and tj ∼ Np (0, S), where j ∼ Np (0, σ 2 Ip ). [sent-72, score-0.199]

31 It introduces a hierarchical model where an inverseWishart prior is added for the covariance, Yi,j = Ti,j + i,j , T ∼ Np,m (T; 0, S, Im ), S ∼ IW p (S; ν, Ip ) MTGP utilizes the inverse-Wishart prior as the regularization and obtains a maximum a posteriori (MAP) estimate of S. [sent-73, score-0.082]

32 PPCA models the row covariance of Y, GPLVM models the column covariance, and MTGP assigns a hyper prior to prevent over-ﬁtting when estimating the (row) covariance. [sent-76, score-0.214]

33 From a matrix modeling point of view, capturing the dependence structure of Y by its row or column covariance is a matter of choices, which are not fundamentally different. [sent-77, score-0.212]

34 We thus extend the MTGP model in two directions: (1) assume T ∼ Np,m (T; 0, S, Im ) that have covariances on both sides of the matrix; (2) marginalize the covariance S on one side (see Figure 1(b)). [sent-81, score-0.196]

35 Then we have a marginal distribution of T Pr(T) = Np,m (T; 0, S, Im )IW p (S; ν, Ip )dS = tp,m (T; ν, 0, Ip , Im ), (1) which is a matrix-variate t distribution. [sent-82, score-0.075]

36 One important property of matrix-variate t distribution is that the marginal distribution of its sub-matrix still follows a matrix-variate t distribution with the same degree of freedom (see Section 3. [sent-86, score-0.141]

37 Interestingly, the same matrix-variate t distribution can be equivalently derived by putting another hierarchical generative process on the covariance R, as described in Figure 1(c), where R follows an inverse-Wishart distribution. [sent-92, score-0.175]

38 In other words, integrating the covariance on either side, we obtain the same model. [sent-93, score-0.149]

39 The log-determinant term encourages the sparsity of matrix T with lower rank. [sent-97, score-0.151]

40 This property has been used as the heuristic for minimizing the rank of the matrix in [3]. [sent-98, score-0.157]

41 Usually we just set 2 GPLVM offers an advantage of using nonlinear covariance function based on attributes. [sent-102, score-0.111]

42 For the mean matrix M, in our experiments, we just use sample average for all observed elements. [sent-105, score-0.145]

43 For some tasks, when we have prior knowledge about the covariance between columns or between rows, we can use the covariance matrices in the places of Im or Ip . [sent-106, score-0.363]

44 3 Prediction Methods When the evaluation of the prediction is the sum of individual losses, the optimal prediction is to ﬁnd the individual mode of the marginal posterior distribution, i. [sent-107, score-0.408]

45 One way to make prediction is to compute the mode of the joint posterior distribution of T, i. [sent-112, score-0.302]

46 the prediction problem is T = arg max {ln Pr(YI |T) + ln Pr(T)} . [sent-114, score-0.287]

47 An alternative way is to use the individual mean of the posterior distribution to approximate the individual mode. [sent-118, score-0.125]

48 Since the joint of individual mean happens to be the mean of the joint distribution, we only need to compute the joint posterior distribution. [sent-119, score-0.136]

49 The problem of prediction by means is written as T = E(T|YI ). [sent-120, score-0.097]

50 From our experiments, the prediction by means usually outperforms the prediction by modes. [sent-125, score-0.231]

51 Before discussing the prediction methods, we introduce a few useful properties in Section 3. [sent-126, score-0.097]

52 1 and suggest an optimization method as the efﬁcient tool for prediction in Section 3. [sent-127, score-0.161]

53 We encounter log-determinant terms in computation of the mode or mean estimation. [sent-145, score-0.168]

54 The following theorem provides a quadratic upper bounds for the log-determinant terms, which makes it possible to apply the optimization method in Section 3. [sent-146, score-0.099]

55 If X is a p × p positive deﬁnite matrices, it holds that ln |X| ≤ tr (X) − p. [sent-149, score-0.258]

56 The equality holds when X is an orthonormal matrix. [sent-150, score-0.113]

57 Therefore, when X is an orthonormal matrix (especially X = Ip ), the equality holds. [sent-156, score-0.182]

58 Also it holds that ∂ h(T; T0 , Σ, Ω) ∂T = 2(Σ + T0 Ω−1 T0 )−1 T0 Ω−1 = T=T0 ∂ ln |Σ + TΩ−1 T | ∂T . [sent-159, score-0.222]

59 Actually h(·) is a quadratic convex function with respect to T, as (Σ + T0 Ω−1 T0 )−1 and Ω−1 are positive deﬁnite matrices. [sent-162, score-0.097]

60 2 Optimization Method Once the objective is given, the prediction becomes an optimization problem. [sent-164, score-0.134]

61 For a ﬁxed T , Q(T; T ) is quadratic and convex with respect to T. [sent-171, score-0.097]

62 Since Q(T; T0 ) is quadratic with the respect to T, we can apply the Newton-Raphson method to minimize Q(T; T0 ). [sent-174, score-0.094]

63 After we ﬁnd a Ti , we repeat the procedure to ﬁnd a Ti+1 so that J (Ti+1 ) < J (Ti ), unless Ti is a local minimum or saddle point of J . [sent-178, score-0.078]

64 Repeating this procedure, Ti converges a local minimum or saddle point of J , as long as T0 is not a local maximum. [sent-179, score-0.078]

65 (2), the goal is to minimize the objective function def J (T) = (T) + ν+m+p−1 2 ln Ip + TT , (12) def where (T) = − ln Pr(YI ) = 1 2σ 2 (i,j)∈I (Tij − Yij )2 + const. [sent-182, score-0.768]

66 Here, we introduce an auxiliary function, def Q(T; T ) = (T) + h(T; T , Ip , Im ) + h0 (T , Ip , Im ). [sent-184, score-0.178]

67 Because l and h are quadratic and convex, Q is quadratic and convex as well. [sent-186, score-0.159]

68 Therefore, we consider a low rank approximation, using UV to approximate T, where U is a p × k matrix and V is an m × k matrix. [sent-191, score-0.157]

69 This result can be consider as the SVD of an incomplete matrix using matrix-variate t regularization. [sent-196, score-0.101]

70 4 Variational Mean Prediction As the difﬁculty in explicitly computing the posterior distribution of T, we take a variational approach to approximate its posterior distribution by a matrix-variate t distribution via an expanded model. [sent-199, score-0.195]

71 We expand the model by adding matrix variate Θ, Φ and Ψ with distribution as Eq. [sent-200, score-0.216]

72 (5), is the same as the prior of T, we can derive the original model by marginalizing out Θ, Φ and Ψ. [sent-203, score-0.076]

73 However, instead of integrating out Θ, Φ and Ψ, we use them as the parameters to approximate T’s posterior distribution. [sent-204, score-0.086]

74 Therefore, the estimation of the parameters is to minimize − ln Pr(YI , Θ, Φ, Ψ) = − ln Pr(Θ, Φ, Ψ) − ln Pr(T|Θ, Φ, Ψ) Pr(YI |T)dT (14) over Θ, Φ and Ψ. [sent-205, score-0.602]

76 (16) (i,j)∈I because − ln Pr(YI |T) is quadratic respective to T, thus we only need integration using the mean and variance of Tij of Pr(T|Θ, Φ, Ψ), which is given by Eq. [sent-210, score-0.296]

77 Because of this, the prediction by means usually outperforms the prediction by modes. [sent-213, score-0.231]

78 def def def We reparameterize J by U = ΨB1/2 , V = Φ A1/2 , and S = Θ. [sent-219, score-0.534]

79 4 Related work Maximum Margin Matrix Factorization (MMMF) [9] is not in the framework of stochastic matrix analysis, but there are some similarities between MMMF and our mode estimation in Section 3. [sent-224, score-0.255]

80 Using trace norm on the matrix as regularization, MMMF overcomes the over-ﬁtting problem in factorizing matrix with missing values. [sent-226, score-0.241]

81 From the regularization viewpoint, the prediction by mode of MVTM uses log-determinants as the regularization term in Eq. [sent-227, score-0.221]

82 Stochastic Relational Models (SRMs) [12] extend MTGPs by estimating the covariance matrices for each side. [sent-230, score-0.18]

83 The covariance functions are required to be estimated from observation. [sent-231, score-0.111]

84 Written in a matrix format, RPP is ν ν T ∼ Np,m (T; µ1 , WW , U), U = diag {ui } , ui ∼ IG(ui | , ), 2 2 where IG is inverse Gamma distribution. [sent-238, score-0.135]

85 Though RPP unties the scale factors between feature vectors, which could make the estimation more robust, it does not integrate out the covariance matrix, which we did in MVTM. [sent-239, score-0.111]

86 Also RPP results different models depending on which side we assume to be independent, therefore it is not suitable for matrix prediction. [sent-241, score-0.132]

87 22) 18 20 2 4 6 8 10 12 14 16 18 20 30 2 4 6 8 10 12 14 16 18 20 (f) MVTM mode (0. [sent-246, score-0.124]

88 singular values Synthetic data: We generate a 30 × 20 matrix (Fig4 MMMF 3. [sent-251, score-0.176]

89 The reconstruction matrix and root index mean squared errors of prediction on the unobserved elements (comparing to the original matrix) are shown in Figure 3: Singular values of recovered Figure 2(c)-2(g), respectively. [sent-263, score-0.306]

90 To verify this, we depict the singular values of the MMMF method and two MVTM prediction methods in Figure 3. [sent-267, score-0.172]

91 There are only two singular RMSE MAE user mean 1. [sent-268, score-0.147]

92 The singular values of the mode estimation decrease faster than the MMMF ones at beginning, but decrease slower after a threshold. [sent-284, score-0.199]

93 The dataset contains 74, 424 users’ 2, 811, 718 ratings on 1, 648 movies, i. [sent-287, score-0.081]

94 We put all ratings into a matrix, and randomly select 80% as observed data to predict the remaining ratings. [sent-291, score-0.115]

95 We compare our approach with other three approaches: 1) USER MEAN predicting rating by the sample mean of the same user’ ratings; 2) MOVIE MEAN, predicting rating by the sample mean of users’ ratings of the same movie; 3) MMMF[9]; 4) PPCA[11]. [sent-293, score-0.221]

96 6 Conclusions In this paper we introduce matrix-variate t models for matrix prediction. [sent-298, score-0.132]

97 The entire matrix is modeled as a sample drawn from a matrix-variate t distribution. [sent-299, score-0.101]

98 The implicit model selection of the MVTM encourages sparse models with lower ranks. [sent-301, score-0.111]

99 To minimize the log-likelihood with log-determinant terms, we propose an optimization method by sequentially minimizing its convex quadratic upper bound. [sent-302, score-0.206]

100 Log-det heuristic for matrix rank minimization with applications to hankel and euclidean distance matrices. [sent-321, score-0.157]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mvtm', 0.519), ('ppca', 0.268), ('ip', 0.264), ('im', 0.231), ('mtgp', 0.208), ('mmmf', 0.206), ('pr', 0.2), ('ln', 0.19), ('def', 0.178), ('rpp', 0.13), ('gplvm', 0.124), ('mode', 0.124), ('covariance', 0.111), ('matrix', 0.101), ('vj', 0.098), ('prediction', 0.097), ('iq', 0.09), ('ratings', 0.081), ('mvtms', 0.078), ('srm', 0.078), ('ww', 0.078), ('singular', 0.075), ('yj', 0.074), ('eachmovie', 0.073), ('predictive', 0.071), ('matrices', 0.069), ('tij', 0.068), ('ti', 0.066), ('quadratic', 0.062), ('np', 0.059), ('rank', 0.056), ('equality', 0.052), ('inversewishart', 0.052), ('matrixvariate', 0.052), ('mtgps', 0.052), ('usv', 0.052), ('variate', 0.052), ('vv', 0.052), ('wvj', 0.052), ('saddle', 0.052), ('movie', 0.05), ('encourages', 0.05), ('posterior', 0.048), ('yi', 0.047), ('yu', 0.045), ('mean', 0.044), ('marginal', 0.042), ('collaborative', 0.042), ('ig', 0.041), ('skipped', 0.041), ('prior', 0.041), ('sequentially', 0.04), ('missing', 0.039), ('mae', 0.039), ('integrating', 0.038), ('elements', 0.038), ('usually', 0.037), ('optimization', 0.037), ('iw', 0.036), ('tr', 0.036), ('convex', 0.035), ('marginalizing', 0.035), ('probabilistic', 0.035), ('svd', 0.035), ('viewpoint', 0.035), ('users', 0.034), ('predict', 0.034), ('ui', 0.034), ('format', 0.033), ('uv', 0.033), ('rhs', 0.033), ('distribution', 0.033), ('holds', 0.032), ('minimize', 0.032), ('models', 0.031), ('encourage', 0.031), ('putting', 0.031), ('tresp', 0.031), ('columns', 0.031), ('sparse', 0.03), ('stochastic', 0.03), ('expand', 0.03), ('marginalize', 0.03), ('rmse', 0.03), ('marginalization', 0.03), ('gaussian', 0.029), ('orthonormal', 0.029), ('covariances', 0.029), ('relational', 0.028), ('chapman', 0.028), ('multivariate', 0.028), ('user', 0.028), ('nite', 0.027), ('tj', 0.027), ('suggest', 0.027), ('rating', 0.026), ('sides', 0.026), ('root', 0.026), ('minimum', 0.026), ('principal', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 156 nips-2007-Predictive Matrix-Variate t Models

Author: Shenghuo Zhu, Kai Yu, Yihong Gong

2 0.14440607 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

Author: Kai Yu, Wei Chu

Abstract: This paper aims to model relational data on edges of networks. We describe appropriate Gaussian Processes (GPs) for directed, undirected, and bipartite networks. The inter-dependencies of edges can be effectively modeled by adapting the GP hyper-parameters. The framework suggests an intimate connection between link prediction and transfer learning, which were traditionally two separate research topics. We develop an efﬁcient learning algorithm that can handle a large number of observations. The experimental results on several real-world data sets verify superior learning capacity. 1

3 0.11102989 135 nips-2007-Multi-task Gaussian Process Prediction

Author: Edwin V. Bonilla, Kian M. Chai, Christopher Williams

Abstract: In this paper we investigate multi-task learning in the context of Gaussian Processes (GP). We propose a model that learns a shared covariance function on input-dependent features and a “free-form” covariance matrix over tasks. This allows for good ﬂexibility when modelling inter-task dependencies while avoiding the need for large amounts of data for training. We show that under the assumption of noise-free observations and a block design, predictions for a given task only depend on its target values and therefore a cancellation of inter-task transfer occurs. We evaluate the beneﬁts of our model on two practical applications: a compiler performance prediction problem and an exam score prediction task. Additionally, we make use of GP approximations and properties of our model in order to provide scalability to large data sets. 1

4 0.10933597 41 nips-2007-COFI RANK - Maximum Margin Matrix Factorization for Collaborative Ranking

Author: Markus Weimer, Alexandros Karatzoglou, Quoc V. Le, Alex J. Smola

Abstract: In this paper, we consider collaborative ﬁltering as a ranking problem. We present a method which uses Maximum Margin Matrix Factorization and optimizes ranking instead of rating. We employ structured output prediction to optimize directly for ranking scores. Experimental results show that our method gives very good ranking scores and scales well on collaborative ﬁltering tasks. 1

5 0.093301013 96 nips-2007-Heterogeneous Component Analysis

Author: Shigeyuki Oba, Motoaki Kawanabe, Klaus-Robert Müller, Shin Ishii

Abstract: In bioinformatics it is often desirable to combine data from various measurement sources and thus structured feature vectors are to be analyzed that possess different intrinsic blocking characteristics (e.g., different patterns of missing values, observation noise levels, effective intrinsic dimensionalities). We propose a new machine learning tool, heterogeneous component analysis (HCA), for feature extraction in order to better understand the factors that underlie such complex structured heterogeneous data. HCA is a linear block-wise sparse Bayesian PCA based not only on a probabilistic model with block-wise residual variance terms but also on a Bayesian treatment of a block-wise sparse factor-loading matrix. We study various algorithms that implement our HCA concept extracting sparse heterogeneous structure by obtaining common components for the blocks and speciﬁc components within each block. Simulations on toy and bioinformatics data underline the usefulness of the proposed structured matrix factorization concept. 1

6 0.093204573 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis

7 0.081166178 97 nips-2007-Hidden Common Cause Relations in Relational Learning

8 0.076410271 158 nips-2007-Probabilistic Matrix Factorization

9 0.066742957 213 nips-2007-Variational Inference for Diffusion Processes

10 0.065242946 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model

11 0.063038252 8 nips-2007-A New View of Automatic Relevance Determination

12 0.059097007 211 nips-2007-Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data

13 0.058535293 69 nips-2007-Discriminative Batch Mode Active Learning

14 0.058534045 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning

15 0.057185486 187 nips-2007-Structured Learning with Approximate Inference

16 0.055849351 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

17 0.05373596 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm

18 0.05370931 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems

19 0.051720783 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

20 0.051262517 104 nips-2007-Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.2), (1, 0.033), (2, -0.059), (3, 0.048), (4, 0.004), (5, -0.006), (6, -0.083), (7, -0.089), (8, -0.095), (9, -0.077), (10, -0.051), (11, -0.061), (12, 0.099), (13, 0.077), (14, -0.034), (15, -0.026), (16, 0.0), (17, 0.005), (18, 0.036), (19, -0.064), (20, -0.039), (21, 0.071), (22, -0.088), (23, -0.008), (24, -0.072), (25, 0.21), (26, -0.032), (27, -0.106), (28, -0.062), (29, -0.029), (30, -0.023), (31, -0.046), (32, 0.002), (33, -0.04), (34, -0.072), (35, -0.051), (36, 0.113), (37, 0.006), (38, 0.007), (39, -0.054), (40, -0.054), (41, -0.089), (42, 0.065), (43, 0.206), (44, -0.079), (45, -0.096), (46, -0.082), (47, -0.039), (48, 0.003), (49, -0.131)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93461454 156 nips-2007-Predictive Matrix-Variate t Models

Author: Shenghuo Zhu, Kai Yu, Yihong Gong

2 0.85673738 158 nips-2007-Probabilistic Matrix Factorization

Author: Andriy Mnih, Ruslan Salakhutdinov

Abstract: Many existing approaches to collaborative ﬁltering can neither handle very large datasets nor easily deal with users who have very few ratings. In this paper we present the Probabilistic Matrix Factorization (PMF) model which scales linearly with the number of observations and, more importantly, performs well on the large, sparse, and very imbalanced Netﬂix dataset. We further extend the PMF model to include an adaptive prior on the model parameters and show how the model capacity can be controlled automatically. Finally, we introduce a constrained version of the PMF model that is based on the assumption that users who have rated similar sets of movies are likely to have similar preferences. The resulting model is able to generalize considerably better for users with very few ratings. When the predictions of multiple PMF models are linearly combined with the predictions of Restricted Boltzmann Machines models, we achieve an error rate of 0.8861, that is nearly 7% better than the score of Netﬂix’s own system.

3 0.70559204 96 nips-2007-Heterogeneous Component Analysis

Author: Shigeyuki Oba, Motoaki Kawanabe, Klaus-Robert Müller, Shin Ishii

4 0.57457191 41 nips-2007-COFI RANK - Maximum Margin Matrix Factorization for Collaborative Ranking

Author: Markus Weimer, Alexandros Karatzoglou, Quoc V. Le, Alex J. Smola

5 0.52400374 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes

Author: Nicolas Chapados, Yoshua Bengio

Abstract: We introduce a functional representation of time series which allows forecasts to be performed over an unspeciﬁed horizon with progressively-revealed information sets. By virtue of using Gaussian processes, a complete covariance matrix between forecasts at several time-steps is available. This information is put to use in an application to actively trade price spreads between commodity futures contracts. The approach delivers impressive out-of-sample risk-adjusted returns after transaction costs on a portfolio of 30 spreads. 1

6 0.51221359 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

7 0.50746155 97 nips-2007-Hidden Common Cause Relations in Relational Learning

8 0.49655831 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

9 0.48879966 135 nips-2007-Multi-task Gaussian Process Prediction

10 0.48525092 131 nips-2007-Modeling homophily and stochastic equivalence in symmetric relational data

11 0.47889683 211 nips-2007-Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data

12 0.47722891 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning

13 0.45009071 167 nips-2007-Regulator Discovery from Gene Expression Time Series of Malaria Parasites: a Hierachical Approach

14 0.41702482 196 nips-2007-The Infinite Gamma-Poisson Feature Model

15 0.39821625 144 nips-2007-On Ranking in Survival Analysis: Bounds on the Concordance Index

16 0.39464545 8 nips-2007-A New View of Automatic Relevance Determination

17 0.38141727 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis

18 0.36710548 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems

19 0.36466444 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm

20 0.35844585 193 nips-2007-The Distribution Family of Similarity Distances

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.029), (13, 0.024), (16, 0.039), (18, 0.021), (21, 0.085), (31, 0.013), (35, 0.046), (47, 0.082), (49, 0.029), (83, 0.103), (85, 0.043), (87, 0.014), (88, 0.017), (90, 0.098), (92, 0.266)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.76326907 47 nips-2007-Collapsed Variational Inference for HDP

Author: Yee W. Teh, Kenichi Kurihara, Max Welling

Abstract: A wide variety of Dirichlet-multinomial ‘topic’ models have found interesting applications in recent years. While Gibbs sampling remains an important method of inference in such models, variational techniques have certain advantages such as easy assessment of convergence, easy optimization without the need to maintain detailed balance, a bound on the marginal likelihood, and side-stepping of issues with topic-identiﬁability. The most accurate variational technique thus far, namely collapsed variational latent Dirichlet allocation, did not deal with model selection nor did it include inference for hyperparameters. We address both issues by generalizing the technique, obtaining the ﬁrst variational algorithm to deal with the hierarchical Dirichlet process and to deal with hyperparameters of Dirichlet variables. Experiments show a signiﬁcant improvement in accuracy. 1

same-paper 2 0.73051441 156 nips-2007-Predictive Matrix-Variate t Models

Author: Shenghuo Zhu, Kai Yu, Yihong Gong

3 0.55602288 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

Author: Chuan-sheng Foo, Chuong B. Do, Andrew Y. Ng

Abstract: In problems where input features have varying amounts of noise, using distinct regularization hyperparameters for different features provides an effective means of managing model complexity. While regularizers for neural networks and support vector machines often rely on multiple hyperparameters, regularizers for structured prediction models (used in tasks such as sequence labeling or parsing) typically rely only on a single shared hyperparameter for all features. In this paper, we consider the problem of choosing regularization hyperparameters for log-linear models, a class of structured prediction probabilistic models which includes conditional random ﬁelds (CRFs). Using an implicit differentiation trick, we derive an efﬁcient gradient-based method for learning Gaussian regularization priors with multiple hyperparameters. In both simulations and the real-world task of computational RNA secondary structure prediction, we ﬁnd that multiple hyperparameter learning can provide a signiﬁcant boost in accuracy compared to using only a single regularization hyperparameter. 1

4 0.55385351 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

Author: Kai Yu, Wei Chu

5 0.55359662 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

Author: Michael Ross, Andrew Cohen

Abstract: This paper describes a new model for human visual classiﬁcation that enables the recovery of image features that explain human subjects’ performance on different visual classiﬁcation tasks. Unlike previous methods, this algorithm does not model their performance with a single linear classiﬁer operating on raw image pixels. Instead, it represents classiﬁcation as the combination of multiple feature detectors. This approach extracts more information about human visual classiﬁcation than previous methods and provides a foundation for further exploration. 1

6 0.55282587 195 nips-2007-The Generalized FITC Approximation

7 0.54636306 182 nips-2007-Sparse deep belief net model for visual area V2

8 0.54625618 63 nips-2007-Convex Relaxations of Latent Variable Training

9 0.54596543 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

10 0.54338324 158 nips-2007-Probabilistic Matrix Factorization

11 0.54316211 104 nips-2007-Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes

12 0.54312307 174 nips-2007-Selecting Observations against Adversarial Objectives

13 0.54189962 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

14 0.54117215 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model

15 0.54072207 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes

16 0.53968871 175 nips-2007-Semi-Supervised Multitask Learning

17 0.53965008 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

18 0.53876239 122 nips-2007-Locality and low-dimensions in the prediction of natural experience from fMRI

19 0.53843051 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

20 0.53821254 154 nips-2007-Predicting Brain States from fMRI Data: Incremental Functional Principal Component Regression