nips nips2012 nips2012-74 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Neil Houlsby, Ferenc Huszar, Zoubin Ghahramani, Jose M. Hernández-lobato
Abstract: We present a new model based on Gaussian processes (GPs) for learning pairwise preferences expressed by multiple users. Inference is simplified by using a preference kernel for GPs which allows us to combine supervised GP learning of user preferences with unsupervised dimensionality reduction for multi-user systems. The model not only exploits collaborative information from the shared structure in user behavior, but may also incorporate user features if they are available. Approximate inference is implemented using a combination of expectation propagation and variational Bayes. Finally, we present an efficient active learning strategy for querying preferences. The proposed technique performs favorably on real-world data against state-of-the-art multi-user preference learning algorithms. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Inference is simplified by using a preference kernel for GPs which allows us to combine supervised GP learning of user preferences with unsupervised dimensionality reduction for multi-user systems. [sent-2, score-0.858]
2 The model not only exploits collaborative information from the shared structure in user behavior, but may also incorporate user features if they are available. [sent-3, score-0.708]
3 Finally, we present an efficient active learning strategy for querying preferences. [sent-5, score-0.139]
4 The proposed technique performs favorably on real-world data against state-of-the-art multi-user preference learning algorithms. [sent-6, score-0.41]
5 1 Introduction Preference learning is concerned with making inference from data consisting of pairs of items and corresponding binary labels indicating user preferences. [sent-7, score-0.465]
6 A popular modeling approach assumes the existence of a utility function f such that f (x) gives the utility of an item with feature vector x; f (xi ) > f (xj ) indicates that item i is preferred to item j. [sent-9, score-0.518]
7 Bayesian methods can be used to learn f , for example, by modeling f independently for each user as a draw from a Gaussian process (GP) prior [4]. [sent-10, score-0.318]
8 However, when data from many users is available, such methods do not leverage similarities in preferences across users. [sent-11, score-0.332]
9 Current multi-user approaches require that features are available for each user and assume that users with similar features have similar preferences [2], or perform single-user learning, ignoring user features, but tie information across users with a hierachical prior [1]. [sent-12, score-1.182]
10 These methods are not flexible and can only address one of two possible scenarios: a) user features are available and they are useful for prediction and b) when this is not the case. [sent-13, score-0.351]
11 Our approach, by contrast, can address both a) and b) by combining informative user features with collaborative information. [sent-16, score-0.416]
12 The proposed method is based on a connection between preference learning and GP binary classification. [sent-22, score-0.435]
13 We show that both problems are equivalent when a covariance function called the preference kernel is used. [sent-23, score-0.489]
14 Finally, in real scenarios, querying users for preference may be costly and intrusive, so it is desirable to learn preferences using the least data possible. [sent-25, score-0.717]
15 With this objective, we present BALD (Bayesian active learning by disagreement), an efficient active learning strategy for binary classification problems with GP priors. [sent-26, score-0.25]
16 2 Pairwise preference learning as special case of binary classification The problem of pairwise preference learning can be recast as a special case of binary classification. [sent-27, score-0.934]
17 Let us consider two items i and j with corresponding feature vectors xi , xj ∈ X . [sent-28, score-0.269]
18 In the pairwise preference learning problem, we are given pairs of feature vectors xi and xj and corresponding class labels y ∈ {−1, 1} such that y = 1 if the user prefers item i to item j and y = −1 otherwise. [sent-29, score-1.292]
19 This problem can be addressed by introducing a latent preference function f : X → R such that f (xi ) > f (xj ) whenever the user prefers item i to item j and f (xi ) < f (xj ) otherwise [4]. [sent-31, score-1.096]
20 The preference learning problem can be solved by combining a GP prior on f with the likelihood function in (1) [4]. [sent-33, score-0.425]
21 The posterior for f can then be used to make predictions on the user preferences for new pairs of items. [sent-34, score-0.523]
22 Let g : X 2 → R be the latent function g(xi , xj ) = f (xi ) − f (xj ). [sent-36, score-0.183]
23 When the evaluation of g is contaminated with standard Gaussian noise, the likelihood for g given xi , xj and y is P(y|xi , xj , g) = Φ[g(xi , xj )y] . [sent-38, score-0.425]
24 The covariance function kpref of the GP prior on g can be computed from the covariance function k of the GP on f as kpref ((xi , xj ), (xk , xl )) = k(xi , xk )+k(xj , xl )−k(xi , xl )−k(xj , xk ). [sent-40, score-0.523]
25 However, to our knowledge, the preference kernel has not been used previously for GP-based models. [sent-44, score-0.439]
26 The combination of (2) with a GP prior based on the preference kernel allows us to transform the pairwise preference learning problem into binary classification with GPs. [sent-45, score-0.937]
27 This means that state-ofthe-art methods for GP binary classification, such as expectation propagation [14], can be applied directly to preference learning. [sent-46, score-0.469]
28 3 Multi-user preference learning Consider I items with feature vectors xi ∈ X for i = 1, . [sent-48, score-0.544]
29 Our approach to the multiuser problem is to assume common structure in the user latent functions. [sent-53, score-0.386]
30 In particular, we assume a set of D shared latent functions, hd : X 2 → R for d = 1, . [sent-54, score-0.268]
31 , D, such that the user latent functions are generated by a linear combination of these functions, namely D gu (xj , xk ) = wu,d hd (xj , xk ) , d=1 (3) here wu,d ∈ R is the weight given to function hd for user u. [sent-57, score-1.056]
32 We place a GP prior over the shared latent functions h1 , . [sent-58, score-0.145]
33 , hD using the preference kernel described in the previous section. [sent-61, score-0.439]
34 This model allows the preferences of the different users to share some common structure represented by the latent functions h1 , . [sent-62, score-0.374]
35 2 We may extend this model further to the case in which, for each user u, there is a feature vector uu ∈ U containing information that might be useful for prediction. [sent-67, score-0.378]
36 The user features are incorporated now by placing a separate GP prior over the users weights. [sent-72, score-0.527]
37 These weight functions describe the contribution of shared latent function hd to the user latent function gu as a function of the user feature vector uu . [sent-74, score-1.039]
38 The data consists of L, the sets of feature vectors for the users U (if available), the item features X = {x1 , . [sent-79, score-0.401]
39 , xI }, and U sets of preference judgements, one for each user, D = {{zu,i , yu,i }Mu }U , where zu,i indexes the i=1 u=1 i-th pair evaluated by user u, yi,u = 1 if this user prefers the first item in the pair to the second and yi,u = −1 otherwise. [sent-82, score-1.188]
40 Mu is the number of preference judgements made by the u-th user. [sent-83, score-0.42]
41 1 Probabilistic description To address the task of predicting preference on unseen item pairs we cast the model into a probabilistic framework. [sent-85, score-0.593]
42 Let G be an U ×P ‘user-function’ matrix, where each row corresponds to a particular user’s latent function, that is, the entry in the u-th column and i-th row is gu,i = gu (xα(i) , xβ(i) ) and α(i) and β(i) denote respectively the first and second item in the i-th pair from L. [sent-86, score-0.381]
43 Let H be a D × P ‘shared-function’ matrix, where each row represents the shared latent functions, that is, the entry in the d-th row and i-th column is hd,i = hd (xα(i) , xβ(i) ). [sent-87, score-0.352]
44 Finally, we introduce the U × D weight matrix W such that each row contains a user’s weights, that is, the entry in the u-th row and d-th column of this matrix is wd (uu ). [sent-88, score-0.123]
45 If user features are unavailable, Kusers becomes the identity matrix. [sent-104, score-0.324]
46 , hD is sampled a priori from a GP with zero mean and covariance function given by a preference kernel. [sent-108, score-0.438]
47 Let Kitems be the P × P preference covariance matrix for the item pairs in L. [sent-109, score-0.643]
48 BALD samples from the region of greatest uncertainty in the latent function. [sent-116, score-0.146]
49 k is the prior variance of hd (xα(P +1) , xβ(P +1) ) and k is a P -dimensional vector that contains the prior covariances between hd (xα(P +1) , xβ(P +1) ) and hd (xα(1) , xβ(1) ), . [sent-117, score-0.554]
50 We want to learn user preferences with the proposed model from the least amount of data possible. [sent-124, score-0.419]
51 Therefore we desire to query users actively about their preferences on the most informative pairs of items [3]. [sent-125, score-0.444]
52 This method exploits the preference kernel and so may be trivially generalized to GP binary classification problems also. [sent-127, score-0.486]
53 4 Bayesian active learning by disagreement The goal of active learning is to choose item pairs such that we learn the preference functions for the users using minimal data. [sent-128, score-0.973]
54 Information theoretic approaches to active learning are popular because they do not require prior knowledge of loss functions or test domains. [sent-129, score-0.126]
55 For preference learning (see Section 2), this implies choosing the new item features xi and xj that maximize H[P(g|D)] − EP(y|xi ,xj ,D) [H[P(g|y, xi , xj , D)]] , (9) where D are the user preferences observed so far and H[p(x)] = − p(x) log p(x) dx represents the Shannon entropy. [sent-131, score-1.351]
56 Using this we can rearrange this equation to compute entropies in y space: H[P(y|xi , xj , D)] − EP(g|D) [H [P(y|xi , xj , g)]] . [sent-139, score-0.269]
57 Expression (10) also provides us with an intuition about the objective; we seek the xi and xj for which a) the model is marginally uncertain about y (high H[P(y|xi , xj , D)]) and b) conditioned on a particular value of g the model is confident about y (low EP(g|D) [H[P(y|xi , xj , g])]). [sent-144, score-0.395]
58 This can be interpreted as seeking the pair xi and xj for which the latent functions g, under the posterior, ‘disagree’ with each other the most about the outcome, that is, the preference judgement. [sent-145, score-0.649]
59 In the following section we show how (10) can be applied to binary classification with GPs, and hence via the preference kernel also to any preference learning problem. [sent-148, score-0.874]
60 1 BALD in binary classification with GPs Most approximate inference methods for the problem of binary classification with GPs produce a Gaussian approximation to the posterior distribution of f , the latent function of interest. [sent-150, score-0.275]
61 In 4 the binary GP classifier, the entropy of y given the corresponding value of f can be expressed in terms of the binary entropy function, h[f ] = −f log f − (1 − f ) log(1 − f ). [sent-151, score-0.182]
62 When a Gaussian is used to approximate the posterior of f , we have 2 that for each x, fx = f (x) will follow a Gaussian distribution with mean µx and variance σx . [sent-153, score-0.188]
63 The first term in (10), that is, H[p(y|x, D)], can be handled analytically in this case: H[p(y|x, D)] ≈ 2 2 h Φ(fx )N (fx |µx , σx )dfx = h Φ µx (σx + 1)−1/2 , where ≈ represents here the Gaussian approximation to the posterior of fx . [sent-154, score-0.163]
64 This result is obtained by using the Gaussian approximation to the posterior of fx 2 and then approximating h[Φ(fx )] by the squared exponential curve exp(−fx /π log 2) (details can be found in Section 3 of the supplementary material). [sent-156, score-0.202]
65 To summarize, the BALD algorithm for active binary GP classification / preference learning first 2 applies any approximate inference method to obtain the posterior mean µx and variance σx of f at each point of interest x. [sent-157, score-0.635]
66 x (11) BALD assigns a high value to the feature vector x when the model is both uncertain about the label 2 (µx close to 0) and there is high uncertainty about fx (σx is large). [sent-159, score-0.155]
67 By contrast, BALD samples data from the region of greatest uncertainty in the latent function. [sent-163, score-0.146]
68 5 Expectation propagation and variational Bayes Approximate inference in our model is implemented using a combination of expectation propagation (EP) [13] and variational Bayes (VB) [7]. [sent-164, score-0.152]
69 , f4 by minimizing the Kullback-Leibler ˆ (KL) divergence between the product of Q\a and fa and the product of Q\a and fa , where Q\a is ˆ . [sent-189, score-0.12]
70 However, this does not perform well for refining f2 ; details on this ˆ the ratio between Q and fa problem can be found in Section 4 of the supplementary material and in [19]. [sent-190, score-0.127]
71 In our case, refining f3 has cost O(DU 3 ), where U is the number of users, and D the number of shared latent functions. [sent-202, score-0.13]
72 The cost of ˆ refining f4 is O(DP 3 ), where P is the number of observed item pairs. [sent-203, score-0.185]
73 These approximations allow us 2 2 ˆ ˆ to refine f3 and f4 in O(DU0 U ) and O(DP0 P ) operations, where U0 and P0 are the number of pseudo-inputs for the users and for the item pairs, respectively. [sent-207, score-0.329]
74 6 Experiments and Discussion The performance of our collaborative preference model with the BALD active learning strategy is evaluated in a series of experiments with simulated and real-world data. [sent-210, score-0.567]
75 Two versions of the proposed collaborative preference (CP) model are used. [sent-215, score-0.453]
76 The first version (CPU) takes into account the available user features, as described in Section 3. [sent-216, score-0.308]
77 This method does not use user features, and captures similarities between users with a hierarchical GP model. [sent-220, score-0.475]
78 In particular, a common GP prior is assumed for the preference function of each user; using this prior the model learns the full GP posterior for each user. [sent-221, score-0.524]
79 In this model there exists one high-dimensional function which depends on both the features of the two items to be compared and on the features of the user who makes the comparison. [sent-224, score-0.438]
80 Relationships between users’ behaviors are captured only via the user features. [sent-225, score-0.281]
81 We implement BO and BI using the preference kernel and EP for approximate inference1 . [sent-226, score-0.486]
82 BI does not include user features, but learns U GPs, so has complexity O(U P 3 ); the equivalent version of our model (CP) has cost O(N P + DP 3 ), which is lower because D << U . [sent-230, score-0.303]
83 Finally, we consider a single user approach (SU) which fits a different GP classifier independently to the data of each user. [sent-232, score-0.281]
84 1 Although this is not the same as the original implementations (sampling-based for BI, Laplace approximation for BO), the preference kernel and EP are likely to augment the performance of these algorithms, and provides the fairest comparison of the underlying models. [sent-233, score-0.439]
85 The available data were split randomly into training and test sets of item pairs, where the training sets contain 20 pairs per user in Sushi, MovieLens and Election, 15 pairs in Jura and 30 in Synthetic. [sent-334, score-0.555]
86 In our model, we can also estimate the kernel lengthscales by maximizing the EP approximation of the model evidence, as illustrated in Section 9 of the supplementary material. [sent-344, score-0.125]
87 Overall, CP and CPU outperform SU and BO, and breaks even with BI; the final result is notable as BI learns the full mean and covariance structure across all users, ours uses only a few latent dimensions, which provides the key to scaling to many more users. [sent-353, score-0.12]
88 In the real-world datasets, users with similar features do not seem to have similar preferences and so correlating behavior of users with similar features is detrimental. [sent-355, score-0.556]
89 In this case, the unsupervised learning of similarities in user preferences is more useful for prediction than the user features. [sent-356, score-0.728]
90 We now use all the available users from each dataset, with a maximum of 1000 users. [sent-364, score-0.193]
91 For each user the available preference data are split randomly into training, pool and test sets with 5, 35 and 5 data points respectively in 7 Synthetic Sushi 0. [sent-365, score-0.724]
92 2 2 num samples 4 6 8 10 num samples Election 0. [sent-371, score-0.15]
93 1 0 2 4 6 8 10 num samples Figure 2: Average test error for CPU, CP and SU, using the strategies BALD (-B), entropy (-E) and random (-R) for active learning. [sent-397, score-0.208]
94 Average errors after 10 queries from the pool set of each user are summarized in Table 3. [sent-407, score-0.309]
95 For each model (CPU, CP and SU), the results of the best active learning strategy are highlighted in bold. [sent-408, score-0.157]
96 7 Conclusions We have proposed a multi-user model that combines collaborative filtering methods with GP binary preference modeling. [sent-413, score-0.5]
97 We have shown that the task of learning user preferences can be recast as a particular case of binary classification with GPs when a covariance function called the preference kernel is used. [sent-414, score-0.993]
98 We have also presented BALD, a novel active learning strategy for binary classification models with GPs. [sent-415, score-0.161]
99 The proposed multi-user model with BALD performs favorably on simulated and real-world data against single-user methods and existing approaches for multi-user preference learning, whilst having significantly lower computational times than competing multi-user methods. [sent-416, score-0.41]
100 Multi-task preference learning with an application to hearing aid personalization. [sent-423, score-0.388]
wordName wordTfidf (topN-words)
[('preference', 0.388), ('bald', 0.353), ('user', 0.281), ('gp', 0.276), ('cp', 0.246), ('cpu', 0.188), ('election', 0.175), ('users', 0.166), ('item', 0.163), ('hd', 0.16), ('jura', 0.157), ('preferences', 0.138), ('ep', 0.137), ('bo', 0.13), ('sushi', 0.128), ('kusers', 0.118), ('xj', 0.113), ('su', 0.102), ('fx', 0.101), ('mu', 0.098), ('gps', 0.098), ('active', 0.089), ('movielens', 0.084), ('kitems', 0.078), ('num', 0.075), ('items', 0.071), ('bi', 0.07), ('latent', 0.07), ('fitc', 0.069), ('uu', 0.068), ('collaborative', 0.065), ('posterior', 0.062), ('fa', 0.06), ('kpref', 0.059), ('mes', 0.057), ('xi', 0.056), ('pp', 0.055), ('synthetic', 0.051), ('kernel', 0.051), ('covariance', 0.05), ('binary', 0.047), ('wu', 0.047), ('entropy', 0.044), ('pseudo', 0.043), ('highlighted', 0.043), ('features', 0.043), ('entropies', 0.043), ('pairs', 0.042), ('gu', 0.042), ('wd', 0.039), ('infosys', 0.039), ('wud', 0.039), ('supplementary', 0.039), ('recast', 0.038), ('shared', 0.038), ('prior', 0.037), ('classi', 0.036), ('disagreement', 0.036), ('vb', 0.035), ('multiuser', 0.035), ('lengthscales', 0.035), ('propagation', 0.034), ('gaussian', 0.033), ('judgements', 0.032), ('mg', 0.032), ('prefers', 0.031), ('xl', 0.031), ('xk', 0.031), ('kt', 0.031), ('mw', 0.03), ('contaminated', 0.03), ('row', 0.03), ('variational', 0.03), ('feature', 0.029), ('similarities', 0.028), ('material', 0.028), ('pool', 0.028), ('greatest', 0.027), ('informative', 0.027), ('available', 0.027), ('kl', 0.026), ('pairwise', 0.026), ('querying', 0.025), ('bonilla', 0.025), ('approximate', 0.025), ('strategy', 0.025), ('uncertainty', 0.025), ('datasets', 0.025), ('sa', 0.025), ('column', 0.024), ('region', 0.024), ('inference', 0.024), ('mh', 0.023), ('factors', 0.023), ('implement', 0.022), ('inputs', 0.022), ('bayes', 0.022), ('pair', 0.022), ('cost', 0.022), ('favorably', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 74 nips-2012-Collaborative Gaussian Processes for Preference Learning
Author: Neil Houlsby, Ferenc Huszar, Zoubin Ghahramani, Jose M. Hernández-lobato
Abstract: We present a new model based on Gaussian processes (GPs) for learning pairwise preferences expressed by multiple users. Inference is simplified by using a preference kernel for GPs which allows us to combine supervised GP learning of user preferences with unsupervised dimensionality reduction for multi-user systems. The model not only exploits collaborative information from the shared structure in user behavior, but may also incorporate user features if they are available. Approximate inference is implemented using a combination of expectation propagation and variational Bayes. Finally, we present an efficient active learning strategy for querying preferences. The proposed technique performs favorably on real-world data against state-of-the-art multi-user preference learning algorithms. 1
2 0.21045931 272 nips-2012-Practical Bayesian Optimization of Machine Learning Algorithms
Author: Jasper Snoek, Hugo Larochelle, Ryan P. Adams
Abstract: The use of machine learning algorithms frequently involves careful tuning of learning parameters and model hyperparameters. Unfortunately, this tuning is often a “black art” requiring expert experience, rules of thumb, or sometimes bruteforce search. There is therefore great appeal for automatic approaches that can optimize the performance of any given learning algorithm to the problem at hand. In this work, we consider this problem through the framework of Bayesian optimization, in which a learning algorithm’s generalization performance is modeled as a sample from a Gaussian process (GP). We show that certain choices for the nature of the GP, such as the type of kernel and the treatment of its hyperparameters, can play a crucial role in obtaining a good optimizer that can achieve expertlevel performance. We describe new algorithms that take into account the variable cost (duration) of learning algorithm experiments and that can leverage the presence of multiple cores for parallel experimentation. We show that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization for many algorithms including latent Dirichlet allocation, structured SVMs and convolutional neural networks. 1
3 0.20231961 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback
Author: Andriy Mnih, Yee W. Teh
Abstract: User preferences for items can be inferred from either explicit feedback, such as item ratings, or implicit feedback, such as rental histories. Research in collaborative filtering has concentrated on explicit feedback, resulting in the development of accurate and scalable models. However, since explicit feedback is often difficult to collect it is important to develop effective models that take advantage of the more widely available implicit feedback. We introduce a probabilistic approach to collaborative filtering with implicit feedback based on modelling the user’s item selection process. In the interests of scalability, we restrict our attention to treestructured distributions over items and develop a principled and efficient algorithm for learning item trees from data. We also identify a problem with a widely used protocol for evaluating implicit feedback models and propose a way of addressing it using a small quantity of explicit feedback data. 1
4 0.19681916 75 nips-2012-Collaborative Ranking With 17 Parameters
Author: Maksims Volkovs, Richard S. Zemel
Abstract: The primary application of collaborate filtering (CF) is to recommend a small set of items to a user, which entails ranking. Most approaches, however, formulate the CF problem as rating prediction, overlooking the ranking perspective. In this work we present a method for collaborative ranking that leverages the strengths of the two main CF approaches, neighborhood- and model-based. Our novel method is highly efficient, with only seventeen parameters to optimize and a single hyperparameter to tune, and beats the state-of-the-art collaborative ranking methods. We also show that parameters learned on datasets from one item domain yield excellent results on a dataset from very different item domain, without any retraining. 1
5 0.17551726 49 nips-2012-Automatic Feature Induction for Stagewise Collaborative Filtering
Author: Joonseok Lee, Mingxuan Sun, Guy Lebanon, Seung-jean Kim
Abstract: Recent approaches to collaborative filtering have concentrated on estimating an algebraic or statistical model, and using the model for predicting missing ratings. In this paper we observe that different models have relative advantages in different regions of the input space. This motivates our approach of using stagewise linear combinations of collaborative filtering algorithms, with non-constant combination coefficients based on kernel smoothing. The resulting stagewise model is computationally scalable and outperforms a wide selection of state-of-the-art collaborative filtering algorithms. 1
6 0.17444576 33 nips-2012-Active Learning of Model Evidence Using Bayesian Quadrature
7 0.16369082 233 nips-2012-Multiresolution Gaussian Processes
8 0.15784934 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction
9 0.1510223 187 nips-2012-Learning curves for multi-task Gaussian process regression
10 0.13953878 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems
11 0.13528612 286 nips-2012-Random Utility Theory for Social Choice
12 0.13195331 288 nips-2012-Rational inference of relative preferences
13 0.12031204 127 nips-2012-Fast Bayesian Inference for Non-Conjugate Gaussian Process Regression
14 0.10678103 55 nips-2012-Bayesian Warped Gaussian Processes
15 0.091962203 194 nips-2012-Learning to Discover Social Circles in Ego Networks
16 0.086621888 165 nips-2012-Iterative ranking from pair-wise comparisons
17 0.08621183 169 nips-2012-Label Ranking with Partial Abstention based on Thresholded Probabilistic Models
18 0.080238797 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing
19 0.079650737 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries
20 0.075430043 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization
topicId topicWeight
[(0, 0.208), (1, 0.046), (2, 0.01), (3, -0.013), (4, -0.137), (5, -0.108), (6, 0.008), (7, 0.182), (8, 0.019), (9, -0.079), (10, -0.335), (11, 0.074), (12, -0.144), (13, 0.094), (14, -0.004), (15, 0.115), (16, -0.055), (17, 0.103), (18, -0.081), (19, 0.047), (20, -0.063), (21, -0.093), (22, 0.09), (23, -0.041), (24, 0.007), (25, 0.003), (26, -0.019), (27, -0.038), (28, 0.084), (29, 0.073), (30, -0.011), (31, 0.014), (32, -0.025), (33, 0.027), (34, -0.039), (35, -0.02), (36, -0.014), (37, 0.048), (38, 0.004), (39, -0.001), (40, 0.008), (41, 0.051), (42, 0.013), (43, 0.023), (44, 0.046), (45, -0.027), (46, 0.085), (47, -0.035), (48, 0.054), (49, -0.079)]
simIndex simValue paperId paperTitle
same-paper 1 0.92859191 74 nips-2012-Collaborative Gaussian Processes for Preference Learning
Author: Neil Houlsby, Ferenc Huszar, Zoubin Ghahramani, Jose M. Hernández-lobato
Abstract: We present a new model based on Gaussian processes (GPs) for learning pairwise preferences expressed by multiple users. Inference is simplified by using a preference kernel for GPs which allows us to combine supervised GP learning of user preferences with unsupervised dimensionality reduction for multi-user systems. The model not only exploits collaborative information from the shared structure in user behavior, but may also incorporate user features if they are available. Approximate inference is implemented using a combination of expectation propagation and variational Bayes. Finally, we present an efficient active learning strategy for querying preferences. The proposed technique performs favorably on real-world data against state-of-the-art multi-user preference learning algorithms. 1
2 0.68778282 49 nips-2012-Automatic Feature Induction for Stagewise Collaborative Filtering
Author: Joonseok Lee, Mingxuan Sun, Guy Lebanon, Seung-jean Kim
Abstract: Recent approaches to collaborative filtering have concentrated on estimating an algebraic or statistical model, and using the model for predicting missing ratings. In this paper we observe that different models have relative advantages in different regions of the input space. This motivates our approach of using stagewise linear combinations of collaborative filtering algorithms, with non-constant combination coefficients based on kernel smoothing. The resulting stagewise model is computationally scalable and outperforms a wide selection of state-of-the-art collaborative filtering algorithms. 1
3 0.68721259 55 nips-2012-Bayesian Warped Gaussian Processes
Author: Miguel Lázaro-gredilla
Abstract: Warped Gaussian processes (WGP) [1] model output observations in regression tasks as a parametric nonlinear transformation of a Gaussian process (GP). The use of this nonlinear transformation, which is included as part of the probabilistic model, was shown to enhance performance by providing a better prior model on several data sets. In order to learn its parameters, maximum likelihood was used. In this work we show that it is possible to use a non-parametric nonlinear transformation in WGP and variationally integrate it out. The resulting Bayesian WGP is then able to work in scenarios in which the maximum likelihood WGP failed: Low data regime, data with censored values, classification, etc. We demonstrate the superior performance of Bayesian warped GPs on several real data sets.
4 0.67169845 75 nips-2012-Collaborative Ranking With 17 Parameters
Author: Maksims Volkovs, Richard S. Zemel
Abstract: The primary application of collaborate filtering (CF) is to recommend a small set of items to a user, which entails ranking. Most approaches, however, formulate the CF problem as rating prediction, overlooking the ranking perspective. In this work we present a method for collaborative ranking that leverages the strengths of the two main CF approaches, neighborhood- and model-based. Our novel method is highly efficient, with only seventeen parameters to optimize and a single hyperparameter to tune, and beats the state-of-the-art collaborative ranking methods. We also show that parameters learned on datasets from one item domain yield excellent results on a dataset from very different item domain, without any retraining. 1
5 0.65924406 272 nips-2012-Practical Bayesian Optimization of Machine Learning Algorithms
Author: Jasper Snoek, Hugo Larochelle, Ryan P. Adams
Abstract: The use of machine learning algorithms frequently involves careful tuning of learning parameters and model hyperparameters. Unfortunately, this tuning is often a “black art” requiring expert experience, rules of thumb, or sometimes bruteforce search. There is therefore great appeal for automatic approaches that can optimize the performance of any given learning algorithm to the problem at hand. In this work, we consider this problem through the framework of Bayesian optimization, in which a learning algorithm’s generalization performance is modeled as a sample from a Gaussian process (GP). We show that certain choices for the nature of the GP, such as the type of kernel and the treatment of its hyperparameters, can play a crucial role in obtaining a good optimizer that can achieve expertlevel performance. We describe new algorithms that take into account the variable cost (duration) of learning algorithm experiments and that can leverage the presence of multiple cores for parallel experimentation. We show that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization for many algorithms including latent Dirichlet allocation, structured SVMs and convolutional neural networks. 1
6 0.64128554 33 nips-2012-Active Learning of Model Evidence Using Bayesian Quadrature
7 0.61277056 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback
8 0.60951865 233 nips-2012-Multiresolution Gaussian Processes
9 0.58610147 288 nips-2012-Rational inference of relative preferences
10 0.57354695 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction
11 0.51412582 187 nips-2012-Learning curves for multi-task Gaussian process regression
12 0.50213432 189 nips-2012-Learning from the Wisdom of Crowds by Minimax Entropy
13 0.48708466 127 nips-2012-Fast Bayesian Inference for Non-Conjugate Gaussian Process Regression
14 0.4432219 194 nips-2012-Learning to Discover Social Circles in Ego Networks
15 0.43121499 286 nips-2012-Random Utility Theory for Social Choice
16 0.41492042 11 nips-2012-A Marginalized Particle Gaussian Process Regression
17 0.41202641 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization
18 0.40444273 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification
19 0.40200868 111 nips-2012-Efficient Sampling for Bipartite Matching Problems
20 0.39069158 165 nips-2012-Iterative ranking from pair-wise comparisons
topicId topicWeight
[(0, 0.032), (21, 0.021), (38, 0.129), (39, 0.04), (42, 0.038), (54, 0.016), (55, 0.017), (74, 0.044), (76, 0.225), (80, 0.093), (86, 0.199), (92, 0.053)]
simIndex simValue paperId paperTitle
1 0.94246721 137 nips-2012-From Deformations to Parts: Motion-based Segmentation of 3D Objects
Author: Soumya Ghosh, Matthew Loper, Erik B. Sudderth, Michael J. Black
Abstract: We develop a method for discovering the parts of an articulated object from aligned meshes of the object in various three-dimensional poses. We adapt the distance dependent Chinese restaurant process (ddCRP) to allow nonparametric discovery of a potentially unbounded number of parts, while simultaneously guaranteeing a spatially connected segmentation. To allow analysis of datasets in which object instances have varying 3D shapes, we model part variability across poses via affine transformations. By placing a matrix normal-inverse-Wishart prior on these affine transformations, we develop a ddCRP Gibbs sampler which tractably marginalizes over transformation uncertainty. Analyzing a dataset of humans captured in dozens of poses, we infer parts which provide quantitatively better deformation predictions than conventional clustering methods.
2 0.91046482 57 nips-2012-Bayesian estimation of discrete entropy with mixtures of stick-breaking priors
Author: Evan Archer, Il M. Park, Jonathan W. Pillow
Abstract: We consider the problem of estimating Shannon’s entropy H in the under-sampled regime, where the number of possible symbols may be unknown or countably infinite. Dirichlet and Pitman-Yor processes provide tractable prior distributions over the space of countably infinite discrete distributions, and have found major applications in Bayesian non-parametric statistics and machine learning. Here we show that they provide natural priors for Bayesian entropy estimation, due to the analytic tractability of the moments of the induced posterior distribution over entropy H. We derive formulas for the posterior mean and variance of H given data. However, we show that a fixed Dirichlet or Pitman-Yor process prior implies a narrow prior on H, meaning the prior strongly determines the estimate in the under-sampled regime. We therefore define a family of continuous mixing measures such that the resulting mixture of Dirichlet or Pitman-Yor processes produces an approximately flat prior over H. We explore the theoretical properties of the resulting estimators and show that they perform well on data sampled from both exponential and power-law tailed distributions. 1
3 0.88073879 303 nips-2012-Searching for objects driven by context
Author: Bogdan Alexe, Nicolas Heess, Yee W. Teh, Vittorio Ferrari
Abstract: The dominant visual search paradigm for object class detection is sliding windows. Although simple and effective, it is also wasteful, unnatural and rigidly hardwired. We propose strategies to search for objects which intelligently explore the space of windows by making sequential observations at locations decided based on previous observations. Our strategies adapt to the class being searched and to the content of a particular test image, exploiting context as the statistical relation between the appearance of a window and its location relative to the object, as observed in the training set. In addition to being more elegant than sliding windows, we demonstrate experimentally on the PASCAL VOC 2010 dataset that our strategies evaluate two orders of magnitude fewer windows while achieving higher object detection performance. 1
same-paper 4 0.87741661 74 nips-2012-Collaborative Gaussian Processes for Preference Learning
Author: Neil Houlsby, Ferenc Huszar, Zoubin Ghahramani, Jose M. Hernández-lobato
Abstract: We present a new model based on Gaussian processes (GPs) for learning pairwise preferences expressed by multiple users. Inference is simplified by using a preference kernel for GPs which allows us to combine supervised GP learning of user preferences with unsupervised dimensionality reduction for multi-user systems. The model not only exploits collaborative information from the shared structure in user behavior, but may also incorporate user features if they are available. Approximate inference is implemented using a combination of expectation propagation and variational Bayes. Finally, we present an efficient active learning strategy for querying preferences. The proposed technique performs favorably on real-world data against state-of-the-art multi-user preference learning algorithms. 1
5 0.83740991 275 nips-2012-Privacy Aware Learning
Author: Martin J. Wainwright, Michael I. Jordan, John C. Duchi
Abstract: We study statistical risk minimization problems under a version of privacy in which the data is kept confidential even from the learner. In this local privacy framework, we establish sharp upper and lower bounds on the convergence rates of statistical estimation procedures. As a consequence, we exhibit a precise tradeoff between the amount of privacy the data preserves and the utility, measured by convergence rate, of any statistical estimator. 1
6 0.81285536 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC
7 0.81231248 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points
8 0.80953056 163 nips-2012-Isotropic Hashing
9 0.80773258 291 nips-2012-Reducing statistical time-series problems to binary classification
10 0.80695903 338 nips-2012-The Perturbed Variation
11 0.80609053 352 nips-2012-Transelliptical Graphical Models
12 0.8056125 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison
13 0.80556446 188 nips-2012-Learning from Distributions via Support Measure Machines
14 0.80519867 326 nips-2012-Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses
15 0.80518329 280 nips-2012-Proper losses for learning from partial labels
16 0.8050611 294 nips-2012-Repulsive Mixtures
18 0.8047024 277 nips-2012-Probabilistic Low-Rank Subspace Clustering
19 0.8043136 41 nips-2012-Ancestor Sampling for Particle Gibbs
20 0.80407488 142 nips-2012-Generalization Bounds for Domain Adaptation