nips nips2012 nips2012-178 knowledge-graph by maker-knowledge-mining

178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback

Source: pdf

Author: Andriy Mnih, Yee W. Teh

Abstract: User preferences for items can be inferred from either explicit feedback, such as item ratings, or implicit feedback, such as rental histories. Research in collaborative ﬁltering has concentrated on explicit feedback, resulting in the development of accurate and scalable models. However, since explicit feedback is often difﬁcult to collect it is important to develop effective models that take advantage of the more widely available implicit feedback. We introduce a probabilistic approach to collaborative ﬁltering with implicit feedback based on modelling the user’s item selection process. In the interests of scalability, we restrict our attention to treestructured distributions over items and develop a principled and efﬁcient algorithm for learning item trees from data. We also identify a problem with a widely used protocol for evaluating implicit feedback models and propose a way of addressing it using a small quantity of explicit feedback data. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Gatsby Computational Neuroscience Unit University College London Abstract User preferences for items can be inferred from either explicit feedback, such as item ratings, or implicit feedback, such as rental histories. [sent-7, score-0.94]

2 However, since explicit feedback is often difﬁcult to collect it is important to develop effective models that take advantage of the more widely available implicit feedback. [sent-9, score-0.434]

3 We introduce a probabilistic approach to collaborative ﬁltering with implicit feedback based on modelling the user’s item selection process. [sent-10, score-0.963]

4 In the interests of scalability, we restrict our attention to treestructured distributions over items and develop a principled and efﬁcient algorithm for learning item trees from data. [sent-11, score-0.861]

5 We also identify a problem with a widely used protocol for evaluating implicit feedback models and propose a way of addressing it using a small quantity of explicit feedback data. [sent-12, score-0.72]

6 However, since explicit feedback is often difﬁcult to collect it is essential to develop effective models that take advantage of the more abundant implicit feedback, such as logs of user purchases, rentals, or clicks. [sent-20, score-0.641]

7 The difﬁculty of modelling implicit feedback comes from the fact that it contains only positive examples, since users explicitly express their interest, by selecting items, but not their disinterest. [sent-21, score-0.474]

8 Just like their explicit feedback counterparts, the most successful implicit feedback collaborative ﬁltering (IFCF) methods are based on matrix factorization [4, 10, 9]. [sent-23, score-0.778]

9 Recently [11] introduced a new method, called Bayesian Personalized Ranking (BPR), for modelling implicit feedback that is based on more realistic assumptions than BMF. [sent-29, score-0.414]

10 Instead of assuming that users like the selected items and dislike the unselected ones, it assumes that users merely prefer the former to the latter. [sent-30, score-0.582]

11 The model is presented with selected/unselected item pairs and is trained to rank the selected items above the unselected ones. [sent-31, score-0.867]

12 Since the number of such pairs is typically very large, the unselected items are sampled at random. [sent-32, score-0.463]

13 In this paper we develop a new method that explicitly models the user item selection process using a probabilistic model that, unlike the existing approaches, can generate new item lists. [sent-33, score-1.055]

14 Like BPR it assumes that selected items are more interesting than the unselected ones. [sent-34, score-0.462]

15 Unlike BPR, however, it represents the appeal of items to a user using a probability distribution, producing a complete ordering of items by probability value. [sent-35, score-0.933]

16 Since the accuracy of the resulting models depends heavily on the choice of the tree structure, we develop an algorithm for learning trees from data that takes into account the structure of the model the tree will be used with. [sent-37, score-0.475]

17 We then turn our attention to the task of evaluating implicit feedback models and point out a problem with a widely used evaluation protocol, which stems from the assumption that all items not selected by a user are irrelevant. [sent-38, score-1.005]

18 2 Modelling item selection We propose a new approach to collaborative ﬁltering with implicit feedback based on modelling the item selection process performed by each user. [sent-40, score-1.346]

19 The identities of the items selected by a user are modelled as independent samples from a user-speciﬁc distribution over all available items. [sent-41, score-0.61]

20 First, we assume that user preferences do not change with time and model all items chosen by a user as independent samples from a ﬁxed user-speciﬁc distribution. [sent-45, score-0.805]

21 We believe that sampling with replacement is a reasonable approximation to sampling without replacement in this case because the space of items is large while the number of items selected by a user is relatively small. [sent-47, score-0.971]

22 These simpliﬁcations allow us to model the identities of the items selected by a user as IID samples. [sent-48, score-0.61]

23 As is typical for matrix factorization methods in collaborative ﬁltering, we represent users and items with real-valued vectors of latent factors. [sent-50, score-0.608]

24 The factor vectors for user u and item i will be denoted by Uu and Vi respectively. [sent-51, score-0.657]

25 Intuitively, Uu captures the preferences of user u, while Vi encodes the properties of item i. [sent-52, score-0.618]

26 Both user and item factor vectors are unobserved and so have to be learned from the observed user/item pairs. [sent-53, score-0.744]

27 The dot product between Uu and Vi quantiﬁes the preference of user u for item i. [sent-54, score-0.608]

28 We deﬁne the probability of user u choosing item i as P (i|u) = exp(Uu Vi + ci ) , k exp(Uu Vk + ck ) (1) where ci is the bias parameter that captures the overall popularity of item i and index k ranges over all items in the inventory. [sent-55, score-1.336]

29 The main weakness of the model is that its training time is linear in the inventory size because computing the gradient of the log-probability of a single item requires explicitly considering all available items. [sent-57, score-0.464]

30 Though linear time complexity might not seem 2 prohibitive, it severely limits the applicability of the model since collaborative ﬁltering tasks with tens or even hundreds of thousands of items are now common. [sent-58, score-0.485]

31 3 Hierarchical item selection model The linear time complexity of the gradient computation is a consequence of normalization over the entire inventory in Eq. [sent-59, score-0.435]

32 We can speed up normalization, and thus learning, exponentially by assuming that the space of items has a known tree structure. [sent-61, score-0.534]

33 We start by supposing that we are given a K-ary tree with items at the leaves and exactly one item per leaf. [sent-62, score-0.917]

34 Such a tree is uniquely determined by specifying for each item the path from the root to the leaf containing the item. [sent-64, score-0.612]

35 To allow each user to have a different distribution over items we make the probability of choosing each child a function of the user’s factor vector. [sent-70, score-0.644]

36 The probability will also depend on the child node’s factor vector and bias the same way the probability of choosing an item in Eq. [sent-71, score-0.457]

37 Then for user u, the probability of moving from node nj to node n on a root-to-leaf tree traversal is given by P (n|nj , u) = exp Uu Qn + bn , m∈C(nj ) exp (Uu Qm + bm ) (2) if n is a child of nj and 0 otherwise. [sent-74, score-0.728]

38 The probability of selecting item i is then given by the product of the probabilities of the decisions that lead from the root to the leaf containing i: P (i|u) = Li j=1 P (ni |ni , u). [sent-76, score-0.432]

39 Since computing the probability of a single item takes time linear in the item’s depth in the tree, we want to avoid trees that are too unbalanced. [sent-81, score-0.478]

40 To produce a model that generalizes well we also want to avoid trees with difﬁcult classiﬁcation problems at the internal nodes [1], which correspond to hard-to-predict item paths. [sent-82, score-0.502]

41 One way to produce a tree that results in relatively easy classiﬁcation problems is to assign similar items to the same class, which is the approach of [7] and [14]. [sent-83, score-0.534]

42 In Section 5 we will develop a scalable model-based algorithm for learning trees with item paths that are easy to predict using Eq. [sent-85, score-0.554]

43 Of these approaches our algorithm is most similar to the one in [14], which looks for a tree structure that avoids requiring to discriminate between easily confused items as much as possible. [sent-99, score-0.534]

44 Since ﬁnding the best tree is intractable, we take a greedy approach that constructs the tree one level at a time, learning the lth node of each item path st before ﬁxing it and advancing to the (l + 1) node. [sent-106, score-0.95]

45 node biases and factor vectors, jointly with the item paths. [sent-109, score-0.496]

46 We then extract the user vectors learned by the model and use them to learn a better tree from the data. [sent-114, score-0.449]

47 Finetuning is necessary because the parameters learned while building the tree are based on the ﬁxed user factor vectors from the random-tree-based model. [sent-118, score-0.483]

48 Suppose we have learned the ﬁrst l − 1 nodes of each item path and would like to learn the lth node. [sent-122, score-0.546]

49 Let Ui be the set of users who rated item i in the training set. [sent-123, score-0.513]

50 The contribution made by item i to the log-likelihood is then given by Li = log u∈Ui P (i|u) = u∈Ui log j P (ni |ni , u) = u∈Ui j log P (ni |ni , u). [sent-124, score-0.473]

51 The third term is the log-probability of item i under the subtree rooted at node ni , which depends on the structure and parameters of that subtree, l which we have not learned yet. [sent-127, score-0.969]

52 To emphasize the fact that this term is based on a user-dependent distribution over items under node ni we will denote it by log P (i|ni , u). [sent-128, score-0.917]

53 Since jointly optimizing over the lth node in all item paths is infeasible, we have to resort to incremental updates, maximizing Ll over the lth node in one item path at a time. [sent-131, score-1.121]

54 Unfortunately, even this operation is intractable because evaluating each value of ni requires knowing the optimal contribution from the still-to-be-learned l levels of the tree, which is the second term in Eq. [sent-132, score-0.513]

55 In other words, to ﬁnd the optimal ni we need l to compute ni = arg maxn∈C(ni ) l l−1 log P (n|ni , u) + F (n, ni ) , l−1 l−1 u∈Ui (7) where we left out the terms that do not depend on ni . [sent-134, score-1.828]

56 3 Approximating the future The value of F (ni , ni ) quantiﬁes the difﬁculty of discriminating between items assigned to node l l−1 ni using the best tree structure and parameter setting possible given that item i is assigned to the l−1 child ni of that node. [sent-137, score-2.425]

57 8 rules out degenerate solutions where all items l l l−1 below a node are assigned to the same child of it, leaving F (ni , ni ) out to make the optimization l l−1 problem easier is not an option. [sent-139, score-0.951]

58 7 while avoiding the degenerate solutions by approximating the user-dependent distributions P (k|nk , u) by simpler distributions that make it much easier to evaluate l F (ni , ni ) for each candidate value for ni . [sent-141, score-0.902]

59 Since computing F (ni , ni ) requires maximizing l l−1 l l l−1 over the free parameters of P (k|nk , u), choosing a parameterization of P (k|nk , u) that makes this l l maximization easy can greatly speed up this computation. [sent-142, score-0.451]

60 The main advantage of l l this parameterization is that the optimal P (k|nk ) can be computed by counting the number of times l each item assigned to node nk occurs in the training data and normalizing. [sent-144, score-0.636]

61 8, the maximum is achieved at l  Ni  if i ∈ I(nk ) l k m∈I(nk ) Nm P (i|nl ) = (9) l 0 otherwise where Ni is the number of times item i occurs in the training set. [sent-146, score-0.412]

62 The corresponding value for F (ni , ni ) is given by l l−1 F (ni , ni ) = l l−1 k∈I(ni ) l−1 Nk log Nk m∈I(nk ) l Nm . [sent-147, score-0.926]

63 Since adding a constant to F (ni , ni ) has no effect on the solution of Eq. [sent-150, score-0.451]

64 l l−1 7 and the ﬁrst term in the equation does not depend on ni , we can drop it to get l ˜ l l−1 F (ni , ni ) = − c∈C(ni ) l−1 Zc log Zc . [sent-151, score-0.926]

65 (12) ˜ ˜ To compute F (ni , ni ) efﬁciently, we store Zc ’s and the old F (ni , ni ) value, updating them l l−1 l l−1 whenever an item is moved to a different node. [sent-152, score-1.307]

66 7, corresponding to the contribution of the lth path node for item i, can be computed efﬁciently. [sent-155, score-0.581]

67 Since we assume that the user factor vectors are known and ﬁxed, we precompute Ri = u∈Ui Uu for each user, which can be seen as creating a surrogate representation for item i. [sent-158, score-0.657]

68 7 gives us the following update for item nodes: ˜ ni = arg maxn∈C(ni ) Ri Qn + |Ui |bn + F (n, ni ) . [sent-161, score-1.285]

69 l−1 l l−1 6 (14) Evaluating models of implicit feedback Establishing sensible evaluation protocols for machine learning problems is important because they effectively deﬁne what “better” performance means and implicitly guide the development of future methods. [sent-162, score-0.436]

70 Given that the problem of implicit feedback collaborative ﬁltering is relatively new, it is not surprising that the typical evaluation protocol was adopted from information retrieval. [sent-163, score-0.546]

71 Implicit feedback models are typically evaluated using information retrieval metrics, such as Mean Average Precision (MAP), that require knowing which items are relevant and which are irrelevant to each user. [sent-165, score-0.676]

72 It is typical to assume that the items the user selected are relevant and all others are not [10]. [sent-166, score-0.61]

73 However, this approach is problematic because it fails to distinguish between the items the user really has no interest in (i. [sent-167, score-0.57]

74 the truly irrelevant ones) and the relevant items the user simply did not rate. [sent-169, score-0.665]

75 And while the irrelevant items do tend to be far more numerous than the unobserved relevant ones, the effect of the latter can still be strong enough to affect model comparison, as we demonstrate in the next section. [sent-170, score-0.477]

76 To address this issue, we propose using some explicit feedback information to identify a small number of truly irrelevant items for each user and using them in place of items of unknown relevance in the evaluation. [sent-171, score-1.315]

77 Thus the models will be evaluated on their ability to rank the truly relevant items above the truly irrelevant ones, which we believe is the ultimate task of collaborative ﬁltering. [sent-172, score-0.645]

78 (15) The model that assigns the correct item probability 1 has the perplexity of 1, while the model that assigns all N items the same probability (1/N ) has the perplexity of N . [sent-175, score-1.016]

79 Unlike the ranking metrics above, perplexity is computed based on the selected/relevant items alone and does not require assuming that the unselected items are irrelevant. [sent-176, score-1.026]

80 We trained three models with 5-dimensional 1 The implicit assumption here is that the selected items are more relevant than the unselected ones. [sent-182, score-0.621]

81 6 Table 1: Test set scores in percent on the MovieLens 10M dataset obtained by treating items with low ratings as irrelevant. [sent-183, score-0.456]

82 21 Table 2: Test set scores in percent on the MovieLens 10M dataset obtained by treating all unobserved items as irrelevant. [sent-220, score-0.46]

83 The training process for the learned-tree model, which included training a random-tree model, learning a tree from the resulting user factor vectors, and ﬁnetuning all the parameters, took 1 hour. [sent-238, score-0.492]

84 These results suggest that even when the number of items is relatively small our tree-based approach to item selection modelling can yield an order-of-magnitude reduction in training times relative to the ﬂat model without hurting the predictive accuracy. [sent-240, score-0.859]

85 We used the evaluation approach described in the previous section, which involved having the models rank only the items with known relevance status. [sent-244, score-0.433]

86 We used the rating values to determine relevance, considering items rated below 3 as irrelevant and items rated 4 and above as relevant. [sent-245, score-0.892]

87 We compared our hierarchical item selection model to two state-of-the-art models for implicit feedback: the Bayesian Personalized Ranking model (BPR) and the Binary Matrix Factorization model (BMF). [sent-246, score-0.564]

88 The methods are as follows: “Random” generates random balanced trees; “LearnedRI” is the method from Section 5 with randomly initialized item-node assignments; “LearnedCI” is the same method with item-node assignments initialized by clustering surrogate item representations Ri from Section 5. [sent-249, score-0.422]

89 Training a ﬂat item selection model on this dataset was infeasible, as a single pass through the data took six hours, compared to a mere two minutes for CIS (LearnedCI). [sent-251, score-0.427]

90 Comparing the results of CIS (Learned) and CIS (Random) shows that the of the tree used has a strong effect on the performance of CIS models and that using trees learned with the proposed algorithm makes CIS competitive with the best collaborative ﬁltering models. [sent-259, score-0.444]

91 We then determined how discriminative the decisions were at each level of the tree by replacing the user-dependent distributions under all nodes at a particular depth by the best user-independent approximations (frequencies of items under the node). [sent-263, score-0.598]

92 To highlight the importance of excluding items of unknown relevance when evaluating implicit feedback models we recomputed the performance metrics treating all items not rated by a user as irrelevant. [sent-267, score-1.469]

93 In retrospect, these changes in relative performance are not particularly surprising since the training algorithm for BMF treats unobserved items as negative examples, which perfectly matches the assumption the evaluation is based on, namely that unobserved items are irrelevant. [sent-269, score-0.873]

94 8 Discussion We proposed a model that in addition to being competitive with the best implicit feedback models in terms of predictive accuracy also provides calibrated item selection probabilities for each user, which quantify the user’s interest in the items. [sent-271, score-0.775]

95 These probabilities allow comparing the degree of interest in an item across users, making it possible to maximize the total user satisfaction when item availability is limited. [sent-272, score-0.973]

96 Our algorithm can be applied in this setting by noticing the correspondence between items and labels, and between user factor vectors and input vectors. [sent-280, score-0.637]

97 However, unlike in collaborative ﬁltering where user factor vectors have to be learned, in this case input vectors are observed, which eliminates the need to train a model based on a random tree before applying the tree-learning algorithm. [sent-281, score-0.6]

98 We believe that evaluation protocols for implicit feedback models deserve more attention than they have received. [sent-282, score-0.426]

99 In this paper we observed that one widely used protocol can produce misleading results due to an unrealistic assumption it makes about item relevance. [sent-283, score-0.455]

100 We proposed using a small quantity of explicit feedback data to directly estimate item relevance in order to avoid having to make that assumption. [sent-284, score-0.689]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ni', 0.451), ('item', 0.383), ('items', 0.363), ('cis', 0.325), ('feedback', 0.23), ('user', 0.207), ('bpr', 0.178), ('tree', 0.171), ('perplexity', 0.135), ('nk', 0.127), ('collaborative', 0.122), ('implicit', 0.122), ('uu', 0.113), ('bmf', 0.096), ('trees', 0.095), ('node', 0.079), ('unselected', 0.078), ('ltering', 0.075), ('lth', 0.072), ('ui', 0.071), ('zc', 0.065), ('modelling', 0.062), ('users', 0.06), ('learnedci', 0.059), ('learnedri', 0.059), ('nj', 0.052), ('protocol', 0.052), ('qn', 0.05), ('unobserved', 0.049), ('bn', 0.048), ('metrics', 0.048), ('irrelevant', 0.046), ('movielens', 0.045), ('ratings', 0.045), ('ppl', 0.044), ('explicit', 0.044), ('nl', 0.043), ('language', 0.043), ('rated', 0.041), ('child', 0.04), ('ranking', 0.039), ('andriy', 0.039), ('rating', 0.038), ('learned', 0.038), ('factor', 0.034), ('vectors', 0.033), ('relevance', 0.032), ('scalable', 0.032), ('truly', 0.03), ('factorization', 0.03), ('ifcf', 0.03), ('inventory', 0.03), ('yehuda', 0.03), ('path', 0.029), ('training', 0.029), ('leaf', 0.029), ('preferences', 0.028), ('scores', 0.027), ('sensible', 0.027), ('personalized', 0.027), ('maxn', 0.026), ('constructs', 0.025), ('gatsby', 0.025), ('evaluating', 0.024), ('nodes', 0.024), ('mnih', 0.024), ('log', 0.024), ('nm', 0.024), ('paths', 0.024), ('pan', 0.023), ('updating', 0.022), ('pairs', 0.022), ('took', 0.022), ('selection', 0.022), ('probabilistic', 0.022), ('weakness', 0.022), ('label', 0.021), ('treating', 0.021), ('vi', 0.021), ('selected', 0.021), ('decisions', 0.02), ('levels', 0.02), ('level', 0.02), ('balanced', 0.02), ('develop', 0.02), ('unrealistic', 0.02), ('evaluation', 0.02), ('clustering', 0.019), ('protocols', 0.019), ('hierarchical', 0.019), ('relevant', 0.019), ('identities', 0.019), ('preference', 0.018), ('contribution', 0.018), ('assigned', 0.018), ('models', 0.018), ('recommender', 0.018), ('subtree', 0.018), ('rong', 0.018), ('believe', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback

Author: Andriy Mnih, Yee W. Teh

2 0.24840783 75 nips-2012-Collaborative Ranking With 17 Parameters

Author: Maksims Volkovs, Richard S. Zemel

Abstract: The primary application of collaborate ﬁltering (CF) is to recommend a small set of items to a user, which entails ranking. Most approaches, however, formulate the CF problem as rating prediction, overlooking the ranking perspective. In this work we present a method for collaborative ranking that leverages the strengths of the two main CF approaches, neighborhood- and model-based. Our novel method is highly efﬁcient, with only seventeen parameters to optimize and a single hyperparameter to tune, and beats the state-of-the-art collaborative ranking methods. We also show that parameters learned on datasets from one item domain yield excellent results on a dataset from very different item domain, without any retraining. 1

3 0.24619544 49 nips-2012-Automatic Feature Induction for Stagewise Collaborative Filtering

Author: Joonseok Lee, Mingxuan Sun, Guy Lebanon, Seung-jean Kim

Abstract: Recent approaches to collaborative ﬁltering have concentrated on estimating an algebraic or statistical model, and using the model for predicting missing ratings. In this paper we observe that different models have relative advantages in different regions of the input space. This motivates our approach of using stagewise linear combinations of collaborative ﬁltering algorithms, with non-constant combination coefﬁcients based on kernel smoothing. The resulting stagewise model is computationally scalable and outperforms a wide selection of state-of-the-art collaborative ﬁltering algorithms. 1

4 0.20231961 74 nips-2012-Collaborative Gaussian Processes for Preference Learning

Author: Neil Houlsby, Ferenc Huszar, Zoubin Ghahramani, Jose M. Hernández-lobato

Abstract: We present a new model based on Gaussian processes (GPs) for learning pairwise preferences expressed by multiple users. Inference is simpliﬁed by using a preference kernel for GPs which allows us to combine supervised GP learning of user preferences with unsupervised dimensionality reduction for multi-user systems. The model not only exploits collaborative information from the shared structure in user behavior, but may also incorporate user features if they are available. Approximate inference is implemented using a combination of expectation propagation and variational Bayes. Finally, we present an efﬁcient active learning strategy for querying preferences. The proposed technique performs favorably on real-world data against state-of-the-art multi-user preference learning algorithms. 1

5 0.20044534 111 nips-2012-Efficient Sampling for Bipartite Matching Problems

Author: Maksims Volkovs, Richard S. Zemel

Abstract: Bipartite matching problems characterize many situations, ranging from ranking in information retrieval to correspondence in vision. Exact inference in realworld applications of these problems is intractable, making efﬁcient approximation methods essential for learning and inference. In this paper we propose a novel sequential matching sampler based on a generalization of the PlackettLuce model, which can effectively make large moves in the space of matchings. This allows the sampler to match the difﬁcult target distributions common in these problems: highly multimodal distributions with well separated modes. We present experimental results with bipartite matching problems—ranking and image correspondence—which show that the sequential matching sampler efﬁciently approximates the target distribution, signiﬁcantly outperforming other sampling approaches. 1

6 0.18894519 60 nips-2012-Bayesian nonparametric models for ranked data

7 0.18569899 165 nips-2012-Iterative ranking from pair-wise comparisons

8 0.12565444 30 nips-2012-Accuracy at the Top

9 0.12238095 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction

10 0.110565 107 nips-2012-Effective Split-Merge Monte Carlo Methods for Nonparametric Models of Sequential Data

11 0.10088141 339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering

12 0.093692884 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

13 0.087989196 260 nips-2012-Online Sum-Product Computation Over Trees

14 0.086445354 189 nips-2012-Learning from the Wisdom of Crowds by Minimax Entropy

15 0.083244435 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

16 0.078015551 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking

17 0.077039689 81 nips-2012-Context-Sensitive Decision Forests for Object Detection

18 0.072613351 215 nips-2012-Minimizing Uncertainty in Pipelines

19 0.072316825 155 nips-2012-Human memory search as a random walk in a semantic network

20 0.071232893 180 nips-2012-Learning Mixtures of Tree Graphical Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.171), (1, 0.039), (2, -0.021), (3, -0.044), (4, -0.114), (5, -0.141), (6, -0.004), (7, 0.146), (8, -0.022), (9, 0.302), (10, -0.228), (11, 0.12), (12, -0.164), (13, 0.017), (14, 0.039), (15, 0.063), (16, -0.047), (17, 0.078), (18, 0.062), (19, 0.038), (20, 0.005), (21, -0.013), (22, 0.124), (23, -0.143), (24, -0.045), (25, 0.022), (26, 0.015), (27, -0.093), (28, 0.132), (29, 0.027), (30, -0.039), (31, 0.092), (32, -0.072), (33, -0.116), (34, -0.041), (35, 0.069), (36, -0.019), (37, 0.079), (38, -0.082), (39, -0.027), (40, 0.087), (41, 0.055), (42, -0.037), (43, -0.09), (44, -0.009), (45, -0.043), (46, -0.039), (47, 0.054), (48, 0.011), (49, -0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96757299 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback

Author: Andriy Mnih, Yee W. Teh

2 0.81054729 49 nips-2012-Automatic Feature Induction for Stagewise Collaborative Filtering

Author: Joonseok Lee, Mingxuan Sun, Guy Lebanon, Seung-jean Kim

3 0.78981537 75 nips-2012-Collaborative Ranking With 17 Parameters

Author: Maksims Volkovs, Richard S. Zemel

4 0.59604597 111 nips-2012-Efficient Sampling for Bipartite Matching Problems

Author: Maksims Volkovs, Richard S. Zemel

5 0.58972383 189 nips-2012-Learning from the Wisdom of Crowds by Minimax Entropy

Author: Dengyong Zhou, Sumit Basu, Yi Mao, John C. Platt

Abstract: An important way to make large training sets is to gather noisy labels from crowds of nonexperts. We propose a minimax entropy principle to improve the quality of these labels. Our method assumes that labels are generated by a probability distribution over workers, items, and labels. By maximizing the entropy of this distribution, the method naturally infers item confusability and worker expertise. We infer the ground truth by minimizing the entropy of this distribution, which we show minimizes the Kullback-Leibler (KL) divergence between the probability distribution and the unknown truth. We show that a simple coordinate descent scheme can optimize minimax entropy. Empirically, our results are substantially better than previously published methods for the same problem. 1

6 0.58663321 165 nips-2012-Iterative ranking from pair-wise comparisons

7 0.57630181 30 nips-2012-Accuracy at the Top

8 0.54716003 74 nips-2012-Collaborative Gaussian Processes for Preference Learning

9 0.50609481 155 nips-2012-Human memory search as a random walk in a semantic network

10 0.50477296 107 nips-2012-Effective Split-Merge Monte Carlo Methods for Nonparametric Models of Sequential Data

11 0.49716803 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction

12 0.46441209 194 nips-2012-Learning to Discover Social Circles in Ego Networks

13 0.464053 60 nips-2012-Bayesian nonparametric models for ranked data

14 0.4016203 288 nips-2012-Rational inference of relative preferences

15 0.38457495 223 nips-2012-Multi-criteria Anomaly Detection using Pareto Depth Analysis

16 0.37967664 260 nips-2012-Online Sum-Product Computation Over Trees

17 0.36707032 183 nips-2012-Learning Partially Observable Models Using Temporally Abstract Decision Trees

18 0.35296252 339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering

19 0.34825194 215 nips-2012-Minimizing Uncertainty in Pipelines

20 0.33622554 232 nips-2012-Multiplicative Forests for Continuous-Time Processes

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.051), (9, 0.021), (21, 0.042), (38, 0.225), (39, 0.014), (42, 0.025), (53, 0.013), (54, 0.03), (55, 0.015), (72, 0.166), (74, 0.07), (76, 0.117), (80, 0.078), (92, 0.045)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.89336097 250 nips-2012-On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization

Author: Doina Precup, Joelle Pineau, Andre S. Barreto

Abstract: Kernel-based stochastic factorization (KBSF) is an algorithm for solving reinforcement learning tasks with continuous state spaces which builds a Markov decision process (MDP) based on a set of sample transitions. What sets KBSF apart from other kernel-based approaches is the fact that the size of its MDP is independent of the number of transitions, which makes it possible to control the trade-off between the quality of the resulting approximation and the associated computational cost. However, KBSF’s memory usage grows linearly with the number of transitions, precluding its application in scenarios where a large amount of data must be processed. In this paper we show that it is possible to construct KBSF’s MDP in a fully incremental way, thus freeing the space complexity of this algorithm from its dependence on the number of sample transitions. The incremental version of KBSF is able to process an arbitrary amount of data, which results in a model-based reinforcement learning algorithm that can be used to solve continuous MDPs in both off-line and on-line regimes. We present theoretical results showing that KBSF can approximate the value function that would be computed by conventional kernel-based learning with arbitrary precision. We empirically demonstrate the effectiveness of the proposed algorithm in the challenging threepole balancing task, in which the ability to process a large number of transitions is crucial for success. 1

same-paper 2 0.89220995 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback

Author: Andriy Mnih, Yee W. Teh

3 0.8795709 107 nips-2012-Effective Split-Merge Monte Carlo Methods for Nonparametric Models of Sequential Data

Author: Michael C. Hughes, Erik B. Sudderth, Emily B. Fox

Abstract: Applications of Bayesian nonparametric methods require learning and inference algorithms which efﬁciently explore models of unbounded complexity. We develop new Markov chain Monte Carlo methods for the beta process hidden Markov model (BP-HMM), enabling discovery of shared activity patterns in large video and motion capture databases. By introducing split-merge moves based on sequential allocation, we allow large global changes in the shared feature structure. We also develop data-driven reversible jump moves which more reliably discover rare or unique behaviors. Our proposals apply to any choice of conjugate likelihood for observed data, and we show success with multinomial, Gaussian, and autoregressive emission models. Together, these innovations allow tractable analysis of hundreds of time series, where previous inference required clever initialization and lengthy burn-in periods for just six sequences. 1

4 0.87488759 9 nips-2012-A Geometric take on Metric Learning

Author: Søren Hauberg, Oren Freifeld, Michael J. Black

Abstract: Multi-metric learning techniques learn local metric tensors in different parts of a feature space. With such an approach, even simple classiﬁers can be competitive with the state-of-the-art because the distance measure locally adapts to the structure of the data. The learned distance measure is, however, non-metric, which has prevented multi-metric learning from generalizing to tasks such as dimensionality reduction and regression in a principled way. We prove that, with appropriate changes, multi-metric learning corresponds to learning the structure of a Riemannian manifold. We then show that this structure gives us a principled way to perform dimensionality reduction and regression according to the learned metrics. Algorithmically, we provide the ﬁrst practical algorithm for computing geodesics according to the learned metrics, as well as algorithms for computing exponential and logarithmic maps on the Riemannian manifold. Together, these tools let many Euclidean algorithms take advantage of multi-metric learning. We illustrate the approach on regression and dimensionality reduction tasks that involve predicting measurements of the human body from shape data. 1 Learning and Computing Distances Statistics relies on measuring distances. When the Euclidean metric is insufﬁcient, as is the case in many real problems, standard methods break down. This is a key motivation behind metric learning, which strives to learn good distance measures from data. In the most simple scenarios a single metric tensor is learned, but in recent years, several methods have proposed learning multiple metric tensors, such that different distance measures are applied in different parts of the feature space. This has proven to be a very powerful approach for classiﬁcation tasks [1, 2], but the approach has not generalized to other tasks. Here we consider the generalization of Principal Component Analysis (PCA) and linear regression; see Fig. 1 for an illustration of our approach. The main problem with generalizing multi-metric learning is that it is based on assumptions that make the feature space both non-smooth and non-metric. Speciﬁcally, it is often assumed that straight lines form geodesic curves and that the metric tensor stays constant along these lines. These assumptions are made because it is believed that computing the actual geodesics is intractable, requiring a discretization of the entire feature space [3]. We solve these problems by smoothing the transitions between different metric tensors, which ensures a metric space where geodesics can be computed. In this paper, we consider the scenario where the metric tensor at a given point in feature space is deﬁned as the weighted average of a set of learned metric tensors. In this model, we prove that the feature space becomes a chart for a Riemannian manifold. This ensures a metric feature space, i.e. dist(x, y) = 0 ⇔ x = y , dist(x, y) = dist(y, x) (symmetry), (1) dist(x, z) ≤ dist(x, y) + dist(y, z) (triangle inequality). To compute statistics according to the learned metric, we need to be able to compute distances, which implies that we need to compute geodesics. Based on the observation that geodesics are 1 (a) Local Metrics & Geodesics (b) Tangent Space Representation (c) First Principal Geodesic Figure 1: Illustration of Principal Geodesic Analysis. (a) Geodesics are computed between the mean and each data point. (b) Data is mapped to the Euclidean tangent space and the ﬁrst principal component is computed. (c) The principal component is mapped back to the feature space. smooth curves in Riemannian spaces, we derive an algorithm for computing geodesics that only requires a discretization of the geodesic rather than the entire feature space. Furthermore, we show how to compute the exponential and logarithmic maps of the manifold. With this we can map any point back and forth between a Euclidean tangent space and the manifold. This gives us a general strategy for incorporating the learned metric tensors in many Euclidean algorithms: map the data to the tangent of the manifold, perform the Euclidean analysis and map the results back to the manifold. Before deriving the algorithms (Sec. 3) we set the scene by an analysis of the shortcomings of current state-of-the-art methods (Sec. 2), which motivate our ﬁnal model. The model is general and can be used for many problems. Here we illustrate it with several challenging problems in 3D body shape modeling and analysis (Sec. 4). All proofs can be found in the supplementary material along with algorithmic details and further experimental results. 2 Background and Related Work Single-metric learning learns a metric tensor, M, such that distances are measured as dist2 (xi , xj ) = xi − xj 2 M ≡ (xi − xj )T M(xi − xj ) , (2) where M is a symmetric and positive deﬁnite D × D matrix. Classic approaches for ﬁnding such a metric tensor include PCA, where the metric is given by the inverse covariance matrix of the training data; and linear discriminant analysis (LDA), where the metric tensor is M = S−1 SB S−1 , with Sw W W and SB being the within class scatter and the between class scatter respectively [9]. A more recent approach tries to learn a metric tensor from triplets of data points (xi , xj , xk ), where the metric should obey the constraint that dist(xi , xj ) < dist(xi , xk ). Here the constraints are often chosen such that xi and xj belong to the same class, while xi and xk do not. Various relaxed versions of this idea have been suggested such that the metric can be learned by solving a semi-deﬁnite or a quadratic program [1, 2, 4–8]. Among the most popular approaches is the Large Margin Nearest Neighbor (LMNN) classiﬁer [5], which ﬁnds a linear transformation that satisﬁes local distance constraints, making the approach suitable for multi-modal classes. For many problems, a single global metric tensor is not enough, which motivates learning several local metric tensors. The classic work by Hastie and Tibshirani [9] advocates locally learning metric tensors according to LDA and using these as part of a kNN classiﬁer. In a somewhat similar fashion, Weinberger and Saul [5] cluster the training data and learn a separate metric tensor for each cluster using LMNN. A more extreme point of view was taken by Frome et al. [1, 2], who learn a diagonal metric tensor for every point in the training set, such that distance rankings are preserved. Similarly, Malisiewicz and Efros [6] ﬁnd a diagonal metric tensor for each training point such that the distance to a subset of the training data from the same class is kept small. Once a set of metric tensors {M1 , . . . , MR } has been learned, the distance dist(a, b) is measured according to (2) where “the nearest” metric tensor is used, i.e. R M(x) = r=1 wr (x) ˜ Mr , where wr (x) = ˜ ˜ j wj (x) 1 0 x − xr 2 r ≤ x − xj M otherwise 2 Mj , ∀j , (3) where x is either a or b depending on the algorithm. Note that this gives a non-metric distance function as it is not symmetric. To derive this equation, it is necessary to assume that 1) geodesics 2 −8 −8 Assumed Geodesics Location of Metric Tensors Test Points −6 −8 Actual Geodesics Location of Metric Tensors Test Points −6 Riemannian Geodesics Location of Metric Tensors Test Points −6 −4 −4 −4 −2 −2 −2 0 0 0 2 2 2 4 4 4 6 −8 6 −8 −6 −4 −2 0 (a) 2 4 6 −6 −4 −2 0 2 4 6 6 −8 −6 (b) −4 −2 (c) 0 2 4 6 (d) Figure 2: (a)–(b) An illustrative example where straight lines do not form geodesics and where the metric tensor does not stay constant along lines; see text for details. The background color is proportional to the trace of the metric tensor, such that light grey corresponds to regions where paths are short (M1 ), and dark grey corresponds to regions they are long (M2 ). (c) The suggested geometric model along with the geodesics. Again, background colour is proportional to the trace of the metric tensor; the colour scale is the same is used in (a) and (b). (d) An illustration of the exponential and logarithmic maps. form straight lines, and 2) the metric tensor stays constant along these lines [3]. Both assumptions are problematic, which we illustrate with a simple example in Fig. 2a–c. Assume we are given two metric tensors M1 = 2I and M2 = I positioned at x1 = (2, 2)T and x2 = (4, 4)T respectively. This gives rise to two regions in feature space in which x1 is nearest in the ﬁrst and x2 is nearest in the second, according to (3). This is illustrated in Fig. 2a. In the same ﬁgure, we also show the assumed straight-line geodesics between selected points in space. As can be seen, two of the lines goes through both regions, such that the assumption of constant metric tensors along the line is violated. Hence, it would seem natural to measure the length of the line, by adding the length of the line segments which pass through the different regions of feature space. This was suggested by Ramanan and Baker [3] who also proposed a polynomial time algorithm for measuring these line lengths. This gives a symmetric distance function. Properly computing line lengths according to the local metrics is, however, not enough to ensure that the distance function is metric. As can be seen in Fig. 2a the straight line does not form a geodesic as a shorter path can be found by circumventing the region with the “expensive” metric tensor M1 as illustrated in Fig. 2b. This issue makes it trivial to construct cases where the triangle inequality is violated, which again makes the line length measure non-metric. In summary, if we want a metric feature space, we can neither assume that geodesics are straight lines nor that the metric tensor stays constant along such lines. In practice, good results have been reported using (3) [1,3,5], so it seems obvious to ask: is metricity required? For kNN classiﬁers this does not appear to be the case, with many successes based on dissimilarities rather than distances [10]. We, however, want to generalize PCA and linear regression, which both seek to minimize the reconstruction error of points projected onto a subspace. As the notion of projection is hard to deﬁne sensibly in non-metric spaces, we consider metricity essential. In order to build a model with a metric feature space, we change the weights in (3) to be smooth functions. This impose a well-behaved geometric structure on the feature space, which we take advantage of in order to perform statistical analysis according to the learned metrics. However, ﬁrst we review the basics of Riemannian geometry as this provides the theoretical foundation of our work. 2.1 Geodesics and Riemannian Geometry We start by deﬁning Riemannian manifolds, which intuitively are smoothly curved spaces equipped with an inner product. Formally, they are smooth manifolds endowed with a Riemannian metric [11]: Deﬁnition A Riemannian metric M on a manifold M is a smoothly varying inner product < a, b >x = aT M(x)b in the tangent space Tx M of each point x ∈ M . 3 Often Riemannian manifolds are represented by a chart; i.e. a parameter space for the curved surface. An example chart is the spherical coordinate system often used to represent spheres. While such charts are often ﬂat spaces, the curvature of the manifold arises from the smooth changes in the metric. On a Riemannian manifold M, the length of a smooth curve c : [0, 1] → M is deﬁned as the integral of the norm of the tangent vector (interpreted as speed) along the curve: 1 Length(c) = 1 c (λ) M(c(λ)) dλ c (λ)T M(c(λ))c (λ)dλ , = (4) 0 0 where c denotes the derivative of c and M(c(λ)) is the metric tensor at c(λ). A geodesic curve is then a length-minimizing curve connecting two given points x and y, i.e. (5) cgeo = arg min Length(c) with c(0) = x and c(1) = y . c The distance between x and y is deﬁned as the length of the geodesic. Given a tangent vector v ∈ Tx M, there exists a unique geodesic cv (t) with initial velocity v at x. The Riemannian exponential map, Expx , maps v to a point on the manifold along the geodesic cv at t = 1. This mapping preserves distances such that dist(cv (0), cv (1)) = v . The inverse of the exponential map is the Riemannian logarithmic map denoted Logx . Informally, the exponential and logarithmic maps move points back and forth between the manifold and the tangent space while preserving distances (see Fig. 2d for an illustration). This provides a general strategy for generalizing many Euclidean techniques to Riemannian domains: data points are mapped to the tangent space, where ordinary Euclidean techniques are applied and the results are mapped back to the manifold. 3 A Metric Feature Space With the preliminaries settled we deﬁne the new model. Let C = RD denote the feature space. We endow C with a metric tensor in every point x, which we deﬁne akin to (3), R M(x) = wr (x)Mr , where wr (x) = r=1 wr (x) ˜ R ˜ j=1 wj (x) , (6) with wr > 0. The only difference from (3) is that we shall not restrict ourselves to binary weight ˜ functions wr . We assume the metric tensors Mr have already been learned; Sec. 4 contain examples ˜ where they have been learned using LMNN [5] and LDA [9]. From the deﬁnition of a Riemannian metric, we trivially have the following result: Lemma 1 The space C = RD endowed with the metric tensor from (6) is a chart of a Riemannian manifold, iff the weights wr (x) change smoothly with x. Hence, by only considering smooth weight functions wr we get a well-studied geometric structure ˜ on the feature space, which ensures us that it is metric. To illustrate the implications we return to the example in Fig. 2. We change the weight functions from binary to squared exponentials, which gives the feature space shown in Fig. 2c. As can be seen, the metric tensor now changes smoothly, which also makes the geodesics smooth curves (a property we will use when computing the geodesics). It is worth noting that Ramanan and Baker [3] also consider the idea of smoothly averaging the metric tensor. They, however, only evaluate the metric tensor at the test point of their classiﬁer and then assume straight line geodesics with a constant metric tensor. Such assumptions violate the premise of a smoothly changing metric tensor and, again, the distance measure becomes non-metric. Lemma 1 shows that metric learning can be viewed as manifold learning. The main difference between our approach and techniques such as Isomap [12] is that, while Isomap learns an embedding of the data points, we learn the actual manifold structure. This gives us the beneﬁt that we can compute geodesics as well as the exponential and logarithmic maps. These provide us with mappings back and forth between the manifold and Euclidean representation of the data, which preserve distances as well as possible. The availability of such mappings is in stark contrast to e.g. Isomap. In the next section we will derive a system of ordinary differential equations (ODE’s) that geodesics in C have to satisfy, which provides us with algorithms for computing geodesics as well as exponential and logarithmic maps. With these we can generalize many Euclidean techniques. 4 3.1 Computing Geodesics, Maps and Statistics At minima of (4) we know that the Euler-Lagrange equation must hold [11], i.e. ∂L d ∂L , where L(λ, c, c ) = c (λ)T M(c(λ))c (λ) . = ∂c dλ ∂c As we have an explicit expression for the metric tensor we can compute (7) in closed form: (7) Theorem 2 Geodesic curves in C satisfy the following system of 2nd order ODE’s M(c(λ))c (λ) = − 1 ∂vec [M(c(λ))] 2 ∂c(λ) T (c (λ) ⊗ c (λ)) , (8) where ⊗ denotes the Kronecker product and vec [·] stacks the columns of a matrix into a vector [13]. Proof See supplementary material. This result holds for any smooth weight functions wr . We, however, still need to compute ∂vec[M] , ˜ ∂c which depends on the speciﬁc choice of wr . Any smooth weighting scheme is applicable, but we ˜ restrict ourselves to the obvious smooth generalization of (3) and use squared exponentials. From this assumption, we get the following result Theorem 3 For wr (x) = exp − ρ x − xr ˜ 2 ∂vec [M(c)] = ∂c the derivative of the metric tensor from (6) is R ρ R j=1 2 Mr R 2 wj ˜ T r=1 T wj (c − xj ) Mj − (c − xr ) Mr ˜ wr vec [Mr ] ˜ . (9) j=1 Proof See supplementary material. Computing Geodesics. Any geodesic curve must be a solution to (8). Hence, to compute a geodesic between x and y, we can solve (8) subject to the constraints c(0) = x and c(1) = y . (10) This is a boundary value problem, which has a smooth solution. This allows us to solve the problem numerically using a standard three-stage Lobatto IIIa formula, which provides a fourth-order accurate C 1 –continuous solution [14]. Ramanan and Baker [3] discuss the possibility of computing geodesics, but arrive at the conclusion that this is intractable based on the assumption that it requires discretizing the entire feature space. Our solution avoids discretizing the feature space by discretizing the geodesic curve instead. As this is always one-dimensional the approach remains tractable in high-dimensional feature spaces. Computing Logarithmic Maps. Once a geodesic c is found, it follows from the deﬁnition of the logarithmic map, Logx (y), that it can be computed as v = Logx (y) = c (0) Length(c) . c (0) (11) In practice, we solve (8) by rewriting it as a system of ﬁrst order ODE’s, such that we compute both c and c simultaneously (see supplementary material for details). Computing Exponential Maps. Given a starting point x on the manifold and a vector v in the tangent space, the exponential map, Expx (v), ﬁnds the unique geodesic starting at x with initial velocity v. As the geodesic must fulﬁll (8), we can compute the exponential map by solving this system of ODE’s with the initial conditions c(0) = x and c (0) = v . (12) This initial value problem has a unique solution, which we ﬁnd numerically using a standard RungeKutta scheme [15]. 5 3.1.1 Generalizing PCA and Regression At this stage, we know that the feature space is Riemannian and we know how to compute geodesics and exponential and logarithmic maps. We now seek to generalize PCA and linear regression, which becomes straightforward since solutions are available in Riemannian spaces [16, 17]. These generalizations can be summarized as mapping the data to the tangent space at the mean, performing standard Euclidean analysis in the tangent and mapping the results back. The ﬁrst step is to compute the mean value on the manifold, which is deﬁned as the point that minimizes the sum-of-squares distances to the data points. Pennec [18] provides an efﬁcient gradient descent approach for computing this point, which we also summarize in the supplementary material. The empirical covariance of a set of points is deﬁned as the ordinary Euclidean covariance in the tangent space at the mean value [18]. With this in mind, it is not surprising that the principal components of a dataset have been generalized as the geodesics starting at the mean with initial velocity corresponding to the eigenvectors of the covariance [16], γvd (t) = Expµ (tvd ) , (13) th where vd denotes the d eigenvector of the covariance. This approach is called Principal Geodesic Analysis (PGA), and the geodesic curve γvd is called the principal geodesic. An illustration of the approach can be seen in Fig. 1 and more algorithmic details are in the supplementary material. Linear regression has been generalized in a similar way [17] by performing regression in the tangent of the mean and mapping the resulting line back to the manifold using the exponential map. The idea of working in the tangent space is both efﬁcient and convenient, but comes with an element of approximation as the logarithmic map is only guarantied to preserve distances to the origin of the tangent and not between all pairs of data points. Practical experience, however, indicates that this is a good tradeoff; see [19] for a more in-depth discussion of when the approximation is suitable. 4 Experiments To illustrate the framework1 we consider an example in human body analysis, and then we analyze the scalability of the approach. But ﬁrst, to build intuition, Fig. 3a show synthetically generated data samples from two classes. We sample random points xr and learn a local LDA metric [9] by considering all data points within a radius; this locally pushes the two classes apart. We combine the local metrics using (6) and Fig. 3b show the data in the tangent space of the resulting manifold. As can be seen the two classes are now globally further apart, which shows the effect of local metrics. 4.1 Human Body Shape We consider a regression example concerning human body shape analysis. We study 986 female body laser scans from the CAESAR [20] data set; each shape is represented using the leading 35 principal components of the data learned using a SCAPE-like model [21, 22]. Each shape is associated with anthropometric measurements such as body height, shoe size, etc. We show results for shoulder to wrist distance and shoulder breadth, but results for more measurements are in the supplementary material. To predict the measurements from shape coefﬁcients, we learn local metrics and perform linear regression according to these. As a further experiment, we use PGA to reduce the dimensionality of the shape coefﬁcients according to the local metrics, and measure the quality of the reduction by performing linear regression to predict the measurements. As a baseline we use the corresponding Euclidean techniques. To learn the local metric we do the following. First we whiten the data such that the variance captured by PGA will only be due to the change of metric; this allows easy visualization of the impact of the learned metrics. We then cluster the body shapes into equal-sized clusters according to the measurement and learn a LMNN metric for each cluster [5], which we associate with the mean of each class. These push the clusters apart, which introduces variance along the directions where the measurement changes. From this we construct a Riemannian manifold according to (6), 1 Our software implementation for computing geodesics and performing manifold statistics is available at http://ps.is.tue.mpg.de/project/Smooth Metric Learning 6 30 Euclidean Model Riemannian Model 24 20 18 16 20 15 10 5 14 12 0 (a) 25 22 Running Time (sec.) Average Prediction Error 26 10 (b) 20 Dimensionality 0 0 30 50 (c) 100 Dimensionality 150 (d) 4 3 3 2 2 1 1 0 −1 −2 −3 −4 −4 −3 −2 −1 0 1 2 3 4 Shoulder breadth 20 −2 −3 Euclidean Model Riemannian Model 0 −1 25 Prediction Error 4 15 10 0 −4 −5 0 4 10 15 20 Dimensionality 16 25 30 35 17 3 3 5 5 Euclidean Model Riemannian Model 2 15 2 1 1 Prediction Error Shoulder to wrist distance Figure 3: Left panels: Synthetic data. (a) Samples from two classes along with illustratively sampled metric tensors from (6). (b) The data represented in the tangent of a manifold constructed from local LDA metrics learned at random positions. Right panels: Real data. (c) Average error of linearly predicted body measurements (mm). (d) Running time (sec) of the geodesic computation as a function of dimensionality. 0 0 −1 −2 −1 −3 14 13 12 11 −2 −4 −3 −4 −4 10 −5 −3 −2 −1 0 1 Euclidean PCA 2 3 −6 −4 9 0 −2 0 2 4 Tangent Space PCA (PGA) 6 5 10 15 20 Dimensionality 25 30 35 Regression Error Figure 4: Left: body shape data in the ﬁrst two principal components according to the Euclidean metric. Point color indicates cluster membership. Center: As on the left, but according to the Riemannian model. Right: regression error as a function of the dimensionality of the shape space; again the Euclidean metric and the Riemannian metric are compared. compute the mean value on the manifold, map the data to the tangent space at the mean and perform linear regression in the tangent space. As a ﬁrst visualization we plot the data expressed in the leading two dimensions of PGA in Fig. 4; as can be seen the learned metrics provide principal geodesics, which are more strongly related with the measurements than the Euclidean model. In order to predict the measurements from the body shape, we perform linear regression, both directly in the shape space according to the Euclidean metric and in the tangent space of the manifold corresponding to the learned metrics (using the logarithmic map from (11)). We measure the prediction error using leave-one-out cross-validation. To further illustrate the power of the PGA model, we repeat this experiment for different dimensionalities of the data. The results are plotted in Fig. 4, showing that regression according to the learned metrics outperforms the Euclidean model. To verify that the learned metrics improve accuracy, we average the prediction errors over all millimeter measurements. The result in Fig. 3c shows that much can be gained in lower dimensions by using the local metrics. To provide visual insights into the behavior of the learned metrics, we uniformly sample body shape along the ﬁrst principal geodesic (in the range ±7 times the standard deviation) according to the different metrics. The results are available as a movie in the supplementary material, but are also shown in Fig. 5. As can be seen, the learned metrics pick up intuitive relationships between body shape and the measurements, e.g. shoulder to wrist distance is related to overall body size, while shoulder breadth is related to body weight. 7 Shoulder to wrist distance Shoulder breadth Figure 5: Shapes corresponding to the mean (center) and ±7 times the standard deviations along the principal geodesics (left and right). Movies are available in the supplementary material. 4.2 Scalability The human body data set is small enough (986 samples in 35 dimensions) that computing a geodesic only takes a few seconds. To show that the current unoptimized Matlab implementation can handle somewhat larger datasets, we brieﬂy consider a dimensionality reduction task on the classic MNIST handwritten digit data set. We use the preprocessed data available with [3] where the original 28×28 gray scale images were deskewed and projected onto their leading 164 Euclidean principal components (which captures 95% of the variance in the original data). We learn one diagonal LMNN metric per class, which we associate with the mean of the class. From this we construct a Riemannian manifold from (6), compute the mean value on the manifold and compute geodesics between the mean and each data point; this is the computationally expensive part of performing PGA. Fig. 3d plots the average running time (sec) for the computation of geodesics as a function of the dimensionality of the training data. A geodesic can be computed in 100 dimensions in approximately 5 sec., whereas in 150 dimensions it takes about 30 sec. In this experiment, we train a PGA model on 60,000 data points, and test a nearest neighbor classiﬁer in the tangent space as we decrease the dimensionality of the model. Compared to a Euclidean model, this gives a modest improvement in classiﬁcation accuracy of 2.3 percent, when averaged across different dimensionalities. Plots of the results can be found in the supplementary material. 5 Discussion This work shows that multi-metric learning techniques are indeed applicable outside the realm of kNN classiﬁers. The idea of deﬁning the metric tensor at any given point as the weighted average of a ﬁnite set of learned metrics is quite natural from a modeling point of view, which is also validated by the Riemannian structure of the resulting space. This opens both a theoretical and a practical toolbox for analyzing and developing algorithms that use local metric tensors. Speciﬁcally, we show how to use local metric tensors for both regression and dimensionality reduction tasks. Others have attempted to solve non-classiﬁcation problems using local metrics, but we feel that our approach is the ﬁrst to have a solid theoretical backing. For example, Hastie and Tibshirani [9] use local LDA metrics for dimensionality reduction by averaging the local metrics and using the resulting metric as part of a Euclidean PCA, which essentially is a linear approach. Another approach was suggested by Hong et al. [23] who simply compute the principal components according to each metric separately, such that one low dimensional model is learned per metric. The suggested approach is, however, not difﬁculty-free in its current implementation. Currently, we are using off-the-shelf numerical solvers for computing geodesics, which can be computationally demanding. While we managed to analyze medium-sized datasets, we believe that the run-time can be drastically improved by developing specialized numerical solvers. In the experiments, we learned local metrics using techniques specialized for classiﬁcation tasks as this is all the current literature provides. We expect improvements by learning the metrics speciﬁcally for regression and dimensionality reduction, but doing so is currently an open problem. Acknowledgments: Søren Hauberg is supported in part by the Villum Foundation, and Oren Freifeld is supported in part by NIH-NINDS EUREKA (R01-NS066311). 8 References [1] Andrea Frome, Yoram Singer, and Jitendra Malik. Image retrieval and classiﬁcation using local distance functions. In B. Sch¨ lkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing o Systems 19 (NIPS), pages 417–424, Cambridge, MA, 2007. MIT Press. [2] Andrea Frome, Fei Sha, Yoram Singer, and Jitendra Malik. Learning globally-consistent local distance functions for shape-based image retrieval and classiﬁcation. In International Conference on Computer Vision (ICCV), pages 1–8, 2007. [3] Deva Ramanan and Simon Baker. Local distance functions: A taxonomy, new algorithms, and an evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(4):794–806, 2011. [4] Shai Shalev-Shwartz, Yoram Singer, and Andrew Y. Ng. Online and batch learning of pseudo-metrics. In Proceedings of the twenty-ﬁrst international conference on Machine learning, ICML ’04, pages 94–101. ACM, 2004. [5] Kilian Q. Weinberger and Lawrence K. Saul. Distance metric learning for large margin nearest neighbor classiﬁcation. The Journal of Machine Learning Research, 10:207–244, 2009. [6] Tomasz Malisiewicz and Alexei A. Efros. Recognition by association via learning per-exemplar distances. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008. [7] Yiming Ying and Peng Li. Distance metric learning with eigenvalue optimization. The Journal of Machine Learning Research, 13:1–26, 2012. [8] Matthew Schultz and Thorsten Joachims. Learning a distance metric from relative comparisons. In Advances in Neural Information Processing Systems 16 (NIPS), 2004. [9] Trevor Hastie and Robert Tibshirani. Discriminant adaptive nearest neighbor classiﬁcation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(6):607–616, June 1996. [10] Elzbieta Pekalska, Pavel Paclik, and Robert P. W. Duin. A generalized kernel approach to dissimilaritybased classiﬁcation. Journal of Machine Learning Research, 2:175–211, 2002. [11] Manfredo Perdigao do Carmo. Riemannian Geometry. Birkh¨ user Boston, January 1992. a [12] Joshua B. Tenenbaum, Vin De Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000. [13] Jan R. Magnus and Heinz Neudecker. Matrix Differential Calculus with Applications in Statistics and Econometrics. John Wiley & Sons, 2007. [14] Jacek Kierzenka and Lawrence F. Shampine. A BVP solver based on residual control and the Matlab PSE. ACM Transactions on Mathematical Software, 27(3):299–316, 2001. [15] John R. Dormand and P. J. Prince. A family of embedded Runge-Kutta formulae. Journal of Computational and Applied Mathematics, 6:19–26, 1980. [16] P. Thomas Fletcher, Conglin Lu, Stephen M. Pizer, and Sarang Joshi. Principal Geodesic Analysis for the study of Nonlinear Statistics of Shape. IEEE Transactions on Medical Imaging, 23(8):995–1005, 2004. [17] Peter E. Jupp and John T. Kent. Fitting smooth paths to spherical data. Applied Statistics, 36(1):34–46, 1987. [18] Xavier Pennec. Probabilities and statistics on Riemannian manifolds: Basic tools for geometric measurements. In Proceedings of Nonlinear Signal and Image Processing, pages 194–198, 1999. [19] Stefan Sommer, Francois Lauze, Søren Hauberg, and Mads Nielsen. Manifold valued statistics, exact ¸ principal geodesic analysis and the effect of linear approximations. In European Conference on Computer Vision (ECCV), pages 43–56, 2010. [20] Kathleen M. Robinette, Hein Daanen, and Eric Paquet. The CAESAR project: a 3-D surface anthropometry survey. In 3-D Digital Imaging and Modeling, pages 380–386, 1999. [21] Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James Davis. Scape: shape completion and animation of people. ACM Transactions on Graphics, 24(3):408–416, 2005. [22] Oren Freifeld and Michael J. Black. Lie bodies: A manifold representation of 3D human shape. In A. Fitzgibbon et al. (Eds.), editor, European Conference on Computer Vision (ECCV), Part I, LNCS 7572, pages 1–14. Springer-Verlag, oct 2012. [23] Yi Hong, Quannan Li, Jiayan Jiang, and Zhuowen Tu. Learning a mixture of sparse distance metrics for classiﬁcation and dimensionality reduction. In International Conference on Computer Vision (ICCV), pages 906–913, 2011. 9

5 0.87153226 258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection

Author: Shiva P. Kasiviswanathan, Huahua Wang, Arindam Banerjee, Prem Melville

Abstract: Given their pervasive use, social media, such as Twitter, have become a leading source of breaking news. A key task in the automated identiﬁcation of such news is the detection of novel documents from a voluminous stream of text documents in a scalable manner. Motivated by this challenge, we introduce the problem of online 1 -dictionary learning where unlike traditional dictionary learning, which uses squared loss, the 1 -penalty is used for measuring the reconstruction error. We present an efﬁcient online algorithm for this problem based on alternating directions method of multipliers, and establish a sublinear regret bound for this algorithm. Empirical results on news-stream and Twitter data, shows that this online 1 -dictionary learning algorithm for novel document detection gives more than an order of magnitude speedup over the previously known batch algorithm, without any signiﬁcant loss in quality of results. 1

6 0.83927244 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection

7 0.83617771 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs

8 0.83393693 114 nips-2012-Efficient coding provides a direct link between prior and likelihood in perceptual Bayesian inference

9 0.83388221 86 nips-2012-Convex Multi-view Subspace Learning

10 0.8318758 187 nips-2012-Learning curves for multi-task Gaussian process regression

11 0.83102709 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

12 0.82991105 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

13 0.82954395 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction

14 0.82915151 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins

15 0.82843369 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound

16 0.82759434 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

17 0.82687759 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity

18 0.8267476 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

19 0.82629579 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

20 0.82558131 227 nips-2012-Multiclass Learning with Simplex Coding