nips nips2012 nips2012-189 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Dengyong Zhou, Sumit Basu, Yi Mao, John C. Platt
Abstract: An important way to make large training sets is to gather noisy labels from crowds of nonexperts. We propose a minimax entropy principle to improve the quality of these labels. Our method assumes that labels are generated by a probability distribution over workers, items, and labels. By maximizing the entropy of this distribution, the method naturally infers item confusability and worker expertise. We infer the ground truth by minimizing the entropy of this distribution, which we show minimizes the Kullback-Leibler (KL) divergence between the probability distribution and the unknown truth. We show that a simple coordinate descent scheme can optimize minimax entropy. Empirically, our results are substantially better than previously published methods for the same problem. 1
Reference: text
sentIndex sentText sentNum sentScore
1 We propose a minimax entropy principle to improve the quality of these labels. [sent-4, score-0.224]
2 By maximizing the entropy of this distribution, the method naturally infers item confusability and worker expertise. [sent-6, score-0.616]
3 We infer the ground truth by minimizing the entropy of this distribution, which we show minimizes the Kullback-Leibler (KL) divergence between the probability distribution and the unknown truth. [sent-7, score-0.154]
4 A fundamental challenge in crowdsourcing is inferring ground truth from noisy labels by a crowd of nonexperts. [sent-15, score-0.184]
5 When each item is labeled several times by different workers, a straightforward approach is to use the most common label as the true label. [sent-16, score-0.222]
6 When many items are simultaneously labeled, it is reasonable to assume that the performance of a worker is consistent across different items. [sent-19, score-0.311]
7 This assumption underlies the work of Dawid and Skene [5, 18, 19, 11, 17], where each worker is associated with a probabilistic confusion matrix that generates her labels. [sent-20, score-0.322]
8 Given the observed responses, the true labels for each item and the confusion matrices for each worker can be jointly estimated by a maximum likelihood method. [sent-22, score-0.576]
9 We propose a novel minimax entropy principle to jointly estimate the distributions and the ground truth given the observed labels by workers in Section 2. [sent-27, score-0.423]
10 To prevent overfitting, we relax the minimax entropy optimization in Section 3. [sent-30, score-0.179]
11 We describe an easy-to-implement technique to carry out the minimax program in Section 4 and link minimax entropy to a principle of 1 item 1 z11 z21 . [sent-31, score-0.498]
12 Each row corresponds to a crowdsourced worker indexed by i (from 1 to m). [sent-91, score-0.284]
13 Each column corresponds to an item to be labeled, indexed by j (from 1 to n). [sent-92, score-0.189]
14 Each item has an unobserved label represented as a vector yjl , which is 1 when item j is in class l (from 1 to c), and 0 otherwise. [sent-93, score-0.984]
15 More generally, we can treat yjl as the probability that item j is in class l. [sent-94, score-0.793]
16 The label matrix can also be represented as a tensor zijk , which is 1 when worker i labels item j as class k , and 0 otherwise. [sent-96, score-0.711]
17 We assume that zij are drawn from πij , which is the distribution for worker i to generate a label for item j. [sent-97, score-0.497]
18 Again, πij can also be represented as a tensor πijk , which is the probability that worker i labels item j as class k. [sent-98, score-0.544]
19 Our method will estimate yjl from the observed zij . [sent-99, score-0.624]
20 We specify the form of πij through the maximum entropy principle, where the constraints on the maximum entropy combine the best ideas from previous work. [sent-100, score-0.25]
21 Majority voting suggests that we should be constraining the πij per column, with the empirical observation of the number of votes per class per item i zijk should match i πijk . [sent-101, score-0.38]
22 Dawid and Skene’s method suggests that we should be constraining the πij per row, with the empirical confusion matrix per worker j yjl zijk should match j yjl πijk . [sent-102, score-1.691]
23 We thus have the following maximum entropy model for πij given yjl : m n c − max π πijk ln πijk i=1 j=1 k=1 m m i=1 c n zijk , ∀j, k, πijk = s. [sent-103, score-0.938]
24 i=1 n yjl zijk , ∀i, k, l, yjl πijk = j=1 (1a) j=1 πijk = 1, ∀i, j, πijk ≥ 0, ∀i, j, k. [sent-105, score-1.359]
25 (1b) k=1 We propose that, to infer yjl , we should choose yjl to minimize the entropy in Equation (1). [sent-106, score-1.308]
26 Intuitively, making πij “peaky” means that zij is the least random given yjl . [sent-107, score-0.624]
27 Thus, the inference for yjl can be expressed by a minimax entropy program: m min max y π n c − πijk ln πijk i=1 j=1 k=1 m m i=1 c n n zijk , ∀j, k, πijk = s. [sent-110, score-0.997]
28 i=1 yjl zijk , ∀i, k, l, yjl πijk = j=1 (2a) j=1 c πijk = 1, ∀i, j, πijk ≥ 0, ∀i, j, k, k=1 yjl = 1, ∀j, yjl ≥ 0, ∀j, l. [sent-112, score-2.563]
29 1 Justification for Minimum Entropy Now we justify the principle of choosing yjl by minimizing entropy. [sent-114, score-0.66]
30 Think of yjl as a set of parameters to the worker-item label models πij . [sent-115, score-0.614]
31 The goal in choosing the yjl is to select πij that are as ∗ close as possible to the true distributions πij . [sent-116, score-0.613]
32 To find a principle to choose the yjl , assume that we have access to the row and column measure∗ ments on the true distributions πij . [sent-117, score-0.668]
33 That is, assume that we know the true values of the column ∗ ∗ measurements φjk = i πijk and row measurements ϕikl = j yjl πijk , for a chosen set of yjl values. [sent-118, score-1.246]
34 Knowing these true row and column measurements, we can apply the maximum entropy principle to generate distributions πij : m n c − max π πijk ln πijk i=1 j=1 k=1 m (3) n πijk = φjk , ∀j, k, s. [sent-119, score-0.247]
35 We can choose yjl to minimize ∗ a loss of πij with respect to πij given by m n ∗ DKL (πij (π ∗ , π) = πij ). [sent-123, score-0.602]
36 (4) i=1 j=1 The minimum loss can be attained by choosing yjl to minimize the entropy of the maximum distributions πij . [sent-124, score-0.733]
37 Thus, ∂L = − ln πijk − 1 + λij + ∂πijk c yjl (τjk + σikl ) = 0, ∀i, j, k, l=1 which can be rearranged as c yjl (τjk + σikl ) + λij − 1 , ∀i, j, k. [sent-127, score-1.265]
38 πijk = exp (5) l=1 For being a probability measure, the variables πijk have to satisfy c c πijk = k=1 c yjl (τjk + σikl ) + λij − 1 = 1, ∀i, j. [sent-128, score-0.626]
39 exp k=1 (6) l=1 Eliminating λij by jointly considering Equations (5) and (6), we obtain a labeling model in the exponential family: c exp l=1 yjl (τjk + σikl ) , ∀i, j, k. [sent-129, score-0.697]
40 (7) πijk = c c s=1 exp l=1 yjl (τjs + σisl ) Plugging Equation (7) into (4) and performing some algebraic manipulations, we prove Theorem 2. [sent-130, score-0.64]
41 i=1 j=1 k=1 3 The second term is the only term that depends on yjl . [sent-133, score-0.602]
42 Therefore, we should choose yjl to minimize the entropy of the maximum entropy distributions. [sent-134, score-0.826]
43 For each worker i, the multiplier set {σikl } is a measure of her expertise, while for each item j, the multiplier set {τjk } is a measure of its confusability. [sent-136, score-0.483]
44 A worker correctly labels an item either because she has good expertise or because the item is not that confusing. [sent-137, score-0.761]
45 When the item or worker parameters are shifted by an arbitrary constant, the probability given by Equation (7) does not change. [sent-138, score-0.463]
46 3 Constraint Relaxation In real crowdsourcing applications, each item is usually labeled only a few times. [sent-140, score-0.284]
47 Moreover, a worker usually only labels a small subset of items rather than all of them. [sent-141, score-0.37]
48 As in the literature of regularized maximum entropy [14, 1, 9], we relax the optimization problem to prevent overfitting: n m y π,ξ,ζ c n c πijk ln πijk − − min max j=1 k=1 n i=1 j=1 k=1 m c c k=1 l=1 2 ζikl 2βi yjl (πijk − zijk ) = ζikl , ∀i, k, l, (πijk − zijk ) = ξjk , ∀j, k, s. [sent-143, score-1.112]
49 m 2 ξjk − 2αj i=1 (8a) j=1 i=1 c c yjl = 1, ∀j, yjl ≥ 0, ∀j, l, πijk = 1, ∀i, j, πijk ≥ 0, ∀i, j, k, (8b) l=1 k=1 where αj and βi are regularization parameters. [sent-145, score-1.214]
50 1 can be extended to the regularized minimax entropy formulation (8) with minor modifications. [sent-151, score-0.198]
51 Instead of knowing the exact marginals, we need to choose πij based on noisy marginals: m n ∗ ∗ πijk + ξjk , ∀j, k, ϕikl = φjk = i=1 ∗ ∗ yjl πijk + ζikl , ∀i, k, l. [sent-152, score-0.612]
52 j=1 We thus maximize the regularized entropy subject to the relaxed constraints: m n πijk + ξjk = φjk , ∀j, k, i=1 yjl πijk + ζikl = ϕikl , ∀i, k, l. [sent-153, score-0.725]
53 ikl ∗ 2 ξjk ξjk ≤ (ξ ∗ 2 + ξjk )/2, ∀j, k, jk (11) Denote by Ω(π, ξ, ζ) the objective function of the regularized minimax entropy program (8). [sent-167, score-0.822]
54 (12) So minimizing the regularized maximum entropy leads to minimizing an upper bound of the loss. [sent-169, score-0.165]
55 1, the Lagrangian of program (8) can be written as m n L=− exp ln i=1 j=1 c c k=1 zijk l=1 yjl (τjk + σikl ) c c exp l=1 yjl (τjs + σisl ) s=1 n c + j=1 k=1 m 2 αj τjk + 2 i=1 c c k=1 l=1 2 βi σikl . [sent-172, score-1.488]
56 2 c l=1 The dual problem minimizes L subject to the constraints ∆ = {yjl | yjl = 1, ∀j, yjl ≥ 0, ∀j, l}. [sent-173, score-1.214]
57 When the yjl are restricted to be {0, 1}, that is, deterministic labels, the coordinate descent procedure can be simplified. [sent-176, score-0.623]
58 Let m pjl = exp c s=1 i=1 For any set of real-valued numbers {µjl | n m ln j=1 n = i=1 c ln j=1 n exp µjl l=1 yjl pjl µjl n c ≥ µjl ln j=1 l=1 n c µjl ln(yjl pjl ) − j=1 l=1 c l=1 + σikl ) . [sent-177, score-0.962]
59 exp (τjs + σisl ) µjl = 1, ∀j, µjl > 0, ∀j, l}, we have the inequality c c l=1 yjl (τjk + σikl ) k=1 zijk c c l=1 yjl (τjs + σisl ) s=1 exp c = c k=1 zijk (τjk yjl pjl µjl n = c ln j=1 yjl pjl deterministic labels l=1 Jensen’s inequality µjl ln µjl . [sent-178, score-3.043]
60 It can be shown that we must have yjl = µjl at any stationary point of F. [sent-180, score-0.602]
61 We initialize yjl with majority vote in Equation (13). [sent-182, score-0.634]
62 In each iteration step, we first optimize over τjk and σikl in (14a), which can be solved by any convex optimization procedure, and next optimize over yjl using a simple closed form in (14b). [sent-183, score-0.602]
63 The optimization over yjl is the same as applying Bayes’ theorem where the result from the last iteration is considered as a prior. [sent-184, score-0.602]
64 5 Algorithm 1 Minimax Entropy Learning from Crowds input: {zijk } ∈ {0, 1}m×n×c , {αj } ∈ Rn , {βi } ∈ Rm + + initialization: m zijl 0 yjl = m i=1c , ∀j, l i=1 k=1 zijk for t = 1, 2, . [sent-186, score-0.757]
65 The first statement is about the objectivity of item confusability. [sent-190, score-0.299]
66 The second statement is about the objectivity of worker expertise. [sent-191, score-0.404]
67 For deterministic labels, we show that the labeling model in Equation (7) can be recovered from the measurement objectivity principle. [sent-193, score-0.199]
68 From Equation (7), given item j in class l, the probability that worker i labels it as class k is πijkl = exp (τjk + σikl ) . [sent-194, score-0.57]
69 exp (τjs + σisl ) c s=1 (15) Assume that a worker i has labeled two items j and j both of which are from the same class l. [sent-195, score-0.378]
70 With respect to the given worker i, for each item, we measure the confusability for class k by πijkl πij kl ρijk = , ρij k = . [sent-196, score-0.377]
71 (16) πijll πij ll For comparing the item confusabilities, we compute a ratio between them. [sent-197, score-0.217]
72 To maintain the objectivity of confusability, the ratio should not depend on whichever worker is involved in the comparison. [sent-198, score-0.432]
73 Hence, given another worker i , we should have πijkl πijll πij kl πij ll = πi jkl πi jll πi j kl . [sent-199, score-0.452]
74 πi j ll (17) It is straightforward to verify that the labeling model in Equation (15) indeed satisfies the objectivity requirement given by Equation (17). [sent-200, score-0.193]
75 πijll πi0ll π0jll π00kl (18) Assume that the referenced worker 0 chooses a class uniformly at random for the referenced item 0. [sent-204, score-0.505]
76 Symmetrically, we can also start from the objectivity of worker expertise to recover the labeling model in (15). [sent-209, score-0.511]
77 Assume that two workers i and i have labeled a common item j which is from class l. [sent-210, score-0.314]
78 With respect to the given item j, for each worker, we measure the confusion from class l to k by πijkl πi jkl ρijk = , ρi jk = . [sent-211, score-0.495]
79 (19) πijll πi jll For comparing the worker expertises, we compute a ratio between them. [sent-212, score-0.335]
80 To maintain the objectivity of expertise, the ratio should not depend on whichever item is involved in the comparison. [sent-213, score-0.327]
81 Hence, given another item j in class l, we should have πijkl πijll πi jkl πi jll = πij kl πij ll πi j kl . [sent-214, score-0.359]
82 A worker labeled an image at most once, and each image was labeled 10 times. [sent-223, score-0.368]
83 It is difficult to evaluate a worker if she only labeled few images. [sent-224, score-0.315]
84 We thus only consider the workers who labeled at least 40 images, which yields a label set that contains 7354 labels by 52 workers. [sent-225, score-0.194]
85 The worker who labeled the most labeled 345 images and achieved an accuracy of 68. [sent-231, score-0.37]
86 The average worker confusion matrix between breeds is shown in Table 2. [sent-233, score-0.348]
87 For our method, the regularization parameters are set as αj = 100/(number of labels for item j), βi = 100/(number of labels by worker i). [sent-236, score-0.591]
88 65 Table 2: Average worker confusion (%) using consensus from 9 experts. [sent-266, score-0.322]
89 Each pair was judged by around 6 workers, and each worker judged a subset of the pairs. [sent-268, score-0.354]
90 The worker who judged the most judged 1225 pairs and achieved an accuracy of 76. [sent-272, score-0.364]
91 For our method, the regularization parameters are set as αj = 200/(number of labels for item j), βi = 200/(number of labels by worker i). [sent-274, score-0.591]
92 The essential difference between our work and Dawid and Skene’s work is that, in addition to worker expertise, we also take item confusability into account. [sent-279, score-0.512]
93 In computer vision, a minimax entropy method was proposed for estimating the probability density of certain visual patterns such as textures [22]. [sent-280, score-0.189]
94 The authors formulate the combined density estimation and feature selection as a minimax entropy problem. [sent-284, score-0.189]
95 The measurement objectivity principle is inspired by the Rasch model [16], used to design and analyze psychological and educational measurements. [sent-285, score-0.187]
96 In the Rasch model, given an examinee and a test item, the probability of a correct response is modeled as a logistic function of the difference between the examinee ability and the item difficulty. [sent-286, score-0.257]
97 The specific objectivity property of the Rasch model comes from the algebraic separation of examinee and item parameters. [sent-288, score-0.352]
98 If the probability of a correct response is modeled with other forms, such as a logistic function of the ratio between the examinee ability and the item difficulty [21], objective measurements cannot be achieved. [sent-289, score-0.246]
99 8 Conclusion We have proposed a minimax entropy principle for estimating the true labels from the judgements of a crowd of nonexperts. [sent-291, score-0.297]
100 We have also shown that the labeling model derived from the minimax entropy principle uniquely satisfies an objectivity principle for measuring worker expertise and item confusability. [sent-292, score-0.959]
wordName wordTfidf (topN-words)
[('yjl', 0.602), ('ijk', 0.379), ('ikl', 0.377), ('worker', 0.284), ('jk', 0.227), ('item', 0.179), ('zijk', 0.155), ('objectivity', 0.12), ('entropy', 0.104), ('ij', 0.092), ('workers', 0.092), ('ijkl', 0.087), ('jl', 0.087), ('dawid', 0.079), ('isl', 0.078), ('minimax', 0.075), ('crowdsourcing', 0.074), ('skene', 0.071), ('ln', 0.061), ('expertise', 0.06), ('js', 0.059), ('labels', 0.059), ('ijll', 0.058), ('terrier', 0.051), ('confusability', 0.049), ('norfolk', 0.049), ('norwich', 0.049), ('scottish', 0.049), ('labeling', 0.047), ('principle', 0.045), ('pjl', 0.043), ('irish', 0.04), ('examinee', 0.039), ('jkl', 0.039), ('jll', 0.039), ('confusion', 0.038), ('judged', 0.035), ('dogs', 0.035), ('rasch', 0.034), ('kl', 0.032), ('labeled', 0.031), ('mturk', 0.029), ('wolfhound', 0.029), ('equation', 0.029), ('crowds', 0.028), ('wisdom', 0.028), ('items', 0.027), ('ll', 0.026), ('breeds', 0.026), ('voting', 0.024), ('exp', 0.024), ('zij', 0.022), ('measurement', 0.022), ('majority', 0.021), ('relevance', 0.021), ('program', 0.02), ('deerhound', 0.019), ('judging', 0.019), ('regularized', 0.019), ('truth', 0.019), ('ground', 0.018), ('web', 0.018), ('mechanical', 0.018), ('amazon', 0.017), ('lagrangian', 0.017), ('measurements', 0.016), ('whichever', 0.016), ('maximum', 0.016), ('referenced', 0.015), ('images', 0.014), ('crowd', 0.014), ('symmetrically', 0.014), ('algebraic', 0.014), ('dkl', 0.013), ('minimizing', 0.013), ('justi', 0.013), ('manipulations', 0.013), ('slack', 0.013), ('label', 0.012), ('class', 0.012), ('em', 0.012), ('ratio', 0.012), ('image', 0.011), ('distributions', 0.011), ('vote', 0.011), ('chris', 0.011), ('coordinate', 0.011), ('deterministic', 0.01), ('column', 0.01), ('rating', 0.01), ('redundancy', 0.01), ('accuracy', 0.01), ('regularization', 0.01), ('constraints', 0.01), ('density', 0.01), ('tensor', 0.01), ('multiplier', 0.01), ('constraining', 0.01), ('knowing', 0.01), ('plugging', 0.009)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 189 nips-2012-Learning from the Wisdom of Crowds by Minimax Entropy
Author: Dengyong Zhou, Sumit Basu, Yi Mao, John C. Platt
Abstract: An important way to make large training sets is to gather noisy labels from crowds of nonexperts. We propose a minimax entropy principle to improve the quality of these labels. Our method assumes that labels are generated by a probability distribution over workers, items, and labels. By maximizing the entropy of this distribution, the method naturally infers item confusability and worker expertise. We infer the ground truth by minimizing the entropy of this distribution, which we show minimizes the Kullback-Leibler (KL) divergence between the probability distribution and the unknown truth. We show that a simple coordinate descent scheme can optimize minimax entropy. Empirically, our results are substantially better than previously published methods for the same problem. 1
2 0.1062564 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification
Author: Jun Wang, Alexandros Kalousis, Adam Woznica
Abstract: We study the problem of learning local metrics for nearest neighbor classification. Most previous works on local metric learning learn a number of local unrelated metrics. While this ”independence” approach delivers an increased flexibility its downside is the considerable risk of overfitting. We present a new parametric local metric learning method in which we learn a smooth metric matrix function over the data manifold. Using an approximation error bound of the metric matrix function we learn local metrics as linear combinations of basis metrics defined on anchor points over different regions of the instance space. We constrain the metric matrix function by imposing on the linear combinations manifold regularization which makes the learned metric matrix function vary smoothly along the geodesics of the data manifold. Our metric learning method has excellent performance both in terms of predictive power and scalability. We experimented with several largescale classification problems, tens of thousands of instances, and compared it with several state of the art metric learning methods, both global and local, as well as to SVM with automatic kernel selection, all of which it outperforms in a significant manner. 1
3 0.086445354 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback
Author: Andriy Mnih, Yee W. Teh
Abstract: User preferences for items can be inferred from either explicit feedback, such as item ratings, or implicit feedback, such as rental histories. Research in collaborative filtering has concentrated on explicit feedback, resulting in the development of accurate and scalable models. However, since explicit feedback is often difficult to collect it is important to develop effective models that take advantage of the more widely available implicit feedback. We introduce a probabilistic approach to collaborative filtering with implicit feedback based on modelling the user’s item selection process. In the interests of scalability, we restrict our attention to treestructured distributions over items and develop a principled and efficient algorithm for learning item trees from data. We also identify a problem with a widely used protocol for evaluating implicit feedback models and propose a way of addressing it using a small quantity of explicit feedback data. 1
4 0.072993644 359 nips-2012-Variational Inference for Crowdsourcing
Author: Qiang Liu, Jian Peng, Alex Ihler
Abstract: Crowdsourcing has become a popular paradigm for labeling large datasets. However, it has given rise to the computational task of aggregating the crowdsourced labels provided by a collection of unreliable annotators. We approach this problem by transforming it into a standard inference problem in graphical models, and applying approximate variational methods, including belief propagation (BP) and mean field (MF). We show that our BP algorithm generalizes both majority voting and a recent algorithm by Karger et al. [1], while our MF method is closely related to a commonly used EM algorithm. In both cases, we find that the performance of the algorithms critically depends on the choice of a prior distribution on the workers’ reliability; by choosing the prior properly, both BP and MF (and EM) perform surprisingly well on both simulated and real-world datasets, competitive with state-of-the-art algorithms based on more complicated modeling assumptions. 1
5 0.064691529 49 nips-2012-Automatic Feature Induction for Stagewise Collaborative Filtering
Author: Joonseok Lee, Mingxuan Sun, Guy Lebanon, Seung-jean Kim
Abstract: Recent approaches to collaborative filtering have concentrated on estimating an algebraic or statistical model, and using the model for predicting missing ratings. In this paper we observe that different models have relative advantages in different regions of the input space. This motivates our approach of using stagewise linear combinations of collaborative filtering algorithms, with non-constant combination coefficients based on kernel smoothing. The resulting stagewise model is computationally scalable and outperforms a wide selection of state-of-the-art collaborative filtering algorithms. 1
6 0.055777732 352 nips-2012-Transelliptical Graphical Models
7 0.055186711 107 nips-2012-Effective Split-Merge Monte Carlo Methods for Nonparametric Models of Sequential Data
8 0.053509071 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning
9 0.049076162 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes
10 0.046837125 111 nips-2012-Efficient Sampling for Bipartite Matching Problems
11 0.046690419 60 nips-2012-Bayesian nonparametric models for ranked data
12 0.04294008 74 nips-2012-Collaborative Gaussian Processes for Preference Learning
13 0.041146897 165 nips-2012-Iterative ranking from pair-wise comparisons
14 0.039507706 57 nips-2012-Bayesian estimation of discrete entropy with mixtures of stick-breaking priors
15 0.039106555 75 nips-2012-Collaborative Ranking With 17 Parameters
16 0.037982795 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing
17 0.036303788 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction
18 0.035777763 103 nips-2012-Distributed Probabilistic Learning for Camera Networks with Missing Data
19 0.035568528 119 nips-2012-Entropy Estimations Using Correlated Symmetric Stable Random Projections
20 0.031074744 351 nips-2012-Transelliptical Component Analysis
topicId topicWeight
[(0, 0.076), (1, 0.023), (2, 0.007), (3, -0.02), (4, -0.014), (5, -0.033), (6, 0.009), (7, 0.058), (8, 0.043), (9, 0.057), (10, -0.047), (11, 0.016), (12, -0.033), (13, 0.004), (14, 0.023), (15, 0.013), (16, 0.03), (17, 0.008), (18, 0.02), (19, 0.042), (20, 0.041), (21, -0.086), (22, 0.039), (23, 0.007), (24, -0.012), (25, -0.028), (26, -0.008), (27, -0.007), (28, 0.104), (29, 0.031), (30, -0.064), (31, 0.003), (32, -0.001), (33, -0.068), (34, -0.069), (35, 0.005), (36, 0.046), (37, 0.051), (38, -0.105), (39, 0.055), (40, -0.023), (41, 0.058), (42, -0.053), (43, -0.066), (44, 0.023), (45, -0.035), (46, 0.079), (47, 0.051), (48, 0.01), (49, 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.91881186 189 nips-2012-Learning from the Wisdom of Crowds by Minimax Entropy
Author: Dengyong Zhou, Sumit Basu, Yi Mao, John C. Platt
Abstract: An important way to make large training sets is to gather noisy labels from crowds of nonexperts. We propose a minimax entropy principle to improve the quality of these labels. Our method assumes that labels are generated by a probability distribution over workers, items, and labels. By maximizing the entropy of this distribution, the method naturally infers item confusability and worker expertise. We infer the ground truth by minimizing the entropy of this distribution, which we show minimizes the Kullback-Leibler (KL) divergence between the probability distribution and the unknown truth. We show that a simple coordinate descent scheme can optimize minimax entropy. Empirically, our results are substantially better than previously published methods for the same problem. 1
2 0.56904072 49 nips-2012-Automatic Feature Induction for Stagewise Collaborative Filtering
Author: Joonseok Lee, Mingxuan Sun, Guy Lebanon, Seung-jean Kim
Abstract: Recent approaches to collaborative filtering have concentrated on estimating an algebraic or statistical model, and using the model for predicting missing ratings. In this paper we observe that different models have relative advantages in different regions of the input space. This motivates our approach of using stagewise linear combinations of collaborative filtering algorithms, with non-constant combination coefficients based on kernel smoothing. The resulting stagewise model is computationally scalable and outperforms a wide selection of state-of-the-art collaborative filtering algorithms. 1
3 0.54545343 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback
Author: Andriy Mnih, Yee W. Teh
Abstract: User preferences for items can be inferred from either explicit feedback, such as item ratings, or implicit feedback, such as rental histories. Research in collaborative filtering has concentrated on explicit feedback, resulting in the development of accurate and scalable models. However, since explicit feedback is often difficult to collect it is important to develop effective models that take advantage of the more widely available implicit feedback. We introduce a probabilistic approach to collaborative filtering with implicit feedback based on modelling the user’s item selection process. In the interests of scalability, we restrict our attention to treestructured distributions over items and develop a principled and efficient algorithm for learning item trees from data. We also identify a problem with a widely used protocol for evaluating implicit feedback models and propose a way of addressing it using a small quantity of explicit feedback data. 1
4 0.49937701 75 nips-2012-Collaborative Ranking With 17 Parameters
Author: Maksims Volkovs, Richard S. Zemel
Abstract: The primary application of collaborate filtering (CF) is to recommend a small set of items to a user, which entails ranking. Most approaches, however, formulate the CF problem as rating prediction, overlooking the ranking perspective. In this work we present a method for collaborative ranking that leverages the strengths of the two main CF approaches, neighborhood- and model-based. Our novel method is highly efficient, with only seventeen parameters to optimize and a single hyperparameter to tune, and beats the state-of-the-art collaborative ranking methods. We also show that parameters learned on datasets from one item domain yield excellent results on a dataset from very different item domain, without any retraining. 1
5 0.47255662 107 nips-2012-Effective Split-Merge Monte Carlo Methods for Nonparametric Models of Sequential Data
Author: Michael C. Hughes, Erik B. Sudderth, Emily B. Fox
Abstract: Applications of Bayesian nonparametric methods require learning and inference algorithms which efficiently explore models of unbounded complexity. We develop new Markov chain Monte Carlo methods for the beta process hidden Markov model (BP-HMM), enabling discovery of shared activity patterns in large video and motion capture databases. By introducing split-merge moves based on sequential allocation, we allow large global changes in the shared feature structure. We also develop data-driven reversible jump moves which more reliably discover rare or unique behaviors. Our proposals apply to any choice of conjugate likelihood for observed data, and we show success with multinomial, Gaussian, and autoregressive emission models. Together, these innovations allow tractable analysis of hundreds of time series, where previous inference required clever initialization and lengthy burn-in periods for just six sequences. 1
6 0.44872418 74 nips-2012-Collaborative Gaussian Processes for Preference Learning
7 0.44161481 30 nips-2012-Accuracy at the Top
8 0.43928853 223 nips-2012-Multi-criteria Anomaly Detection using Pareto Depth Analysis
9 0.42216331 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification
10 0.39630818 111 nips-2012-Efficient Sampling for Bipartite Matching Problems
11 0.39491996 242 nips-2012-Non-linear Metric Learning
12 0.38653338 165 nips-2012-Iterative ranking from pair-wise comparisons
13 0.36229914 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction
14 0.36165664 194 nips-2012-Learning to Discover Social Circles in Ego Networks
15 0.34872612 288 nips-2012-Rational inference of relative preferences
16 0.33428457 279 nips-2012-Projection Retrieval for Classification
17 0.31904814 103 nips-2012-Distributed Probabilistic Learning for Camera Networks with Missing Data
18 0.31368646 351 nips-2012-Transelliptical Component Analysis
19 0.30884892 343 nips-2012-Tight Bounds on Profile Redundancy and Distinguishability
20 0.30626222 57 nips-2012-Bayesian estimation of discrete entropy with mixtures of stick-breaking priors
topicId topicWeight
[(0, 0.011), (18, 0.417), (21, 0.025), (38, 0.121), (39, 0.017), (42, 0.034), (53, 0.048), (54, 0.013), (55, 0.01), (74, 0.022), (76, 0.075), (80, 0.051), (92, 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.74005514 189 nips-2012-Learning from the Wisdom of Crowds by Minimax Entropy
Author: Dengyong Zhou, Sumit Basu, Yi Mao, John C. Platt
Abstract: An important way to make large training sets is to gather noisy labels from crowds of nonexperts. We propose a minimax entropy principle to improve the quality of these labels. Our method assumes that labels are generated by a probability distribution over workers, items, and labels. By maximizing the entropy of this distribution, the method naturally infers item confusability and worker expertise. We infer the ground truth by minimizing the entropy of this distribution, which we show minimizes the Kullback-Leibler (KL) divergence between the probability distribution and the unknown truth. We show that a simple coordinate descent scheme can optimize minimax entropy. Empirically, our results are substantially better than previously published methods for the same problem. 1
2 0.62930441 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification
Author: Yao-nan Chen, Hsuan-tien Lin
Abstract: Label space dimension reduction (LSDR) is an efficient and effective paradigm for multi-label classification with many classes. Existing approaches to LSDR, such as compressive sensing and principal label space transformation, exploit only the label part of the dataset, but not the feature part. In this paper, we propose a novel approach to LSDR that considers both the label and the feature parts. The approach, called conditional principal label space transformation, is based on minimizing an upper bound of the popular Hamming loss. The minimization step of the approach can be carried out efficiently by a simple use of singular value decomposition. In addition, the approach can be extended to a kernelized version that allows the use of sophisticated feature combinations to assist LSDR. The experimental results verify that the proposed approach is more effective than existing ones to LSDR across many real-world datasets. 1
3 0.36333355 30 nips-2012-Accuracy at the Top
Author: Stephen Boyd, Corinna Cortes, Mehryar Mohri, Ana Radovanovic
Abstract: We introduce a new notion of classification accuracy based on the top ⌧ -quantile values of a scoring function, a relevant criterion in a number of problems arising for search engines. We define an algorithm optimizing a convex surrogate of the corresponding loss, and discuss its solution in terms of a set of convex optimization problems. We also present margin-based guarantees for this algorithm based on the top ⌧ -quantile value of the scores of the functions in the hypothesis set. Finally, we report the results of several experiments in the bipartite setting evaluating the performance of our solution and comparing the results to several other algorithms seeking high precision at the top. In most examples, our solution achieves a better performance in precision at the top. 1
4 0.36170924 328 nips-2012-Submodular-Bregman and the Lovász-Bregman Divergences with Applications
Author: Rishabh Iyer, Jeff A. Bilmes
Abstract: We introduce a class of discrete divergences on sets (equivalently binary vectors) that we call the submodular-Bregman divergences. We consider two kinds, defined either from tight modular upper or tight modular lower bounds of a submodular function. We show that the properties of these divergences are analogous to the (standard continuous) Bregman divergence. We demonstrate how they generalize many useful divergences, including the weighted Hamming distance, squared weighted Hamming, weighted precision, recall, conditional mutual information, and a generalized KL-divergence on sets. We also show that the generalized Bregman divergence on the Lov´ sz extension of a submodular function, which we a call the Lov´ sz-Bregman divergence, is a continuous extension of a submodular a Bregman divergence. We point out a number of applications, and in particular show that a proximal algorithm defined through the submodular Bregman divergence provides a framework for many mirror-descent style algorithms related to submodular function optimization. We also show that a generalization of the k-means algorithm using the Lov´ sz Bregman divergence is natural in clustering scenarios where a ordering is important. A unique property of this algorithm is that computing the mean ordering is extremely efficient unlike other order based distance measures. 1
5 0.36083642 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes
Author: Thomas Furmston, David Barber
Abstract: Parametric policy search algorithms are one of the methods of choice for the optimisation of Markov Decision Processes, with Expectation Maximisation and natural gradient ascent being popular methods in this field. In this article we provide a unifying perspective of these two algorithms by showing that their searchdirections in the parameter space are closely related to the search-direction of an approximate Newton method. This analysis leads naturally to the consideration of this approximate Newton method as an alternative optimisation method for Markov Decision Processes. We are able to show that the algorithm has numerous desirable properties, absent in the naive application of Newton’s method, that make it a viable alternative to either Expectation Maximisation or natural gradient ascent. Empirical results suggest that the algorithm has excellent convergence and robustness properties, performing strongly in comparison to both Expectation Maximisation and natural gradient ascent. 1 Markov Decision Processes Markov Decision Processes (MDPs) are the most commonly used model for the description of sequential decision making processes in a fully observable environment, see e.g. [5]. A MDP is described by the tuple {S, A, H, p1 , p, π, R}, where S and A are sets known respectively as the state and action space, H ∈ N is the planning horizon, which can be either finite or infinite, and {p1 , p, π, R} are functions that are referred as the initial state distribution, transition dynamics, policy (or controller) and the reward function. In general the state and action spaces can be arbitrary sets, but we restrict our attention to either discrete sets or subsets of Rn , where n ∈ N. We use boldface notation to represent a vector and also use the notation z = (s, a) to denote a state-action pair. Given a MDP the trajectory of the agent is determined by the following recursive procedure: Given the agent’s state, st , at a given time-point, t ∈ NH , an action is selected according to the policy, at ∼ π(·|st ); The agent will then transition to a new state according to the transition dynamics, st+1 ∼ p(·|at , st ); this process is iterated sequentially through all of the time-points in the planning horizon, where the state of the initial time-point is determined by the initial state distribution s1 ∼ p1 (·). At each time-point the agent receives a (scalar) reward that is determined by the reward function, where this function depends on the current action and state of the environment. Typically the reward function is assumed to be bounded, but as the objective is linear in the reward function we assume w.l.o.g that it is non-negative. The most widely used objective in the MDP framework is to maximise the total expected reward of the agent over the course of the planning horizon. This objective can take various forms, including an infinite planning horizon, with either discounted or average rewards, or a finite planning horizon. The theoretical contributions of this paper are applicable to all three frameworks, but for notational ease and for reasons of space we concern ourselves with the infinite horizon framework with discounted rewards. In this framework the boundedness of the objective function is ensured by the 1 introduction of a discount factor, γ ∈ [0, 1), which scales the rewards of the various time-points in a geometric manner. Writing the objective function and trajectory distribution directly in terms of the parameter vector then, for any w ∈ W, the objective function takes the form ∞ Ept (a,s;w) γ t−1 R(a, s) , U (w) = (1) t=1 where we have denoted the parameter space by W and have used the notation pt (a, s; w) to represent the marginal p(st = s, at = a; w) of the joint state-action trajectory distribution H−1 p(a1:H , s1:H ; w) = π(aH |sH ; w) p(st+1 |at , st )π(at |st ; w) p1 (s1 ), H ∈ N. (2) t=1 Note that the policy is now written in terms of its parametric representation, π(a|s; w). It is well known that the global optimum of (1) can be obtained through dynamic programming, see e.g. [5]. However, due to various issues, such as prohibitively large state-action spaces or highly non-linear transition dynamics, it is not possible to find the global optimum of (1) in most real-world problems of interest. Instead most research in this area focuses on obtaining approximate solutions, for which there exist numerous techniques, such as approximate dynamic programming methods [6], Monte-Carlo tree search methods [19] and policy search methods, both parametric [27, 21, 16, 18] and non-parametric [2, 25]. This work is focused solely on parametric policy search methods, by which we mean gradient-based methods, such as steepest and natural gradient ascent [23, 1], along with Expectation Maximisation [11], which is a bound optimisation technique from the statistics literature. Since their introduction [14, 31, 10, 16] these methods have been the centre of a large amount of research, with much of it focusing on gradient estimation [21, 4], variance reduction techniques [30, 15], function approximation techniques [27, 8, 20] and real-world applications [18, 26]. While steepest gradient ascent has enjoyed some success it is known to suffer from some substantial issues that often make it unattractive in practice, such as slow convergence and susceptibility to poor scaling of the objective function [23]. Various optimisation methods have been introduced as an alternative, most notably natural gradient ascent [16, 24, 3] and Expectation Maximisation [18, 28], which are currently the methods of choice among parametric policy search algorithms. In this paper our primary focus is on the search-direction (in the parameter space) of these two methods. 2 Search Direction Analysis In this section we will perform a novel analysis of the search-direction of both natural gradient ascent and Expectation Maximisation. In gradient-based algorithms of Markov Decision Processes the update of the policy parameters take the form wnew = w + αM(w) w U (w), (3) + where α ∈ R is the step-size parameter and M(w) is some positive-definite matrix that possibly depends on w. It is well-known that such an update will increase the total expected reward, provided that α is sufficiently small, and this process will converge to a local optimum of (1) provided the step-size sequence is appropriately selected. While EM doesn’t have an update of the form given in (3) we shall see that the algorithm is closely related to such an update. It is convenient for later reference to note that the gradient w U (w) can be written in the following form w U (w) = Epγ (z;w)Q(z;w) w log π(a|s; w) , (4) where we use the expectation notation E[·] to denote the integral/summation w.r.t. a non-negative function. The term pγ (z; w) is a geometric weighted average of state-action occupancy marginals given by ∞ γ t−1 pt (z; w), pγ (z; w) = t=1 while the term Q(z; w) is referred to as the state-action value function and is equal to the total expected future reward from the current time-point onwards, given the current state-action pair, z, 2 and parameter vector, w, i.e. ∞ Ept (z ;w) γ t−1 R(z ) z1 = z . Q(z; w) = t=1 This is a standard result and due to reasons of space we have omitted the details, but see e.g. [27] or section(6.1) of the supplementary material for more details. An immediate issue concerning updates of the form (3) is in the selection of the ‘optimal’ choice of the matrix M(w), which clearly depends on the sense in which optimality is defined. There are numerous reasonable properties that are desirable of such an update, including the numerical stability and computational complexity of the parameter update, as well as the rate of convergence of the overall algorithm resulting from these updates. While all reasonable criteria the rate of convergence is of such importance in an optimisation algorithm that it is a logical starting point in our analysis. For this reason we concern ourselves with relating these two parametric policy search algorithms to the Newton method, which has the highly desirable property of having a quadratic rate of convergence in the vicinity of a local optimum. The Newton method is well-known to suffer from problems that make it either infeasible or unattractive in practice, but in terms of forming a basis for theoretical comparisons it is a logical starting point. We shall discuss some of the issues with the Newton method in more detail in section(3). In the Newton method the matrix M(w) is set to the negative inverse Hessian, i.e. M(w) = −H−1 (w), where H(w) = w T w U (w). where we have denoted the Hessian by H(w). Using methods similar to those used to calculate the gradient, it can be shown that the Hessian takes the form H(w) = H1 (w) + H2 (w), (5) where ∞ Ep(z1:t ;w) γ t−1 R(zt ) w Ep(z1:t ;w) γ t−1 R(zt ) H1 (w) = w log p(z1:t ; w) T w log p(z1:t ; w) , (6) t=1 ∞ H2 (w) = T w log p(z1:t ; w) . (7) t=1 We have omitted the details of the derivation, but these can be found in section(6.2) of the supplementary material, with a similar derivation of a sample-based estimate of the Hessian given in [4]. 2.1 Natural Gradient Ascent To overcome some of the issues that can hinder steepest gradient ascent an alternative, natural gradient, was introduced in [16]. Natural gradient ascent techniques originated in the neural network and blind source separation literature, see e.g. [1], and take the perspective that the parameter space has a Riemannian manifold structure, as opposed to a Euclidean structure. Deriving the steepest ascent direction of U (w) w.r.t. a local norm defined on this parameter manifold (as opposed to w.r.t. the Euclidean norm, which is the case in steepest gradient ascent) results in natural gradient ascent. We denote the quadratic form that induces this local norm on the parameter manifold by G(w), i.e. d(w)2 = wT G(w)w. The derivation for natural gradient ascent is well-known, see e.g. [1], and its application to the objective (1) results in a parameter update of the form wk+1 = wk + αk G−1 (wk ) w U (wk ). −1 In terms of (3) this corresponds to M(w) = G (w). In the case of MDPs the most commonly used local norm is given by the Fisher information matrix of the trajectory distribution, see e.g. [3, 24], and due to the Markovian structure of the dynamics it is given by G(w) = −Epγ (z;w) w T w log π(a|s; w) . (8) We note that there is an alternate, but equivalent, form of writing the Fisher information matrix, see e.g. [24], but we do not use it in this work. 3 In order to relate natural gradient ascent to the Newton method we first rewrite the matrix (7) into the following form H2 (w) = Epγ (z;w)Q(z;w) w T w log π(a|s; w) . (9) For reasons of space the details of this reformulation of (7) are left to section(6.2) of the supplementary material. Comparing the Fisher information matrix (8) with the form of H2 (w) given in (9) it is clear that natural gradient ascent has a relationship with the approximate Newton method that uses H2 (w) in place of H(w). In terms of (3) this approximate Newton method corresponds to setting −1 M(w) = −H2 (w). In particular it can be seen that the difference between the two methods lies in the non-negative function w.r.t. which the expectation is taken in (8) and (9). (It also appears that there is a difference in sign, but observing the form of M(w) for each algorithm shows that this is not the case.) In the Fisher information matrix the expectation is taken w.r.t. to the geometrically weighted summation of state-action occupancy marginals of the trajectory distribution, while in H2 (w) there is an additional weighting from the state-action value function. Hence, H2 (w) incorporates information about the reward structure of the objective function, whereas the Fisher information matrix does not, and so it will generally contain more information about the curvature of the objective function. 2.2 Expectation Maximisation The Expectation Maximisation algorithm, or EM-algorithm, is a powerful optimisation technique from the statistics literature, see e.g. [11], that has recently been the centre of much research in the planning and reinforcement learning communities, see e.g. [10, 28, 18]. A quantity of central importance in the EM-algorithm for MDPs is the following lower-bound on the log-objective log U (w) ≥ Hentropy (q(z1:t , t)) + Eq(z1:t ,t) log γ t−1 R(zt )p(z1:t ; w) , (10) where Hentropy is the entropy function and q(z1:t , t) is known as the ‘variational distribution’. Further details of the EM-algorithm for MDPs and a derivation of (10) are given in section(6.3) of the supplementary material or can be found in e.g. [18, 28]. The parameter update of the EM-algorithm is given by the maximum (w.r.t. w) of the ‘energy’ term, Q(w, wk ) = Epγ (z;wk )Q(z;wk ) log π(a|s; w) . (11) Note that Q is a two-parameter function, where the first parameter occurs inside the expectation and the second parameter defines the non-negative function w.r.t. the expectation is taken. This decoupling allows the maximisation over w to be performed explicitly in many cases of interest. For example, when the log-policy is quadratic in w the maximisation problems is equivalent to a least-squares problem and the optimum can be found through solving a linear system of equations. It has previously been noted, again see e.g. [18], that the parameter update of steepest gradient ascent and the EM-algorithm can be related through this ‘energy’ term. In particular the gradient (4) evaluated at wk can also be written as follows w|w=wk U (w) = 10 w|w=wk Q(w, wk ), where 10 we use the notation w to denote the first derivative w.r.t. the first parameter, while the update of the EM-algorithm is given by wk+1 = argmaxw∈W Q(w, wk ). In other words, steepest gradient ascent moves in the direction that most rapidly increases Q(w, wk ) w.r.t. the first variable, while the EM-algorithm maximises Q(w, wk ) w.r.t. the first variable. While this relationship is true, it is also quite a negative result. It states that in situations where it is not possible to explicitly perform the maximisation over w in (11) then the alternative, in terms of the EM-algorithm, is this generalised EM-algorithm, which is equivalent to steepest gradient ascent. Considering that algorithms such as EM are typically considered because of the negative aspects related to steepest gradient ascent this is an undesirable alternative. It is possible to find the optimum of (11) numerically, but this is also undesirable as it results in a double-loop algorithm that could be computationally expensive. Finally, this result provides no insight into the behaviour of the EM-algorithm, in terms of the direction of its parameter update, when the maximisation over w in (11) can be performed explicitly. Instead we provide the following result, which shows that the step-direction of the EM-algorithm has an underlying relationship with the Newton method. In particular we show that, under suitable 4 regularity conditions, the direction of the EM-update, i.e. wk+1 − wk , is the same, up to first order, as the direction of an approximate Newton method that uses H2 (w) in place of H(w). Theorem 1. Suppose we are given a Markov Decision Process with objective (1) and Markovian trajectory distribution (2). Consider the update of the parameter through Expectation Maximisation at the k th iteration of the algorithm, i.e. wk+1 = argmaxw∈W Q(w, wk ). Provided that Q(w, wk ) is twice continuously differentiable in the first parameter we have that −1 wk+1 − wk = −H2 (wk ) w|w=wk U (w) + O( wk+1 − wk 2 ). (12) Additionally, in the case where the log-policy is quadratic the relation to the approximate Newton method is exact, i.e. the second term on the r.h.s. (12) is zero. Proof. The idea of the proof is simple and only involves performing a Taylor expansion of 10 w Q(w, wk ). As Q is assumed to be twice continuously differentiable in the first component this Taylor expansion is possible and gives 10 w Q(wk+1 , wk ) = 10 w Q(wk , wk ) + 20 w Q(wk , wk )(wk+1 − wk ) + O( wk+1 − wk 2 ). (13) As wk+1 = argmaxw∈W Q(w, wk ) it follows that 10 Q(wk+1 , wk ) = 0. This means that, upon w ignoring higher order terms in wk+1 − wk , the Taylor expansion (13) can be rewritten into the form wk+1 − wk = − 20 −1 w Q(wk , wk ) 10 w Q(wk , wk ). (14) 10 = The proof is completed by observing that w|w=wk U (w) and w Q(wk , wk ) 20 Q(wk , wk ) = H2 (wk ). The second statement follows because in the case where the log-policy w is quadratic the higher order terms in the Taylor expansion vanish. 2.3 Summary In this section we have provided a novel analysis of both natural gradient ascent and Expectation Maximisation when applied to the MDP framework. Previously, while both of these algorithms have proved popular methods for MDP optimisation, there has been little understanding of them in terms of their search-direction in the parameter space or their relation to the Newton method. Firstly, our analysis shows that the Fisher information matrix, which is used in natural gradient ascent, is similar to H2 (w) in (5) with the exception that the information about the reward structure of the problem is not contained in the Fisher information matrix, while such information is contained in H2 (w). Additionally we have shown that the step-direction of the EM-algorithm is, up to first order, an approximate Newton method that uses H2 (w) in place of H(w) and employs a constant step-size of one. 3 An Approximate Newton Method −1 A natural follow on from the analysis in section(2) is the consideration of using M(w) = −H2 (w) in (3), a method we call the full approximate Newton method from this point onwards. In this section we show that this method has many desirable properties that make it an attractive alternative to other parametric policy search methods. Additionally, denoting the diagonal matrix formed from the diagonal elements of H2 (w) by D2 (w), we shall also consider the method that uses M(w) = −1 −D2 (w) in (3). We call this second method the diagonal approximate Newton method. Recall that in (3) it is necessary that M(w) is positive-definite (in the Newton method this corresponds to requiring the Hessian to be negative-definite) to ensure an increase of the objective. In general the objective (1) is not concave, which means that the Hessian will not be negative-definite over the entire parameter space. In such cases the Newton method can actually lower the objective and this is an undesirable aspect of the Newton method. An attractive property of the approximate Hessian, H2 (w), is that it is always negative-definite when the policy is log–concave in the policy parameters. This fact follows from the observation that in such cases H2 (w) is a non-negative mixture of negative-definite matrices, which again is negative-definite [9]. Additionally, the diagonal 5 terms of a negative-definite matrix are negative and so D2 (w) is also negative-definite when the controller is log-concave. To motivate this result we now briefly consider some widely used policies that are either log-concave or blockwise log-concave. Firstly, consider the Gibb’s policy, π(a|s; w) ∝ exp wT φ(a, s), where φ(a, s) ∈ Rnw is a feature vector. This policy is widely used in discrete systems and is logconcave in w, which can be seen from the fact that log π(a|s; w) is the sum of a linear term and a negative log-sum-exp term, both of which are concave [9]. In systems with a continuous stateaction space a common choice of controller is π(a|s; wmean , Σ) = N (a|Kφ(s) + m, Σ(s)), where wmean = {K, m} and φ(s) ∈ Rnw is a feature vector. The notation Σ(s) is used because there are cases where is it beneficial to have state dependent noise in the controller. This controller is not jointly log-concave in wmean and Σ, but it is blockwise log-concave in wmean and Σ−1 . In terms of wmean the log-policy is quadratic and the coefficient matrix of the quadratic term is negative-definite. In terms of Σ−1 the log-policy consists of a linear term and a log-determinant term, both of which are concave. In terms of evaluating the search direction it is clear from the forms of D2 (w) and H2 (w) that many of the pre-existing gradient evaluation techniques can be extended to the approximate Newton framework with little difficulty. In particular, gradient evaluation requires calculating the expectation of the derivative of the log-policy w.r.t. pγ (z; w)Q(z; w). In terms of inference the only additional calculation necessary to implement either the full or diagonal approximate Newton methods is the calculation of the expectation (w.r.t. to the same function) of the Hessian of the log-policy, or its diagonal terms. As an example in section(6.5) of the supplementary material we detail the extension of the recurrent state formulation of gradient evaluation in the average reward framework, see e.g. [31], to the approximate Newton method. We use this extension in the Tetris experiment that we consider in section(4). Given ns samples and nw parameters the complexity of these extensions scale as O(ns nw ) for the diagonal approximate Newton method, while it scales as O(ns n2 ) for the w full approximate Newton method. An issue with the Newton method is the inversion of the Hessian matrix, which scales with O(n3 ) in w the worst case. In the standard application of the Newton method this inversion has to be performed at each iteration and in large parameter systems this becomes prohibitively costly. In general H(w) will be dense and no computational savings will be possible when performing this matrix inversion. The same is not true, however, of the matrices D2 (w) and H2 (w). Firstly, as D2 (w) is a diagonal matrix it is trivial to invert. Secondly, there is an immediate source of sparsity that comes from taking the second derivative of the log-trajectory distribution in (7). This property ensures that any (product) sparsity over the control parameters in the log-trajectory distribution will correspond to sparsity in H2 (w). For example, in a partially observable Markov Decision Processes where the policy is modeled through a finite state controller, see e.g. [22], there are three functions to be optimised, namely the initial belief distribution, the belief transition dynamics and the policy. When the parameters of these three functions are independent H2 (w) will be block-diagonal (across the parameters of the three functions) and the matrix inversion can be performed more efficiently by inverting each of the block matrices individually. The reason that H(w) does not exhibit any such sparsity properties is due to the term H1 (w) in (5), which consists of the non-negative mixture of outer-product matrices. The vector in these outer-products is the derivative of the log-trajectory distribution and this typically produces a dense matrix. A undesirable aspect of steepest gradient ascent is that its performance is affected by the choice of basis used to represent the parameter space. An important and desirable property of the Newton method is that it is invariant to non-singular linear (affine) transformations of the parameter space, see e.g. [9]. This means that given a non-singular linear (affine) mapping T ∈ Rnw ×nw , the Newton ˜ update of the objective U (w) = U (T w) is related to the Newton update of the original objective through the same linear (affine) mapping, i.e. v + ∆vnt = T w + ∆wnt , where v = T w and ∆vnt and ∆wnt denote the respective Newton steps. In other words running the Newton method on U (w) ˜ and U (T −1 w) will give identical results. An important point to note is that this desirable property is maintained when using H2 (w) in an approximate Newton method, while using D2 (w) results in a method that is invariant to rescaling of the parameters, i.e. where T is a diagonal matrix with non-zero elements along the diagonal. This can be seen by using the linearity of the expectation operator and a proof of this statement is provided in section(6.4) of the supplementary material. 6 20 Completed Lines 400 θ2 15 10 5 0 −10 −8 −6 −4 θ1 −2 0 300 200 100 0 0 2 (a) Policy Trace 20 40 60 80 Training Iterations 100 (b) Tetris Problem Figure 1: (a) An empirical illustration of the affine invariance of the approximate Newton method, performed on the two state MDP of [16]. The plot shows the trace of the policy during training for the two different parameter spaces, where the results of the latter have been mapped back into the original parameter space for comparison. The plot shows the two steepest gradient ascent traces (blue cross and blue circle) and the two traces of the full approximate Newton method (red cross and red circle). (b) Results of the tetris problem for steepest gradient ascent (black), natural gradient ascent (green), the diagonal approximate Newton method (blue) and the full approximate Newton method (red). 4 Experiments The first experiment we performed was an empirical illustration that the full approximate Newton method is invariant to linear transformations of the parameter space. We considered the simple two state example of [16] as it allows us to plot the trace of the policy during training, since the policy has only two parameters. The policy was trained using both steepest gradient ascent and the full approximate Newton method and in both the original and linearly transformed parameter space. The policy traces of the two algorithms are plotted in figure(1.a). As expected steepest gradient ascent is affected by such mappings, whilst the full approximate Newton method is invariant to them. The second experiment was aimed at demonstrating the scalability of the full and diagonal approximate Newton methods to problems with a large state space. We considered the tetris domain, which is a popular computer game designed by Alexey Pajitnov in 1985. See [12] for more details. Firstly, we compared the performance of the full and diagonal approximate Newton methods to other parametric policy search methods. Tetris is typically played on a 20 × 10 grid, but due to computational costs we considered a 10 × 10 grid in the experiment. This results in a state space with roughly 7 × 2100 states. We modelled the policy through a Gibb’s distribution, where we considered a feature vector with the following features: the heights of each column, the difference in heights between adjacent columns, the maximum height and the number of ‘holes’. Under this policy it is not possible to obtain the explicit maximum over w in (11) and so a straightforward application of EM is not possible in this problem. We therefore compared the diagonal and full approximate Newton methods with steepest and natural gradient ascent. Due to reasons of space the exact implementation of the experiment is detailed in section(6.6) of the supplementary material. We ran 100 repetitions of the experiment, each consisting of 100 training iterations, and the mean and standard error of the results are given in figure(1.b). It can be seen that the full approximate Newton method outperforms all of the other methods, while the performance of the diagonal approximate Newton method is comparable to natural gradient ascent. We also ran several training runs of the full approximate Newton method on the full-sized 20 × 10 board and were able to obtain a score in the region of 14, 000 completed lines, which was obtained after roughly 40 training iterations. An approximate dynamic programming based method has previously been applied to the Tetris domain in [7]. The same set of features were used and a score of roughly 4, 500 completed lines was obtained after around 6 training iterations, after which the solution then deteriorated. In the third experiment we considered a finite horizon (controlled) linear dynamical system. This allowed the search-directions of the various algorithms to be computed exactly using [13] and removed any issues of approximate inference from the comparison. In particular we considered a 3-link rigid manipulator, linearized through feedback linearisation, see e.g. [17]. This system has a 7 Normalised Total Expected Reward Normalised Total Expected Reward 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 200 400 Training Time 600 (a) Model-Based Linear System 1 0.9 0.8 0.7 0.6 0 200 400 600 Training Iterations 800 (b) Model-Free Non-Linear System Figure 2: (a) The normalised total expected reward plotted against training time, in seconds, for the 3-link rigid manipulator. The plot shows the results for steepest gradient ascent (black), EM (blue), natural gradient ascent (green) and the approximate Newton method (red), where the plot shows the mean and standard error of the results. (b) The normalised total expected reward plotted against training iterations for the synthetic non-linear system of [29]. The plot shows the results for EM (blue), steepest gradient ascent (black), natural gradient ascent (green) and the approximate Newton method (red), where the plot shows the mean and standard error of the results. 6-dimensional state space, 3-dimensional action space and a 22-dimensional parameter space. Further details of the system can be found in section(6.7) of the supplementary material. We ran the experiment 100 times and the mean and standard error of the results plotted in figure(2.a). In this experiment the approximate Newton method found substantially better solutions than either steepest gradient ascent, natural gradient ascent or Expectation Maximisation. The superiority of the results in comparison to either steepest or natural gradient ascent can be explained by the fact that H2 (w) gives a better estimate of the curvature of the objective function. Expectation Maximisation performed poorly in this experiment, exhibiting sub-linear convergence. Steepest gradient ascent performed 3684 ± 314 training iterations in this experiment which, in comparison to the 203 ± 34 and 310 ± 40 iterations of natural gradient ascent and the approximate Newton method respectively, illustrates the susceptibility of this method to poor scaling. In the final experiment we considered the synthetic non-linear system considered in [29]. Full details of the system and the experiment can be found in section(6.8) of the supplementary material. We ran the experiment 100 times and the mean and standard error of the results are plotted in figure(2.b). Again the approximate Newton method outperforms both steepest and natural gradient ascent. In this example only the mean parameters of the Gaussian controller are optimised, while the parameters of the noise are held fixed, which means that the log-policy is quadratic in the policy parameters. Hence, in this example the EM-algorithm is a particular (less general) version of the approximate Newton method, where a fixed step-size of one is used throughout. The marked difference in performance between the EM-algorithm and the approximate Newton method shows the benefit of being able to tune the step-size sequence. In this experiment we considered five different step-size sequences for the approximate Newton method and all of them obtained superior results than the EM-algorithm. In contrast only one of the seven step-size sequences considered for steepest and natural gradient ascent outperformed the EM-algorithm. 5 Conclusion The contributions of this paper are twofold: Firstly we have given a novel analysis of Expectation Maximisation and natural gradient ascent when applied to the MDP framework, showing that both have close connections to an approximate Newton method; Secondly, prompted by this analysis we have considered the direct application of this approximate Newton method to the optimisation of MDPs, showing that it has numerous desirable properties that are not present in the naive application of the Newton method. In terms of empirical performance we have found the approximate Newton method to perform consistently well in comparison to EM and natural gradient ascent, highlighting its viability as an alternative to either of these methods. At present we have only considered actor type implementations of the approximate Newton method and the extension to actor-critic methods is a point of future research. 8 References [1] S. Amari. Natural Gradient Works Efficiently in Learning. Neural Computation, 10:251–276, 1998. [2] M. Azar, V. G´ mez, and H. Kappen. Dynamic policy programming with function approximation. Journal o of Machine Learning Research - Proceedings Track, 15:119–127, 2011. [3] J. Bagnell and J. Schneider. Covariant Policy Search. IJCAI, 18:1019–1024, 2003. [4] J. Baxter and P. Bartlett. Infinite Horizon Policy Gradient Estimation. Journal of Artificial Intelligence Research, 15:319–350, 2001. [5] D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, second edition, 2000. [6] D. P. Bertsekas. Approximate Policy Iteration: A Survey and Some New Methods. Research report, Massachusetts Institute of Technology, 2010. [7] D. P. Bertsekas and S. Ioffe. Temporal Differences-Based Policy Iteration and Applications in NeuroDynamic Programming. Research report, Massachusetts Institute of Technology, 1997. [8] S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and L. Mark. Natural Actor-Critic Algorithms. Automatica, 45:2471–2482, 2009. [9] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [10] P. Dayan and G. E. Hinton. Using Expectation-Maximization for Reinforcement Learning. Neural Computation, 9:271–278, 1997. [11] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977. [12] C. Fahey. Tetris AI, Computers Play Tetris http://colinfahey.com/tetris/tetris_en. html, 2003. [13] T. Furmston and D. Barber. Efficient Inference for Markov Control Problems. UAI, 29:221–229, 2011. [14] P. W. Glynn. Likelihood Ratio Gradient Estimation for Stochastic Systems. Communications of the ACM, 33:97–84, 1990. [15] E. Greensmith, P. Bartlett, and J. Baxter. Variance Reduction Techniques For Gradient Based Estimates in Reinforcement Learning. Journal of Machine Learning Research, 5:1471–1530, 2004. [16] S. Kakade. A Natural Policy Gradient. NIPS, 14:1531–1538, 2002. [17] H. Khalil. Nonlinear Systems. Prentice Hall, 2001. [18] J. Kober and J. Peters. Policy Search for Motor Primitives in Robotics. Machine Learning, 84(1-2):171– 203, 2011. [19] L. Kocsis and C. Szepesv´ ri. Bandit Based Monte-Carlo Planning. European Conference on Machine a Learning (ECML), 17:282–293, 2006. [20] V. R. Konda and J. N. Tsitsiklis. On Actor-Critic Algorithms. SIAM J. Control Optim., 42(4):1143–1166, 2003. [21] P. Marbach and J. Tsitsiklis. Simulation-Based Optimisation of Markov Reward Processes. IEEE Transactions on Automatic Control, 46(2):191–209, 2001. [22] N. Meuleau, L. Peshkin, K. Kim, and L. Kaelbling. Learning Finite-State Controllers for Partially Observable Environments. UAI, 15:427–436, 1999. [23] J. Nocedal and S. Wright. Numerical Optimisation. Springer, 2006. [24] J. Peters and S. Schaal. Natural Actor-Critic. Neurocomputing, 71(7-9):1180–1190, 2008. [25] K. Rawlik, Toussaint. M, and S. Vijayakumar. On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference. International Conference on Robotics Science and Systems, 2012. [26] S. Richter, D. Aberdeen, and J. Yu. Natural Actor-Critic for Road Traffic Optimisation. NIPS, 19:1169– 1176, 2007. [27] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. NIPS, 13:1057–1063, 2000. [28] M. Toussaint, S. Harmeling, and A. Storkey. Probabilistic Inference for Solving (PO)MDPs. Research Report EDI-INF-RR-0934, University of Edinburgh, School of Informatics, 2006. [29] N. Vlassis, M. Toussaint, G. Kontes, and S. Piperidis. Learning Model-Free Robot Control by a Monte Carlo EM Algorithm. Autonomous Robots, 27(2):123–130, 2009. [30] L. Weaver and N. Tao. The Optimal Reward Baseline for Gradient Based Reinforcement Learning. UAI, 17(29):538–545, 2001. [31] R. Williams. Simple Statistical Gradient Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8:229–256, 1992. 9
6 0.36027721 313 nips-2012-Sketch-Based Linear Value Function Approximation
7 0.35924765 359 nips-2012-Variational Inference for Crowdsourcing
8 0.35821271 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)
9 0.3564215 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification
10 0.3558009 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization
11 0.35518587 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes
12 0.35486612 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound
13 0.3544566 227 nips-2012-Multiclass Learning with Simplex Coding
14 0.35419965 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models
15 0.35374117 120 nips-2012-Exact and Stable Recovery of Sequences of Signals with Sparse Increments via Differential 1-Minimization
16 0.3535378 187 nips-2012-Learning curves for multi-task Gaussian process regression
17 0.35335112 114 nips-2012-Efficient coding provides a direct link between prior and likelihood in perceptual Bayesian inference
18 0.35306317 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback
19 0.35262054 324 nips-2012-Stochastic Gradient Descent with Only One Projection
20 0.35190654 358 nips-2012-Value Pursuit Iteration