nips nips2011 nips2011-240 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Daniel Hernández-lobato, Jose M. Hernández-lobato, Pierre Dupont
Abstract: Multi-class Gaussian Process Classifiers (MGPCs) are often affected by overfitting problems when labeling errors occur far from the decision boundaries. To prevent this, we investigate a robust MGPC (RMGPC) which considers labeling errors independently of their distance to the decision boundaries. Expectation propagation is used for approximate inference. Experiments with several datasets in which noise is injected in the labels illustrate the benefits of RMGPC. This method performs better than other Gaussian process alternatives based on considering latent Gaussian noise or heavy-tailed processes. When no noise is injected in the labels, RMGPC still performs equal or better than the other methods. Finally, we show how RMGPC can be used for successfully identifying data instances which are difficult to classify correctly in practice. 1
Reference: text
sentIndex sentText sentNum sentScore
1 be Abstract Multi-class Gaussian Process Classifiers (MGPCs) are often affected by overfitting problems when labeling errors occur far from the decision boundaries. [sent-7, score-0.176]
2 To prevent this, we investigate a robust MGPC (RMGPC) which considers labeling errors independently of their distance to the decision boundaries. [sent-8, score-0.231]
3 Experiments with several datasets in which noise is injected in the labels illustrate the benefits of RMGPC. [sent-10, score-0.136]
4 This method performs better than other Gaussian process alternatives based on considering latent Gaussian noise or heavy-tailed processes. [sent-11, score-0.204]
5 When no noise is injected in the labels, RMGPC still performs equal or better than the other methods. [sent-12, score-0.11]
6 Finally, we show how RMGPC can be used for successfully identifying data instances which are difficult to classify correctly in practice. [sent-13, score-0.12]
7 MGPCs assume that there are some latent functions (one per class) whose value at a certain location is related by some rule to the probability of observing a specific class there. [sent-15, score-0.132]
8 The task of interest is to make inference about the latent functions using Bayes’ theorem. [sent-17, score-0.093]
9 Nevertheless, exact Bayesian inference in MGPCs is typically intractable and one has to rely on approximate methods. [sent-18, score-0.106]
10 Approximate inference can be implemented using Markov-chain Monte Carlo sampling, the Laplace approximation or expectation propagation [2, 3, 4, 5]. [sent-19, score-0.105]
11 The consequence is that over-fitting can become a serious problem when errors far from these boundaries are observed in practice. [sent-21, score-0.103]
12 A notable exception is found in the binary classification case when the labeling rule suggested in [6] is used. [sent-22, score-0.099]
13 Such rule considers the possibility of observing errors independently of their distance to the decision boundary [7, 8]. [sent-23, score-0.196]
14 Existing generalizations 1 are in practice simplified so that the probability of observing errors in the labels is zero [3]. [sent-25, score-0.131]
15 Labeling errors in the context of MGPCs are often accounted for by considering that the latent functions of the MGPC are contaminated with additive Gaussian noise [1]. [sent-26, score-0.247]
16 Nevertheless, this approach has again the disadvantage of considering only errors near the decision boundaries of the resulting classifier and is expected to lead to over-fitting problems when errors are actually observed far from the boundaries. [sent-27, score-0.24]
17 These processes have marginal distributions with heavier tails than those of a Gaussian distribution and are in consequence expected to be more robust to labeling errors far from the decision boundaries. [sent-29, score-0.275]
18 In this paper we investigate a robust MGPC (RMGPC) that addresses labeling errors by introducing a set of binary latent variables. [sent-30, score-0.241]
19 These latent variables indicate whether the assumed labeling rule is satisfied for the associated instances or not. [sent-32, score-0.288]
20 This is used as a back-up mechanism to explain data instances that are highly unlikely to stem from the assumed labeling rule. [sent-34, score-0.273]
21 The resulting likelihood function depends only on the total number of errors, and not on the distances of these errors to the decision boundaries. [sent-35, score-0.109]
22 In this model, expectation propagation (EP) can be used to efficiently carry out approximate inference [10]. [sent-37, score-0.153]
23 The cost of EP is O(ln3 ), where n is the number of training instances and l is the number of different classes. [sent-38, score-0.14]
24 When labeling noise is introduced in the data, RMGPC outperforms other MGPC approaches based on considering latent Gaussian noise or heavy-tailed processes. [sent-41, score-0.266]
25 Extra experiments also illustrate the utility of RMGPC to identify data instances that are unlikely to stem from the assumed labeling rule. [sent-43, score-0.293]
26 Section 3 describes how expectation propagation can be used for approximate Bayesian inference. [sent-45, score-0.129]
27 2 Robust Multi-Class Gaussian Process Classification Consider n training instances in the form of a collection of feature vectors X = {x1 , . [sent-48, score-0.12]
28 We follow [3] and assume that, in the noise free scenario, the predictive rule for yi given xi is yi = arg maxfk (xi ) , (1) k where f1 , . [sent-58, score-0.405]
29 , fl are unknown latent functions that have to be estimated. [sent-61, score-0.163]
30 In this latter case, the pair (xi , yi ) is considered to be an outlier and, instead of assuming that yi is generated by (1), we assume that xi is assigned a random class sampled uniformly from C. [sent-67, score-0.371]
31 , fl have been contaminated with an infinite amount of noise and serves as a back-up mechanism to explain observations which are highly unlikely to originate from (1). [sent-71, score-0.209]
32 , fl (xn ))T given y, X and z is 1−zi n z 1 i P(y|X, z, f ) = Θ(fyi (xi ) − fk (xi )) , (2) l i=1 k=yi where Θ(·) is the Heaviside step function. [sent-84, score-0.241]
33 In (2), the contribution to the likelihood of each instance (xi , yi ) is a a mixture of two terms: A first term equal to k=yi Θ(fyi (xi ) − fk (xi )) and a second term equal to 1/l. [sent-85, score-0.3]
34 The mixing coefficient is the prior probability of zi = 1. [sent-86, score-0.165]
35 Thus, the likelihood function described in 2 (2) considers only the total number of prediction errors made by f and not the distance of these errors to the decision boundary. [sent-89, score-0.207]
36 The consequence is that (2) is expected to be robust when the observed data contain labeling errors far from the decision boundaries. [sent-90, score-0.207]
37 Thus, z is set to follow a priori a factorizing multivariate Bernoulli distribution: n ρzi (1 − ρ)1−zi , P(z|ρ) = Bern(z|ρ) = (3) i=1 where ρ is the prior fraction of training instances expected to be outliers. [sent-92, score-0.174]
38 The prior for ρ is set to be a conjugate beta distribution, namely ρa0 −1 (1 − ρ)b0 −1 P(ρ) = Beta(ρ|a0 , b0 ) = , (4) B(a0 , b0 ) where B(·, ·) is the beta function and a0 and b0 are free hyper-parameters. [sent-93, score-0.153]
39 , fl is set to be a product of Gaussian processes with means equal to 0 and covariance matrices K1 , . [sent-99, score-0.181]
40 , cl (·, ·): l N (fk |0, Kk ) P(f ) = (5) k=1 where N (·|µ, Σ) denotes a multivariate Gaussian density with mean vector µ and covariance matrix Σ, f is defined as in (2) and fk = (fk (x1 ), fk (x2 ), . [sent-105, score-0.371]
41 The posterior distribution and the likelihood function can be used to compute a predictive distribution for the label y ∈ C associated to a new observation x : P(y |x , y, X) = P(y |x , z , f )P(z |ρ)P(f |f )P(ρ, z, f |y, X) df df dρ , (7) z ,z where f = (f1 (x ), . [sent-114, score-0.148]
42 , fl (x ))T , P(y |x , z , f ) = k=y Θ(fk (x ) − fy (x ))1−z (1/l)z , P(z |ρ) = ρz (1 − ρ)1−z and P(f |f ) is a product of l conditional Gaussians with zero mean and covariance matrices given by the covariance functions of K1 , . [sent-117, score-0.202]
43 For this, we only have to marginalize (8) with respect to all the components of z except zi . [sent-125, score-0.144]
44 Nevertheless, these expressions can be approximated using expectation propagation [10]. [sent-127, score-0.107]
45 (11) k=1 In (11) the dependence of the exact and the approximate factors on f , z and ρ has been removed ˜ to improve readability. [sent-131, score-0.154]
46 The approximate factors ψ are constrained to belong to the same family of exponential distributions, but they do not have to integrate to one. [sent-132, score-0.12]
47 Therefore, Q has the same form as the approximate factors and Z can be readily computed. [sent-136, score-0.12]
48 In practice, the form of Q is selected first, and the approximate ˜ factors are then constrained to have the same form as Q. [sent-137, score-0.12]
49 Initialize all the approximate factors ψ and the posterior approximation Q to be uniform. [sent-141, score-0.179]
50 ˜ ˜ (b) Update the approximate factor ψ ˜ ˜ (c) Update the posterior approximation Q to the normalized version of ψQ\ψ . [sent-145, score-0.107]
51 When = 0, no update of the approximate factors occurs. [sent-155, score-0.12]
52 Note that Q factorizes with respect to fk for k = 1, . [sent-165, score-0.147]
53 More accurate approximations can be obtained at a cubic cost in l by considering correlations among the fk . [sent-170, score-0.195]
54 The approximate factors must have the same functional form as Q but they need not be normalized. [sent-172, score-0.12]
55 , n and k = yi , corresponding to the likelihood, (2), only depend on fk (xi ), fyi (xi ) and zi . [sent-176, score-0.492]
56 Thus, the beta part of the corresponding approximate factors can be removed and the multivariate Gaussian distributions simplify to univariate Gaussians. [sent-177, score-0.208]
57 ˜ Specifically, the approximate factors ψik with i = 1, . [sent-178, score-0.12]
58 , n and k = yi are: (fyi (xi ) − µyi )2 ˜ik 1 (fk (xi ) − µik )2 ˜ ˜ ψik (f , z, ρ) = sik exp − ˜ + yi 2 νik ˜ νik ˜ pzi (1 − pik )1−zi , ˜ik ˜ (14) where sik , pik , µik , νik , µyi and νik are free parameters to be estimated by EP. [sent-181, score-0.523]
59 Similarly, the exact ˜ ˜ ˜ ˜ ˜ik ˜yi factors ψi , with i = 1, . [sent-182, score-0.106]
60 , n, corresponding to the prior for the latent variables z, (3), only depend on ρ and zi . [sent-185, score-0.234]
61 Thus, the Gaussian part of the corresponding approximate factors can be removed and the multivariate Bernoulli distribution simplifies to a univariate Bernoulli. [sent-186, score-0.153]
62 The resulting factors are: ˜ ˜ ψi (f , z, ρ) = si ρai −1 (1 − ρ)bi −1 pzi (1 − pi )1−zi , ˜ ˜ ˜i ˜ (15) for i = 1, . [sent-187, score-0.156]
63 The same applies to ˜ the exact factors ψk , for k = 1, . [sent-194, score-0.106]
64 2 The EP Update Operations ˜ The approximate factors ψik , for i = 1, . [sent-210, score-0.12]
65 , n and k = yi , corresponding to the likelihood, are ˜ refined in parallel, as in [17]. [sent-213, score-0.119]
66 3 Model Evidence, Prediction and Outlier Identification Once EP has converged, we can evaluate the approximation to the model evidence as the integral of the product of all the approximate terms. [sent-229, score-0.095]
67 Specifically, one can show that, if EP has converged, the gradient of the free parameters of the approximate factors with respect to θkj is zero [18]. [sent-242, score-0.142]
68 , bk )T with bk = n 1 2 i k=yi 1 ∂Kk −1 k + (υ k )T (M−1 )T M υ , k 2 ∂θkj k (18) µyi /˜ik , if k = yi , and bk = µik /˜ik otherwise. [sent-246, score-0.29]
69 , l, with kk equal to the covariances between x and X, and with κk equal to the corresponding variance at x , as computed by ck (·, ·). [sent-250, score-0.155]
70 The posterior (8) of z can be similarly approximated by marginalizing Q with respect to ρ and f : n pzi (1 − pi )1−zi , i P(z|y, X) ≈ Bern(z|p) = (21) i=1 where p = (p1 , . [sent-253, score-0.169]
71 Thus, these parameters can be used to identify the data instances that are more likely to be outliers. [sent-260, score-0.14]
72 SMGPC explains data instances for which (1) is not satisfied in practice by considering Gaussian noise in the estimation of the functions f1 , . [sent-269, score-0.199]
73 , fl , which is the typical approach found in the literature [1]. [sent-272, score-0.094]
74 In HTPC, the prior for each latent function fk , with k = 1, . [sent-274, score-0.237]
75 , l, is a Gaussian Process that has been non-linearly transformed to have marginals that follow hyperbolic secant distributions with scale parameter bk . [sent-277, score-0.117]
76 The hyperbolic secant distribution has heavier tails than the Gaussian distribution and is expected to perform better in the presence of outliers. [sent-278, score-0.105]
77 1 Classification of Noisy Data We carry out experiments on four datasets extracted from the UCI repository [11] and from other sources [12] to evaluate the predictive performance of RMGPC, SMGPC and HTPC when different fractions of outliers are present in the data2 . [sent-280, score-0.107]
78 Furthermore, the labels of η ∈ {0%, 5%, 10%, 20%} of the training instances are selected uniformly at random from C. [sent-285, score-0.146]
79 The BCR of a method with prediction accuracy ak on those instances of class k (k = 1, . [sent-291, score-0.12]
80 Dataset New-thyroid Wine Glass SVMguide2 # Instances 215 178 214 319 # Attributes 5 13 9 20 # Classes 3 3 6 3 # Source UCI UCI UCI LIBSVM In our experiments, the different methods analyzed (RMGPC, SMGPC and HTPC) use the same covariance function for each latent function, i. [sent-297, score-0.113]
81 Preliminary experiments on the datasets analyzed show no significant benefit from considering a different covariance function for each latent function. [sent-303, score-0.165]
82 , l, of SMGPC are also added an extra term equal to ϑ2 to account for latent Gaussian noise with variance ϑ2 k k around fk [1]. [sent-307, score-0.289]
83 These extra terms are used by SMGPC to explain those instances that are unlikely to stem from (1). [sent-308, score-0.228]
84 2 Outlier Identification A second batch of experiments shows the utility of RMGPC to identify data instances that are likely to be outliers. [sent-424, score-0.14]
85 After normalizing the Glass dataset, we run RMGPC on the whole data and estimate the posterior probability that each instance is an outlier using (21). [sent-430, score-0.191]
86 Figure 1 shows for each instance (xi , yi ) of the Glass dataset, with i = 1, . [sent-432, score-0.153]
87 Note that most of the instances are considered to be outliers with very low posterior probability. [sent-436, score-0.211]
88 Nevertheless, there is a small set of instances that have very high posterior probabilities. [sent-437, score-0.179]
89 These instances are unlikely to stem from (1) and are expected to be misclassified when placed on the test set. [sent-438, score-0.206]
90 Consider the set of instances that are more likely to be outliers than normal instances (i. [sent-439, score-0.272]
91 Table 3 displays the fraction of times that each of these instances is misclassified by RMGPC, SMGPC and HTPC when placed on the test set. [sent-443, score-0.12]
92 The table shows that all the instances are typically misclassified by all the classifiers investigated, which confirms the difficulty of obtaining accurate predictions for them in practice. [sent-445, score-0.12]
93 RMGPC considers only the number of errors made, and not the distance of such errors to the decision boundaries of the classifier. [sent-499, score-0.236]
94 This is achieved by introducing binary latent variables that indicate when a given instance is considered to be an outlier (wrongly labeled instance) or not. [sent-500, score-0.201]
95 RMGPC can also identify the training instances that are more likely to be outliers. [sent-501, score-0.14]
96 Nevertheless, approximate inference can be efficiently carried out using expectation propagation (EP). [sent-503, score-0.153]
97 Experiments in four multi-class classification problems show the benefits of RMGPC when labeling noise is injected in the data. [sent-505, score-0.153]
98 In this case, RMGPC performs better than other alternatives based on considering latent Gaussian noise or noise which follows a distribution with heavy tails. [sent-506, score-0.223]
99 Our experiments also confirm the utility of RMGPC to identify data instances that are difficult to classify accurately in practice. [sent-508, score-0.14]
100 These instances are typically misclassified by different predictors when included in the test set. [sent-509, score-0.12]
wordName wordTfidf (topN-words)
[('rmgpc', 0.69), ('smgpc', 0.296), ('ik', 0.264), ('htpc', 0.197), ('ep', 0.149), ('fk', 0.147), ('zi', 0.144), ('instances', 0.12), ('yi', 0.119), ('kk', 0.119), ('glass', 0.118), ('mgpcs', 0.099), ('outlier', 0.098), ('fl', 0.094), ('bcr', 0.087), ('fyi', 0.082), ('mgpc', 0.082), ('errors', 0.074), ('factors', 0.072), ('latent', 0.069), ('labeling', 0.067), ('kj', 0.063), ('gaussian', 0.062), ('posterior', 0.059), ('pik', 0.058), ('bk', 0.057), ('classi', 0.056), ('beta', 0.055), ('propagation', 0.053), ('noise', 0.051), ('wine', 0.05), ('hern', 0.05), ('pzi', 0.049), ('sik', 0.049), ('approximate', 0.048), ('stem', 0.047), ('uci', 0.047), ('bern', 0.045), ('covariance', 0.044), ('louvain', 0.043), ('unlikely', 0.039), ('mk', 0.038), ('ck', 0.036), ('pi', 0.035), ('injected', 0.035), ('decision', 0.035), ('xi', 0.035), ('exact', 0.034), ('instance', 0.034), ('misclassi', 0.033), ('barbe', 0.033), ('sainte', 0.033), ('secant', 0.033), ('multivariate', 0.033), ('rule', 0.032), ('process', 0.032), ('outliers', 0.032), ('df', 0.031), ('robust', 0.031), ('observing', 0.031), ('laplace', 0.03), ('boundaries', 0.029), ('jos', 0.029), ('catholique', 0.029), ('icteam', 0.029), ('expectation', 0.028), ('considering', 0.028), ('evidence', 0.027), ('predictive', 0.027), ('nevertheless', 0.027), ('miguel', 0.027), ('hyperbolic', 0.027), ('bayesian', 0.026), ('labels', 0.026), ('approximated', 0.026), ('belgium', 0.025), ('contaminated', 0.025), ('heavier', 0.025), ('kl', 0.025), ('christopher', 0.024), ('inference', 0.024), ('performs', 0.024), ('repository', 0.024), ('considers', 0.024), ('datasets', 0.024), ('edward', 0.024), ('processes', 0.023), ('damping', 0.023), ('ers', 0.022), ('free', 0.022), ('extra', 0.022), ('zoubin', 0.021), ('er', 0.021), ('prior', 0.021), ('xn', 0.02), ('tails', 0.02), ('libsvm', 0.02), ('cost', 0.02), ('product', 0.02), ('identify', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 240 nips-2011-Robust Multi-Class Gaussian Process Classification
Author: Daniel Hernández-lobato, Jose M. Hernández-lobato, Pierre Dupont
Abstract: Multi-class Gaussian Process Classifiers (MGPCs) are often affected by overfitting problems when labeling errors occur far from the decision boundaries. To prevent this, we investigate a robust MGPC (RMGPC) which considers labeling errors independently of their distance to the decision boundaries. Expectation propagation is used for approximate inference. Experiments with several datasets in which noise is injected in the labels illustrate the benefits of RMGPC. This method performs better than other Gaussian process alternatives based on considering latent Gaussian noise or heavy-tailed processes. When no noise is injected in the labels, RMGPC still performs equal or better than the other methods. Finally, we show how RMGPC can be used for successfully identifying data instances which are difficult to classify correctly in practice. 1
2 0.10234656 258 nips-2011-Sparse Bayesian Multi-Task Learning
Author: Shengbo Guo, Onno Zoeter, Cédric Archambeau
Abstract: We propose a new sparse Bayesian model for multi-task regression and classification. The model is able to capture correlations between tasks, or more specifically a low-rank approximation of the covariance matrix, while being sparse in the features. We introduce a general family of group sparsity inducing priors based on matrix-variate Gaussian scale mixtures. We show the amount of sparsity can be learnt from the data by combining an approximate inference approach with type II maximum likelihood estimation of the hyperparameters. Empirical evaluations on data sets from biology and vision demonstrate the applicability of the model, where on both regression and classification tasks it achieves competitive predictive performance compared to previously proposed methods. 1
3 0.087171659 260 nips-2011-Sparse Features for PCA-Like Linear Regression
Author: Christos Boutsidis, Petros Drineas, Malik Magdon-Ismail
Abstract: Principal Components Analysis (PCA) is often used as a feature extraction procedure. Given a matrix X ∈ Rn×d , whose rows represent n data points with respect to d features, the top k right singular vectors of X (the so-called eigenfeatures), are arbitrary linear combinations of all available features. The eigenfeatures are very useful in data analysis, including the regularization of linear regression. Enforcing sparsity on the eigenfeatures, i.e., forcing them to be linear combinations of only a small number of actual features (as opposed to all available features), can promote better generalization error and improve the interpretability of the eigenfeatures. We present deterministic and randomized algorithms that construct such sparse eigenfeatures while provably achieving in-sample performance comparable to regularized linear regression. Our algorithms are relatively simple and practically efficient, and we demonstrate their performance on several data sets.
4 0.080290005 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning
Author: Jun Zhu, Ning Chen, Eric P. Xing
Abstract: Unlike existing nonparametric Bayesian models, which rely solely on specially conceived priors to incorporate domain knowledge for discovering improved latent representations, we study nonparametric Bayesian inference with regularization on the desired posterior distributions. While priors can indirectly affect posterior distributions through Bayes’ theorem, imposing posterior regularization is arguably more direct and in some cases can be much easier. We particularly focus on developing infinite latent support vector machines (iLSVM) and multi-task infinite latent support vector machines (MT-iLSVM), which explore the largemargin idea in combination with a nonparametric Bayesian model for discovering predictive latent features for classification and multi-task learning, respectively. We present efficient inference methods and report empirical studies on several benchmark datasets. Our results appear to demonstrate the merits inherited from both large-margin learning and Bayesian nonparametrics.
5 0.072450563 288 nips-2011-Thinning Measurement Models and Questionnaire Design
Author: Ricardo Silva
Abstract: Inferring key unobservable features of individuals is an important task in the applied sciences. In particular, an important source of data in fields such as marketing, social sciences and medicine is questionnaires: answers in such questionnaires are noisy measures of target unobserved features. While comprehensive surveys help to better estimate the latent variables of interest, aiming at a high number of questions comes at a price: refusal to participate in surveys can go up, as well as the rate of missing data; quality of answers can decline; costs associated with applying such questionnaires can also increase. In this paper, we cast the problem of refining existing models for questionnaire data as follows: solve a constrained optimization problem of preserving the maximum amount of information found in a latent variable model using only a subset of existing questions. The goal is to find an optimal subset of a given size. For that, we first define an information theoretical measure for quantifying the quality of a reduced questionnaire. Three different approximate inference methods are introduced to solve this problem. Comparisons against a simple but powerful heuristic are presented. 1 Contribution A common goal in the applied sciences is to measure concepts of interest that are not directly observable (Bartholomew et al., 2008). Such is the case in the social sciences, medicine, economics and other fields, where quantifying key attributes such as “consumer satisfaction,” “anxiety” and “recession” requires the development of indicators: observable variables that are postulated to measure the target latent variables up to some measurement error (Bollen, 1989; Carroll et al., 1995). In a probabilistic framework, this often boils down to a latent variable model (Bishop, 1998). One common setup is to assume each observed indicator Yi as being generated independently given the set of latent variables X. Conditioning on any given observed data point Y gives information about the distribution of the latent vector X, which can then be used for ranking, clustering, visualization or smoothing, among other tasks. Figure 1 provides an illustration. Questionnaires from large surveys are sometimes used to provide such indicators, each Yi recording an answer that typically corresponds to a Bernoulli or ordinal variable. For instance, experts can be given questions concerning whether there is freedom of press in a particular nation, as a way of measuring its democratization level (Bollen, 1989; Palomo et al., 2007). Nations can then be clustering or ranked within an interpretable latent space. Long questionnaires have nevertheless drawbacks, as summarized by Stanton et al. (2002) in the context of psychometric studies: Longer surveys take more time to complete, tend to have more missing data, and have higher refusal rates than short surveys. Arguably, then, techniques to reducing the length of scales while maintaining psychometric quality are wortwhile. 1 Factor scores: countries in the latent space Y2 Y3 Y4 Y5 0 Y1 5 X2 (Democratization) X1 (Industrialization) Democratization 10 Dem1960 Dem1965 1 5 9 13 18 23 28 33 38 43 48 53 58 63 68 73 Country (ordered by industrialization factor) (a) (b) Figure 1: (a) A graphical representation of a latent variable model. Notice that in general latent variables will be dependent. Here, the question is how to quantify democratization and industrialization levels of nations given observed indicators Y such as freedom of press and gross national product, among others (Bollen, 1989; Palomo et al., 2007). (b) An example of a result implied by the model (adapted from Palomo et al. (2007)): barplots of the conditional distribution of democratization levels given the observed indicators at two time points, ordered by the posterior mean industrialization level. The distribution of the latent variables given the observations is the basis of the analysis. Our contribution is a methodology for choosing which indicators to preserve (e.g., which items to keep in a questionnaire) given: i.) a latent variable model specification of the domain of interest; ii.) a target number of indicators that should be preserved. To accomplish this, we provide: i.) a target objective function that quantifies the amount of information preserved by a choice of a subset of indicators, with respect to the full set; ii.) algorithms for optimizing this choice of subset with respect to the objective function. The general idea is to start with a target posterior distribution of latent variables, defined by some latent variable measurement model M (i.e., PM (X | Y)). We want to choose a subset Yz ⊂ Y so that the resulting conditional distribution PM (X | Yz ) is as close as possible to the original one according to some metric. Model M is provided either by expertise or by numerous standard approaches that can be applied to learn it from data (e.g., methods in Bishop, 2009). We call this task measurement model thinning. Notice that the size of Yz is a domain-dependent choice. Assuming M is a good model for the data, choosing a subset of indicators will incur some information loss. It is up to the analyst to choose a trade-off between loss of information and the design of simpler, cheaper ways of measuring latent variables. Even if a shorter questionnaire is not to be deployed, the outcome of measurement model thinning provides a formal sensitivity analysis of the target latent distribution with respect to the available indicators. The result is useful to generate different insights into the domain. This paper is organized as follows: Section 2 defines a formal criterion to quantify how appropriate a subset Yz is. Section 3 describes different approaches in which this criterion can be optimized. Related work is briefly discussed in Section 4. Experiments with synthetic and real data are discussed in Section 5, followed by the conclusion. 2 An Information-Theoretical Criterion Our focus is on domains where latent variables are not a by-product of a dimensionality reduction technique, but the target of the analysis as in the example of Figure 1. That is, measurement error problems where the variables to be recorded are designed specifically to obtain information concerning such unknowns (Carroll et al., 1995; Bartholomew et al., 2008). As such, we postulate that the outcome of any analysis should be a functional of PM (X | Y), the conditional distribution of unobservables X given observables Y within a model M. It is assumed that M specifies the joint PM (X, Y). We further assume that observed variables are conditionally independent given X, i.e. p PM (X, Y) = PM (X) i=1 PM (Yi | X), with p being the number of observed indicators. 2 If z ≡ (z1 , z2 , . . . , zp ) is a binary vector of the same dimensionality as Y, and Yz is the subset of Y corresponding the non-zero entries of z, we can assess z by the KL divergence PM (X | Y) dX KL(PM (X | Y) || PM (X | Yz )) ≡ PM (X | Y) log PM (X | Yz ) This is well-defined, since both distributions lie in the same sample space despite the difference of dimensionality between Y and Yz . Moreover, since Y is itself a random vector, our criterion becomes the expected KL divergence KL(PM (X | Y) || PM (X | Yz )) PM (Y) where · denotes expectation. Our goal is to minimize this function with respect to z. Rearranging this expression to drop all constants that do not depend on z, and multiplying it by −1 to get a maximization problem, we obtain the problem of finding z⋆ such that z⋆ = argmaxz log(PM (Yz | X)) PM (X,Yz ) − log(PM (Yz )) PM (Yz ) p zi log(PM (Yi | X)) = argmaxz ≡ + HM (Yz ) i=1 argmaxz FM (z) PM (X,Yi ) p i=1 zi = K for a choice of K, and zi ∈ {0, 1}. HM (·) denotes here the entropy of subject to a distribution parameterized by M. Notice we used the assumption that indicators are mutually independent given X. There is an intuitive appeal of having a joint entropy term to reward not only marginal relationships between indicators and latent variables, but also selections that are jointly diverse. Notice that optimizing this objective function turns out to be equivalent to minimizing the conditional entropy of latent variables given Yz . Motivating conditional entropy from a more fundamental principle illustrates that other functions can be obtained by changing the divergence. 3 Approaches for Approximate Optimization The problem of optimizing FM (z) subject to the constraints p zi = K, zi ∈ {0, 1}, is hard i=1 not only for its combinatorial nature, but due to the entropy term. This needs to be approximated, and the nature of the approximation should depend on the form taken by M. We will assume that it is possible to efficiently compute any marginals of PM (Y) of modest dimensionality (say, 10 dimensions). This is the case, for instance, in the probit model for binary data: X ∼ N (0, Σ), Yi⋆ ∼ N (ΛT X + λi;0 , 1), i Yi = 1, if Yi⋆ > 0, and 0 otherwise where N (m, S) is the multivariate Gaussian distribution with mean m and covariance matrix S. The probit model is one of the most common latent variable models for questionnaire data (Bartholomew et al., 2008), with a straigthforward extension to ordinal data. In this model, marginals for a few dozen variables can be obtained efficiently since this corresponds to calculating multivariate Gaussian probabilities (Genz, 1992). Parameters can be fit by a variety of methods (Hahn et al., 2010). We also assume that M allows for the computation of log(PM (Yi | X)) PM (X,Yi ) at little cost. Again, in the binary probit model this is simple, since this requires integrating away a single binary variable Yi and a univariate Gaussian ΛT X. i 3.1 Gaussian Entropy One approximation to FM (z) is to replace its entropy term by the corresponding entropy from some Gaussian distribution PN (Yz ). The entropy of a Gaussian distribution is proportional to the logarithm of the determinant of its covariance matrix, and hence can be computed in O(p3 ) steps. This Gaussian can be chosen as the one closest to PM (Yz ) in a KL(PM || PN ) sense: that is, the one with the same first and second moments as PM (Yz ). In our case, computing these moments can be done deterministically (up to numerical error) using standard bivariate quadrature methods. No expectation-propagation (Minka, 2001) is necessary. The corresponding objective function is p zi log(PM (Yi | X)) FM;N (z) ≡ i=1 3 PM (X,Yi ) + 0.5 log |Σz | where Σz is the covariance matrix of Yz – which for binary and ordinal data has a sensible interpretation. This function is also an upper bound on the exact function, FM (z), since the Gaussian is the distribution with the largest entropy for a given mean vector and covariance matrix. The resulting function is non-linear in z. In our experiments, we optimize for z using a greedy scheme: for all possible pairs (i, j) such that zi = 1 and zj = 0, we swap its values (so that i zi is always K). We choose the pair with the highest increase in FM;N (z) and repeat the process until convergence. 3.2 Entropy with Bounded Neighborhoods An alternative bound can be derived from a standard fact in information theory: H(Y | S) ≤ H(Y | S ′ ) for S ′ ⊆ S, where H(· | ·) denotes conditional entropy. This was exploited by Globerson and Jaakkola (2007) to define an upper bound in the entropy of a distribution as follows: consider a permutation e of the set {1, 2, . . . , p}, with e(i) being the i-th element of e. Denote by e(1 : i) the first i elements of this permutation (an empty set if i < 1). Moreover, let N (e, i) be a subset of e(1 : i − 1). For a given set variables Y = {Y1 , Y2 , . . . , Yp } the following bound holds: p n H(Ye(i) | YN (e,i) ) H(Ye(i) | Ye(1:i−1) ) ≤ H(Y1 , Y2 , . . . Yp ) = (1) i=1 i=1 If each set N (e, i) is no larger than some constant D, then this bound can be computed in O(p · 2D ) steps for binary probit models. The bound holds for any choice of e, but we want it to be as tight as possible so that it gets weighted in a reasonable way against the other terms in FM (·). Since the entropy function is decomposable as a sum of functions that depend on i and N (e, i) only, one can minimize this bound with respect to e by using permutation optimization methods such as (Jaakkola et al., 2010). In our implementation, we use a method similar to Teyssier and Koller (2005) that shuffles neighboring entries of e to generate candidates, chooses the optimal N (e, i) for each i given the candidate e, and picks as the next permutation the candidate e with the greatest decrease in the bound. Notice that a permutation choice e and neighborhood choices N (e, i) define a Bayesian network where N (e, i) are the parents of Ye(i) . Therefore, if this Bayesian network model provides a good approximation to PM (Y), the bound will be reasonably tight. Given e, we will further relax this bound with the goal of obtaining an integer programming formulation for the problem of optimizing an upper bound to FM (z). For any given z, we define the local term HL (z, i) as HL (z, i) ≡ HM (Ye(i) | Yz ∩N (e, i)) = S∈P (N (e,i)) j∈S zj k∈N (e,i)\S (1 − zk ) HM (Ye(i) | S) (2) where P (·) denotes the power set of a set. The new approximate objective function becomes p p zi log(PM (Yi | X)) FM;D (z) ≡ PM (X,Yi ) ze(i) HL (z, i) + (3) i=1 i=1 Notice that HL (z, i) is still an upper bound on HM (Ye(i) | Ye(1:i−1) ). The intuition is that we are bounding HM (Yz ) by the entropy of a Bayesian network where a vertex Ye(i) is included if ze(i) = 1, with corresponding parents given by Yz ∩ N (e, i). This is a well-defined Bayesian network for any choice of z. The shortcoming is that ideally we would like this Bayesian network to be the actual marginal of the model given by e and N (e, i). It is not: if the network implied by e and N (e, i) was, for instance, Y1 → Y2 → Y3 , the choice of z = (1, 0, 1) would result on the entropy of the disconnected graph {Y1 , Y3 }, while the true marginal would correspond instead to the graph Y1 → Y3 . However, our simplified marginalization has the advantage of avoiding an intractable problem. Moreover, it allows us to redefine the problem as an integer linear program (ILP). Each product ze(i) j zj k (1−zk ) appearing in (3) results in a sum of O(2D ) terms, each of which has (up to a sign) the form qM ≡ m∈M zm for some set M . It is still the case that qM ∈ {0, 1}. Therefore, objective function (3) can be interpreted as being linear on a set of binary variables {{z}, {q}}. We need further to enforce the constraints coming from qM = 1 ⇒ {∀m ∈ M, zm = 1}; qM = 0 ⇒ {∃m ∈ M s.t. zm = 0} 4 It is well-known (Glover and Woolsey, 1974) that this corresponds to the linear constraints qM = 1 ⇒ {∀m ∈ M, zm = 1} ⇔ ∀m ∈ M, qM − zm ≤ 0 qM = 0 ⇒ {∃m ∈ M s.t. zm = 0} ⇔ m∈M zm − qM ≤ |M | − 1 p which combined with the linear constraint i=1 zi = K implies that optimizing FM;D (z) is an ILP with O(p · 2D ) variables and O(p2 · 2D ) constraints. In our experiments in Section 5, we were able to solve essentially all of such ILPs exactly using linear programming relaxations with branch-and-bound. 3.3 Entropy with Tree-Structured Bounds The previous bound simplifies marginalization, which might badly overestimate entropies where the corresponding Yz are uniformly spread out in permutation e. We now propose a different type of bound which treats different marginalizations on an equal footing. It comes from the following observation: since H(Ye(i) | Ye(1:i−1) ) is less than or equal to any conditional entropy H(Ye(i) | Yj ) for j ∈ e(1 : i − 1), we have that the tighest bound given by singleton conditioning sets is H(Ye(i) | Ye(1:i−1) ) ≤ min j∈e(1:i−1) HM (Ye(i) | Yj ), resulting in the objective function p p zi log(PM (Yi | X)) FM;tree (z) ≡ ze(i) · PM (X,Yi ) + i=1 i=1 min {Yj ∈Ye(1:i−1) ∩Yz } H(Ye(i) | Yj ) (4) where min{Yj ∈Ye(1:i−1) ∩Yz } H(Ye(i) | Yj ) ≡ H(Ye(i) ) if Ye(1:i−1) ∩ Yz = ∅. The intuition is that we are bounding the exact entropy using the entropy of a directed tree rooted at Yez (1) , the first element of Yz according to e. That is, all variables are marginally dependent in the approximation regardless of what z is, and for a fixed z the tree is, by construction, the one obtained by the usual greedy algorithm of adding edges corresponding to the next legal pair of vertices with maximum mutual information (following an ordering, in this case). It turns out we can also write (4) as a linear objective function of a polynomial number of 0\1 (i−1) (2) (1) be the values of set variables and constraints. Let zi ≡ 1 − zi . Let Hi , Hi , . . . , Hi ¯ (i−1) (1) be{HM (Ye(i) | Ye(1) ), . . . , HM (Ye(i) | Ye(i−1) )} sorted in ascending order, with zi , . . . , zi ing the corresponding permutation of {ze(1) , . . . , ze(i−1) }. We have min{Yj ∈Ye(1:i−1) ∩Yz } H(Ye(i) | Yj ) (j) (j) j−1 (1) (1) (1) (2) (2) (1) (2) (3) (3) ¯ ¯ ¯ = z i Hi + z i z i H i + z i z i z i H i + . . . (i−2) (i−1) (i−1) (1) i−1 (j) + j=1 zi HM (Ye(i) ) ¯ Hi zi ¯ zi . . . zi ¯ (i) i−1 (j) (j) + qi HM (Ye(i) ) ≡ j=1 qi Hi (k) ¯ where qi ≡ zi k=1 zi , and also a binary 0\1 variable. Plugging this expression into (4) gives a linear objective function in this extended variable space. The corresponding constraints are (j−1) (j) (1) (j) , zi }, zm = 1} ¯ z qi = 1 ⇒ {∀zm ∈ {¯i , . . . , zi (j−1) (j) (1) (j) , zi } s.t. zm = 0} ¯ z qi = 0 ⇒ {∃zm ∈ {¯i , . . . , zi which, as shown in the previous section, can be written as linear constraints (substituting each zi ¯ by 1 − zi ). The total number of constraints is however O(p3 ), which can be expensive, and often a linear relaxation procedure with branch-and-bound fails to provide guarantees of optimality. 3.4 The Reliability Score Finally, it is important to design cheap, effective criteria whose maxima correlate with the maxima of FM (·). Empirically, we have found high quality selections in binary probit models using the solution to the problem p p wi zi , subject to zi ∈ {0, 1}, maximize FM;R (z) = zi = K i=1 i=1 5 where wi = ΛT ΣΛi . This can be solved by picking the corresponding indicators with the highest i K weights wi . Assuming a probit model where the measurement error for each Yi⋆ has the same variance of 1, this score is related to the “reliability” of an indicator. Simply put, the reliability Ri of an indicator is the proportion of its variance that is due to the latent variables (Bollen, 1989, Chapter 6): Ri = wi /(wi + 1) for each Yi⋆ . There is no current theory linking this solution to the problem of maximizing FM (·): since there is no entropy term, we can set an adversarial problem to easily defeat this method. For instance, this happens in a model where the K indicators of highest reliability all measure the same latent variable Xi and nothing else – much information about Xi would be preserved, but little about other variables. In any case, we found this criterion to be fairly competitive even if at times it produces extreme failures. An honest account of more sophisticated selection mechanisms cannot be performed without including it, as we do in Section 5. 4 Related Work The literature on survey analysis, in the context of latent variable models, contains several examples of guidelines on how to simplify questionnaires (sometimes described as providing “shortened versions” of scales). Much of the literature, however, consists of describing general guidelines and rules-of-thumb to accomplish this task (e.g, Richins, 2004; Stanton et al., 2002). One possible exception is Leite et al. (2008), which uses different model fitness criteria with respect to a given dataset to score candidate solutions, along with an expensive combinatorial optimization method. This conflates model selection and questionnaire thinning, and there is no theory linking the score functions to the amount of information preserved. In the machine learning and statistics literature, there is a large body of research in active learning, which is related to our task. One of the closest approaches is the one by Liang et al. (2009), which casts the classical problem of measurement selection within a Bayesian graphical model perspective. In that work, one has to choose which measurements to add. This is done sequentially, partially motivated by problems where collecting new measurements can be done relatively fast and cheap (say, by paying graduate students to annotate text data), and so the choice of next measurement can make use of fresh data. In our case, it not might be realistic to expect we can perform a large number of iterations of data collection – and as such the task of reducing the number of measurements from a large initial collection might be more relevant in practice. Liang et al. also focus on (multivariate) supervised learning instead of purely unsupervised learning. In statistics there is also a considerable body of literature on sufficient dimension reduction and its sparse variants (e.g., Chen et al., 2010). Such techniques create a bottleneck between two sets of variables in a regression problem (say, the mapping from Y to X) while eliminating some of the input variables. In principle one might want to adapt such models to take a latent variable model M as the target mapping. Besides some loss of interpretability, the computational implications might be problematic, though. Moreover, this framework has another free parameter corresponding to the dimensionality of the bottleneck that has to be set. It is not clear how this parameter, along with a choice of sparsity level, would interact with a fixed choice K of indicators to be kept. 5 Experiments In this section, we first describe some synthetic experiments to provide insights about the different methods, followed by one brief description of a case study. In all of the experiments, the target models M are binary probit. We set the neighborhood parameter for FM;N (·) to 9. The ordering e for the tree-structured method is obtained by the same greedy search of Section 3.2, where now the score is the average of all H(Yi | Yj ) for all Yj preceding Yi . Finally, all ordering optimization methods were initialized by sorting indicators in a descending order according to their reliability scores, and the initial solution for all entropy-based optimization methods was given by the reliability score solution of Section 3.4. The integer program solver G UROBI 4.02 was used in all experiments. 5.1 Synthetic studies We start with a batch of synthetic experiments. We generated 80 models with 40 indicators and 10 latent variables1 . We further preprocess such models into two groups: in 40 of them, we select a 1 Details on the model generation: we generate 40 models by sampling the latent covariance matrix from an inverse Wishart distribution with 10 degrees of freedom and scale matrix 10I, I being the identity matrix. 6 Improvement ratio: high signal Mean error: high signal Improvement ratio: low signal Mean error: low signal 0.6 0.5 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.25 Reliability score Reliability score 0.4 0.2 0.15 0.25 0.2 0.15 −0.1 −0.1 0.1 N/R T/R G/R N/S T/S G/S N/R T/R G/R N/S T/S 0.1 G/S 0.1 0.15 0.2 0.25 0.3 0.1 0.15 0.2 Tree bound (a) (c) (b) 0.25 0.3 Tree bound (d) Figure 2: (a) A comparison of the bounded neighborhood (N ), tree-based (T ) and Gaussian (G) methods with respect to a random solution (R) and the reliability score (S). (b) A similar comparison for models where indicators are more weakly correlated to the latent variables than in (a). (c) and (d) Scatterplots of the average absolute deviance for the tree-based method (horizontal axis) against the reliability method (vertical axis). The bottom-left clouds correspond to the K = 32 trials. target reliability ri for each indicator Yi , uniformly at random from the interval [0.4 0.7]. We then rescale coefficients Λi such that the reliability (defined in Section 3.4) of the respective Yi⋆ becomes ri . For the remaining 40 models, we sample ri uniformly at random from the interval [0.2 0.4]. We perform two choices of subsets: sets Yz of size 20 and 32 (50% and 80% of the total number of indicators). Our evaluation is as follows: since the expected value is perhaps the most common functional of the posterior distribution PM (X | Y), we calculate the expected value of the latent variables for a sample {y(1) , y(2) , . . . , y(1000) } of size 1000 taken from the respective synthetic models. This is done for the full set of 40 indicators, and for each set chosen by our four criteria: for each data point i and each objective function F , we evaluate the average distance (i) (i) (i) (i) ˆ ˆ dF ≡ 10 |ˆj − xj;F |/10. In this case, xj is the expected value of Xj obtained by conditioning j=1 x (i) on all indicators, while xj;F is the one obtained with the subset selected by optimizing F . We denote ˆ (1) (2) (1000) by mF the average of {dF , dF , . . . , dF }. Finally, we compare the three main methods with respect to the reliability score method using the improvement ratio statistic sF = 1 − mF /mFM;R , the proportion of average error decrease with respect to the reliability score. In order to provide a sense of scale on the difficulty of each problem, we compute the same ratios with respect to a random selection, obtained by choosing K = 20 and K = 32 indicators uniformly at random. Figure 2 provides a summary of the results. In Figure 2(a), each boxplot shows the distribution over the 40 probit models where reliabilities were sampled between [0.4 0.7] (the “high signal” models). The first three boxplots show the scores sF of the bounded neighborhood, tree-structured and Gaussian methods, respectively, compared against random selections. The last three boxplots are comparisons against the reliability heuristic. The tree-based method easily beats the Gaussian method, with about 75% of its outcomes being better than the median Gaussian outcome. The Gaussian approach is also less reliable, with results showing a long lower tail. Although the reliability score is on average a good approach, in only a handful of cases it was better than the tree-based method, and by considerably smaller magnitudes compared to the upper tails in the tree-based outcome distribution. A separate panel (Figure 2(b)) is shown for the 40 models with lower reliabilities. In this case, all methods show stronger improvements over the reliability score, although now there is a less clear difference between the tree method and the Gaussian one. Finally, in panels (c) and (d) we present scatterplots for the average deviances mF of the tree-based method against the reliability score. The two clouds correspond to the solutions with 20 and 32 indicators. Notice that in the vast majority of the cases the tree-based method does better. We then rescale the matrix to make all variances equal to 1. We also generate 40 models using as the inverse Wishart scale matrix the correlation matrix will all off-diagonal entries set to 0.5. Coefficients linking indicators to latent variables were set to zero with probability 0.8, and sampled from a standard Gaussian otherwise. If some latent variable ends up with no child, or an indicator ends up with no parent, we uniformly choose one child/parent to be linked to it. Code to fully replicate the synthetic experiments is available at HTTP :// WWW. HOMEPAGES . UCL . AC . UK /∼ UCGTRBD /. 7 5.2 Case study The National Health Service (NHS) is the public health system of the United Kingdom. In 2009, a major survey called the National Health Service National Staff Survey was deployed with the goal of “collect(ing) staff views about working in their local NHS trust” (Care Quality Comission and Aston University, 2010). A questionnaire of 206 items was filled by 156, 951 respondents. We designed a measurement model based on the structure of the questionnaire and fit it by the posterior expected value estimator. Gaussian and inverse Wishart priors were used, along with Gibbs sampling and a random subset of 50, 000 respondents. See the Supplementary Material for more details. Several items in this survey asked for the NHS staff member to provide degrees of agreement in a Likert scale (Bartholomew et al., 2008) to questions such as • • • • . . . have you ever come to work despite not feeling well enough to perform . . . ? Have you felt pressure from your manager to come to work? Have you felt pressure from colleagues to come to work? Have you put yourself under pressure to come to work? as different probes into an unobservable self-assessed level of work pressure. We preprocessed and binarized the data to first narrow it down to 63 questions. We compare selections of 32 (50%) and 50 (80%) items using the same statistics of the previous section. sF ;D sF ;tree sF ;N sF ;random mF ;tree mF ;R K = 32 K = 50 7.8% 10.5% 6.3% 11.9% 0.01% 7.6% −16.0% −0.05% 0.238 0.123 0.255 0.140 Although gains were relatively small (as measured by the difference between reconstruction errors mF ;tree − mF ;R and the good performance of a random selection), we showed that: i.) we do improve results over a popular measure of indicator quality; ii.) we do provide some guarantees about the diversity of the selected items via a information-theoretical measure with an entropy term, with theoretically sound approximations to such a measure. For more details on the preprocessing, and more insights into the different selections, please refer to the Supplementary Material. 6 Conclusion There are problems where one posits that the relevant information is encoded in the posterior distribution of a set of latent variables. Questionnaires (and other instruments) can be used as evidence to generate this posterior, but there is a cost associated with complex questionnaires. One problem is how to simplify such instruments of measurement. To the best of our knowledge, we provide the first formal account on how to solve it. Nevertheless, we would like to stress there is no substitute for common sense. While the tools we provide here can be used for a variety of analyses – from deploying simpler questionnaires to sensitivity analysis – the value and cost of keeping particular indicators can go much beyond the information contained in the latent posterior distribution. How to combine this criterion with other domain-dependent criteria is a matter of future research. Another problem of importance is how to deal with model specification and transportability across studies. A measurement model built for a very specific population of respondents might transfer poorly to another group, and therefore taking into account model uncertainty will be important. The Bayesian setup discussed by Liang et al. (2009) might provide some directions on this issue. Also, there is further structure in real-world questionnaires we are not exploiting in the current work. Namely, it is not uncommon to have questionnaires with branching questions and other dynamic behaviour more commonly associated with Web based surveys and/or longitudinal studies. Finally, hybrid approaches combining the bounded neighborhood and the tree-structured methods, along with more sophisticated ordering optimization procedures and the use of other divergence measures and determinant-based criteria (e.g. Kulesza and Taskar, 2011), will also be studied in the future. Acknowledgments The author would like to thank James Cussens and Simon Lacoste-Julien for helpful discussions, as well as the anonymous reviewers for further comments. 8 References D. Bartholomew, F. Steele, I. Moustaki, and J. Galbraith. Analysis of Multivariate Social Science Data, 2nd edition. Chapman & Hall, 2008. C. Bishop. Latent variable models. In M. Jordan (editor), Learning in Graphical Models, pages 371–403, 1998. C. Bishop. Pattern Recognition and Machine Learning. Springer, 2009. K. Bollen. Structural Equations with Latent Variables. John Wiley & Sons, 1989. R. Carroll, D. Ruppert, and L. Stefanski. Measurement Error in Nonlinear Models. Chapman & Hall, 1995. X. Chen, C. Zou, and R. Cook. Coordinate-independent sparse sufficient dimension reduction and variable selection. Annals of Statistics, 38:3696–3723, 2010. Care Quality Commission and Aston University. Aston Business School, National Health Service National Staff Survey, 2009 [computer file]. Colchester, Essex: UK Data Archive [distributor], October 2010. Available at HTTPS :// WWW. ESDS . AC . UK, SN: 6570, 2010. A. Genz. Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics, 1:141–149, 1992. A. Globerson and T. Jaakkola. Approximate inference using conditional entropy decompositions. Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS 2007), pages 141–149, 2007. F. Glover and E. Woolsey. Converting the 0-1 polynomial programming problem to a 0-1 linear program. Operations Research, 22:180–182, 1974. P. Hahn, J. Scott, and C. Carvalho. A sparse factor-analytic probit model for congressional voting patterns. Duke University Department of Statistical Science, Discussion Paper 2009-22, 2010. T. Jaakkola, D. Sontag, A. Globerson, and M. Meila. Learning Bayesian network structure using LP relaxations. Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS 2010), pages 366–373, 2010. A. Kulesza and B. Taskar. k-DPPs: fixed-size determinantal point processes. Proceedings of the 28th International Conference on Machine Learning (ICML), pages 1193–1200, 2011. W. Leite, I-C. Huang, and G. Marcoulides. Item selection for the development of short forms of scales using an ant colony optimization algorithm. Multivariate Behavioral Research, 43:411– 431, 2008. P. Liang, M. Jordan, and D. Klein. Learning from measurements in exponential families. Proceedings of the 26th Annual International Conference on Machine Learning (ICML ’09), 2009. T. Minka. A family of algorithms for approximate Bayesian inference. PhD Thesis, Massachussets Institute of Technology, 2001. J. Palomo, D. Dunson, and K. Bollen. Bayesian structural equation modeling. In Sik-Yum Lee (ed.), Handbook of Latent Variable and Related Models, pages 163–188, 2007. M. Richins. The material values scale: Measurement properties and development of a short form. The Journal of Consumer Research, 31:209–219, 2004. J. Stanton, E. Sinar, W. Balzer, and P. Smith. Issues and strategies for reducing the length of selfreported scales. Personnel Psychology, 55:167–194, 2002. M. Teyssier and D. Koller. Ordering-based search: A simple and effective algorithm for learning Bayesian networks. Proceedings of the Twenty-first Conference on Uncertainty in AI (UAI ’05), pages 584–590, 2005. 9
6 0.069415055 180 nips-2011-Multiple Instance Filtering
7 0.057471201 267 nips-2011-Spectral Methods for Learning Multivariate Latent Tree Structure
8 0.055581257 301 nips-2011-Variational Gaussian Process Dynamical Systems
9 0.054980464 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition
10 0.052209601 168 nips-2011-Maximum Margin Multi-Instance Learning
11 0.051278345 100 nips-2011-Gaussian Process Training with Input Noise
12 0.050622102 306 nips-2011-t-divergence Based Approximate Inference
13 0.05036433 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison
14 0.049293175 224 nips-2011-Probabilistic Modeling of Dependencies Among Visual Short-Term Memory Representations
15 0.04895198 104 nips-2011-Generalized Beta Mixtures of Gaussians
16 0.047130521 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
17 0.046679139 302 nips-2011-Variational Learning for Recurrent Spiking Networks
18 0.046112321 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
19 0.046109494 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression
20 0.045161117 119 nips-2011-Higher-Order Correlation Clustering for Image Segmentation
topicId topicWeight
[(0, 0.151), (1, 0.03), (2, 0.001), (3, -0.03), (4, -0.017), (5, -0.035), (6, 0.035), (7, -0.076), (8, -0.02), (9, 0.077), (10, -0.054), (11, -0.075), (12, 0.026), (13, -0.025), (14, -0.03), (15, 0.042), (16, -0.054), (17, -0.016), (18, 0.035), (19, -0.045), (20, 0.013), (21, -0.027), (22, 0.097), (23, -0.001), (24, 0.025), (25, -0.032), (26, 0.027), (27, -0.012), (28, -0.033), (29, 0.009), (30, -0.027), (31, 0.027), (32, 0.058), (33, -0.026), (34, -0.044), (35, -0.025), (36, -0.046), (37, 0.021), (38, -0.044), (39, -0.093), (40, -0.058), (41, -0.064), (42, 0.088), (43, -0.074), (44, 0.1), (45, -0.024), (46, 0.039), (47, -0.093), (48, -0.057), (49, 0.038)]
simIndex simValue paperId paperTitle
same-paper 1 0.90727067 240 nips-2011-Robust Multi-Class Gaussian Process Classification
Author: Daniel Hernández-lobato, Jose M. Hernández-lobato, Pierre Dupont
Abstract: Multi-class Gaussian Process Classifiers (MGPCs) are often affected by overfitting problems when labeling errors occur far from the decision boundaries. To prevent this, we investigate a robust MGPC (RMGPC) which considers labeling errors independently of their distance to the decision boundaries. Expectation propagation is used for approximate inference. Experiments with several datasets in which noise is injected in the labels illustrate the benefits of RMGPC. This method performs better than other Gaussian process alternatives based on considering latent Gaussian noise or heavy-tailed processes. When no noise is injected in the labels, RMGPC still performs equal or better than the other methods. Finally, we show how RMGPC can be used for successfully identifying data instances which are difficult to classify correctly in practice. 1
2 0.67079067 288 nips-2011-Thinning Measurement Models and Questionnaire Design
Author: Ricardo Silva
Abstract: Inferring key unobservable features of individuals is an important task in the applied sciences. In particular, an important source of data in fields such as marketing, social sciences and medicine is questionnaires: answers in such questionnaires are noisy measures of target unobserved features. While comprehensive surveys help to better estimate the latent variables of interest, aiming at a high number of questions comes at a price: refusal to participate in surveys can go up, as well as the rate of missing data; quality of answers can decline; costs associated with applying such questionnaires can also increase. In this paper, we cast the problem of refining existing models for questionnaire data as follows: solve a constrained optimization problem of preserving the maximum amount of information found in a latent variable model using only a subset of existing questions. The goal is to find an optimal subset of a given size. For that, we first define an information theoretical measure for quantifying the quality of a reduced questionnaire. Three different approximate inference methods are introduced to solve this problem. Comparisons against a simple but powerful heuristic are presented. 1 Contribution A common goal in the applied sciences is to measure concepts of interest that are not directly observable (Bartholomew et al., 2008). Such is the case in the social sciences, medicine, economics and other fields, where quantifying key attributes such as “consumer satisfaction,” “anxiety” and “recession” requires the development of indicators: observable variables that are postulated to measure the target latent variables up to some measurement error (Bollen, 1989; Carroll et al., 1995). In a probabilistic framework, this often boils down to a latent variable model (Bishop, 1998). One common setup is to assume each observed indicator Yi as being generated independently given the set of latent variables X. Conditioning on any given observed data point Y gives information about the distribution of the latent vector X, which can then be used for ranking, clustering, visualization or smoothing, among other tasks. Figure 1 provides an illustration. Questionnaires from large surveys are sometimes used to provide such indicators, each Yi recording an answer that typically corresponds to a Bernoulli or ordinal variable. For instance, experts can be given questions concerning whether there is freedom of press in a particular nation, as a way of measuring its democratization level (Bollen, 1989; Palomo et al., 2007). Nations can then be clustering or ranked within an interpretable latent space. Long questionnaires have nevertheless drawbacks, as summarized by Stanton et al. (2002) in the context of psychometric studies: Longer surveys take more time to complete, tend to have more missing data, and have higher refusal rates than short surveys. Arguably, then, techniques to reducing the length of scales while maintaining psychometric quality are wortwhile. 1 Factor scores: countries in the latent space Y2 Y3 Y4 Y5 0 Y1 5 X2 (Democratization) X1 (Industrialization) Democratization 10 Dem1960 Dem1965 1 5 9 13 18 23 28 33 38 43 48 53 58 63 68 73 Country (ordered by industrialization factor) (a) (b) Figure 1: (a) A graphical representation of a latent variable model. Notice that in general latent variables will be dependent. Here, the question is how to quantify democratization and industrialization levels of nations given observed indicators Y such as freedom of press and gross national product, among others (Bollen, 1989; Palomo et al., 2007). (b) An example of a result implied by the model (adapted from Palomo et al. (2007)): barplots of the conditional distribution of democratization levels given the observed indicators at two time points, ordered by the posterior mean industrialization level. The distribution of the latent variables given the observations is the basis of the analysis. Our contribution is a methodology for choosing which indicators to preserve (e.g., which items to keep in a questionnaire) given: i.) a latent variable model specification of the domain of interest; ii.) a target number of indicators that should be preserved. To accomplish this, we provide: i.) a target objective function that quantifies the amount of information preserved by a choice of a subset of indicators, with respect to the full set; ii.) algorithms for optimizing this choice of subset with respect to the objective function. The general idea is to start with a target posterior distribution of latent variables, defined by some latent variable measurement model M (i.e., PM (X | Y)). We want to choose a subset Yz ⊂ Y so that the resulting conditional distribution PM (X | Yz ) is as close as possible to the original one according to some metric. Model M is provided either by expertise or by numerous standard approaches that can be applied to learn it from data (e.g., methods in Bishop, 2009). We call this task measurement model thinning. Notice that the size of Yz is a domain-dependent choice. Assuming M is a good model for the data, choosing a subset of indicators will incur some information loss. It is up to the analyst to choose a trade-off between loss of information and the design of simpler, cheaper ways of measuring latent variables. Even if a shorter questionnaire is not to be deployed, the outcome of measurement model thinning provides a formal sensitivity analysis of the target latent distribution with respect to the available indicators. The result is useful to generate different insights into the domain. This paper is organized as follows: Section 2 defines a formal criterion to quantify how appropriate a subset Yz is. Section 3 describes different approaches in which this criterion can be optimized. Related work is briefly discussed in Section 4. Experiments with synthetic and real data are discussed in Section 5, followed by the conclusion. 2 An Information-Theoretical Criterion Our focus is on domains where latent variables are not a by-product of a dimensionality reduction technique, but the target of the analysis as in the example of Figure 1. That is, measurement error problems where the variables to be recorded are designed specifically to obtain information concerning such unknowns (Carroll et al., 1995; Bartholomew et al., 2008). As such, we postulate that the outcome of any analysis should be a functional of PM (X | Y), the conditional distribution of unobservables X given observables Y within a model M. It is assumed that M specifies the joint PM (X, Y). We further assume that observed variables are conditionally independent given X, i.e. p PM (X, Y) = PM (X) i=1 PM (Yi | X), with p being the number of observed indicators. 2 If z ≡ (z1 , z2 , . . . , zp ) is a binary vector of the same dimensionality as Y, and Yz is the subset of Y corresponding the non-zero entries of z, we can assess z by the KL divergence PM (X | Y) dX KL(PM (X | Y) || PM (X | Yz )) ≡ PM (X | Y) log PM (X | Yz ) This is well-defined, since both distributions lie in the same sample space despite the difference of dimensionality between Y and Yz . Moreover, since Y is itself a random vector, our criterion becomes the expected KL divergence KL(PM (X | Y) || PM (X | Yz )) PM (Y) where · denotes expectation. Our goal is to minimize this function with respect to z. Rearranging this expression to drop all constants that do not depend on z, and multiplying it by −1 to get a maximization problem, we obtain the problem of finding z⋆ such that z⋆ = argmaxz log(PM (Yz | X)) PM (X,Yz ) − log(PM (Yz )) PM (Yz ) p zi log(PM (Yi | X)) = argmaxz ≡ + HM (Yz ) i=1 argmaxz FM (z) PM (X,Yi ) p i=1 zi = K for a choice of K, and zi ∈ {0, 1}. HM (·) denotes here the entropy of subject to a distribution parameterized by M. Notice we used the assumption that indicators are mutually independent given X. There is an intuitive appeal of having a joint entropy term to reward not only marginal relationships between indicators and latent variables, but also selections that are jointly diverse. Notice that optimizing this objective function turns out to be equivalent to minimizing the conditional entropy of latent variables given Yz . Motivating conditional entropy from a more fundamental principle illustrates that other functions can be obtained by changing the divergence. 3 Approaches for Approximate Optimization The problem of optimizing FM (z) subject to the constraints p zi = K, zi ∈ {0, 1}, is hard i=1 not only for its combinatorial nature, but due to the entropy term. This needs to be approximated, and the nature of the approximation should depend on the form taken by M. We will assume that it is possible to efficiently compute any marginals of PM (Y) of modest dimensionality (say, 10 dimensions). This is the case, for instance, in the probit model for binary data: X ∼ N (0, Σ), Yi⋆ ∼ N (ΛT X + λi;0 , 1), i Yi = 1, if Yi⋆ > 0, and 0 otherwise where N (m, S) is the multivariate Gaussian distribution with mean m and covariance matrix S. The probit model is one of the most common latent variable models for questionnaire data (Bartholomew et al., 2008), with a straigthforward extension to ordinal data. In this model, marginals for a few dozen variables can be obtained efficiently since this corresponds to calculating multivariate Gaussian probabilities (Genz, 1992). Parameters can be fit by a variety of methods (Hahn et al., 2010). We also assume that M allows for the computation of log(PM (Yi | X)) PM (X,Yi ) at little cost. Again, in the binary probit model this is simple, since this requires integrating away a single binary variable Yi and a univariate Gaussian ΛT X. i 3.1 Gaussian Entropy One approximation to FM (z) is to replace its entropy term by the corresponding entropy from some Gaussian distribution PN (Yz ). The entropy of a Gaussian distribution is proportional to the logarithm of the determinant of its covariance matrix, and hence can be computed in O(p3 ) steps. This Gaussian can be chosen as the one closest to PM (Yz ) in a KL(PM || PN ) sense: that is, the one with the same first and second moments as PM (Yz ). In our case, computing these moments can be done deterministically (up to numerical error) using standard bivariate quadrature methods. No expectation-propagation (Minka, 2001) is necessary. The corresponding objective function is p zi log(PM (Yi | X)) FM;N (z) ≡ i=1 3 PM (X,Yi ) + 0.5 log |Σz | where Σz is the covariance matrix of Yz – which for binary and ordinal data has a sensible interpretation. This function is also an upper bound on the exact function, FM (z), since the Gaussian is the distribution with the largest entropy for a given mean vector and covariance matrix. The resulting function is non-linear in z. In our experiments, we optimize for z using a greedy scheme: for all possible pairs (i, j) such that zi = 1 and zj = 0, we swap its values (so that i zi is always K). We choose the pair with the highest increase in FM;N (z) and repeat the process until convergence. 3.2 Entropy with Bounded Neighborhoods An alternative bound can be derived from a standard fact in information theory: H(Y | S) ≤ H(Y | S ′ ) for S ′ ⊆ S, where H(· | ·) denotes conditional entropy. This was exploited by Globerson and Jaakkola (2007) to define an upper bound in the entropy of a distribution as follows: consider a permutation e of the set {1, 2, . . . , p}, with e(i) being the i-th element of e. Denote by e(1 : i) the first i elements of this permutation (an empty set if i < 1). Moreover, let N (e, i) be a subset of e(1 : i − 1). For a given set variables Y = {Y1 , Y2 , . . . , Yp } the following bound holds: p n H(Ye(i) | YN (e,i) ) H(Ye(i) | Ye(1:i−1) ) ≤ H(Y1 , Y2 , . . . Yp ) = (1) i=1 i=1 If each set N (e, i) is no larger than some constant D, then this bound can be computed in O(p · 2D ) steps for binary probit models. The bound holds for any choice of e, but we want it to be as tight as possible so that it gets weighted in a reasonable way against the other terms in FM (·). Since the entropy function is decomposable as a sum of functions that depend on i and N (e, i) only, one can minimize this bound with respect to e by using permutation optimization methods such as (Jaakkola et al., 2010). In our implementation, we use a method similar to Teyssier and Koller (2005) that shuffles neighboring entries of e to generate candidates, chooses the optimal N (e, i) for each i given the candidate e, and picks as the next permutation the candidate e with the greatest decrease in the bound. Notice that a permutation choice e and neighborhood choices N (e, i) define a Bayesian network where N (e, i) are the parents of Ye(i) . Therefore, if this Bayesian network model provides a good approximation to PM (Y), the bound will be reasonably tight. Given e, we will further relax this bound with the goal of obtaining an integer programming formulation for the problem of optimizing an upper bound to FM (z). For any given z, we define the local term HL (z, i) as HL (z, i) ≡ HM (Ye(i) | Yz ∩N (e, i)) = S∈P (N (e,i)) j∈S zj k∈N (e,i)\S (1 − zk ) HM (Ye(i) | S) (2) where P (·) denotes the power set of a set. The new approximate objective function becomes p p zi log(PM (Yi | X)) FM;D (z) ≡ PM (X,Yi ) ze(i) HL (z, i) + (3) i=1 i=1 Notice that HL (z, i) is still an upper bound on HM (Ye(i) | Ye(1:i−1) ). The intuition is that we are bounding HM (Yz ) by the entropy of a Bayesian network where a vertex Ye(i) is included if ze(i) = 1, with corresponding parents given by Yz ∩ N (e, i). This is a well-defined Bayesian network for any choice of z. The shortcoming is that ideally we would like this Bayesian network to be the actual marginal of the model given by e and N (e, i). It is not: if the network implied by e and N (e, i) was, for instance, Y1 → Y2 → Y3 , the choice of z = (1, 0, 1) would result on the entropy of the disconnected graph {Y1 , Y3 }, while the true marginal would correspond instead to the graph Y1 → Y3 . However, our simplified marginalization has the advantage of avoiding an intractable problem. Moreover, it allows us to redefine the problem as an integer linear program (ILP). Each product ze(i) j zj k (1−zk ) appearing in (3) results in a sum of O(2D ) terms, each of which has (up to a sign) the form qM ≡ m∈M zm for some set M . It is still the case that qM ∈ {0, 1}. Therefore, objective function (3) can be interpreted as being linear on a set of binary variables {{z}, {q}}. We need further to enforce the constraints coming from qM = 1 ⇒ {∀m ∈ M, zm = 1}; qM = 0 ⇒ {∃m ∈ M s.t. zm = 0} 4 It is well-known (Glover and Woolsey, 1974) that this corresponds to the linear constraints qM = 1 ⇒ {∀m ∈ M, zm = 1} ⇔ ∀m ∈ M, qM − zm ≤ 0 qM = 0 ⇒ {∃m ∈ M s.t. zm = 0} ⇔ m∈M zm − qM ≤ |M | − 1 p which combined with the linear constraint i=1 zi = K implies that optimizing FM;D (z) is an ILP with O(p · 2D ) variables and O(p2 · 2D ) constraints. In our experiments in Section 5, we were able to solve essentially all of such ILPs exactly using linear programming relaxations with branch-and-bound. 3.3 Entropy with Tree-Structured Bounds The previous bound simplifies marginalization, which might badly overestimate entropies where the corresponding Yz are uniformly spread out in permutation e. We now propose a different type of bound which treats different marginalizations on an equal footing. It comes from the following observation: since H(Ye(i) | Ye(1:i−1) ) is less than or equal to any conditional entropy H(Ye(i) | Yj ) for j ∈ e(1 : i − 1), we have that the tighest bound given by singleton conditioning sets is H(Ye(i) | Ye(1:i−1) ) ≤ min j∈e(1:i−1) HM (Ye(i) | Yj ), resulting in the objective function p p zi log(PM (Yi | X)) FM;tree (z) ≡ ze(i) · PM (X,Yi ) + i=1 i=1 min {Yj ∈Ye(1:i−1) ∩Yz } H(Ye(i) | Yj ) (4) where min{Yj ∈Ye(1:i−1) ∩Yz } H(Ye(i) | Yj ) ≡ H(Ye(i) ) if Ye(1:i−1) ∩ Yz = ∅. The intuition is that we are bounding the exact entropy using the entropy of a directed tree rooted at Yez (1) , the first element of Yz according to e. That is, all variables are marginally dependent in the approximation regardless of what z is, and for a fixed z the tree is, by construction, the one obtained by the usual greedy algorithm of adding edges corresponding to the next legal pair of vertices with maximum mutual information (following an ordering, in this case). It turns out we can also write (4) as a linear objective function of a polynomial number of 0\1 (i−1) (2) (1) be the values of set variables and constraints. Let zi ≡ 1 − zi . Let Hi , Hi , . . . , Hi ¯ (i−1) (1) be{HM (Ye(i) | Ye(1) ), . . . , HM (Ye(i) | Ye(i−1) )} sorted in ascending order, with zi , . . . , zi ing the corresponding permutation of {ze(1) , . . . , ze(i−1) }. We have min{Yj ∈Ye(1:i−1) ∩Yz } H(Ye(i) | Yj ) (j) (j) j−1 (1) (1) (1) (2) (2) (1) (2) (3) (3) ¯ ¯ ¯ = z i Hi + z i z i H i + z i z i z i H i + . . . (i−2) (i−1) (i−1) (1) i−1 (j) + j=1 zi HM (Ye(i) ) ¯ Hi zi ¯ zi . . . zi ¯ (i) i−1 (j) (j) + qi HM (Ye(i) ) ≡ j=1 qi Hi (k) ¯ where qi ≡ zi k=1 zi , and also a binary 0\1 variable. Plugging this expression into (4) gives a linear objective function in this extended variable space. The corresponding constraints are (j−1) (j) (1) (j) , zi }, zm = 1} ¯ z qi = 1 ⇒ {∀zm ∈ {¯i , . . . , zi (j−1) (j) (1) (j) , zi } s.t. zm = 0} ¯ z qi = 0 ⇒ {∃zm ∈ {¯i , . . . , zi which, as shown in the previous section, can be written as linear constraints (substituting each zi ¯ by 1 − zi ). The total number of constraints is however O(p3 ), which can be expensive, and often a linear relaxation procedure with branch-and-bound fails to provide guarantees of optimality. 3.4 The Reliability Score Finally, it is important to design cheap, effective criteria whose maxima correlate with the maxima of FM (·). Empirically, we have found high quality selections in binary probit models using the solution to the problem p p wi zi , subject to zi ∈ {0, 1}, maximize FM;R (z) = zi = K i=1 i=1 5 where wi = ΛT ΣΛi . This can be solved by picking the corresponding indicators with the highest i K weights wi . Assuming a probit model where the measurement error for each Yi⋆ has the same variance of 1, this score is related to the “reliability” of an indicator. Simply put, the reliability Ri of an indicator is the proportion of its variance that is due to the latent variables (Bollen, 1989, Chapter 6): Ri = wi /(wi + 1) for each Yi⋆ . There is no current theory linking this solution to the problem of maximizing FM (·): since there is no entropy term, we can set an adversarial problem to easily defeat this method. For instance, this happens in a model where the K indicators of highest reliability all measure the same latent variable Xi and nothing else – much information about Xi would be preserved, but little about other variables. In any case, we found this criterion to be fairly competitive even if at times it produces extreme failures. An honest account of more sophisticated selection mechanisms cannot be performed without including it, as we do in Section 5. 4 Related Work The literature on survey analysis, in the context of latent variable models, contains several examples of guidelines on how to simplify questionnaires (sometimes described as providing “shortened versions” of scales). Much of the literature, however, consists of describing general guidelines and rules-of-thumb to accomplish this task (e.g, Richins, 2004; Stanton et al., 2002). One possible exception is Leite et al. (2008), which uses different model fitness criteria with respect to a given dataset to score candidate solutions, along with an expensive combinatorial optimization method. This conflates model selection and questionnaire thinning, and there is no theory linking the score functions to the amount of information preserved. In the machine learning and statistics literature, there is a large body of research in active learning, which is related to our task. One of the closest approaches is the one by Liang et al. (2009), which casts the classical problem of measurement selection within a Bayesian graphical model perspective. In that work, one has to choose which measurements to add. This is done sequentially, partially motivated by problems where collecting new measurements can be done relatively fast and cheap (say, by paying graduate students to annotate text data), and so the choice of next measurement can make use of fresh data. In our case, it not might be realistic to expect we can perform a large number of iterations of data collection – and as such the task of reducing the number of measurements from a large initial collection might be more relevant in practice. Liang et al. also focus on (multivariate) supervised learning instead of purely unsupervised learning. In statistics there is also a considerable body of literature on sufficient dimension reduction and its sparse variants (e.g., Chen et al., 2010). Such techniques create a bottleneck between two sets of variables in a regression problem (say, the mapping from Y to X) while eliminating some of the input variables. In principle one might want to adapt such models to take a latent variable model M as the target mapping. Besides some loss of interpretability, the computational implications might be problematic, though. Moreover, this framework has another free parameter corresponding to the dimensionality of the bottleneck that has to be set. It is not clear how this parameter, along with a choice of sparsity level, would interact with a fixed choice K of indicators to be kept. 5 Experiments In this section, we first describe some synthetic experiments to provide insights about the different methods, followed by one brief description of a case study. In all of the experiments, the target models M are binary probit. We set the neighborhood parameter for FM;N (·) to 9. The ordering e for the tree-structured method is obtained by the same greedy search of Section 3.2, where now the score is the average of all H(Yi | Yj ) for all Yj preceding Yi . Finally, all ordering optimization methods were initialized by sorting indicators in a descending order according to their reliability scores, and the initial solution for all entropy-based optimization methods was given by the reliability score solution of Section 3.4. The integer program solver G UROBI 4.02 was used in all experiments. 5.1 Synthetic studies We start with a batch of synthetic experiments. We generated 80 models with 40 indicators and 10 latent variables1 . We further preprocess such models into two groups: in 40 of them, we select a 1 Details on the model generation: we generate 40 models by sampling the latent covariance matrix from an inverse Wishart distribution with 10 degrees of freedom and scale matrix 10I, I being the identity matrix. 6 Improvement ratio: high signal Mean error: high signal Improvement ratio: low signal Mean error: low signal 0.6 0.5 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.25 Reliability score Reliability score 0.4 0.2 0.15 0.25 0.2 0.15 −0.1 −0.1 0.1 N/R T/R G/R N/S T/S G/S N/R T/R G/R N/S T/S 0.1 G/S 0.1 0.15 0.2 0.25 0.3 0.1 0.15 0.2 Tree bound (a) (c) (b) 0.25 0.3 Tree bound (d) Figure 2: (a) A comparison of the bounded neighborhood (N ), tree-based (T ) and Gaussian (G) methods with respect to a random solution (R) and the reliability score (S). (b) A similar comparison for models where indicators are more weakly correlated to the latent variables than in (a). (c) and (d) Scatterplots of the average absolute deviance for the tree-based method (horizontal axis) against the reliability method (vertical axis). The bottom-left clouds correspond to the K = 32 trials. target reliability ri for each indicator Yi , uniformly at random from the interval [0.4 0.7]. We then rescale coefficients Λi such that the reliability (defined in Section 3.4) of the respective Yi⋆ becomes ri . For the remaining 40 models, we sample ri uniformly at random from the interval [0.2 0.4]. We perform two choices of subsets: sets Yz of size 20 and 32 (50% and 80% of the total number of indicators). Our evaluation is as follows: since the expected value is perhaps the most common functional of the posterior distribution PM (X | Y), we calculate the expected value of the latent variables for a sample {y(1) , y(2) , . . . , y(1000) } of size 1000 taken from the respective synthetic models. This is done for the full set of 40 indicators, and for each set chosen by our four criteria: for each data point i and each objective function F , we evaluate the average distance (i) (i) (i) (i) ˆ ˆ dF ≡ 10 |ˆj − xj;F |/10. In this case, xj is the expected value of Xj obtained by conditioning j=1 x (i) on all indicators, while xj;F is the one obtained with the subset selected by optimizing F . We denote ˆ (1) (2) (1000) by mF the average of {dF , dF , . . . , dF }. Finally, we compare the three main methods with respect to the reliability score method using the improvement ratio statistic sF = 1 − mF /mFM;R , the proportion of average error decrease with respect to the reliability score. In order to provide a sense of scale on the difficulty of each problem, we compute the same ratios with respect to a random selection, obtained by choosing K = 20 and K = 32 indicators uniformly at random. Figure 2 provides a summary of the results. In Figure 2(a), each boxplot shows the distribution over the 40 probit models where reliabilities were sampled between [0.4 0.7] (the “high signal” models). The first three boxplots show the scores sF of the bounded neighborhood, tree-structured and Gaussian methods, respectively, compared against random selections. The last three boxplots are comparisons against the reliability heuristic. The tree-based method easily beats the Gaussian method, with about 75% of its outcomes being better than the median Gaussian outcome. The Gaussian approach is also less reliable, with results showing a long lower tail. Although the reliability score is on average a good approach, in only a handful of cases it was better than the tree-based method, and by considerably smaller magnitudes compared to the upper tails in the tree-based outcome distribution. A separate panel (Figure 2(b)) is shown for the 40 models with lower reliabilities. In this case, all methods show stronger improvements over the reliability score, although now there is a less clear difference between the tree method and the Gaussian one. Finally, in panels (c) and (d) we present scatterplots for the average deviances mF of the tree-based method against the reliability score. The two clouds correspond to the solutions with 20 and 32 indicators. Notice that in the vast majority of the cases the tree-based method does better. We then rescale the matrix to make all variances equal to 1. We also generate 40 models using as the inverse Wishart scale matrix the correlation matrix will all off-diagonal entries set to 0.5. Coefficients linking indicators to latent variables were set to zero with probability 0.8, and sampled from a standard Gaussian otherwise. If some latent variable ends up with no child, or an indicator ends up with no parent, we uniformly choose one child/parent to be linked to it. Code to fully replicate the synthetic experiments is available at HTTP :// WWW. HOMEPAGES . UCL . AC . UK /∼ UCGTRBD /. 7 5.2 Case study The National Health Service (NHS) is the public health system of the United Kingdom. In 2009, a major survey called the National Health Service National Staff Survey was deployed with the goal of “collect(ing) staff views about working in their local NHS trust” (Care Quality Comission and Aston University, 2010). A questionnaire of 206 items was filled by 156, 951 respondents. We designed a measurement model based on the structure of the questionnaire and fit it by the posterior expected value estimator. Gaussian and inverse Wishart priors were used, along with Gibbs sampling and a random subset of 50, 000 respondents. See the Supplementary Material for more details. Several items in this survey asked for the NHS staff member to provide degrees of agreement in a Likert scale (Bartholomew et al., 2008) to questions such as • • • • . . . have you ever come to work despite not feeling well enough to perform . . . ? Have you felt pressure from your manager to come to work? Have you felt pressure from colleagues to come to work? Have you put yourself under pressure to come to work? as different probes into an unobservable self-assessed level of work pressure. We preprocessed and binarized the data to first narrow it down to 63 questions. We compare selections of 32 (50%) and 50 (80%) items using the same statistics of the previous section. sF ;D sF ;tree sF ;N sF ;random mF ;tree mF ;R K = 32 K = 50 7.8% 10.5% 6.3% 11.9% 0.01% 7.6% −16.0% −0.05% 0.238 0.123 0.255 0.140 Although gains were relatively small (as measured by the difference between reconstruction errors mF ;tree − mF ;R and the good performance of a random selection), we showed that: i.) we do improve results over a popular measure of indicator quality; ii.) we do provide some guarantees about the diversity of the selected items via a information-theoretical measure with an entropy term, with theoretically sound approximations to such a measure. For more details on the preprocessing, and more insights into the different selections, please refer to the Supplementary Material. 6 Conclusion There are problems where one posits that the relevant information is encoded in the posterior distribution of a set of latent variables. Questionnaires (and other instruments) can be used as evidence to generate this posterior, but there is a cost associated with complex questionnaires. One problem is how to simplify such instruments of measurement. To the best of our knowledge, we provide the first formal account on how to solve it. Nevertheless, we would like to stress there is no substitute for common sense. While the tools we provide here can be used for a variety of analyses – from deploying simpler questionnaires to sensitivity analysis – the value and cost of keeping particular indicators can go much beyond the information contained in the latent posterior distribution. How to combine this criterion with other domain-dependent criteria is a matter of future research. Another problem of importance is how to deal with model specification and transportability across studies. A measurement model built for a very specific population of respondents might transfer poorly to another group, and therefore taking into account model uncertainty will be important. The Bayesian setup discussed by Liang et al. (2009) might provide some directions on this issue. Also, there is further structure in real-world questionnaires we are not exploiting in the current work. Namely, it is not uncommon to have questionnaires with branching questions and other dynamic behaviour more commonly associated with Web based surveys and/or longitudinal studies. Finally, hybrid approaches combining the bounded neighborhood and the tree-structured methods, along with more sophisticated ordering optimization procedures and the use of other divergence measures and determinant-based criteria (e.g. Kulesza and Taskar, 2011), will also be studied in the future. Acknowledgments The author would like to thank James Cussens and Simon Lacoste-Julien for helpful discussions, as well as the anonymous reviewers for further comments. 8 References D. Bartholomew, F. Steele, I. Moustaki, and J. Galbraith. Analysis of Multivariate Social Science Data, 2nd edition. Chapman & Hall, 2008. C. Bishop. Latent variable models. In M. Jordan (editor), Learning in Graphical Models, pages 371–403, 1998. C. Bishop. Pattern Recognition and Machine Learning. Springer, 2009. K. Bollen. Structural Equations with Latent Variables. John Wiley & Sons, 1989. R. Carroll, D. Ruppert, and L. Stefanski. Measurement Error in Nonlinear Models. Chapman & Hall, 1995. X. Chen, C. Zou, and R. Cook. Coordinate-independent sparse sufficient dimension reduction and variable selection. Annals of Statistics, 38:3696–3723, 2010. Care Quality Commission and Aston University. Aston Business School, National Health Service National Staff Survey, 2009 [computer file]. Colchester, Essex: UK Data Archive [distributor], October 2010. Available at HTTPS :// WWW. ESDS . AC . UK, SN: 6570, 2010. A. Genz. Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics, 1:141–149, 1992. A. Globerson and T. Jaakkola. Approximate inference using conditional entropy decompositions. Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS 2007), pages 141–149, 2007. F. Glover and E. Woolsey. Converting the 0-1 polynomial programming problem to a 0-1 linear program. Operations Research, 22:180–182, 1974. P. Hahn, J. Scott, and C. Carvalho. A sparse factor-analytic probit model for congressional voting patterns. Duke University Department of Statistical Science, Discussion Paper 2009-22, 2010. T. Jaakkola, D. Sontag, A. Globerson, and M. Meila. Learning Bayesian network structure using LP relaxations. Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS 2010), pages 366–373, 2010. A. Kulesza and B. Taskar. k-DPPs: fixed-size determinantal point processes. Proceedings of the 28th International Conference on Machine Learning (ICML), pages 1193–1200, 2011. W. Leite, I-C. Huang, and G. Marcoulides. Item selection for the development of short forms of scales using an ant colony optimization algorithm. Multivariate Behavioral Research, 43:411– 431, 2008. P. Liang, M. Jordan, and D. Klein. Learning from measurements in exponential families. Proceedings of the 26th Annual International Conference on Machine Learning (ICML ’09), 2009. T. Minka. A family of algorithms for approximate Bayesian inference. PhD Thesis, Massachussets Institute of Technology, 2001. J. Palomo, D. Dunson, and K. Bollen. Bayesian structural equation modeling. In Sik-Yum Lee (ed.), Handbook of Latent Variable and Related Models, pages 163–188, 2007. M. Richins. The material values scale: Measurement properties and development of a short form. The Journal of Consumer Research, 31:209–219, 2004. J. Stanton, E. Sinar, W. Balzer, and P. Smith. Issues and strategies for reducing the length of selfreported scales. Personnel Psychology, 55:167–194, 2002. M. Teyssier and D. Koller. Ordering-based search: A simple and effective algorithm for learning Bayesian networks. Proceedings of the Twenty-first Conference on Uncertainty in AI (UAI ’05), pages 584–590, 2005. 9
3 0.62535137 42 nips-2011-Bayesian Bias Mitigation for Crowdsourcing
Author: Fabian L. Wauthier, Michael I. Jordan
Abstract: Biased labelers are a systemic problem in crowdsourcing, and a comprehensive toolbox for handling their responses is still being developed. A typical crowdsourcing application can be divided into three steps: data collection, data curation, and learning. At present these steps are often treated separately. We present Bayesian Bias Mitigation for Crowdsourcing (BBMC), a Bayesian model to unify all three. Most data curation methods account for the effects of labeler bias by modeling all labels as coming from a single latent truth. Our model captures the sources of bias by describing labelers as influenced by shared random effects. This approach can account for more complex bias patterns that arise in ambiguous or hard labeling tasks and allows us to merge data curation and learning into a single computation. Active learning integrates data collection with learning, but is commonly considered infeasible with Gibbs sampling inference. We propose a general approximation strategy for Markov chains to efficiently quantify the effect of a perturbation on the stationary distribution and specialize this approach to active learning. Experiments show BBMC to outperform many common heuristics. 1
4 0.60041279 269 nips-2011-Spike and Slab Variational Inference for Multi-Task and Multiple Kernel Learning
Author: Miguel Lázaro-gredilla, Michalis K. Titsias
Abstract: We introduce a variational Bayesian inference algorithm which can be widely applied to sparse linear models. The algorithm is based on the spike and slab prior which, from a Bayesian perspective, is the golden standard for sparse inference. We apply the method to a general multi-task and multiple kernel learning model in which a common set of Gaussian process functions is linearly combined with task-specific sparse weights, thus inducing relation between tasks. This model unifies several sparse linear models, such as generalized linear models, sparse factor analysis and matrix factorization with missing values, so that the variational algorithm can be applied to all these cases. We demonstrate our approach in multioutput Gaussian process regression, multi-class classification, image processing applications and collaborative filtering. 1
5 0.58745748 306 nips-2011-t-divergence Based Approximate Inference
Author: Nan Ding, Yuan Qi, S.v.n. Vishwanathan
Abstract: Approximate inference is an important technique for dealing with large, intractable graphical models based on the exponential family of distributions. We extend the idea of approximate inference to the t-exponential family by defining a new t-divergence. This divergence measure is obtained via convex duality between the log-partition function of the t-exponential family and a new t-entropy. We illustrate our approach on the Bayes Point Machine with a Student’s t-prior. 1
6 0.58178008 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning
7 0.57800651 77 nips-2011-Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression
8 0.5777427 33 nips-2011-An Exact Algorithm for F-Measure Maximization
9 0.55894369 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression
10 0.55243582 258 nips-2011-Sparse Bayesian Multi-Task Learning
11 0.53641874 301 nips-2011-Variational Gaussian Process Dynamical Systems
12 0.53001899 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison
13 0.5104484 232 nips-2011-Ranking annotators for crowdsourced labeling tasks
14 0.48904732 277 nips-2011-Submodular Multi-Label Learning
15 0.48799893 111 nips-2011-Hashing Algorithms for Large-Scale Learning
16 0.48738301 68 nips-2011-Demixed Principal Component Analysis
17 0.47079656 83 nips-2011-Efficient inference in matrix-variate Gaussian models with \iid observation noise
18 0.46854374 191 nips-2011-Nonnegative dictionary learning in the exponential noise model for adaptive music signal representation
19 0.46568382 107 nips-2011-Global Solution of Fully-Observed Variational Bayesian Matrix Factorization is Column-Wise Independent
20 0.46151021 180 nips-2011-Multiple Instance Filtering
topicId topicWeight
[(0, 0.015), (4, 0.022), (20, 0.02), (26, 0.014), (31, 0.496), (33, 0.015), (43, 0.086), (45, 0.105), (57, 0.034), (74, 0.038), (83, 0.035), (84, 0.01), (99, 0.028)]
simIndex simValue paperId paperTitle
1 0.99662298 94 nips-2011-Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines
Author: Matthew D. Zeiler, Graham W. Taylor, Leonid Sigal, Iain Matthews, Rob Fergus
Abstract: We present a type of Temporal Restricted Boltzmann Machine that defines a probability distribution over an output sequence conditional on an input sequence. It shares the desirable properties of RBMs: efficient exact inference, an exponentially more expressive latent state than HMMs, and the ability to model nonlinear structure and dynamics. We apply our model to a challenging real-world graphics problem: facial expression transfer. Our results demonstrate improved performance over several baselines modeling high-dimensional 2D and 3D data. 1
2 0.99557316 225 nips-2011-Probabilistic amplitude and frequency demodulation
Author: Richard Turner, Maneesh Sahani
Abstract: A number of recent scientific and engineering problems require signals to be decomposed into a product of a slowly varying positive envelope and a quickly varying carrier whose instantaneous frequency also varies slowly over time. Although signal processing provides algorithms for so-called amplitude- and frequencydemodulation (AFD), there are well known problems with all of the existing methods. Motivated by the fact that AFD is ill-posed, we approach the problem using probabilistic inference. The new approach, called probabilistic amplitude and frequency demodulation (PAFD), models instantaneous frequency using an auto-regressive generalization of the von Mises distribution, and the envelopes using Gaussian auto-regressive dynamics with a positivity constraint. A novel form of expectation propagation is used for inference. We demonstrate that although PAFD is computationally demanding, it outperforms previous approaches on synthetic and real signals in clean, noisy and missing data settings. 1
3 0.99537146 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
Author: Dylan A. Simon, Nathaniel D. Daw
Abstract: There is much evidence that humans and other animals utilize a combination of model-based and model-free RL methods. Although it has been proposed that these systems may dominate according to their relative statistical efficiency in different circumstances, there is little specific evidence — especially in humans — as to the details of this trade-off. Accordingly, we examine the relative performance of different RL approaches under situations in which the statistics of reward are differentially noisy and volatile. Using theory and simulation, we show that model-free TD learning is relatively most disadvantaged in cases of high volatility and low noise. We present data from a decision-making experiment manipulating these parameters, showing that humans shift learning strategies in accord with these predictions. The statistical circumstances favoring model-based RL are also those that promote a high learning rate, which helps explain why, in psychology, the distinction between these strategies is traditionally conceived in terms of rulebased vs. incremental learning. 1
4 0.97676265 23 nips-2011-Active dendrites: adaptation to spike-based communication
Author: Balazs B. Ujfalussy, Máté Lengyel
Abstract: Computational analyses of dendritic computations often assume stationary inputs to neurons, ignoring the pulsatile nature of spike-based communication between neurons and the moment-to-moment fluctuations caused by such spiking inputs. Conversely, circuit computations with spiking neurons are usually formalized without regard to the rich nonlinear nature of dendritic processing. Here we address the computational challenge faced by neurons that compute and represent analogue quantities but communicate with digital spikes, and show that reliable computation of even purely linear functions of inputs can require the interplay of strongly nonlinear subunits within the postsynaptic dendritic tree. Our theory predicts a matching of dendritic nonlinearities and synaptic weight distributions to the joint statistics of presynaptic inputs. This approach suggests normative roles for some puzzling forms of nonlinear dendritic dynamics and plasticity. 1
5 0.96040791 137 nips-2011-Iterative Learning for Reliable Crowdsourcing Systems
Author: David R. Karger, Sewoong Oh, Devavrat Shah
Abstract: Crowdsourcing systems, in which tasks are electronically distributed to numerous “information piece-workers”, have emerged as an effective paradigm for humanpowered solving of large scale problems in domains such as image classification, data entry, optical character recognition, recommendation, and proofreading. Because these low-paid workers can be unreliable, nearly all crowdsourcers must devise schemes to increase confidence in their answers, typically by assigning each task multiple times and combining the answers in some way such as majority voting. In this paper, we consider a general model of such crowdsourcing tasks, and pose the problem of minimizing the total price (i.e., number of task assignments) that must be paid to achieve a target overall reliability. We give a new algorithm for deciding which tasks to assign to which workers and for inferring correct answers from the workers’ answers. We show that our algorithm significantly outperforms majority voting and, in fact, is asymptotically optimal through comparison to an oracle that knows the reliability of every worker. 1
same-paper 6 0.95793647 240 nips-2011-Robust Multi-Class Gaussian Process Classification
7 0.88487679 241 nips-2011-Scalable Training of Mixture Models via Coresets
8 0.86485523 249 nips-2011-Sequence learning with hidden units in spiking neural networks
9 0.85346115 131 nips-2011-Inference in continuous-time change-point models
10 0.84080589 75 nips-2011-Dynamical segmentation of single trials from population neural data
11 0.83908993 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning
12 0.83646077 229 nips-2011-Query-Aware MCMC
13 0.83523244 221 nips-2011-Priors over Recurrent Continuous Time Processes
14 0.8153376 38 nips-2011-Anatomically Constrained Decoding of Finger Flexion from Electrocorticographic Signals
15 0.81464052 292 nips-2011-Two is better than one: distinct roles for familiarity and recollection in retrieving palimpsest memories
16 0.8124336 87 nips-2011-Energetically Optimal Action Potentials
17 0.80508727 86 nips-2011-Empirical models of spiking in neural populations
18 0.80475527 184 nips-2011-Neuronal Adaptation for Sampling-Based Probabilistic Inference in Perceptual Bistability
19 0.79951984 194 nips-2011-On Causal Discovery with Cyclic Additive Noise Models
20 0.79650491 11 nips-2011-A Reinforcement Learning Theory for Homeostatic Regulation