nips nips2009 nips2009-42 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ricardo Henao, Ole Winther
Abstract: In this paper we present a novel approach to learn directed acyclic graphs (DAGs) and factor models within the same framework while also allowing for model comparison between them. For this purpose, we exploit the connection between factor models and DAGs to propose Bayesian hierarchies based on spike and slab priors to promote sparsity, heavy-tailed priors to ensure identifiability and predictive densities to perform the model comparison. We require identifiability to be able to produce variable orderings leading to valid DAGs and sparsity to learn the structures. The effectiveness of our approach is demonstrated through extensive experiments on artificial and biological data showing that our approach outperform a number of state of the art methods. 1
Reference: text
sentIndex sentText sentNum sentScore
1 dk Abstract In this paper we present a novel approach to learn directed acyclic graphs (DAGs) and factor models within the same framework while also allowing for model comparison between them. [sent-5, score-0.24]
2 For this purpose, we exploit the connection between factor models and DAGs to propose Bayesian hierarchies based on spike and slab priors to promote sparsity, heavy-tailed priors to ensure identifiability and predictive densities to perform the model comparison. [sent-6, score-0.45]
3 We require identifiability to be able to produce variable orderings leading to valid DAGs and sparsity to learn the structures. [sent-7, score-0.21]
4 1 Introduction Sparse factor models have proven to be a very versatile tool for detailed modeling and interpretation of multivariate data, for example in the context of gene expression data analysis [1, 2]. [sent-9, score-0.168]
5 A sparse factor model encodes the prior knowledge that the latent factors only affect a limited number of the observed variables. [sent-10, score-0.249]
6 This multiple regression model is a well-defined multivariate probabilistic model if the connectivity (non-zero weights) defines a directed acyclic graph (DAG). [sent-12, score-0.147]
7 What usually is done in practice is to consider either factor or DAG models. [sent-13, score-0.134]
8 A more principled idea that can phrased in Bayesian terms is for example to find an equivalence between both models, then represent them using a common/comparable hierarchy, and finally use a marginal likelihood or a predictive density to select one of them. [sent-16, score-0.117]
9 Although a formal connection between factor models and DAGs has been already established in [3], this paper makes important extensions such as explicitly modeling sparsity, stochastic search over the order of the variables and model comparison. [sent-17, score-0.211]
10 A commonly used approach for structure learning is to split the problem into two stages using the fact that the space of variable orderings is far more smaller than the space of all possible structures, e. [sent-19, score-0.18]
11 by first attempting to learn a suitable permutation of the variables and then the skeleton of the structure given the already found ordering or viceversa. [sent-21, score-0.182]
12 Most of the work so far for continuous data assumes linearity and Gaussian variables hence they can only recover the DAG structure up 1 to Markov equivalence [5, 6, 7, 8], which means that some subset of links can be reversed without changing the likelihood [9]. [sent-22, score-0.292]
13 In this work we follow the line of [3] by starting from a linear factor model and ensure identifiability by using non-normal heavy-tailed latent variables. [sent-25, score-0.175]
14 As a byproduct we find a set of candidate orderings compatible with a linear DAG, i. [sent-26, score-0.218]
15 Finally, we may perform model comparison between the factor and DAG models inferred with fixed orderings taken from the candidate set. [sent-29, score-0.393]
16 2 From DAGs to factor models We will assume that an ordered d-dimensional data vector Px can be represented as a directed acyclic graph with only observed nodes, where P is the usually unknown true permutation matrix. [sent-32, score-0.326]
17 Solving for x we can rewrite the problem as x = P−1 APz = P−1 (I − B)−1 Pz , (2) −1 which corresponds to a noise-free linear factor model with the restriction that P AP must have a sparsity pattern that can be permuted to a triangular form since (I − B)−1 is triangular. [sent-35, score-0.379]
18 In our work we also exploit the relation between the factor models and linear DAGs. [sent-44, score-0.134]
19 We apply a Bayesian approach to learning a sparse factor models and DAGs, and the stochastic search for P is performed as an integrated part of inference of the sparse factor model. [sent-45, score-0.433]
20 The inference of factor model (including order) and DAG parameters are performed as two separate inferences such that the only input that comes from the first part is a set of candidate orders. [sent-46, score-0.295]
21 3 From factor models to DAGs Our first goal is to perform model inference in the families of factor and linear DAG models. [sent-47, score-0.36]
22 for the factor model, as p(X, A, Z, Ψ, P, ·) = p(X|A, Z, P, ·)p(A|·)p(Z|·)p(Ψ|·)p(P|·)p(·) , where X = [x1 , . [sent-50, score-0.134]
23 The prior over permutation p(P|·) will always be chosen to be uniform over the d! [sent-57, score-0.126]
24 The actual sampling based inference for P is discussed in the next section and the standard Gibbs sampling components are provided in the supplementary material. [sent-59, score-0.123]
25 This is more difficult to calculate with sampling than obtaining samples from the posterior so we use the predictive densities on a test set as a yardstick. [sent-61, score-0.185]
26 1 These ambiguities are not affecting our ability to find correct permutation P of the rows. [sent-62, score-0.123]
27 2 Factor model Instead of using the noise-free factor model of equation (2) we allow for additive noise x = P−1 APc z + ǫ, where ǫ is an additional Gaussian noise term with diagonal covariance r matrix Ψ, i. [sent-63, score-0.216]
28 uncorrelated noise, to account for independent measurement noise, Pr = P is the permutation matrix for the rows of A and Pc = Pf Pr another permutation for the columns with Pf accounting for the permutation freedom of the factors. [sent-65, score-0.273]
29 Instead we infer Pr and Pc using a stochastic search based upon closeness to triangular as measured by a masked likelihood, see below. [sent-67, score-0.168]
30 Now we can specify a hierarchy for the Bayesian model as follows X|Pr , A, Pc , Z, Ψ ∼ N (X|P−1 APc Z, Ψ) , r −1 ψi |ss , sr ∼ −1 Gamma(ψi |ss , sr ) Z ∼ π(Z|·) , (3) A ∼ ρ(A|·) , , where ψi are elements of Ψ. [sent-68, score-0.207]
31 For convenience, to exploit conjugate exponential families we are placing a gamma prior on the precision of ǫ with shape ss and rate sr . [sent-69, score-0.208]
32 The prior should favor small values of ψi as well as providing support for ψi = 1 such that certain variables can be explained solely by noise (we set ss = 2 and sr = 0. [sent-71, score-0.131]
33 νj j =1:d λ υjn τij qij aij zjn rij ηij ψi n=1:N xin i=1:d Figure 1: Graphical model for Bayesian hierarchy in equation (3). [sent-81, score-0.568]
34 The prior ρ(A|·) for the mixing matrix should be biased towards sparsity because we want to infer something close to a triangular matrix. [sent-82, score-0.256]
35 Here we adopt a two-layer discrete spike and slab prior for the elements aij of A similar to the one in [2]. [sent-83, score-0.261]
36 The first layer in the prior control the sparsity of each element aij individually, whereas the second layer impose a per-factor sparsity level to allow elements within the same factor to share information. [sent-84, score-0.391]
37 The prior above specify a point mass mixture over aij with mask rij . [sent-86, score-0.27]
38 The expected probability of aij to be non-zero is ηij and is controlled through a beta hyperprior with mean αm and precision αp . [sent-87, score-0.173]
39 Besides, each factor has a common sparsity rate νj that let the elements ηij to be exactly zero with probability 1 − νj through a beta distribution with mean βm and 3 precision βp , turning the distribution of ηij bimodal over the unit interval. [sent-88, score-0.264]
40 The masking matrix rij with parameters ηij should be somewhat diffuse while favoring relatively large masking probabilities, e. [sent-92, score-0.263]
41 The factor model needs only shared variance parameter λ for the Laplace distributed zjn because a change of scale in A is equivalent to change of variance in zjn . [sent-101, score-0.455]
42 4 Sampling based inference For given permutation P, Gibbs sampling can be used for inference of the remaining parameters. [sent-107, score-0.229]
43 The other is to let the factor model be unrestricted and search for P according to a criterion that does not affect parameter inference. [sent-110, score-0.243]
44 First, joint combinatorial and parameter inference in this model will probably have poor mixing with slow convergence. [sent-112, score-0.152]
45 Second, we are also interested in comparing the factor model against the DAG for cases when we cannot really assume that the data is well approximated by a DAG. [sent-113, score-0.175]
46 For the linear DAG model we are not performing joint inference of P and the model parameters. [sent-116, score-0.133]
47 Rather we use a set of Ps found for the factor model to be good candidates for the DAG. [sent-117, score-0.283]
48 The stochastic search for P = Pc goes as follows: we make inference for the unrestricted factor model, propose P⋆ and P⋆ independently according q(P⋆ |Pr )q(P⋆ |Pc ) which is the uniform two c r c r variable random exchange. [sent-118, score-0.253]
49 To learn DAGs we first perform inference on the factor model specified by the hierarchy in (3) to obtain a set of ordering candidates sorted according to their usage during sampling—after the burnin period. [sent-120, score-0.456]
50 a false zero entry on A allowing several orderings leading to several lower triangular versions of A, only one of those being actually correct. [sent-123, score-0.28]
51 Thus, we propose not only to use the best candidate but a set of top candidates of size mtop = 10. [sent-124, score-0.177]
52 Then we perform inference on the DAG model corresponding to the structure search (m ) (1) hierarchy in (7), for each one of the permutation candidates being considered, Pr , . [sent-125, score-0.42]
53 Finally, we select the DAG model among candidates using the predictive distribution for the DAG when a test set is available or just the likelihood if not. [sent-129, score-0.266]
54 In order to perform the model comparison, we use predictive densities p(X⋆ |X, M) with M = {MFA , MDAG }, instead of marginal likelihoods because the latter is difficult and expensive to compute by sampling, requiring for example thermodynamic integration. [sent-131, score-0.224]
55 Factor model We can compute the predictive distribution by taking the likelihood in equation (3) and marginalizing Z. [sent-138, score-0.129]
56 Since the integral has no closed form we can approximate it using the Gaussian distribution from the scale mixture representation as ⋆ p(X |A, Ψ, ·) = 1 p(X |A, Z, Ψ)p(Z|·)dZ ≈ rep rep ⋆ N (x⋆ |0, A⊤ Un A + Ψ) , n n r where Un = diag(υ1n , . [sent-139, score-0.16]
57 , υdn ), the υjn are sampled from the prior and rep is the number of samples generated to approximate the intractable integral (rep = 500 in the experiments). [sent-142, score-0.115]
58 In practice we compute the predictive densities for a particular X⋆ during sampling and then select the model based on its ratio. [sent-145, score-0.255]
59 Note that both predictive distributions depend directly on λ—the rate of Laplace distribution, making the estimates highly dependent on its value. [sent-146, score-0.124]
60 The order search in LiNGAM assumes that there is not estimation errors during fastICA model inference, then a single ordering candidate is produced. [sent-149, score-0.206]
61 LiNGAM produces and select a final model among several candidates, but in contrast to our method such candidates are not different DAGs with different variable orderings but DAGs with different sparsity levels. [sent-150, score-0.388]
62 The factor model inference in LiNGAM, namely fastICA is very efficient however their structure search involves repeated inversions of matrices of sizes d2 × d2 which can make it prohibitive for large problems. [sent-151, score-0.293]
63 In contrast, the complexity in our case is O(Nite d2 N ) where Nite is the total number of samples including burn-in periods for both, factor model and DAG inferences. [sent-153, score-0.175]
64 In every case we ran 2000 samples after a burn-in period of 4000 iterations and three independent chains for the factor model, and a single chain with 1000 samples and 2000 as burn-in for the DAG2 . [sent-162, score-0.134]
65 p(R|X, ·) where R is a matrix with elements rij . [sent-170, score-0.135]
66 AUC is an important measure because it quantifies how the model accounts for the uncertainty of presence or absence of links in the DAG. [sent-171, score-0.18]
67 It is important to mention that being able to compute a probability for a link in the DAG to be zero, p(bij = 0|X, ·), turns out to be very useful in practice, for example to reject links with high uncertainty or to rank them. [sent-219, score-0.168]
68 6 Bayesian networks repository Next we want to compare some of the state of the art (Gaussian) approaches to DAG learning on 7 well known structures4 , namely alarm, barley, carpo, hailfinder, insurance, mildew and water (d = 37, 48, 61, 56, 27, 35, 32 respectively). [sent-228, score-0.186]
69 Apart from ours (sFA), we considered the following methods5 : standard DAG search (DS), order-search (OS), sparse candidate pruning then DAG-search (DSC) [6], L1MB then DAG-search (DSL) [8], sparsecandidate pruning then order-search (OSC) [7]. [sent-230, score-0.224]
70 Results are shown in Figure 3, including the number of reversed links found due to ordering errors. [sent-231, score-0.321]
71 water water water water mildew mildew mildew mildew insurance hailfinder DS OS OSC DSC DSL sFA carpo barley alarm 0 0. [sent-232, score-0.987]
72 2 False positive rate (a) insurance insurance hailfinder insurance hailfinder hailfinder carpo carpo carpo barley barley barley alarm alarm 0 0. [sent-234, score-1.042]
73 6 Reversed links (d) Figure 3: Performance measures for Bayesian networks repository experiments. [sent-243, score-0.168]
74 The true negative rate is comparable to the other methods suggesting that our model in some cases is sparser than the others. [sent-245, score-0.113]
75 AUC estimates are significantly better because we have continuous probabilities for links to be zero (in the other methods we had to use a binary value). [sent-246, score-0.139]
76 From Figure 3(d), the number of reversed links in the other methods is quite high as expected due to lack of identifiability. [sent-247, score-0.261]
77 Our model produced a small amount reversed links because it was not able to find any of the true orderings, but indeed something quite close. [sent-248, score-0.338]
78 On the other hand, our approach performs similarly but the number of reversed links increases significantly since the model is no longer identified. [sent-251, score-0.302]
79 Model comparison For this experiment we have generated 1000 different datasets/models with d = 5 and N = {500, 1000} in a similar way to the first experiment but this time we selected the true model to be a factor model or a DAG uniformly. [sent-255, score-0.252]
80 In order to generate a factor model we basically just need to be sure that A cannot be permuted to a triangular form. [sent-256, score-0.318]
81 We kept 20% of the data to compute the predictive densities to then select between all estimated DAG candidates and the factor model. [sent-257, score-0.42]
82 6%, For N = 1000 the true DAG and true factor model rates increased to 98. [sent-261, score-0.247]
83 (e) Test likelihoods for the best ordering DAG (dashed) and the factor model (solid). [sent-294, score-0.269]
84 (d) Likelihood ratios (solid) and structure errors (dashed) for all candidates considered by our method and their usage. [sent-295, score-0.139]
85 The Bayesian network is not able to identify the direction of the links with only observational data. [sent-296, score-0.246]
86 (e) pure observational data and randomly selected 20% of the data to compute the predictive densities. [sent-297, score-0.163]
87 From the 21 possible links in figure 4(a), the model from [17] was able to find 9, but also one falsely added link. [sent-300, score-0.212]
88 Our model in Figure 4(c) was able to find 10 true links, one falsely added link and only two reversed links (RL), one of them is PIP2 → PIP3 which according to the ground truth is bidirectional and the other one, PLCγ → PIP3 which was also found reversed using experimental data in [17]. [sent-302, score-0.521]
89 The predictive densities for the best candidate (sixth in Figure 4(e)) is shown in Figure 4(d) and suggests that the factor model is a better option which makes sense considering that estimated DAG in figure 4(c) is a substructure of the ground truth. [sent-304, score-0.393]
90 We also examined the estimated factor model and we found out that three factors could correspond to unmeasured proteins (PI3K, MKK and IP3), see Figure 2 and table 3 in [17]. [sent-305, score-0.223]
91 Results were very similar to our method in terms of true positives (≈ 9) and true negatives (≈ 32), however none of them were able to produce less than 6 reversed links that corresponds to approximately two-thirds of total true positives. [sent-307, score-0.369]
92 8 Discussion We have proposed a novel approach to perform inference and model comparison of sparse factor models and DAGs within the same framework. [sent-308, score-0.265]
93 The key ingredients for both Bayesian models are spike and slab priors to promote sparsity, heavy-tailed priors to ensure identifiability and predictive densities to perform the comparison. [sent-309, score-0.275]
94 A set of candidate orderings is produced by the factor model. [sent-310, score-0.352]
95 This setting can be very beneficial in situations where the prior evidence suggests both DAG structure and/or unmeasured variables in the data. [sent-313, score-0.114]
96 For example in the protein signaling network [17], the textbook ground truth suggests both DAG structure and a number of unmeasured proteins. [sent-314, score-0.189]
97 The previous approach [17] only performed structure learning in DAGs but our results suggest that the data is better explained by the factor model. [sent-315, score-0.165]
98 graphs with directed/undirected links and observed/latent nodes as well as being able to use experimental data. [sent-318, score-0.139]
99 Bayesian factor regression models in the “large p, small n” paradigm. [sent-323, score-0.134]
100 High-dimensional sparse factor modeling: Applications in gene expression genomics. [sent-415, score-0.207]
wordName wordTfidf (topN-words)
[('dag', 0.562), ('lingam', 0.263), ('dags', 0.241), ('orderings', 0.149), ('zjn', 0.14), ('links', 0.139), ('rij', 0.135), ('factor', 0.134), ('reversed', 0.122), ('candidates', 0.108), ('denmark', 0.105), ('triangular', 0.1), ('barley', 0.1), ('carpo', 0.1), ('hail', 0.1), ('mildew', 0.1), ('plc', 0.1), ('aij', 0.1), ('ij', 0.097), ('permutation', 0.091), ('qij', 0.09), ('auc', 0.089), ('predictive', 0.088), ('insurance', 0.088), ('slab', 0.088), ('bayesian', 0.085), ('copenhagen', 0.08), ('rep', 0.08), ('observational', 0.075), ('alarm', 0.071), ('nder', 0.071), ('jn', 0.071), ('akt', 0.07), ('erk', 0.07), ('fastica', 0.07), ('jnk', 0.07), ('mek', 0.07), ('pka', 0.07), ('pkc', 0.07), ('raf', 0.07), ('candidate', 0.069), ('acyclic', 0.065), ('masking', 0.064), ('laplace', 0.063), ('pc', 0.063), ('pr', 0.062), ('hierarchy', 0.062), ('sparsity', 0.061), ('densities', 0.061), ('mixing', 0.06), ('dsc', 0.06), ('dsl', 0.06), ('osc', 0.06), ('ordering', 0.06), ('water', 0.057), ('sfa', 0.053), ('sr', 0.052), ('inference', 0.051), ('os', 0.048), ('textbook', 0.048), ('unmeasured', 0.048), ('ss', 0.044), ('pf', 0.043), ('permuted', 0.043), ('identi', 0.042), ('gamma', 0.041), ('model', 0.041), ('pruning', 0.04), ('apc', 0.04), ('hyperprior', 0.04), ('lucas', 0.04), ('lyngby', 0.04), ('mfa', 0.04), ('sparse', 0.039), ('spike', 0.038), ('bx', 0.038), ('sampling', 0.036), ('rate', 0.036), ('search', 0.036), ('true', 0.036), ('ap', 0.035), ('carvalho', 0.035), ('dtu', 0.035), ('wald', 0.035), ('prior', 0.035), ('gene', 0.034), ('likelihoods', 0.034), ('beta', 0.033), ('network', 0.032), ('nevins', 0.032), ('falsely', 0.032), ('masked', 0.032), ('unrestricted', 0.032), ('ability', 0.032), ('structure', 0.031), ('false', 0.031), ('signaling', 0.03), ('networks', 0.029), ('link', 0.029), ('select', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 42 nips-2009-Bayesian Sparse Factor Models and DAGs Inference and Comparison
Author: Ricardo Henao, Ole Winther
Abstract: In this paper we present a novel approach to learn directed acyclic graphs (DAGs) and factor models within the same framework while also allowing for model comparison between them. For this purpose, we exploit the connection between factor models and DAGs to propose Bayesian hierarchies based on spike and slab priors to promote sparsity, heavy-tailed priors to ensure identifiability and predictive densities to perform the model comparison. We require identifiability to be able to produce variable orderings leading to valid DAGs and sparsity to learn the structures. The effectiveness of our approach is demonstrated through extensive experiments on artificial and biological data showing that our approach outperform a number of state of the art methods. 1
2 0.20885067 170 nips-2009-Nonlinear directed acyclic structure learning with weakly additive noise models
Author: Arthur Gretton, Peter Spirtes, Robert E. Tillman
Abstract: The recently proposed additive noise model has advantages over previous directed structure learning approaches since it (i) does not assume linearity or Gaussianity and (ii) can discover a unique DAG rather than its Markov equivalence class. However, for certain distributions, e.g. linear Gaussians, the additive noise model is invertible and thus not useful for structure learning, and it was originally proposed for the two variable case with a multivariate extension which requires enumerating all possible DAGs. We introduce weakly additive noise models, which extends this framework to cases where the additive noise model is invertible and when additive noise is not present. We then provide an algorithm that learns an equivalence class for such models from data, by combining a PC style search using recent advances in kernel measures of conditional dependence with local searches for additive noise models in substructures of the Markov equivalence class. This results in a more computationally efficient approach that is useful for arbitrary distributions even when additive noise models are invertible. 1
3 0.12553561 23 nips-2009-Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models
Author: Baback Moghaddam, Emtiyaz Khan, Kevin P. Murphy, Benjamin M. Marlin
Abstract: We make several contributions in accelerating approximate Bayesian structural inference for non-decomposable GGMs. Our first contribution is to show how to efficiently compute a BIC or Laplace approximation to the marginal likelihood of non-decomposable graphs using convex methods for precision matrix estimation. This optimization technique can be used as a fast scoring function inside standard Stochastic Local Search (SLS) for generating posterior samples. Our second contribution is a novel framework for efficiently generating large sets of high-quality graph topologies without performing local search. This graph proposal method, which we call “Neighborhood Fusion” (NF), samples candidate Markov blankets at each node using sparse regression techniques. Our third contribution is a hybrid method combining the complementary strengths of NF and SLS. Experimental results in structural recovery and prediction tasks demonstrate that NF and hybrid NF/SLS out-perform state-of-the-art local search methods, on both synthetic and real-world datasets, when realistic computational limits are imposed.
4 0.081964605 3 nips-2009-AUC optimization and the two-sample problem
Author: Nicolas Vayatis, Marine Depecker, Stéphan J. Clémençcon
Abstract: The purpose of the paper is to explore the connection between multivariate homogeneity tests and AUC optimization. The latter problem has recently received much attention in the statistical learning literature. From the elementary observation that, in the two-sample problem setup, the null assumption corresponds to the situation where the area under the optimal ROC curve is equal to 1/2, we propose a two-stage testing method based on data splitting. A nearly optimal scoring function in the AUC sense is first learnt from one of the two half-samples. Data from the remaining half-sample are then projected onto the real line and eventually ranked according to the scoring function computed at the first stage. The last step amounts to performing a standard Mann-Whitney Wilcoxon test in the onedimensional framework. We show that the learning step of the procedure does not affect the consistency of the test as well as its properties in terms of power, provided the ranking produced is accurate enough in the AUC sense. The results of a numerical experiment are eventually displayed in order to show the efficiency of the method. 1
5 0.078977801 100 nips-2009-Gaussian process regression with Student-t likelihood
Author: Jarno Vanhatalo, Pasi Jylänki, Aki Vehtari
Abstract: In the Gaussian process regression the observation model is commonly assumed to be Gaussian, which is convenient in computational perspective. However, the drawback is that the predictive accuracy of the model can be significantly compromised if the observations are contaminated by outliers. A robust observation model, such as the Student-t distribution, reduces the influence of outlying observations and improves the predictions. The problem, however, is the analytically intractable inference. In this work, we discuss the properties of a Gaussian process regression model with the Student-t likelihood and utilize the Laplace approximation for approximate inference. We compare our approach to a variational approximation and a Markov chain Monte Carlo scheme, which utilize the commonly used scale mixture representation of the Student-t distribution. 1
6 0.075349562 78 nips-2009-Efficient Moments-based Permutation Tests
7 0.070490465 244 nips-2009-The Wisdom of Crowds in the Recollection of Order Information
8 0.064887248 168 nips-2009-Non-stationary continuous dynamic Bayesian networks
9 0.064165704 1 nips-2009-$L 1$-Penalized Robust Estimation for a Class of Inverse Problems Arising in Multiview Geometry
10 0.060940936 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models
11 0.060749173 246 nips-2009-Time-Varying Dynamic Bayesian Networks
12 0.059325106 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction
13 0.057135347 152 nips-2009-Measuring model complexity with the prior predictive
14 0.05708446 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models
15 0.05591961 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations
16 0.055342838 203 nips-2009-Replacing supervised classification learning by Slow Feature Analysis in spiking neural networks
17 0.054361679 123 nips-2009-Large Scale Nonparametric Bayesian Inference: Data Parallelisation in the Indian Buffet Process
18 0.053469207 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior
19 0.053335756 140 nips-2009-Linearly constrained Bayesian matrix factorization for blind source separation
20 0.051535394 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs
topicId topicWeight
[(0, -0.176), (1, -0.035), (2, 0.014), (3, -0.033), (4, 0.022), (5, -0.141), (6, 0.073), (7, -0.042), (8, -0.039), (9, -0.122), (10, -0.044), (11, -0.019), (12, 0.002), (13, -0.016), (14, -0.019), (15, 0.043), (16, 0.032), (17, 0.023), (18, -0.06), (19, -0.073), (20, -0.055), (21, 0.097), (22, -0.127), (23, 0.032), (24, -0.121), (25, 0.098), (26, 0.014), (27, 0.029), (28, 0.077), (29, -0.129), (30, 0.029), (31, -0.059), (32, 0.153), (33, 0.017), (34, -0.011), (35, 0.085), (36, -0.024), (37, -0.071), (38, -0.047), (39, -0.102), (40, 0.115), (41, 0.069), (42, 0.072), (43, 0.042), (44, -0.173), (45, 0.03), (46, 0.041), (47, 0.045), (48, 0.171), (49, 0.093)]
simIndex simValue paperId paperTitle
same-paper 1 0.89638841 42 nips-2009-Bayesian Sparse Factor Models and DAGs Inference and Comparison
Author: Ricardo Henao, Ole Winther
Abstract: In this paper we present a novel approach to learn directed acyclic graphs (DAGs) and factor models within the same framework while also allowing for model comparison between them. For this purpose, we exploit the connection between factor models and DAGs to propose Bayesian hierarchies based on spike and slab priors to promote sparsity, heavy-tailed priors to ensure identifiability and predictive densities to perform the model comparison. We require identifiability to be able to produce variable orderings leading to valid DAGs and sparsity to learn the structures. The effectiveness of our approach is demonstrated through extensive experiments on artificial and biological data showing that our approach outperform a number of state of the art methods. 1
2 0.77472478 170 nips-2009-Nonlinear directed acyclic structure learning with weakly additive noise models
Author: Arthur Gretton, Peter Spirtes, Robert E. Tillman
Abstract: The recently proposed additive noise model has advantages over previous directed structure learning approaches since it (i) does not assume linearity or Gaussianity and (ii) can discover a unique DAG rather than its Markov equivalence class. However, for certain distributions, e.g. linear Gaussians, the additive noise model is invertible and thus not useful for structure learning, and it was originally proposed for the two variable case with a multivariate extension which requires enumerating all possible DAGs. We introduce weakly additive noise models, which extends this framework to cases where the additive noise model is invertible and when additive noise is not present. We then provide an algorithm that learns an equivalence class for such models from data, by combining a PC style search using recent advances in kernel measures of conditional dependence with local searches for additive noise models in substructures of the Markov equivalence class. This results in a more computationally efficient approach that is useful for arbitrary distributions even when additive noise models are invertible. 1
3 0.61173725 23 nips-2009-Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models
Author: Baback Moghaddam, Emtiyaz Khan, Kevin P. Murphy, Benjamin M. Marlin
Abstract: We make several contributions in accelerating approximate Bayesian structural inference for non-decomposable GGMs. Our first contribution is to show how to efficiently compute a BIC or Laplace approximation to the marginal likelihood of non-decomposable graphs using convex methods for precision matrix estimation. This optimization technique can be used as a fast scoring function inside standard Stochastic Local Search (SLS) for generating posterior samples. Our second contribution is a novel framework for efficiently generating large sets of high-quality graph topologies without performing local search. This graph proposal method, which we call “Neighborhood Fusion” (NF), samples candidate Markov blankets at each node using sparse regression techniques. Our third contribution is a hybrid method combining the complementary strengths of NF and SLS. Experimental results in structural recovery and prediction tasks demonstrate that NF and hybrid NF/SLS out-perform state-of-the-art local search methods, on both synthetic and real-world datasets, when realistic computational limits are imposed.
4 0.49982578 152 nips-2009-Measuring model complexity with the prior predictive
Author: Wolf Vanpaemel
Abstract: In the last few decades, model complexity has received a lot of press. While many methods have been proposed that jointly measure a model’s descriptive adequacy and its complexity, few measures exist that measure complexity in itself. Moreover, existing measures ignore the parameter prior, which is an inherent part of the model and affects the complexity. This paper presents a stand alone measure for model complexity, that takes the number of parameters, the functional form, the range of the parameters and the parameter prior into account. This Prior Predictive Complexity (PPC) is an intuitive and easy to compute measure. It starts from the observation that model complexity is the property of the model that enables it to fit a wide range of outcomes. The PPC then measures how wide this range exactly is. keywords: Model Selection & Structure Learning; Model Comparison Methods; Perception The recent revolution in model selection methods in the cognitive sciences was driven to a large extent by the observation that computational models can differ in their complexity. Differences in complexity put models on unequal footing when their ability to approximate empirical data is assessed. Therefore, models should be penalized for their complexity when their adequacy is measured. The balance between descriptive adequacy and complexity has been termed generalizability [1, 2]. Much attention has been devoted to developing, advocating, and comparing different measures of generalizability (for a recent overview, see [3]). In contrast, measures of complexity have received relatively little attention. The aim of the current paper is to propose and illustrate a stand alone measure of model complexity, called the Prior Predictive Complexity (PPC). The PPC is based on the intuitive idea that a complex model can predict many outcomes and a simple model can predict a few outcomes only. First, I discuss existing approaches to measuring model complexity and note some of their limitations. In particular, I argue that currently existing measures ignore one important aspect of a model: the prior distribution it assumes over the parameters. I then introduce the PPC, which, unlike the existing measures, is sensitive to the parameter prior. Next, the PPC is illustrated by calculating the complexities of two popular models of information integration. 1 Previous approaches to measuring model complexity A first approach to assess the (relative) complexity of models relies on simulated data. Simulationbased methods differ in how these artificial data are generated. A first, atheoretical approach uses random data [4, 5]. In the semi-theoretical approach, the data are generated from some theoretically ∗ I am grateful to Michael Lee and Liz Bonawitz. 1 interesting functions, such as the exponential or the logistic function [4]. Using these approaches, the models under consideration are equally complex if each model provides the best optimal fit to roughly the same number of data sets. A final approach to generating artificial data is a theoretical one, in which the data are generated from the models of interest themselves [6, 7]. The parameter sets used in the generation can either be hand-picked by the researcher, estimated from empirical data or drawn from a previously specified distribution. If the models under consideration are equally complex, each model should provide the best optimal fit to self-generated data more often than the other models under consideration do. One problem with this simulation-based approach is that it is very labor intensive. It requires generating a large amount of artificial data sets, and fitting the models to all these data sets. Further, it relies on choices that are often made in an arbitrary fashion that nonetheless bias the results. For example, in the semi-theoretical approach, a crucial choice is which functions to use. Similarly, in the theoretical approach, results are heavily influenced by the parameter values used in generating the data. If they are fixed, on what basis? If they are estimated from empirical data, from which data? If they are drawn randomly, from which distribution? Further, a simulation study only gives a rough idea of complexity differences but provides no direct measure reflecting the complexity. A number of proposals have been made to measure model complexity more directly. Consider a model M with k parameters, summarized in the parameter vector θ = (θ1 , θ2 , . . . , θk , ) which has a range indicated by Ω. Let d denote the data and p(d|θ, M ) the likelihood. The most straightforward measure of model complexity is the parametric complexity (PC), which simply counts the number of parameters: PC = k. (1) PC is attractive as a measure of model complexity since it is very easy to calculate. Further, it has a direct and well understood relation toward complexity: the more parameters, the more complex the model. It is included as the complexity term of several generalizability measures such as AIC [8] and BIC [9], and it is at the heart of the Likelihood Ratio Test. Despite this intuitive appeal, PC is not free from problems. One problem with PC is that it reflects only a single aspect of complexity. Also the parameter range and the functional form (the way the parameters are combined in the model equation) influence a model’s complexity, but these dimensions of complexity are ignored in PC [2, 6]. A complexity measure that takes these three dimensions into account is provided by the geometric complexity (GC) measure, which is inspired by differential geometry [10]. In GC, complexity is conceptualized as the number of distinguishable probability distributions a model can generate. It is defined by GC = k n ln + ln 2 2π det I(θ|M )dθ, (2) Ω where n indicates the size of the data sample and I(θ) is the Fisher Information Matrix: Iij (θ|M ) = −Eθ ∂ 2 ln p(d|θ, M ) . ∂θi ∂θj (3) Note that I(θ|M ) is determined by the likelihood function p(d|θ, M ), which is in turn determined by the model equation. Hence GC is sensitive to the number of parameters (through k), the functional form (through I), and the range (through Ω). Quite surprisingly, GC turns out to be equal to the complexity term used in one version of Minimum Description Length (MDL), a measure of generalizability developed within the domain of information theory [2, 11, 12, 13]. GC contrasts favorably with PC, in the sense that it takes three dimensions of complexity into account rather than a single one. A major drawback of GC is that, unlike PC, it requires considerable technical sophistication to be computed, as it relies on the second derivative of the likelihood. A more important limitation of both PC and GC is that these measures are insensitive to yet another important dimension contributing to model complexity: the prior distribution over the model parameters. The relation between the parameter prior distribution and model complexity is discussed next. 2 2 Model complexity and the parameter prior The growing popularity of Bayesian methods in psychology has not only raised awareness that model complexity should be taken into account when testing models [6], it has also drawn attention to the fact that in many occasions, relevant prior information is available [14]. In Bayesian methods, there is room to incorporate this information in two different flavors: as a prior distribution over the models, or as a prior distribution over the parameters. Specifying a model prior is a daunting task, so almost invariably, the model prior is taken to be uniform (but see [15] for an exception). In contrast, information regarding the parameter is much easier to include, although still challenging (e.g., [16]). There are two ways to formalize prior information about a model’s parameters: using the parameter prior range (often referred to as simply the range) and using the parameter prior distribution (often referred to as simply the prior). The prior range indicates which parameter values are allowed and which are forbidden. The prior distribution indicates which parameter values are likely and which are unlikely. Models that share the same equation and the same range but differ in the prior distribution can be considered different models (or at least different model versions), just like models that share the same equation but differ in range are different model versions. Like the parameter prior range, the parameter prior distribution influences the model complexity. In general, a model with a vague parameter prior distribution is more complex than a model with a sharply peaked parameter prior distribution, much as a model with a broad-ranged parameter is more complex than the same model where the parameter is heavily restricted. To drive home the point that the parameter prior should be considered when model complexity is assessed, consider the following “fair coin” model Mf and a “biased coin” model Mb . There is a clear intuitive complexity difference between these models: Mb is more complex than Mf . The most straightforward way to formalize these models is as follows, where ph denotes the probability of observing heads: ph = 1/2, (4) ph = θ 0≤θ≤1 p(θ) = 1, (5) for model Mf and the triplet of equations jointly define model Mb . The range forbids values smaller than 0 or greater than 1 because ph is a proportion. As Mf and Mb have a different number of parameters, both PC and GC, being sensitive to the number of parameters, pick up the difference in model complexity between the models. Alternatively, model Mf could be defined as follows: ph = θ 0≤θ≤1 1 p(θ) = δ(θ − ), 2 (6) where δ(x) is the Dirac delta. Note that the model formalized in Equation 6 is exactly identical the model formalized in Equation 4. However, relying on the formulation of model Mf in Equation 6, PC and GC now judge Mf and Mb to be equally complex: both models share the same model equation (which implies they have the same number of parameters and the same functional form) and the same range for the parameter. Hence, PC and GC make an incorrect judgement of the complexity difference between both models. This misjudgement is a direct result of the insensitivity of these measures to the parameter prior. As models Mf and Mb have different prior distributions over their parameter, a measure sensitive to the prior would pick up the complexity difference between these models. Such a measure is introduced next. 3 The Prior Predictive Complexity Model complexity refers to the property of the model that enables it to predict a wide range of data patterns [2]. The idea of the PPC is to measure how wide this range exactly is. A complex model 3 can predict many outcomes, and a simple model can predict a few outcomes only. Model simplicity, then, refers to the property of placing restrictions on the possible outcomes: the greater restrictions, the greater the simplicity. To understand how model complexity is measured in the PPC, it is useful to think about the universal interval (UI) and the predicted interval (PI). The universal interval is the range of outcomes that could potentially be observed, irrespective of any model. For example, in an experiment with n binomial trials, it is impossible to observe less that zero successes, or more than n successes, so the range of possible outcomes is [0, n] . Similarly, the universal interval for a proportion is [0, 1]. The predicted interval is the interval containing all outcomes the model predicts. An intuitive way to gauge model complexity is then the cardinality of the predicted interval, relative to the cardinality of the universal interval, averaged over all m conditions or stimuli: PPC = 1 m m i=1 |PIi | . |UIi | (7) A key aspect of the PPC is deriving the predicted interval. For a parameterized likelihood-based model, prediction takes the form of a distribution over all possible outcomes for some future, yet-tobe-observed data d under some model M . This distribution is called the prior predictive distribution (ppd) and can be calculated using the law of total probability: p(d|M ) = p(d|θ, M )p(θ|M )dθ. (8) Ω Predicting the probability of unseen future data d arising under the assumption that model M is true involves integrating the probability of the data for each of the possible parameter values, p(d|θ, M ), as weighted by the prior probability of each of these values, p(θ|M ). Note that the ppd relies on the number of parameters (through the number of integrals and the likelihood), the model equation (through the likelihood), and the parameter range (through Ω). Therefore, as GC, the PPC is sensitive to all these aspects. In contrast to GC, however, the ppd, and hence the PPC, also relies on the parameter prior. Since predictions are made probabilistically, virtually all outcomes will be assigned some prior weight. This implies that, in principle, the predicted interval equals the universal interval. However, for some outcomes the assigned weight will be extremely small. Therefore, it seems reasonable to restrict the predicted interval to the smallest interval that includes some predetermined amount of the prior mass. For example, the 95% predictive interval is defined by those outcomes with the highest prior mass that together make up 95% of the prior mass. Analytical solutions to the integral defining the ppd are rarely available. Instead, one should rely on approximations to the ppd by drawing samples from it. In the current study, sampling was performed using WinBUGS [17, 18], a highly versatile, user friendly, and freely available software package. It contains sophisticated and relatively general-purpose Markov Chain Monte Carlo (MCMC) algorithms to sample from any distribution of interest. 4 An application example The PPC is illustrated by comparing the complexity of two popular models of information integration, which attempt to account for how people merge potentially ambiguous or conflicting information from various sensorial sources to create subjective experience. These models either assume that the sources of information are combined additively (the Linear Integration Model; LIM; [19]) or multiplicatively (the Fuzzy Logical Model of Perception; FLMP; [20, 21]). 4.1 Information integration tasks A typical information integration task exposes participants simultaneously to different sources of information and requires this combined experience to be identified in a forced-choice identification task. The presented stimuli are generated from a factorial manipulation of the sources of information by systematically varying the ambiguity of each of the sources. The relevant empirical data consist 4 of, for each of the presented stimuli, the counts km of the number of times the mth stimulus was identified as one of the response alternatives, out of the tm trials on which it was presented. For example, an experiment in phonemic identification could involve two phonemes to be identified, /ba/ and /da/ and two sources of information, auditory and visual. Stimuli are created by crossing different levels of audible speech, varying between /ba/ and /da/, with different levels of visible speech, also varying between these alternatives. The resulting set of stimuli spans a continuum between the two syllables. The participant is then asked to listen and to watch the speaker, and based on this combined audiovisual experience, to identify the syllable as being either /ba/ or /da/. In the so-called expanded factorial design, not only bimodal stimuli (containing both auditory and visual information) but also unimodal stimuli (providing only a single source of information) are presented. 4.2 Information integration models In what follows, the formal description of the LIM and the FLMP is outlined for a design with two response alternatives (/da/ or /ba/) and two sources (auditory and visual), with I and J levels, respectively. In such a two-choice identification task, the counts km follow a Binomial distribution: km ∼ Binomial(pm , tm ), (9) where pm indicates the probability that the mth stimulus is identified as /da/. 4.2.1 Model equation The probability for the stimulus constructed with the ith level of the first source and the jth level of the second being identified as /da/ is computed according to the choice rule: pij = s (ij, /da/) , s (ij, /da/) + s (ij, /ba/) (10) where s (ij, /da/) represents the overall degree of support for the stimulus to be /da/. The sources of information are assumed to be evaluated independently, implying that different parameters are used for the different modalities. In the present example, the degree of auditory support for /da/ is denoted by ai (i = 1, . . . , I) and the degree of visual support for /da/ by bj (j = 1, . . . , J). When a unimodal stimulus is presented, the overall degree of support for each alternative is given by s (i∗, /da/) = ai and s (∗j, /da/) = bj , where the asterisk (*) indicates the absence of information, implying that Equation 10 reduces to pi∗ = ai and p∗j = bj . (11) When a bimodal stimulus is presented, the overall degree of support for each alternative is based on the integration or blending of both these sources. Hence, for bimodal stimuli, s (ij, /da/) = ai bj , where the operator denotes the combination of both sources. Hence, Equation 10 reduces to ai bj . (12) pij = ai bj + (1 − ai ) (1 − bj ) = +, so Equation 12 becomes The LIM assumes an additive combination, i.e., pij = ai + bj . 2 (13) The FLMP, in contrast, assumes a multiplicative combination, i.e., = ×, so Equation 12 becomes ai bj . ai bj + (1 − ai )(1 − bj ) (14) pij = 5 4.2.2 Parameter prior range and distribution Each level of auditory and visual support for /da/ (i.e., ai and bj , respectively) is associated with a free parameter, which implies that the FLMP and the LIM have an equal number of free parameters, I + J. Each of these parameters is constrained to satisfy 0 ≤ ai , bj ≤ 1. The original formulations of the LIM and FLMP unfortunately left the parameter priors unspecified. However, an implicit assumption that has been commonly used is a uniform prior for each of the parameters. This assumption implicitly underlies classical and widely adopted methods for model evaluation using accounted percentage of variance or maximum likelihood. ai ∼ Uniform(0, 1) and bi ∼ Uniform(0, 1) for i = 1, . . . , I; j = 1, . . . , J. (15) The models relying on this set of uniform priors will be referred to as LIMu and FLMPu . Note that LIMu and FLMPu treat the different parameters as independent. This approach misses important information. In particular, the experimental design is such that the amount of support for each level i + 1 is always higher than for level i. Because parameter ai (or bi ) corresponds to the degree of auditory (or visual) support for a unimodal stimulus at the ith level, it seems reasonable to expect the following orderings among the parameters to hold (see also [6]): aj > ai and bj > bi for j > i. (16) The models relying on this set of ordered priors will be referred to as LIMo and FLMPo . 4.3 Complexity and experimental design It is tempting to consider model complexity as an inherent characteristic of a model. For some models and for some measures of complexity this is clearly the case. Consider, for example, model Mb . In any experimental design (i.e., a number of coin tosses), PCMb = 1. However, more generally, this is not the case. Focusing on the FLMP and the LIM, it is clear that even a simple measure as PC depends crucially on (some aspects of) the experimental design. In particular, every level corresponds to a new parameter, so PC = I + J . Similarly, GC is dependent on design choices. The PPC is not different in this respect. The design sensitivity implies that one can only make sensible conclusions about differences in model complexity by using different designs. In an information integration task, the design decisions include the type of design (expanded or not), the number of sources, the number of response alternatives, the number of levels for each source, and the number of observations for each stimulus (sample size). The present study focuses on the expanded factorial designs with two sources and two response alternatives. The additional design features were varied: both a 5 × 5 and a 8 × 2 design were considered, using three different sample sizes (20, 60 and 150, following [2]). 4.4 Results Figure 1 shows the 99% predicted interval in the 8×2 design with n = 150. Each panel corresponds to a different model. In each panel, each of the 26 stimuli is displayed on the x-axis. The first eight stimuli correspond to the stimuli with the lowest level of visual support, and are ordered in increasing order of auditory support. The next eight stimuli correspond to the stimuli with the highest level of visual support. The next eight stimuli correspond to the unimodal stimuli where only auditory information is provided (again ranked in increasing order). The final two stimuli are the unimodal visual stimuli. Panel A shows that the predicted interval of LIMu nearly equals the universal interval, ranging between 0 and 1. This indicates that almost all outcomes are given a non-negligible prior mass by LIMu , making it almost maximally complex. FLMPu is even more complex. The predicted interval, shown in Panel B, virtually equals the universal interval, indicating that the model predicts virtually every possible outcome. Panels C and D show the dramatic effect of incorporating relevant prior information in the models. The predicted intervals of both LIMo and FLMPo are much smaller than their counterparts using the uniform priors. Focusing on the comparison between LIM and FLMP, the PPC indicates that the latter is more complex than the former. This observation holds irrespective of the model version (assuming uniform 6 0.9 0.8 0.8 Proportion of /da/ responses 1 0.9 Proportion of /da/ responses 1 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.7 0.6 0.5 0.4 0.3 0.2 0.1 11 21 A 1* 0 *1 11 21 B 1* *1 1* *1 0.8 Proportion of /da/ responses 0.9 0.8 21 1 0.9 Proportion of /da/ responses 1 11 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.7 0.6 0.5 0.4 0.3 0.2 0.1 11 21 C 1* 0 *1 D Figure 1: The 99% predicted interval for each of the 26 stimuli (x-axis) according to LIMu (Panel A), FLMPu (Panel B), LIMo (Panel C), and FLMPo (Panel D). Table 1: PPC, based on the 99% predicted interval, for four models across six different designs. 20 LIMu FLMPu LIMo FLMPo 5×5 60 150 20 8×2 60 150 0.97 1 0.75 0.83 0.94 1 0.67 0.80 .97 1 0.77 0.86 0.95 1 0.69 0.82 0.93 0.99 0.64 0.78 7 0.94 0.99 0.66 0.81 vs. ordered priors). The smaller complexity of LIM is in line with previous attempts to measure the relative complexities of LIM and FLMP, such as the atheoretical simulation-based approach ([4] but see [5]), the semi-theoretical simulation-based approach [4], the theoretical simulation-based approach [2, 6, 22], and a direct computation of the GC [2]. The PPC’s for all six designs considered are displayed in Table 1. It shows that the observations made for the 8 × 2, n = 150 design holds across the five remaining designs as well: LIM is simpler than FLMP; and models assuming ordered priors are simpler than models assuming uniform priors. Note that these conclusions would not have been possible based on PC or GC. For PC, all four models have the same complexity. GC, in contrast, would detect complexity differences between LIM and FLMP (i.e., the first conclusion), but due to its insensitivity to the parameter prior, the complexity differences between LIMu and LIMo on the one hand, and FLMPu and FLMPo on the other hand (i.e., the second conclusion) would have gone unnoticed. 5 Discussion A theorist defining a model should clearly and explicitly specify at least the three following pieces of information: the model equation, the parameter prior range, and the parameter prior distribution. If any of these pieces is missing, the model should be regarded as incomplete, and therefore untestable. Consequently, any measure of generalizability should be sensitive to all three aspects of the model definition. Many currently popular generalizability measures do not satisfy this criterion, including AIC, BIC and MDL. A measure of generalizability that does take these three aspects of a model into account is the marginal likelihood [6, 7, 14, 23]. Often, the marginal likelihood is criticized exactly for its sensitivity to the prior range and distribution (e.g., [24]). However, in the light of the fact that the prior is a part of the model definition, I see the sensitivity of the marginal likelihood to the prior as an asset rather than a nuisance. It is precisely the measures of generalizability that are insensitive to the prior that miss an important aspect of the model. Similarly, any stand alone measure of model complexity should be sensitive to all three aspects of the model definition, as all three aspects contribute to the model’s complexity (with the model equation contributing two factors: the number of parameters and the functional form). Existing measures of complexity do not satisfy this requirement and are therefore incomplete. PC takes only part of the model equation into account, whereas GC takes only the model equation and the range into account. In contrast, the PPC currently proposed is sensitive to all these three aspects. It assesses model complexity using the predicted interval which contains all possible outcomes a model can generate. A narrow predicted interval (relative to the universal interval) indicates a simple model; a complex model is characterized by a wide predicted interval. There is a tight coupling between the notions of information, knowledge and uncertainty, and the notion of model complexity. As parameters correspond to unknown variables, having more information available leads to fewer parameters and hence to a simpler model. Similarly, the more information there is available, the sharper the parameter prior, implying a simpler model. To put it differently, the less uncertainty present in a model, the narrower its predicted interval, and the simpler the model. For example, in model Mb , there is maximal uncertainty. Nothing but the range is known about θ, so all values of θ are equally likely. In contrast, in model Mf , there is minimal uncertainty. In fact, ph is known for sure, so only a single value of θ is possible. This difference in uncertainty is translated in a difference in complexity. The same is true for the information integration models. Incorporating the order constraints in the priors reduces the uncertainty compared to the models without these constraints (it tells you, for example, that parameter a1 is smaller than a2 ). This reduction in uncertainty is reflected by a smaller complexity. There are many different sources of prior information that can be translated in a range or distribution. The illustration using the information integration models highlighted that prior information can reflect meaningful information in the design. Alternatively, priors can be informed by previous applications of similar models in similar settings. Probably the purest form of priors are those that translate theoretical assumptions made by a model (see [16]). The fact that it is often difficult to formalize this prior information may not be used as an excuse to leave the prior unspecified. Sure it is a challenging task, but so is translating theoretical assumptions into the model equation. Formalizing theory, intuitions, and information is what model building is all about. 8 References [1] Myung, I. J. (2000) The importance of complexity in model selection. Journal of Mathematical Psychology, 44, 190–204. [2] Pitt, M. A., Myung, I. J., and Zhang, S. (2002) Toward a method of selecting among computational models of cognition. Psychological Review, 109, 472–491. [3] Shiffrin, R. M., Lee, M. D., Kim, W., and Wagenmakers, E. J. (2008) A survey of model evaluation approaches with a tutorial on hierarchical Bayesian methods. Cognitive Science, 32, 1248–1284. [4] Cutting, J. E., Bruno, N., Brady, N. P., and Moore, C. (1992) Selectivity, scope, and simplicity of models: A lesson from fitting judgments of perceived depth. Journal of Experimental Psychology: General, 121, 364–381. [5] Dunn, J. (2000) Model complexity: The fit to random data reconsidered. Psychological Research, 63, 174–182. [6] Myung, I. J. and Pitt, M. A. (1997) Applying Occam’s razor in modeling cognition: A Bayesian approach. Psychonomic Bulletin & Review, 4, 79–95. [7] Vanpaemel, W. and Storms, G. (in press) Abstraction and model evaluation in category learning. Behavior Research Methods. [8] Akaike, H. (1973) Information theory and an extension of the maximum likelihood principle. Petrov, B. and Csaki, B. (eds.), Second International Symposium on Information Theory, pp. 267–281, Academiai Kiado. [9] Schwarz, G. (1978) Estimating the dimension of a model. Annals of Statistics, 6, 461–464. [10] Myung, I. J., Balasubramanian, V., and Pitt, M. A. (2000) Counting probability distributions: Differential geometry and model selection. Proceedings of the National Academy of Sciences, 97, 11170–11175. [11] Lee, M. D. (2002) Generating additive clustering models with minimal stochastic complexity. Journal of Classification, 19, 69–85. [12] Rissanen, J. (1996) Fisher information and stochastic complexity. IEEE Transactions on Information Theory, 42, 40–47. [13] Gr¨ nwald, P. (2000) Model selection based on minimum description length. Journal of Mathematical u Psychology, 44, 133–152. [14] Lee, M. D. and Wagenmakers, E. J. (2005) Bayesian statistical inference in psychology: Comment on Trafimow (2003). Psychological Review, 112, 662–668. [15] Lee, M. D. and Vanpaemel, W. (2008) Exemplars, prototypes, similarities and rules in category representation: An example of hierarchical Bayesian analysis. Cognitive Science, 32, 1403–1424. [16] Vanpaemel, W. and Lee, M. D. (submitted) Using priors to formalize theory: Optimal attention and the generalized context model. [17] Lee, M. D. (2008) Three case studies in the Bayesian analysis of cognitive models. Psychonomic Bulletin & Review, 15, 1–15. [18] Spiegelhalter, D., Thomas, A., Best, N., and Lunn, D. (2004) WinBUGS User Manual Version 2.0. Medical Research Council Biostatistics Unit. Institute of Public Health, Cambridge. [19] Anderson, N. H. (1981) Foundations of information integration theory. Academic Press. [20] Oden, G. C. and Massaro, D. W. (1978) Integration of featural information in speech perception. Psychological Review, 85, 172–191. [21] Massaro, D. W. (1998) Perceiving Talking Faces: From Speech Perception to a Behavioral Principle. MIT Press. [22] Massaro, D. W., Cohen, M. M., Campbell, C. S., and Rodriguez, T. (2001) Bayes factor of model selection validates FLMP. Psychonomic Bulletin and Review, 8, 1–17. [23] Kass, R. E. and Raftery, A. E. (1995) Bayes factors. Journal of the American Statistical Association, 90, 773–795. [24] Liu, C. C. and Aitkin, M. (2008) Bayes factors: Prior sensitivity and model generalizability. Journal of Mathematical Psychology, 53, 362–375. 9
5 0.47328568 3 nips-2009-AUC optimization and the two-sample problem
Author: Nicolas Vayatis, Marine Depecker, Stéphan J. Clémençcon
Abstract: The purpose of the paper is to explore the connection between multivariate homogeneity tests and AUC optimization. The latter problem has recently received much attention in the statistical learning literature. From the elementary observation that, in the two-sample problem setup, the null assumption corresponds to the situation where the area under the optimal ROC curve is equal to 1/2, we propose a two-stage testing method based on data splitting. A nearly optimal scoring function in the AUC sense is first learnt from one of the two half-samples. Data from the remaining half-sample are then projected onto the real line and eventually ranked according to the scoring function computed at the first stage. The last step amounts to performing a standard Mann-Whitney Wilcoxon test in the onedimensional framework. We show that the learning step of the procedure does not affect the consistency of the test as well as its properties in terms of power, provided the ranking produced is accurate enough in the AUC sense. The results of a numerical experiment are eventually displayed in order to show the efficiency of the method. 1
6 0.44589877 244 nips-2009-The Wisdom of Crowds in the Recollection of Order Information
7 0.44359657 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior
8 0.44091535 100 nips-2009-Gaussian process regression with Student-t likelihood
9 0.43709967 78 nips-2009-Efficient Moments-based Permutation Tests
10 0.40497139 168 nips-2009-Non-stationary continuous dynamic Bayesian networks
11 0.40438154 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization
12 0.40314499 206 nips-2009-Riffled Independence for Ranked Data
13 0.40070841 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs
14 0.39542937 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem
15 0.39424482 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations
16 0.38144037 140 nips-2009-Linearly constrained Bayesian matrix factorization for blind source separation
17 0.37293831 195 nips-2009-Probabilistic Relational PCA
18 0.36922625 51 nips-2009-Clustering sequence sets for motif discovery
19 0.35271516 203 nips-2009-Replacing supervised classification learning by Slow Feature Analysis in spiking neural networks
20 0.35086894 25 nips-2009-Adaptive Design Optimization in Experiments with People
topicId topicWeight
[(7, 0.014), (24, 0.021), (25, 0.067), (35, 0.531), (36, 0.065), (39, 0.037), (58, 0.055), (71, 0.051), (81, 0.014), (86, 0.045), (91, 0.016)]
simIndex simValue paperId paperTitle
1 0.96711558 10 nips-2009-A Gaussian Tree Approximation for Integer Least-Squares
Author: Jacob Goldberger, Amir Leshem
Abstract: This paper proposes a new algorithm for the linear least squares problem where the unknown variables are constrained to be in a finite set. The factor graph that corresponds to this problem is very loopy; in fact, it is a complete graph. Hence, applying the Belief Propagation (BP) algorithm yields very poor results. The algorithm described here is based on an optimal tree approximation of the Gaussian density of the unconstrained linear system. It is shown that even though the approximation is not directly applied to the exact discrete distribution, applying the BP algorithm to the modified factor graph outperforms current methods in terms of both performance and complexity. The improved performance of the proposed algorithm is demonstrated on the problem of MIMO detection.
same-paper 2 0.92275757 42 nips-2009-Bayesian Sparse Factor Models and DAGs Inference and Comparison
Author: Ricardo Henao, Ole Winther
Abstract: In this paper we present a novel approach to learn directed acyclic graphs (DAGs) and factor models within the same framework while also allowing for model comparison between them. For this purpose, we exploit the connection between factor models and DAGs to propose Bayesian hierarchies based on spike and slab priors to promote sparsity, heavy-tailed priors to ensure identifiability and predictive densities to perform the model comparison. We require identifiability to be able to produce variable orderings leading to valid DAGs and sparsity to learn the structures. The effectiveness of our approach is demonstrated through extensive experiments on artificial and biological data showing that our approach outperform a number of state of the art methods. 1
3 0.90634698 236 nips-2009-Structured output regression for detection with partial truncation
Author: Andrea Vedaldi, Andrew Zisserman
Abstract: We develop a structured output model for object category detection that explicitly accounts for alignment, multiple aspects and partial truncation in both training and inference. The model is formulated as large margin learning with latent variables and slack rescaling, and both training and inference are computationally efficient. We make the following contributions: (i) we note that extending the Structured Output Regression formulation of Blaschko and Lampert [1] to include a bias term significantly improves performance; (ii) that alignment (to account for small rotations and anisotropic scalings) can be included as a latent variable and efficiently determined and implemented; (iii) that the latent variable extends to multiple aspects (e.g. left facing, right facing, front) with the same formulation; and (iv), most significantly for performance, that truncated and truncated instances can be included in both training and inference with an explicit truncation mask. We demonstrate the method by training and testing on the PASCAL VOC 2007 data set – training includes the truncated examples, and in testing object instances are detected at multiple scales, alignments, and with significant truncations. 1
4 0.88353634 209 nips-2009-Robust Value Function Approximation Using Bilinear Programming
Author: Marek Petrik, Shlomo Zilberstein
Abstract: Existing value function approximation methods have been successfully used in many applications, but they often lack useful a priori error bounds. We propose approximate bilinear programming, a new formulation of value function approximation that provides strong a priori guarantees. In particular, this approach provably finds an approximate value function that minimizes the Bellman residual. Solving a bilinear program optimally is NP-hard, but this is unavoidable because the Bellman-residual minimization itself is NP-hard. We therefore employ and analyze a common approximate algorithm for bilinear programs. The analysis shows that this algorithm offers a convergent generalization of approximate policy iteration. Finally, we demonstrate that the proposed approach can consistently minimize the Bellman residual on a simple benchmark problem. 1 Motivation Solving large Markov Decision Problems (MDPs) is a very useful, but computationally challenging problem addressed widely in the AI literature, particularly in the area of reinforcement learning. It is widely accepted that large MDPs can only be solved approximately. The commonly used approximation methods can be divided into three broad categories: 1) policy search, which explores a restricted space of all policies, 2) approximate dynamic programming, which searches a restricted space of value functions, and 3) approximate linear programming, which approximates the solution using a linear program. While all of these methods have achieved impressive results in many domains, they have significant limitations. Policy search methods rely on local search in a restricted policy space. The policy may be represented, for example, as a finite-state controller [22] or as a greedy policy with respect to an approximate value function [24]. Policy search methods have achieved impressive results in such domains as Tetris [24] and helicopter control [1]. However, they are notoriously hard to analyze. We are not aware of any theoretical guarantees regarding the quality of the solution. Approximate dynamic programming (ADP) methods iteratively approximate the value function [4, 20, 23]. They have been extensively analyzed and are the most commonly used methods. However, ADP methods typically do not converge and they only provide weak guarantees of approximation quality. The approximation error bounds are usually expressed in terms of the worst-case approximation of the value function over all policies [4]. In addition, most available bounds are with respect to the L∞ norm, while the algorithms often minimize the L2 norm. While there exist some L2 -based bounds [14], they require values that are difficult to obtain. Approximate linear programming (ALP) uses a linear program to compute the approximate value function in a particular vector space [7]. ALP has been previously used in a wide variety of settings [2, 9, 10]. Although ALP often does not perform as well as ADP, there have been some recent 1 efforts to close the gap [18]. ALP has better theoretical properties than ADP and policy search. It is guaranteed to converge and return the closest L1 -norm approximation v of the optimal value func˜ tion v ∗ up to a multiplicative factor. However, the L1 norm must be properly weighted to guarantee a small policy loss, and there is no reliable method for selecting appropriate weights [7]. To summarize, the existing reinforcement learning techniques often provide good solutions, but typically require significant domain knowledge [20]. The domain knowledge is needed partly because useful a priori error bounds are not available, as mentioned above. Our goal is to develop a more robust method that is guaranteed to minimize an actual bound on the policy loss. We present a new formulation of value function approximation that provably minimizes a bound on the policy loss. Unlike in some other algorithms, the bound in this case does not rely on values that are hard to obtain. The new method unifies policy search and value-function search methods to minimize the L∞ norm of the Bellman residual, which bounds the policy loss. We start with a description of the framework and notation in Section 2. Then, in Section 3, we describe the proposed Approximate Bilinear Programming (ABP) formulation. A drawback of this formulation is its computational complexity, which may be exponential. We show in Section 4 that this is unavoidable, because minimizing the approximation error bound is in fact NP-hard. Although our focus is on the formulation and its properties, we also discuss some simple algorithms for solving bilinear programs. Section 5 shows that ABP can be seen as an improvement of ALP and Approximate Policy Iteration (API). Section 6 demonstrates the applicability of ABP using a common reinforcement learning benchmark problem. A complete discussion of sampling strategies–an essential component for achieving robustness–is beyond the scope of this paper, but the issue is briefly discussed in Section 6. Complete proofs of the theorems can be found in [19]. 2 Solving MDPs using ALP In this section, we formally define MDPs, their ALP formulation, and the approximation errors involved. These notions serve as a basis for developing the ABP formulation. A Markov Decision Process is a tuple (S, A, P, r, α), where S is the finite set of states, A is the finite set of actions. P : S × S × A → [0, 1] is the transition function, where P (s , s, a) represents the probability of transiting to state s from state s, given action a. The function r : S × A → R is the reward function, and α : S → [0, 1] is the initial state distribution. The objective is to maximize the infinite-horizon discounted cumulative reward. To shorten the notation, we assume an arbitrary ordering of the states: s1 , s2 , . . . , sn . Then, Pa and ra are used to denote the probabilistic transition matrix and reward for action a. The solution of an MDP is a policy π : S × A → [0, 1] from a set of possible policies Π, such that for all s ∈ S, a∈A π(s, a) = 1. We assume that the policies may be stochastic, but stationary [21]. A policy is deterministic when π(s, a) ∈ {0, 1} for all s ∈ S and a ∈ A. The transition and reward functions for a given policy are denoted by Pπ and rπ . The value function update for a policy π is denoted by Lπ , and the Bellman operator is denoted by L. That is: Lπ v = Pπ v + rπ Lv = max Lπ v. π∈Π The optimal value function, denoted v ∗ , satisfies v ∗ = Lv ∗ . We focus on linear value function approximation for discounted infinite-horizon problems. In linear value function approximation, the value function is represented as a linear combination of nonlinear basis functions (vectors). For each state s, we define a row-vector φ(s) of features. The rows of the basis matrix M correspond to φ(s), and the approximation space is generated by the columns of the matrix. That is, the basis matrix M , and the value function v are represented as: − φ(s1 ) − M = − φ(s2 ) − v = M x. . . . Definition 1. A value function, v, is representable if v ∈ M ⊆ R|S| , where M = colspan (M ), and is transitive-feasible when v ≥ Lv. We denote the set of transitive-feasible value functions as: K = {v ∈ R|S| v ≥ Lv}. 2 Notice that the optimal value function v ∗ is transitive-feasible, and M is a linear space. Also, all the inequalities are element-wise. Because the new formulation is related to ALP, we introduce it first. It is well known that an infinite horizon discounted MDP problem may be formulated in terms of solving the following linear program: minimize v c(s)v(s) s∈S v(s) − γ s.t. P (s , s, a)v(s ) ≥ r(s, a) ∀(s, a) ∈ (S, A) (1) s ∈S We use A as a shorthand notation for the constraint matrix and b for the right-hand side. The value c represents a distribution over the states, usually a uniform one. That is, s∈S c(s) = 1. The linear program in Eq. (1) is often too large to be solved precisely, so it is approximated to get an approximate linear program by assuming that v ∈ M [8], as follows: minimize cT v x Av ≥ b s.t. (2) v∈M The constraint v ∈ M denotes the approximation. To actually solve this linear program, the value function is represented as v = M x. In the remainder of the paper, we assume that 1 ∈ M to guarantee the feasibility of the ALP, where 1 is a vector of all ones. The optimal solution of the ALP, v , satisfies that v ≥ v ∗ . Then, the objective of Eq. (2) represents the minimization of v − v ∗ 1,c , ˜ ˜ ˜ where · 1,c is a c-weighted L1 norm [7]. The ultimate goal of the optimization is not to obtain a good value function v , but a good policy. ˜ The quality of the policy, typically chosen to be greedy with respect to v , depends non-trivially on ˜ the approximate value function. The ABP formulation will minimize policy loss by minimizing L˜ − v ∞ , which bounds the policy loss as follows. v ˜ Theorem 2 (e.g. [25]). Let v be an arbitrary value function, and let v be the value of the greedy ˜ ˆ policy with respect to v . Then: ˜ 2 v∗ − v ∞ ≤ ˆ L˜ − v ∞ , v ˜ 1−γ In addition, if v ≥ L˜, the policy loss is smallest for the greedy policy. ˜ v Policies, like value functions, can be represented as vectors. Assume an arbitrary ordering of the state-action pairs, such that o(s, a) → N maps a state and an action to its position. The policies are represented as θ ∈ R|S|×|A| , and we use the shorthand notation θ(s, a) = θ(o(s, a)). Remark 3. The corresponding π and θ are denoted as π θ and θπ and satisfy: π θ (s, a) = θπ (s, a). We will also consider approximations of the policies in the policy-space, generated by columns of a matrix N . A policy is representable when π ∈ N , where N = colspan (N ). 3 Approximate Bilinear Programs This section shows how to formulate minv∈M Lv − v ∞ as a separable bilinear program. Bilinear programs are a generalization of linear programs with an additional bilinear term in the objective function. A separable bilinear program consists of two linear programs with independent constraints and are fairly easy to solve and analyze. Definition 4 (Separable Bilinear Program). A separable bilinear program in the normal form is defined as follows: T T minimize f (w, x, y, z) = sT w + r1 x + xT Cy + r2 y + sT z 1 2 w,x y,z s.t. A1 x + B1 w = b1 A2 y + B2 z = b2 w, x ≥ 0 y, z ≥ 0 3 (3) We separate the variables using a vertical line and the constraints using different columns to emphasize the separable nature of the bilinear program. In this paper, we only use separable bilinear programs and refer to them simply as bilinear programs. An approximate bilinear program can now be formulated as follows. minimize θT λ + λ θ λ,λ ,v Bθ = 1 z = Av − b s.t. θ≥0 z≥0 (4) λ+λ1≥z λ≥0 θ∈N v∈M All variables are vectors except λ , which is a scalar. The symbol z is only used to simplify the notation and does not need to represent an optimization variable. The variable v is defined for each state and represents the value function. Matrix A represents constraints that are identical to the constraints in Eq. (2). The variables λ correspond to all state-action pairs. These variables represent the Bellman residuals that are being minimized. The variables θ are defined for all state-action pairs and represent policies in Remark 3. The matrix B represents the following constraints: θ(s, a) = 1 ∀s ∈ S. a∈A As with approximate linear programs, we initially assume that all the constraints on z are used. In realistic settings, however, the constraints would be sampled or somehow reduced. We defer the discussion of this issue until Section 6. Note that the constraints in our formulation correspond to elements of z and θ. Thus when constraints are omitted, also the corresponding elements of z and θ are omitted. To simplify the notation, the value function approximation in this problem is denoted only implicitly by v ∈ M, and the policy approximation is denoted by θ ∈ N . In an actual implementation, the optimization variables would be x, y using the relationships v = M x and θ = N y. We do not assume any approximation of the policy space, unless mentioned otherwise. We also use v or θ to refer to partial solutions of Eq. (4) with the other variables chosen appropriately to achieve feasibility. The ABP formulation is closely related to approximate linear programs, and we discuss the connection in Section 5. We first analyze the properties of the optimal solutions of the bilinear program and then show and discuss the solution methods in Section 4. The following theorem states the main property of the bilinear formulation. ˜˜ ˜ ˜ Theorem 5. b Let (θ, v , λ, λ ) be an optimal solution of Eq. (4) and assume that 1 ∈ M. Then: ˜ ˜ ˜ θT λ + λ = L˜ − v v ˜ ∞ ≤ min v∈K∩M Lv − v ∞ ≤ 2 min Lv − v v∈M ∞ ≤ 2(1 + γ) min v − v ∗ v∈M ∞. ˜ In addition, π θ minimizes the Bellman residual with regard to v , and its value function v satisfies: ˜ ˆ 2 min Lv − v ∞ . v − v∗ ∞ ≤ ˆ 1 − γ v∈M The proof of the theorem can be found in [19]. It is important to note that, as Theorem 5 states, the ABP approach is equivalent to a minimization over all representable value functions, not only the transitive-feasible ones. Notice also the missing coefficient 2 (2 instead of 4) in the last equation of Theorem 5. This follows by subtracting a constant vector 1 from v to balance the lower bounds ˜ on the Bellman residual error with the upper ones. This modified approximate value function will have 1/2 of the original Bellman residual but an identical greedy policy. Finally, note that whenever v ∗ ∈ M, both ABP and ALP will return the optimal value function. The ABP solution minimizes the L∞ norm of the Bellman residual due to: 1) the correspondence between θ and the policies, and 2) the dual representation with respect to variables λ and λ . The theorem then follows using techniques similar to those used for approximate linear programs [7]. 4 Algorithm 1: Iterative algorithm for solving Eq. (3) (x0 , w0 ) ← random ; (y0 , z0 ) ← arg miny,z f (w0 , x0 , y, z) ; i←1; while yi−1 = yi or xi−1 = xi do (yi , zi ) ← arg min{y,z A2 y+B2 z=b2 y,z≥0} f (wi−1 , xi−1 , y, z) ; (xi , wi ) ← arg min{x,w A1 x+B1 w=b1 x,w≥0} f (w, x, yi , zi ) ; i←i+1 return f (wi , xi , yi , zi ) 4 Solving Bilinear Programs In this section we describe simple methods for solving ABPs. We first describe optimal methods, which have exponential complexity, and then discuss some approximation strategies. Solving a bilinear program is an NP-complete problem [3]. The membership in NP follows from the finite number of basic feasible solutions of the individual linear programs, each of which can be checked in polynomial time. The NP-hardness is shown by a reduction from the SAT problem [3]. The NP-completeness of ABP compares unfavorably with the polynomial complexity of ALP. However, most other ADP algorithms are not guaranteed to converge to a solution in finite time. The following theorem shows that the computational complexity of the ABP formulation is asymptotically the same as the complexity of the problem it solves. Theorem 6. b Determining minv∈K∩M Lv − v ∞ < is NP-complete for the full constraint representation, 0 < γ < 1, and a given > 0. In addition, the problem remains NP-complete when 1 ∈ M, and therefore minv∈M Lv − v ∞ < is also NP-complete. As the theorem states, the value function approximation does not become computationally simpler even when 1 ∈ M – a universal assumption in the paper. Notice that ALP can determine whether minv∈K∩M Lv − v ∞ = 0 in polynomial time. The proof of Theorem 6 is based on a reduction from SAT and can be found in [19]. The policy in the reduction determines the true literal in each clause, and the approximate value function corresponds to the truth value of the literals. The approximation basis forces literals that share the same variable to have consistent values. Bilinear programs are non-convex and are typically solved using global optimization techniques. The common solution methods are based on concave cuts [11] or branch-and-bound [6]. In ABP settings with a small number of features, the successive approximation algorithm [17] may be applied efficiently. We are, however, not aware of commercial solvers available for solving bilinear programs. Bilinear programs can be formulated as concave quadratic minimization problems [11], or mixed integer linear programs [11, 16], for which there are numerous commercial solvers available. Because we are interested in solving very large bilinear programs, we describe simple approximate algorithms next. Optimal scalable methods are beyond the scope of this paper. The most common approximate method for solving bilinear programs is shown in Algorithm 1. It is designed for the general formulation shown in Eq. (3), where f (w, x, y, z) represents the objective function. The minimizations in the algorithm are linear programs which can be easily solved. Interestingly, as we will show in Section 5, Algorithm 1 applied to ABP generalizes a version of API. While Algorithm 1 is not guaranteed to find an optimal solution, its empirical performance is often remarkably good [13]. Its basic properties are summarized by the following proposition. Proposition 7 (e.g. [3]). Algorithm 1 is guaranteed to converge, assuming that the linear program solutions are in a vertex of the optimality simplex. In addition, the global optimum is a fixed point of the algorithm, and the objective value monotonically improves during execution. 5 The proof is based on the finite count of the basic feasible solutions of the individual linear programs. Because the objective function does not increase in any iteration, the algorithm will eventually converge. In the context of MDPs, Algorithm 1 can be further refined. For example, the constraint v ∈ M in Eq. (4) serves mostly to simplify the bilinear program and a value function that violates it may still be acceptable. The following proposition motivates the construction of a new value function from two transitive-feasible value functions. Proposition 8. Let v1 and v2 be feasible value functions in Eq. (4). Then the value function ˜ ˜ v (s) = min{˜1 (s), v2 (s)} is also feasible in Eq. (4). Therefore v ≥ v ∗ and v ∗ − v ∞ ≤ ˜ v ˜ ˜ ˜ min { v ∗ − v1 ∞ , v ∗ − v2 ∞ }. ˜ ˜ The proof of the proposition is based on Jensen’s inequality and can be found in [19]. Proposition 8 can be used to extend Algorithm 1 when solving ABPs. One option is to take the state-wise minimum of values from multiple random executions of Algorithm 1, which preserves the transitive feasibility of the value function. However, the increasing number of value functions used to obtain v also increases the potential sampling error. ˜ 5 Relationship to ALP and API In this section, we describe the important connections between ABP and the two closely related ADP methods: ALP, and API with L∞ minimization. Both of these methods are commonly used, for example to solve factored MDPs [10]. Our analysis sheds light on some of their observed properties and leads to a new convergent form of API. ABP addresses some important issues with ALP: 1) ALP provides value function bounds with respect to L1 norm, which does not guarantee small policy loss, 2) ALP’s solution quality depends significantly on the heuristically-chosen objective function c in Eq. (2) [7], and 3) incomplete constraint samples in ALP easily lead to unbounded linear programs. The drawback of using ABP, however, is the higher computational complexity. Both the first and the second issues in ALP can be addressed by choosing the right objective function [7]. Because this objective function depends on the optimal ALP solution, it cannot be practically computed. Instead, various heuristics are usually used. The heuristic objective functions may lead to significant improvements in specific domains, but they do not provide any guarantees. ABP, on the other hand, has no such parameters that require adjustments. The third issue arises when the constraints of an ALP need to be sampled in some large domains. The ALP may become unbounded with incomplete samples because its objective value is defined using the L1 norm on the states, and the constraints are defined using the L∞ norm of the Bellman residual. In ABP, the Bellman residual is used in both the constraints and objective function. The objective function of ABP is then bounded below by 0 for an arbitrarily small number of samples. ABP can also improve on API with L∞ minimization (L∞ -API for short), which is a leading method for solving factored MDPs [10]. Minimizing the L∞ approximation error is theoretically preferable, since it is compatible with the existing bounds on policy loss [10]. In contrast, few practical bounds exist for API with the L2 norm minimization [14], such as LSPI [12]. L∞ -API is shown in Algorithm 2, where f (π) is calculated using the following program: minimize φ φ,v s.t. (I − γPπ )v + 1φ ≥ rπ −(I − γPπ )v + 1φ ≥ −rπ (5) v∈M Here I denotes the identity matrix. We are not aware of a convergence or a divergence proof of L∞ -API, and this analysis is beyond the scope of this paper. 6 Algorithm 2: Approximate policy iteration, where f (π) denotes a custom value function approximation for the policy π. π0 , k ← rand, 1 ; while πk = πk−1 do vk ← f (πk−1 ) ; ˜ πk (s) ← arg maxa∈A r(s, a) + γ s ∈S P (s , s, a)˜k (s) ∀s ∈ S ; v k ←k+1 We propose Optimistic Approximate Policy Iteration (OAPI), a modification of API. OAPI is shown in Algorithm 2, where f (π) is calculated using the following program: minimize φ φ,v s.t. Av ≥ b (≡ (I − γPπ )v ≥ rπ ∀π ∈ Π) −(I − γPπ )v + 1φ ≥ −rπ (6) v∈M In fact, OAPI corresponds to Algorithm 1 applied to ABP because Eq. (6) corresponds to Eq. (4) with fixed θ. Then, using Proposition 7, we get the following corollary. Corollary 9. Optimistic approximate policy iteration converges in finite time. In addition, the Bellman residual of the generated value functions monotonically decreases. OAPI differs from L∞ -API in two ways: 1) OAPI constrains the Bellman residuals by 0 from below and by φ from above, and then it minimizes φ. L∞ -API constrains the Bellman residuals by φ from both above and below. 2) OAPI, like API, uses only the current policy for the upper bound on the Bellman residual, but uses all the policies for the lower bound on the Bellman residual. L∞ -API cannot return an approximate value function that has a lower Bellman residual than ABP, given the optimality of ABP described in Theorem 5. However, even OAPI, an approximate ABP algorithm, performs comparably to L∞ -API, as the following theorem states. Theorem 10. b Assume that L∞ -API converges to a policy π and a value function v that both φ satisfy: φ = v − Lπ v ∞ = v − Lv ∞ . Then v = v + 1−γ 1 is feasible in Eq. (4), and it is a fixed ˜ point of OAPI. In addition, the greedy policies with respect to v and v are identical. ˜ The proof is based on two facts. First, v is feasible with respect to the constraints in Eq. (4). The ˜ Bellman residual changes for all the policies identically, since a constant vector is added. Second, because Lπ is greedy with respect to v , we have that v ≥ Lπ v ≥ L˜. The value function v is ˜ ˜ ˜ v ˜ therefore transitive-feasible. The full proof can be found in [19]. To summarize, OAPI guarantees convergence, while matching the performance of L∞ -API. The convergence of OAPI is achieved because given a non-negative Bellman residual, the greedy policy also minimizes the Bellman residual. Because OAPI ensures that the Bellman residual is always non-negative, it can progressively reduce it. In comparison, the greedy policy in L∞ -API does not minimize the Bellman residual, and therefore L∞ -API does not always reduce it. Theorem 10 also explains why API provides better solutions than ALP, as observed in [10]. From the discussion above, ALP can be seen as an L1 -norm approximation of a single iteration of OAPI. L∞ -API, on the other hand, performs many such ALP-like iterations. 6 Empirical Evaluation As we showed in Theorem 10, even OAPI, the very simple approximate algorithm for ABP, can perform as well as existing state-of-the art methods on factored MDPs. However, a deeper understanding of the formulation and potential solution methods will be necessary in order to determine the full practical impact of the proposed methods. In this section, we validate the approach by applying it to the mountain car problem, a simple reinforcement learning benchmark problem. We have so far considered that all the constraints involving z are present in the ABP in Eq. (4). Because the constraints correspond to all state-action pairs, it is often impractical to even enumerate 7 (a) L∞ error of the Bellman residual Features 100 144 OAPI 0.21 (0.23) 0.13 (0.1) ALP 13. (13.) 3.6 (4.3) LSPI 9. (14.) 3.9 (7.7) API 0.46 (0.08) 0.86 (1.18) (b) L2 error of the Bellman residual Features 100 144 OAPI 0.2 (0.3) 0.1 (1.9) ALP 9.5 (18.) 0.3 (0.4) LSPI 1.2 (1.5) 0.9 (0.1) API 0.04 (0.01) 0.08 (0.08) Table 1: Bellman residual of the final value function. The values are averages over 5 executions, with the standard deviations shown in parentheses. them. This issue can be addressed in at least two ways. First, a small randomly-selected subset of the constraints can be used in the ABP, a common approach in ALP [9, 5]. The ALP sampling bounds can be easily extended to ABP. Second, the structure of the MDP can be used to reduce the number of constraints. Such a reduction is possible, for example, in factored MDPs with L∞ -API and ALP [10], and can be easily extended to OAPI and ABP. In the mountain-car benchmark, an underpowered car needs to climb a hill [23]. To do so, it first needs to back up to an opposite hill to gain sufficient momentum. The car receives a reward of 1 when it climbs the hill. In the experiments we used a discount factor γ = 0.99. The experiments are designed to determine whether OAPI reliably minimizes the Bellman residual in comparison with API and ALP. We use a uniformly-spaced linear spline to approximate the value function. The constraints were based on 200 uniformly sampled states with all 3 actions per state. We evaluated the methods with the number of the approximation features 100 and 144, which corresponds to the number of linear segments. The results of ABP (in particular OAPI), ALP, API with L2 minimization, and LSPI are depicted in Table 1. The results are shown for both L∞ norm and uniformly-weighted L2 norm. The runtimes of all these methods are comparable, with ALP being the fastest. Since API (LSPI) is not guaranteed to converge, we ran it for at most 20 iterations, which was an upper bound on the number of iterations of OAPI. The results demonstrate that ABP minimizes the L∞ Bellman residual much more consistently than the other methods. Note, however, that all the considered algorithms would perform significantly better given a finer approximation. 7 Conclusion and Future Work We proposed and analyzed approximate bilinear programming, a new value-function approximation method, which provably minimizes the L∞ Bellman residual. ABP returns the optimal approximate value function with respect to the Bellman residual bounds, despite the formulation with regard to transitive-feasible value functions. We also showed that there is no asymptotically simpler formulation, since finding the closest value function and solving a bilinear program are both NP-complete problems. Finally, the formulation leads to the development of OAPI, a new convergent form of API which monotonically improves the objective value function. While we only discussed approximate solutions of the ABP, a deeper study of bilinear solvers may render optimal solution methods feasible. ABPs have a small number of essential variables (that determine the value function) and a large number of constraints, which can be leveraged by the solvers [15]. The L∞ error bound provides good theoretical guarantees, but it may be too conservative in practice. A similar formulation based on L2 norm minimization may be more practical. We believe that the proposed formulation will help to deepen the understanding of value function approximation and the characteristics of existing solution methods, and potentially lead to the development of more robust and widely-applicable reinforcement learning algorithms. Acknowledgements This work was supported by the Air Force Office of Scientific Research under Grant No. FA955008-1-0171. We also thank the anonymous reviewers for their useful comments. 8 References [1] Pieter Abbeel, Varun Ganapathi, and Andrew Y. Ng. Learning vehicular dynamics, with application to modeling helicopters. In Advances in Neural Information Processing Systems, pages 1–8, 2006. [2] Daniel Adelman. A price-directed approach to stochastic inventory/routing. Operations Research, 52:499–514, 2004. [3] Kristin P. Bennett and O. L. Mangasarian. Bilinear separation of two sets in n-space. Technical report, Computer Science Department, University of Wisconsin, 1992. [4] Dimitri P. Bertsekas and Sergey Ioffe. Temporal differences-based policy iteration and applications in neuro-dynamic programming. Technical Report LIDS-P-2349, LIDS, 1997. [5] Guiuseppe Calafiore and M.C. Campi. Uncertain convex programs: Randomized solutions and confidence levels. Mathematical Programming, Series A, 102:25–46, 2005. [6] Alberto Carpara and Michele Monaci. Bidimensional packing by bilinear programming. Mathematical Programming Series A, 118:75–108, 2009. [7] Daniela P. de Farias. The Linear Programming Approach to Approximate Dynamic Programming: Theory and Application. PhD thesis, Stanford University, 2002. [8] Daniela P. de Farias and Ben Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51:850–856, 2003. [9] Daniela Pucci de Farias and Benjamin Van Roy. On constraint sampling in the linear programming approach to approximate dynamic programming. Mathematics of Operations Research, 29(3):462–478, 2004. [10] Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman. Efficient solution algorithms for factored MDPs. Journal of Artificial Intelligence Research, 19:399–468, 2003. [11] Reiner Horst and Hoang Tuy. Global optimization: Deterministic approaches. Springer, 1996. [12] Michail G. Lagoudakis and Ronald Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107–1149, 2003. [13] O. L. Mangasarian. The linear complementarity problem as a separable bilinear program. Journal of Global Optimization, 12:1–7, 1995. [14] Remi Munos. Error bounds for approximate policy iteration. In International Conference on Machine Learning, pages 560–567, 2003. [15] Marek Petrik and Shlomo Zilberstein. Anytime coordination using separable bilinear programs. In Conference on Artificial Intelligence, pages 750–755, 2007. [16] Marek Petrik and Shlomo Zilberstein. Average reward decentralized Markov decision processes. In International Joint Conference on Artificial Intelligence, pages 1997–2002, 2007. [17] Marek Petrik and Shlomo Zilberstein. A bilinear programming approach for multiagent planning. Journal of Artificial Intelligence Research, 35:235–274, 2009. [18] Marek Petrik and Shlomo Zilberstein. Constraint relaxation in approximate linear programs. In International Conference on Machine Learning, pages 809–816, 2009. [19] Marek Petrik and Shlomo Zilberstein. Robust value function approximation using bilinear programming. Technical Report UM-CS-2009-052, Department of Computer Science, University of Massachusetts Amherst, 2009. [20] Warren B. Powell. Approximate Dynamic Programming. Wiley-Interscience, 2007. [21] Martin L. Puterman. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, Inc., 2005. [22] Kenneth O. Stanley and Risto Miikkulainen. Competitive coevolution through evolutionary complexification. Journal of Artificial Intelligence Research, 21:63–100, 2004. [23] Richard S. Sutton and Andrew Barto. Reinforcement learning. MIT Press, 1998. [24] Istvan Szita and Andras Lorincz. Learning Tetris using the noisy cross-entropy method. Neural Computation, 18(12):2936–2941, 2006. [25] Ronald J. Williams and Leemon C. Baird. Tight performance bounds on greedy policies based on imperfect value functions. In Yale Workshop on Adaptive and Learning Systems, 1994. 9
5 0.83681923 34 nips-2009-Anomaly Detection with Score functions based on Nearest Neighbor Graphs
Author: Manqi Zhao, Venkatesh Saligrama
Abstract: We propose a novel non-parametric adaptive anomaly detection algorithm for high dimensional data based on score functions derived from nearest neighbor graphs on n-point nominal data. Anomalies are declared whenever the score of a test sample falls below α, which is supposed to be the desired false alarm level. The resulting anomaly detector is shown to be asymptotically optimal in that it is uniformly most powerful for the specified false alarm level, α, for the case when the anomaly density is a mixture of the nominal and a known density. Our algorithm is computationally efficient, being linear in dimension and quadratic in data size. It does not require choosing complicated tuning parameters or function approximation classes and it can adapt to local structure such as local change in dimensionality. We demonstrate the algorithm on both artificial and real data sets in high dimensional feature spaces.
6 0.81773603 219 nips-2009-Slow, Decorrelated Features for Pretraining Complex Cell-like Networks
7 0.60639977 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control
8 0.57174504 23 nips-2009-Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models
9 0.55611986 187 nips-2009-Particle-based Variational Inference for Continuous Systems
10 0.5527848 113 nips-2009-Improving Existing Fault Recovery Policies
11 0.55265504 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference
12 0.55182159 131 nips-2009-Learning from Neighboring Strokes: Combining Appearance and Context for Multi-Domain Sketch Recognition
13 0.5476706 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition
14 0.54499197 100 nips-2009-Gaussian process regression with Student-t likelihood
15 0.52762055 167 nips-2009-Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations
16 0.52414215 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
17 0.52069938 3 nips-2009-AUC optimization and the two-sample problem
18 0.51497418 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling
19 0.51279664 48 nips-2009-Bootstrapping from Game Tree Search
20 0.50641221 35 nips-2009-Approximating MAP by Compensating for Structural Relaxations