andrew_gelman_stats andrew_gelman_stats-2014 andrew_gelman_stats-2014-2351 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Yajuan Si, Natesh Pillai, and I write : It has historically been a challenge to perform Bayesian inference in a design-based survey context. The present paper develops a Bayesian model for sampling inference using inverse-probability weights. We use a hierarchical approach in which we model the distribution of the weights of the nonsampled units in the population and simultaneously include them as predictors in a nonparametric Gaussian process regression. We use simulation studies to evaluate the performance of our procedure and compare it to the classical design-based estimator. We apply our method to the Fragile Family Child Wellbeing Study. Our studies find the Bayesian nonparametric finite population estimator to be more robust than the classical design-based estimator without loss in efficiency. More work needs to be done for this to be a general practical tool—in particular, in the setup of this paper you only have survey weights and no direct poststratification variab
sentIndex sentText sentNum sentScore
1 Yajuan Si, Natesh Pillai, and I write : It has historically been a challenge to perform Bayesian inference in a design-based survey context. [sent-1, score-0.615]
2 The present paper develops a Bayesian model for sampling inference using inverse-probability weights. [sent-2, score-0.434]
3 We use a hierarchical approach in which we model the distribution of the weights of the nonsampled units in the population and simultaneously include them as predictors in a nonparametric Gaussian process regression. [sent-3, score-1.171]
4 We use simulation studies to evaluate the performance of our procedure and compare it to the classical design-based estimator. [sent-4, score-0.724]
5 We apply our method to the Fragile Family Child Wellbeing Study. [sent-5, score-0.077]
6 Our studies find the Bayesian nonparametric finite population estimator to be more robust than the classical design-based estimator without loss in efficiency. [sent-6, score-1.499]
7 I’m very excited about this general line of research. [sent-8, score-0.253]
wordName wordTfidf (topN-words)
[('weights', 0.329), ('estimator', 0.247), ('poststratification', 0.229), ('nonparametric', 0.223), ('population', 0.197), ('survey', 0.184), ('yajuan', 0.175), ('natesh', 0.165), ('pillai', 0.165), ('classical', 0.162), ('si', 0.152), ('develops', 0.144), ('historically', 0.138), ('fragile', 0.138), ('bayesian', 0.136), ('general', 0.134), ('setup', 0.133), ('feed', 0.129), ('mister', 0.122), ('simultaneously', 0.121), ('demonstrates', 0.121), ('studies', 0.119), ('excited', 0.119), ('unknown', 0.118), ('units', 0.116), ('inference', 0.111), ('finite', 0.107), ('gaussian', 0.106), ('model', 0.103), ('child', 0.101), ('robust', 0.1), ('simulation', 0.1), ('loss', 0.097), ('evaluate', 0.094), ('challenge', 0.091), ('perform', 0.091), ('procedure', 0.09), ('tool', 0.09), ('sizes', 0.09), ('framework', 0.089), ('family', 0.086), ('needs', 0.084), ('predictors', 0.082), ('fitting', 0.082), ('practical', 0.081), ('compare', 0.08), ('performance', 0.079), ('apply', 0.077), ('estimated', 0.077), ('sampling', 0.076)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 2351 andrew gelman stats-2014-05-28-Bayesian nonparametric weighted sampling inference
Introduction: Yajuan Si, Natesh Pillai, and I write : It has historically been a challenge to perform Bayesian inference in a design-based survey context. The present paper develops a Bayesian model for sampling inference using inverse-probability weights. We use a hierarchical approach in which we model the distribution of the weights of the nonsampled units in the population and simultaneously include them as predictors in a nonparametric Gaussian process regression. We use simulation studies to evaluate the performance of our procedure and compare it to the classical design-based estimator. We apply our method to the Fragile Family Child Wellbeing Study. Our studies find the Bayesian nonparametric finite population estimator to be more robust than the classical design-based estimator without loss in efficiency. More work needs to be done for this to be a general practical tool—in particular, in the setup of this paper you only have survey weights and no direct poststratification variab
2 0.27064326 784 andrew gelman stats-2011-07-01-Weighting and prediction in sample surveys
Introduction: A couple years ago Rod Little was invited to write an article for the diamond jubilee of the Calcutta Statistical Association Bulletin. His article was published with discussions from Danny Pfefferman, J. N. K. Rao, Don Rubin, and myself. Here it all is . I’ll paste my discussion below, but it’s worth reading the others’ perspectives too. Especially the part in Rod’s rejoinder where he points out a mistake I made. Survey weights, like sausage and legislation, are designed and best appreciated by those who are placed a respectable distance from their manufacture. For those of us working inside the factory, vigorous discussion of methods is appreciated. I enjoyed Rod Little’s review of the connections between modeling and survey weighting and have just a few comments. I like Little’s discussion of model-based shrinkage of post-stratum averages, which, as he notes, can be seen to correspond to shrinkage of weights. I would only add one thing to his formula at the end of his
3 0.20370899 1509 andrew gelman stats-2012-09-24-Analyzing photon counts
Introduction: Via Tom LaGatta, Boris Glebov writes: My labmates have statistics problem. We are all experimentalists, but need an input on a fine statistics point. The problem is as follows. The data set consists of photon counts measured at a series of coordinates. The number of input photons is known, but the system transmission (T) is not known and needs to be estimated. The number of transmitted photons at each coordinate follows a binomial distribution, not a Gaussian one. The spatial distribution of T values it then fit using a Levenberg-Marquart method modified to use weights for each data point. At present, my labmates are not sure how to properly calculate and use the weights. The equations are designed for Gaussian distributions, not binomial ones, and this is a problem because in many cases the photon counts are near the edge (say, zero), where a Gaussian width is nonsensical. Could you recommend a source they could use to guide their calculations? My reply: I don’t know a
4 0.19111577 405 andrew gelman stats-2010-11-10-Estimation from an out-of-date census
Introduction: Suguru Mizunoya writes: When we estimate the number of people from a national sampling survey (such as labor force survey) using sampling weights, don’t we obtain underestimated number of people, if the country’s population is growing and the sampling frame is based on an old census data? In countries with increasing populations, the probability of inclusion changes over time, but the weights can’t be adjusted frequently because census takes place only once every five or ten years. I am currently working for UNICEF for a project on estimating number of out-of-school children in developing countries. The project leader is comfortable to use estimates of number of people from DHS and other surveys. But, I am concerned that we may need to adjust the estimated number of people by the population projection, otherwise the estimates will be underestimated. I googled around on this issue, but I could not find a right article or paper on this. My reply: I don’t know if there’s a pa
5 0.18749896 1430 andrew gelman stats-2012-07-26-Some thoughts on survey weighting
Introduction: From a comment I made in an email exchange: My work on survey adjustments has very much been inspired by the ideas of Rod Little. Much of my efforts have gone toward the goal of integrating hierarchical modeling (which is so helpful for small-area estimation) with post stratification (which adjusts for known differences between sample and population). In the surveys I’ve dealt with, nonresponse/nonavailability can be a big issue, and I’ve always tried to emphasize that (a) the probability of a person being included in the sample is just about never known, and (b) even if this probability were known, I’d rather know the empirical n/N than the probability p (which is only valid in expectation). Regarding nonparametric modeling: I haven’t done much of that (although I hope to at some point) but Rod and his students have. As I wrote in the first sentence of the above-linked paper, I do think the current theory and practice of survey weighting is a mess, in that much depends on so
6 0.16810474 352 andrew gelman stats-2010-10-19-Analysis of survey data: Design based models vs. hierarchical modeling?
7 0.15806609 1858 andrew gelman stats-2013-05-15-Reputations changeable, situations tolerable
8 0.15588696 1205 andrew gelman stats-2012-03-09-Coming to agreement on philosophy of statistics
9 0.15175928 1868 andrew gelman stats-2013-05-23-Validation of Software for Bayesian Models Using Posterior Quantiles
11 0.13473137 1383 andrew gelman stats-2012-06-18-Hierarchical modeling as a framework for extrapolation
12 0.13185199 972 andrew gelman stats-2011-10-25-How do you interpret standard errors from a regression fit to the entire population?
13 0.12916601 1469 andrew gelman stats-2012-08-25-Ways of knowing
14 0.12828358 781 andrew gelman stats-2011-06-28-The holes in my philosophy of Bayesian data analysis
16 0.12167539 251 andrew gelman stats-2010-09-02-Interactions of predictors in a causal model
17 0.11657862 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making
18 0.11636073 244 andrew gelman stats-2010-08-30-Useful models, model checking, and external validation: a mini-discussion
19 0.11426127 246 andrew gelman stats-2010-08-31-Somewhat Bayesian multilevel modeling
20 0.11057238 1779 andrew gelman stats-2013-03-27-“Two Dogmas of Strong Objective Bayesianism”
topicId topicWeight
[(0, 0.164), (1, 0.188), (2, 0.047), (3, -0.037), (4, 0.003), (5, 0.066), (6, -0.064), (7, 0.004), (8, 0.069), (9, -0.049), (10, 0.049), (11, -0.09), (12, -0.037), (13, 0.093), (14, -0.005), (15, -0.003), (16, 0.025), (17, 0.013), (18, -0.01), (19, 0.039), (20, -0.041), (21, 0.008), (22, -0.052), (23, 0.031), (24, -0.015), (25, 0.023), (26, -0.025), (27, -0.004), (28, 0.045), (29, 0.081), (30, 0.037), (31, -0.009), (32, -0.015), (33, 0.041), (34, -0.049), (35, 0.027), (36, -0.011), (37, -0.009), (38, -0.02), (39, 0.016), (40, 0.029), (41, 0.041), (42, 0.057), (43, -0.031), (44, 0.022), (45, -0.033), (46, 0.048), (47, 0.019), (48, 0.024), (49, 0.051)]
simIndex simValue blogId blogTitle
same-blog 1 0.97714406 2351 andrew gelman stats-2014-05-28-Bayesian nonparametric weighted sampling inference
Introduction: Yajuan Si, Natesh Pillai, and I write : It has historically been a challenge to perform Bayesian inference in a design-based survey context. The present paper develops a Bayesian model for sampling inference using inverse-probability weights. We use a hierarchical approach in which we model the distribution of the weights of the nonsampled units in the population and simultaneously include them as predictors in a nonparametric Gaussian process regression. We use simulation studies to evaluate the performance of our procedure and compare it to the classical design-based estimator. We apply our method to the Fragile Family Child Wellbeing Study. Our studies find the Bayesian nonparametric finite population estimator to be more robust than the classical design-based estimator without loss in efficiency. More work needs to be done for this to be a general practical tool—in particular, in the setup of this paper you only have survey weights and no direct poststratification variab
2 0.79567212 784 andrew gelman stats-2011-07-01-Weighting and prediction in sample surveys
Introduction: A couple years ago Rod Little was invited to write an article for the diamond jubilee of the Calcutta Statistical Association Bulletin. His article was published with discussions from Danny Pfefferman, J. N. K. Rao, Don Rubin, and myself. Here it all is . I’ll paste my discussion below, but it’s worth reading the others’ perspectives too. Especially the part in Rod’s rejoinder where he points out a mistake I made. Survey weights, like sausage and legislation, are designed and best appreciated by those who are placed a respectable distance from their manufacture. For those of us working inside the factory, vigorous discussion of methods is appreciated. I enjoyed Rod Little’s review of the connections between modeling and survey weighting and have just a few comments. I like Little’s discussion of model-based shrinkage of post-stratum averages, which, as he notes, can be seen to correspond to shrinkage of weights. I would only add one thing to his formula at the end of his
3 0.76020145 405 andrew gelman stats-2010-11-10-Estimation from an out-of-date census
Introduction: Suguru Mizunoya writes: When we estimate the number of people from a national sampling survey (such as labor force survey) using sampling weights, don’t we obtain underestimated number of people, if the country’s population is growing and the sampling frame is based on an old census data? In countries with increasing populations, the probability of inclusion changes over time, but the weights can’t be adjusted frequently because census takes place only once every five or ten years. I am currently working for UNICEF for a project on estimating number of out-of-school children in developing countries. The project leader is comfortable to use estimates of number of people from DHS and other surveys. But, I am concerned that we may need to adjust the estimated number of people by the population projection, otherwise the estimates will be underestimated. I googled around on this issue, but I could not find a right article or paper on this. My reply: I don’t know if there’s a pa
4 0.75450569 352 andrew gelman stats-2010-10-19-Analysis of survey data: Design based models vs. hierarchical modeling?
Introduction: Alban Zeber writes: Suppose I have survey data from say 10 countries where by each country collected the data based on different sampling routines – the results of this being that each country has its own weights for the data that can be used in the analyses. If I analyse the data of each country separately then I can incorporate the survey design in the analyses e.g in Stata once can use svyset ….. But what happens when I want to do a pooled analysis of the all the data from the 10 countries: Presumably either 1. I analyse the data from each country separately (using multiple or logistic regression, …) accounting for the survey design and then combine the estimates using a meta analysis (fixed or random) OR 2. Assume that the data from each country is a simple random sample from the population, combine the data from the 10 countries and then use multilevel or hierarchical models My question is which of the methods is likely to give better estimates? Or is the
5 0.73466241 288 andrew gelman stats-2010-09-21-Discussion of the paper by Girolami and Calderhead on Bayesian computation
Introduction: Here’s my discussion of this article for the Journal of the Royal Statistical Society: I will comment on this paper in my role as applied statistician and consumer of Bayesian computation. In the last few years, my colleagues and I have felt the need to fit predictive survey responses given multiple discrete predictors, for example estimating voting given ethnicity and income within each of the fifty states, or estimating public opinion about gay marriage given age, sex, ethnicity, education, and state. We would like to be able to fit such models with ten or more predictors–for example, religion, religious attendance, marital status, and urban/rural/suburban residence in addition to the factors mentioned above. There are (at least) three reasons for fitting a model with many predictive factors and potentially a huge number of interactions among them: 1. Deep interactions can be of substantive interest. For example, Gelman et al. (2009) discuss the importance of interaction
7 0.72302032 1430 andrew gelman stats-2012-07-26-Some thoughts on survey weighting
8 0.67994076 1511 andrew gelman stats-2012-09-26-What do statistical p-values mean when the sample = the population?
9 0.67973411 1455 andrew gelman stats-2012-08-12-Probabilistic screening to get an approximate self-weighted sample
10 0.67874444 454 andrew gelman stats-2010-12-07-Diabetes stops at the state line?
11 0.67558074 2176 andrew gelman stats-2014-01-19-Transformations for non-normal data
12 0.66354924 725 andrew gelman stats-2011-05-21-People kept emailing me this one so I think I have to blog something
13 0.65795672 1165 andrew gelman stats-2012-02-13-Philosophy of Bayesian statistics: my reactions to Wasserman
15 0.65493876 136 andrew gelman stats-2010-07-09-Using ranks as numbers
17 0.64779174 1371 andrew gelman stats-2012-06-07-Question 28 of my final exam for Design and Analysis of Sample Surveys
18 0.64776897 1868 andrew gelman stats-2013-05-23-Validation of Software for Bayesian Models Using Posterior Quantiles
19 0.64532316 1294 andrew gelman stats-2012-05-01-Modeling y = a + b + c
20 0.64496833 2152 andrew gelman stats-2013-12-28-Using randomized incentives as an instrument for survey nonresponse?
topicId topicWeight
[(16, 0.01), (21, 0.043), (24, 0.12), (53, 0.072), (64, 0.019), (73, 0.029), (84, 0.07), (86, 0.016), (87, 0.126), (88, 0.065), (89, 0.026), (95, 0.029), (99, 0.275)]
simIndex simValue blogId blogTitle
same-blog 1 0.95729858 2351 andrew gelman stats-2014-05-28-Bayesian nonparametric weighted sampling inference
Introduction: Yajuan Si, Natesh Pillai, and I write : It has historically been a challenge to perform Bayesian inference in a design-based survey context. The present paper develops a Bayesian model for sampling inference using inverse-probability weights. We use a hierarchical approach in which we model the distribution of the weights of the nonsampled units in the population and simultaneously include them as predictors in a nonparametric Gaussian process regression. We use simulation studies to evaluate the performance of our procedure and compare it to the classical design-based estimator. We apply our method to the Fragile Family Child Wellbeing Study. Our studies find the Bayesian nonparametric finite population estimator to be more robust than the classical design-based estimator without loss in efficiency. More work needs to be done for this to be a general practical tool—in particular, in the setup of this paper you only have survey weights and no direct poststratification variab
Introduction: The above graph shows the estimated support, by state, for the Employment Nondiscrimination Act, a gay rights bill that the Senate will be voting on this Monday. The estimates were constructed by Kate Krimmel, Jeff Lax, and Justin Phillips using multilevel regression and poststratification. Check out that graph again. The scale goes from 20% to 80%, but every state is in the yellow-to-red range. Support for a law making it illegal to discriminate against gays has majority support in every state. And in most states the support is very strong. And here’s the research paper by Krimmel, Lax, and Phillips, which begins: Public majorities have supported several gay rights policies for some time, yet Congress has responded slowly if at all. We address this puzzle through dyadic analysis of the opinion- vote relationship on 23 roll-call votes between 1993 and 2010, matching members of Congress to policy-specific opinion in their state or district. We also extend the MRP opinion e
3 0.90212834 1868 andrew gelman stats-2013-05-23-Validation of Software for Bayesian Models Using Posterior Quantiles
Introduction: Every once in awhile I get a question that I can directly answer from my published research. When that happens it makes me so happy. Here’s an example. Patrick Lam wrote, Suppose one develops a Bayesian model to estimate a parameter theta. Now suppose one wants to evaluate the model via simulation by generating fake data where you know the value of theta and see how well you recover theta with your model, assuming that you use the posterior mean as the estimate. The traditional frequentist way of evaluating it might be to generate many datasets and see how well your estimator performs each time in terms of unbiasedness or mean squared error or something. But given that unbiasedness means nothing to a Bayesian and there is no repeated sampling interpretation in a Bayesian model, how would you suggest one would evaluate a Bayesian model? My reply: I actually have a paper on this ! It is by Cook, Gelman, and Rubin. The idea is to draw theta from the prior distribution.
Introduction: John Kastellec points me to this blog by Ezra Klein criticizing the following graph from a recent Republican Party report: Klein (following Alexander Hart ) slams the graph for not going all the way to zero on the y-axis, thus making the projected change seem bigger than it really is. I agree with Klein and Hart that, if you’re gonna do a bar chart, you want the bars to go down to 0. On the other hand, a projected change from 19% to 23% is actually pretty big, and I don’t see the point of using a graphical display that hides it. The solution: Ditch the bar graph entirely and replace it by a lineplot , in particular, a time series with year-by-year data. The time series would have several advantages: 1. Data are placed in context. You’d see every year, instead of discrete averages, and you’d get to see the changes in the context of year-to-year variation. 2. With the time series, you can use whatever y-axis works with the data. No need to go to zero. P.S. I l
5 0.89798337 569 andrew gelman stats-2011-02-12-Get the Data
Introduction: At GetTheData , you can ask and answer data related questions. Here’s a preview: I’m not sure a Q&A; site is the best way to do this. My pipe dream is to create a taxonomy of variables and instances, and collect spreadsheets annotated this way. Imagine doing a search of type: “give me datasets, where an instance is a person, the variables are age, gender and weight” – and out would come datasets, each one tagged with the descriptions of the variables that were held constant for the whole dataset (person_type=student, location=Columbia, time_of_study=1/1/2009, study_type=longitudinal). It would even be possible to automatically convert one variable into another, if it was necessary (like age = time_of_measurement-time_of_birth). Maybe the dream of Semantic Web will actually be implemented for relatively structured statistical data rather than much fuzzier “knowledge”, just consider the difficulties of developing a universal Freebase . Wolfram|Alpha is perhaps currently clos
7 0.89550972 1773 andrew gelman stats-2013-03-21-2.15
8 0.89408755 355 andrew gelman stats-2010-10-20-Andy vs. the Ideal Point Model of Voting
9 0.88638103 233 andrew gelman stats-2010-08-25-Lauryn Hill update
10 0.88470197 583 andrew gelman stats-2011-02-21-An interesting assignment for statistical graphics
12 0.88216436 2136 andrew gelman stats-2013-12-16-Whither the “bet on sparsity principle” in a nonsparse world?
13 0.8804884 1788 andrew gelman stats-2013-04-04-When is there “hidden structure in data” to be discovered?
14 0.87853742 918 andrew gelman stats-2011-09-21-Avoiding boundary estimates in linear mixed models
15 0.87503594 629 andrew gelman stats-2011-03-26-Is it plausible that 1% of people pick a career based on their first name?
16 0.87282503 880 andrew gelman stats-2011-08-30-Annals of spam
17 0.87277943 783 andrew gelman stats-2011-06-30-Don’t stop being a statistician once the analysis is done
18 0.87237769 1960 andrew gelman stats-2013-07-28-More on that machine learning course
19 0.87187904 248 andrew gelman stats-2010-09-01-Ratios where the numerator and denominator both change signs
20 0.87137401 1877 andrew gelman stats-2013-05-30-Infill asymptotics and sprawl asymptotics