andrew_gelman_stats andrew_gelman_stats-2010 andrew_gelman_stats-2010-70 knowledge-graph by maker-knowledge-mining

70 andrew gelman stats-2010-06-07-Mister P goes on a date


meta infos for this blog

Source: html

Introduction: I recently wrote something on the much-discussed OK Cupid analysis of political attitudes of a huge sample of people in their dating database. My quick comment was that their analysis was interesting, but participants on an online dating site must certainly be far from a random sample of Americans. But suppose I want to not just criticize but also think in a positive direction. OK Cupid’s database is huge, and one thing statistical methods are good at–Bayesian methods in particular–is combining a huge amount of noisy, biased data with a smaller amount of good data. This is what we did in our radon study, using a high-quality survey of 5000 houses in 125 counties to calibrate a set of crappier surveys totaling 80,000 houses in 3000 counties. How would it work for OK Cupid? We’d want to take their data and poststratify on: Age Sex Marital/family status Education Income Partisanship Ideology Political participation Religion and religious attendance State Urban/rural/


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 I recently wrote something on the much-discussed OK Cupid analysis of political attitudes of a huge sample of people in their dating database. [sent-1, score-0.611]

2 My quick comment was that their analysis was interesting, but participants on an online dating site must certainly be far from a random sample of Americans. [sent-2, score-0.547]

3 But suppose I want to not just criticize but also think in a positive direction. [sent-3, score-0.07]

4 OK Cupid’s database is huge, and one thing statistical methods are good at–Bayesian methods in particular–is combining a huge amount of noisy, biased data with a smaller amount of good data. [sent-4, score-0.85]

5 This is what we did in our radon study, using a high-quality survey of 5000 houses in 125 counties to calibrate a set of crappier surveys totaling 80,000 houses in 3000 counties. [sent-5, score-0.826]

6 We’d want to take their data and poststratify on: Age Sex Marital/family status Education Income Partisanship Ideology Political participation Religion and religious attendance State Urban/rural/suburban Probably some other key variables that I’m not thinking of right now. [sent-7, score-0.48]

7 We’d do multilevel regression and poststratification (MRP, “Mister P”), with enough cells that it’s reasonable to think of the OK Cupid people as being a random sample within each cell. [sent-8, score-0.416]

8 This is not a trivial project–it would involve also including Census data and large public opinion surveys such as Annenberg or Pew–but it could be worth it. [sent-9, score-0.404]

9 The goal would be to get the flexibility and power of the OK Cupid analyses, but with the warm feelings that come from matching their sample to the U. [sent-10, score-0.635]

10 Inferences would necessarily be strongly model-based–for example, any claims about married people would be essentially 100% based on regression-based extrapolation–but, hey, that’s the way it is. [sent-13, score-0.211]

11 The goal is to be as honest as possible with the data available. [sent-14, score-0.229]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('cupid', 0.649), ('dating', 0.236), ('ok', 0.201), ('houses', 0.165), ('sample', 0.159), ('huge', 0.152), ('surveys', 0.119), ('crappier', 0.118), ('poststratify', 0.118), ('amount', 0.108), ('annenberg', 0.103), ('extrapolation', 0.095), ('goal', 0.093), ('pew', 0.091), ('attendance', 0.09), ('cells', 0.09), ('random', 0.09), ('counties', 0.088), ('trivial', 0.088), ('flexibility', 0.087), ('partisanship', 0.087), ('calibrate', 0.086), ('radon', 0.085), ('mrp', 0.083), ('warm', 0.083), ('mister', 0.082), ('database', 0.081), ('married', 0.081), ('poststratification', 0.077), ('participation', 0.077), ('matching', 0.074), ('feelings', 0.074), ('honest', 0.072), ('religion', 0.072), ('census', 0.072), ('criticize', 0.07), ('combining', 0.07), ('ideology', 0.07), ('biased', 0.069), ('methods', 0.069), ('noisy', 0.068), ('involve', 0.068), ('religious', 0.068), ('would', 0.065), ('data', 0.064), ('political', 0.064), ('status', 0.063), ('site', 0.062), ('sex', 0.06), ('smaller', 0.06)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 70 andrew gelman stats-2010-06-07-Mister P goes on a date

Introduction: I recently wrote something on the much-discussed OK Cupid analysis of political attitudes of a huge sample of people in their dating database. My quick comment was that their analysis was interesting, but participants on an online dating site must certainly be far from a random sample of Americans. But suppose I want to not just criticize but also think in a positive direction. OK Cupid’s database is huge, and one thing statistical methods are good at–Bayesian methods in particular–is combining a huge amount of noisy, biased data with a smaller amount of good data. This is what we did in our radon study, using a high-quality survey of 5000 houses in 125 counties to calibrate a set of crappier surveys totaling 80,000 houses in 3000 counties. How would it work for OK Cupid? We’d want to take their data and poststratify on: Age Sex Marital/family status Education Income Partisanship Ideology Political participation Religion and religious attendance State Urban/rural/

2 0.12317827 2061 andrew gelman stats-2013-10-14-More on Mister P and how it does what it does

Introduction: Following up on our discussion the other day, Matt Buttice and Ben Highton write: It was nice to see our article mentioned and discussed by Andrew, Jeff Lax, Justin Phillips, and Yair Ghitza on Andrew’s blog in this post on Wednesday. As noted in the post, we recently published an article in Political Analysis on how well multilevel regression and poststratification (MRP) performs at producing estimates of state opinion with conventional national surveys where N≈1,500. Our central claims are that (i) the performance of MRP is highly variable, (ii) in the absence of knowing the true values, it is difficult to determine the quality of the MRP estimates produced on the basis of a single national sample, and, (iii) therefore, our views about the usefulness of MRP in instances where a researcher has a single sample of N≈1,500 are less optimistic than the ones expressed in previous research on the topic. Obviously we were interested in the blog posts. We found them stimulating

3 0.12248286 288 andrew gelman stats-2010-09-21-Discussion of the paper by Girolami and Calderhead on Bayesian computation

Introduction: Here’s my discussion of this article for the Journal of the Royal Statistical Society: I will comment on this paper in my role as applied statistician and consumer of Bayesian computation. In the last few years, my colleagues and I have felt the need to fit predictive survey responses given multiple discrete predictors, for example estimating voting given ethnicity and income within each of the fifty states, or estimating public opinion about gay marriage given age, sex, ethnicity, education, and state. We would like to be able to fit such models with ten or more predictors–for example, religion, religious attendance, marital status, and urban/rural/suburban residence in addition to the factors mentioned above. There are (at least) three reasons for fitting a model with many predictive factors and potentially a huge number of interactions among them: 1. Deep interactions can be of substantive interest. For example, Gelman et al. (2009) discuss the importance of interaction

4 0.12032289 2056 andrew gelman stats-2013-10-09-Mister P: What’s its secret sauce?

Introduction: This is a long and technical post on an important topic: the use of multilevel regression and poststratification (MRP) to estimate state-level public opinion. MRP as a research method, and state-level opinion (or, more generally, attitudes in demographic and geographic subpopulation) as a subject, have both become increasingly important in political science—and soon, I expect, will become increasingly important in other social sciences as well. Being able to estimate state-level opinion from national surveys is just such a powerful thing, that if it can be done, people will do it. It’s taken 15 years or so for the method to really catch on, but the ready availability of survey data and of computing power—as well as our increasing comfort level, as a profession, with these techniques, has made MRP become more of a routine research tool. As a method becomes used more and more widely, there will be natural concerns about its domains of applicability. That is the subject of the pres

5 0.11087326 678 andrew gelman stats-2011-04-25-Democrats do better among the most and least educated groups

Introduction: These are based on raw Pew data, reweighted to adjust for voter turnout by state, income, and ethnicity. No modeling of vote on age, education, and ethnicity. I think our future estimates based on the 9-way model will be better, but these are basically OK, I think. All but six of the dots in the graph are based on sample sizes greater than 30. I published these last year but they’re still relevant, I think. There’s lots of confusion when it comes to education and voting.

6 0.1073962 544 andrew gelman stats-2011-01-29-Splitting the data

7 0.10010792 2359 andrew gelman stats-2014-06-04-All the Assumptions That Are My Life

8 0.09484385 352 andrew gelman stats-2010-10-19-Analysis of survey data: Design based models vs. hierarchical modeling?

9 0.094002321 1934 andrew gelman stats-2013-07-11-Yes, worry about generalizing from data to population. But multilevel modeling is the solution, not the problem

10 0.092366755 1628 andrew gelman stats-2012-12-17-Statistics in a world where nothing is random

11 0.091351204 1365 andrew gelman stats-2012-06-04-Question 25 of my final exam for Design and Analysis of Sample Surveys

12 0.089145221 1787 andrew gelman stats-2013-04-04-Wanna be the next Tyler Cowen? It’s not as easy as you might think!

13 0.088503793 2096 andrew gelman stats-2013-11-10-Schiminovich is on The Simpsons

14 0.084225759 383 andrew gelman stats-2010-10-31-Analyzing the entire population rather than a sample

15 0.083675504 2062 andrew gelman stats-2013-10-15-Last word on Mister P (for now)

16 0.081259951 446 andrew gelman stats-2010-12-03-Is 0.05 too strict as a p-value threshold?

17 0.081013963 1455 andrew gelman stats-2012-08-12-Probabilistic screening to get an approximate self-weighted sample

18 0.080003299 1920 andrew gelman stats-2013-06-30-“Non-statistical” statistics tools

19 0.079981387 695 andrew gelman stats-2011-05-04-Statistics ethics question

20 0.079686634 1149 andrew gelman stats-2012-02-01-Philosophy of Bayesian statistics: my reactions to Cox and Mayo


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.155), (1, 0.013), (2, 0.091), (3, -0.046), (4, 0.026), (5, 0.045), (6, -0.054), (7, 0.008), (8, 0.007), (9, -0.01), (10, 0.022), (11, -0.053), (12, 0.021), (13, 0.071), (14, 0.028), (15, 0.006), (16, -0.016), (17, -0.01), (18, 0.022), (19, 0.0), (20, -0.007), (21, -0.006), (22, -0.039), (23, -0.001), (24, -0.035), (25, -0.033), (26, 0.01), (27, -0.0), (28, 0.009), (29, 0.031), (30, 0.044), (31, -0.051), (32, 0.024), (33, 0.025), (34, -0.008), (35, 0.055), (36, 0.026), (37, 0.007), (38, -0.029), (39, -0.003), (40, 0.076), (41, -0.008), (42, 0.014), (43, -0.029), (44, 0.019), (45, -0.027), (46, 0.019), (47, 0.026), (48, -0.007), (49, 0.011)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.9493534 70 andrew gelman stats-2010-06-07-Mister P goes on a date

Introduction: I recently wrote something on the much-discussed OK Cupid analysis of political attitudes of a huge sample of people in their dating database. My quick comment was that their analysis was interesting, but participants on an online dating site must certainly be far from a random sample of Americans. But suppose I want to not just criticize but also think in a positive direction. OK Cupid’s database is huge, and one thing statistical methods are good at–Bayesian methods in particular–is combining a huge amount of noisy, biased data with a smaller amount of good data. This is what we did in our radon study, using a high-quality survey of 5000 houses in 125 counties to calibrate a set of crappier surveys totaling 80,000 houses in 3000 counties. How would it work for OK Cupid? We’d want to take their data and poststratify on: Age Sex Marital/family status Education Income Partisanship Ideology Political participation Religion and religious attendance State Urban/rural/

2 0.786062 405 andrew gelman stats-2010-11-10-Estimation from an out-of-date census

Introduction: Suguru Mizunoya writes: When we estimate the number of people from a national sampling survey (such as labor force survey) using sampling weights, don’t we obtain underestimated number of people, if the country’s population is growing and the sampling frame is based on an old census data? In countries with increasing populations, the probability of inclusion changes over time, but the weights can’t be adjusted frequently because census takes place only once every five or ten years. I am currently working for UNICEF for a project on estimating number of out-of-school children in developing countries. The project leader is comfortable to use estimates of number of people from DHS and other surveys. But, I am concerned that we may need to adjust the estimated number of people by the population projection, otherwise the estimates will be underestimated. I googled around on this issue, but I could not find a right article or paper on this. My reply: I don’t know if there’s a pa

3 0.76104337 1455 andrew gelman stats-2012-08-12-Probabilistic screening to get an approximate self-weighted sample

Introduction: Sharad had a survey sampling question: We’re trying to use mechanical turk to conduct some surveys, and have quickly discovered that turkers tend to be quite young. We’d really like a representative sample of the U.S., or at the least be able to recruit a diverse enough sample from turk that we can post-stratify to adjust the estimates. The approach we ended up taking is to pay turkers a small amount to answer a couple of screening questions (age & sex), and then probabilistically recruit individuals to complete the full survey (for more money) based on the estimated turk population parameters and our desired target distribution. We use rejection sampling, so the end result is that individuals who are invited to take the full survey look as if they came from a representative sample, at least in terms of age and sex. I’m wondering whether this sort of technique—a two step design in which participants are first screened and then probabilistically selected to mimic a target distributio

4 0.74160051 352 andrew gelman stats-2010-10-19-Analysis of survey data: Design based models vs. hierarchical modeling?

Introduction: Alban Zeber writes: Suppose I have survey data from say 10 countries where by each country collected the data based on different sampling routines – the results of this being that each country has its own weights for the data that can be used in the analyses. If I analyse the data of each country separately then I can incorporate the survey design in the analyses e.g in Stata once can use svyset ….. But what happens when I want to do a pooled analysis of the all the data from the 10 countries: Presumably either 1. I analyse the data from each country separately (using multiple or logistic regression, …) accounting for the survey design and then combine the estimates using a meta analysis (fixed or random) OR 2. Assume that the data from each country is a simple random sample from the population, combine the data from the 10 countries and then use multilevel or hierarchical models My question is which of the methods is likely to give better estimates? Or is the

5 0.73982537 454 andrew gelman stats-2010-12-07-Diabetes stops at the state line?

Introduction: From Discover : Razib Khan asks: But follow the gradient from El Paso to the Illinois-Missouri border. The differences are small across state lines, but the consistent differences along the borders really don’t make. Are there state-level policies or regulations causing this? Or, are there state-level differences in measurement? This weird pattern shows up in other CDC data I’ve seen. Turns out that CDC isn’t providing data , they’re providing model . Frank Howland answered: I suspect the answer has to do with the manner in which the county estimates are produced. I went to the original data source, the CDC, and then to the relevant FAQ . There they say that the diabetes prevalence estimates come from the “CDC’s Behavioral Risk Factor Surveillance System (BRFSS) and data from the U.S. Census Bureau’s Population Estimates Program. The BRFSS is an ongoing, monthly, state-based telephone survey of the adult population. The survey provides state-specific informati

6 0.73760957 2056 andrew gelman stats-2013-10-09-Mister P: What’s its secret sauce?

7 0.7093156 2061 andrew gelman stats-2013-10-14-More on Mister P and how it does what it does

8 0.70690894 784 andrew gelman stats-2011-07-01-Weighting and prediction in sample surveys

9 0.70479441 1511 andrew gelman stats-2012-09-26-What do statistical p-values mean when the sample = the population?

10 0.69572848 1294 andrew gelman stats-2012-05-01-Modeling y = a + b + c

11 0.6923914 972 andrew gelman stats-2011-10-25-How do you interpret standard errors from a regression fit to the entire population?

12 0.68939143 769 andrew gelman stats-2011-06-15-Mr. P by another name . . . is still great!

13 0.68810046 2167 andrew gelman stats-2014-01-10-Do you believe that “humans and other living things have evolved over time”?

14 0.68586928 977 andrew gelman stats-2011-10-27-Hack pollster Doug Schoen illustrates a general point: The #1 way to lie with statistics is . . . to just lie!

15 0.68343377 385 andrew gelman stats-2010-10-31-Wacky surveys where they don’t tell you the questions they asked

16 0.68206191 1365 andrew gelman stats-2012-06-04-Question 25 of my final exam for Design and Analysis of Sample Surveys

17 0.67903906 1940 andrew gelman stats-2013-07-16-A poll that throws away data???

18 0.67811638 1289 andrew gelman stats-2012-04-29-We go to war with the data we have, not the data we want

19 0.67690134 820 andrew gelman stats-2011-07-25-Design of nonrandomized cluster sample study

20 0.67562628 107 andrew gelman stats-2010-06-24-PPS in Georgia


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(2, 0.056), (15, 0.03), (16, 0.016), (17, 0.012), (21, 0.011), (24, 0.143), (36, 0.038), (43, 0.202), (47, 0.01), (63, 0.033), (86, 0.015), (95, 0.01), (97, 0.011), (99, 0.288)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.96123254 1754 andrew gelman stats-2013-03-08-Cool GSS training video! And cumulative file 1972-2012!

Introduction: Felipe Osorio made the above video to help people use the General Social Survey and R to answer research questions in social science. Go for it! Meanwhile, Tom Smith reports: The initial release of the General Social Survey (GSS), cumulative file for 1972-2012 is now on our website . Codebooks and copies of questionnaires will be posted shortly. Later additional files including the GSS reinterview panels and additional variables in the cumulative file will be added. P.S. R scripts are here .

2 0.95592254 1077 andrew gelman stats-2011-12-21-In which I compare “POLITICO’s chief political columnist” unfavorably to a cranky old dead guy and one of the funniest writers who’s ever lived

Introduction: Neil Malhotra writes: I just wanted to alert to this completely misinformed Politico article by Roger Simon, equating sampling theory with “magic.” Normally, I wouldn’t send you this, but I sent him a helpful email and he was a complete jerk about it. Wow—this is really bad. It’s so bad I refuse to link to it. I don’t know who this dude is, but it’s pitiful. Andy Rooney could do better. And I don’t mean Andy Rooney in his prime, I mean Andy Rooney right now. The piece appears to be an attempt at jocularity, but it’s about 10 million times worse than whatever the worst thing is that Dave Barry has ever written. My question to Neil Malhotra is . . . what made you click on this in the first place? P.S. John Sides piles on with some Gallup quotes.

3 0.9524495 314 andrew gelman stats-2010-10-03-Disconnect between drug and medical device approval

Introduction: Sanjay Kaul wrotes: By statute (“the least burdensome” pathway), the approval standard for devices by the US FDA is lower than for drugs. Before a new drug can be marketed, the sponsor must show “substantial evidence of effectiveness” as based on two or more well-controlled clinical studies (which literally means 2 trials, each with a p value of <0.05, or 1 large trial with a robust p value <0.00125). In contrast, the sponsor of a new device, especially those that are designated as high-risk (Class III) device, need only demonstrate "substantial equivalence" to an FDA-approved device via the 510(k) exemption or a "reasonable assurance of safety and effectiveness", evaluated through a pre-market approval and typically based on a single study. What does “reasonable assurance” or “substantial equivalence” imply to you as a Bayesian? These are obviously qualitative constructs, but if one were to quantify them, how would you go about addressing it? The regulatory definitions for

4 0.93492442 1707 andrew gelman stats-2013-02-05-Glenn Hubbard and I were on opposite sides of a court case and I didn’t even know it!

Introduction: Matt Taibbi writes : Glenn Hubbard, Leading Academic and Mitt Romney Advisor, Took $1200 an Hour to Be Countrywide’s Expert Witness . . . Hidden among the reams of material recently filed in connection with the lawsuit of monoline insurer MBIA against Bank of America and Countrywide is a deposition of none other than Columbia University’s Glenn Hubbard. . . . Hubbard testified on behalf of Countrywide in the MBIA suit. He conducted an “analysis” that essentially concluded that Countrywide’s loans weren’t any worse than the loans produced by other mortgage originators, and that therefore the monstrous losses that investors in those loans suffered were due to other factors related to the economic crisis – and not caused by the serial misrepresentations and fraud in Countrywide’s underwriting. That’s interesting, because I worked on the other side of this case! I was hired by MBIA’s lawyers. It wouldn’t be polite of me to reveal my consulting rate, and I never actually got depose

5 0.93401003 857 andrew gelman stats-2011-08-17-Bayes pays

Introduction: George Leckie writes: The Centre for Multilevel Modelling at the University of Bristol is seeking to appoint an applied statistician to work on a new ESRC-funded project, Longitudinal Effects, Multilevel Modelling and Applications (LEMMA 3). LEMMA 3 is one of six Nodes of the National Centre for Research Methods (NCRM). The LEMMA 3 Node will focus on methods for the analysis of longitudinal data. The appointment, at Research Assistant or Research Associate level, will be for 2.5 years with likelihood of extension to the end of September 2014. For further details, including information on how to apply online, please go to http://www.bris.ac.uk/boris/jobs/feeds/ads?ID=100571 By “modelling,” I think he means “modeling.” And by “centre,” I think he means “center.” But I think you get the basic idea. It looks like a great place to do research.

6 0.92591882 1347 andrew gelman stats-2012-05-27-Macromuddle

7 0.91524613 601 andrew gelman stats-2011-03-05-Against double-blind reviewing: Political science and statistics are not like biology and physics

8 0.91475463 1253 andrew gelman stats-2012-04-08-Technology speedup graph

same-blog 9 0.91300905 70 andrew gelman stats-2010-06-07-Mister P goes on a date

10 0.89296722 2330 andrew gelman stats-2014-05-12-Historical Arc of Universities

11 0.89092642 1920 andrew gelman stats-2013-06-30-“Non-statistical” statistics tools

12 0.88867158 1882 andrew gelman stats-2013-06-03-The statistical properties of smart chains (and referral chains more generally)

13 0.88206351 806 andrew gelman stats-2011-07-17-6 links

14 0.88191843 1860 andrew gelman stats-2013-05-17-How can statisticians help psychologists do their research better?

15 0.88004291 75 andrew gelman stats-2010-06-08-“Is the cyber mob a threat to freedom?”

16 0.86997336 538 andrew gelman stats-2011-01-25-Postdoc Position #2: Hierarchical Modeling and Statistical Graphics

17 0.867948 770 andrew gelman stats-2011-06-15-Still more Mr. P in public health

18 0.86616254 2338 andrew gelman stats-2014-05-19-My short career as a Freud expert

19 0.86324346 1815 andrew gelman stats-2013-04-20-Displaying inferences from complex models

20 0.86103505 481 andrew gelman stats-2010-12-22-The Jumpstart financial literacy survey and the different purposes of tests