andrew_gelman_stats andrew_gelman_stats-2014 andrew_gelman_stats-2014-2343 knowledge-graph by maker-knowledge-mining

2343 andrew gelman stats-2014-05-22-Big Data needs Big Model


meta infos for this blog

Source: html

Introduction: Gary Marcus and Ernest Davis wrote this useful news article on the promise and limitations of “big data.” And let me add this related point: Big data are typically not random samples, hence the need for “big model” to map from sample to population. Here’s an example (with Wei Wang, David Rothschild, and Sharad Goel): Election forecasts have traditionally been based on representative polls, in which randomly sampled individuals are asked for whom they intend to vote. While representative polling has historically proven to be quite effective, it comes at considerable financial and time costs. Moreover, as response rates have declined over the past several decades, the statistical ben- efits of representative sampling have diminished. In this paper, we show that with proper statistical adjustment, non-representative polls can be used to generate accurate election forecasts, and often faster and at less expense than traditional survey methods. We demon- strate this approach


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Gary Marcus and Ernest Davis wrote this useful news article on the promise and limitations of “big data. [sent-1, score-0.274]

2 Here’s an example (with Wei Wang, David Rothschild, and Sharad Goel): Election forecasts have traditionally been based on representative polls, in which randomly sampled individuals are asked for whom they intend to vote. [sent-3, score-0.886]

3 While representative polling has historically proven to be quite effective, it comes at considerable financial and time costs. [sent-4, score-0.673]

4 Moreover, as response rates have declined over the past several decades, the statistical ben- efits of representative sampling have diminished. [sent-5, score-0.319]

5 In this paper, we show that with proper statistical adjustment, non-representative polls can be used to generate accurate election forecasts, and often faster and at less expense than traditional survey methods. [sent-6, score-1.119]

6 We demon- strate this approach by creating forecasts from a novel and highly non-representative survey dataset: a series of daily voter intention polls for the 2012 presidential election conducted on the Xbox gaming platform. [sent-7, score-1.636]

7 After adjusting the Xbox responses via multilevel regression and poststratification, we obtain estimates in line with forecasts from leading poll analysts, which were based on aggregating hundreds of traditional polls conducted during the election cycle. [sent-8, score-1.445]

8 We conclude by arguing that non-representative polling shows promise not only for election forecasting, but also for measuring public opinion on a broad range of social, economic and cultural issues. [sent-9, score-0.872]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('forecasts', 0.314), ('election', 0.287), ('polls', 0.285), ('xbox', 0.281), ('representative', 0.23), ('promise', 0.194), ('polling', 0.166), ('conducted', 0.142), ('traditional', 0.13), ('rothschild', 0.128), ('wei', 0.128), ('aggregating', 0.121), ('ernest', 0.115), ('goel', 0.108), ('gaming', 0.105), ('intention', 0.103), ('big', 0.102), ('sharad', 0.101), ('historically', 0.101), ('marcus', 0.101), ('davis', 0.099), ('expense', 0.097), ('intend', 0.097), ('wang', 0.094), ('survey', 0.09), ('analysts', 0.089), ('declined', 0.089), ('considerable', 0.088), ('proven', 0.088), ('adjusting', 0.088), ('traditionally', 0.087), ('moreover', 0.086), ('sampled', 0.085), ('poststratification', 0.084), ('voter', 0.082), ('limitations', 0.08), ('faster', 0.079), ('adjustment', 0.078), ('proper', 0.078), ('novel', 0.078), ('obtain', 0.078), ('measuring', 0.077), ('gary', 0.076), ('creating', 0.075), ('forecasting', 0.075), ('daily', 0.075), ('broad', 0.074), ('cultural', 0.074), ('randomly', 0.073), ('generate', 0.073)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999988 2343 andrew gelman stats-2014-05-22-Big Data needs Big Model

Introduction: Gary Marcus and Ernest Davis wrote this useful news article on the promise and limitations of “big data.” And let me add this related point: Big data are typically not random samples, hence the need for “big model” to map from sample to population. Here’s an example (with Wei Wang, David Rothschild, and Sharad Goel): Election forecasts have traditionally been based on representative polls, in which randomly sampled individuals are asked for whom they intend to vote. While representative polling has historically proven to be quite effective, it comes at considerable financial and time costs. Moreover, as response rates have declined over the past several decades, the statistical ben- efits of representative sampling have diminished. In this paper, we show that with proper statistical adjustment, non-representative polls can be used to generate accurate election forecasts, and often faster and at less expense than traditional survey methods. We demon- strate this approach

2 0.29049802 1570 andrew gelman stats-2012-11-08-Poll aggregation and election forecasting

Introduction: At the sister blog, Henry writes about poll averaging and election forecasts. Henry writes that “These models need to crunch lots of polls, at the state and national level, if they’re going to provide good predictions.” Actually, you can get reasonable predictions from national-level forecasting models plus previous state-level election results, then when the election comes closer you can use national and state polls as needed. See my paper with Kari Lock, Bayesian combination of state polls and election forecasts . (That said, the method in that paper is fairly complicated, much more so than simply taking weighted averages of state polls, if such abundant data happen to be available. And I’m sure our approach would need to be altered if it were used for real-time forecasts.) Having a steady supply of polls of varying quality from various sources allows poll aggregators to produce news every day (in the sense of pushing their estimates around) but it doesn’t help much with a

3 0.21223372 1512 andrew gelman stats-2012-09-27-A Non-random Walk Down Campaign Street

Introduction: Political campaigns are commonly understood as random walks, during which, at any point in time, the level of support for any party or candidate is equally likely to go up or down. Each shift in the polls is then interpreted as the result of some combination of news and campaign strategies. A completely different story of campaigns is the mean reversion model in which the elections are determined by fundamental factors of the economy and partisanship; the role of the campaign is to give voters a chance to reach their predetermined positions. The popularity of the random walk model for polls may be partially explained via analogy to the widespread idea that stock prices reflect all available information, as popularized in Burton Malkiel’s book, A Random Walk Down Wall Street. Once the idea has sunk in that short-term changes in the stock market are inherently unpredictable, it is natural for journalists to think the same of polls. For example, political analyst Nate Silver wrote

4 0.20456369 270 andrew gelman stats-2010-09-12-Comparison of forecasts for the 2010 congressional elections

Introduction: Yesterday at the sister blog , Nate Silver forecast that the Republicans have a two-thirds chance of regaining the House of Representatives in the upcoming election, with an expected gain of 45 House seats. Last month, Bafumi, Erikson, and Wlezien released their forecast that gives the Republicans an 80% chance of takeover and an expected gain of 50 seats. As all the above writers emphasize, these forecasts are full of uncertainty, so I treat the two predictions–a 45-seat swing or a 50-seat swing–as essentially identical at the national level. And, as regular readers know, as far back as a year ago , the generic Congressional ballot (those questions of the form, “Which party do you plan to vote for in November?”) was also pointing to big Republican gains. As Bafumi et al. point out, early generic polls are strongly predictive of the election outcome, but they need to be interpreted carefully. The polls move in a generally predictable manner during the year leading up to an

5 0.2039603 391 andrew gelman stats-2010-11-03-Some thoughts on election forecasting

Introduction: I’ve written a lot on polls and elections (“a poll is a snapshot, not a forecast,” etc., or see here for a more technical paper with Kari Lock) but had a few things to add in light of Sam Wang’s recent efforts . As a biologist with a physics degree, Wang brings an outsider’s perspective to political forecasting, which can be a good thing. (I’m a bit of an outsider to political science myself, as is my sometime collaborator Nate Silver, who’s done a lot of good work in the past few years.) But there are two places where Wang misses the point, I think. He refers to his method as a “transparent, low-assumption calculation” and compares it favorably to “fancy modeling” and “assumption-laden models.” Assumptions are a bad thing, right? Well, no, I don’t think so. Bad assumptions are a bad thing. Good assumptions are just fine. Similarly for fancy modeling. I don’t see why a model should get credit for not including a factor that might be important. Let me clarify. I

6 0.17046122 210 andrew gelman stats-2010-08-16-What I learned from those tough 538 commenters

7 0.17019962 364 andrew gelman stats-2010-10-22-Politics is not a random walk: Momentum and mean reversion in polling

8 0.13320819 131 andrew gelman stats-2010-07-07-A note to John

9 0.12423159 1564 andrew gelman stats-2012-11-06-Choose your default, or your default will choose you (election forecasting edition)

10 0.11897452 300 andrew gelman stats-2010-09-28-A calibrated Cook gives Dems the edge in Nov, sez Sandy

11 0.1129095 389 andrew gelman stats-2010-11-01-Why it can be rational to vote

12 0.1129095 1565 andrew gelman stats-2012-11-06-Why it can be rational to vote

13 0.11203999 2173 andrew gelman stats-2014-01-15-Postdoc involving pathbreaking work in MRP, Stan, and the 2014 election!

14 0.11184336 1567 andrew gelman stats-2012-11-07-Election reports

15 0.10784683 2221 andrew gelman stats-2014-02-23-Postdoc with Huffpost Pollster to do Bayesian poll tracking

16 0.10219335 1946 andrew gelman stats-2013-07-19-Prior distributions on derived quantities rather than on parameters themselves

17 0.10189509 649 andrew gelman stats-2011-04-05-Internal and external forecasting

18 0.099125482 784 andrew gelman stats-2011-07-01-Weighting and prediction in sample surveys

19 0.097604349 292 andrew gelman stats-2010-09-23-Doug Hibbs on the fundamentals in 2010

20 0.096780211 288 andrew gelman stats-2010-09-21-Discussion of the paper by Girolami and Calderhead on Bayesian computation


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.129), (1, 0.005), (2, 0.142), (3, 0.017), (4, -0.021), (5, 0.05), (6, -0.111), (7, -0.04), (8, -0.019), (9, -0.02), (10, 0.111), (11, -0.031), (12, 0.019), (13, 0.001), (14, -0.093), (15, -0.024), (16, -0.019), (17, 0.021), (18, 0.025), (19, 0.013), (20, -0.06), (21, 0.055), (22, -0.049), (23, 0.089), (24, -0.005), (25, 0.001), (26, 0.042), (27, 0.005), (28, -0.003), (29, 0.105), (30, -0.053), (31, 0.02), (32, -0.022), (33, -0.04), (34, -0.008), (35, 0.064), (36, 0.022), (37, 0.007), (38, 0.034), (39, -0.015), (40, -0.041), (41, 0.078), (42, 0.033), (43, -0.055), (44, -0.031), (45, 0.037), (46, -0.004), (47, -0.048), (48, 0.043), (49, -0.004)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97962952 2343 andrew gelman stats-2014-05-22-Big Data needs Big Model

Introduction: Gary Marcus and Ernest Davis wrote this useful news article on the promise and limitations of “big data.” And let me add this related point: Big data are typically not random samples, hence the need for “big model” to map from sample to population. Here’s an example (with Wei Wang, David Rothschild, and Sharad Goel): Election forecasts have traditionally been based on representative polls, in which randomly sampled individuals are asked for whom they intend to vote. While representative polling has historically proven to be quite effective, it comes at considerable financial and time costs. Moreover, as response rates have declined over the past several decades, the statistical ben- efits of representative sampling have diminished. In this paper, we show that with proper statistical adjustment, non-representative polls can be used to generate accurate election forecasts, and often faster and at less expense than traditional survey methods. We demon- strate this approach

2 0.81087279 270 andrew gelman stats-2010-09-12-Comparison of forecasts for the 2010 congressional elections

Introduction: Yesterday at the sister blog , Nate Silver forecast that the Republicans have a two-thirds chance of regaining the House of Representatives in the upcoming election, with an expected gain of 45 House seats. Last month, Bafumi, Erikson, and Wlezien released their forecast that gives the Republicans an 80% chance of takeover and an expected gain of 50 seats. As all the above writers emphasize, these forecasts are full of uncertainty, so I treat the two predictions–a 45-seat swing or a 50-seat swing–as essentially identical at the national level. And, as regular readers know, as far back as a year ago , the generic Congressional ballot (those questions of the form, “Which party do you plan to vote for in November?”) was also pointing to big Republican gains. As Bafumi et al. point out, early generic polls are strongly predictive of the election outcome, but they need to be interpreted carefully. The polls move in a generally predictable manner during the year leading up to an

3 0.80519181 1570 andrew gelman stats-2012-11-08-Poll aggregation and election forecasting

Introduction: At the sister blog, Henry writes about poll averaging and election forecasts. Henry writes that “These models need to crunch lots of polls, at the state and national level, if they’re going to provide good predictions.” Actually, you can get reasonable predictions from national-level forecasting models plus previous state-level election results, then when the election comes closer you can use national and state polls as needed. See my paper with Kari Lock, Bayesian combination of state polls and election forecasts . (That said, the method in that paper is fairly complicated, much more so than simply taking weighted averages of state polls, if such abundant data happen to be available. And I’m sure our approach would need to be altered if it were used for real-time forecasts.) Having a steady supply of polls of varying quality from various sources allows poll aggregators to produce news every day (in the sense of pushing their estimates around) but it doesn’t help much with a

4 0.7907837 364 andrew gelman stats-2010-10-22-Politics is not a random walk: Momentum and mean reversion in polling

Introduction: Nate Silver and Justin Wolfers are having a friendly blog-dispute about momentum in political polling. Nate and Justin each make good points but are also missing parts of the picture. These questions relate to my own research so I thought I’d discuss them here. There ain’t no mo’ Nate led off the discussion by writing that pundits are always talking about “momentum” in the polls: Turn on the news or read through much of the analysis put out by some of our friends, and you’re likely to hear a lot of talk about “momentum”: the term is used about 60 times per day by major media outlets in conjunction with articles about polling. When people say a particular candidate has momentum, what they are implying is that present trends are likely to perpetuate themselves into the future. Say, for instance, that a candidate trailed by 10 points in a poll three weeks ago — and now a new poll comes out showing the candidate down by just 5 points. It will frequently be said that this

5 0.76747185 1512 andrew gelman stats-2012-09-27-A Non-random Walk Down Campaign Street

Introduction: Political campaigns are commonly understood as random walks, during which, at any point in time, the level of support for any party or candidate is equally likely to go up or down. Each shift in the polls is then interpreted as the result of some combination of news and campaign strategies. A completely different story of campaigns is the mean reversion model in which the elections are determined by fundamental factors of the economy and partisanship; the role of the campaign is to give voters a chance to reach their predetermined positions. The popularity of the random walk model for polls may be partially explained via analogy to the widespread idea that stock prices reflect all available information, as popularized in Burton Malkiel’s book, A Random Walk Down Wall Street. Once the idea has sunk in that short-term changes in the stock market are inherently unpredictable, it is natural for journalists to think the same of polls. For example, political analyst Nate Silver wrote

6 0.76672351 300 andrew gelman stats-2010-09-28-A calibrated Cook gives Dems the edge in Nov, sez Sandy

7 0.74025869 131 andrew gelman stats-2010-07-07-A note to John

8 0.71750885 43 andrew gelman stats-2010-05-19-What do Tuesday’s elections tell us about November?

9 0.67421585 292 andrew gelman stats-2010-09-23-Doug Hibbs on the fundamentals in 2010

10 0.67169201 200 andrew gelman stats-2010-08-11-Separating national and state swings in voting and public opinion, or, How I avoided blogorific embarrassment: An agony in four acts

11 0.67072618 406 andrew gelman stats-2010-11-10-Translating into Votes: The Electoral Impact of Spanish-Language Ballots

12 0.65579814 391 andrew gelman stats-2010-11-03-Some thoughts on election forecasting

13 0.6280517 649 andrew gelman stats-2011-04-05-Internal and external forecasting

14 0.61654437 237 andrew gelman stats-2010-08-27-Bafumi-Erikson-Wlezien predict a 50-seat loss for Democrats in November

15 0.59444857 2005 andrew gelman stats-2013-09-02-“Il y a beaucoup de candidats démocrates, et leurs idéologies ne sont pas très différentes. Et la participation est imprévisible.”

16 0.57248026 1000 andrew gelman stats-2011-11-10-Forecasting 2012: How much does ideology matter?

17 0.56968242 210 andrew gelman stats-2010-08-16-What I learned from those tough 538 commenters

18 0.56158918 142 andrew gelman stats-2010-07-12-God, Guns, and Gaydar: The Laws of Probability Push You to Overestimate Small Groups

19 0.54781777 1940 andrew gelman stats-2013-07-16-A poll that throws away data???

20 0.54586643 2167 andrew gelman stats-2014-01-10-Do you believe that “humans and other living things have evolved over time”?


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(5, 0.072), (15, 0.016), (16, 0.044), (18, 0.01), (24, 0.047), (34, 0.08), (39, 0.022), (45, 0.01), (50, 0.04), (57, 0.043), (59, 0.021), (85, 0.013), (86, 0.171), (89, 0.012), (99, 0.272)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.9591614 2343 andrew gelman stats-2014-05-22-Big Data needs Big Model

Introduction: Gary Marcus and Ernest Davis wrote this useful news article on the promise and limitations of “big data.” And let me add this related point: Big data are typically not random samples, hence the need for “big model” to map from sample to population. Here’s an example (with Wei Wang, David Rothschild, and Sharad Goel): Election forecasts have traditionally been based on representative polls, in which randomly sampled individuals are asked for whom they intend to vote. While representative polling has historically proven to be quite effective, it comes at considerable financial and time costs. Moreover, as response rates have declined over the past several decades, the statistical ben- efits of representative sampling have diminished. In this paper, we show that with proper statistical adjustment, non-representative polls can be used to generate accurate election forecasts, and often faster and at less expense than traditional survey methods. We demon- strate this approach

2 0.92637569 76 andrew gelman stats-2010-06-09-Both R and Stata

Introduction: A student I’m working with writes: I was planning on getting a applied stat text as a desk reference, and for that I’m assuming you’d recommend your own book. Also, being an economics student, I was initially planning on doing my analysis in STATA, but I noticed on your blog that you use R, and apparently so does the rest of the statistics profession. Would you rather I do my programming in R this summer, or does it not matter? It doesn’t look too hard to learn, so just let me know what’s most convenient for you. My reply: Yes, I recommend my book with Jennifer Hill. Also the book by John Fox, An R and S-plus Companion to Applied Regression, is a good way to get into R. I recommend you use both Stata and R. If you’re already familiar with Stata, then stick with it–it’s a great system for working with big datasets. You can grab your data in Stata, do some basic manipulations, then save a smaller dataset to read into R (using R’s read.dta() function). Once you want to make fu

3 0.92568886 904 andrew gelman stats-2011-09-13-My wikipedia edit

Introduction: The other day someone mentioned my complaint about the Wikipedia article on “Bayesian inference” (see footnote 1 of this article ) and he said I should fix the Wikipedia entry myself. And so I did . I didn’t have the energy to rewrite the whole article–in particular, all of its examples involve discrete parameters, whereas the Bayesian problems I work on generally have continuous parameters, and its “mathematical foundations” section focuses on “independent identically distributed observations x” rather than data y which can have different distributions. It’s just a wacky, unbalanced article. But I altered the first few paragraphs to get rid of the stuff about the posterior probability that a model is true. I much prefer the Scholarpedia article on Bayesian statistics by David Spiegelhalter and Kenneth Rice, but I couldn’t bring myself to simply delete the Wikipedia article and replace it with the Scholarpedia content. Just to be clear: I’m not at all trying to disparage

4 0.92156821 873 andrew gelman stats-2011-08-26-Luck or knowledge?

Introduction: Joan Ginther has won the Texas lottery four times. First, she won $5.4 million, then a decade later, she won $2million, then two years later $3million and in the summer of 2010, she hit a $10million jackpot. The odds of this has been calculated at one in eighteen septillion and luck like this could only come once every quadrillion years. According to Forbes, the residents of Bishop, Texas, seem to believe God was behind it all. The Texas Lottery Commission told Mr Rich that Ms Ginther must have been ‘born under a lucky star’, and that they don’t suspect foul play. Harper’s reporter Nathanial Rich recently wrote an article about Ms Ginther, which calls the the validity of her ‘luck’ into question. First, he points out, Ms Ginther is a former math professor with a PhD from Stanford University specialising in statistics. More at Daily Mail. [Edited Saturday] In comments, C Ryan King points to the original article at Harper’s and Bill Jefferys to Wired .

5 0.91782308 1718 andrew gelman stats-2013-02-11-Toward a framework for automatic model building

Introduction: Patrick Caldon writes: I saw your recent blog post where you discussed in passing an iterative-chain-of models approach to AI. I essentially built such a thing for my PhD thesis – not in a Bayesian context, but in a logic programming context – and proved it had a few properties and showed how you could solve some toy problems. The important bit of my framework was that at various points you also go and get more data in the process – in a statistical context this might be seen as building a little univariate model on a subset of the data, then iteratively extending into a better model with more data and more independent variables – a generalized forward stepwise regression if you like. It wrapped a proper computational framework around E.M. Gold’s identification/learning in the limit based on a logic my advisor (Eric Martin) had invented. What’s not written up in the thesis is a few months of failed struggle trying to shoehorn some simple statistical inference into this

6 0.91059625 253 andrew gelman stats-2010-09-03-Gladwell vs Pinker

7 0.90794635 515 andrew gelman stats-2011-01-13-The Road to a B

8 0.90484667 1586 andrew gelman stats-2012-11-21-Readings for a two-week segment on Bayesian modeling?

9 0.90451461 276 andrew gelman stats-2010-09-14-Don’t look at just one poll number–unless you really know what you’re doing!

10 0.90430832 1547 andrew gelman stats-2012-10-25-College football, voting, and the law of large numbers

11 0.90400046 1530 andrew gelman stats-2012-10-11-Migrating your blog from Movable Type to WordPress

12 0.89980364 769 andrew gelman stats-2011-06-15-Mr. P by another name . . . is still great!

13 0.8979463 759 andrew gelman stats-2011-06-11-“2 level logit with 2 REs & large sample. computational nightmare – please help”

14 0.89288306 866 andrew gelman stats-2011-08-23-Participate in a research project on combining information for prediction

15 0.89117956 1427 andrew gelman stats-2012-07-24-More from the sister blog

16 0.88812339 2082 andrew gelman stats-2013-10-30-Berri Gladwell Loken football update

17 0.8856712 1552 andrew gelman stats-2012-10-29-“Communication is a central task of statistics, and ideally a state-of-the-art data analysis can have state-of-the-art displays to match”

18 0.88539612 1327 andrew gelman stats-2012-05-18-Comments on “A Bayesian approach to complex clinical diagnoses: a case-study in child abuse”

19 0.88240176 2260 andrew gelman stats-2014-03-22-Postdoc at Rennes on multilevel missing data imputation

20 0.88110244 1278 andrew gelman stats-2012-04-23-“Any old map will do” meets “God is in every leaf of every tree”