andrew_gelman_stats andrew_gelman_stats-2013 andrew_gelman_stats-2013-1981 knowledge-graph by maker-knowledge-mining

1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making


meta infos for this blog

Source: html

Introduction: Andreas Graefe writes (see here here here ): The usual procedure for developing linear models to predict any kind of target variable is to identify a subset of most important predictors and to estimate weights that provide the best possible solution for a given sample. The resulting “optimally” weighted linear composite is then used when predicting new data. This approach is useful in situations with large and reliable datasets and few predictor variables. However, a large body of analytical and empirical evidence since the 1970s shows that the weighting of variables is of little, if any, value in situations with small and noisy datasets and a large number of predictor variables. In such situations, including all relevant variables is more important than their weighting. These findings have yet to impact many fields. This study uses data from nine established U.S. election-forecasting models whose forecasts are regularly published in academic journals to demonstrate the value o


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Andreas Graefe writes (see here here here ): The usual procedure for developing linear models to predict any kind of target variable is to identify a subset of most important predictors and to estimate weights that provide the best possible solution for a given sample. [sent-1, score-0.873]

2 The resulting “optimally” weighted linear composite is then used when predicting new data. [sent-2, score-0.608]

3 This approach is useful in situations with large and reliable datasets and few predictor variables. [sent-3, score-0.912]

4 However, a large body of analytical and empirical evidence since the 1970s shows that the weighting of variables is of little, if any, value in situations with small and noisy datasets and a large number of predictor variables. [sent-4, score-1.779]

5 In such situations, including all relevant variables is more important than their weighting. [sent-5, score-0.413]

6 election-forecasting models whose forecasts are regularly published in academic journals to demonstrate the value of weighting all predictors equally and including all relevant variables in the model. [sent-9, score-1.53]

7 Across the ten elections from 1976 to 2012, equally weighted predictors reduced the forecast error of the original regression models on average by four percent. [sent-10, score-1.421]

8 An equal-weights model that includes all variables provided well-calibrated forecasts that reduced the error of the most accurate regression model by 29 percent. [sent-11, score-1.134]

9 I haven’t actually read the paper, but I have no reason to disbelieve it. [sent-12, score-0.138]

10 I assume that you could get even better performance using a Bayesian approach that puts a strong prior distribution on the coefficients being close to each other. [sent-13, score-0.432]

11 This can be done, for example, in a multiplicative model like this: Suppose your original model is y = b_0 + b_1*x_1 + b_2*x_2 + . [sent-14, score-0.518]

12 + b_10*x_10, and suppose you want the coefficients b_3,…,b_10 to be close to each other. [sent-17, score-0.368]

13 A Bayesian version could set g_j ~ N(1,s^2), where s is some small value such as 0. [sent-20, score-0.247]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('situations', 0.249), ('variables', 0.213), ('predictors', 0.207), ('weighted', 0.197), ('reduced', 0.182), ('equally', 0.182), ('weighting', 0.181), ('forecasts', 0.179), ('datasets', 0.172), ('predictor', 0.168), ('value', 0.16), ('coefficients', 0.148), ('model', 0.144), ('andreas', 0.138), ('disbelieve', 0.138), ('large', 0.131), ('linear', 0.13), ('optimally', 0.127), ('multiplicative', 0.124), ('composite', 0.118), ('close', 0.112), ('analytical', 0.111), ('models', 0.11), ('suppose', 0.108), ('relevant', 0.107), ('original', 0.106), ('nine', 0.105), ('error', 0.103), ('reliable', 0.099), ('regularly', 0.098), ('regression', 0.094), ('established', 0.093), ('approach', 0.093), ('including', 0.093), ('weights', 0.092), ('body', 0.092), ('subset', 0.091), ('small', 0.087), ('forecast', 0.087), ('target', 0.087), ('resulting', 0.086), ('noisy', 0.084), ('developing', 0.08), ('ten', 0.079), ('puts', 0.079), ('predicting', 0.077), ('bayesian', 0.076), ('procedure', 0.076), ('accurate', 0.075), ('elections', 0.074)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making

Introduction: Andreas Graefe writes (see here here here ): The usual procedure for developing linear models to predict any kind of target variable is to identify a subset of most important predictors and to estimate weights that provide the best possible solution for a given sample. The resulting “optimally” weighted linear composite is then used when predicting new data. This approach is useful in situations with large and reliable datasets and few predictor variables. However, a large body of analytical and empirical evidence since the 1970s shows that the weighting of variables is of little, if any, value in situations with small and noisy datasets and a large number of predictor variables. In such situations, including all relevant variables is more important than their weighting. These findings have yet to impact many fields. This study uses data from nine established U.S. election-forecasting models whose forecasts are regularly published in academic journals to demonstrate the value o

2 0.18051054 1506 andrew gelman stats-2012-09-21-Building a regression model . . . with only 27 data points

Introduction: Dan Silitonga writes: I was wondering whether you would have any advice on building a regression model on a very small datasets. I’m in the midst of revamping the model to predict tax collections from unincorporated businesses. But I only have 27 data points, 27 years of annual data. Any advice would be much appreciated. My reply: This sounds tough, especially given that 27 years of annual data isn’t even 27 independent data points. I have various essentially orthogonal suggestions: 1 [added after seeing John Cook's comment below]. Do your best, making as many assumptions as you need. In a Bayesian context, this means that you’d use a strong and informative prior and let the data update it as appropriate. In a less formal setting, you’d start with a guess of a model and then alter it to the extent that your data contradict your original guess. 2. Get more data. Not by getting information on more years (I assume you can’t do that) but by breaking up the data you do

3 0.17437741 250 andrew gelman stats-2010-09-02-Blending results from two relatively independent multi-level models

Introduction: David Shor writes: I [Shor] am working on a Bayesian Forecasting model for the Mid-term elections that has two components: 1) A poll aggregation system with pooled and hierarchical house and design effects across every race with polls (Average Standard error for house seat level vote-share ~.055) 2) A Bafumi-style regression that applies national-swing to individual seats. (Average Standard error for house seat level vote-share ~.06) Since these two estimates are essentially independent, estimates can probably be made more accurate by pooling them together. But If a house effect changes in one draw, that changes estimates in every race. Changes in regression coefficients and National swing have a similar effect. In the face of high and possibly differing seat-to-seat correlations from each method, I’m not sure what the correct way to “blend” these models would be, either for individual or top-line seat estimates. In the mean-time, I’m just creating variance-weighted avera

4 0.16535161 1248 andrew gelman stats-2012-04-06-17 groups, 6 group-level predictors: What to do?

Introduction: Yi-Chun Ou writes: I am using a multilevel model with three levels. I read that you wrote a book about multilevel models, and wonder if you can solve the following question. The data structure is like this: Level one: customer (8444 customers) Level two: companys (90 companies) Level three: industry (17 industries) I use 6 level-three variables (i.e. industry characteristics) to explain the variance of the level-one effect across industries. The question here is whether there is an over-fitting problem since there are only 17 industries. I understand that this must be a problem for non-multilevel models, but is it also a problem for multilevel models? My reply: Yes, this could be a problem. I’d suggest combining some of your variables into a common score, or using only some of the variables, or using strong priors to control the inferences. This is an interesting and important area of statistics research, to do this sort of thing systematically. There’s lots o

5 0.15438822 1430 andrew gelman stats-2012-07-26-Some thoughts on survey weighting

Introduction: From a comment I made in an email exchange: My work on survey adjustments has very much been inspired by the ideas of Rod Little. Much of my efforts have gone toward the goal of integrating hierarchical modeling (which is so helpful for small-area estimation) with post stratification (which adjusts for known differences between sample and population). In the surveys I’ve dealt with, nonresponse/nonavailability can be a big issue, and I’ve always tried to emphasize that (a) the probability of a person being included in the sample is just about never known, and (b) even if this probability were known, I’d rather know the empirical n/N than the probability p (which is only valid in expectation). Regarding nonparametric modeling: I haven’t done much of that (although I hope to at some point) but Rod and his students have. As I wrote in the first sentence of the above-linked paper, I do think the current theory and practice of survey weighting is a mess, in that much depends on so

6 0.15221897 1196 andrew gelman stats-2012-03-04-Piss-poor monocausal social science

7 0.1521244 352 andrew gelman stats-2010-10-19-Analysis of survey data: Design based models vs. hierarchical modeling?

8 0.14681755 1340 andrew gelman stats-2012-05-23-Question 13 of my final exam for Design and Analysis of Sample Surveys

9 0.14513856 1656 andrew gelman stats-2013-01-05-Understanding regression models and regression coefficients

10 0.14495718 391 andrew gelman stats-2010-11-03-Some thoughts on election forecasting

11 0.14265393 753 andrew gelman stats-2011-06-09-Allowing interaction terms to vary

12 0.14073414 2017 andrew gelman stats-2013-09-11-“Informative g-Priors for Logistic Regression”

13 0.13533959 1999 andrew gelman stats-2013-08-27-Bayesian model averaging or fitting a larger model

14 0.13479897 1486 andrew gelman stats-2012-09-07-Prior distributions for regression coefficients

15 0.13311532 1972 andrew gelman stats-2013-08-07-When you’re planning on fitting a model, build up to it by fitting simpler models first. Then, once you have a model you like, check the hell out of it

16 0.1317268 1247 andrew gelman stats-2012-04-05-More philosophy of Bayes

17 0.13167182 1367 andrew gelman stats-2012-06-05-Question 26 of my final exam for Design and Analysis of Sample Surveys

18 0.12714127 1941 andrew gelman stats-2013-07-16-Priors

19 0.12677287 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?

20 0.12669091 1955 andrew gelman stats-2013-07-25-Bayes-respecting experimental design and other things


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.23), (1, 0.188), (2, 0.101), (3, -0.042), (4, 0.052), (5, 0.022), (6, 0.011), (7, -0.058), (8, 0.07), (9, 0.065), (10, 0.105), (11, 0.017), (12, -0.024), (13, 0.023), (14, -0.047), (15, 0.015), (16, 0.03), (17, 0.008), (18, 0.028), (19, -0.012), (20, -0.018), (21, 0.082), (22, 0.009), (23, -0.021), (24, 0.005), (25, 0.015), (26, 0.042), (27, -0.065), (28, -0.019), (29, -0.017), (30, 0.064), (31, 0.06), (32, 0.017), (33, -0.02), (34, 0.007), (35, -0.002), (36, 0.057), (37, 0.014), (38, -0.011), (39, -0.063), (40, -0.045), (41, -0.041), (42, 0.022), (43, 0.029), (44, 0.01), (45, 0.009), (46, 0.03), (47, 0.05), (48, 0.045), (49, 0.034)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96630758 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making

Introduction: Andreas Graefe writes (see here here here ): The usual procedure for developing linear models to predict any kind of target variable is to identify a subset of most important predictors and to estimate weights that provide the best possible solution for a given sample. The resulting “optimally” weighted linear composite is then used when predicting new data. This approach is useful in situations with large and reliable datasets and few predictor variables. However, a large body of analytical and empirical evidence since the 1970s shows that the weighting of variables is of little, if any, value in situations with small and noisy datasets and a large number of predictor variables. In such situations, including all relevant variables is more important than their weighting. These findings have yet to impact many fields. This study uses data from nine established U.S. election-forecasting models whose forecasts are regularly published in academic journals to demonstrate the value o

2 0.82940316 257 andrew gelman stats-2010-09-04-Question about standard range for social science correlations

Introduction: Andrew Eppig writes: I’m a physicist by training who is transitioning to the social sciences. I recently came across a reference in the Economist to a paper on IQ and parasites which I read as I have more than a passing interest in IQ research (having read much that you and others (e.g., Shalizi, Wicherts) have written). In this paper I note that the authors find a very high correlation between national IQ and parasite prevalence. The strength of the correlation (-0.76 to -0.82) surprised me, as I’m used to much weaker correlations in the social sciences. To me, it’s a bit too high, suggesting that there are other factors at play or that one of the variables is merely a proxy for a large number of other variables. But I have no basis for this other than a gut feeling and a memory of a plot on Language Log about the distribution of correlation coefficients in social psychology. So my question is this: Is a correlation in the range of (-0.82,-0.76) more likely to be a correlatio

3 0.80331892 1094 andrew gelman stats-2011-12-31-Using factor analysis or principal components analysis or measurement-error models for biological measurements in archaeology?

Introduction: Greg Campbell writes: I am a Canadian archaeologist (BSc in Chemistry) researching the past human use of European Atlantic shellfish. After two decades of practice I am finally getting a MA in archaeology at Reading. I am seeing if the habitat or size of harvested mussels (Mytilus edulis) can be reconstructed from measurements of the umbo (the pointy end, and the only bit that survives well in archaeological deposits) using log-transformed measurements (or allometry; relationships between dimensions are more likely exponential than linear). Of course multivariate regressions in most statistics packages (Minitab, SPSS, SAS) assume you are trying to predict one variable from all the others (a Model I regression), and use ordinary least squares to fit the regression line. For organismal dimensions this makes little sense, since all the dimensions are (at least in theory) free to change their mutual proportions during growth. So there is no predictor and predicted, mutual variation of

4 0.80075222 1294 andrew gelman stats-2012-05-01-Modeling y = a + b + c

Introduction: Brandon Behlendorf writes: I [Behlendorf] am replicating some previous research using OLS [he's talking about what we call "linear regression"---ed.] to regress a logged rate (to reduce skew) of Y on a number of predictors (Xs). Y is the count of a phenomena divided by the population of the unit of the analysis. The problem that I am encountering is that Y is composite count of a number of distinct phenomena [A+B+C], and these phenomena are not uniformly distributed across the sample. Most of the research in this area has conducted regressions either with Y or with individual phenomena [A or B or C] as the dependent variable. Yet it seems that if [A, B, C] are not uniformly distributed across the sample of units in the same proportion, then the use of Y would be biased, since as a count of [A+B+C] divided by the population, it would treat as equivalent units both [2+0.5+1.5] and [4+0+0]. My goal is trying to find a methodology which allows a researcher to regress Y on a

5 0.79499233 1218 andrew gelman stats-2012-03-18-Check your missing-data imputations using cross-validation

Introduction: Elena Grewal writes: I am currently using the iterative regression imputation model as implemented in the Stata ICE package. I am using data from a survey of about 90,000 students in 142 schools and my variable of interest is parent level of education. I want only this variable to be imputed with as little bias as possible as I am not using any other variable. So I scoured the survey for every variable I thought could possibly predict parent education. The main variable I found is parent occupation, which explains about 35% of the variance in parent education for the students with complete data on both. I then include the 20 other variables I found in the survey in a regression predicting parent education, which explains about 40% of the variance in parent education for students with complete data on all the variables. My question is this: many of the other variables I found have more missing values than the parent education variable, and also, although statistically significant

6 0.78706229 627 andrew gelman stats-2011-03-24-How few respondents are reasonable to use when calculating the average by county?

7 0.78379107 1121 andrew gelman stats-2012-01-15-R-squared for multilevel models

8 0.76719099 1462 andrew gelman stats-2012-08-18-Standardizing regression inputs

9 0.75983816 250 andrew gelman stats-2010-09-02-Blending results from two relatively independent multi-level models

10 0.75627017 14 andrew gelman stats-2010-05-01-Imputing count data

11 0.75062197 770 andrew gelman stats-2011-06-15-Still more Mr. P in public health

12 0.74292189 251 andrew gelman stats-2010-09-02-Interactions of predictors in a causal model

13 0.73973471 1966 andrew gelman stats-2013-08-03-Uncertainty in parameter estimates using multilevel models

14 0.73677826 1196 andrew gelman stats-2012-03-04-Piss-poor monocausal social science

15 0.73475736 1900 andrew gelman stats-2013-06-15-Exploratory multilevel analysis when group-level variables are of importance

16 0.72407746 1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?

17 0.72318095 1703 andrew gelman stats-2013-02-02-Interaction-based feature selection and classification for high-dimensional biological data

18 0.72309744 1656 andrew gelman stats-2013-01-05-Understanding regression models and regression coefficients

19 0.72027892 1908 andrew gelman stats-2013-06-21-Interpreting interactions in discrete-data regression

20 0.71846884 1374 andrew gelman stats-2012-06-11-Convergence Monitoring for Non-Identifiable and Non-Parametric Models


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(2, 0.029), (16, 0.076), (21, 0.032), (24, 0.177), (50, 0.08), (53, 0.027), (63, 0.034), (74, 0.013), (76, 0.027), (86, 0.071), (96, 0.019), (99, 0.324)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.9823482 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making

Introduction: Andreas Graefe writes (see here here here ): The usual procedure for developing linear models to predict any kind of target variable is to identify a subset of most important predictors and to estimate weights that provide the best possible solution for a given sample. The resulting “optimally” weighted linear composite is then used when predicting new data. This approach is useful in situations with large and reliable datasets and few predictor variables. However, a large body of analytical and empirical evidence since the 1970s shows that the weighting of variables is of little, if any, value in situations with small and noisy datasets and a large number of predictor variables. In such situations, including all relevant variables is more important than their weighting. These findings have yet to impact many fields. This study uses data from nine established U.S. election-forecasting models whose forecasts are regularly published in academic journals to demonstrate the value o

2 0.9724623 1805 andrew gelman stats-2013-04-16-Memo to Reinhart and Rogoff: I think it’s best to admit your errors and go on from there

Introduction: Jeff Ratto points me to this news article by Dean Baker reporting the work of three economists, Thomas Herndon, Michael Ash, and Robert Pollin, who found errors in a much-cited article by Carmen Reinhart and Kenneth Rogoff analyzing historical statistics of economic growth and public debt. Mike Konczal provides a clear summary; that’s where I got the above image. Errors in data processing and data analysis It turns out that Reinhart and Rogoff flubbed it. Herndon et al. write of “spreadsheet errors, omission of available data, weighting, and transcription.” The spreadsheet errors are the most embarrassing, but the other choices in data analysis seem pretty bad too. It can be tough to work with small datasets, so I have sympathy for Reinhart and Rogoff, but it does look like they were jumping to conclusions in their paper. Perhaps the urgency of the topic moved them to publish as fast as possible rather than carefully considering the impact of their data-analytic choi

3 0.96793872 541 andrew gelman stats-2011-01-27-Why can’t I be more like Bill James, or, The use of default and default-like models

Introduction: During our discussion of estimates of teacher performance, Steve Sailer wrote : I suspect we’re going to take years to work the kinks out of overall rating systems. By way of analogy, Bill James kicked off the modern era of baseball statistics analysis around 1975. But he stuck to doing smaller scale analyses and avoided trying to build one giant overall model for rating players. In contrast, other analysts such as Pete Palmer rushed into building overall ranking systems, such as his 1984 book, but they tended to generate curious results such as the greatness of Roy Smalley Jr.. James held off until 1999 before unveiling his win share model for overall rankings. I remember looking at Pete Palmer’s book many years ago and being disappointed that he did everything through his Linear Weights formula. A hit is worth X, a walk is worth Y, etc. Some of this is good–it’s presumably an improvement on counting walks as 0 or 1 hits, also an improvement on counting doubles and triples a

4 0.96508157 1636 andrew gelman stats-2012-12-23-Peter Bartlett on model complexity and sample size

Introduction: Zach Shahn saw this and writes: I just heard a talk by Peter Bartlett about model selection in “unlimited” data situations that essentially addresses this curve. He talks about the problem of model selection given a computational budget (rather than given a sample size). You can either use your computational budget to get more data or fit a more complex model. He shows that you can get oracle inequalities for model selection algorithms under this paradigm (as long as the candidate models are nested). I can’t follow all the details but it looks cool! This is what they should be teaching in theoretical statistics class, instead of sufficient statistics and the Neyman-Pearson lemma and all that other old stuff. Zach also asks: I have a question about political science. I always hear that the direction of the economy is one of the best predictors of election outcome. What’s your thinking about the causal mechanism(s) behind the success of economic trend indicators as pr

5 0.96282089 1793 andrew gelman stats-2013-04-08-The Supreme Court meets the fallacy of the one-sided bet

Introduction: Doug Hartmann writes ( link from Jay Livingston): Justice Antonin Scalia’s comment in the Supreme Court hearings on the U.S. law defining marriage that “there’s considerable disagreement among sociologists as to what the consequences of raising a child in a single-sex family, whether that is harmful to the child or not.” Hartman argues that Scalia is factually incorrect—there is not actually “considerable disagreement among sociologists” on this issue—and quotes a recent report from the American Sociological Association to this effect. Assuming there’s no other considerable group of sociologists (Hartman knows of only one small group) arguing otherwise, it seems that Hartman has a point. Scalia would’ve been better off omitting the phrase “among sociologists”—then he’d have been on safe ground, because you can always find somebody to take a position on the issue. Jerry Falwell’s no longer around but there’s a lot more where he came from. Even among scientists, there’s

6 0.96123296 781 andrew gelman stats-2011-06-28-The holes in my philosophy of Bayesian data analysis

7 0.95966315 2040 andrew gelman stats-2013-09-26-Difficulties in making inferences about scientific truth from distributions of published p-values

8 0.95959121 2140 andrew gelman stats-2013-12-19-Revised evidence for statistical standards

9 0.95915246 777 andrew gelman stats-2011-06-23-Combining survey data obtained using different modes of sampling

10 0.95893276 106 andrew gelman stats-2010-06-23-Scientists can read your mind . . . as long as the’re allowed to look at more than one place in your brain and then make a prediction after seeing what you actually did

11 0.95882922 1695 andrew gelman stats-2013-01-28-Economists argue about Bayes

12 0.95845413 1518 andrew gelman stats-2012-10-02-Fighting a losing battle

13 0.9584384 351 andrew gelman stats-2010-10-18-“I was finding the test so irritating and boring that I just started to click through as fast as I could”

14 0.95839047 899 andrew gelman stats-2011-09-10-The statistical significance filter

15 0.95832431 1117 andrew gelman stats-2012-01-13-What are the important issues in ethics and statistics? I’m looking for your input!

16 0.95801878 1763 andrew gelman stats-2013-03-14-Everyone’s trading bias for variance at some point, it’s just done at different places in the analyses

17 0.9578796 61 andrew gelman stats-2010-05-31-A data visualization manifesto

18 0.95772529 2281 andrew gelman stats-2014-04-04-The Notorious N.H.S.T. presents: Mo P-values Mo Problems

19 0.95758241 288 andrew gelman stats-2010-09-21-Discussion of the paper by Girolami and Calderhead on Bayesian computation

20 0.95728016 2120 andrew gelman stats-2013-12-02-Does a professor’s intervention in online discussions have the effect of prolonging discussion or cutting it off?