andrew_gelman_stats andrew_gelman_stats-2014 andrew_gelman_stats-2014-2364 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Bill Harris wrote in with a question: David Hogg points out in one of his general articles on data modeling that regression assumptions require one to put the variable with the highest variance in the ‘y’ position and the variable you know best (lowest variance) in the ‘x’ position. As he points out, others speak of independent and dependent variables, as if causality determined the form of a regression formula. In a quick scan of ARM and BDA, I don’t see clear advice, but I do see the use of ‘independent’ and ‘dependent.’ I recently did a model over data in which we know the ‘effect’ pretty well (we measure it), while we know the ’cause’ less well (it’s estimated by people who only need to get it approximately correct). A model of the form ’cause ~ effect’ fit visually much better than one of the form ‘effect ~ cause’, but interpreting it seems challenging. For a simplistic example, let the effect be energy use in a building for cooling (E), and let the cause be outdoor ai
sentIndex sentText sentNum sentScore
1 Bill Harris wrote in with a question: David Hogg points out in one of his general articles on data modeling that regression assumptions require one to put the variable with the highest variance in the ‘y’ position and the variable you know best (lowest variance) in the ‘x’ position. [sent-1, score-0.515]
2 As he points out, others speak of independent and dependent variables, as if causality determined the form of a regression formula. [sent-2, score-0.429]
3 In a quick scan of ARM and BDA, I don’t see clear advice, but I do see the use of ‘independent’ and ‘dependent. [sent-3, score-0.294]
4 ’ I recently did a model over data in which we know the ‘effect’ pretty well (we measure it), while we know the ’cause’ less well (it’s estimated by people who only need to get it approximately correct). [sent-4, score-0.304]
5 A model of the form ’cause ~ effect’ fit visually much better than one of the form ‘effect ~ cause’, but interpreting it seems challenging. [sent-5, score-0.323]
6 For a simplistic example, let the effect be energy use in a building for cooling (E), and let the cause be outdoor air temperature (T). [sent-6, score-1.623]
7 We typically get T at a “nearby” location (within 5-10 miles, perhaps), but we know microclimates cause that to be in error for what counts at the particular building. [sent-8, score-0.402]
8 So ‘E ~ T’ makes sense, but ‘T ~ E’ may violate fewer regression assumptions. [sent-9, score-0.171]
9 At least in the short term and over a volume that’s bigger than covered by the exhaust plume from the air conditioner, the natural interpretation of that (“the outdoor air temperature is a function of the energy you consume to cool the building”) is hard to swallow. [sent-10, score-1.49]
10 In a complete modeling sense, I see modeling the uncertainty in x and y, but often a simpler ‘lm(y ~ x)’ suffices. [sent-12, score-0.176]
11 I replied: Do we really use the terms “independent” and “dependent” variables in this sense in ARM and BDA? [sent-15, score-0.255]
12 In ARM I think we make it pretty clear that regression is about predicting y from x. [sent-19, score-0.209]
13 Sometimes people want to predict y from x, but x is not observed, all we that is available is z which is some noisy measure of x. [sent-21, score-0.182]
14 In this case one can fit a measurement error model. [sent-22, score-0.284]
15 37, which seemed crystal clear until I read Hogg (below); then it wasn’t clear if the predictor on p. [sent-26, score-0.281]
16 37 of ARM really means what I think it means (energy use doesn’t drive outside air temperature, at least on the short term, but I /could/ interpret it as energy use can be used to /predict/ outdoor air temperature more accurately than temperature can predict energy use). [sent-27, score-2.276]
17 mention that you should regress x on y, not y on x, in those cases if you don’t model the measurement error. [sent-31, score-0.298]
18 Perhaps that’s something to cover more fully in a new ARM: is there anything to do in particular when working up from a simple lm() to a full-blown model of measurement error (or perhaps you have and I forgot or missed it). [sent-33, score-0.571]
19 My reply: We’ll definitely cover this in the next edition of ARM. [sent-34, score-0.192]
20 We’ll do it in Stan, where it’s very easy to write a measurement error model. [sent-35, score-0.284]
wordName wordTfidf (topN-words)
[('arm', 0.295), ('lm', 0.277), ('temperature', 0.258), ('air', 0.246), ('hogg', 0.241), ('energy', 0.234), ('outdoor', 0.22), ('cause', 0.217), ('measurement', 0.166), ('independent', 0.121), ('error', 0.118), ('variance', 0.114), ('use', 0.114), ('regression', 0.108), ('dependent', 0.106), ('bda', 0.105), ('measure', 0.104), ('clear', 0.101), ('effect', 0.1), ('edition', 0.1), ('form', 0.094), ('cover', 0.092), ('modeling', 0.088), ('interpret', 0.087), ('building', 0.082), ('sense', 0.08), ('crystal', 0.079), ('cooling', 0.079), ('scan', 0.079), ('predict', 0.078), ('bill', 0.078), ('exhaust', 0.076), ('consume', 0.073), ('simplistic', 0.073), ('term', 0.072), ('visually', 0.069), ('variable', 0.069), ('know', 0.067), ('model', 0.066), ('regress', 0.066), ('forgot', 0.065), ('short', 0.065), ('harris', 0.064), ('perhaps', 0.064), ('violate', 0.063), ('glm', 0.063), ('variables', 0.061), ('miles', 0.061), ('means', 0.061), ('nearby', 0.059)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 2364 andrew gelman stats-2014-06-08-Regression and causality and variable ordering
Introduction: Bill Harris wrote in with a question: David Hogg points out in one of his general articles on data modeling that regression assumptions require one to put the variable with the highest variance in the ‘y’ position and the variable you know best (lowest variance) in the ‘x’ position. As he points out, others speak of independent and dependent variables, as if causality determined the form of a regression formula. In a quick scan of ARM and BDA, I don’t see clear advice, but I do see the use of ‘independent’ and ‘dependent.’ I recently did a model over data in which we know the ‘effect’ pretty well (we measure it), while we know the ’cause’ less well (it’s estimated by people who only need to get it approximately correct). A model of the form ’cause ~ effect’ fit visually much better than one of the form ‘effect ~ cause’, but interpreting it seems challenging. For a simplistic example, let the effect be energy use in a building for cooling (E), and let the cause be outdoor ai
2 0.20604686 906 andrew gelman stats-2011-09-14-Another day, another stats postdoc
Introduction: This post is from Phil Price. I work in the Environmental Energy Technologies Division at Lawrence Berkeley National Laboratory, and I am looking for a postdoc who knows substantially more than I do about time-series modeling; in practice this probably means someone whose dissertation work involved that sort of thing. The work involves developing models to predict and/or forecast the time-dependent energy use in buildings, given historical data and some covariates such as outdoor temperature. Simple regression approaches (e.g. using time-of-week indicator variables, plus outdoor temperature) work fine for a lot of things, but we still have a variety of problems. To give one example, sometimes building behavior changes — due to retrofits, or a change in occupant behavior — so that a single model won’t fit well over a long time period. We want to recognize these changes automatically . We have many other issues besides: heteroskedasticity, need for good uncertainty estimates, abilit
Introduction: I hate to keep bumping our scheduled posts but this is just too important and too exciting to wait. So it’s time to jump the queue. The news is a paper from Michael Betancourt that presents a super-cool new way to compute normalizing constants: A common strategy for inference in complex models is the relaxation of a simple model into the more complex target model, for example the prior into the posterior in Bayesian inference. Existing approaches that attempt to generate such transformations, however, are sensitive to the pathologies of complex distributions and can be difficult to implement in practice. Leveraging the geometry of thermodynamic processes I introduce a principled and robust approach to deforming measures that presents a powerful new tool for inference. The idea is to generalize Hamiltonian Monte Carlo so that it moves through a family of distributions (that is, it transitions through an “inverse temperature” variable called beta that indexes the family) a
4 0.17355296 1010 andrew gelman stats-2011-11-14-“Free energy” and economic resources
Introduction: By “free energy” I don’t mean perpetual motion machines, cars that run on water and get 200 mpg, or the latest cold-fusion hype. No, I’m referring to the term from physics. The free energy of a system is, roughly, the amount of energy that can be directly extracted from it. For example, a rock at room temperature is just full of energy—not just the energy locked in its nuclei, but basic thermal energy—but at room temperature you can’t extract any of it. To the physicists in the audience: Yes, I realize that free energy has a technical meaning in statistical mechanics and that my above definition is sloppy. Please bear with me. And, to the non-physicists: feel free to head to Wikipedia or a physics textbook for a more careful treatment. I was thinking about free energy the other day when hearing someone on the radio say something about China bailing out the E.U. I did a double-take. Huh? The E.U. is rich, China’s not so rich. How can a middle-income country bail out a
5 0.16875075 1501 andrew gelman stats-2012-09-18-More studies on the economic effects of climate change
Introduction: After writing yesterday’s post , I was going through Solomon Hsiang’s blog and found a post pointing to three studies from researchers at business schools: Severe Weather and Automobile Assembly Productivity Gérard P. Cachon, Santiago Gallino and Marcelo Olivares Abstract: It is expected that climate change could lead to an increased frequency of severe weather. In turn, severe weather intuitively should hamper the productivity of work that occurs outside. But what is the effect of rain, snow, fog, heat and wind on work that occurs indoors, such as the production of automobiles? Using weekly production data from 64 automobile plants in the United States over a ten-year period, we find that adverse weather conditions lead to a significant reduction in production. For example, one additional day of high wind advisory by the National Weather Service (i.e., maximum winds generally in excess of 44 miles per hour) reduces production by 26%, which is comparable in order of magnitude t
6 0.13903767 1486 andrew gelman stats-2012-09-07-Prior distributions for regression coefficients
7 0.13804887 180 andrew gelman stats-2010-08-03-Climate Change News
8 0.13653359 1966 andrew gelman stats-2013-08-03-Uncertainty in parameter estimates using multilevel models
9 0.12066182 1422 andrew gelman stats-2012-07-20-Likelihood thresholds and decisions
11 0.11881583 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?
12 0.11784673 754 andrew gelman stats-2011-06-09-Difficulties with Bayesian model averaging
13 0.11657762 2321 andrew gelman stats-2014-05-05-On deck this week
14 0.1164207 257 andrew gelman stats-2010-09-04-Question about standard range for social science correlations
15 0.11159076 1162 andrew gelman stats-2012-02-11-Adding an error model to a deterministic model
16 0.11073975 1801 andrew gelman stats-2013-04-13-Can you write a program to determine the causal order?
17 0.1087804 1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?
18 0.10849468 1201 andrew gelman stats-2012-03-07-Inference = data + model
19 0.10756249 1656 andrew gelman stats-2013-01-05-Understanding regression models and regression coefficients
topicId topicWeight
[(0, 0.215), (1, 0.081), (2, 0.048), (3, -0.009), (4, 0.097), (5, 0.015), (6, 0.039), (7, -0.063), (8, 0.085), (9, 0.031), (10, -0.017), (11, 0.043), (12, 0.025), (13, -0.035), (14, -0.017), (15, 0.01), (16, 0.023), (17, 0.005), (18, -0.004), (19, -0.019), (20, -0.006), (21, 0.02), (22, 0.022), (23, 0.008), (24, 0.017), (25, 0.02), (26, 0.03), (27, -0.064), (28, 0.015), (29, 0.013), (30, 0.048), (31, 0.061), (32, -0.005), (33, -0.036), (34, -0.028), (35, -0.02), (36, -0.001), (37, -0.028), (38, -0.013), (39, -0.074), (40, 0.011), (41, 0.009), (42, -0.07), (43, -0.008), (44, 0.009), (45, 0.035), (46, -0.063), (47, -0.018), (48, -0.056), (49, 0.065)]
simIndex simValue blogId blogTitle
same-blog 1 0.95748365 2364 andrew gelman stats-2014-06-08-Regression and causality and variable ordering
Introduction: Bill Harris wrote in with a question: David Hogg points out in one of his general articles on data modeling that regression assumptions require one to put the variable with the highest variance in the ‘y’ position and the variable you know best (lowest variance) in the ‘x’ position. As he points out, others speak of independent and dependent variables, as if causality determined the form of a regression formula. In a quick scan of ARM and BDA, I don’t see clear advice, but I do see the use of ‘independent’ and ‘dependent.’ I recently did a model over data in which we know the ‘effect’ pretty well (we measure it), while we know the ’cause’ less well (it’s estimated by people who only need to get it approximately correct). A model of the form ’cause ~ effect’ fit visually much better than one of the form ‘effect ~ cause’, but interpreting it seems challenging. For a simplistic example, let the effect be energy use in a building for cooling (E), and let the cause be outdoor ai
2 0.81235135 906 andrew gelman stats-2011-09-14-Another day, another stats postdoc
Introduction: This post is from Phil Price. I work in the Environmental Energy Technologies Division at Lawrence Berkeley National Laboratory, and I am looking for a postdoc who knows substantially more than I do about time-series modeling; in practice this probably means someone whose dissertation work involved that sort of thing. The work involves developing models to predict and/or forecast the time-dependent energy use in buildings, given historical data and some covariates such as outdoor temperature. Simple regression approaches (e.g. using time-of-week indicator variables, plus outdoor temperature) work fine for a lot of things, but we still have a variety of problems. To give one example, sometimes building behavior changes — due to retrofits, or a change in occupant behavior — so that a single model won’t fit well over a long time period. We want to recognize these changes automatically . We have many other issues besides: heteroskedasticity, need for good uncertainty estimates, abilit
Introduction: Greg Campbell writes: I am a Canadian archaeologist (BSc in Chemistry) researching the past human use of European Atlantic shellfish. After two decades of practice I am finally getting a MA in archaeology at Reading. I am seeing if the habitat or size of harvested mussels (Mytilus edulis) can be reconstructed from measurements of the umbo (the pointy end, and the only bit that survives well in archaeological deposits) using log-transformed measurements (or allometry; relationships between dimensions are more likely exponential than linear). Of course multivariate regressions in most statistics packages (Minitab, SPSS, SAS) assume you are trying to predict one variable from all the others (a Model I regression), and use ordinary least squares to fit the regression line. For organismal dimensions this makes little sense, since all the dimensions are (at least in theory) free to change their mutual proportions during growth. So there is no predictor and predicted, mutual variation of
4 0.76264763 770 andrew gelman stats-2011-06-15-Still more Mr. P in public health
Introduction: When it rains it pours . . . John Transue writes: I saw a post on Andrew Sullivan’s blog today about life expectancy in different US counties. With a bunch of the worst counties being in Mississippi, I thought that it might be another case of analysts getting extreme values from small counties. However, the paper (see here ) includes a pretty interesting methods section. This is from page 5, “Specifically, we used a mixed-effects Poisson regression with time, geospatial, and covariate components. Poisson regression fits count outcome variables, e.g., death counts, and is preferable to a logistic model because the latter is biased when an outcome is rare (occurring in less than 1% of observations).” They have downloadable data. I believe that the data are predicted values from the model. A web appendix also gives 90% CIs for their estimates. Do you think they solved the small county problem and that the worst counties really are where their spreadsheet suggests? My re
5 0.74954003 1196 andrew gelman stats-2012-03-04-Piss-poor monocausal social science
Introduction: Dan Kahan writes: Okay, have done due diligence here & can’t find the reference. It was in recent blog — and was more or less an aside — but you ripped into researchers (pretty sure econometricians, but this could be my memory adding to your account recollections it conjured from my own experience) who purport to make estimates or predictions based on multivariate regression in which the value of particular predictor is set at some level while others “held constant” etc., on ground that variance in that particular predictor independent of covariance in other model predictors is unrealistic. You made it sound, too, as if this were one of the pet peeves in your menagerie — leading me to think you had blasted into it before. Know what I’m talking about? Also — isn’t this really just a way of saying that the model is misspecified — at least if the goal is to try to make a valid & unbiased estimate of the impact of that particular predictor? The problem can’t be that one is usin
6 0.74251884 245 andrew gelman stats-2010-08-31-Predicting marathon times
7 0.73502946 1656 andrew gelman stats-2013-01-05-Understanding regression models and regression coefficients
8 0.73201007 1164 andrew gelman stats-2012-02-13-Help with this problem, win valuable prizes
9 0.73020571 1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?
10 0.72965568 1761 andrew gelman stats-2013-03-13-Lame Statistics Patents
12 0.72496367 938 andrew gelman stats-2011-10-03-Comparing prediction errors
13 0.72338831 1395 andrew gelman stats-2012-06-27-Cross-validation (What is it good for?)
14 0.72287083 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?
15 0.70473552 1462 andrew gelman stats-2012-08-18-Standardizing regression inputs
16 0.7040624 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making
17 0.70044988 2204 andrew gelman stats-2014-02-09-Keli Liu and Xiao-Li Meng on Simpson’s paradox
19 0.69549066 2311 andrew gelman stats-2014-04-29-Bayesian Uncertainty Quantification for Differential Equations!
20 0.68964314 375 andrew gelman stats-2010-10-28-Matching for preprocessing data for causal inference
topicId topicWeight
[(1, 0.037), (5, 0.012), (21, 0.034), (24, 0.196), (31, 0.057), (42, 0.011), (47, 0.012), (54, 0.016), (58, 0.034), (86, 0.11), (95, 0.029), (99, 0.332)]
simIndex simValue blogId blogTitle
same-blog 1 0.97739398 2364 andrew gelman stats-2014-06-08-Regression and causality and variable ordering
Introduction: Bill Harris wrote in with a question: David Hogg points out in one of his general articles on data modeling that regression assumptions require one to put the variable with the highest variance in the ‘y’ position and the variable you know best (lowest variance) in the ‘x’ position. As he points out, others speak of independent and dependent variables, as if causality determined the form of a regression formula. In a quick scan of ARM and BDA, I don’t see clear advice, but I do see the use of ‘independent’ and ‘dependent.’ I recently did a model over data in which we know the ‘effect’ pretty well (we measure it), while we know the ’cause’ less well (it’s estimated by people who only need to get it approximately correct). A model of the form ’cause ~ effect’ fit visually much better than one of the form ‘effect ~ cause’, but interpreting it seems challenging. For a simplistic example, let the effect be energy use in a building for cooling (E), and let the cause be outdoor ai
2 0.96986628 2093 andrew gelman stats-2013-11-07-I’m negative on the expression “false positives”
Introduction: After seeing a document sent to me and others regarding the crisis of spurious, statistically-significant research findings in psychology research, I had the following reaction: I am unhappy with the use in the document of the phrase “false positives.” I feel that this expression is unhelpful as it frames science in terms of “true” and “false” claims, which I don’t think is particularly accurate. In particular, in most of the recent disputed Psych Science type studies (the ESP study excepted, perhaps), there is little doubt that there is _some_ underlying effect. The issue, as I see it, as that the underlying effects are much smaller, and much more variable, than mainstream researchers imagine. So what happens is that Psych Science or Nature or whatever will publish a result that is purported to be some sort of universal truth, but it is actually a pattern specific to one data set, one population, and one experimental condition. In a sense, yes, these journals are publishing
3 0.96565342 305 andrew gelman stats-2010-09-29-Decision science vs. social psychology
Introduction: Dan Goldstein sends along this bit of research , distinguishing terms used in two different subfields of psychology. Dan writes: Intuitive calls included not listing words that don’t occur 3 or more times in both programs. I [Dan] did this because when I looked at the results, those cases tended to be proper names or arbitrary things like header or footer text. It also narrowed down the space of words to inspect, which means I could actually get the thing done in my copious free time. I think the bar graphs are kinda ugly, maybe there’s a better way to do it based on classifying the words according to content? Also the whole exercise would gain a new dimension by comparing several areas instead of just two. Maybe that’s coming next.
4 0.96486545 872 andrew gelman stats-2011-08-26-Blog on applied probability modeling
Introduction: Joseph Wilson points me to this blog on applied probability modeling. He sent me the link a couple months ago. If he’s still adding new entries, then his blog is probably already longer-lasting than most!
5 0.96422338 1983 andrew gelman stats-2013-08-15-More on AIC, WAIC, etc
Introduction: Following up on our discussion from the other day, Angelika van der Linde sends along this paper from 2012 (link to journal here ). And Aki pulls out this great quote from Geisser and Eddy (1979): This discussion makes clear that in the nested case this method, as Akaike’s, is not consistent; i.e., even if $M_k$ is true, it will be rejected with probability $\alpha$ as $N\to\infty$. This point is also made by Schwarz (1978). However, from the point of view of prediction, this is of no great consequence. For large numbers of observations, a prediction based on the falsely assumed $M_k$, will not differ appreciably from one based on the true $M_k$. For example, if we assert that two normal populations have different means when in fact they have the same mean, then the use of the group mean as opposed to the grand mean for predicting a future observation results in predictors which are asymptotically equivalent and whose predictive variances are $\sigma^2[1 + (1/2n)]$ and $\si
6 0.96196771 781 andrew gelman stats-2011-06-28-The holes in my philosophy of Bayesian data analysis
7 0.96104878 1971 andrew gelman stats-2013-08-07-I doubt they cheated
11 0.9580431 759 andrew gelman stats-2011-06-11-“2 level logit with 2 REs & large sample. computational nightmare – please help”
12 0.95800143 866 andrew gelman stats-2011-08-23-Participate in a research project on combining information for prediction
13 0.95775664 86 andrew gelman stats-2010-06-14-“Too much data”?
14 0.95699441 494 andrew gelman stats-2010-12-31-Type S error rates for classical and Bayesian single and multiple comparison procedures
15 0.95561051 1886 andrew gelman stats-2013-06-07-Robust logistic regression
16 0.95457131 899 andrew gelman stats-2011-09-10-The statistical significance filter
17 0.95424676 2102 andrew gelman stats-2013-11-15-“Are all significant p-values created equal?”
19 0.95386934 2365 andrew gelman stats-2014-06-09-I hate polynomials
20 0.95371687 669 andrew gelman stats-2011-04-19-The mysterious Gamma (1.4, 0.4)