andrew_gelman_stats andrew_gelman_stats-2014 andrew_gelman_stats-2014-2364 knowledge-graph by maker-knowledge-mining

2364 andrew gelman stats-2014-06-08-Regression and causality and variable ordering


meta infos for this blog

Source: html

Introduction: Bill Harris wrote in with a question: David Hogg points out in one of his general articles on data modeling that regression assumptions require one to put the variable with the highest variance in the ‘y’ position and the variable you know best (lowest variance) in the ‘x’ position. As he points out, others speak of independent and dependent variables, as if causality determined the form of a regression formula. In a quick scan of ARM and BDA, I don’t see clear advice, but I do see the use of ‘independent’ and ‘dependent.’ I recently did a model over data in which we know the ‘effect’ pretty well (we measure it), while we know the ’cause’ less well (it’s estimated by people who only need to get it approximately correct). A model of the form ’cause ~ effect’ fit visually much better than one of the form ‘effect ~ cause’, but interpreting it seems challenging. For a simplistic example, let the effect be energy use in a building for cooling (E), and let the cause be outdoor ai


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Bill Harris wrote in with a question: David Hogg points out in one of his general articles on data modeling that regression assumptions require one to put the variable with the highest variance in the ‘y’ position and the variable you know best (lowest variance) in the ‘x’ position. [sent-1, score-0.515]

2 As he points out, others speak of independent and dependent variables, as if causality determined the form of a regression formula. [sent-2, score-0.429]

3 In a quick scan of ARM and BDA, I don’t see clear advice, but I do see the use of ‘independent’ and ‘dependent. [sent-3, score-0.294]

4 ’ I recently did a model over data in which we know the ‘effect’ pretty well (we measure it), while we know the ’cause’ less well (it’s estimated by people who only need to get it approximately correct). [sent-4, score-0.304]

5 A model of the form ’cause ~ effect’ fit visually much better than one of the form ‘effect ~ cause’, but interpreting it seems challenging. [sent-5, score-0.323]

6 For a simplistic example, let the effect be energy use in a building for cooling (E), and let the cause be outdoor air temperature (T). [sent-6, score-1.623]

7 We typically get T at a “nearby” location (within 5-10 miles, perhaps), but we know microclimates cause that to be in error for what counts at the particular building. [sent-8, score-0.402]

8 So ‘E ~ T’ makes sense, but ‘T ~ E’ may violate fewer regression assumptions. [sent-9, score-0.171]

9 At least in the short term and over a volume that’s bigger than covered by the exhaust plume from the air conditioner, the natural interpretation of that (“the outdoor air temperature is a function of the energy you consume to cool the building”) is hard to swallow. [sent-10, score-1.49]

10 In a complete modeling sense, I see modeling the uncertainty in x and y, but often a simpler ‘lm(y ~ x)’ suffices. [sent-12, score-0.176]

11 I replied: Do we really use the terms “independent” and “dependent” variables in this sense in ARM and BDA? [sent-15, score-0.255]

12 In ARM I think we make it pretty clear that regression is about predicting y from x. [sent-19, score-0.209]

13 Sometimes people want to predict y from x, but x is not observed, all we that is available is z which is some noisy measure of x. [sent-21, score-0.182]

14 In this case one can fit a measurement error model. [sent-22, score-0.284]

15 37, which seemed crystal clear until I read Hogg (below); then it wasn’t clear if the predictor on p. [sent-26, score-0.281]

16 37 of ARM really means what I think it means (energy use doesn’t drive outside air temperature, at least on the short term, but I /could/ interpret it as energy use can be used to /predict/ outdoor air temperature more accurately than temperature can predict energy use). [sent-27, score-2.276]

17 mention that you should regress x on y, not y on x, in those cases if you don’t model the measurement error. [sent-31, score-0.298]

18 Perhaps that’s something to cover more fully in a new ARM: is there anything to do in particular when working up from a simple lm() to a full-blown model of measurement error (or perhaps you have and I forgot or missed it). [sent-33, score-0.571]

19 My reply: We’ll definitely cover this in the next edition of ARM. [sent-34, score-0.192]

20 We’ll do it in Stan, where it’s very easy to write a measurement error model. [sent-35, score-0.284]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('arm', 0.295), ('lm', 0.277), ('temperature', 0.258), ('air', 0.246), ('hogg', 0.241), ('energy', 0.234), ('outdoor', 0.22), ('cause', 0.217), ('measurement', 0.166), ('independent', 0.121), ('error', 0.118), ('variance', 0.114), ('use', 0.114), ('regression', 0.108), ('dependent', 0.106), ('bda', 0.105), ('measure', 0.104), ('clear', 0.101), ('effect', 0.1), ('edition', 0.1), ('form', 0.094), ('cover', 0.092), ('modeling', 0.088), ('interpret', 0.087), ('building', 0.082), ('sense', 0.08), ('crystal', 0.079), ('cooling', 0.079), ('scan', 0.079), ('predict', 0.078), ('bill', 0.078), ('exhaust', 0.076), ('consume', 0.073), ('simplistic', 0.073), ('term', 0.072), ('visually', 0.069), ('variable', 0.069), ('know', 0.067), ('model', 0.066), ('regress', 0.066), ('forgot', 0.065), ('short', 0.065), ('harris', 0.064), ('perhaps', 0.064), ('violate', 0.063), ('glm', 0.063), ('variables', 0.061), ('miles', 0.061), ('means', 0.061), ('nearby', 0.059)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 2364 andrew gelman stats-2014-06-08-Regression and causality and variable ordering

Introduction: Bill Harris wrote in with a question: David Hogg points out in one of his general articles on data modeling that regression assumptions require one to put the variable with the highest variance in the ‘y’ position and the variable you know best (lowest variance) in the ‘x’ position. As he points out, others speak of independent and dependent variables, as if causality determined the form of a regression formula. In a quick scan of ARM and BDA, I don’t see clear advice, but I do see the use of ‘independent’ and ‘dependent.’ I recently did a model over data in which we know the ‘effect’ pretty well (we measure it), while we know the ’cause’ less well (it’s estimated by people who only need to get it approximately correct). A model of the form ’cause ~ effect’ fit visually much better than one of the form ‘effect ~ cause’, but interpreting it seems challenging. For a simplistic example, let the effect be energy use in a building for cooling (E), and let the cause be outdoor ai

2 0.20604686 906 andrew gelman stats-2011-09-14-Another day, another stats postdoc

Introduction: This post is from Phil Price.  I work in the Environmental Energy Technologies Division at Lawrence Berkeley National Laboratory, and I am looking for a postdoc who knows substantially more than I do about time-series modeling; in practice this probably means someone whose dissertation work involved that sort of thing.  The work involves developing models to predict and/or forecast the time-dependent energy use in buildings, given historical data and some covariates such as outdoor temperature.  Simple regression approaches (e.g. using time-of-week indicator variables, plus outdoor temperature) work fine for a lot of things, but we still have a variety of problems.  To give one example, sometimes building behavior changes — due to retrofits, or a change in occupant behavior — so that a single model won’t fit well over a long time period. We want to recognize these changes automatically .  We have many other issues besides: heteroskedasticity, need for good uncertainty estimates, abilit

3 0.19273058 2340 andrew gelman stats-2014-05-20-Thermodynamic Monte Carlo: Michael Betancourt’s new method for simulating from difficult distributions and evaluating normalizing constants

Introduction: I hate to keep bumping our scheduled posts but this is just too important and too exciting to wait. So it’s time to jump the queue. The news is a paper from Michael Betancourt that presents a super-cool new way to compute normalizing constants: A common strategy for inference in complex models is the relaxation of a simple model into the more complex target model, for example the prior into the posterior in Bayesian inference. Existing approaches that attempt to generate such transformations, however, are sensitive to the pathologies of complex distributions and can be difficult to implement in practice. Leveraging the geometry of thermodynamic processes I introduce a principled and robust approach to deforming measures that presents a powerful new tool for inference. The idea is to generalize Hamiltonian Monte Carlo so that it moves through a family of distributions (that is, it transitions through an “inverse temperature” variable called beta that indexes the family) a

4 0.17355296 1010 andrew gelman stats-2011-11-14-“Free energy” and economic resources

Introduction: By “free energy” I don’t mean perpetual motion machines, cars that run on water and get 200 mpg, or the latest cold-fusion hype. No, I’m referring to the term from physics. The free energy of a system is, roughly, the amount of energy that can be directly extracted from it. For example, a rock at room temperature is just full of energy—not just the energy locked in its nuclei, but basic thermal energy—but at room temperature you can’t extract any of it. To the physicists in the audience: Yes, I realize that free energy has a technical meaning in statistical mechanics and that my above definition is sloppy. Please bear with me. And, to the non-physicists: feel free to head to Wikipedia or a physics textbook for a more careful treatment. I was thinking about free energy the other day when hearing someone on the radio say something about China bailing out the E.U. I did a double-take. Huh? The E.U. is rich, China’s not so rich. How can a middle-income country bail out a

5 0.16875075 1501 andrew gelman stats-2012-09-18-More studies on the economic effects of climate change

Introduction: After writing yesterday’s post , I was going through Solomon Hsiang’s blog and found a post pointing to three studies from researchers at business schools: Severe Weather and Automobile Assembly Productivity Gérard P. Cachon, Santiago Gallino and Marcelo Olivares Abstract: It is expected that climate change could lead to an increased frequency of severe weather. In turn, severe weather intuitively should hamper the productivity of work that occurs outside. But what is the effect of rain, snow, fog, heat and wind on work that occurs indoors, such as the production of automobiles? Using weekly production data from 64 automobile plants in the United States over a ten-year period, we find that adverse weather conditions lead to a significant reduction in production. For example, one additional day of high wind advisory by the National Weather Service (i.e., maximum winds generally in excess of 44 miles per hour) reduces production by 26%, which is comparable in order of magnitude t

6 0.13903767 1486 andrew gelman stats-2012-09-07-Prior distributions for regression coefficients

7 0.13804887 180 andrew gelman stats-2010-08-03-Climate Change News

8 0.13653359 1966 andrew gelman stats-2013-08-03-Uncertainty in parameter estimates using multilevel models

9 0.12066182 1422 andrew gelman stats-2012-07-20-Likelihood thresholds and decisions

10 0.11902675 1972 andrew gelman stats-2013-08-07-When you’re planning on fitting a model, build up to it by fitting simpler models first. Then, once you have a model you like, check the hell out of it

11 0.11881583 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?

12 0.11784673 754 andrew gelman stats-2011-06-09-Difficulties with Bayesian model averaging

13 0.11657762 2321 andrew gelman stats-2014-05-05-On deck this week

14 0.1164207 257 andrew gelman stats-2010-09-04-Question about standard range for social science correlations

15 0.11159076 1162 andrew gelman stats-2012-02-11-Adding an error model to a deterministic model

16 0.11073975 1801 andrew gelman stats-2013-04-13-Can you write a program to determine the causal order?

17 0.1087804 1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?

18 0.10849468 1201 andrew gelman stats-2012-03-07-Inference = data + model

19 0.10756249 1656 andrew gelman stats-2013-01-05-Understanding regression models and regression coefficients

20 0.10741665 1418 andrew gelman stats-2012-07-16-Long discussion about causal inference and the use of hierarchical models to bridge between different inferential settings


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.215), (1, 0.081), (2, 0.048), (3, -0.009), (4, 0.097), (5, 0.015), (6, 0.039), (7, -0.063), (8, 0.085), (9, 0.031), (10, -0.017), (11, 0.043), (12, 0.025), (13, -0.035), (14, -0.017), (15, 0.01), (16, 0.023), (17, 0.005), (18, -0.004), (19, -0.019), (20, -0.006), (21, 0.02), (22, 0.022), (23, 0.008), (24, 0.017), (25, 0.02), (26, 0.03), (27, -0.064), (28, 0.015), (29, 0.013), (30, 0.048), (31, 0.061), (32, -0.005), (33, -0.036), (34, -0.028), (35, -0.02), (36, -0.001), (37, -0.028), (38, -0.013), (39, -0.074), (40, 0.011), (41, 0.009), (42, -0.07), (43, -0.008), (44, 0.009), (45, 0.035), (46, -0.063), (47, -0.018), (48, -0.056), (49, 0.065)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.95748365 2364 andrew gelman stats-2014-06-08-Regression and causality and variable ordering

Introduction: Bill Harris wrote in with a question: David Hogg points out in one of his general articles on data modeling that regression assumptions require one to put the variable with the highest variance in the ‘y’ position and the variable you know best (lowest variance) in the ‘x’ position. As he points out, others speak of independent and dependent variables, as if causality determined the form of a regression formula. In a quick scan of ARM and BDA, I don’t see clear advice, but I do see the use of ‘independent’ and ‘dependent.’ I recently did a model over data in which we know the ‘effect’ pretty well (we measure it), while we know the ’cause’ less well (it’s estimated by people who only need to get it approximately correct). A model of the form ’cause ~ effect’ fit visually much better than one of the form ‘effect ~ cause’, but interpreting it seems challenging. For a simplistic example, let the effect be energy use in a building for cooling (E), and let the cause be outdoor ai

2 0.81235135 906 andrew gelman stats-2011-09-14-Another day, another stats postdoc

Introduction: This post is from Phil Price.  I work in the Environmental Energy Technologies Division at Lawrence Berkeley National Laboratory, and I am looking for a postdoc who knows substantially more than I do about time-series modeling; in practice this probably means someone whose dissertation work involved that sort of thing.  The work involves developing models to predict and/or forecast the time-dependent energy use in buildings, given historical data and some covariates such as outdoor temperature.  Simple regression approaches (e.g. using time-of-week indicator variables, plus outdoor temperature) work fine for a lot of things, but we still have a variety of problems.  To give one example, sometimes building behavior changes — due to retrofits, or a change in occupant behavior — so that a single model won’t fit well over a long time period. We want to recognize these changes automatically .  We have many other issues besides: heteroskedasticity, need for good uncertainty estimates, abilit

3 0.76875293 1094 andrew gelman stats-2011-12-31-Using factor analysis or principal components analysis or measurement-error models for biological measurements in archaeology?

Introduction: Greg Campbell writes: I am a Canadian archaeologist (BSc in Chemistry) researching the past human use of European Atlantic shellfish. After two decades of practice I am finally getting a MA in archaeology at Reading. I am seeing if the habitat or size of harvested mussels (Mytilus edulis) can be reconstructed from measurements of the umbo (the pointy end, and the only bit that survives well in archaeological deposits) using log-transformed measurements (or allometry; relationships between dimensions are more likely exponential than linear). Of course multivariate regressions in most statistics packages (Minitab, SPSS, SAS) assume you are trying to predict one variable from all the others (a Model I regression), and use ordinary least squares to fit the regression line. For organismal dimensions this makes little sense, since all the dimensions are (at least in theory) free to change their mutual proportions during growth. So there is no predictor and predicted, mutual variation of

4 0.76264763 770 andrew gelman stats-2011-06-15-Still more Mr. P in public health

Introduction: When it rains it pours . . . John Transue writes: I saw a post on Andrew Sullivan’s blog today about life expectancy in different US counties. With a bunch of the worst counties being in Mississippi, I thought that it might be another case of analysts getting extreme values from small counties. However, the paper (see here ) includes a pretty interesting methods section. This is from page 5, “Specifically, we used a mixed-effects Poisson regression with time, geospatial, and covariate components. Poisson regression fits count outcome variables, e.g., death counts, and is preferable to a logistic model because the latter is biased when an outcome is rare (occurring in less than 1% of observations).” They have downloadable data. I believe that the data are predicted values from the model. A web appendix also gives 90% CIs for their estimates. Do you think they solved the small county problem and that the worst counties really are where their spreadsheet suggests? My re

5 0.74954003 1196 andrew gelman stats-2012-03-04-Piss-poor monocausal social science

Introduction: Dan Kahan writes: Okay, have done due diligence here & can’t find the reference. It was in recent blog — and was more or less an aside — but you ripped into researchers (pretty sure econometricians, but this could be my memory adding to your account recollections it conjured from my own experience) who purport to make estimates or predictions based on multivariate regression in which the value of particular predictor is set at some level while others “held constant” etc., on ground that variance in that particular predictor independent of covariance in other model predictors is unrealistic. You made it sound, too, as if this were one of the pet peeves in your menagerie — leading me to think you had blasted into it before. Know what I’m talking about? Also — isn’t this really just a way of saying that the model is misspecified — at least if the goal is to try to make a valid & unbiased estimate of the impact of that particular predictor? The problem can’t be that one is usin

6 0.74251884 245 andrew gelman stats-2010-08-31-Predicting marathon times

7 0.73502946 1656 andrew gelman stats-2013-01-05-Understanding regression models and regression coefficients

8 0.73201007 1164 andrew gelman stats-2012-02-13-Help with this problem, win valuable prizes

9 0.73020571 1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?

10 0.72965568 1761 andrew gelman stats-2013-03-13-Lame Statistics Patents

11 0.72677517 775 andrew gelman stats-2011-06-21-Fundamental difficulty of inference for a ratio when the denominator could be positive or negative

12 0.72496367 938 andrew gelman stats-2011-10-03-Comparing prediction errors

13 0.72338831 1395 andrew gelman stats-2012-06-27-Cross-validation (What is it good for?)

14 0.72287083 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?

15 0.70473552 1462 andrew gelman stats-2012-08-18-Standardizing regression inputs

16 0.7040624 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making

17 0.70044988 2204 andrew gelman stats-2014-02-09-Keli Liu and Xiao-Li Meng on Simpson’s paradox

18 0.69746041 1703 andrew gelman stats-2013-02-02-Interaction-based feature selection and classification for high-dimensional biological data

19 0.69549066 2311 andrew gelman stats-2014-04-29-Bayesian Uncertainty Quantification for Differential Equations!

20 0.68964314 375 andrew gelman stats-2010-10-28-Matching for preprocessing data for causal inference


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(1, 0.037), (5, 0.012), (21, 0.034), (24, 0.196), (31, 0.057), (42, 0.011), (47, 0.012), (54, 0.016), (58, 0.034), (86, 0.11), (95, 0.029), (99, 0.332)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97739398 2364 andrew gelman stats-2014-06-08-Regression and causality and variable ordering

Introduction: Bill Harris wrote in with a question: David Hogg points out in one of his general articles on data modeling that regression assumptions require one to put the variable with the highest variance in the ‘y’ position and the variable you know best (lowest variance) in the ‘x’ position. As he points out, others speak of independent and dependent variables, as if causality determined the form of a regression formula. In a quick scan of ARM and BDA, I don’t see clear advice, but I do see the use of ‘independent’ and ‘dependent.’ I recently did a model over data in which we know the ‘effect’ pretty well (we measure it), while we know the ’cause’ less well (it’s estimated by people who only need to get it approximately correct). A model of the form ’cause ~ effect’ fit visually much better than one of the form ‘effect ~ cause’, but interpreting it seems challenging. For a simplistic example, let the effect be energy use in a building for cooling (E), and let the cause be outdoor ai

2 0.96986628 2093 andrew gelman stats-2013-11-07-I’m negative on the expression “false positives”

Introduction: After seeing a document sent to me and others regarding the crisis of spurious, statistically-significant research findings in psychology research, I had the following reaction: I am unhappy with the use in the document of the phrase “false positives.” I feel that this expression is unhelpful as it frames science in terms of “true” and “false” claims, which I don’t think is particularly accurate. In particular, in most of the recent disputed Psych Science type studies (the ESP study excepted, perhaps), there is little doubt that there is _some_ underlying effect. The issue, as I see it, as that the underlying effects are much smaller, and much more variable, than mainstream researchers imagine. So what happens is that Psych Science or Nature or whatever will publish a result that is purported to be some sort of universal truth, but it is actually a pattern specific to one data set, one population, and one experimental condition. In a sense, yes, these journals are publishing

3 0.96565342 305 andrew gelman stats-2010-09-29-Decision science vs. social psychology

Introduction: Dan Goldstein sends along this bit of research , distinguishing terms used in two different subfields of psychology. Dan writes: Intuitive calls included not listing words that don’t occur 3 or more times in both programs. I [Dan] did this because when I looked at the results, those cases tended to be proper names or arbitrary things like header or footer text. It also narrowed down the space of words to inspect, which means I could actually get the thing done in my copious free time. I think the bar graphs are kinda ugly, maybe there’s a better way to do it based on classifying the words according to content? Also the whole exercise would gain a new dimension by comparing several areas instead of just two. Maybe that’s coming next.

4 0.96486545 872 andrew gelman stats-2011-08-26-Blog on applied probability modeling

Introduction: Joseph Wilson points me to this blog on applied probability modeling. He sent me the link a couple months ago. If he’s still adding new entries, then his blog is probably already longer-lasting than most!

5 0.96422338 1983 andrew gelman stats-2013-08-15-More on AIC, WAIC, etc

Introduction: Following up on our discussion from the other day, Angelika van der Linde sends along this paper from 2012 (link to journal here ). And Aki pulls out this great quote from Geisser and Eddy (1979): This discussion makes clear that in the nested case this method, as Akaike’s, is not consistent; i.e., even if $M_k$ is true, it will be rejected with probability $\alpha$ as $N\to\infty$. This point is also made by Schwarz (1978). However, from the point of view of prediction, this is of no great consequence. For large numbers of observations, a prediction based on the falsely assumed $M_k$, will not differ appreciably from one based on the true $M_k$. For example, if we assert that two normal populations have different means when in fact they have the same mean, then the use of the group mean as opposed to the grand mean for predicting a future observation results in predictors which are asymptotically equivalent and whose predictive variances are $\sigma^2[1 + (1/2n)]$ and $\si

6 0.96196771 781 andrew gelman stats-2011-06-28-The holes in my philosophy of Bayesian data analysis

7 0.96104878 1971 andrew gelman stats-2013-08-07-I doubt they cheated

8 0.96047783 1327 andrew gelman stats-2012-05-18-Comments on “A Bayesian approach to complex clinical diagnoses: a case-study in child abuse”

9 0.96047163 1552 andrew gelman stats-2012-10-29-“Communication is a central task of statistics, and ideally a state-of-the-art data analysis can have state-of-the-art displays to match”

10 0.95904446 2027 andrew gelman stats-2013-09-17-Christian Robert on the Jeffreys-Lindley paradox; more generally, it’s good news when philosophical arguments can be transformed into technical modeling issues

11 0.9580431 759 andrew gelman stats-2011-06-11-“2 level logit with 2 REs & large sample. computational nightmare – please help”

12 0.95800143 866 andrew gelman stats-2011-08-23-Participate in a research project on combining information for prediction

13 0.95775664 86 andrew gelman stats-2010-06-14-“Too much data”?

14 0.95699441 494 andrew gelman stats-2010-12-31-Type S error rates for classical and Bayesian single and multiple comparison procedures

15 0.95561051 1886 andrew gelman stats-2013-06-07-Robust logistic regression

16 0.95457131 899 andrew gelman stats-2011-09-10-The statistical significance filter

17 0.95424676 2102 andrew gelman stats-2013-11-15-“Are all significant p-values created equal?”

18 0.9541254 1950 andrew gelman stats-2013-07-22-My talks that were scheduled for Tues at the Data Skeptics meetup and Wed at the Open Statistical Programming meetup

19 0.95386934 2365 andrew gelman stats-2014-06-09-I hate polynomials

20 0.95371687 669 andrew gelman stats-2011-04-19-The mysterious Gamma (1.4, 0.4)