andrew_gelman_stats andrew_gelman_stats-2010 andrew_gelman_stats-2010-301 knowledge-graph by maker-knowledge-mining

301 andrew gelman stats-2010-09-28-Correlation, prediction, variation, etc.


meta infos for this blog

Source: html

Introduction: Hamdan Azhar writes: I [Azhar] write with a question about language in the context of statistics. Consider the three statements below. a) Y is significantly associated (correlated) with X; b) knowledge of X allows us to account for __% of the variance in Y; c) Y can be predicted to a significant extent given knowledge of X. To what extent are these statements equivalent? Much of the (non-statistical) scientific literature doesn’t seem to distinguish between these notions. Is this just about semantics — or are there meaningful differences here, particularly between b and c? Consider a framework where X constitutes a predictor space of p variables (x1,…,xp). We wish to generate a linear combination of these variables to yield a score that optimally correlates with Y. Can we substitute the word “predicts” for “optimally correlates with” in this context? One can argue that “correlating” or “accounting for variance” suggests that we are trying to maximize goodness-of-fit (i


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Hamdan Azhar writes: I [Azhar] write with a question about language in the context of statistics. [sent-1, score-0.253]

2 a) Y is significantly associated (correlated) with X; b) knowledge of X allows us to account for __% of the variance in Y; c) Y can be predicted to a significant extent given knowledge of X. [sent-3, score-0.952]

3 To what extent are these statements equivalent? [sent-4, score-0.339]

4 Much of the (non-statistical) scientific literature doesn’t seem to distinguish between these notions. [sent-5, score-0.087]

5 Is this just about semantics — or are there meaningful differences here, particularly between b and c? [sent-6, score-0.22]

6 Consider a framework where X constitutes a predictor space of p variables (x1,…,xp). [sent-7, score-0.377]

7 We wish to generate a linear combination of these variables to yield a score that optimally correlates with Y. [sent-8, score-1.112]

8 Can we substitute the word “predicts” for “optimally correlates with” in this context? [sent-9, score-0.432]

9 One can argue that “correlating” or “accounting for variance” suggests that we are trying to maximize goodness-of-fit (i. [sent-10, score-0.109]

10 On the other hand, “prediction” implies that we engage in some form of cross-validation where we seek to minimize some measure of prediction error. [sent-13, score-0.588]

11 Is it alright to substitute “prediction” for “accounting for variance”? [sent-15, score-0.211]

12 Or are these distinct concepts that we should be careful not to conflate? [sent-16, score-0.191]

13 My reply: If interpreted generally enough, these statements are equivalent. [sent-17, score-0.3]

14 “Correlation” refers to a linear relation, whereas “association” is more general. [sent-18, score-0.216]

15 Similarly, you can get information without accounting for “variance,” but if you replace the term by “variation” then this might work. [sent-19, score-0.387]

16 I don’t think you get anything useful out of worrying about these different expressions in general. [sent-20, score-0.217]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('accounting', 0.301), ('azhar', 0.274), ('variance', 0.263), ('optimally', 0.253), ('correlates', 0.221), ('statements', 0.213), ('substitute', 0.211), ('prediction', 0.21), ('language', 0.138), ('hamdan', 0.137), ('correlating', 0.131), ('semantics', 0.131), ('linear', 0.129), ('extent', 0.126), ('knowledge', 0.122), ('context', 0.115), ('constitutes', 0.114), ('minimize', 0.112), ('expressions', 0.11), ('maximize', 0.109), ('worrying', 0.107), ('variables', 0.106), ('distinct', 0.101), ('predicts', 0.095), ('seek', 0.093), ('significantly', 0.091), ('concepts', 0.09), ('engage', 0.09), ('consider', 0.089), ('meaningful', 0.089), ('refers', 0.087), ('interpreted', 0.087), ('yield', 0.087), ('distinguish', 0.087), ('replace', 0.086), ('focusing', 0.084), ('predictor', 0.083), ('generate', 0.083), ('implies', 0.083), ('relation', 0.081), ('wish', 0.081), ('allows', 0.078), ('predicted', 0.078), ('score', 0.077), ('combination', 0.075), ('equivalent', 0.075), ('correlated', 0.075), ('framework', 0.074), ('account', 0.072), ('association', 0.07)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 301 andrew gelman stats-2010-09-28-Correlation, prediction, variation, etc.

Introduction: Hamdan Azhar writes: I [Azhar] write with a question about language in the context of statistics. Consider the three statements below. a) Y is significantly associated (correlated) with X; b) knowledge of X allows us to account for __% of the variance in Y; c) Y can be predicted to a significant extent given knowledge of X. To what extent are these statements equivalent? Much of the (non-statistical) scientific literature doesn’t seem to distinguish between these notions. Is this just about semantics — or are there meaningful differences here, particularly between b and c? Consider a framework where X constitutes a predictor space of p variables (x1,…,xp). We wish to generate a linear combination of these variables to yield a score that optimally correlates with Y. Can we substitute the word “predicts” for “optimally correlates with” in this context? One can argue that “correlating” or “accounting for variance” suggests that we are trying to maximize goodness-of-fit (i

2 0.17484698 810 andrew gelman stats-2011-07-20-Adding more information can make the variance go up (depending on your model)

Introduction: Andy McKenzie writes: In their March 9 “ counterpoint ” in nature biotech to the prospect that we should try to integrate more sources of data in clinical practice (see “ point ” arguing for this), Isaac Kohane and David Margulies claim that, “Finally, how much better is our new knowledge than older knowledge? When is the incremental benefit of a genomic variant(s) or gene expression profile relative to a family history or classic histopathology insufficient and when does it add rather than subtract variance?” Perhaps I am mistaken (thus this email), but it seems that this claim runs contra to the definition of conditional probability. That is, if you have a hierarchical model, and the family history / classical histopathology already suggests a parameter estimate with some variance, how could the new genomic info possibly increase the variance of that parameter estimate? Surely the question is how much variance the new genomic info reduces and whether it therefore justifies t

3 0.14030479 490 andrew gelman stats-2010-12-29-Brain Structure and the Big Five

Introduction: Many years ago, a research psychologist whose judgment I greatly respect told me that the characterization of personality by the so-called Big Five traits (extraversion, etc.) was old-fashioned. So I’m always surprised to see that the Big Five keeps cropping up. I guess not everyone agrees that it’s a bad idea. For example, Hamdan Azhar wrote to me: I was wondering if you’d seen this recent paper (De Young et al. 2010) that finds significant correlations between brain volume in selected regions and personality trait measures (from the Big Five). This is quite a ground-breaking finding and it was covered extensively in the mainstream media. I think readers of your blog would be interested in your thoughts, statistically speaking, on their methodology and findings. My reply: I’d be interested in my thoughts on this too! But I don’t know enough to say anything useful. From the abstract of the paper under discussion: Controlling for age, sex, and whole-brain volume

4 0.13863762 1862 andrew gelman stats-2013-05-18-uuuuuuuuuuuuugly

Introduction: Hamdan Azhar writes: I came across this graphic of vaccine-attributed decreases in mortality and was curious if you found it as unattractive and unintuitive as I did. Hope all is well with you! My reply: All’s well with me. And yes, that’s one horrible graph. It has all the problems with a bad infographic with none of the virtues. Compared to this monstrosity, the typical USA Today graph is a stunning, beautiful masterpiece. I don’t think I want to soil this webpage with the image. In fact, I don’t even want to link to it.

5 0.12349215 2315 andrew gelman stats-2014-05-02-Discovering general multidimensional associations

Introduction: Continuing our discussion of general measures of correlations, Ben Murrell sends along this paper (with corresponding R package), which begins: When two variables are related by a known function, the coefficient of determination (denoted R-squared) measures the proportion of the total variance in the observations that is explained by that function. This quantifies the strength of the relationship between variables by describing what proportion of the variance is signal as opposed to noise. For linear relationships, this is equal to the square of the correlation coefficient, ρ. When the parametric form of the relationship is unknown, however, it is unclear how to estimate the proportion of explained variance equitably – assigning similar values to equally noisy relationships. Here we demonstrate how to directly estimate a generalized R-squared when the form of the relationship is unknown, and we question the performance of the Maximal Information Coefficient (MIC) – a recently pr

6 0.11606237 63 andrew gelman stats-2010-06-02-The problem of overestimation of group-level variance parameters

7 0.10985494 352 andrew gelman stats-2010-10-19-Analysis of survey data: Design based models vs. hierarchical modeling?

8 0.10142849 464 andrew gelman stats-2010-12-12-Finite-population standard deviation in a hierarchical model

9 0.098277748 1196 andrew gelman stats-2012-03-04-Piss-poor monocausal social science

10 0.095188588 1737 andrew gelman stats-2013-02-25-Correlation of 1 . . . too good to be true?

11 0.093451582 1206 andrew gelman stats-2012-03-10-95% intervals that I don’t believe, because they’re from a flat prior I don’t believe

12 0.092314132 1365 andrew gelman stats-2012-06-04-Question 25 of my final exam for Design and Analysis of Sample Surveys

13 0.091956653 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making

14 0.08923097 846 andrew gelman stats-2011-08-09-Default priors update?

15 0.086630382 779 andrew gelman stats-2011-06-25-Avoiding boundary estimates using a prior distribution as regularization

16 0.08467792 1248 andrew gelman stats-2012-04-06-17 groups, 6 group-level predictors: What to do?

17 0.083622605 1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?

18 0.083550893 2143 andrew gelman stats-2013-12-22-The kluges of today are the textbook solutions of tomorrow.

19 0.080591701 2145 andrew gelman stats-2013-12-24-Estimating and summarizing inference for hierarchical variance parameters when the number of groups is small

20 0.079202183 451 andrew gelman stats-2010-12-05-What do practitioners need to know about regression?


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.123), (1, 0.053), (2, 0.055), (3, -0.028), (4, 0.025), (5, -0.009), (6, 0.022), (7, -0.02), (8, 0.019), (9, 0.072), (10, 0.003), (11, 0.016), (12, 0.021), (13, -0.001), (14, -0.001), (15, 0.011), (16, -0.015), (17, -0.007), (18, 0.01), (19, 0.015), (20, 0.009), (21, -0.01), (22, 0.037), (23, 0.015), (24, 0.036), (25, 0.023), (26, 0.044), (27, 0.031), (28, -0.004), (29, -0.017), (30, 0.051), (31, 0.042), (32, 0.003), (33, -0.02), (34, 0.035), (35, 0.015), (36, 0.028), (37, -0.021), (38, 0.013), (39, -0.037), (40, 0.021), (41, -0.056), (42, 0.042), (43, 0.052), (44, -0.04), (45, -0.019), (46, 0.024), (47, -0.0), (48, 0.009), (49, -0.012)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.98018003 301 andrew gelman stats-2010-09-28-Correlation, prediction, variation, etc.

Introduction: Hamdan Azhar writes: I [Azhar] write with a question about language in the context of statistics. Consider the three statements below. a) Y is significantly associated (correlated) with X; b) knowledge of X allows us to account for __% of the variance in Y; c) Y can be predicted to a significant extent given knowledge of X. To what extent are these statements equivalent? Much of the (non-statistical) scientific literature doesn’t seem to distinguish between these notions. Is this just about semantics — or are there meaningful differences here, particularly between b and c? Consider a framework where X constitutes a predictor space of p variables (x1,…,xp). We wish to generate a linear combination of these variables to yield a score that optimally correlates with Y. Can we substitute the word “predicts” for “optimally correlates with” in this context? One can argue that “correlating” or “accounting for variance” suggests that we are trying to maximize goodness-of-fit (i

2 0.77836132 2315 andrew gelman stats-2014-05-02-Discovering general multidimensional associations

Introduction: Continuing our discussion of general measures of correlations, Ben Murrell sends along this paper (with corresponding R package), which begins: When two variables are related by a known function, the coefficient of determination (denoted R-squared) measures the proportion of the total variance in the observations that is explained by that function. This quantifies the strength of the relationship between variables by describing what proportion of the variance is signal as opposed to noise. For linear relationships, this is equal to the square of the correlation coefficient, ρ. When the parametric form of the relationship is unknown, however, it is unclear how to estimate the proportion of explained variance equitably – assigning similar values to equally noisy relationships. Here we demonstrate how to directly estimate a generalized R-squared when the form of the relationship is unknown, and we question the performance of the Maximal Information Coefficient (MIC) – a recently pr

3 0.71641725 257 andrew gelman stats-2010-09-04-Question about standard range for social science correlations

Introduction: Andrew Eppig writes: I’m a physicist by training who is transitioning to the social sciences. I recently came across a reference in the Economist to a paper on IQ and parasites which I read as I have more than a passing interest in IQ research (having read much that you and others (e.g., Shalizi, Wicherts) have written). In this paper I note that the authors find a very high correlation between national IQ and parasite prevalence. The strength of the correlation (-0.76 to -0.82) surprised me, as I’m used to much weaker correlations in the social sciences. To me, it’s a bit too high, suggesting that there are other factors at play or that one of the variables is merely a proxy for a large number of other variables. But I have no basis for this other than a gut feeling and a memory of a plot on Language Log about the distribution of correlation coefficients in social psychology. So my question is this: Is a correlation in the range of (-0.82,-0.76) more likely to be a correlatio

4 0.70383 810 andrew gelman stats-2011-07-20-Adding more information can make the variance go up (depending on your model)

Introduction: Andy McKenzie writes: In their March 9 “ counterpoint ” in nature biotech to the prospect that we should try to integrate more sources of data in clinical practice (see “ point ” arguing for this), Isaac Kohane and David Margulies claim that, “Finally, how much better is our new knowledge than older knowledge? When is the incremental benefit of a genomic variant(s) or gene expression profile relative to a family history or classic histopathology insufficient and when does it add rather than subtract variance?” Perhaps I am mistaken (thus this email), but it seems that this claim runs contra to the definition of conditional probability. That is, if you have a hierarchical model, and the family history / classical histopathology already suggests a parameter estimate with some variance, how could the new genomic info possibly increase the variance of that parameter estimate? Surely the question is how much variance the new genomic info reduces and whether it therefore justifies t

5 0.68889832 1966 andrew gelman stats-2013-08-03-Uncertainty in parameter estimates using multilevel models

Introduction: David Hsu writes: I have a (perhaps) simple question about uncertainty in parameter estimates using multilevel models — what is an appropriate threshold for measure parameter uncertainty in a multilevel model? The reason why I ask is that I set out to do a crossed two-way model with two varying intercepts, similar to your flight simulator example in your 2007 book. The difference is that I have a lot of predictors specific to each cell (I think equivalent to airport and pilot in your example), and I find after modeling this in JAGS, I happily find that the predictors are much less important than the variability by cell (airport and pilot effects). Happily because this is what I am writing a paper about. However, I then went to check subsets of predictors using lm() and lmer(). I understand that they all use different estimation methods, but what I can’t figure out is why the errors on all of the coefficient estimates are *so* different. For example, using JAGS, and th

6 0.68653804 1918 andrew gelman stats-2013-06-29-Going negative

7 0.68642539 1686 andrew gelman stats-2013-01-21-Finite-population Anova calculations for models with interactions

8 0.67604703 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?

9 0.67260826 753 andrew gelman stats-2011-06-09-Allowing interaction terms to vary

10 0.6576038 2145 andrew gelman stats-2013-12-24-Estimating and summarizing inference for hierarchical variance parameters when the number of groups is small

11 0.6482451 303 andrew gelman stats-2010-09-28-“Genomics” vs. genetics

12 0.64776635 1908 andrew gelman stats-2013-06-21-Interpreting interactions in discrete-data regression

13 0.64164901 1121 andrew gelman stats-2012-01-15-R-squared for multilevel models

14 0.63978493 1462 andrew gelman stats-2012-08-18-Standardizing regression inputs

15 0.63834608 2274 andrew gelman stats-2014-03-30-Adjudicating between alternative interpretations of a statistical interaction?

16 0.63745701 1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?

17 0.63554668 1703 andrew gelman stats-2013-02-02-Interaction-based feature selection and classification for high-dimensional biological data

18 0.63368034 2204 andrew gelman stats-2014-02-09-Keli Liu and Xiao-Li Meng on Simpson’s paradox

19 0.63356024 1663 andrew gelman stats-2013-01-09-The effects of fiscal consolidation

20 0.6324271 1230 andrew gelman stats-2012-03-26-Further thoughts on nonparametric correlation measures


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(2, 0.018), (6, 0.013), (7, 0.024), (16, 0.074), (17, 0.013), (19, 0.025), (21, 0.019), (24, 0.186), (42, 0.012), (43, 0.021), (53, 0.013), (60, 0.015), (74, 0.055), (86, 0.075), (95, 0.097), (97, 0.02), (99, 0.229)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97695351 301 andrew gelman stats-2010-09-28-Correlation, prediction, variation, etc.

Introduction: Hamdan Azhar writes: I [Azhar] write with a question about language in the context of statistics. Consider the three statements below. a) Y is significantly associated (correlated) with X; b) knowledge of X allows us to account for __% of the variance in Y; c) Y can be predicted to a significant extent given knowledge of X. To what extent are these statements equivalent? Much of the (non-statistical) scientific literature doesn’t seem to distinguish between these notions. Is this just about semantics — or are there meaningful differences here, particularly between b and c? Consider a framework where X constitutes a predictor space of p variables (x1,…,xp). We wish to generate a linear combination of these variables to yield a score that optimally correlates with Y. Can we substitute the word “predicts” for “optimally correlates with” in this context? One can argue that “correlating” or “accounting for variance” suggests that we are trying to maximize goodness-of-fit (i

2 0.94158673 266 andrew gelman stats-2010-09-09-The future of R

Introduction: Some thoughts from Christian , including this bit: We need to consider separately 1. R’s brilliant library 2. R’s not-so-brilliant language and/or interpreter. I don’t know that R’s library is so brilliant as all that–if necessary, I don’t think it would be hard to reprogram the important packages in a new language. I would say, though, that the problems with R are not just in the technical details of the language. I think the culture of R has some problems too. As I’ve written before, R functions used to be lean and mean, and now they’re full of exception-handling and calls to other packages. R functions are spaghetti-like messes of connections in which I keep expecting to run into syntax like “GOTO 120.” I learned about these problems a couple years ago when writing bayesglm(), which is a simple adaptation of glm(). But glm(), and its workhorse, glm.fit(), are a mess: They’re about 10 lines of functioning code, plus about 20 lines of necessary front-end, plus a cou

3 0.92840457 2135 andrew gelman stats-2013-12-15-The UN Plot to Force Bayesianism on Unsuspecting Americans (penalized B-Spline edition)

Introduction: Mike Spagat sent me an email with the above heading, referring to this paper by Leontine Alkema and Jin Rou New, which begins: National estimates of the under-5 mortality rate (U5MR) are used to track progress in reducing child mortality and to evaluate countries’ performance related to United Nations Millennium Development Goal 4, which calls for a reduction in the U5MR by two-thirds between 1990 and 2015. However, for the great majority of developing countries without well-functioning vital registration systems, estimating levels and trends in child mortality is challenging, not only because of limited data availability but also because of issues with data quality. Global U5MR estimates are often constructed without accounting for potential biases in data series, which may lead to inaccurate point estimates and/or credible intervals. We describe a Bayesian penalized B-spline regression model for assessing levels and trends in the U5MR for all countries in the world, whereby bi

4 0.92421263 599 andrew gelman stats-2011-03-03-Two interesting posts elsewhere on graphics

Introduction: Have data graphics progressed in the last century? The first addresses familiar subjects to readers of the blog, with some nice examples of where infographics emphasize the obvious, or increase the probability of an incorrect insight. Your Help Needed: the Effect of Aesthetics on Visualization I borrow the term ‘insight’ from the second link, a study by a group of design & software researchers based around a single interactive graphic. This is similar in spirit to Unwin’s ‘caption this graphic’ assignment.

5 0.92132926 829 andrew gelman stats-2011-07-29-Infovis vs. statgraphics: A clear example of their different goals

Introduction: I recently came across a data visualization that perfectly demonstrates the difference between the “infovis” and “statgraphics” perspectives. Here’s the image ( link from Tyler Cowen): That’s the infovis. The statgraphic version would simply be a dotplot, something like this: (I purposely used the default settings in R with only minor modifications here to demonstrate what happens if you just want to plot the data with minimal effort.) Let’s compare the two graphs: From a statistical graphics perspective, the second graph dominates. The countries are directly comparable and the numbers are indicated by positions rather than area. The first graph is full of distracting color and gives the misleading visual impression that the total GDP of countries 5-10 is about equal to that of countries 1-4. If the goal is to get attention , though, it’s another story. There’s nothing special about the top graph above except how it looks. It represents neither a dat

6 0.92070872 2154 andrew gelman stats-2013-12-30-Bill Gates’s favorite graph of the year

7 0.92061365 404 andrew gelman stats-2010-11-09-“Much of the recent reported drop in interstate migration is a statistical artifact”

8 0.91998649 1575 andrew gelman stats-2012-11-12-Thinking like a statistician (continuously) rather than like a civilian (discretely)

9 0.91995633 1737 andrew gelman stats-2013-02-25-Correlation of 1 . . . too good to be true?

10 0.91937351 639 andrew gelman stats-2011-03-31-Bayes: radical, liberal, or conservative?

11 0.91865909 783 andrew gelman stats-2011-06-30-Don’t stop being a statistician once the analysis is done

12 0.91772282 1474 andrew gelman stats-2012-08-29-More on scaled-inverse Wishart and prior independence

13 0.91749394 1367 andrew gelman stats-2012-06-05-Question 26 of my final exam for Design and Analysis of Sample Surveys

14 0.9168337 1834 andrew gelman stats-2013-05-01-A graph at war with its caption. Also, how to visualize the same numbers without giving the display a misleading causal feel?

15 0.91669714 899 andrew gelman stats-2011-09-10-The statistical significance filter

16 0.91619432 1612 andrew gelman stats-2012-12-08-The Case for More False Positives in Anti-doping Testing

17 0.91606808 1422 andrew gelman stats-2012-07-20-Likelihood thresholds and decisions

18 0.91539878 1240 andrew gelman stats-2012-04-02-Blogads update

19 0.91456246 1164 andrew gelman stats-2012-02-13-Help with this problem, win valuable prizes

20 0.91207248 494 andrew gelman stats-2010-12-31-Type S error rates for classical and Bayesian single and multiple comparison procedures