andrew_gelman_stats andrew_gelman_stats-2010 andrew_gelman_stats-2010-56 knowledge-graph by maker-knowledge-mining

56 andrew gelman stats-2010-05-28-Another argument in favor of expressing conditional probability statements using the population distribution


meta infos for this blog

Source: html

Introduction: Yesterday we had a spirited discussion of the following conditional probability puzzle: “I have two children. One is a boy born on a Tuesday. What is the probability I have two boys?” This reminded me of the principle, familiar from statistics instruction and the cognitive psychology literature, that the best way to teach these sorts of examples is through integers rather than fractions. For example, consider this classic problem: “10% of persons have disease X. You are tested for the disease and test positive, and the test has 80% accuracy. What is the probability that you have the disease?” This can be solved directly using conditional probability but it appears to be clearer to do it using integers: Start with 100 people. 10 will have the disease and 90 will not. Of the 10 with the disease, 8 will test positive and 2 will test negative. Of the 90 without the disease, 18 will test positive and 72% will test negative. (72% = 0.8*90.) So, out of the origin


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Yesterday we had a spirited discussion of the following conditional probability puzzle: “I have two children. [sent-1, score-0.529]

2 ” This reminded me of the principle, familiar from statistics instruction and the cognitive psychology literature, that the best way to teach these sorts of examples is through integers rather than fractions. [sent-4, score-0.29]

3 For example, consider this classic problem: “10% of persons have disease X. [sent-5, score-0.733]

4 You are tested for the disease and test positive, and the test has 80% accuracy. [sent-6, score-1.221]

5 What is the probability that you have the disease? [sent-7, score-0.23]

6 ” This can be solved directly using conditional probability but it appears to be clearer to do it using integers: Start with 100 people. [sent-8, score-0.59]

7 Of the 10 with the disease, 8 will test positive and 2 will test negative. [sent-10, score-0.626]

8 Of the 90 without the disease, 18 will test positive and 72% will test negative. [sent-11, score-0.626]

9 ) So, out of the original 100 people, 26 have tested positive, and 8 of these actually have the disease. [sent-14, score-0.194]

10 Expressing the problem using a population distribution rather than a probability distribution has an additional advantage: it forces us to be explicit about the data-generating process. [sent-18, score-0.953]

11 The key assumption is that everybody (or, equivalently, a random sample of people) are tested. [sent-20, score-0.233]

12 Or, to put it another way, we’re assuming that the 10% base rate applies to the population of people who get tested. [sent-21, score-0.43]

13 If, for example, you get tested only if you think it’s likely you have the disease, then the above simplified model won’t work. [sent-22, score-0.282]

14 This condition is a bit hidden in the probability model, but it jumps out (at least, to me) in the “population distribution” formulation. [sent-23, score-0.444]

15 The key phrases above: “Of the 10 with the disease . [sent-24, score-0.711]

16 The crucial unstated assumption was that, every time someone had exactly two children with at least one born on a Tuesday, he would give you this information. [sent-32, score-0.764]

17 It’s hard to keep this straight, given the artificial nature of the problem and the strange bit of linguistics (“I have two children” = “exactly two,” but “One is a boy” = “exactly one”). [sent-33, score-0.397]

18 But if you do it with a population distribution (start with 4×49 families and go from there), then it’s clear that you’re assuming that everyone in this situation is telling you this particular information. [sent-34, score-0.517]

19 It becomes less of a vague question of “what are we conditioning on? [sent-35, score-0.195]

20 ” and more clearly an assumption about where the data came from. [sent-36, score-0.156]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('disease', 0.565), ('test', 0.231), ('probability', 0.23), ('integers', 0.208), ('tested', 0.194), ('positive', 0.164), ('population', 0.156), ('assumption', 0.156), ('assuming', 0.15), ('distribution', 0.144), ('boy', 0.143), ('born', 0.13), ('exactly', 0.121), ('spirited', 0.104), ('children', 0.103), ('two', 0.098), ('conditional', 0.097), ('consider', 0.096), ('linguistics', 0.094), ('unstated', 0.094), ('simplified', 0.088), ('jumps', 0.084), ('equivalently', 0.082), ('instruction', 0.082), ('key', 0.077), ('forces', 0.075), ('artificial', 0.074), ('tuesday', 0.072), ('persons', 0.072), ('conditioning', 0.071), ('explicit', 0.071), ('boys', 0.07), ('phrases', 0.069), ('start', 0.069), ('expressing', 0.069), ('problem', 0.068), ('puzzle', 0.068), ('vague', 0.067), ('solved', 0.067), ('families', 0.067), ('clearer', 0.066), ('condition', 0.065), ('using', 0.065), ('hidden', 0.065), ('applies', 0.064), ('strange', 0.063), ('crucial', 0.062), ('base', 0.06), ('straight', 0.057), ('becomes', 0.057)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 56 andrew gelman stats-2010-05-28-Another argument in favor of expressing conditional probability statements using the population distribution

Introduction: Yesterday we had a spirited discussion of the following conditional probability puzzle: “I have two children. One is a boy born on a Tuesday. What is the probability I have two boys?” This reminded me of the principle, familiar from statistics instruction and the cognitive psychology literature, that the best way to teach these sorts of examples is through integers rather than fractions. For example, consider this classic problem: “10% of persons have disease X. You are tested for the disease and test positive, and the test has 80% accuracy. What is the probability that you have the disease?” This can be solved directly using conditional probability but it appears to be clearer to do it using integers: Start with 100 people. 10 will have the disease and 90 will not. Of the 10 with the disease, 8 will test positive and 2 will test negative. Of the 90 without the disease, 18 will test positive and 72% will test negative. (72% = 0.8*90.) So, out of the origin

2 0.28914726 1221 andrew gelman stats-2012-03-19-Whassup with deviance having a high posterior correlation with a parameter in the model?

Introduction: Jean Richardson writes: Do you know what might lead to a large negative cross-correlation (-0.95) between deviance and one of the model parameters? Here’s the (brief) background: I [Richardson] have written a Bayesian hierarchical site occupancy model for presence of disease on individual amphibians. The response variable is therefore binary (disease present/absent) and the probability of disease being present in an individual (psi) depends on various covariates (species of amphibian, location sampled, etc.) paramaterized using a logit link function. Replicates are individuals sampled (tested for presence of disease) together. The possibility of imperfect detection is included as p = (prob. disease detected given disease is present). Posterior distributions were estimated using WinBUGS via R2WinBUGS. Simulated data from the model fit the real data very well and posterior distribution densities seem robust to any changes in the model (different priors, etc.) All autocor

3 0.19014268 54 andrew gelman stats-2010-05-27-Hype about conditional probability puzzles

Introduction: Jason Kottke posts this puzzle from Gary Foshee that reportedly impressed people at a puzzle-designers’ convention: I have two children. One is a boy born on a Tuesday. What is the probability I have two boys? The first thing you think is “What has Tuesday got to do with it?” Well, it has everything to do with it. I thought I should really figure this one out myself before reading any further, and I decided this was a good time to apply my general principle that it’s always best to solve such problems from scratch rather than trying to guess at the answer. So I laid out all the 4 x 49 possibilities. The 4 is bb, bg, gb, gg, and the 49 are all possible pairs of days of the week. Then I ruled out all the possibilities that were inconsistent with the data: this leaves the following: bb with all pairs of days that include a Tuesday. That’s 13 possibilities (Mon/Tues, Tues/Tues, Wed/Tues, …, Tues/Mon, …, Sun/Tues, remembering not to count Tues/Tues twice). bg with all

4 0.17939827 2200 andrew gelman stats-2014-02-05-Prior distribution for a predicted probability

Introduction: I received the following email: I have an interesting thought on a prior for a logistic regression, and would love your input on how to make it “work.” Some of my research, two published papers, are on mathematical models of **. Along those lines, I’m interested in developing more models for **. . . . Empirical studies show that the public is rather smart and that the wisdom-of-the-crowd is fairly accurate. So, my thought would be to tread the public’s probability of the event as a prior, and then see how adding data, through a model, would change or perturb our inferred probability of **. (Similarly, I could envision using previously published epidemiological research as a prior probability of a disease, and then seeing how the addition of new testing protocols would update that belief.) However, everything I learned about hierarchical Bayesian models has a prior as a distribution on the coefficients. I don’t know how to start with a prior point estimate for the probabili

5 0.14674586 1524 andrew gelman stats-2012-10-07-An (impressive) increase in survival rate from 50% to 60% corresponds to an R-squared of (only) 1%. Counterintuitive, huh?

Introduction: I was just reading an old post and came across this example which I’d like to share with you again: Here’s a story of R-squared = 1%. Consider a 0/1 outcome with about half the people in each category. For.example, half the people with some disease die in a year and half live. Now suppose there’s a treatment that increases survival rate from 50% to 60%. The unexplained sd is 0.5 and the explained sd is 0.05, hence R-squared is 0.01.

6 0.14329834 1527 andrew gelman stats-2012-10-10-Another reason why you can get good inferences from a bad model

7 0.13786167 1572 andrew gelman stats-2012-11-10-I don’t like this cartoon

8 0.13189685 1288 andrew gelman stats-2012-04-29-Clueless Americans think they’ll never get sick

9 0.12533712 972 andrew gelman stats-2011-10-25-How do you interpret standard errors from a regression fit to the entire population?

10 0.11011356 1618 andrew gelman stats-2012-12-11-The consulting biz

11 0.10880621 961 andrew gelman stats-2011-10-16-The “Washington read” and the algebra of conditional distributions

12 0.1081041 996 andrew gelman stats-2011-11-07-Chi-square FAIL when many cells have small expected values

13 0.10459258 351 andrew gelman stats-2010-10-18-“I was finding the test so irritating and boring that I just started to click through as fast as I could”

14 0.10318659 341 andrew gelman stats-2010-10-14-Confusion about continuous probability densities

15 0.10143227 2155 andrew gelman stats-2013-12-31-No on Yes-No decisions

16 0.099959992 1364 andrew gelman stats-2012-06-04-Massive confusion about a study that purports to show that exercise may increase heart risk

17 0.097717263 1941 andrew gelman stats-2013-07-16-Priors

18 0.094597034 7 andrew gelman stats-2010-04-27-Should Mister P be allowed-encouraged to reside in counter-factual populations?

19 0.094508983 2121 andrew gelman stats-2013-12-02-Should personal genetic testing be regulated? Battle of the blogroll

20 0.093200088 602 andrew gelman stats-2011-03-06-Assumptions vs. conditions


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.174), (1, 0.063), (2, 0.053), (3, -0.024), (4, 0.006), (5, -0.015), (6, 0.05), (7, 0.051), (8, 0.019), (9, -0.079), (10, -0.069), (11, -0.01), (12, -0.001), (13, -0.056), (14, -0.062), (15, -0.003), (16, 0.018), (17, -0.007), (18, 0.013), (19, -0.028), (20, 0.037), (21, -0.022), (22, -0.001), (23, -0.044), (24, 0.012), (25, 0.043), (26, -0.058), (27, 0.032), (28, 0.02), (29, 0.007), (30, -0.041), (31, -0.002), (32, -0.048), (33, 0.083), (34, -0.01), (35, -0.046), (36, 0.008), (37, 0.011), (38, -0.025), (39, 0.021), (40, 0.029), (41, -0.064), (42, 0.049), (43, -0.092), (44, -0.01), (45, 0.034), (46, 0.029), (47, 0.073), (48, -0.019), (49, -0.004)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97174275 56 andrew gelman stats-2010-05-28-Another argument in favor of expressing conditional probability statements using the population distribution

Introduction: Yesterday we had a spirited discussion of the following conditional probability puzzle: “I have two children. One is a boy born on a Tuesday. What is the probability I have two boys?” This reminded me of the principle, familiar from statistics instruction and the cognitive psychology literature, that the best way to teach these sorts of examples is through integers rather than fractions. For example, consider this classic problem: “10% of persons have disease X. You are tested for the disease and test positive, and the test has 80% accuracy. What is the probability that you have the disease?” This can be solved directly using conditional probability but it appears to be clearer to do it using integers: Start with 100 people. 10 will have the disease and 90 will not. Of the 10 with the disease, 8 will test positive and 2 will test negative. Of the 90 without the disease, 18 will test positive and 72% will test negative. (72% = 0.8*90.) So, out of the origin

2 0.80889946 341 andrew gelman stats-2010-10-14-Confusion about continuous probability densities

Introduction: I had the following email exchange with a reader of Bayesian Data Analysis. My correspondent wrote: Exercise 1(b) involves evaluating the normal pdf at a single point. But p(Y=y|mu,sigma) = 0 (and is not simply N(y|mu,sigma)), since the normal distribution is continuous. So it seems that part (b) of the exercise is inappropriate. The solution does actually evaluate the probability as the value of the pdf at the single point, which is wrong. The probabilities should all be 0, so the answer to (b) is undefined. I replied: The pdf is the probability density function, which for a continuous distribution is defined as the derivative of the cumulative density function. The notation in BDA is rigorous but we do not spell out all the details, so I can see how confusion is possible. My correspondent: I agree that the pdf is the derivative of the cdf. But to compute P(a .lt. Y .lt. b) for a continuous distribution (with support in the real line) requires integrating over t

3 0.79580307 54 andrew gelman stats-2010-05-27-Hype about conditional probability puzzles

Introduction: Jason Kottke posts this puzzle from Gary Foshee that reportedly impressed people at a puzzle-designers’ convention: I have two children. One is a boy born on a Tuesday. What is the probability I have two boys? The first thing you think is “What has Tuesday got to do with it?” Well, it has everything to do with it. I thought I should really figure this one out myself before reading any further, and I decided this was a good time to apply my general principle that it’s always best to solve such problems from scratch rather than trying to guess at the answer. So I laid out all the 4 x 49 possibilities. The 4 is bb, bg, gb, gg, and the 49 are all possible pairs of days of the week. Then I ruled out all the possibilities that were inconsistent with the data: this leaves the following: bb with all pairs of days that include a Tuesday. That’s 13 possibilities (Mon/Tues, Tues/Tues, Wed/Tues, …, Tues/Mon, …, Sun/Tues, remembering not to count Tues/Tues twice). bg with all

4 0.72915369 996 andrew gelman stats-2011-11-07-Chi-square FAIL when many cells have small expected values

Introduction: William Perkins, Mark Tygert, and Rachel Ward write : If a discrete probability distribution in a model being tested for goodness-of-fit is not close to uniform, then forming the Pearson χ2 statistic can involve division by nearly zero. This often leads to serious trouble in practice — even in the absence of round-off errors . . . The problem is not merely that the chi-squared statistic doesn’t have the advertised chi-squared distribution —a reference distribution can always be computed via simulation, either using the posterior predictive distribution or by conditioning on a point estimate of the cell expectations and then making a degrees-of-freedom sort of adjustment. Rather, the problem is that, when there are lots of cells with near-zero expectation, the chi-squared test is mostly noise. And this is not merely a theoretical problem. It comes up in real examples. Here’s one, taken from the classic 1992 genetics paper of Guo and Thomspson: And here are the e

5 0.70122868 923 andrew gelman stats-2011-09-24-What is the normal range of values in a medical test?

Introduction: Geoffrey Sheean writes: I am having trouble thinking Bayesianly about the so-called ‘normal’ or ‘reference’ values that I am supposed to use in some of the tests I perform. These values are obtained from purportedly healthy people. Setting aside concerns about ascertainment bias, non-parametric distributions, and the like, the values are usually obtained by setting the limits at ± 2SD from the mean. In some cases, supposedly because of a non-normal distribution, the third highest and lowest value observed in the healthy group sets the limits, on the assumption that no more than 2 results (out of 20 samples) are allowed to exceed these values: if there are 3 or more, then the test is assumed to be abnormal and the reference range is said to reflect the 90th percentile. The results are binary – normal, abnormal. The relevance to the diseased state is this. People who are known unequivocally to have condition X show Y abnormalities in these tests. Therefore, when people suspected

6 0.69503534 138 andrew gelman stats-2010-07-10-Creating a good wager based on probability estimates

7 0.68852466 2128 andrew gelman stats-2013-12-09-How to model distributions that have outliers in one direction

8 0.66089606 1527 andrew gelman stats-2012-10-10-Another reason why you can get good inferences from a bad model

9 0.65898389 2322 andrew gelman stats-2014-05-06-Priors I don’t believe

10 0.65303767 2029 andrew gelman stats-2013-09-18-Understanding posterior p-values

11 0.64817131 1387 andrew gelman stats-2012-06-21-Will Tiger Woods catch Jack Nicklaus? And a discussion of the virtues of using continuous data even if your goal is discrete prediction

12 0.6431908 2155 andrew gelman stats-2013-12-31-No on Yes-No decisions

13 0.64306945 2342 andrew gelman stats-2014-05-21-Models with constraints

14 0.63738251 2258 andrew gelman stats-2014-03-21-Random matrices in the news

15 0.63662571 791 andrew gelman stats-2011-07-08-Censoring on one end, “outliers” on the other, what can we do with the middle?

16 0.63310826 1518 andrew gelman stats-2012-10-02-Fighting a losing battle

17 0.62919712 1221 andrew gelman stats-2012-03-19-Whassup with deviance having a high posterior correlation with a parameter in the model?

18 0.62410551 777 andrew gelman stats-2011-06-23-Combining survey data obtained using different modes of sampling

19 0.62126714 858 andrew gelman stats-2011-08-17-Jumping off the edge of the world

20 0.61812723 1284 andrew gelman stats-2012-04-26-Modeling probability data


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(2, 0.024), (6, 0.02), (16, 0.071), (21, 0.038), (24, 0.149), (34, 0.017), (40, 0.099), (82, 0.01), (86, 0.024), (97, 0.012), (99, 0.393)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.98903406 1679 andrew gelman stats-2013-01-18-Is it really true that only 8% of people who buy Herbalife products are Herbalife distributors?

Introduction: A reporter emailed me the other day with a question about a case I’d never heard of before, a company called Herbalife that is being accused of being a pyramid scheme. The reporter pointed me to this document which describes a survey conducted by “a third party firm called Lieberman Research”: Two independent studies took place using real time (aka “river”) sampling, in which respondents were intercepted across a wide array of websites Sample size of 2,000 adults 18+ matched to U.S. census on age, gender, income, region and ethnicity “River sampling” in this case appears to mean, according to the reporter, that “people were invited into it through online ads.” The survey found that 5% of U.S. households had purchased Herbalife products during the past three months (with a “0.8% margin of error,” ha ha ha). They they did a multiplication and a division to estimate that only 8% of households who bought these products were Herbalife distributors: 480,000 active distributor

same-blog 2 0.9840945 56 andrew gelman stats-2010-05-28-Another argument in favor of expressing conditional probability statements using the population distribution

Introduction: Yesterday we had a spirited discussion of the following conditional probability puzzle: “I have two children. One is a boy born on a Tuesday. What is the probability I have two boys?” This reminded me of the principle, familiar from statistics instruction and the cognitive psychology literature, that the best way to teach these sorts of examples is through integers rather than fractions. For example, consider this classic problem: “10% of persons have disease X. You are tested for the disease and test positive, and the test has 80% accuracy. What is the probability that you have the disease?” This can be solved directly using conditional probability but it appears to be clearer to do it using integers: Start with 100 people. 10 will have the disease and 90 will not. Of the 10 with the disease, 8 will test positive and 2 will test negative. Of the 90 without the disease, 18 will test positive and 72% will test negative. (72% = 0.8*90.) So, out of the origin

3 0.98322439 962 andrew gelman stats-2011-10-17-Death!

Introduction: This graph shows the estimate that Kenny Shirley and I have of support for the death penalty by sex and race in the U.S. since 1955: We also found that capital punishment used to be more popular in the Northeast than in the South, but now it’s the other way around. Here’s the abstract to our paper : One of the longest running questions that has been regularly included in Gallup’s national public opinion poll is “Do you favor or oppose the death penalty for persons convicted of murder?” Because the death penalty is governed by state laws rather than federal laws, it is of special interest to know how public opinion varies by state, and how it has changed over time within each state. In this paper we combine dozens of national polls taken over a fifty-year span and fit a Bayesian multilevel logistic regression model to individual response data to estimate changes in state-level public opinion over time. Such a long span of polls has not been analyzed this way before, partly

4 0.97890383 1803 andrew gelman stats-2013-04-14-Why girls do better in school

Introduction: Wayne Folta writes, “In light of your recent blog post on women in higher education, here’s one I just read about on a techie website regarding elementary education”: Why do girls get better grades in elementary school than boys—even when they perform worse on standardized tests? New research . . . suggests that it’s because of their classroom behavior, which may lead teachers to assign girls higher grades than their male counterparts. . . . The study, co-authored by [Christopher] Cornwell and David Mustard at UGA and Jessica Van Parys at Columbia, analyzed data on more than 5,800 students from kindergarten through fifth grade. It examined students’ performance on standardized tests in three categories—reading, math and science-linking test scores to teachers’ assessments of their students’ progress, both academically and more broadly. The data show, for the first time, that gender disparities in teacher grades start early and uniformly favor girls. In every subject area, bo

5 0.9757008 1245 andrew gelman stats-2012-04-03-Redundancy and efficiency: In praise of Penn Station

Introduction: In reaction to this news article by Michael Kimmelman, I’d like to repost this from four years ago: Walking through Penn Station in New York, I remembered how much I love its open structure. By “open,” I don’t mean bright and airy. I mean “open” in a topological sense. The station has three below-ground levels–the uppermost has ticket counters (and, what is more relevant nowadays, ticket machines), some crappy stores and restaurants, and a crappy waiting area. The middle level has Long Island Rail Road ticket counters, some more crappy stores and restaurants, and entrances to the 7th and 8th Avenue subway lines. The lower level has train tracks and platforms. There are stairs, escalators, and elevators going everywhere. As a result, it’s easy to get around, there are lots of shortcuts, and the train loads fast–some people come down the escalators and elevators from the top level, others take the stairs from the middle level. The powers-that-be keep threatening to spend a coupl

6 0.97395182 1671 andrew gelman stats-2013-01-13-Preregistration of Studies and Mock Reports

7 0.97324866 2212 andrew gelman stats-2014-02-15-Mary, Mary, why ya buggin

8 0.97217464 1445 andrew gelman stats-2012-08-06-Slow progress

9 0.97128338 1277 andrew gelman stats-2012-04-23-Infographic of the year

10 0.97105783 2130 andrew gelman stats-2013-12-11-Multilevel marketing as a way of liquidating participants’ social networks

11 0.9700408 1153 andrew gelman stats-2012-02-04-More on the economic benefits of universities

12 0.96995437 961 andrew gelman stats-2011-10-16-The “Washington read” and the algebra of conditional distributions

13 0.96896631 2263 andrew gelman stats-2014-03-24-Empirical implications of Empirical Implications of Theoretical Models

14 0.96886808 2236 andrew gelman stats-2014-03-07-Selection bias in the reporting of shaky research

15 0.96752995 1482 andrew gelman stats-2012-09-04-Model checking and model understanding in machine learning

16 0.96724671 288 andrew gelman stats-2010-09-21-Discussion of the paper by Girolami and Calderhead on Bayesian computation

17 0.96715569 2350 andrew gelman stats-2014-05-27-A whole fleet of gremlins: Looking more carefully at Richard Tol’s twice-corrected paper, “The Economic Effects of Climate Change”

18 0.96635449 1198 andrew gelman stats-2012-03-05-A cloud with a silver lining

19 0.96634257 2270 andrew gelman stats-2014-03-28-Creating a Lenin-style democracy

20 0.96631169 2120 andrew gelman stats-2013-12-02-Does a professor’s intervention in online discussions have the effect of prolonging discussion or cutting it off?