andrew_gelman_stats andrew_gelman_stats-2012 andrew_gelman_stats-2012-1337 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: 12. A researcher fits a regression model predicting some political behavior given predictors for demographics and several measures of economic ideology. The coefficients for the ideology measures are not statistically significant, and the researcher creates a new measure, adding up the ideology questions and creating a common score, and then fits a new regression including the new score and removing the individual ideology questions from the model. Which of the following statements are basically true? (Indicate all that apply.) (a) If the original ideology measures are close to 100% correlated with each other, there will be essentially no benefit from this approach. (b) If the original ideology measures are not on a common scale, they should be rescaled before adding them up. (c) If the original result was not statistically significant, the researcher should stop, so as to avoid data dredging and selection bias. (d) Another reasonable option would be to perform a factor analysi
sentIndex sentText sentNum sentScore
1 A researcher fits a regression model predicting some political behavior given predictors for demographics and several measures of economic ideology. [sent-2, score-0.945]
2 ) (a) If the original ideology measures are close to 100% correlated with each other, there will be essentially no benefit from this approach. [sent-6, score-1.03]
3 (b) If the original ideology measures are not on a common scale, they should be rescaled before adding them up. [sent-7, score-1.312]
4 (c) If the original result was not statistically significant, the researcher should stop, so as to avoid data dredging and selection bias. [sent-8, score-0.65]
5 (d) Another reasonable option would be to perform a factor analysis on the ideology mea- sures and create a common score in that way. [sent-9, score-1.201]
6 Solution to question 11 From yesterday : 11. [sent-10, score-0.066]
7 Here is the result of fitting a logistic regression to Republican vote in the 1972 NES. [sent-11, score-0.414]
8 Approximately how much more likely is a person in income category 4 to vote Republican, compared to a person income category 2? [sent-13, score-0.974]
9 Give an approximate estimate, standard error, and 95% interval. [sent-14, score-0.077]
10 Solution: On the logit scale, the estimate is 0. [sent-15, score-0.212]
11 To switch to the probability scale, divide by 4 and round down: the estimate is then 0. [sent-23, score-0.392]
wordName wordTfidf (topN-words)
[('ideology', 0.481), ('measures', 0.268), ('score', 0.216), ('se', 0.199), ('income', 0.181), ('researcher', 0.174), ('scale', 0.174), ('common', 0.158), ('interval', 0.154), ('category', 0.151), ('original', 0.147), ('fits', 0.144), ('adding', 0.14), ('sures', 0.135), ('republican', 0.131), ('regression', 0.13), ('estimate', 0.126), ('solution', 0.122), ('dredging', 0.122), ('vote', 0.12), ('rescaled', 0.118), ('statistically', 0.112), ('significant', 0.102), ('removing', 0.098), ('result', 0.095), ('person', 0.095), ('creates', 0.094), ('demographics', 0.094), ('round', 0.093), ('questions', 0.091), ('divide', 0.088), ('logit', 0.086), ('switch', 0.085), ('creating', 0.08), ('approximate', 0.077), ('option', 0.076), ('indicate', 0.075), ('approximately', 0.074), ('new', 0.073), ('predicting', 0.071), ('perform', 0.07), ('logistic', 0.069), ('correlated', 0.069), ('coefficients', 0.069), ('statements', 0.066), ('stop', 0.066), ('yesterday', 0.066), ('create', 0.065), ('benefit', 0.065), ('predictors', 0.064)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 1337 andrew gelman stats-2012-05-22-Question 12 of my final exam for Design and Analysis of Sample Surveys
Introduction: 12. A researcher fits a regression model predicting some political behavior given predictors for demographics and several measures of economic ideology. The coefficients for the ideology measures are not statistically significant, and the researcher creates a new measure, adding up the ideology questions and creating a common score, and then fits a new regression including the new score and removing the individual ideology questions from the model. Which of the following statements are basically true? (Indicate all that apply.) (a) If the original ideology measures are close to 100% correlated with each other, there will be essentially no benefit from this approach. (b) If the original ideology measures are not on a common scale, they should be rescaled before adding them up. (c) If the original result was not statistically significant, the researcher should stop, so as to avoid data dredging and selection bias. (d) Another reasonable option would be to perform a factor analysi
2 0.73654807 1340 andrew gelman stats-2012-05-23-Question 13 of my final exam for Design and Analysis of Sample Surveys
Introduction: 13. A survey of American adults is conducted that includes too many women and not enough men in the sample. In the resulting weighting, each female respondent is given a weight of 1 and each male respondent is given a weight of 1.5. The sample includes 600 women and 380 men, of whom 400 women and 100 men respond Yes to a particular question of interest. Give an estimate and standard error for the proportion of American adults who would answer Yes to this question if asked. Solution to question 12 From yesterday : 12. A researcher fits a regression model predicting some political behavior given predictors for demographics and several measures of economic ideology. The coefficients for the ideology measures are not statistically significant, and the researcher creates a new measure, adding up the ideology questions and creating a common score, and then fits a new regression including the new score and removing the individual ideology questions from the model. Which of the follo
3 0.40919635 1334 andrew gelman stats-2012-05-21-Question 11 of my final exam for Design and Analysis of Sample Surveys
Introduction: 11. Here is the result of fitting a logistic regression to Republican vote in the 1972 NES. Income is on a 1–5 scale. Approximately how much more likely is a person in income category 4 to vote Republican, compared to a person income category 2? Give an approximate estimate, standard error, and 95% interval. Solution to question 10 From yesterday : 10. Out of a random sample of 100 Americans, zero report having ever held political office. From this information, give a 95% confidence interval for the proportion of Americans who have ever held political office. Solution: Use the Agresti-Coull interval based on (y+2)/(n+4). Estimate is p.hat=2/104=0.02, se is sqrt(p.hat*(1-p.hat)/104)=0.013, 95% interval is [0.02 +/- 2*0.013] = [0,0.05].
4 0.14306532 1333 andrew gelman stats-2012-05-20-Question 10 of my final exam for Design and Analysis of Sample Surveys
Introduction: 10. Out of a random sample of 100 Americans, zero report having ever held political office. From this information, give a 95% confidence interval for the proportion of Americans who have ever held political office. Solution to question 9 From yesterday : 9. Out of a population of 100 medical records, 40 are randomly sampled and then audited. 10 out of the 40 audits reveal fraud. From this information, give an estimate, standard error, and 95% confidence interval for the proportion of audits in the population with fraud. Solution: estimate is p.hat=10/40=0.25. Se is sqrt(1-f)*sqrt(p.hat*(1-.hat)/n)=sqrt(1-0.4)*sqrt(0.25*0.75/40)=0.053. 95% interval is [0.25 +/- 2*0.053] = [0.14,0.36].
Introduction: Someone who wants to remain anonymous writes: I am working to create a more accurate in-game win probability model for basketball games. My idea is for each timestep in a game (a second, 5 seconds, etc), use the Vegas line, the current score differential, who has the ball, and the number of possessions played already (to account for differences in pace) to create a point estimate probability of the home team winning. This problem would seem to fit a multi-level model structure well. It seems silly to estimate 2,000 regressions (one for each timestep), but the coefficients should vary at each timestep. Do you have suggestions for what type of model this could/would be? Additionally, I believe this needs to be some form of logit/probit given the binary dependent variable (win or loss). Finally, do you have suggestions for what package could accomplish this in Stata or R? To answer the questions in reverse order: 3. I’d hope this could be done in Stan (which can be run from R)
6 0.14053042 1367 andrew gelman stats-2012-06-05-Question 26 of my final exam for Design and Analysis of Sample Surveys
7 0.13305426 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?
8 0.13131681 1042 andrew gelman stats-2011-12-05-Timing is everything!
9 0.13067912 726 andrew gelman stats-2011-05-22-Handling multiple versions of an outcome variable
10 0.1155167 1349 andrew gelman stats-2012-05-28-Question 18 of my final exam for Design and Analysis of Sample Surveys
11 0.11507669 1072 andrew gelman stats-2011-12-19-“The difference between . . .”: It’s not just p=.05 vs. p=.06
12 0.10902046 201 andrew gelman stats-2010-08-12-Are all rich people now liberals?
14 0.1055982 769 andrew gelman stats-2011-06-15-Mr. P by another name . . . is still great!
15 0.10433118 1368 andrew gelman stats-2012-06-06-Question 27 of my final exam for Design and Analysis of Sample Surveys
16 0.10162514 1227 andrew gelman stats-2012-03-23-Voting patterns of America’s whites, from the masses to the elites
17 0.099947743 1365 andrew gelman stats-2012-06-04-Question 25 of my final exam for Design and Analysis of Sample Surveys
18 0.098888554 775 andrew gelman stats-2011-06-21-Fundamental difficulty of inference for a ratio when the denominator could be positive or negative
19 0.098804206 2201 andrew gelman stats-2014-02-06-Bootstrap averaging: Examples where it works and where it doesn’t work
20 0.098761685 451 andrew gelman stats-2010-12-05-What do practitioners need to know about regression?
topicId topicWeight
[(0, 0.143), (1, 0.047), (2, 0.223), (3, -0.05), (4, 0.031), (5, 0.028), (6, -0.015), (7, -0.004), (8, -0.001), (9, -0.039), (10, 0.052), (11, 0.008), (12, -0.024), (13, 0.05), (14, -0.02), (15, -0.034), (16, -0.011), (17, -0.02), (18, 0.023), (19, -0.079), (20, 0.099), (21, -0.011), (22, 0.105), (23, -0.08), (24, 0.1), (25, -0.006), (26, 0.082), (27, -0.142), (28, -0.099), (29, -0.14), (30, 0.046), (31, 0.008), (32, 0.047), (33, -0.026), (34, 0.087), (35, -0.004), (36, -0.07), (37, 0.038), (38, -0.032), (39, -0.087), (40, -0.077), (41, -0.044), (42, 0.007), (43, 0.072), (44, 0.058), (45, 0.019), (46, -0.023), (47, -0.0), (48, -0.019), (49, -0.024)]
simIndex simValue blogId blogTitle
same-blog 1 0.9925521 1337 andrew gelman stats-2012-05-22-Question 12 of my final exam for Design and Analysis of Sample Surveys
Introduction: 12. A researcher fits a regression model predicting some political behavior given predictors for demographics and several measures of economic ideology. The coefficients for the ideology measures are not statistically significant, and the researcher creates a new measure, adding up the ideology questions and creating a common score, and then fits a new regression including the new score and removing the individual ideology questions from the model. Which of the following statements are basically true? (Indicate all that apply.) (a) If the original ideology measures are close to 100% correlated with each other, there will be essentially no benefit from this approach. (b) If the original ideology measures are not on a common scale, they should be rescaled before adding them up. (c) If the original result was not statistically significant, the researcher should stop, so as to avoid data dredging and selection bias. (d) Another reasonable option would be to perform a factor analysi
2 0.89397585 1340 andrew gelman stats-2012-05-23-Question 13 of my final exam for Design and Analysis of Sample Surveys
Introduction: 13. A survey of American adults is conducted that includes too many women and not enough men in the sample. In the resulting weighting, each female respondent is given a weight of 1 and each male respondent is given a weight of 1.5. The sample includes 600 women and 380 men, of whom 400 women and 100 men respond Yes to a particular question of interest. Give an estimate and standard error for the proportion of American adults who would answer Yes to this question if asked. Solution to question 12 From yesterday : 12. A researcher fits a regression model predicting some political behavior given predictors for demographics and several measures of economic ideology. The coefficients for the ideology measures are not statistically significant, and the researcher creates a new measure, adding up the ideology questions and creating a common score, and then fits a new regression including the new score and removing the individual ideology questions from the model. Which of the follo
3 0.84777701 1334 andrew gelman stats-2012-05-21-Question 11 of my final exam for Design and Analysis of Sample Surveys
Introduction: 11. Here is the result of fitting a logistic regression to Republican vote in the 1972 NES. Income is on a 1–5 scale. Approximately how much more likely is a person in income category 4 to vote Republican, compared to a person income category 2? Give an approximate estimate, standard error, and 95% interval. Solution to question 10 From yesterday : 10. Out of a random sample of 100 Americans, zero report having ever held political office. From this information, give a 95% confidence interval for the proportion of Americans who have ever held political office. Solution: Use the Agresti-Coull interval based on (y+2)/(n+4). Estimate is p.hat=2/104=0.02, se is sqrt(p.hat*(1-p.hat)/104)=0.013, 95% interval is [0.02 +/- 2*0.013] = [0,0.05].
4 0.66686648 1333 andrew gelman stats-2012-05-20-Question 10 of my final exam for Design and Analysis of Sample Surveys
Introduction: 10. Out of a random sample of 100 Americans, zero report having ever held political office. From this information, give a 95% confidence interval for the proportion of Americans who have ever held political office. Solution to question 9 From yesterday : 9. Out of a population of 100 medical records, 40 are randomly sampled and then audited. 10 out of the 40 audits reveal fraud. From this information, give an estimate, standard error, and 95% confidence interval for the proportion of audits in the population with fraud. Solution: estimate is p.hat=10/40=0.25. Se is sqrt(1-f)*sqrt(p.hat*(1-.hat)/n)=sqrt(1-0.4)*sqrt(0.25*0.75/40)=0.053. 95% interval is [0.25 +/- 2*0.053] = [0.14,0.36].
5 0.62444615 1341 andrew gelman stats-2012-05-24-Question 14 of my final exam for Design and Analysis of Sample Surveys
Introduction: 14. A public health survey of elderly Americans includes many questions, including “How many hours per week did you exercise in your most active years as a young adult?” and also several questions about current mobility and health status. Response rates are high for the questions about recent activities and status, but there is a lot of nonresponse for the question on past activity. You are considering imputing the missing values on the question, “How many hours per week did you exercise in your most active years as a young adult?” Which of the following statements are basically correct? (Indicate all that apply.) (a) If done reasonably well, imputation is preferred to available-case and complete-case analysis. (b) If you do impute, you should also present the available-case and complete-case analysis and analyze how the imputed estimates differ. (c) It is OK to include current health status variables as predictors in a model imputing past activities: anything that adds informati
7 0.5700236 1348 andrew gelman stats-2012-05-27-Question 17 of my final exam for Design and Analysis of Sample Surveys
8 0.54929674 1331 andrew gelman stats-2012-05-19-Question 9 of my final exam for Design and Analysis of Sample Surveys
9 0.54611933 918 andrew gelman stats-2011-09-21-Avoiding boundary estimates in linear mixed models
10 0.53159261 1672 andrew gelman stats-2013-01-14-How do you think about the values in a confidence interval?
11 0.53026873 1349 andrew gelman stats-2012-05-28-Question 18 of my final exam for Design and Analysis of Sample Surveys
12 0.52764195 1761 andrew gelman stats-2013-03-13-Lame Statistics Patents
13 0.52724838 2201 andrew gelman stats-2014-02-06-Bootstrap averaging: Examples where it works and where it doesn’t work
14 0.52262861 1326 andrew gelman stats-2012-05-17-Question 7 of my final exam for Design and Analysis of Sample Surveys
15 0.51727396 627 andrew gelman stats-2011-03-24-How few respondents are reasonable to use when calculating the average by county?
16 0.51726413 1361 andrew gelman stats-2012-06-02-Question 23 of my final exam for Design and Analysis of Sample Surveys
18 0.50940591 39 andrew gelman stats-2010-05-18-The 1.6 rule
19 0.50218093 1344 andrew gelman stats-2012-05-25-Question 15 of my final exam for Design and Analysis of Sample Surveys
20 0.49647751 1377 andrew gelman stats-2012-06-13-A question about AIC
topicId topicWeight
[(9, 0.043), (15, 0.017), (16, 0.072), (21, 0.029), (24, 0.128), (41, 0.048), (63, 0.039), (65, 0.025), (69, 0.066), (76, 0.033), (86, 0.042), (88, 0.025), (99, 0.326)]
simIndex simValue blogId blogTitle
same-blog 1 0.98390478 1337 andrew gelman stats-2012-05-22-Question 12 of my final exam for Design and Analysis of Sample Surveys
Introduction: 12. A researcher fits a regression model predicting some political behavior given predictors for demographics and several measures of economic ideology. The coefficients for the ideology measures are not statistically significant, and the researcher creates a new measure, adding up the ideology questions and creating a common score, and then fits a new regression including the new score and removing the individual ideology questions from the model. Which of the following statements are basically true? (Indicate all that apply.) (a) If the original ideology measures are close to 100% correlated with each other, there will be essentially no benefit from this approach. (b) If the original ideology measures are not on a common scale, they should be rescaled before adding them up. (c) If the original result was not statistically significant, the researcher should stop, so as to avoid data dredging and selection bias. (d) Another reasonable option would be to perform a factor analysi
2 0.97321653 923 andrew gelman stats-2011-09-24-What is the normal range of values in a medical test?
Introduction: Geoffrey Sheean writes: I am having trouble thinking Bayesianly about the so-called ‘normal’ or ‘reference’ values that I am supposed to use in some of the tests I perform. These values are obtained from purportedly healthy people. Setting aside concerns about ascertainment bias, non-parametric distributions, and the like, the values are usually obtained by setting the limits at ± 2SD from the mean. In some cases, supposedly because of a non-normal distribution, the third highest and lowest value observed in the healthy group sets the limits, on the assumption that no more than 2 results (out of 20 samples) are allowed to exceed these values: if there are 3 or more, then the test is assumed to be abnormal and the reference range is said to reflect the 90th percentile. The results are binary – normal, abnormal. The relevance to the diseased state is this. People who are known unequivocally to have condition X show Y abnormalities in these tests. Therefore, when people suspected
Introduction: Johathan Chait writes : Parties and candidates will kill themselves to move the needle a percentage point or two in a presidential race. And again, the fundamentals determine the bigger picture, but within that big picture political tactics and candidate quality still matters around the margins. I agree completely. This is the central message of Steven Rosenstone’s excellent 1983 book, Forecasting Presidential Elections. So, given that Chait and I agree 100%, why was I so upset at his recent column on “The G.O.P.’s Dukakis Problem”? I’ll put the reasons for my displeasure below the fold because my main point is that I’m happy with Chait’s quote above. For completeness I want to explain where I’m coming from but my take-home point is that we’re mostly in agreement. — OK, so what upset me about Chait’s article? 1. The title. I’m pretty sure that Mike Dukakis, David Mamet, Bill Clinton, and the ghost of Lee Atwater will disagree with me on this one, but Duka
4 0.96471387 1769 andrew gelman stats-2013-03-18-Tibshirani announces new research result: A significance test for the lasso
Introduction: Lasso and me For a long time I was wrong about lasso. Lasso (“least absolute shrinkage and selection operator”) is a regularization procedure that shrinks regression coefficients toward zero, and in its basic form is equivalent to maximum penalized likelihood estimation with a penalty function that is proportional to the sum of the absolute values of the regression coefficients. I first heard about lasso from a talk that Trevor Hastie Rob Tibshirani gave at Berkeley in 1994 or 1995. He demonstrated that it shrunk regression coefficients to zero. I wasn’t impressed, first because it seemed like no big deal (if that’s the prior you use, that’s the shrinkage you get) and second because, from a Bayesian perspective, I don’t want to shrink things all the way to zero. In the sorts of social and environmental science problems I’ve worked on, just about nothing is zero. I’d like to control my noisy estimates but there’s nothing special about zero. At the end of the talk I stood
Introduction: Jonathan Chait writes that the most important aspect of a presidential candidate is “political talent”: Republicans have generally understood that an agenda tilted toward the desires of the powerful requires a skilled frontman who can pitch Middle America. Favorite character types include jocks, movie stars, folksy Texans and war heroes. . . . [But the frontrunners for the 2012 Republican nomination] make Michael Dukakis look like John F. Kennedy. They are qualified enough to serve as president, but wildly unqualified to run for president. . . . [Mitch] Daniels’s drawbacks begin — but by no means end — with his lack of height, hair and charisma. . . . [Jeb Bush] suffers from an inherent branding challenge [because of his last name]. . . . [Chris] Christie . . . doesn’t cut a trim figure and who specializes in verbally abusing his constituents. . . . [Haley] Barbour is the comic embodiment of his party’s most negative stereotypes. A Barbour nomination would be the rough equivalent
6 0.96333456 518 andrew gelman stats-2011-01-15-Regression discontinuity designs: looking for the keys under the lamppost?
7 0.96269101 749 andrew gelman stats-2011-06-06-“Sampling: Design and Analysis”: a course for political science graduate students
8 0.96079868 1267 andrew gelman stats-2012-04-17-Hierarchical-multilevel modeling with “big data”
9 0.96076846 1909 andrew gelman stats-2013-06-21-Job openings at conservative political analytics firm!
10 0.95997137 315 andrew gelman stats-2010-10-03-He doesn’t trust the fit . . . r=.999
11 0.95956492 89 andrew gelman stats-2010-06-16-A historical perspective on financial bailouts
12 0.95929968 32 andrew gelman stats-2010-05-14-Causal inference in economics
13 0.95882642 158 andrew gelman stats-2010-07-22-Tenants and landlords
14 0.95801437 1357 andrew gelman stats-2012-06-01-Halloween-Valentine’s update
15 0.95794308 678 andrew gelman stats-2011-04-25-Democrats do better among the most and least educated groups
17 0.95722085 384 andrew gelman stats-2010-10-31-Two stories about the election that I don’t believe
18 0.95675474 288 andrew gelman stats-2010-09-21-Discussion of the paper by Girolami and Calderhead on Bayesian computation
19 0.9566583 1823 andrew gelman stats-2013-04-24-The Tweets-Votes Curve
20 0.95650184 2263 andrew gelman stats-2014-03-24-Empirical implications of Empirical Implications of Theoretical Models