andrew_gelman_stats andrew_gelman_stats-2011 andrew_gelman_stats-2011-1017 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Evens Salies writes: I have a question regarding a randomizing constraint in my current funded electricity experiment. After elimination of missing data we have 110 voluntary households from a larger population (resource constraints do not allow us to have more households!). I randomly assign them to threated and non treated where the treatment variable is some ICT that allows the treated to track their electricity consumption in real tim. The ICT is made of two devices, one that is plugged on the household’s modem and the other on the electric meter. A necessary condition for being treated is that the distance between the box and the meter be below some threshold (d), the value of which is 20 meters approximately. 50 ICTs can be installed. 60 households will be in the control group. But, I can only assign 6 households in the control group for whom d is less than 20. Therefore, I have only 6 households in the control group who have a counterfactual in the group of treated.
sentIndex sentText sentNum sentScore
1 Evens Salies writes: I have a question regarding a randomizing constraint in my current funded electricity experiment. [sent-1, score-0.527]
2 After elimination of missing data we have 110 voluntary households from a larger population (resource constraints do not allow us to have more households! [sent-2, score-0.899]
3 I randomly assign them to threated and non treated where the treatment variable is some ICT that allows the treated to track their electricity consumption in real tim. [sent-4, score-1.097]
4 The ICT is made of two devices, one that is plugged on the household’s modem and the other on the electric meter. [sent-5, score-0.176]
5 A necessary condition for being treated is that the distance between the box and the meter be below some threshold (d), the value of which is 20 meters approximately. [sent-6, score-0.879]
6 But, I can only assign 6 households in the control group for whom d is less than 20. [sent-9, score-1.102]
7 Therefore, I have only 6 households in the control group who have a counterfactual in the group of treated. [sent-10, score-1.21]
8 To put it differently, for 54 households in the control group, the overlap assumption is violated because these 54 households could never have been treated. [sent-11, score-1.661]
9 Please, could you send me to a paper on Program Evalution/Causal Inference that addresses such issue? [sent-13, score-0.08]
10 Should I discard the 54 households who could not be treates (due to the distance constraint). [sent-14, score-1.067]
11 My response: I don’t know of any references on this (beyond chapters 9 and 10 of my book with Jennifer). [sent-16, score-0.12]
12 My quick answer is that you should model your outcome conditional on the treatment indicator and also this distance variable. [sent-17, score-0.47]
13 If distance doesn’t matter at all, maybe you’re ok, and if it does matter, maybe a reasonable model will correct for the non-overlap. [sent-18, score-0.563]
14 Or maybe you’ll be able to keep most of the data and just discard some cases with extreme values of the predictor. [sent-19, score-0.295]
wordName wordTfidf (topN-words)
[('households', 0.64), ('distance', 0.255), ('ict', 0.214), ('electricity', 0.186), ('treated', 0.181), ('discard', 0.172), ('control', 0.172), ('constraint', 0.165), ('group', 0.155), ('assign', 0.135), ('elimination', 0.107), ('randomizing', 0.107), ('meter', 0.107), ('treatment', 0.091), ('plugged', 0.09), ('meters', 0.09), ('voluntary', 0.088), ('counterfactual', 0.088), ('electric', 0.086), ('correct', 0.085), ('violated', 0.084), ('resource', 0.083), ('devices', 0.081), ('matter', 0.081), ('addresses', 0.08), ('non', 0.08), ('unfair', 0.077), ('indicator', 0.074), ('household', 0.072), ('overlap', 0.071), ('maybe', 0.071), ('differently', 0.07), ('funded', 0.069), ('condition', 0.067), ('consumption', 0.066), ('threshold', 0.066), ('constraints', 0.064), ('box', 0.063), ('chapters', 0.062), ('randomly', 0.061), ('predictor', 0.061), ('track', 0.058), ('references', 0.058), ('allows', 0.058), ('jennifer', 0.057), ('therefore', 0.054), ('assumption', 0.054), ('extreme', 0.052), ('necessary', 0.05), ('conditional', 0.05)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999994 1017 andrew gelman stats-2011-11-18-Lack of complete overlap
Introduction: Evens Salies writes: I have a question regarding a randomizing constraint in my current funded electricity experiment. After elimination of missing data we have 110 voluntary households from a larger population (resource constraints do not allow us to have more households!). I randomly assign them to threated and non treated where the treatment variable is some ICT that allows the treated to track their electricity consumption in real tim. The ICT is made of two devices, one that is plugged on the household’s modem and the other on the electric meter. A necessary condition for being treated is that the distance between the box and the meter be below some threshold (d), the value of which is 20 meters approximately. 50 ICTs can be installed. 60 households will be in the control group. But, I can only assign 6 households in the control group for whom d is less than 20. Therefore, I have only 6 households in the control group who have a counterfactual in the group of treated.
Introduction: A reporter emailed me the other day with a question about a case I’d never heard of before, a company called Herbalife that is being accused of being a pyramid scheme. The reporter pointed me to this document which describes a survey conducted by “a third party firm called Lieberman Research”: Two independent studies took place using real time (aka “river”) sampling, in which respondents were intercepted across a wide array of websites Sample size of 2,000 adults 18+ matched to U.S. census on age, gender, income, region and ethnicity “River sampling” in this case appears to mean, according to the reporter, that “people were invited into it through online ads.” The survey found that 5% of U.S. households had purchased Herbalife products during the past three months (with a “0.8% margin of error,” ha ha ha). They they did a multiplication and a division to estimate that only 8% of households who bought these products were Herbalife distributors: 480,000 active distributor
3 0.11802179 1773 andrew gelman stats-2013-03-21-2.15
Introduction: Jake Hofman writes that he saw my recent newspaper article on running (“How fast do we slow down? . . . For each doubling of distance, the world record time is multiplied by about 2.15. . . . for sprints of 200 meters to 1,000 meters, a doubling of distance corresponds to an increase of a factor of 2.3 in world record running times; for longer distances from 1,000 meters to the marathon, a doubling of distance increases the time by a factor of 2.1. . . . similar patterns for men and women, and for swimming as well as running”) and writes: If you’re ever interested in getting or playing with Olympics data, I [Jake] wrote some code to scrape it all from sportsreference.com this past summer for a blog post . Enjoy!
Introduction: Mike Spagat sends in an interesting explanation for the noted problems with conflict mortality studies (a topic we’ve discussed on occasion on this blog). Spagat writes: This analysis is based on the fact that conflict violence does not spread out at all uniformly across a map but, rather, tends to concentrate in a few areas. This means that small, headline-grabbing violence surveys are extremely unreliable. There is a second point, based on the work of David Hemenway which you’ve also cited on your blog. Even within exceptionally violent environments most households will still not have a violent death. So a very small false positive rate in a household survey will cause substantial upward bias in violence estimates.
5 0.10904056 107 andrew gelman stats-2010-06-24-PPS in Georgia
Introduction: Lucy Flynn writes: I’m working at a non-profit organization called CRRC in the Republic of Georgia. I’m having a methodological problem and I saw the syllabus for your sampling class online and thought I might be able to ask you about it? We do a lot of complex surveys nationwide; our typical sample design is as follows: - stratify by rural/urban/capital - sub-stratify the rural and urban strata into NE/NW/SE/SW geographic quadrants - select voting precincts as PSUs - select households as SSUs - select individual respondents as TSUs I’m relatively new here, and past practice has been to sample voting precincts with probability proportional to size. It’s desirable because it’s not logistically feasible for us to vary the number of interviews per precinct with precinct size, so it makes the selection probabilities for households more even across precinct sizes. However, I have a complex sampling textbook (Lohr 1999), and it explains how complex it is to calculate sel
6 0.10898779 2342 andrew gelman stats-2014-05-21-Models with constraints
7 0.094837815 5 andrew gelman stats-2010-04-27-Ethical and data-integrity problems in a study of mortality in Iraq
8 0.093063109 1912 andrew gelman stats-2013-06-24-Bayesian quality control?
9 0.081558816 2115 andrew gelman stats-2013-11-27-Three unblinded mice
10 0.08059404 1745 andrew gelman stats-2013-03-02-Classification error
11 0.079566747 86 andrew gelman stats-2010-06-14-“Too much data”?
13 0.075583279 388 andrew gelman stats-2010-11-01-The placebo effect in pharma
14 0.071451932 2203 andrew gelman stats-2014-02-08-“Guys who do more housework get less sex”
16 0.063836098 2046 andrew gelman stats-2013-10-01-I’ll say it again
17 0.06366577 1114 andrew gelman stats-2012-01-12-Controversy about average personality differences between men and women
18 0.063627906 142 andrew gelman stats-2010-07-12-God, Guns, and Gaydar: The Laws of Probability Push You to Overestimate Small Groups
19 0.061693236 1767 andrew gelman stats-2013-03-17-The disappearing or non-disappearing middle class
20 0.060694061 1523 andrew gelman stats-2012-10-06-Comparing people from two surveys, one of which is a simple random sample and one of which is not
topicId topicWeight
[(0, 0.101), (1, 0.025), (2, 0.029), (3, -0.033), (4, 0.029), (5, 0.03), (6, 0.003), (7, -0.007), (8, 0.024), (9, 0.015), (10, -0.017), (11, -0.005), (12, 0.015), (13, -0.005), (14, 0.014), (15, 0.016), (16, 0.022), (17, 0.004), (18, 0.02), (19, 0.008), (20, -0.009), (21, 0.019), (22, -0.011), (23, -0.032), (24, -0.014), (25, 0.038), (26, 0.007), (27, 0.0), (28, 0.018), (29, 0.02), (30, -0.018), (31, -0.017), (32, -0.032), (33, 0.067), (34, -0.011), (35, -0.014), (36, 0.003), (37, 0.003), (38, -0.002), (39, 0.032), (40, -0.027), (41, -0.038), (42, 0.0), (43, -0.017), (44, 0.011), (45, 0.011), (46, 0.018), (47, 0.018), (48, -0.009), (49, 0.051)]
simIndex simValue blogId blogTitle
same-blog 1 0.94729531 1017 andrew gelman stats-2011-11-18-Lack of complete overlap
Introduction: Evens Salies writes: I have a question regarding a randomizing constraint in my current funded electricity experiment. After elimination of missing data we have 110 voluntary households from a larger population (resource constraints do not allow us to have more households!). I randomly assign them to threated and non treated where the treatment variable is some ICT that allows the treated to track their electricity consumption in real tim. The ICT is made of two devices, one that is plugged on the household’s modem and the other on the electric meter. A necessary condition for being treated is that the distance between the box and the meter be below some threshold (d), the value of which is 20 meters approximately. 50 ICTs can be installed. 60 households will be in the control group. But, I can only assign 6 households in the control group for whom d is less than 20. Therefore, I have only 6 households in the control group who have a counterfactual in the group of treated.
2 0.80667746 86 andrew gelman stats-2010-06-14-“Too much data”?
Introduction: Chris Hane writes: I am scientist needing to model a treatment effect on a population of ~500 people. The dependent variable in the model is the difference in a person’s pre-treatment 12 month total medical cost versus post-treatment cost. So there is large variation in costs, but not so much by using the difference between the pre and post treatment costs. The issue I’d like some advice on is that the treatment has already occurred so there is no possibility of creating a fully randomized control now. I do have a very large population of people to use as possible controls via propensity scoring or exact matching. If I had a few thousand people to possibly match, then I would use standard techniques. However, I have a potential population of over a hundred thousand people. An exact match of the possible controls to age, gender and region of the country still leaves a population of 10,000 controls. Even if I use propensity scores to weight the 10,000 observations (understan
Introduction: Last year we discussed an important challenge in causal inference: The standard advice (given in many books, including ours) for causal inference is to control for relevant pre-treatment variables as much as possible. But, as Judea Pearl has pointed out, instruments (as in “instrumental variables”) are pre-treatment variables that we would not want to “control for” in a matching or regression sense. At first, this seems like a minor modification, with the new recommendation being to apply instrumental variables estimation using all pre-treatment instruments, and to control for all other pre-treatment variables. But that can’t really work as general advice. What about weak instruments or covariates that have some instrumental aspects? I asked Paul Rosenbaum for his thoughts on the matter, and he wrote the following: In section 18.2 of Design of Observational Studies (DOS), I [Rosenbaum] discuss “seemingly innocuous confounding” defined to be a covariate that predicts a su
4 0.72838187 251 andrew gelman stats-2010-09-02-Interactions of predictors in a causal model
Introduction: Michael Bader writes: What is the best way to examine interactions of independent variables in a propensity weights framework? Let’s say we are interested in estimating breathing difficulty (measured on a continuous scale) and our main predictor is age of housing. The object is to estimate whether living in housing 20 years or older is associated with breathing difficulty compared counterfactually to those living in housing less than 20 years old; as a secondary question, we want to know whether that effect differs for those in poverty compared to those not in poverty. In our first-stage propensity model, we include whether the respondent lives in poverty. The weights applied to the other covariates in the propensity model are similar to those living in poverty compared to those who are not. Now, can I simply interact the poverty variable with the age of construction variable to look at the interaction of age of housing and poverty on breathing difficulty? My thought is no —
5 0.70900643 527 andrew gelman stats-2011-01-20-Cars vs. trucks
Introduction: Anupam Agrawal writes: I am an Assistant Professor of Operations Management at the University of Illinois. . . . My main work is in supply chain area, and empirical in nature. . . . I am working with a firm that has two separate divisions – one making cars, and the other makes trucks. Four years back, the firm made an interesting organizational change. They created a separate group of ~25 engineers, in their car division (from within their quality and production engineers). This group was focused on improving supplier quality and reported to car plant head . The truck division did not (and still does not) have such an independent “supplier improvement group”. Other than this unit in car, the organizational arrangements in the two divisions mimic each other. There are many common suppliers to the car and truck division. Data on quality of components coming from suppliers has been collected (for the last four years). The organizational change happened in January 2007. My focus is
6 0.70561534 213 andrew gelman stats-2010-08-17-Matching at two levels
7 0.6845246 1910 andrew gelman stats-2013-06-22-Struggles over the criticism of the “cannabis users and IQ change” paper
8 0.68093485 777 andrew gelman stats-2011-06-23-Combining survey data obtained using different modes of sampling
9 0.67777044 708 andrew gelman stats-2011-05-12-Improvement of 5 MPG: how many more auto deaths?
10 0.67534876 791 andrew gelman stats-2011-07-08-Censoring on one end, “outliers” on the other, what can we do with the middle?
12 0.67328614 2342 andrew gelman stats-2014-05-21-Models with constraints
13 0.65202975 553 andrew gelman stats-2011-02-03-is it possible to “overstratify” when assigning a treatment in a randomized control trial?
14 0.64912134 1294 andrew gelman stats-2012-05-01-Modeling y = a + b + c
15 0.64895356 1121 andrew gelman stats-2012-01-15-R-squared for multilevel models
16 0.64760053 14 andrew gelman stats-2010-05-01-Imputing count data
18 0.63819927 1330 andrew gelman stats-2012-05-19-Cross-validation to check missing-data imputation
19 0.63296813 271 andrew gelman stats-2010-09-12-GLM – exposure
20 0.63259667 154 andrew gelman stats-2010-07-18-Predictive checks for hierarchical models
topicId topicWeight
[(2, 0.29), (13, 0.031), (16, 0.1), (21, 0.045), (24, 0.113), (29, 0.024), (31, 0.012), (53, 0.01), (62, 0.01), (76, 0.012), (85, 0.012), (87, 0.013), (89, 0.018), (90, 0.013), (96, 0.011), (99, 0.18)]
simIndex simValue blogId blogTitle
1 0.95062518 663 andrew gelman stats-2011-04-15-Happy tax day!
Introduction: Your taxes pay for the research funding that supports the work we do here, some of which appears on this blog and almost all of which is public, free, and open-source. So, to all of the taxpayers out there in the audience: thank you.
2 0.93212694 489 andrew gelman stats-2010-12-28-Brow inflation
Introduction: In an article headlined, “Hollywood moves away from middlebrow,” Brooks Barnes writes : As Hollywood plowed into 2010, there was plenty of clinging to the tried and true: humdrum remakes like “The Wolfman” and “The A-Team”; star vehicles like “Killers” with Ashton Kutcher and “The Tourist” with Angelina Jolie and Johnny Depp; and shoddy sequels like “Sex and the City 2.” All arrived at theaters with marketing thunder intended to fill multiplexes on opening weekend, no matter the quality of the film. . . . But the audience pushed back. One by one, these expensive yet middle-of-the-road pictures delivered disappointing results or flat-out flopped. Meanwhile, gambles on original concepts paid off. “Inception,” a complicated thriller about dream invaders, racked up more than $825 million in global ticket sales; “The Social Network” has so far delivered $192 million, a stellar result for a highbrow drama. . . . the message that the year sent about quality and originality is real enoug
3 0.93198287 97 andrew gelman stats-2010-06-18-Economic Disparities and Life Satisfaction in European Regions
Introduction: Grazia Pittau, Roberto Zelli, and I came out with a paper investigating the role of economic variables in predicting regional disparities in reported life satisfaction of European Union citizens. We use multilevel modeling to explicitly account for the hierarchical nature of our data, respondents within regions and countries, and for understanding patterns of variation within and between regions. Here’s what we found: - Personal income matters more in poor regions than in rich regions, a pattern that still holds for regions within the same country. - Being unemployed is negatively associated with life satisfaction even after controlled for income variation. Living in high unemployment regions does not alleviate the unhappiness of being out of work. - After controlling for individual characteristics and modeling interactions, regional differences in life satisfaction still remain. Here’s a quick graph; there’s more in the article:
same-blog 4 0.91029626 1017 andrew gelman stats-2011-11-18-Lack of complete overlap
Introduction: Evens Salies writes: I have a question regarding a randomizing constraint in my current funded electricity experiment. After elimination of missing data we have 110 voluntary households from a larger population (resource constraints do not allow us to have more households!). I randomly assign them to threated and non treated where the treatment variable is some ICT that allows the treated to track their electricity consumption in real tim. The ICT is made of two devices, one that is plugged on the household’s modem and the other on the electric meter. A necessary condition for being treated is that the distance between the box and the meter be below some threshold (d), the value of which is 20 meters approximately. 50 ICTs can be installed. 60 households will be in the control group. But, I can only assign 6 households in the control group for whom d is less than 20. Therefore, I have only 6 households in the control group who have a counterfactual in the group of treated.
5 0.901824 549 andrew gelman stats-2011-02-01-“Roughly 90% of the increase in . . .” Hey, wait a minute!
Introduction: Matthew Yglesias links approvingly to the following statement by Michael Mandel: Homeland Security accounts for roughly 90% of the increase in federal regulatory employment over the past ten years. Roughly 90%, huh? That sounds pretty impressive. But wait a minute . . . what if total federal regulatory employment had increased a bit less. Then Homeland Security could’ve accounted for 105% of the increase, or 500% of the increase, or whatever. The point is the change in total employment is the sum of a bunch of pluses and minuses. It happens that, if you don’t count Homeland Security, the total hasn’t changed much–I’m assuming Mandel’s numbers are correct here–and that could be interesting. The “roughly 90%” figure is misleading because, when written as a percent of the total increase, it’s natural to quickly envision it as a percentage that is bounded by 100%. There is a total increase in regulatory employment that the individual agencies sum to, but some margins are p
6 0.89474607 17 andrew gelman stats-2010-05-05-Taking philosophical arguments literally
7 0.89139789 44 andrew gelman stats-2010-05-20-Boris was right
8 0.88934743 1698 andrew gelman stats-2013-01-30-The spam just gets weirder and weirder
10 0.87298459 1189 andrew gelman stats-2012-02-28-Those darn physicists
11 0.85341889 1663 andrew gelman stats-2013-01-09-The effects of fiscal consolidation
12 0.84460032 1102 andrew gelman stats-2012-01-06-Bayesian Anova found useful in ecology
13 0.83680058 1872 andrew gelman stats-2013-05-27-More spam!
14 0.83613473 694 andrew gelman stats-2011-05-04-My talk at Hunter College on Thurs
15 0.83235925 1508 andrew gelman stats-2012-09-23-Speaking frankly
16 0.80252594 1567 andrew gelman stats-2012-11-07-Election reports
17 0.80164587 1954 andrew gelman stats-2013-07-24-Too Good To Be True: The Scientific Mass Production of Spurious Statistical Significance
18 0.79848206 1893 andrew gelman stats-2013-06-11-Folic acid and autism
19 0.79263967 1260 andrew gelman stats-2012-04-11-Hunger Games survival analysis
20 0.79131627 2360 andrew gelman stats-2014-06-05-Identifying pathways for managing multiple disturbances to limit plant invasions