andrew_gelman_stats andrew_gelman_stats-2010 andrew_gelman_stats-2010-245 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Frank Hansen writes: I [Hansen] signed up for my first marathon race. Everyone asks me my predicted time. The predictors online seem geared to or are based off of elite runners. And anyway they seem a bit limited. So I decided to do some analysis of my own. I was going to put together a web page where people could get their race time predictions, maybe sell some ads for sports gps watches, but it might also be publishable. I have 2 requests which obviously I don’t want you to spend more than a few seconds on. 1. I was wondering if you knew of any sports performance researchers working on performance of not just elite athletes, but the full range of runners. 2. Can you suggest a way to do multilevel modeling of this. There are several natural subsets for the data but it’s not obvious what makes sense. I describe the data below. 3. Phil (the runner/co-blogger who posted about weight loss) might be interested. I collected race results for the Chicago marathon and 3
sentIndex sentText sentNum sentScore
1 Frank Hansen writes: I [Hansen] signed up for my first marathon race. [sent-1, score-0.715]
2 The predictors online seem geared to or are based off of elite runners. [sent-3, score-0.135]
3 I was going to put together a web page where people could get their race time predictions, maybe sell some ads for sports gps watches, but it might also be publishable. [sent-6, score-0.466]
4 I have 2 requests which obviously I don’t want you to spend more than a few seconds on. [sent-7, score-0.13]
5 I was wondering if you knew of any sports performance researchers working on performance of not just elite athletes, but the full range of runners. [sent-9, score-0.259]
6 There are several natural subsets for the data but it’s not obvious what makes sense. [sent-12, score-0.107]
7 I collected race results for the Chicago marathon and 3 shorter races: Chicago Half Marathon, Soldier Field 10 Miler, Ravenswood 5k. [sent-16, score-1.284]
8 Within each year I matched results for finishers between each shorter race and that year’s marathon based on full name and age. [sent-18, score-1.294]
9 I used python to scrape web pages for the results. [sent-19, score-0.161]
10 Of course in a particular year a given marathoner may have run more than one of the shorter races. [sent-20, score-0.298]
11 At this point I am ignoring that, treating them as independent records even though they have the same marathon finish data. [sent-21, score-0.666]
12 I would think that knowing several shorter races to predict a marathon time would help, but demanding several matches really cuts down the data. [sent-22, score-1.15]
13 I also collected weather data, so I know the temperature, humidity, wind speed near 8 am for each race (in Chicago). [sent-23, score-0.662]
14 A record contains a marathon time, a short race time, the type of short race, the temperature, humidity and wind speed difference between the short race and the marathon. [sent-25, score-1.968]
15 I also know the age and sex of the marathon finisher. [sent-26, score-0.812]
16 2e-16 In the regression results the marathon and short race “pace” variable is in seconds per mile, so the short. [sent-95, score-1.221]
17 typehalf equal to 82 means roughly add 82 seconds to your half marathon mile pace to get the marathon mile pace, and so on for the inde[endent variables. [sent-97, score-2.059]
18 Marathon day for 2009 was really cold, predicting pace for 2009 based on a fit of the other years has larger errors than predicting 2008 using a fit for the non-2008 data. [sent-99, score-0.324]
19 My main piece of advice is to never ever ever ever ever use “summary” to display regression outputs in R. [sent-100, score-0.455]
20 Unless, that is, you care that your standard error is “4. [sent-102, score-0.065]
wordName wordTfidf (topN-words)
[('marathon', 0.666), ('race', 0.279), ('pace', 0.194), ('humidity', 0.185), ('shorter', 0.179), ('mile', 0.174), ('wind', 0.161), ('seconds', 0.13), ('temperature', 0.113), ('speed', 0.113), ('chicago', 0.111), ('collected', 0.109), ('short', 0.095), ('hansen', 0.093), ('races', 0.089), ('ever', 0.086), ('age', 0.083), ('elite', 0.077), ('sports', 0.07), ('error', 0.065), ('predicting', 0.065), ('sex', 0.063), ('display', 0.062), ('fahrenheit', 0.062), ('finishers', 0.062), ('marathoner', 0.062), ('web', 0.059), ('df', 0.058), ('geared', 0.058), ('gps', 0.058), ('logs', 0.058), ('soldier', 0.058), ('several', 0.058), ('year', 0.057), ('performance', 0.056), ('half', 0.055), ('coefplot', 0.054), ('scrape', 0.054), ('codes', 0.052), ('demanding', 0.052), ('results', 0.051), ('lm', 0.051), ('min', 0.051), ('signed', 0.049), ('outputs', 0.049), ('subsets', 0.049), ('matches', 0.048), ('python', 0.048), ('athletes', 0.048), ('residuals', 0.048)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999988 245 andrew gelman stats-2010-08-31-Predicting marathon times
Introduction: Frank Hansen writes: I [Hansen] signed up for my first marathon race. Everyone asks me my predicted time. The predictors online seem geared to or are based off of elite runners. And anyway they seem a bit limited. So I decided to do some analysis of my own. I was going to put together a web page where people could get their race time predictions, maybe sell some ads for sports gps watches, but it might also be publishable. I have 2 requests which obviously I don’t want you to spend more than a few seconds on. 1. I was wondering if you knew of any sports performance researchers working on performance of not just elite athletes, but the full range of runners. 2. Can you suggest a way to do multilevel modeling of this. There are several natural subsets for the data but it’s not obvious what makes sense. I describe the data below. 3. Phil (the runner/co-blogger who posted about weight loss) might be interested. I collected race results for the Chicago marathon and 3
2 0.30974725 1351 andrew gelman stats-2012-05-29-A Ph.D. thesis is not really a marathon
Introduction: Thomas Basbøll writes : A blog called The Thesis Whisperer was recently pointed out to me. I [Basbøll] haven’t looked at it closely, but I’ll be reading it regularly for a while before I recommend it. I’m sure it’s a good place to go to discover that you’re not alone, especially when you’re struggling with your dissertation. One post caught my eye immediately. It suggested that writing a thesis is not a sprint, it’s a marathon. As a metaphorical adjustment to a particular attitude about writing, it’s probably going to help some people. But if we think it through, it’s not really a very good analogy. No one is really a “sprinter”; and writing a dissertation is nothing like running a marathon. . . . Here’s Ben’s explication of the analogy at the Thesis Whisperer, which seems initially plausible. …writing a dissertation is a lot like running a marathon. They are both endurance events, they last a long time and they require a consistent and carefully calculated amount of effor
3 0.15122364 1978 andrew gelman stats-2013-08-12-Fixing the race, ethnicity, and national origin questions on the U.S. Census
Introduction: In his new book, “What is Your Race? The Census and Our Flawed Efforts to Classify Americans,” former Census Bureau director Ken Prewitt recommends taking the race question off the decennial census: He recommends gradual changes, integrating the race and national origin questions while improving both. In particular, he would replace the main “race” question by a “race or origin” question, with the instruction to “Mark one or more” of the following boxes: “White,” “Black, African Am., or Negro,” “Hispanic, Latino, or Spanish origin,” “American Indian or Alaska Native,” “Asian”, “Native Hawaiian or Other Pacific Islander,” and “Some other race or origin.” Then the next question is to write in “specific race, origin, or enrolled or principal tribe.” Prewitt writes: His suggestion is to go with these questions in 2020 and 2030, then in 2040 “drop the race question and use only the national origin question.” He’s also relying on the American Community Survey to gather a lo
4 0.13299723 273 andrew gelman stats-2010-09-13-Update on marathon statistics
Introduction: Frank Hansen updates his story and writes: Here is a link to the new stuff. The update is a little less than half way down the page. 1. used display() instead of summary() 2. include a proxy for [non] newbies — whether I can find their name in a previous Chicago Marathon. 3. graph actual pace vs. fitted pace (color code newbie proxy) 4. estimate the model separately for newbies and non-newbies. some incidental discussion of sd of errors. There are a few things unfinished but I have to get to bed, I’m running the 2010 Chicago Half tomorrow morning, and they moved the start up from 7:30 to 7:00 because it’s the day of the Bears home opener too.
5 0.12412085 1773 andrew gelman stats-2013-03-21-2.15
Introduction: Jake Hofman writes that he saw my recent newspaper article on running (“How fast do we slow down? . . . For each doubling of distance, the world record time is multiplied by about 2.15. . . . for sprints of 200 meters to 1,000 meters, a doubling of distance corresponds to an increase of a factor of 2.3 in world record running times; for longer distances from 1,000 meters to the marathon, a doubling of distance increases the time by a factor of 2.1. . . . similar patterns for men and women, and for swimming as well as running”) and writes: If you’re ever interested in getting or playing with Olympics data, I [Jake] wrote some code to scrape it all from sportsreference.com this past summer for a blog post . Enjoy!
6 0.10172753 1501 andrew gelman stats-2012-09-18-More studies on the economic effects of climate change
7 0.095080756 1831 andrew gelman stats-2013-04-29-The Great Race
8 0.094029568 2364 andrew gelman stats-2014-06-08-Regression and causality and variable ordering
9 0.090272419 374 andrew gelman stats-2010-10-27-No matter how famous you are, billions of people have never heard of you.
10 0.076167598 180 andrew gelman stats-2010-08-03-Climate Change News
11 0.074763909 2224 andrew gelman stats-2014-02-25-Basketball Stats: Don’t model the probability of win, model the expected score differential.
12 0.073349454 1881 andrew gelman stats-2013-06-03-Boot
13 0.07320638 1277 andrew gelman stats-2012-04-23-Infographic of the year
14 0.07052125 962 andrew gelman stats-2011-10-17-Death!
15 0.070081711 1086 andrew gelman stats-2011-12-27-The most dangerous jobs in America
16 0.066409729 250 andrew gelman stats-2010-09-02-Blending results from two relatively independent multi-level models
17 0.06566032 1403 andrew gelman stats-2012-07-02-Moving beyond hopeless graphics
18 0.065215498 2341 andrew gelman stats-2014-05-20-plus ça change, plus c’est la même chose
19 0.06390731 1173 andrew gelman stats-2012-02-17-Sports examples in class
topicId topicWeight
[(0, 0.115), (1, 0.003), (2, 0.034), (3, -0.002), (4, 0.063), (5, 0.009), (6, 0.016), (7, -0.012), (8, 0.032), (9, -0.008), (10, 0.001), (11, -0.004), (12, 0.013), (13, -0.004), (14, -0.02), (15, 0.034), (16, 0.028), (17, 0.008), (18, 0.03), (19, -0.009), (20, -0.013), (21, 0.058), (22, -0.016), (23, 0.008), (24, 0.006), (25, 0.006), (26, -0.001), (27, -0.03), (28, 0.012), (29, -0.008), (30, 0.013), (31, 0.007), (32, -0.024), (33, -0.005), (34, 0.034), (35, -0.023), (36, -0.001), (37, -0.011), (38, 0.01), (39, -0.01), (40, -0.024), (41, -0.023), (42, -0.047), (43, -0.024), (44, -0.027), (45, -0.013), (46, -0.008), (47, -0.002), (48, -0.022), (49, 0.047)]
simIndex simValue blogId blogTitle
same-blog 1 0.93121672 245 andrew gelman stats-2010-08-31-Predicting marathon times
Introduction: Frank Hansen writes: I [Hansen] signed up for my first marathon race. Everyone asks me my predicted time. The predictors online seem geared to or are based off of elite runners. And anyway they seem a bit limited. So I decided to do some analysis of my own. I was going to put together a web page where people could get their race time predictions, maybe sell some ads for sports gps watches, but it might also be publishable. I have 2 requests which obviously I don’t want you to spend more than a few seconds on. 1. I was wondering if you knew of any sports performance researchers working on performance of not just elite athletes, but the full range of runners. 2. Can you suggest a way to do multilevel modeling of this. There are several natural subsets for the data but it’s not obvious what makes sense. I describe the data below. 3. Phil (the runner/co-blogger who posted about weight loss) might be interested. I collected race results for the Chicago marathon and 3
2 0.77243888 1164 andrew gelman stats-2012-02-13-Help with this problem, win valuable prizes
Introduction: Corrected equation This post is by Phil. In the comments to an earlier post , I mentioned a problem I am struggling with right now. Several people mentioned having (and solving!) similar problems in the past, so this seems like a great way for me and a bunch of other blog readers to learn something. I will describe the problem, one or more of you will tell me how to solve it, and you will win…wait for it….my thanks, and the approval and admiration of your fellow blog readers, and a big thank-you in any publication that includes results from fitting the model. You can’t ask fairer than that! Here’s the problem. The goal is to estimate six parameters that characterize the leakiness (or air-tightness) of a house with an attached garage. We are specifically interested in the parameters that describe the connection between the house and the garage; this is of interest because of the effect on the air quality in the house if there are toxic chemic
Introduction: Alexander at GiveWell writes : The Disease Control Priorities in Developing Countries (DCP2), a major report funded by the Gates Foundation . . . provides an estimate of $3.41 per disability-adjusted life-year (DALY) for the cost-effectiveness of soil-transmitted-helminth (STH) treatment, implying that STH treatment is one of the most cost-effective interventions for global health. In investigating this figure, we have corresponded, over a period of months, with six scholars who had been directly or indirectly involved in the production of the estimate. Eventually, we were able to obtain the spreadsheet that was used to generate the $3.41/DALY estimate. That spreadsheet contains five separate errors that, when corrected, shift the estimated cost effectiveness of deworming from $3.41 to $326.43. [I think they mean to say $300 -- ed.] We came to this conclusion a year after learning that the DCP2’s published cost-effectiveness estimate for schistosomiasis treatment – another kind of
4 0.70474231 2364 andrew gelman stats-2014-06-08-Regression and causality and variable ordering
Introduction: Bill Harris wrote in with a question: David Hogg points out in one of his general articles on data modeling that regression assumptions require one to put the variable with the highest variance in the ‘y’ position and the variable you know best (lowest variance) in the ‘x’ position. As he points out, others speak of independent and dependent variables, as if causality determined the form of a regression formula. In a quick scan of ARM and BDA, I don’t see clear advice, but I do see the use of ‘independent’ and ‘dependent.’ I recently did a model over data in which we know the ‘effect’ pretty well (we measure it), while we know the ’cause’ less well (it’s estimated by people who only need to get it approximately correct). A model of the form ’cause ~ effect’ fit visually much better than one of the form ‘effect ~ cause’, but interpreting it seems challenging. For a simplistic example, let the effect be energy use in a building for cooling (E), and let the cause be outdoor ai
Introduction: I know next to nothing about golf. My mini-golf scores typically approach the maximum of 7 per hole, and I’ve never actually played macro-golf. I did publish a paper on golf once ( A Probability Model for Golf Putting , with Deb Nolan), but it’s not so rare for people to publish papers on topics they know nothing about. Those who can’t, research. But I certainly have the ability to post other people’s ideas. Charles Murray writes: I [Murray] am playing around with the likelihood of Tiger Woods breaking Nicklaus’s record in the Majors. I’ve already gone on record two years ago with the reason why he won’t, but now I’m looking at it from a non-psychological perspective. Given the history of the majors, what how far above the average _for other great golfers_ does Tiger have to perform? Here’s the procedure I’ve been working on: 1. For all golfers who have won at at least one major since 1934 (the year the Masters began), create 120 lines: one for each Major for each year f
6 0.70089179 770 andrew gelman stats-2011-06-15-Still more Mr. P in public health
8 0.6903553 906 andrew gelman stats-2011-09-14-Another day, another stats postdoc
9 0.66765577 513 andrew gelman stats-2011-01-12-“Tied for Warmest Year On Record”
10 0.66326964 1462 andrew gelman stats-2012-08-18-Standardizing regression inputs
11 0.66079634 549 andrew gelman stats-2011-02-01-“Roughly 90% of the increase in . . .” Hey, wait a minute!
12 0.6596759 1201 andrew gelman stats-2012-03-07-Inference = data + model
14 0.65788502 1086 andrew gelman stats-2011-12-27-The most dangerous jobs in America
17 0.64486599 1017 andrew gelman stats-2011-11-18-Lack of complete overlap
18 0.64137608 708 andrew gelman stats-2011-05-12-Improvement of 5 MPG: how many more auto deaths?
19 0.64119875 273 andrew gelman stats-2010-09-13-Update on marathon statistics
20 0.64102983 1501 andrew gelman stats-2012-09-18-More studies on the economic effects of climate change
topicId topicWeight
[(11, 0.026), (14, 0.206), (15, 0.031), (16, 0.065), (20, 0.011), (21, 0.016), (24, 0.116), (27, 0.015), (29, 0.012), (30, 0.019), (53, 0.01), (56, 0.013), (86, 0.032), (87, 0.011), (90, 0.013), (95, 0.013), (99, 0.23)]
simIndex simValue blogId blogTitle
1 0.95751107 1724 andrew gelman stats-2013-02-16-Zero Dark Thirty and Bayes’ theorem
Introduction: A moviegoing colleague writes: I just watched the movie Zero Dark Thirty about the hunt for Osama Bin Laden. What struck me about it was: (1) Bayes theorem underlies the whole movie; (2) CIA top brass do not know Bayes theorem (at least as portrayed in the movie). Obviously one does not need to know physics to play billiards, but it helps with the reasoning. Essentially, at some point the key CIA agent locates what she strongly believes is OBL’s hidding place in Pakistan. Then it takes the White House some 150 days to make the decision to attack the compound. Why so long? And why, even on the eve of the operation, were senior brass only some 60% OBL was there? Fear of false positives is the answer. After all, the compound could belong to a drug lord, or some other terrorist. Here is the math: There are two possibilities, according to movie: OBL is in a compound (C) in a city or he is in the mountains in tribal regions. Say P(OBL in C) = 0.5. A diagnosis is made on
2 0.92903018 130 andrew gelman stats-2010-07-07-A False Consensus about Public Opinion on Torture
Introduction: John Sides reports on this finding by Paul Gronke, Darius Rejali, Dustin Drenguis, James Hicks, Peter Miller, and Bryan Nakayama, from a survey in 2008:: Gronke et al. write (as excerpted by Sides): Many journalists and politicians believe that during the Bush administration, a majority of Americans supported torture if they were assured that it would prevent a terrorist attack….But this view was a misperception…we show here that a majority of Americans were opposed to torture throughout the Bush presidency…even when respondents were asked about an imminent terrorist attack, even when enhanced interrogation techniques were not called torture, and even when Americans were assured that torture would work to get crucial information. Opposition to torture remained stable and consistent during the entire Bush presidency. Gronke et al. attribute confusion of beliefs to the so-called false consensus effect studied by cognitive psychologists, in which people tend to assume th
3 0.92460942 1770 andrew gelman stats-2013-03-19-Retraction watch
Introduction: Here (from the Annals of Applied Statistics ). “Thus, arguably, all of Section 3 is wrong until proven otherwise.” As with retractions in general, it makes me wonder about the rest of this guy’s work. Dr. Anil Potti would be pooping in his pants spinning in his retirement .
4 0.92409074 1809 andrew gelman stats-2013-04-17-NUTS discussed on Xi’an’s Og
Introduction: Xi’an’s Og (aka Christian Robert’s blog) is featuring a very nice presentation of NUTS by Marco Banterle, with discussion and some suggestions. I’m not even sure how they found Michael Betancourt’s paper on geometric NUTS — I don’t see it on the arXiv yet, or I’d provide a link.
5 0.91346359 1051 andrew gelman stats-2011-12-10-Towards a Theory of Trust in Networks of Humans and Computers
Introduction: Hey, this looks cool: Towards a Theory of Trust in Networks of Humans and Computers Virgil Gligor Carnegie Mellon University We argue that a general theory of trust in networks of humans and computers must be build on both a theory of behavioral trust and a theory of computational trust. This argument is motivated by increased participation of people in social networking, crowdsourcing, human computation, and socio-economic protocols, e.g., protocols modeled by trust and gift-exchange games, norms-establishing contracts, and scams/deception. User participation in these protocols relies primarily on trust, since on-line verification of protocol compliance is often impractical; e.g., verification can lead to undecidable problems, co-NP complete test procedures, and user inconvenience. Trust is captured by participant preferences (i.e., risk and betrayal aversion) and beliefs in the trustworthiness of other protocol participants. Both preferences and beliefs can be enhanced
6 0.91237891 824 andrew gelman stats-2011-07-26-Milo and Milo
7 0.90364236 755 andrew gelman stats-2011-06-09-Recently in the award-winning sister blog
9 0.8954699 1696 andrew gelman stats-2013-01-29-The latest in economics exceptionalism
same-blog 10 0.88593042 245 andrew gelman stats-2010-08-31-Predicting marathon times
11 0.88335842 1471 andrew gelman stats-2012-08-27-Why do we never see a full decision analysis for a clinical trial?
12 0.83125591 1303 andrew gelman stats-2012-05-06-I’m skeptical about this skeptical article about left-handedness
13 0.82898498 1236 andrew gelman stats-2012-03-29-Resolution of Diederik Stapel case
14 0.82523978 2344 andrew gelman stats-2014-05-23-The gremlins did it? Iffy statistics drive strong policy recommendations
15 0.82314324 2237 andrew gelman stats-2014-03-08-Disagreeing to disagree
16 0.81128466 252 andrew gelman stats-2010-09-02-R needs a good function to make line plots
17 0.80753046 2114 andrew gelman stats-2013-11-26-“Please make fun of this claim”
18 0.80483377 2117 andrew gelman stats-2013-11-29-The gradual transition to replicable science
19 0.80066389 2336 andrew gelman stats-2014-05-16-How much can we learn about individual-level causal claims from state-level correlations?
20 0.79478788 1634 andrew gelman stats-2012-12-21-Two reviews of Nate Silver’s new book, from Kaiser Fung and Cathy O’Neil