andrew_gelman_stats andrew_gelman_stats-2010 andrew_gelman_stats-2010-99 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Mark Palko writes: I’ve got a stat problem I’d like to run past you. It’s one of those annoying problems that feels like it should be obvious but the solution has evaded me and the colleagues I’ve discussed it with. I’m working on a project where the metric of interest is defined in relation to pairs of data points. It has nothing to do with sports or betting but the following analogy (which I also post on the blog) covers the basic situation: “You want to build a model predicting the spread for games in a new football league. Because the line-up of teams is still in flux, you decide to use only stats from individual teams as inputs (for example, an indicator variable for when the Ambushers play the Ravagers would not be allowed).” Is there a standard approach for modeling this kind of data? My reply: I don’t quite understand your question, but are you familiar with the Bradley-Terry and Thurstone-Mosteller models for paired comparisons? These are old–from the 1920s and 194
sentIndex sentText sentNum sentScore
1 Mark Palko writes: I’ve got a stat problem I’d like to run past you. [sent-1, score-0.294]
2 It’s one of those annoying problems that feels like it should be obvious but the solution has evaded me and the colleagues I’ve discussed it with. [sent-2, score-0.85]
3 I’m working on a project where the metric of interest is defined in relation to pairs of data points. [sent-3, score-0.638]
4 It has nothing to do with sports or betting but the following analogy (which I also post on the blog) covers the basic situation: “You want to build a model predicting the spread for games in a new football league. [sent-4, score-1.29]
5 Because the line-up of teams is still in flux, you decide to use only stats from individual teams as inputs (for example, an indicator variable for when the Ambushers play the Ravagers would not be allowed). [sent-5, score-1.364]
6 ” Is there a standard approach for modeling this kind of data? [sent-6, score-0.082]
7 My reply: I don’t quite understand your question, but are you familiar with the Bradley-Terry and Thurstone-Mosteller models for paired comparisons? [sent-7, score-0.409]
8 Interesting work has been done on these models recently by Hal Stern, Mark Glickman, and others, to allow the underlying parameters to vary over time. [sent-9, score-0.517]
wordName wordTfidf (topN-words)
[('teams', 0.272), ('evaded', 0.22), ('flux', 0.22), ('paired', 0.191), ('mark', 0.188), ('metric', 0.173), ('stern', 0.173), ('betting', 0.17), ('hal', 0.159), ('covers', 0.155), ('inputs', 0.153), ('indicator', 0.152), ('pairs', 0.145), ('football', 0.141), ('annoying', 0.135), ('games', 0.133), ('stat', 0.132), ('stats', 0.127), ('spread', 0.126), ('palko', 0.125), ('feels', 0.125), ('sports', 0.125), ('relation', 0.122), ('build', 0.122), ('allowed', 0.121), ('vary', 0.118), ('predicting', 0.115), ('analogy', 0.114), ('decide', 0.112), ('models', 0.11), ('familiar', 0.108), ('situation', 0.108), ('play', 0.105), ('defined', 0.104), ('allow', 0.102), ('solution', 0.099), ('underlying', 0.099), ('obvious', 0.097), ('project', 0.094), ('comparisons', 0.093), ('colleagues', 0.092), ('variable', 0.09), ('basic', 0.089), ('parameters', 0.088), ('run', 0.084), ('old', 0.083), ('kind', 0.082), ('discussed', 0.082), ('individual', 0.081), ('past', 0.078)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999982 99 andrew gelman stats-2010-06-19-Paired comparisons
Introduction: Mark Palko writes: I’ve got a stat problem I’d like to run past you. It’s one of those annoying problems that feels like it should be obvious but the solution has evaded me and the colleagues I’ve discussed it with. I’m working on a project where the metric of interest is defined in relation to pairs of data points. It has nothing to do with sports or betting but the following analogy (which I also post on the blog) covers the basic situation: “You want to build a model predicting the spread for games in a new football league. Because the line-up of teams is still in flux, you decide to use only stats from individual teams as inputs (for example, an indicator variable for when the Ambushers play the Ravagers would not be allowed).” Is there a standard approach for modeling this kind of data? My reply: I don’t quite understand your question, but are you familiar with the Bradley-Terry and Thurstone-Mosteller models for paired comparisons? These are old–from the 1920s and 194
2 0.15184808 1318 andrew gelman stats-2012-05-13-Stolen jokes
Introduction: Fun stories here (from Kliph Nesteroff, link from Mark Palko).
3 0.1321339 29 andrew gelman stats-2010-05-12-Probability of successive wins in baseball
Introduction: Dan Goldstein did an informal study asking people the following question: When two baseball teams play each other on two consecutive days, what is the probability that the winner of the first game will be the winner of the second game? You can make your own guess and the continue reading below. Dan writes: We asked two colleagues knowledgeable in baseball and the mathematics of forecasting. The answers came in between 65% and 70%. The true answer [based on Dan's analysis of a database of baseball games]: 51.3%, a little better than a coin toss. I have to say, I’m surprised his colleagues gave such extreme guesses. I was guessing something like 50%, myself, based on the following very crude reasoning: Suppose two unequal teams are playing, and the chance of team A beating team B is 55%. (This seems like a reasonable average of all matchups, which will include some more extreme disparities but also many more equal contests.) Then the chance of the same team
4 0.11530489 1173 andrew gelman stats-2012-02-17-Sports examples in class
Introduction: Karl Broman writes : I [Karl] personally would avoid sports entirely, as I view the subject to be insufficiently serious. . . . Certainly lots of statisticians are interested in sports. . . . And I’m not completely uninterested in sports: I like to watch football, particularly Nebraska, Green Bay, and Baltimore, and to see Notre Dame or any team from Florida or Texas lose. But statistics about sports? Yawn. As a person who loves sports, statistics, and sports statistics, I have a few thoughts: 1. Not everyone likes sports, and even fewer are interested in any particular sport. It’s ok to use sports examples, but don’t delude yourself into thinking that everyone in the class cares about it. 2. Don’t forget foreign students. A lot of them don’t even know the rules of kickball, fer chrissake! 3. Of the students who care about a sport, there will be a minority who really care. We had some serious basketball fans in our class last year. 4. I think the best solution
5 0.1139659 1804 andrew gelman stats-2013-04-15-How effective are football coaches?
Introduction: Dave Berri writes : A recent study published in the Social Science Quarterly suggests that these moves may not lead to the happiness the fans envision (HT: the Sports Economist). E. Scott Adler, Michael J. Berry, and David Doherty looked at coaching changes from 1997 to 2010. What they found should give pause to people who demanded a coaching change (or still hope for one). Here is how these authors summarize their findings: . . . we use matching techniques to compare the performance of football programs that replaced their head coach to those where the coach was retained. The analysis has two major innovations over existing literature. First, we consider how entry conditions moderate the effects of coaching replacements. Second, we examine team performance for several years following the replacement to assess its effects. We find that for particularly poorly performing teams, coach replacements have little effect on team performance as measured against comparable teams that
6 0.1137388 1547 andrew gelman stats-2012-10-25-College football, voting, and the law of large numbers
7 0.1046141 2296 andrew gelman stats-2014-04-19-Index or indicator variables
8 0.10456102 1482 andrew gelman stats-2012-09-04-Model checking and model understanding in machine learning
9 0.10348601 753 andrew gelman stats-2011-06-09-Allowing interaction terms to vary
10 0.10302177 1582 andrew gelman stats-2012-11-18-How to teach methods we don’t like?
11 0.098045096 554 andrew gelman stats-2011-02-04-An addition to the model-makers’ oath
13 0.094373852 1072 andrew gelman stats-2011-12-19-“The difference between . . .”: It’s not just p=.05 vs. p=.06
14 0.094124213 533 andrew gelman stats-2011-01-23-The scalarization of America
15 0.093308397 1150 andrew gelman stats-2012-02-02-The inevitable problems with statistical significance and 95% intervals
17 0.088382684 1767 andrew gelman stats-2013-03-17-The disappearing or non-disappearing middle class
18 0.087505147 1919 andrew gelman stats-2013-06-29-R sucks
19 0.086059034 1346 andrew gelman stats-2012-05-27-Average predictive comparisons when changing a pair of variables
20 0.085157029 260 andrew gelman stats-2010-09-07-QB2
topicId topicWeight
[(0, 0.175), (1, 0.046), (2, 0.007), (3, 0.007), (4, 0.075), (5, 0.038), (6, 0.02), (7, -0.009), (8, 0.056), (9, 0.035), (10, 0.02), (11, 0.027), (12, -0.004), (13, -0.033), (14, -0.035), (15, 0.01), (16, 0.015), (17, -0.016), (18, -0.0), (19, -0.014), (20, -0.004), (21, 0.011), (22, -0.034), (23, 0.031), (24, -0.022), (25, -0.032), (26, 0.025), (27, 0.039), (28, -0.014), (29, -0.1), (30, 0.013), (31, -0.041), (32, 0.025), (33, -0.013), (34, 0.036), (35, 0.001), (36, 0.024), (37, 0.06), (38, 0.007), (39, 0.103), (40, -0.012), (41, 0.037), (42, 0.004), (43, -0.031), (44, -0.004), (45, 0.041), (46, -0.038), (47, 0.06), (48, -0.055), (49, -0.055)]
simIndex simValue blogId blogTitle
same-blog 1 0.95470428 99 andrew gelman stats-2010-06-19-Paired comparisons
Introduction: Mark Palko writes: I’ve got a stat problem I’d like to run past you. It’s one of those annoying problems that feels like it should be obvious but the solution has evaded me and the colleagues I’ve discussed it with. I’m working on a project where the metric of interest is defined in relation to pairs of data points. It has nothing to do with sports or betting but the following analogy (which I also post on the blog) covers the basic situation: “You want to build a model predicting the spread for games in a new football league. Because the line-up of teams is still in flux, you decide to use only stats from individual teams as inputs (for example, an indicator variable for when the Ambushers play the Ravagers would not be allowed).” Is there a standard approach for modeling this kind of data? My reply: I don’t quite understand your question, but are you familiar with the Bradley-Terry and Thurstone-Mosteller models for paired comparisons? These are old–from the 1920s and 194
Introduction: I know next to nothing about golf. My mini-golf scores typically approach the maximum of 7 per hole, and I’ve never actually played macro-golf. I did publish a paper on golf once ( A Probability Model for Golf Putting , with Deb Nolan), but it’s not so rare for people to publish papers on topics they know nothing about. Those who can’t, research. But I certainly have the ability to post other people’s ideas. Charles Murray writes: I [Murray] am playing around with the likelihood of Tiger Woods breaking Nicklaus’s record in the Majors. I’ve already gone on record two years ago with the reason why he won’t, but now I’m looking at it from a non-psychological perspective. Given the history of the majors, what how far above the average _for other great golfers_ does Tiger have to perform? Here’s the procedure I’ve been working on: 1. For all golfers who have won at at least one major since 1934 (the year the Masters began), create 120 lines: one for each Major for each year f
Introduction: From a commenter on the web, 21 May 2010: Tampa Bay: Playing .732 ball in the toughest division in baseball, wiped their feet on NY twice. If they sweep Houston, which seems pretty likely, they will be at .750, which I [the commenter] have never heard of. At the time of that posting, the Rays were 30-11. Quick calculation: if a team is good enough to be expected to win 100 games, that is, Pr(win) = 100/162 = .617, then there’s a 5% chance that they’ll have won at least 30 of their first 41 games. That’s a calculation based on simple probability theory of independent events, which isn’t quite right here but will get you close and is a good way to train one’s intuition , I think. Having a .732 record after 41 games is not unheard-of. The Detroit Tigers won 35 of their first 40 games in 1984: that’s .875. (I happen to remember that fast start, having been an Orioles fan at the time.) Now on to the key ideas The passage quoted above illustrates three statistical fa
4 0.66578513 948 andrew gelman stats-2011-10-10-Combining data from many sources
Introduction: Mark Grote writes: I’d like to request general feedback and references for a problem of combining disparate data sources in a regression model. We’d like to model log crop yield as a function of environmental predictors, but the observations come from many data sources and are peculiarly structured. Among the issues are: 1. Measurement precision in predictors and outcome varies widely with data sources. Some observations are in very coarse units of measurement, due to rounding or even observer guesswork. 2. There are obvious clusters of observations arising from studies in which crop yields were monitored over successive years in spatially proximate communities. Thus some variables may be constant within clusters–this is true even for log yield, probably due to rounding of similar yields. 3. Cluster size and intra-cluster association structure (temporal, spatial or both) vary widely across the dataset. My [Grote's] intuition is that we can learn about central tendency
5 0.65624267 704 andrew gelman stats-2011-05-10-Multiple imputation and multilevel analysis
Introduction: Robert Birkelbach: I am writing my Bachelor Thesis in which I want to assess the reading competencies of German elementary school children using the PIRLS2006 data. My levels are classrooms and the individuals. However, my dependent variable is a multiple imputed (m=5) reading test. The problem I have is, that I do not know, whether I can just calculate 5 linear multilevel models and then average all the results (the coefficients, standard deviation, bic, intra class correlation, R2, t-statistics, p-values etc) or if I need different formulas for integrating the results of the five models into one because it is a multilevel analysis? Do you think there’s a better way in solving my problem? I would greatly appreciate if you could help me with a problem regarding my analysis — I am quite a newbie to multilevel modeling and especially to multiple imputation. Also: Is it okay to use frequentist models when the multiple imputation was done bayesian? Would the different philosophies of sc
6 0.65000415 417 andrew gelman stats-2010-11-17-Clutering and variance components
7 0.64465261 772 andrew gelman stats-2011-06-17-Graphical tools for understanding multilevel models
8 0.6408267 802 andrew gelman stats-2011-07-13-Super Sam Fuld Needs Your Help (with Foul Ball stats)
9 0.64072442 295 andrew gelman stats-2010-09-25-Clusters with very small numbers of observations
10 0.63550222 29 andrew gelman stats-2010-05-12-Probability of successive wins in baseball
12 0.63185525 2130 andrew gelman stats-2013-12-11-Multilevel marketing as a way of liquidating participants’ social networks
13 0.62910104 2262 andrew gelman stats-2014-03-23-Win probabilities during a sporting event
14 0.62852108 1804 andrew gelman stats-2013-04-15-How effective are football coaches?
15 0.62746292 154 andrew gelman stats-2010-07-18-Predictive checks for hierarchical models
17 0.62197489 864 andrew gelman stats-2011-08-21-Going viral — not!
18 0.61700016 559 andrew gelman stats-2011-02-06-Bidding for the kickoff
19 0.61592102 1248 andrew gelman stats-2012-04-06-17 groups, 6 group-level predictors: What to do?
20 0.61479062 243 andrew gelman stats-2010-08-30-Computer models of the oil spill
topicId topicWeight
[(9, 0.055), (16, 0.071), (24, 0.099), (85, 0.015), (86, 0.078), (89, 0.012), (96, 0.21), (99, 0.361)]
simIndex simValue blogId blogTitle
1 0.97819293 1731 andrew gelman stats-2013-02-21-If a lottery is encouraging addictive gambling, don’t expand it!
Introduction: This story from Vivian Yee seems just horrible to me. First the background: Pronto Lotto’s real business takes place in the carpeted, hushed area where its most devoted customers watch video screens from a scattering of tall silver tables, hour after hour, day after day. The players — mostly men, about a dozen at any given time — come on their lunch breaks or after work to study the screens, which are programmed with the Quick Draw lottery game, and flash a new set of winning numbers every four minutes. They have helped make Pronto Lotto the top Quick Draw vendor in the state, selling $3.3 million worth of tickets last year, more than $1 million more than the second busiest location, a World Books shop in Penn Station. Some stay for just a few minutes. Others play for the length of a workday, repeatedly traversing the few yards between their seats and the cash register as they hand the next wager to a clerk with a dollar bill or two, and return to wait. “It’s like my job, 24
2 0.96092981 1306 andrew gelman stats-2012-05-07-Lists of Note and Letters of Note
Introduction: These (from Shaun Usher) are surprisingly good, especially since he appears to come up with new lists and letters pretty regularly. I suppose a lot of them get sent in from readers, but still. Here’s my favorite recent item, a letter sent to the Seattle Bureau of Prohibition in 1931: Dear Sir: My husband is in the habit of buying a quart of wiskey every other day from a Chinese bootlegger named Chin Waugh living at 317-16th near Alder street. We need this money for household expenses. Will you please have his place raided? He keeps a supply planted in the garden and a smaller quantity under the back steps for quick delivery. If you make the raid at 9:30 any morning you will be sure to get the goods and Chin also as he leaves the house at 10 o’clock and may clean up before he goes. Thanking you in advance, I remain yours truly, Mrs. Hillyer
Introduction: Paul Nee sends in this amusing item: MELA Sciences claimed success in a clinical trial of its experimental skin cancer detection device only by altering the statistical method used to analyze the data in violation of an agreement with U.S. regulators, charges an independent healthcare analyst in a report issued last week. . . The BER report, however, relies on its own analysis to suggest that MELA struck out with FDA because the agency’s medical device reviewers discovered the MELAFind pivotal study failed to reach statistical significance despite the company’s claims to the contrary. And now here’s where it gets interesting: MELA claims that a phase III study of MELAFind met its primary endpoint by detecting accurately 112 of 114 eligible melanomas for a “sensitivity” rate of 98%. The lower confidence bound of the sensitivity analysis was 95.1%, which met the FDA’s standard for statistical significance in the study spelled out in a binding agreement with MELA, the compa
4 0.95328581 327 andrew gelman stats-2010-10-07-There are never 70 distinct parameters
Introduction: Sam Seaver writes: I’m a graduate student in computational biology, and I’m relatively new to advanced statistics, and am trying to teach myself how best to approach a problem I have. My dataset is a small sparse matrix of 150 cases and 70 predictors, it is sparse as in many zeros, not many ‘NA’s. Each case is a nutrient that is fed into an in silico organism, and its response is whether or not it stimulates growth, and each predictor is one of 70 different pathways that the nutrient may or may not belong to. Because all of the nutrients do not belong to all of the pathways, there are thus many zeros in my matrix. My goal is to be able to use the pathways themselves to predict whether or not a nutrient could stimulate growth, thus I wanted to compute regression coefficients for each pathway, with which I could apply to other nutrients for other species. There are quite a few singularities in the dataset (summary(glm) reports that 14 coefficients are not defined because of sin
same-blog 5 0.94905829 99 andrew gelman stats-2010-06-19-Paired comparisons
Introduction: Mark Palko writes: I’ve got a stat problem I’d like to run past you. It’s one of those annoying problems that feels like it should be obvious but the solution has evaded me and the colleagues I’ve discussed it with. I’m working on a project where the metric of interest is defined in relation to pairs of data points. It has nothing to do with sports or betting but the following analogy (which I also post on the blog) covers the basic situation: “You want to build a model predicting the spread for games in a new football league. Because the line-up of teams is still in flux, you decide to use only stats from individual teams as inputs (for example, an indicator variable for when the Ambushers play the Ravagers would not be allowed).” Is there a standard approach for modeling this kind of data? My reply: I don’t quite understand your question, but are you familiar with the Bradley-Terry and Thurstone-Mosteller models for paired comparisons? These are old–from the 1920s and 194
6 0.94305038 1023 andrew gelman stats-2011-11-22-Going Beyond the Book: Towards Critical Reading in Statistics Teaching
7 0.93798804 934 andrew gelman stats-2011-09-30-Nooooooooooooooooooo!
8 0.93111062 302 andrew gelman stats-2010-09-28-This is a link to a news article about a scientific paper
9 0.92890918 319 andrew gelman stats-2010-10-04-“Who owns Congress”
10 0.92715508 405 andrew gelman stats-2010-11-10-Estimation from an out-of-date census
11 0.92497492 205 andrew gelman stats-2010-08-13-Arnold Zellner
12 0.91594857 787 andrew gelman stats-2011-07-05-Different goals, different looks: Infovis and the Chris Rock effect
13 0.9117977 1405 andrew gelman stats-2012-07-04-“Titanic Thompson: The Man Who Would Bet on Everything”
14 0.91051698 1887 andrew gelman stats-2013-06-07-“Happy Money: The Science of Smarter Spending”
15 0.90865093 690 andrew gelman stats-2011-05-01-Peter Huber’s reflections on data analysis
16 0.90436625 1338 andrew gelman stats-2012-05-23-Advice on writing research articles
17 0.89248437 1642 andrew gelman stats-2012-12-28-New book by Stef van Buuren on missing-data imputation looks really good!
18 0.88784128 678 andrew gelman stats-2011-04-25-Democrats do better among the most and least educated groups
19 0.88773525 2065 andrew gelman stats-2013-10-17-Cool dynamic demographic maps provide beautiful illustration of Chris Rock effect
20 0.88561714 236 andrew gelman stats-2010-08-26-Teaching yourself mathematics