andrew_gelman_stats andrew_gelman_stats-2012 andrew_gelman_stats-2012-1214 knowledge-graph by maker-knowledge-mining

1214 andrew gelman stats-2012-03-15-Of forecasts and graph theory and characterizing a statistical method by the information it uses


meta infos for this blog

Source: html

Introduction: Wayne Folta points me to “EigenBracket 2012: Using Graph Theory to Predict NCAA March Madness Basketball” and writes, “I [Folta] have got to believe that he’s simply re-invented a statistical method in a graph-ish context, but don’t know enough to judge.” I have not looked in detail at the method being presented here—I’m not much of college basketball fan—but I’d like to use this as an excuse to make one of my favorite general point, which is that a good way to characterize any statistical method is by what information it uses. The basketball ranking method here uses score differentials between teams in the past season. On the plus side, that is better than simply using one-loss records (which (a) discards score differentials and (b) discards information on who played whom). On the minus side, the method appears to be discretizing the scores (thus throwing away information on the exact score differential) and doesn’t use any external information such as external ratings. A


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Wayne Folta points me to “EigenBracket 2012: Using Graph Theory to Predict NCAA March Madness Basketball” and writes, “I [Folta] have got to believe that he’s simply re-invented a statistical method in a graph-ish context, but don’t know enough to judge. [sent-1, score-0.522]

2 ” I have not looked in detail at the method being presented here—I’m not much of college basketball fan—but I’d like to use this as an excuse to make one of my favorite general point, which is that a good way to characterize any statistical method is by what information it uses. [sent-2, score-1.92]

3 The basketball ranking method here uses score differentials between teams in the past season. [sent-3, score-1.435]

4 On the plus side, that is better than simply using one-loss records (which (a) discards score differentials and (b) discards information on who played whom). [sent-4, score-1.826]

5 On the minus side, the method appears to be discretizing the scores (thus throwing away information on the exact score differential) and doesn’t use any external information such as external ratings. [sent-5, score-1.926]

6 Anyway, my point is that the writeup of the method focuses on statistical operations (forming a matrix of a graph, computing eigensomethingorothers), and, sure, something like that is necessary, but to me, what’s interesting is to know what information went into the rankings. [sent-6, score-1.158]

7 If I wanted to use the information that this guy was using, I’d probably just fit a simple normal linear model with a latent parameter for each team. [sent-9, score-0.617]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('method', 0.328), ('basketball', 0.307), ('discards', 0.3), ('differentials', 0.283), ('folta', 0.247), ('score', 0.239), ('information', 0.237), ('external', 0.187), ('ncaa', 0.141), ('writeup', 0.135), ('wayne', 0.123), ('side', 0.123), ('forming', 0.118), ('ranking', 0.114), ('simply', 0.107), ('operations', 0.107), ('minus', 0.107), ('graph', 0.104), ('differential', 0.103), ('excuse', 0.101), ('characterize', 0.099), ('march', 0.098), ('focuses', 0.094), ('records', 0.094), ('using', 0.094), ('played', 0.093), ('teams', 0.093), ('latent', 0.091), ('throwing', 0.089), ('matrix', 0.089), ('use', 0.087), ('statistical', 0.087), ('fan', 0.084), ('exact', 0.083), ('scores', 0.081), ('computing', 0.081), ('plus', 0.079), ('detail', 0.078), ('favorite', 0.075), ('uses', 0.071), ('necessary', 0.071), ('predict', 0.07), ('normal', 0.07), ('presented', 0.069), ('team', 0.069), ('linear', 0.067), ('parameter', 0.065), ('appears', 0.064), ('college', 0.063), ('looked', 0.061)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000001 1214 andrew gelman stats-2012-03-15-Of forecasts and graph theory and characterizing a statistical method by the information it uses

Introduction: Wayne Folta points me to “EigenBracket 2012: Using Graph Theory to Predict NCAA March Madness Basketball” and writes, “I [Folta] have got to believe that he’s simply re-invented a statistical method in a graph-ish context, but don’t know enough to judge.” I have not looked in detail at the method being presented here—I’m not much of college basketball fan—but I’d like to use this as an excuse to make one of my favorite general point, which is that a good way to characterize any statistical method is by what information it uses. The basketball ranking method here uses score differentials between teams in the past season. On the plus side, that is better than simply using one-loss records (which (a) discards score differentials and (b) discards information on who played whom). On the minus side, the method appears to be discretizing the scores (thus throwing away information on the exact score differential) and doesn’t use any external information such as external ratings. A

2 0.27673355 2226 andrew gelman stats-2014-02-26-Econometrics, political science, epidemiology, etc.: Don’t model the probability of a discrete outcome, model the underlying continuous variable

Introduction: This is an echo of yesterday’s post, Basketball Stats: Don’t model the probability of win, model the expected score differential . As with basketball, so with baseball: as the great Bill James wrote, if you want to predict a pitcher’s win-loss record, it’s better to use last year’s ERA than last year’s W-L. As with basketball and baseball, so with epidemiology: as Joseph Delaney points out in my favorite blog that nobody reads, you will see much better prediction if you first model change in the parameter (e.g. blood pressure) and then convert that to the binary disease state (e.g. hypertension) then if you just develop a logistic model for prob(hypertension). As with basketball, baseball, and epidemiology, so with political science: instead of modeling election winners, better to model vote differential, a point that I made back in 1993 (see page 120 here ) but which seems to continually need repeating . A forecasting method should get essentially no credit for correctl

3 0.24591812 1146 andrew gelman stats-2012-01-30-Convenient page of data sources from the Washington Post

Introduction: Wayne Folta points us to this list .

4 0.19823514 2262 andrew gelman stats-2014-03-23-Win probabilities during a sporting event

Introduction: Todd Schneider writes: Apropos of your recent blog post about modeling score differential of basketball games , I thought you might enjoy a site I built, gambletron2000.com , that gathers real-time win probabilities from betting markets for most major sports (including NBA and college basketball). My original goal was to use the variance of changes in win probabilities to quantify which games were the most exciting, but I got a bit carried away and ended up pursuing a bunch of other ideas, which  you can read about in the full writeup here This particular passage from the anonymous someone in your post: My idea is for each timestep in a game (a second, 5 seconds, etc), use the Vegas line, the current score differential, who has the ball, and the number of possessions played already (to account for differences in pace) to create a point estimate probability of the home team winning. reminded me of a graph I made, which shows the mean-reverting tendency of N

5 0.19669703 1923 andrew gelman stats-2013-07-03-Bayes pays!

Introduction: Jason Rosenfeld, who has the amazing title of “Manager of Basketball Analytics” at the Charlotte Bobcats, announces the following jobs : Basketball Operations: Statistics Basketball Operations Systems Developer – Charlotte Bobcats (Charlotte, NC) POSITION OVERVIEW The Basketball Operations System Developer will collect and import data to our database, check data, and field requests from the Basketball Operations staff.  This position will be instrumental in molding and improving our database to assist the staff in player personnel and coaching efforts. ESSENTIAL DUTIES AND RESPONSIBILITIES • Respond to data and database requests from the front office. • Build user-friendly software tools for use by the basketball operations staff. • Accumulate data from various sources to input and organize into our system to assist the basketball operations staff with decisions. • Check and clean data for accuracy and import to our database. • Provide ideas and play a key ro

6 0.19365352 891 andrew gelman stats-2011-09-05-World Bank data now online

7 0.18927228 2224 andrew gelman stats-2014-02-25-Basketball Stats: Don’t model the probability of win, model the expected score differential.

8 0.14366215 306 andrew gelman stats-2010-09-29-Statistics and the end of time

9 0.12736434 1825 andrew gelman stats-2013-04-25-It’s binless! A program for computing normalizing functions

10 0.10810835 2222 andrew gelman stats-2014-02-24-On deck this week

11 0.10577042 1940 andrew gelman stats-2013-07-16-A poll that throws away data???

12 0.10338566 496 andrew gelman stats-2011-01-01-Tukey’s philosophy

13 0.10150811 2109 andrew gelman stats-2013-11-21-Hidden dangers of noninformative priors

14 0.099021085 1149 andrew gelman stats-2012-02-01-Philosophy of Bayesian statistics: my reactions to Cox and Mayo

15 0.097321346 309 andrew gelman stats-2010-10-01-Why Development Economics Needs Theory?

16 0.095664926 300 andrew gelman stats-2010-09-28-A calibrated Cook gives Dems the edge in Nov, sez Sandy

17 0.094544552 1019 andrew gelman stats-2011-11-19-Validation of Software for Bayesian Models Using Posterior Quantiles

18 0.086979114 2017 andrew gelman stats-2013-09-11-“Informative g-Priors for Logistic Regression”

19 0.083842136 1606 andrew gelman stats-2012-12-05-The Grinch Comes Back

20 0.08295732 878 andrew gelman stats-2011-08-29-Infovis, infographics, and data visualization: Where I’m coming from, and where I’d like to go


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.143), (1, 0.056), (2, 0.016), (3, 0.04), (4, 0.073), (5, -0.044), (6, -0.014), (7, 0.018), (8, -0.017), (9, -0.009), (10, 0.006), (11, 0.016), (12, -0.055), (13, -0.027), (14, -0.088), (15, 0.007), (16, 0.049), (17, -0.015), (18, 0.023), (19, -0.051), (20, -0.014), (21, 0.04), (22, 0.012), (23, 0.034), (24, 0.089), (25, 0.033), (26, 0.041), (27, 0.068), (28, -0.031), (29, -0.06), (30, 0.058), (31, 0.024), (32, 0.047), (33, -0.011), (34, 0.016), (35, 0.023), (36, 0.031), (37, 0.012), (38, -0.041), (39, 0.01), (40, 0.04), (41, -0.019), (42, 0.066), (43, 0.024), (44, -0.049), (45, 0.012), (46, -0.01), (47, -0.056), (48, -0.09), (49, 0.037)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96209621 1214 andrew gelman stats-2012-03-15-Of forecasts and graph theory and characterizing a statistical method by the information it uses

Introduction: Wayne Folta points me to “EigenBracket 2012: Using Graph Theory to Predict NCAA March Madness Basketball” and writes, “I [Folta] have got to believe that he’s simply re-invented a statistical method in a graph-ish context, but don’t know enough to judge.” I have not looked in detail at the method being presented here—I’m not much of college basketball fan—but I’d like to use this as an excuse to make one of my favorite general point, which is that a good way to characterize any statistical method is by what information it uses. The basketball ranking method here uses score differentials between teams in the past season. On the plus side, that is better than simply using one-loss records (which (a) discards score differentials and (b) discards information on who played whom). On the minus side, the method appears to be discretizing the scores (thus throwing away information on the exact score differential) and doesn’t use any external information such as external ratings. A

2 0.68787342 2262 andrew gelman stats-2014-03-23-Win probabilities during a sporting event

Introduction: Todd Schneider writes: Apropos of your recent blog post about modeling score differential of basketball games , I thought you might enjoy a site I built, gambletron2000.com , that gathers real-time win probabilities from betting markets for most major sports (including NBA and college basketball). My original goal was to use the variance of changes in win probabilities to quantify which games were the most exciting, but I got a bit carried away and ended up pursuing a bunch of other ideas, which  you can read about in the full writeup here This particular passage from the anonymous someone in your post: My idea is for each timestep in a game (a second, 5 seconds, etc), use the Vegas line, the current score differential, who has the ball, and the number of possessions played already (to account for differences in pace) to create a point estimate probability of the home team winning. reminded me of a graph I made, which shows the mean-reverting tendency of N

3 0.62744176 1062 andrew gelman stats-2011-12-16-Mr. Pearson, meet Mr. Mandelbrot: Detecting Novel Associations in Large Data Sets

Introduction: Jeremy Fox asks what I think about this paper by David N. Reshef, Yakir Reshef, Hilary Finucane, Sharon Grossman, Gilean McVean, Peter Turnbaugh, Eric Lander, Michael Mitzenmacher, and Pardis Sabeti which proposes a new nonlinear R-squared-like measure. My quick answer is that it looks really cool! From my quick reading of the paper, it appears that the method reduces on average to the usual R-squared when fit to data of the form y = a + bx + error, and that it also has a similar interpretation when “a + bx” is replaced by other continuous functions. Unlike R-squared, the method of Reshef et al. depends on a tuning parameter that controls the level of discretization, in a “How long is the coast of Britain” sort of way. The dependence on scale is inevitable for such a general method. Just consider: if you sample 1000 points from the unit bivariate normal distribution, (x,y) ~ N(0,I), you’ll be able to fit them perfectly by a 999-degree polynomial fit to the data. So the sca

4 0.62264937 2224 andrew gelman stats-2014-02-25-Basketball Stats: Don’t model the probability of win, model the expected score differential.

Introduction: Someone who wants to remain anonymous writes: I am working to create a more accurate in-game win probability model for basketball games. My idea is for each timestep in a game (a second, 5 seconds, etc), use the Vegas line, the current score differential, who has the ball, and the number of possessions played already (to account for differences in pace) to create a point estimate probability of the home team winning. This problem would seem to fit a multi-level model structure well. It seems silly to estimate 2,000 regressions (one for each timestep), but the coefficients should vary at each timestep. Do you have suggestions for what type of model this could/would be? Additionally, I believe this needs to be some form of logit/probit given the binary dependent variable (win or loss). Finally, do you have suggestions for what package could accomplish this in Stata or R? To answer the questions in reverse order: 3. I’d hope this could be done in Stan (which can be run from R)

5 0.62113136 2247 andrew gelman stats-2014-03-14-The maximal information coefficient

Introduction: Justin Kinney writes: I wanted to let you know that the critique Mickey Atwal and I wrote regarding equitability and the maximal information coefficient has just been published . We discussed this paper last year, under the heading, Too many MC’s not enough MIC’s, or What principles should govern attempts to summarize bivariate associations in large multivariate datasets? Kinney and Atwal’s paper is interesting, with my only criticism being that in some places they seem to aim for what might not be possible. For example, they write that “mutual information is already widely believed to quantify dependencies without bias for relationships of one type or another,” which seems a bit vague to me. And later they write, “How to compute such an estimate that does not bias the resulting mutual information value remains an open problem,” which seems to me to miss the point in that unbiased statistical estimates are not generally possible and indeed are often not desirable. Their

6 0.61819178 1706 andrew gelman stats-2013-02-04-Too many MC’s not enough MIC’s, or What principles should govern attempts to summarize bivariate associations in large multivariate datasets?

7 0.60394263 1230 andrew gelman stats-2012-03-26-Further thoughts on nonparametric correlation measures

8 0.60110736 2226 andrew gelman stats-2014-02-26-Econometrics, political science, epidemiology, etc.: Don’t model the probability of a discrete outcome, model the underlying continuous variable

9 0.5981555 2311 andrew gelman stats-2014-04-29-Bayesian Uncertainty Quantification for Differential Equations!

10 0.59411585 1387 andrew gelman stats-2012-06-21-Will Tiger Woods catch Jack Nicklaus? And a discussion of the virtues of using continuous data even if your goal is discrete prediction

11 0.58146864 1903 andrew gelman stats-2013-06-17-Weak identification provides partial information

12 0.57331413 2324 andrew gelman stats-2014-05-07-Once more on nonparametric measures of mutual information

13 0.56980127 559 andrew gelman stats-2011-02-06-Bidding for the kickoff

14 0.56592602 623 andrew gelman stats-2011-03-21-Baseball’s greatest fielders

15 0.55920517 2314 andrew gelman stats-2014-05-01-Heller, Heller, and Gorfine on univariate and multivariate information measures

16 0.55855322 171 andrew gelman stats-2010-07-30-Silly baseball example illustrates a couple of key ideas they don’t usually teach you in statistics class

17 0.55116707 562 andrew gelman stats-2011-02-06-Statistician cracks Toronto lottery

18 0.55042535 607 andrew gelman stats-2011-03-11-Rajiv Sethi on the interpretation of prediction market data

19 0.54972756 1146 andrew gelman stats-2012-01-30-Convenient page of data sources from the Washington Post

20 0.5467065 2076 andrew gelman stats-2013-10-24-Chasing the noise: W. Edwards Deming would be spinning in his grave


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(9, 0.012), (15, 0.024), (16, 0.069), (21, 0.035), (24, 0.15), (27, 0.013), (41, 0.193), (54, 0.014), (57, 0.046), (59, 0.014), (86, 0.037), (89, 0.023), (97, 0.022), (99, 0.239)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.9486227 1626 andrew gelman stats-2012-12-16-The lamest, grudgingest, non-retraction retraction ever

Introduction: In politics we’re familiar with the non-apology apology (well described in Wikipedia as “a statement that has the form of an apology but does not express the expected contrition”). Here’s the scientific equivalent: the non-retraction retraction. Sanjay Srivastava points to an amusing yet barfable story of a pair of researchers who (inadvertently, I assume) made a data coding error and were eventually moved to issue a correction notice, but even then refused to fully admit their error. As Srivastava puts it, the story “ended up with Lew [Goldberg] and colleagues [Kibeom Lee and Michael Ashton] publishing a comment on an erratum – the only time I’ve ever heard of that happening in a scientific journal.” From the comment on the erratum: In their “erratum and addendum,” Anderson and Ones (this issue) explained that we had brought their attention to the “potential” of a “possible” misalignment and described the results computed from re-aligned data as being based on a “post-ho

same-blog 2 0.93576485 1214 andrew gelman stats-2012-03-15-Of forecasts and graph theory and characterizing a statistical method by the information it uses

Introduction: Wayne Folta points me to “EigenBracket 2012: Using Graph Theory to Predict NCAA March Madness Basketball” and writes, “I [Folta] have got to believe that he’s simply re-invented a statistical method in a graph-ish context, but don’t know enough to judge.” I have not looked in detail at the method being presented here—I’m not much of college basketball fan—but I’d like to use this as an excuse to make one of my favorite general point, which is that a good way to characterize any statistical method is by what information it uses. The basketball ranking method here uses score differentials between teams in the past season. On the plus side, that is better than simply using one-loss records (which (a) discards score differentials and (b) discards information on who played whom). On the minus side, the method appears to be discretizing the scores (thus throwing away information on the exact score differential) and doesn’t use any external information such as external ratings. A

3 0.93074191 303 andrew gelman stats-2010-09-28-“Genomics” vs. genetics

Introduction: John Cook and Joseph Delaney point to an article by Yurii Aulchenko et al., who write: 54 loci showing strong statistical evidence for association to human height were described, providing us with potential genomic means of human height prediction. In a population-based study of 5748 people, we find that a 54-loci genomic profile explained 4-6% of the sex- and age-adjusted height variance, and had limited ability to discriminate tall/short people. . . . In a family-based study of 550 people, with both parents having height measurements, we find that the Galtonian mid-parental prediction method explained 40% of the sex- and age-adjusted height variance, and showed high discriminative accuracy. . . . The message is that the simple approach of predicting child’s height using a regression model given parents’ average height performs much better than the method they have based on combining 54 genes. They also find that, if you start with the prediction based on parents’ heigh

4 0.89388859 516 andrew gelman stats-2011-01-14-A new idea for a science core course based entirely on computer simulation

Introduction: Columbia College has for many years had a Core Curriculum, in which students read classics such as Plato (in translation) etc. A few years ago they created a Science core course. There was always some confusion about this idea: On one hand, how much would college freshmen really learn about science by reading the classic writings of Galileo, Laplace, Darwin, Einstein, etc.? And they certainly wouldn’t get much out by puzzling over the latest issues of Nature, Cell, and Physical Review Letters. On the other hand, what’s the point of having them read Dawkins, Gould, or even Brian Greene? These sorts of popularizations give you a sense of modern science (even to the extent of conveying some of the debates in these fields), but reading them might not give the same intellectual engagement that you’d get from wrestling with the Bible or Shakespeare. I have a different idea. What about structuring the entire course around computer programming and simulation? Start with a few weeks t

5 0.89088005 1300 andrew gelman stats-2012-05-05-Recently in the sister blog

Introduction: Culture war: The rules You can only accept capital punishment if you’re willing to have innocent people executed every now and then The politics of America’s increasing economic inequality

6 0.88304865 2262 andrew gelman stats-2014-03-23-Win probabilities during a sporting event

7 0.87966114 454 andrew gelman stats-2010-12-07-Diabetes stops at the state line?

8 0.87445879 2224 andrew gelman stats-2014-02-25-Basketball Stats: Don’t model the probability of win, model the expected score differential.

9 0.87184364 1669 andrew gelman stats-2013-01-12-The power of the puzzlegraph

10 0.87064993 447 andrew gelman stats-2010-12-03-Reinventing the wheel, only more so.

11 0.86840993 2311 andrew gelman stats-2014-04-29-Bayesian Uncertainty Quantification for Differential Equations!

12 0.86616993 685 andrew gelman stats-2011-04-29-Data mining and allergies

13 0.86611569 2204 andrew gelman stats-2014-02-09-Keli Liu and Xiao-Li Meng on Simpson’s paradox

14 0.86376452 1895 andrew gelman stats-2013-06-12-Peter Thiel is writing another book!

15 0.86256266 1019 andrew gelman stats-2011-11-19-Validation of Software for Bayesian Models Using Posterior Quantiles

16 0.85903716 2185 andrew gelman stats-2014-01-25-Xihong Lin on sparsity and density

17 0.85579598 1923 andrew gelman stats-2013-07-03-Bayes pays!

18 0.85462034 778 andrew gelman stats-2011-06-24-New ideas on DIC from Martyn Plummer and Sumio Watanabe

19 0.85005105 1297 andrew gelman stats-2012-05-03-New New York data research organizations

20 0.8490026 1062 andrew gelman stats-2011-12-16-Mr. Pearson, meet Mr. Mandelbrot: Detecting Novel Associations in Large Data Sets