andrew_gelman_stats andrew_gelman_stats-2014 andrew_gelman_stats-2014-2224 knowledge-graph by maker-knowledge-mining

2224 andrew gelman stats-2014-02-25-Basketball Stats: Don’t model the probability of win, model the expected score differential.


meta infos for this blog

Source: html

Introduction: Someone who wants to remain anonymous writes: I am working to create a more accurate in-game win probability model for basketball games. My idea is for each timestep in a game (a second, 5 seconds, etc), use the Vegas line, the current score differential, who has the ball, and the number of possessions played already (to account for differences in pace) to create a point estimate probability of the home team winning. This problem would seem to fit a multi-level model structure well. It seems silly to estimate 2,000 regressions (one for each timestep), but the coefficients should vary at each timestep. Do you have suggestions for what type of model this could/would be? Additionally, I believe this needs to be some form of logit/probit given the binary dependent variable (win or loss). Finally, do you have suggestions for what package could accomplish this in Stata or R? To answer the questions in reverse order: 3. I’d hope this could be done in Stan (which can be run from R)


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Someone who wants to remain anonymous writes: I am working to create a more accurate in-game win probability model for basketball games. [sent-1, score-1.227]

2 My idea is for each timestep in a game (a second, 5 seconds, etc), use the Vegas line, the current score differential, who has the ball, and the number of possessions played already (to account for differences in pace) to create a point estimate probability of the home team winning. [sent-2, score-1.202]

3 This problem would seem to fit a multi-level model structure well. [sent-3, score-0.357]

4 It seems silly to estimate 2,000 regressions (one for each timestep), but the coefficients should vary at each timestep. [sent-4, score-0.358]

5 Do you have suggestions for what type of model this could/would be? [sent-5, score-0.337]

6 Additionally, I believe this needs to be some form of logit/probit given the binary dependent variable (win or loss). [sent-6, score-0.156]

7 Finally, do you have suggestions for what package could accomplish this in Stata or R? [sent-7, score-0.31]

8 Yes, a model with varying coefficients would make sense. [sent-11, score-0.412]

9 I’d play around with the data, graph some estimates based on different timesteps, and then from there fit a parametric model that fits the data and makes sense. [sent-12, score-0.525]

10 Don’t model the probability of win, model the expected score differential. [sent-14, score-0.817]

11 But the most efficient way to get there is to model the score differential and then map that back to win probabilities. [sent-16, score-1.463]

12 The exact same issue comes up in election modeling: it makes sense to predict vote differential and then map that to Pr(win), rather than predicting Pr(win) directly. [sent-17, score-0.83]

13 This is most obvious in very close games (or elections) or blowouts; in either of these settings the win/loss outcome provides essentially zero information. [sent-18, score-0.144]

14 But it’s true more generally that there’s a lot of information in the score (or vote) differential that’s thrown away if you just look at win/loss. [sent-19, score-0.724]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('win', 0.398), ('differential', 0.361), ('score', 0.279), ('timestep', 0.262), ('model', 0.207), ('pr', 0.172), ('map', 0.144), ('coefficients', 0.133), ('suggestions', 0.13), ('create', 0.126), ('probability', 0.124), ('vegas', 0.123), ('possessions', 0.118), ('vote', 0.116), ('accomplish', 0.111), ('additionally', 0.103), ('pace', 0.103), ('parametric', 0.096), ('seconds', 0.093), ('basketball', 0.089), ('ball', 0.088), ('fit', 0.085), ('anonymous', 0.084), ('thrown', 0.084), ('stata', 0.083), ('dependent', 0.083), ('estimate', 0.082), ('played', 0.081), ('games', 0.079), ('reverse', 0.078), ('efficient', 0.074), ('binary', 0.073), ('loss', 0.073), ('exact', 0.073), ('regressions', 0.073), ('varying', 0.072), ('yeah', 0.071), ('vary', 0.07), ('fits', 0.07), ('predicting', 0.069), ('package', 0.069), ('accurate', 0.067), ('remain', 0.067), ('makes', 0.067), ('elections', 0.067), ('game', 0.065), ('settings', 0.065), ('account', 0.065), ('wants', 0.065), ('structure', 0.065)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999988 2224 andrew gelman stats-2014-02-25-Basketball Stats: Don’t model the probability of win, model the expected score differential.

Introduction: Someone who wants to remain anonymous writes: I am working to create a more accurate in-game win probability model for basketball games. My idea is for each timestep in a game (a second, 5 seconds, etc), use the Vegas line, the current score differential, who has the ball, and the number of possessions played already (to account for differences in pace) to create a point estimate probability of the home team winning. This problem would seem to fit a multi-level model structure well. It seems silly to estimate 2,000 regressions (one for each timestep), but the coefficients should vary at each timestep. Do you have suggestions for what type of model this could/would be? Additionally, I believe this needs to be some form of logit/probit given the binary dependent variable (win or loss). Finally, do you have suggestions for what package could accomplish this in Stata or R? To answer the questions in reverse order: 3. I’d hope this could be done in Stan (which can be run from R)

2 0.48445839 2262 andrew gelman stats-2014-03-23-Win probabilities during a sporting event

Introduction: Todd Schneider writes: Apropos of your recent blog post about modeling score differential of basketball games , I thought you might enjoy a site I built, gambletron2000.com , that gathers real-time win probabilities from betting markets for most major sports (including NBA and college basketball). My original goal was to use the variance of changes in win probabilities to quantify which games were the most exciting, but I got a bit carried away and ended up pursuing a bunch of other ideas, which  you can read about in the full writeup here This particular passage from the anonymous someone in your post: My idea is for each timestep in a game (a second, 5 seconds, etc), use the Vegas line, the current score differential, who has the ball, and the number of possessions played already (to account for differences in pace) to create a point estimate probability of the home team winning. reminded me of a graph I made, which shows the mean-reverting tendency of N

3 0.38022134 2226 andrew gelman stats-2014-02-26-Econometrics, political science, epidemiology, etc.: Don’t model the probability of a discrete outcome, model the underlying continuous variable

Introduction: This is an echo of yesterday’s post, Basketball Stats: Don’t model the probability of win, model the expected score differential . As with basketball, so with baseball: as the great Bill James wrote, if you want to predict a pitcher’s win-loss record, it’s better to use last year’s ERA than last year’s W-L. As with basketball and baseball, so with epidemiology: as Joseph Delaney points out in my favorite blog that nobody reads, you will see much better prediction if you first model change in the parameter (e.g. blood pressure) and then convert that to the binary disease state (e.g. hypertension) then if you just develop a logistic model for prob(hypertension). As with basketball, baseball, and epidemiology, so with political science: instead of modeling election winners, better to model vote differential, a point that I made back in 1993 (see page 120 here ) but which seems to continually need repeating . A forecasting method should get essentially no credit for correctl

4 0.2121838 2222 andrew gelman stats-2014-02-24-On deck this week

Introduction: Mon: “Edlin’s rule” for routinely scaling down published estimates Tues: Basketball Stats: Don’t model the probability of win, model the expected score differential Wed: A good comment on one of my papers Thurs: “What Can we Learn from the Many Labs Replication Project?” Fri: God/leaf/tree Sat: “We are moving from an era of private data and public analyses to one of public data and private analyses. Just as we have learned to be cautious about data that are missing, we may have to be cautious about missing analyses also.”

5 0.19560207 2311 andrew gelman stats-2014-04-29-Bayesian Uncertainty Quantification for Differential Equations!

Introduction: Mark Girolami points us to this paper and software (with Oksana Chkrebtii, David Campbell, and Ben Calderhead). They write: We develop a general methodology for the probabilistic integration of differential equations via model based updating of a joint prior measure on the space of functions and their temporal and spatial derivatives. This results in a posterior measure over functions reflecting how well they satisfy the system of differential equations and corresponding initial and boundary values. We show how this posterior measure can be naturally incorporated within the Kennedy and O’Hagan framework for uncertainty quantification and provides a fully Bayesian approach to model calibration. . . . A broad variety of examples are provided to illustrate the potential of this framework for characterising discretization uncertainty, including initial value, delay, and boundary value differential equations, as well as partial differential equations. We also demonstrate our methodolo

6 0.18927228 1214 andrew gelman stats-2012-03-15-Of forecasts and graph theory and characterizing a statistical method by the information it uses

7 0.15813895 1387 andrew gelman stats-2012-06-21-Will Tiger Woods catch Jack Nicklaus? And a discussion of the virtues of using continuous data even if your goal is discrete prediction

8 0.15155411 1544 andrew gelman stats-2012-10-22-Is it meaningful to talk about a probability of “65.7%” that Obama will win the election?

9 0.14117435 1337 andrew gelman stats-2012-05-22-Question 12 of my final exam for Design and Analysis of Sample Surveys

10 0.1345775 391 andrew gelman stats-2010-11-03-Some thoughts on election forecasting

11 0.13431546 934 andrew gelman stats-2011-09-30-Nooooooooooooooooooo!

12 0.12666984 1972 andrew gelman stats-2013-08-07-When you’re planning on fitting a model, build up to it by fitting simpler models first. Then, once you have a model you like, check the hell out of it

13 0.12478106 2291 andrew gelman stats-2014-04-14-Transitioning to Stan

14 0.12302741 1340 andrew gelman stats-2012-05-23-Question 13 of my final exam for Design and Analysis of Sample Surveys

15 0.12168244 1562 andrew gelman stats-2012-11-05-Let’s try this: Instead of saying, “The probability is 75%,” say “There’s a 25% chance I’m wrong”

16 0.11575038 1242 andrew gelman stats-2012-04-03-Best lottery story ever

17 0.11555862 1999 andrew gelman stats-2013-08-27-Bayesian model averaging or fitting a larger model

18 0.11320458 171 andrew gelman stats-2010-07-30-Silly baseball example illustrates a couple of key ideas they don’t usually teach you in statistics class

19 0.11268449 1367 andrew gelman stats-2012-06-05-Question 26 of my final exam for Design and Analysis of Sample Surveys

20 0.11235623 1540 andrew gelman stats-2012-10-18-“Intrade to the 57th power”


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.193), (1, 0.107), (2, 0.122), (3, 0.086), (4, 0.058), (5, 0.009), (6, -0.014), (7, -0.073), (8, 0.068), (9, -0.055), (10, 0.045), (11, 0.09), (12, -0.071), (13, -0.082), (14, -0.131), (15, -0.025), (16, 0.053), (17, -0.026), (18, 0.026), (19, -0.021), (20, -0.033), (21, 0.083), (22, 0.007), (23, -0.016), (24, 0.013), (25, 0.039), (26, 0.044), (27, 0.05), (28, -0.083), (29, -0.176), (30, 0.016), (31, -0.055), (32, -0.008), (33, 0.028), (34, 0.007), (35, 0.014), (36, 0.066), (37, -0.002), (38, -0.053), (39, 0.001), (40, -0.047), (41, -0.074), (42, 0.023), (43, -0.034), (44, -0.003), (45, 0.026), (46, 0.006), (47, 0.035), (48, -0.136), (49, 0.033)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.94460797 2224 andrew gelman stats-2014-02-25-Basketball Stats: Don’t model the probability of win, model the expected score differential.

Introduction: Someone who wants to remain anonymous writes: I am working to create a more accurate in-game win probability model for basketball games. My idea is for each timestep in a game (a second, 5 seconds, etc), use the Vegas line, the current score differential, who has the ball, and the number of possessions played already (to account for differences in pace) to create a point estimate probability of the home team winning. This problem would seem to fit a multi-level model structure well. It seems silly to estimate 2,000 regressions (one for each timestep), but the coefficients should vary at each timestep. Do you have suggestions for what type of model this could/would be? Additionally, I believe this needs to be some form of logit/probit given the binary dependent variable (win or loss). Finally, do you have suggestions for what package could accomplish this in Stata or R? To answer the questions in reverse order: 3. I’d hope this could be done in Stan (which can be run from R)

2 0.82559752 2226 andrew gelman stats-2014-02-26-Econometrics, political science, epidemiology, etc.: Don’t model the probability of a discrete outcome, model the underlying continuous variable

Introduction: This is an echo of yesterday’s post, Basketball Stats: Don’t model the probability of win, model the expected score differential . As with basketball, so with baseball: as the great Bill James wrote, if you want to predict a pitcher’s win-loss record, it’s better to use last year’s ERA than last year’s W-L. As with basketball and baseball, so with epidemiology: as Joseph Delaney points out in my favorite blog that nobody reads, you will see much better prediction if you first model change in the parameter (e.g. blood pressure) and then convert that to the binary disease state (e.g. hypertension) then if you just develop a logistic model for prob(hypertension). As with basketball, baseball, and epidemiology, so with political science: instead of modeling election winners, better to model vote differential, a point that I made back in 1993 (see page 120 here ) but which seems to continually need repeating . A forecasting method should get essentially no credit for correctl

3 0.81507874 2262 andrew gelman stats-2014-03-23-Win probabilities during a sporting event

Introduction: Todd Schneider writes: Apropos of your recent blog post about modeling score differential of basketball games , I thought you might enjoy a site I built, gambletron2000.com , that gathers real-time win probabilities from betting markets for most major sports (including NBA and college basketball). My original goal was to use the variance of changes in win probabilities to quantify which games were the most exciting, but I got a bit carried away and ended up pursuing a bunch of other ideas, which  you can read about in the full writeup here This particular passage from the anonymous someone in your post: My idea is for each timestep in a game (a second, 5 seconds, etc), use the Vegas line, the current score differential, who has the ball, and the number of possessions played already (to account for differences in pace) to create a point estimate probability of the home team winning. reminded me of a graph I made, which shows the mean-reverting tendency of N

4 0.72694016 1387 andrew gelman stats-2012-06-21-Will Tiger Woods catch Jack Nicklaus? And a discussion of the virtues of using continuous data even if your goal is discrete prediction

Introduction: I know next to nothing about golf. My mini-golf scores typically approach the maximum of 7 per hole, and I’ve never actually played macro-golf. I did publish a paper on golf once ( A Probability Model for Golf Putting , with Deb Nolan), but it’s not so rare for people to publish papers on topics they know nothing about. Those who can’t, research. But I certainly have the ability to post other people’s ideas. Charles Murray writes: I [Murray] am playing around with the likelihood of Tiger Woods breaking Nicklaus’s record in the Majors. I’ve already gone on record two years ago with the reason why he won’t, but now I’m looking at it from a non-psychological perspective. Given the history of the majors, what how far above the average _for other great golfers_ does Tiger have to perform? Here’s the procedure I’ve been working on: 1. For all golfers who have won at at least one major since 1934 (the year the Masters began), create 120 lines: one for each Major for each year f

5 0.70143467 1284 andrew gelman stats-2012-04-26-Modeling probability data

Introduction: Rafael Huber writes: I conducted an experiment in which subjects where asked to estimate the probability of a certain event given a number of information (like a wheater forecaster or a stockmarket trader). These probability estimates are the dependent variable of my experiment. My goal is to model the data with a (hierarchical) Bayesian regression. A linear equation with all the presented information (quantified as log odds) defines the mu of a normal likelihood. The tau as precision is another free parameter. y[r] ~ dnorm( mu[r] , tau[ subj[r] ] ) mu[r] <- b0[ subj[r] ] + b1[ subj[r] ] * x1[r] + b2[ subj[r] ] * x2[r] + b3[ subj[r] ] * x3[r] My problem is that I do not believe that the normal is the correct probability distribution to model probability data (‌ because the error is limited). However, until now nobody was able to tell me how I can correctly model probability data. My reply: You can take the logit of the data before analyzing them. That is assuming there

6 0.64236414 934 andrew gelman stats-2011-09-30-Nooooooooooooooooooo!

7 0.63984227 151 andrew gelman stats-2010-07-16-Wanted: Probability distributions for rank orderings

8 0.62983578 1544 andrew gelman stats-2012-10-22-Is it meaningful to talk about a probability of “65.7%” that Obama will win the election?

9 0.62041843 782 andrew gelman stats-2011-06-29-Putting together multinomial discrete regressions by combining simple logits

10 0.61989772 485 andrew gelman stats-2010-12-25-Unlogging

11 0.61662155 1214 andrew gelman stats-2012-03-15-Of forecasts and graph theory and characterizing a statistical method by the information it uses

12 0.60692871 559 andrew gelman stats-2011-02-06-Bidding for the kickoff

13 0.60420251 171 andrew gelman stats-2010-07-30-Silly baseball example illustrates a couple of key ideas they don’t usually teach you in statistics class

14 0.60203356 391 andrew gelman stats-2010-11-03-Some thoughts on election forecasting

15 0.59879208 82 andrew gelman stats-2010-06-12-UnConMax – uncertainty consideration maxims 7 +-- 2

16 0.59089446 562 andrew gelman stats-2011-02-06-Statistician cracks Toronto lottery

17 0.58713317 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making

18 0.58062923 2311 andrew gelman stats-2014-04-29-Bayesian Uncertainty Quantification for Differential Equations!

19 0.57929116 1540 andrew gelman stats-2012-10-18-“Intrade to the 57th power”

20 0.57875299 1562 andrew gelman stats-2012-11-05-Let’s try this: Instead of saying, “The probability is 75%,” say “There’s a 25% chance I’m wrong”


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(9, 0.013), (11, 0.012), (15, 0.015), (16, 0.058), (24, 0.236), (27, 0.039), (40, 0.011), (41, 0.084), (51, 0.014), (60, 0.021), (77, 0.019), (86, 0.104), (89, 0.039), (99, 0.237)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.971138 2224 andrew gelman stats-2014-02-25-Basketball Stats: Don’t model the probability of win, model the expected score differential.

Introduction: Someone who wants to remain anonymous writes: I am working to create a more accurate in-game win probability model for basketball games. My idea is for each timestep in a game (a second, 5 seconds, etc), use the Vegas line, the current score differential, who has the ball, and the number of possessions played already (to account for differences in pace) to create a point estimate probability of the home team winning. This problem would seem to fit a multi-level model structure well. It seems silly to estimate 2,000 regressions (one for each timestep), but the coefficients should vary at each timestep. Do you have suggestions for what type of model this could/would be? Additionally, I believe this needs to be some form of logit/probit given the binary dependent variable (win or loss). Finally, do you have suggestions for what package could accomplish this in Stata or R? To answer the questions in reverse order: 3. I’d hope this could be done in Stan (which can be run from R)

2 0.95657367 2262 andrew gelman stats-2014-03-23-Win probabilities during a sporting event

Introduction: Todd Schneider writes: Apropos of your recent blog post about modeling score differential of basketball games , I thought you might enjoy a site I built, gambletron2000.com , that gathers real-time win probabilities from betting markets for most major sports (including NBA and college basketball). My original goal was to use the variance of changes in win probabilities to quantify which games were the most exciting, but I got a bit carried away and ended up pursuing a bunch of other ideas, which  you can read about in the full writeup here This particular passage from the anonymous someone in your post: My idea is for each timestep in a game (a second, 5 seconds, etc), use the Vegas line, the current score differential, who has the ball, and the number of possessions played already (to account for differences in pace) to create a point estimate probability of the home team winning. reminded me of a graph I made, which shows the mean-reverting tendency of N

3 0.95155334 1062 andrew gelman stats-2011-12-16-Mr. Pearson, meet Mr. Mandelbrot: Detecting Novel Associations in Large Data Sets

Introduction: Jeremy Fox asks what I think about this paper by David N. Reshef, Yakir Reshef, Hilary Finucane, Sharon Grossman, Gilean McVean, Peter Turnbaugh, Eric Lander, Michael Mitzenmacher, and Pardis Sabeti which proposes a new nonlinear R-squared-like measure. My quick answer is that it looks really cool! From my quick reading of the paper, it appears that the method reduces on average to the usual R-squared when fit to data of the form y = a + bx + error, and that it also has a similar interpretation when “a + bx” is replaced by other continuous functions. Unlike R-squared, the method of Reshef et al. depends on a tuning parameter that controls the level of discretization, in a “How long is the coast of Britain” sort of way. The dependence on scale is inevitable for such a general method. Just consider: if you sample 1000 points from the unit bivariate normal distribution, (x,y) ~ N(0,I), you’ll be able to fit them perfectly by a 999-degree polynomial fit to the data. So the sca

4 0.9507004 846 andrew gelman stats-2011-08-09-Default priors update?

Introduction: Ryan King writes: I was wondering if you have a brief comment on the state of the art for objective priors for hierarchical generalized linear models (generalized linear mixed models). I have been working off the papers in Bayesian Analysis (2006) 1, Number 3 (Browne and Draper, Kass and Natarajan, Gelman). There seems to have been continuous work for matching priors in linear mixed models, but GLMMs less so because of the lack of an analytic marginal likelihood for the variance components. There are a number of additional suggestions in the literature since 2006, but little robust practical guidance. I’m interested in both mean parameters and the variance components. I’m almost always concerned with logistic random effect models. I’m fascinated by the matching-priors idea of higher-order asymptotic improvements to maximum likelihood, and need to make some kind of defensible default recommendation. Given the massive scale of the datasets (genetics …), extensive sensitivity a

5 0.94699669 953 andrew gelman stats-2011-10-11-Steve Jobs’s cancer and science-based medicine

Introduction: Interesting discussion from David Gorski (which I found via this link from Joseph Delaney). I don’t have anything really to add to this discussion except to note the value of this sort of anecdote in a statistics discussion. It’s only n=1 and adds almost nothing to the literature on the effectiveness of various treatments, but a story like this can help focus one’s thoughts on the decision problems.

6 0.9437803 779 andrew gelman stats-2011-06-25-Avoiding boundary estimates using a prior distribution as regularization

7 0.94176888 1367 andrew gelman stats-2012-06-05-Question 26 of my final exam for Design and Analysis of Sample Surveys

8 0.94042659 494 andrew gelman stats-2010-12-31-Type S error rates for classical and Bayesian single and multiple comparison procedures

9 0.93769425 1474 andrew gelman stats-2012-08-29-More on scaled-inverse Wishart and prior independence

10 0.93288517 1019 andrew gelman stats-2011-11-19-Validation of Software for Bayesian Models Using Posterior Quantiles

11 0.93260795 1368 andrew gelman stats-2012-06-06-Question 27 of my final exam for Design and Analysis of Sample Surveys

12 0.92860317 1214 andrew gelman stats-2012-03-15-Of forecasts and graph theory and characterizing a statistical method by the information it uses

13 0.92569631 1004 andrew gelman stats-2011-11-11-Kaiser Fung on how not to critique models

14 0.92380774 1240 andrew gelman stats-2012-04-02-Blogads update

15 0.92329162 2312 andrew gelman stats-2014-04-29-Ken Rice presents a unifying approach to statistical inference and hypothesis testing

16 0.92273641 301 andrew gelman stats-2010-09-28-Correlation, prediction, variation, etc.

17 0.92270678 2017 andrew gelman stats-2013-09-11-“Informative g-Priors for Logistic Regression”

18 0.92254776 2231 andrew gelman stats-2014-03-03-Running into a Stan Reference by Accident

19 0.92239815 1838 andrew gelman stats-2013-05-03-Setting aside the politics, the debate over the new health-care study reveals that we’re moving to a new high standard of statistical journalism

20 0.92201859 2102 andrew gelman stats-2013-11-15-“Are all significant p-values created equal?”