andrew_gelman_stats andrew_gelman_stats-2012 andrew_gelman_stats-2012-1516 knowledge-graph by maker-knowledge-mining

1516 andrew gelman stats-2012-09-30-Computational problems with glm etc.

meta infos for this blog

Source: html

Introduction: John Mount provides some useful background and follow-up on our discussion from last year on computational instability of the usual logistic regression solver. Just to refresh your memory, here’s a simple logistic regression with only a constant term and no separation, nothing pathological at all: > y <- rep (c(1,0),c(10,5)) > display (glm (y ~ 1, family=binomial(link="logit"))) glm(formula = y ~ 1, family = binomial(link = "logit")) coef.est coef.se (Intercept) 0.69 0.55 --- n = 15, k = 1 residual deviance = 19.1, null deviance = 19.1 (difference = 0.0) And here’s what happens when we give it the not-outrageous starting value of -2: > display (glm (y ~ 1, family=binomial(link="logit"), start=-2)) glm(formula = y ~ 1, family = binomial(link = "logit"), start = -2) coef.est coef.se (Intercept) 71.97 17327434.18 --- n = 15, k = 1 residual deviance = 360.4, null deviance = 19.1 (difference = -341.3) Warning message:

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 John Mount provides some useful background and follow-up on our discussion from last year on computational instability of the usual logistic regression solver. [sent-1, score-0.481]

2 55 --- n = 15, k = 1 residual deviance = 19. [sent-6, score-0.461]

3 0) And here’s what happens when we give it the not-outrageous starting value of -2: > display (glm (y ~ 1, family=binomial(link="logit"), start=-2)) glm(formula = y ~ 1, family = binomial(link = "logit"), start = -2) coef. [sent-9, score-0.332]

4 18 --- n = 15, k = 1 residual deviance = 360. [sent-13, score-0.461]

5 Mount explains what’s going on: From a theoretical point of view the logistic generalized linear model is an easy problem to solve. [sent-18, score-0.574]

6 The quantity being optimized (deviance or perplexity) is log-concave. [sent-19, score-0.071]

7 This in turn implies there is a unique global maximum and no local maxima to get trapped in. [sent-20, score-0.152]

8 However, the standard methods of solving the logistic generalized linear model are the Newton-Raphson method or the closely related iteratively reweighted least squares method. [sent-22, score-0.716]

9 And these methods, while typically very fast, do not guarantee convergence in all conditions. [sent-23, score-0.06]

10 The problem is fixable, because optimizing logistic divergence or perplexity is a very nice optimization problem (log-concave). [sent-27, score-0.912]

11 But most common statistical packages do not invest effort in this situation. [sent-28, score-0.068]

12 Mount points out that, in addition to patches which will redirect exploding Newton steps, “many other optimization techniques can be used”: - stochastic gradient descent - conjugate gradient - EM (see “Direct calculation of the information matrix via the EM. [sent-29, score-0.654]

13 Or you can try to solve a different, but related, problem: “Exact logistic regression: theory and examples”, C R CR Mehta and N R NR Patel, Statist Med, 1995 vol. [sent-33, score-0.303]

14 Maybe we should also change what we write in Bayesian Data Analysis about how to fit a logistic regression. [sent-37, score-0.303]

similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('deviance', 0.333), ('glm', 0.316), ('logistic', 0.303), ('mount', 0.279), ('binomial', 0.239), ('logit', 0.214), ('perplexity', 0.169), ('family', 0.167), ('residual', 0.128), ('gradient', 0.128), ('intercept', 0.124), ('occurred', 0.113), ('formula', 0.111), ('regression', 0.108), ('link', 0.108), ('optimization', 0.106), ('generalized', 0.104), ('null', 0.095), ('problem', 0.092), ('display', 0.086), ('maxima', 0.085), ('reweighted', 0.085), ('iteratively', 0.085), ('nr', 0.085), ('refresh', 0.085), ('med', 0.08), ('numerically', 0.08), ('polson', 0.08), ('patches', 0.08), ('starting', 0.079), ('pathological', 0.076), ('divergence', 0.076), ('linear', 0.075), ('optimizing', 0.074), ('stripped', 0.074), ('exploding', 0.071), ('newton', 0.071), ('conjugate', 0.071), ('optimized', 0.071), ('rep', 0.07), ('descent', 0.07), ('instability', 0.07), ('invest', 0.068), ('gradients', 0.068), ('trapped', 0.067), ('related', 0.064), ('em', 0.062), ('royal', 0.061), ('separation', 0.06), ('guarantee', 0.06)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000002 1516 andrew gelman stats-2012-09-30-Computational problems with glm etc.

2 0.62378514 696 andrew gelman stats-2011-05-04-Whassup with glm()?

Introduction: We’re having problem with starting values in glm(). A very simple logistic regression with just an intercept with a very simple starting value (beta=5) blows up. Here’s the R code: > y <- rep (c(1,0),c(10,5)) > glm (y ~ 1, family=binomial(link="logit")) Call: glm(formula = y ~ 1, family = binomial(link = "logit")) Coefficients: (Intercept) 0.6931 Degrees of Freedom: 14 Total (i.e. Null); 14 Residual Null Deviance: 19.1 Residual Deviance: 19.1 AIC: 21.1 > glm (y ~ 1, family=binomial(link="logit"), start=2) Call: glm(formula = y ~ 1, family = binomial(link = "logit"), start = 2) Coefficients: (Intercept) 0.6931 Degrees of Freedom: 14 Total (i.e. Null); 14 Residual Null Deviance: 19.1 Residual Deviance: 19.1 AIC: 21.1 > glm (y ~ 1, family=binomial(link="logit"), start=5) Call: glm(formula = y ~ 1, family = binomial(link = "logit"), start = 5) Coefficients: (Intercept) 1.501e+15 Degrees of Freedom: 14 Total (i.

3 0.30894631 729 andrew gelman stats-2011-05-24-Deviance as a difference

Introduction: Peng Yu writes: On page 180 of BDA2, deviance is defined as D(y,\theta)=-2log p(y|\theta). However, according to GLM 2/e by McCullagh and Nelder, deviance is the different of the log-likelihood of the full model and the base model (times 2) (see the equation on the wiki webpage). The english word ‘deviance’ implies the difference from a standard (in this case, the base model). I’m wondering what the rationale for your definition of deviance, which consists of only 1 term rather than 2 terms. My reply: Deviance is typically computed as a relative quantity; that is, people look at the difference in deviance. So the two definitions are equivalent.

4 0.20111185 1221 andrew gelman stats-2012-03-19-Whassup with deviance having a high posterior correlation with a parameter in the model?

Introduction: Jean Richardson writes: Do you know what might lead to a large negative cross-correlation (-0.95) between deviance and one of the model parameters? Here’s the (brief) background: I [Richardson] have written a Bayesian hierarchical site occupancy model for presence of disease on individual amphibians. The response variable is therefore binary (disease present/absent) and the probability of disease being present in an individual (psi) depends on various covariates (species of amphibian, location sampled, etc.) paramaterized using a logit link function. Replicates are individuals sampled (tested for presence of disease) together. The possibility of imperfect detection is included as p = (prob. disease detected given disease is present). Posterior distributions were estimated using WinBUGS via R2WinBUGS. Simulated data from the model fit the real data very well and posterior distribution densities seem robust to any changes in the model (different priors, etc.) All autocor

5 0.17418006 1886 andrew gelman stats-2013-06-07-Robust logistic regression

Introduction: Corey Yanofsky writes: In your work, you’ve robustificated logistic regression by having the logit function saturate at, e.g., 0.01 and 0.99, instead of 0 and 1. Do you have any thoughts on a sensible setting for the saturation values? My intuition suggests that it has something to do with proportion of outliers expected in the data (assuming a reasonable model fit). It would be desirable to have them fit in the model, but my intuition is that integrability of the posterior distribution might become an issue. My reply: it should be no problem to put these saturation values in the model, I bet it would work fine in Stan if you give them uniform (0,.1) priors or something like that. Or you could just fit the robit model. And this reminds me . . . I’ve been told that when Stan’s on its optimization setting, it fits generalized linear models just about as fast as regular glm or bayesglm in R. This suggests to me that we should have some precompiled regression models in Stan,

6 0.15121856 547 andrew gelman stats-2011-01-31-Using sample size in the prior distribution

7 0.14560793 247 andrew gelman stats-2010-09-01-How does Bayes do it?

8 0.13946322 39 andrew gelman stats-2010-05-18-The 1.6 rule

9 0.13787307 684 andrew gelman stats-2011-04-28-Hierarchical ordered logit or probit

10 0.13410652 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?

11 0.13287573 1445 andrew gelman stats-2012-08-06-Slow progress

12 0.1130449 782 andrew gelman stats-2011-06-29-Putting together multinomial discrete regressions by combining simple logits

13 0.11059181 2017 andrew gelman stats-2013-09-11-“Informative g-Priors for Logistic Regression”

14 0.10956841 773 andrew gelman stats-2011-06-18-Should we always be using the t and robit instead of the normal and logit?

15 0.10867137 2200 andrew gelman stats-2014-02-05-Prior distribution for a predicted probability

16 0.10206026 1761 andrew gelman stats-2013-03-13-Lame Statistics Patents

17 0.098605663 417 andrew gelman stats-2010-11-17-Clutering and variance components

18 0.097978711 77 andrew gelman stats-2010-06-09-Sof[t]

19 0.09264452 266 andrew gelman stats-2010-09-09-The future of R

20 0.089489117 2072 andrew gelman stats-2013-10-21-The future (and past) of statistical sciences

similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.137), (1, 0.098), (2, 0.004), (3, 0.018), (4, 0.06), (5, -0.021), (6, 0.011), (7, -0.05), (8, 0.008), (9, -0.009), (10, 0.007), (11, -0.002), (12, -0.01), (13, -0.032), (14, -0.033), (15, 0.039), (16, -0.009), (17, -0.041), (18, -0.026), (19, -0.051), (20, 0.058), (21, 0.033), (22, 0.067), (23, -0.059), (24, 0.028), (25, 0.002), (26, 0.024), (27, -0.03), (28, 0.004), (29, -0.061), (30, 0.004), (31, 0.091), (32, 0.041), (33, 0.019), (34, -0.009), (35, -0.066), (36, -0.034), (37, 0.005), (38, -0.009), (39, 0.076), (40, -0.019), (41, 0.044), (42, -0.051), (43, -0.031), (44, 0.089), (45, 0.07), (46, 0.01), (47, 0.021), (48, 0.013), (49, 0.108)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.95943284 1516 andrew gelman stats-2012-09-30-Computational problems with glm etc.

2 0.88286173 696 andrew gelman stats-2011-05-04-Whassup with glm()?

3 0.69846046 684 andrew gelman stats-2011-04-28-Hierarchical ordered logit or probit

Introduction: Jeff writes: How far off is bglmer and can it handle ordered logit or multinom logit? My reply: bglmer is very close. No ordered logit but I was just talking about it with Sophia today. My guess is that the easiest way to fit a hierarchical ordered logit or multinom logit will be to use stan. For right now I’d recommend using glmer/bglmer to fit the ordered logits in order (e.g., 1 vs. 2,3,4, then 2 vs. 3,4, then 3 vs. 4). Or maybe there’s already a hierarchical multinomial logit in mcmcpack or somewhere?

4 0.67249399 1761 andrew gelman stats-2013-03-13-Lame Statistics Patents

Introduction: Manoel Galdino wrote in a comment off-topic on another post (which I erased): I know you commented before about patents on statistical methods. Did you know this patent ( http://www.archpatent.com/patents/8032473 )? Do you have any comment on patents that don’t describe mathematically how it works and how and if they’re any different from previous methods? And what about the lack of scientific validation of the claims in such a method? The patent in question, “US 8032473: “Generalized reduced error logistic regression method,” begins with the following “claim”: A system for machine learning comprising: a computer including a computer-readable medium having software stored thereon that, when executed by said computer, performs a method comprising the steps of being trained to learn a logistic regression match to a target class variable so to exhibit classification learning by which: an estimated error in each variable’s moment in the logistic regression be modeled and reduce

5 0.66594052 39 andrew gelman stats-2010-05-18-The 1.6 rule

Introduction: In ARM we discuss how you can go back and forth between logit and probit models by dividing by 1.6. Or, to put it another way, logistic regression corresponds to a latent-variable model with errors that are approximately normally distributed with mean 0 and standard deviation 1.6. (This is well known, itâ€™s nothing original with our book.) Anyway, John Cook discusses the approximation here .

6 0.63339114 782 andrew gelman stats-2011-06-29-Putting together multinomial discrete regressions by combining simple logits

7 0.62392908 2311 andrew gelman stats-2014-04-29-Bayesian Uncertainty Quantification for Differential Equations!

8 0.61939979 2190 andrew gelman stats-2014-01-29-Stupid R Tricks: Random Scope

9 0.61815912 2110 andrew gelman stats-2013-11-22-A Bayesian model for an increasing function, in Stan!

10 0.60367417 1462 andrew gelman stats-2012-08-18-Standardizing regression inputs

11 0.60201484 729 andrew gelman stats-2011-05-24-Deviance as a difference

12 0.58951068 1875 andrew gelman stats-2013-05-28-Simplify until your fake-data check works, then add complications until you can figure out where the problem is coming from

13 0.58323669 861 andrew gelman stats-2011-08-19-Will Stan work well with 40×40 matrices?

14 0.57249206 770 andrew gelman stats-2011-06-15-Still more Mr. P in public health

15 0.57000554 1221 andrew gelman stats-2012-03-19-Whassup with deviance having a high posterior correlation with a parameter in the model?

16 0.56868774 1094 andrew gelman stats-2011-12-31-Using factor analysis or principal components analysis or measurement-error models for biological measurements in archaeology?

17 0.56753194 2342 andrew gelman stats-2014-05-21-Models with constraints

18 0.56636834 1849 andrew gelman stats-2013-05-09-Same old same old

19 0.56543887 1908 andrew gelman stats-2013-06-21-Interpreting interactions in discrete-data regression

20 0.56231016 1975 andrew gelman stats-2013-08-09-Understanding predictive information criteria for Bayesian models

similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(7, 0.013), (16, 0.034), (21, 0.032), (24, 0.137), (35, 0.155), (53, 0.03), (54, 0.032), (61, 0.039), (63, 0.033), (73, 0.02), (82, 0.018), (84, 0.019), (86, 0.052), (90, 0.02), (99, 0.238)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.94801366 1516 andrew gelman stats-2012-09-30-Computational problems with glm etc.

2 0.93448639 1443 andrew gelman stats-2012-08-04-Bayesian Learning via Stochastic Gradient Langevin Dynamics

Introduction: Burak Bayramli writes: In this paper by Sunjin Ahn, Anoop Korattikara, and Max Welling and this paper by Welling and Yee Whye The, there are some arguments on big data and the use of MCMC. Both papers have suggested improvements to speed up MCMC computations. I was wondering what your thoughts were, especially on this paragraph: When a dataset has a billion data-cases (as is not uncommon these days) MCMC algorithms will not even have generated a single (burn-in) sample when a clever learning algorithm based on stochastic gradients may already be making fairly good predictions. In fact, the intriguing results of Bottou and Bousquet (2008) seem to indicate that in terms of “number of bits learned per unit of computation”, an algorithm as simple as stochastic gradient descent is almost optimally efficient. We therefore argue that for Bayesian methods to remain useful in an age when the datasets grow at an exponential rate, they need to embrace the ideas of the stochastic optimiz

3 0.93004912 473 andrew gelman stats-2010-12-17-Why a bonobo won’t play poker with you

Introduction: Sciencedaily has posted an article titled Apes Unwilling to Gamble When Odds Are Uncertain : The apes readily distinguished between the different probabilities of winning: they gambled a lot when there was a 100 percent chance, less when there was a 50 percent chance, and only rarely when there was no chance In some trials, however, the experimenter didn’t remove a lid from the bowl, so the apes couldn’t assess the likelihood of winning a banana The odds from the covered bowl were identical to those from the risky option: a 50 percent chance of getting the much sought-after banana. But apes of both species were less likely to choose this ambiguous option. Like humans, they showed “ambiguity aversion” — preferring to gamble more when they knew the odds than when they didn’t. Given some of the other differences between chimps and bonobos, Hare and Rosati had expected to find the bonobos to be more averse to ambiguity, but that didn’t turn out to be the case. Thanks to Sta

4 0.92632174 837 andrew gelman stats-2011-08-04-Is it rational to vote?

Introduction: Hear me interviewed on the topic here . P.S. The interview was fine but I donâ€™t agree with everything on the linked website. For example, this bit: Global warming is not the first case of a widespread fear based on incomplete knowledge turned out to be false or at least greatly exaggerated. Global warming has many of the characteristics of a popular delusion, an irrational fear or cause that is embraced by millions of people because, well, it is believed by millions of people! All right, then.

5 0.91649711 2049 andrew gelman stats-2013-10-03-On house arrest for p-hacking

Introduction: People keep pointing me to this excellent news article by David Brown, about a scientist who was convicted of data manipulation: In all, 330 patients were randomly assigned to get either interferon gamma-1b or placebo injections. Disease progression or death occurred in 46 percent of those on the drug and 52 percent of those on placebo. That was not a significant difference, statistically speaking. When only survival was considered, however, the drug looked better: 10 percent of people getting the drug died, compared with 17 percent of those on placebo. However, that difference wasn’t “statistically significant,” either. Specifically, the so-called P value — a mathematical measure of the strength of the evidence that there’s a true difference between a treatment and placebo — was 0.08. . . . Technically, the study was a bust, although the results leaned toward a benefit from interferon gamma-1b. Was there a group of patients in which the results tipped? Harkonen asked the statis

6 0.91637796 591 andrew gelman stats-2011-02-25-Quantitative Methods in the Social Sciences M.A.: Innovative, interdisciplinary social science research program for a data-rich world

7 0.90954721 942 andrew gelman stats-2011-10-04-45% hitting, 25% fielding, 25% pitching, and 100% not telling us how they did it

8 0.90941519 296 andrew gelman stats-2010-09-26-A simple semigraphic display

9 0.90449023 895 andrew gelman stats-2011-09-08-How to solve the Post Office’s problems?

10 0.90275729 1926 andrew gelman stats-2013-07-05-More plain old everyday Bayesianism

11 0.90105546 881 andrew gelman stats-2011-08-30-Rickey Henderson and Peter Angelos, together again

12 0.88408571 1264 andrew gelman stats-2012-04-14-Learning from failure

13 0.88223642 388 andrew gelman stats-2010-11-01-The placebo effect in pharma

14 0.87545967 1130 andrew gelman stats-2012-01-20-Prior beliefs about locations of decision boundaries

15 0.86547011 392 andrew gelman stats-2010-11-03-Taleb + 3.5 years

16 0.86048716 535 andrew gelman stats-2011-01-24-Bleg: Automatic Differentiation for Log Prob Gradients?

17 0.85777134 2274 andrew gelman stats-2014-03-30-Adjudicating between alternative interpretations of a statistical interaction?

18 0.85650384 1753 andrew gelman stats-2013-03-06-Stan 1.2.0 and RStan 1.2.0

19 0.85489315 1886 andrew gelman stats-2013-06-07-Robust logistic regression

20 0.85359699 1661 andrew gelman stats-2013-01-08-Software is as software does