andrew_gelman_stats andrew_gelman_stats-2010 andrew_gelman_stats-2010-266 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Some thoughts from Christian , including this bit: We need to consider separately 1. R’s brilliant library 2. R’s not-so-brilliant language and/or interpreter. I don’t know that R’s library is so brilliant as all that–if necessary, I don’t think it would be hard to reprogram the important packages in a new language. I would say, though, that the problems with R are not just in the technical details of the language. I think the culture of R has some problems too. As I’ve written before, R functions used to be lean and mean, and now they’re full of exception-handling and calls to other packages. R functions are spaghetti-like messes of connections in which I keep expecting to run into syntax like “GOTO 120.” I learned about these problems a couple years ago when writing bayesglm(), which is a simple adaptation of glm(). But glm(), and its workhorse, glm.fit(), are a mess: They’re about 10 lines of functioning code, plus about 20 lines of necessary front-end, plus a cou
sentIndex sentText sentNum sentScore
1 Some thoughts from Christian , including this bit: We need to consider separately 1. [sent-1, score-0.174]
2 I don’t know that R’s library is so brilliant as all that–if necessary, I don’t think it would be hard to reprogram the important packages in a new language. [sent-4, score-0.541]
3 I would say, though, that the problems with R are not just in the technical details of the language. [sent-5, score-0.289]
4 As I’ve written before, R functions used to be lean and mean, and now they’re full of exception-handling and calls to other packages. [sent-7, score-0.639]
5 R functions are spaghetti-like messes of connections in which I keep expecting to run into syntax like “GOTO 120. [sent-8, score-0.981]
6 ” I learned about these problems a couple years ago when writing bayesglm(), which is a simple adaptation of glm(). [sent-9, score-0.482]
7 fit(), are a mess: They’re about 10 lines of functioning code, plus about 20 lines of necessary front-end, plus a couple hundred lines of naming, exception-handling, repetitions of chunks of code, pseudo-structured-programming-through-naming-of-variables, and general buck-passing. [sent-11, score-1.911]
8 I still don’t know if my modifications are quite right–I did what was needed to the meat of the function but no way can I keep track of all the if-else possibilities. [sent-12, score-0.656]
9 If R is redone, I hope its functions return to the lean-and-mean aesthetic of the original S (but with better graphics defaults). [sent-13, score-0.741]
wordName wordTfidf (topN-words)
[('functions', 0.284), ('glm', 0.257), ('lines', 0.235), ('brilliant', 0.219), ('library', 0.216), ('plus', 0.18), ('redone', 0.172), ('workhorse', 0.172), ('aesthetic', 0.162), ('necessary', 0.162), ('messes', 0.155), ('modifications', 0.155), ('repetitions', 0.155), ('code', 0.15), ('chunks', 0.15), ('syntax', 0.145), ('adaptation', 0.145), ('functioning', 0.145), ('bayesglm', 0.141), ('naming', 0.141), ('problems', 0.141), ('defaults', 0.138), ('meat', 0.13), ('lean', 0.128), ('keep', 0.121), ('couple', 0.117), ('hundred', 0.117), ('mess', 0.116), ('expecting', 0.112), ('separately', 0.108), ('packages', 0.106), ('christian', 0.105), ('calls', 0.102), ('connections', 0.099), ('track', 0.093), ('culture', 0.092), ('return', 0.087), ('language', 0.082), ('needed', 0.082), ('technical', 0.081), ('learned', 0.079), ('graphics', 0.079), ('function', 0.075), ('details', 0.067), ('hope', 0.067), ('thoughts', 0.066), ('run', 0.065), ('written', 0.063), ('full', 0.062), ('original', 0.062)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 266 andrew gelman stats-2010-09-09-The future of R
Introduction: Some thoughts from Christian , including this bit: We need to consider separately 1. R’s brilliant library 2. R’s not-so-brilliant language and/or interpreter. I don’t know that R’s library is so brilliant as all that–if necessary, I don’t think it would be hard to reprogram the important packages in a new language. I would say, though, that the problems with R are not just in the technical details of the language. I think the culture of R has some problems too. As I’ve written before, R functions used to be lean and mean, and now they’re full of exception-handling and calls to other packages. R functions are spaghetti-like messes of connections in which I keep expecting to run into syntax like “GOTO 120.” I learned about these problems a couple years ago when writing bayesglm(), which is a simple adaptation of glm(). But glm(), and its workhorse, glm.fit(), are a mess: They’re about 10 lines of functioning code, plus about 20 lines of necessary front-end, plus a cou
2 0.12737454 1799 andrew gelman stats-2013-04-12-Stan 1.3.0 and RStan 1.3.0 Ready for Action
Introduction: The Stan Development Team is happy to announce that Stan 1.3.0 and RStan 1.3.0 are available for download. Follow the links on: Stan home page: http://mc-stan.org/ Please let us know if you have problems updating. Here’s the full set of release notes. v1.3.0 (12 April 2013) ====================================================================== Enhancements ---------------------------------- Modeling Language * forward sampling (random draws from distributions) in generated quantities * better error messages in parser * new distributions: + exp_mod_normal + gumbel + skew_normal * new special functions: + owenst * new broadcast (repetition) functions for vectors, arrays, matrices + rep_arrray + rep_matrix + rep_row_vector + rep_vector Command-Line * added option to display autocorrelations in the command-line program to print output * changed default point estimation routine from the command line to
3 0.11992633 696 andrew gelman stats-2011-05-04-Whassup with glm()?
Introduction: We’re having problem with starting values in glm(). A very simple logistic regression with just an intercept with a very simple starting value (beta=5) blows up. Here’s the R code: > y <- rep (c(1,0),c(10,5)) > glm (y ~ 1, family=binomial(link="logit")) Call: glm(formula = y ~ 1, family = binomial(link = "logit")) Coefficients: (Intercept) 0.6931 Degrees of Freedom: 14 Total (i.e. Null); 14 Residual Null Deviance: 19.1 Residual Deviance: 19.1 AIC: 21.1 > glm (y ~ 1, family=binomial(link="logit"), start=2) Call: glm(formula = y ~ 1, family = binomial(link = "logit"), start = 2) Coefficients: (Intercept) 0.6931 Degrees of Freedom: 14 Total (i.e. Null); 14 Residual Null Deviance: 19.1 Residual Deviance: 19.1 AIC: 21.1 > glm (y ~ 1, family=binomial(link="logit"), start=5) Call: glm(formula = y ~ 1, family = binomial(link = "logit"), start = 5) Coefficients: (Intercept) 1.501e+15 Degrees of Freedom: 14 Total (i.
4 0.11097538 1753 andrew gelman stats-2013-03-06-Stan 1.2.0 and RStan 1.2.0
Introduction: Stan 1.2.0 and RStan 1.2.0 are now available for download. See: http://mc-stan.org/ Here are the highlights. Full Mass Matrix Estimation during Warmup Yuanjun Gao, a first-year grad student here at Columbia (!), built a regularized mass-matrix estimator. This helps for posteriors with high correlation among parameters and varying scales. We’re still testing this ourselves, so the estimation procedure may change in the future (don’t worry — it satisfies detailed balance as is, but we might be able to make it more computationally efficient in terms of time per effective sample). It’s not the default option. The major reason is the matrix operations required are expensive, raising the algorithm cost to , where is the average number of leapfrog steps, is the number of iterations, and is the number of parameters. Yuanjun did a great job with the Cholesky factorizations and implemented this about as efficiently as is possible. (His homework for Andrew’s class w
5 0.10230814 1009 andrew gelman stats-2011-11-14-Wickham R short course
Introduction: Hadley writes: I [Hadley] am going to be teaching an R development master class in New York City on Dec 12-13. The basic idea of the class is to help you write better code, focused on the mantra of “do not repeat yourself”. In day one you will learn powerful new tools of abstraction, allowing you to solve a wider range of problems with fewer lines of code. Day two will teach you how to make packages, the fundamental unit of code distribution in R, allowing others to save time by allowing them to use your code. To get the most out of this course, you should have some experience programming in R already: you should be familiar with writing functions, and the basic data structures of R: vectors, matrices, arrays, lists and data frames. You will find the course particularly useful if you’re an experienced R user looking to take the next step, or if you’re moving to R from other programming languages and you want to quickly get up to speed with R’s unique features. A coupl
6 0.097172394 272 andrew gelman stats-2010-09-13-Ross Ihaka to R: Drop Dead
7 0.09264452 1516 andrew gelman stats-2012-09-30-Computational problems with glm etc.
8 0.091482006 1399 andrew gelman stats-2012-06-28-Life imitates blog
9 0.091074333 1807 andrew gelman stats-2013-04-17-Data problems, coding errors…what can be done?
10 0.086079247 1627 andrew gelman stats-2012-12-17-Stan and RStan 1.1.0
11 0.084553517 1886 andrew gelman stats-2013-06-07-Robust logistic regression
12 0.084245078 2365 andrew gelman stats-2014-06-09-I hate polynomials
13 0.08279781 535 andrew gelman stats-2011-01-24-Bleg: Automatic Differentiation for Log Prob Gradients?
14 0.081969127 198 andrew gelman stats-2010-08-11-Multilevel modeling in R on a Mac
15 0.076091602 1655 andrew gelman stats-2013-01-05-The statistics software signal
16 0.076079741 2089 andrew gelman stats-2013-11-04-Shlemiel the Software Developer and Unknown Unknowns
17 0.075308889 787 andrew gelman stats-2011-07-05-Different goals, different looks: Infovis and the Chris Rock effect
18 0.073651187 1482 andrew gelman stats-2012-09-04-Model checking and model understanding in machine learning
19 0.072207406 1814 andrew gelman stats-2013-04-20-A mess with which I am comfortable
topicId topicWeight
[(0, 0.116), (1, -0.012), (2, -0.027), (3, 0.043), (4, 0.072), (5, -0.024), (6, 0.022), (7, -0.05), (8, 0.0), (9, -0.024), (10, -0.038), (11, -0.024), (12, -0.024), (13, -0.025), (14, -0.0), (15, 0.0), (16, -0.007), (17, 0.005), (18, -0.015), (19, 0.006), (20, 0.017), (21, 0.023), (22, -0.022), (23, 0.036), (24, -0.011), (25, 0.008), (26, 0.029), (27, 0.027), (28, 0.021), (29, 0.011), (30, 0.026), (31, 0.014), (32, 0.016), (33, -0.003), (34, 0.026), (35, -0.083), (36, -0.022), (37, 0.04), (38, -0.001), (39, -0.008), (40, -0.014), (41, 0.026), (42, 0.01), (43, -0.013), (44, 0.004), (45, 0.053), (46, -0.027), (47, 0.01), (48, 0.039), (49, 0.018)]
simIndex simValue blogId blogTitle
same-blog 1 0.95299286 266 andrew gelman stats-2010-09-09-The future of R
Introduction: Some thoughts from Christian , including this bit: We need to consider separately 1. R’s brilliant library 2. R’s not-so-brilliant language and/or interpreter. I don’t know that R’s library is so brilliant as all that–if necessary, I don’t think it would be hard to reprogram the important packages in a new language. I would say, though, that the problems with R are not just in the technical details of the language. I think the culture of R has some problems too. As I’ve written before, R functions used to be lean and mean, and now they’re full of exception-handling and calls to other packages. R functions are spaghetti-like messes of connections in which I keep expecting to run into syntax like “GOTO 120.” I learned about these problems a couple years ago when writing bayesglm(), which is a simple adaptation of glm(). But glm(), and its workhorse, glm.fit(), are a mess: They’re about 10 lines of functioning code, plus about 20 lines of necessary front-end, plus a cou
2 0.8269912 2089 andrew gelman stats-2013-11-04-Shlemiel the Software Developer and Unknown Unknowns
Introduction: The Stan meeting today reminded me of Joel Spolsky’s recasting of the Yiddish joke about Shlemiel the Painter. Joel retold it on his blog, Joel on Software , in the post Back to Basics : Shlemiel gets a job as a street painter, painting the dotted lines down the middle of the road. On the first day he takes a can of paint out to the road and finishes 300 yards of the road. “That’s pretty good!” says his boss, “you’re a fast worker!” and pays him a kopeck. The next day Shlemiel only gets 150 yards done. “Well, that’s not nearly as good as yesterday, but you’re still a fast worker. 150 yards is respectable,” and pays him a kopeck. The next day Shlemiel paints 30 yards of the road. “Only 30!” shouts his boss. “That’s unacceptable! On the first day you did ten times that much work! What’s going on?” “I can’t help it,” says Shlemiel. “Every day I get farther and farther away from the paint can!” Joel used it as an example of the kind of string processing naive programmers ar
3 0.78126574 1716 andrew gelman stats-2013-02-09-iPython Notebook
Introduction: Burak Bayramli writes: I wanted to inform you on iPython Notebook technology – allowing markup, Python code to reside in one document. Someone ported one of your examples from ARM . iPynb file is actually a live document, can be downloaded and reran locally, hence change of code on document means change of images, results. Graphs (as well as text output) which are generated by the code, are placed inside the document automatically. No more referencing image files seperately. For now running notebooks locally require a notebook server, but that part can live “on the cloud” as part of an educational software. Viewers, such as nbviewer.ipython.org, do not even need that much, since all recent results of a notebook are embedded in the notebook itself. A lot of people are excited about this; Also out of nowhere, Alfred P. Sloan Foundation dropped a $1.15 million grant on the developers of ipython which provided some extra energy on the project. Cool. We’ll have to do that ex
4 0.75719064 535 andrew gelman stats-2011-01-24-Bleg: Automatic Differentiation for Log Prob Gradients?
Introduction: We need help picking out an automatic differentiation package for Hamiltonian Monte Carlo sampling from the posterior of a generalized linear model with deep interactions. Specifically, we need to compute gradients for log probability functions with thousands of parameters that involve matrix (determinants, eigenvalues, inverses), stats (distributions), and math (log gamma) functions. Any suggestions? The Application: Hybrid Monte Carlo for Posteriors We’re getting serious about implementing posterior sampling using Hamiltonian Monte Carlo. HMC speeds up mixing by including gradient information to help guide the Metropolis proposals toward areas high probability. In practice, the algorithm requires a handful or of gradient calculations per sample, but there are many dimensions and the functions are hairy enough we don’t want to compute derivaties by hand. Auto Diff: Perhaps not What you Think It may not have been clear to readers of this blog that automatic diffe
5 0.74855626 1807 andrew gelman stats-2013-04-17-Data problems, coding errors…what can be done?
Introduction: This post is by Phil A recent post on this blog discusses a prominent case of an Excel error leading to substantially wrong results from a statistical analysis. Excel is notorious for this because it is easy to add a row or column of data (or intermediate results) but forget to update equations so that they correctly use the new data. That particular error is less common in a language like R because R programmers usually refer to data by variable name (or by applying functions to a named variable), so the same code works even if you add or remove data. Still, there is plenty of opportunity for errors no matter what language one uses. Andrew ran into problems fairly recently, and also blogged about another instance. I’ve never had to retract a paper, but that’s partly because I haven’t published a whole lot of papers. Certainly I have found plenty of substantial errors pretty late in some of my data analyses, and I obviously don’t have sufficient mechanisms in place to be sure
6 0.74824411 272 andrew gelman stats-2010-09-13-Ross Ihaka to R: Drop Dead
7 0.73937362 2190 andrew gelman stats-2014-01-29-Stupid R Tricks: Random Scope
8 0.71823937 1134 andrew gelman stats-2012-01-21-Lessons learned from a recent R package submission
9 0.71424109 597 andrew gelman stats-2011-03-02-RStudio – new cross-platform IDE for R
10 0.71101272 324 andrew gelman stats-2010-10-07-Contest for developing an R package recommendation system
11 0.6981315 1655 andrew gelman stats-2013-01-05-The statistics software signal
12 0.67027414 1919 andrew gelman stats-2013-06-29-R sucks
13 0.66432744 818 andrew gelman stats-2011-07-23-Parallel JAGS RNGs
15 0.65788263 1520 andrew gelman stats-2012-10-03-Advice that’s so eminently sensible but so difficult to follow
16 0.65079069 1048 andrew gelman stats-2011-12-09-Maze generation algorithms!
17 0.64870596 1753 andrew gelman stats-2013-03-06-Stan 1.2.0 and RStan 1.2.0
18 0.64368951 1736 andrew gelman stats-2013-02-24-Rcpp class in Sat 9 Mar in NYC
19 0.64056146 1277 andrew gelman stats-2012-04-23-Infographic of the year
20 0.63864654 418 andrew gelman stats-2010-11-17-ff
topicId topicWeight
[(10, 0.016), (16, 0.093), (24, 0.173), (42, 0.018), (45, 0.012), (54, 0.048), (65, 0.027), (74, 0.019), (86, 0.014), (90, 0.034), (95, 0.241), (99, 0.202)]
simIndex simValue blogId blogTitle
1 0.96796805 832 andrew gelman stats-2011-07-31-Even a good data display can sometimes be improved
Introduction: When I first saw this graphic, I thought “boy, that’s great, sometimes the graphic practically makes itself.” Normally it’s hard to use lots of different colors to differentiate items of interest, because there’s usually not an intuitive mapping between color and item (e.g. for countries, or states, or whatever). But the colors of crayons, what could be more perfect? So this graphic seemed awesome. But, as they discovered after some experimentation at datapointed.net there is an even BETTER possibility here. Click the link to see. Crayola Crayon colors by year
Introduction: Greg Kaplan writes: I noticed that you have blogged a little about interstate migration trends in the US, and thought that you might be interested in a new working paper of mine (joint with Sam Schulhofer-Wohl from the Minneapolis Fed) which I have attached. Briefly, we show that much of the recent reported drop in interstate migration is a statistical artifact: The Census Bureau made an undocumented change in its imputation procedures for missing data in 2006, and this change significantly reduced the number of imputed interstate moves. The change in imputation procedures — not any actual change in migration behavior — explains 90 percent of the reported decrease in interstate migration between the 2005 and 2006 Current Population Surveys, and 42 percent of the decrease between 2000 and 2010. I haven’t had a chance to give a serious look so could only make the quick suggestion to make the graphs smaller and put multiple graphs on a page, This would allow the reader to bett
3 0.95507199 1820 andrew gelman stats-2013-04-23-Foundation for Open Access Statistics
Introduction: Now here’s a foundation I (Bob) can get behind: Foundation for Open Access Statistics (FOAS) Their mission is to “promote free software, open access publishing, and reproducible research in statistics.” To me, that’s like supporting motherhood and apple pie ! FOAS spun out of and is partially designed to support the Journal of Statistical Software (aka JSS , aka JStatSoft ). I adore JSS because it (a) is open access, (b) publishes systems papers on statistical software, (c) has fast reviewing turnaround times, and (d) is free for authors and readers. One of the next items on my to-do list is to write up the Stan modeling language and submit it to JSS . As a not-for-profit with no visible source of income, they are quite sensibly asking for donations (don’t complain — it beats $3K author fees or not being able to read papers).
Introduction: Ben Hyde sends along this : Stuck in the middle of the supplemental data, reporting the total workup for their compounds, was this gem: Emma, please insert NMR data here! where are they? and for this compound, just make up an elemental analysis . . . I’m reminded of our recent discussions of coauthorship, where I argued that I see real advantages to having multiple people taking responsibility for the result. Jay Verkuilen responded: “On the flipside of collaboration . . . is diffusion of responsibility, where everybody thinks someone else ‘has that problem’ and thus things don’t get solved.” That’s what seems to have happened (hilariously) here.
5 0.90987158 1862 andrew gelman stats-2013-05-18-uuuuuuuuuuuuugly
Introduction: Hamdan Azhar writes: I came across this graphic of vaccine-attributed decreases in mortality and was curious if you found it as unattractive and unintuitive as I did. Hope all is well with you! My reply: All’s well with me. And yes, that’s one horrible graph. It has all the problems with a bad infographic with none of the virtues. Compared to this monstrosity, the typical USA Today graph is a stunning, beautiful masterpiece. I don’t think I want to soil this webpage with the image. In fact, I don’t even want to link to it.
6 0.90764809 12 andrew gelman stats-2010-04-30-More on problems with surveys estimating deaths in war zones
same-blog 7 0.90550828 266 andrew gelman stats-2010-09-09-The future of R
8 0.90090579 1164 andrew gelman stats-2012-02-13-Help with this problem, win valuable prizes
9 0.89044464 876 andrew gelman stats-2011-08-28-Vaguely related to the coke-dumping story
10 0.87997705 1086 andrew gelman stats-2011-12-27-The most dangerous jobs in America
11 0.87083715 1308 andrew gelman stats-2012-05-08-chartsnthings !
12 0.86919862 519 andrew gelman stats-2011-01-16-Update on the generalized method of moments
13 0.85978878 2135 andrew gelman stats-2013-12-15-The UN Plot to Force Bayesianism on Unsuspecting Americans (penalized B-Spline edition)
14 0.85725564 829 andrew gelman stats-2011-07-29-Infovis vs. statgraphics: A clear example of their different goals
15 0.84419298 627 andrew gelman stats-2011-03-24-How few respondents are reasonable to use when calculating the average by county?
17 0.83066988 1595 andrew gelman stats-2012-11-28-Should Harvard start admitting kids at random?
18 0.8281284 1737 andrew gelman stats-2013-02-25-Correlation of 1 . . . too good to be true?
19 0.82078063 1070 andrew gelman stats-2011-12-19-The scope for snooping
20 0.82057405 2154 andrew gelman stats-2013-12-30-Bill Gates’s favorite graph of the year