andrew_gelman_stats andrew_gelman_stats-2013 andrew_gelman_stats-2013-1661 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: We had a recent discussion about statistics packages where people talked about the structure and capabilities of different computer languages. One thing I wanted to add to this discussion is some sociology. To me, a statistics package is not just its code, it’s also its community, it’s what people do with it. R, for example, is nothing special for graphics (again, I think in retrospect my graphs would be better if I’d been making them in Fortran all these years); what makes R graphics work so well is that there’s a clear path from the numbers to the graphs, there’s a tradition in R of postprocessing. In comparison, consider Sas. I’ve never directly used Sas but whenever I’ve seen it used, whether by people working for me or with me or just people down the hall who left Sas output sitting in the printer, in all these cases there’s no postprocessing. It doesn’t look interactive at all. The user runs some procedure and then there are pages and pages and pages of output. The po
sentIndex sentText sentNum sentScore
1 We had a recent discussion about statistics packages where people talked about the structure and capabilities of different computer languages. [sent-1, score-0.928]
2 One thing I wanted to add to this discussion is some sociology. [sent-2, score-0.077]
3 To me, a statistics package is not just its code, it’s also its community, it’s what people do with it. [sent-3, score-0.3]
4 R, for example, is nothing special for graphics (again, I think in retrospect my graphs would be better if I’d been making them in Fortran all these years); what makes R graphics work so well is that there’s a clear path from the numbers to the graphs, there’s a tradition in R of postprocessing. [sent-4, score-1.057]
5 I’ve never directly used Sas but whenever I’ve seen it used, whether by people working for me or with me or just people down the hall who left Sas output sitting in the printer, in all these cases there’s no postprocessing. [sent-6, score-0.99]
6 The user runs some procedure and then there are pages and pages and pages of output. [sent-8, score-1.071]
7 The point about R graphics is not that they’re so great, it’s that R users such as myself graph what we want. [sent-9, score-0.375]
8 In fact, lots of default R graphics are horrible. [sent-10, score-0.454]
9 (Try applying the default plot() function to the output of a linear model if you want to see some yuck. [sent-11, score-0.588]
10 ) I think Sas is horrible, not out of some inherent sense of its structure as a computer program but because I see what people do with it. [sent-12, score-0.697]
11 In contrast, I see people using Stata creatively and flexibly, so I have much warmer feelings toward Stata (even though I don’t actually know how to use it myself). [sent-13, score-0.457]
12 The internals of a program do have something to do with how it’s used, I’m sure. [sent-14, score-0.137]
13 I assume that Excel really is crappy, it’s not just that people use it to make crappy graphs. [sent-15, score-0.364]
14 But as a user, I don’t really care about that, I just know to avoid it. [sent-16, score-0.072]
wordName wordTfidf (topN-words)
[('sas', 0.36), ('graphics', 0.29), ('pages', 0.239), ('crappy', 0.213), ('stata', 0.2), ('output', 0.196), ('user', 0.18), ('default', 0.164), ('structure', 0.156), ('people', 0.151), ('computer', 0.15), ('printer', 0.149), ('flexibly', 0.149), ('warmer', 0.142), ('capabilities', 0.142), ('program', 0.137), ('fortran', 0.133), ('graphs', 0.116), ('hall', 0.114), ('used', 0.114), ('excel', 0.108), ('interactive', 0.107), ('tradition', 0.103), ('inherent', 0.103), ('whenever', 0.102), ('feelings', 0.099), ('retrospect', 0.097), ('packages', 0.097), ('path', 0.093), ('runs', 0.092), ('sitting', 0.092), ('talked', 0.089), ('applying', 0.089), ('users', 0.085), ('horrible', 0.084), ('package', 0.083), ('procedure', 0.082), ('community', 0.08), ('discussion', 0.077), ('plot', 0.076), ('avoid', 0.072), ('linear', 0.07), ('left', 0.07), ('contrast', 0.07), ('comparison', 0.069), ('function', 0.069), ('code', 0.069), ('special', 0.068), ('statistics', 0.066), ('toward', 0.065)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999994 1661 andrew gelman stats-2013-01-08-Software is as software does
Introduction: We had a recent discussion about statistics packages where people talked about the structure and capabilities of different computer languages. One thing I wanted to add to this discussion is some sociology. To me, a statistics package is not just its code, it’s also its community, it’s what people do with it. R, for example, is nothing special for graphics (again, I think in retrospect my graphs would be better if I’d been making them in Fortran all these years); what makes R graphics work so well is that there’s a clear path from the numbers to the graphs, there’s a tradition in R of postprocessing. In comparison, consider Sas. I’ve never directly used Sas but whenever I’ve seen it used, whether by people working for me or with me or just people down the hall who left Sas output sitting in the printer, in all these cases there’s no postprocessing. It doesn’t look interactive at all. The user runs some procedure and then there are pages and pages and pages of output. The po
2 0.32400784 1655 andrew gelman stats-2013-01-05-The statistics software signal
Introduction: Tyler Cowen links to a post by Sean Taylor, who writes the following about users of R: You are willing to invest in learning something difficult. You do not care about aesthetics, only availability of packages and getting results quickly. To me, R is easy and Sas is difficult. I once worked with some students who were running Sas and the output was unreadable! Pages and pages of numbers that made no sense. When it comes to ease or difficulty of use, I think it depends on what you’re used to! And I really don’t understand the bit about aesthetics. What about this ? One reason I use R is to make pretty graphs. That said, if I’d never learned R, I’d just be making pretty graphs in Fortran or whatever. My guess is, the way I program, R is actually hindering rather than helping my ability to make attractive graphs. Half the time I’m scrambling around, writing custom code to get around R’s defaults.
3 0.18844959 855 andrew gelman stats-2011-08-16-Infovis and statgraphics update update
Introduction: To continue our discussion from last week , consider three positions regarding the display of information: (a) The traditional tabular approach. This is how most statisticians, econometricians, political scientists, sociologists, etc., seem to operate. They understand the appeal of a pretty graph, and they’re willing to plot some data as part of an exploratory data analysis, but they see their serious research as leading to numerical estimates, p-values, tables of numbers. These people might use a graph to illustrate their points but they don’t see them as necessary in their research. (b) Statistical graphics as performed by Howard Wainer, Bill Cleveland, Dianne Cook, etc. They–we–see graphics as central to the process of statistical modeling and data analysis and are interested in graphs (static and dynamic) that display every data point as transparently as possible. (c) Information visualization or infographics, as performed by graphics designers and statisticians who are
4 0.17324382 83 andrew gelman stats-2010-06-13-Silly Sas lays out old-fashioned statistical thinking
Introduction: People keep telling me that Sas isn’t as bad as everybody says, but then I see (from Christian Robert ) this listing from the Sas website of “disadvantages in using Bayesian analysis”: There is no correct way to choose a prior. Bayesian inferences require skills to translate prior beliefs into a mathematically formulated prior. If you do not proceed with caution, you can generate misleading results. . . . From a practical point of view, it might sometimes be difficult to convince subject matter experts who do not agree with the validity of the chosen prior. That is so tacky! As if least squares, logistic regressions, Cox models, and all those other likelihoods mentioned in the Sas documentation are so automatically convincing to subject matter experts. P.S. For some more serious objections to Bayesian statistics, see here and here . P.P.S. In case you’re wondering why I’m commenting on month-old blog entries . . . I have a monthlong backlog of entries, and I’m spooling
5 0.17025113 1764 andrew gelman stats-2013-03-15-How do I make my graphs?
Introduction: Someone who wishes to remain anonymous writes: I’ve been following your blog a long time and enjoy your posts on visualization/statistical graphics matters. I don’t recall however you ever describing the details of your setup for plotting. I’m a new R user (convert from matplotlib) and would love to know your thoughts on the ideal setup: do you use mainly the R base? Do you use lattice? What do you think of ggplot2? etc. I found ggplot2 nearly indecipherable until a recent eureka moment, and I think its default theme is a waste tremendous ink (all those silly grey backgrounds and grids are really unnecessary), but if you customize that away it can be made to look like ordinary, pretty statistical graphs. Feel free to respond on your blog, but if you do, please remove my name from the post (my colleagues already make fun of me for thinking about visualization too much.) I love that last bit! Anyway, my response is that I do everything in base graphics (using my
6 0.15007259 1275 andrew gelman stats-2012-04-22-Please stop me before I barf again
7 0.14680743 76 andrew gelman stats-2010-06-09-Both R and Stata
9 0.14513735 1584 andrew gelman stats-2012-11-19-Tradeoffs in information graphics
10 0.14350915 2266 andrew gelman stats-2014-03-25-A statistical graphics course and statistical graphics advice
12 0.1321367 530 andrew gelman stats-2011-01-22-MS-Bayes?
13 0.12577233 546 andrew gelman stats-2011-01-31-Infovis vs. statistical graphics: My talk tomorrow (Tues) 1pm at Columbia
14 0.11879788 1848 andrew gelman stats-2013-05-09-A tale of two discussion papers
15 0.114852 61 andrew gelman stats-2010-05-31-A data visualization manifesto
16 0.11433159 1134 andrew gelman stats-2012-01-21-Lessons learned from a recent R package submission
17 0.11138011 1482 andrew gelman stats-2012-09-04-Model checking and model understanding in machine learning
18 0.10798667 2279 andrew gelman stats-2014-04-02-Am I too negative?
19 0.10766104 1604 andrew gelman stats-2012-12-04-An epithet I can live with
20 0.1074512 798 andrew gelman stats-2011-07-12-Sometimes a graph really is just ugly
topicId topicWeight
[(0, 0.188), (1, -0.019), (2, -0.058), (3, 0.109), (4, 0.15), (5, -0.098), (6, -0.088), (7, 0.011), (8, -0.033), (9, 0.001), (10, -0.022), (11, -0.029), (12, -0.02), (13, -0.01), (14, 0.014), (15, -0.027), (16, -0.02), (17, -0.028), (18, 0.011), (19, 0.04), (20, 0.002), (21, 0.013), (22, 0.001), (23, 0.028), (24, -0.058), (25, -0.01), (26, -0.014), (27, 0.028), (28, -0.027), (29, 0.024), (30, -0.018), (31, 0.033), (32, 0.053), (33, -0.002), (34, -0.03), (35, -0.053), (36, -0.007), (37, 0.06), (38, -0.004), (39, 0.016), (40, 0.021), (41, -0.013), (42, -0.013), (43, 0.012), (44, -0.007), (45, -0.016), (46, -0.032), (47, 0.025), (48, 0.026), (49, 0.025)]
simIndex simValue blogId blogTitle
same-blog 1 0.93424118 1661 andrew gelman stats-2013-01-08-Software is as software does
Introduction: We had a recent discussion about statistics packages where people talked about the structure and capabilities of different computer languages. One thing I wanted to add to this discussion is some sociology. To me, a statistics package is not just its code, it’s also its community, it’s what people do with it. R, for example, is nothing special for graphics (again, I think in retrospect my graphs would be better if I’d been making them in Fortran all these years); what makes R graphics work so well is that there’s a clear path from the numbers to the graphs, there’s a tradition in R of postprocessing. In comparison, consider Sas. I’ve never directly used Sas but whenever I’ve seen it used, whether by people working for me or with me or just people down the hall who left Sas output sitting in the printer, in all these cases there’s no postprocessing. It doesn’t look interactive at all. The user runs some procedure and then there are pages and pages and pages of output. The po
2 0.85789061 1764 andrew gelman stats-2013-03-15-How do I make my graphs?
Introduction: Someone who wishes to remain anonymous writes: I’ve been following your blog a long time and enjoy your posts on visualization/statistical graphics matters. I don’t recall however you ever describing the details of your setup for plotting. I’m a new R user (convert from matplotlib) and would love to know your thoughts on the ideal setup: do you use mainly the R base? Do you use lattice? What do you think of ggplot2? etc. I found ggplot2 nearly indecipherable until a recent eureka moment, and I think its default theme is a waste tremendous ink (all those silly grey backgrounds and grids are really unnecessary), but if you customize that away it can be made to look like ordinary, pretty statistical graphs. Feel free to respond on your blog, but if you do, please remove my name from the post (my colleagues already make fun of me for thinking about visualization too much.) I love that last bit! Anyway, my response is that I do everything in base graphics (using my
3 0.84255797 855 andrew gelman stats-2011-08-16-Infovis and statgraphics update update
Introduction: To continue our discussion from last week , consider three positions regarding the display of information: (a) The traditional tabular approach. This is how most statisticians, econometricians, political scientists, sociologists, etc., seem to operate. They understand the appeal of a pretty graph, and they’re willing to plot some data as part of an exploratory data analysis, but they see their serious research as leading to numerical estimates, p-values, tables of numbers. These people might use a graph to illustrate their points but they don’t see them as necessary in their research. (b) Statistical graphics as performed by Howard Wainer, Bill Cleveland, Dianne Cook, etc. They–we–see graphics as central to the process of statistical modeling and data analysis and are interested in graphs (static and dynamic) that display every data point as transparently as possible. (c) Information visualization or infographics, as performed by graphics designers and statisticians who are
4 0.83049417 319 andrew gelman stats-2010-10-04-“Who owns Congress”
Introduction: Curt Yeske pointed me to this . Wow–these graphs are really hard to read! The old me would’ve said that each of these graphs would be better replaced by a dotplot (or, better still, a series of lineplots showing time trends). The new me would still like the dotplots and lineplots, but I’d say it’s fine to have the eye-grabbing but hard-to-read graphs as is, and then to have the more informative statistical graphics underneath, as it were. The idea is, you’d click on the pretty but hard-to-read “infovis” graphs, and this would then reveal informative “full Cleveland” graphs. And then if you click again you’d get a spreadsheet with the raw numbers. That I’d like to see, as a new model for graphical presentation.
Introduction: I continue to struggle to convey my thoughts on statistical graphics so I’ll try another approach, this time giving my own story. For newcomers to this discussion: the background is that Antony Unwin and I wrote an article on the different goals embodied in information visualization and statistical graphics, but I have difficulty communicating on this point with the infovis people. Maybe if I tell my own story, and then they tell their stories, this will point a way forward to a more constructive discussion. So here goes. I majored in physics in college and I worked in a couple of research labs during the summer. Physicists graph everything. I did most of my plotting on graph paper–this continued through my second year of grad school–and became expert at putting points at 1/5, 2/5, 3/5, and 4/5 between the x and y grid lines. In grad school in statistics, I continued my physics habits and graphed everything I could. I did notice, though, that the faculty and the other
6 0.79800433 61 andrew gelman stats-2010-05-31-A data visualization manifesto
7 0.79302722 1584 andrew gelman stats-2012-11-19-Tradeoffs in information graphics
8 0.79142064 372 andrew gelman stats-2010-10-27-A use for tables (really)
10 0.77318245 1275 andrew gelman stats-2012-04-22-Please stop me before I barf again
11 0.76958269 736 andrew gelman stats-2011-05-29-Response to “Why Tables Are Really Much Better Than Graphs”
12 0.76902896 794 andrew gelman stats-2011-07-09-The quest for the holy graph
13 0.76517653 1896 andrew gelman stats-2013-06-13-Against the myth of the heroic visualization
14 0.76515043 2266 andrew gelman stats-2014-03-25-A statistical graphics course and statistical graphics advice
15 0.76451713 324 andrew gelman stats-2010-10-07-Contest for developing an R package recommendation system
16 0.76249868 1606 andrew gelman stats-2012-12-05-The Grinch Comes Back
17 0.76249814 1848 andrew gelman stats-2013-05-09-A tale of two discussion papers
18 0.75697589 1604 andrew gelman stats-2012-12-04-An epithet I can live with
19 0.74980348 1684 andrew gelman stats-2013-01-20-Ugly ugly ugly
20 0.74288762 2319 andrew gelman stats-2014-05-05-Can we make better graphs of global temperature history?
topicId topicWeight
[(13, 0.021), (16, 0.056), (24, 0.21), (35, 0.022), (42, 0.02), (55, 0.03), (66, 0.024), (67, 0.033), (73, 0.045), (75, 0.029), (82, 0.064), (86, 0.038), (89, 0.047), (90, 0.016), (99, 0.253)]
simIndex simValue blogId blogTitle
1 0.95397508 931 andrew gelman stats-2011-09-29-Hamiltonian Monte Carlo stories
Introduction: Tomas Iesmantas had asked me for advice on a regression problem with 50 parameters, and I’d recommended Hamiltonian Monte Carlo. A few weeks later he reported back: After trying several modifications (HMC for all parameters at once, HMC just for first level parameters and Riemman manifold Hamiltonian Monte Carlo method), I finally got it running with HMC just for first level parameters and for others using direct sampling, since conditional distributions turned out to have closed form. However, even in this case it is quite tricky, since I had to employ mass matrix and not just diagonal but at the beginning of algorithm generated it randomly (ensuring it is positive definite). Such random generation of mass matrix is quite blind step, but it proved to be quite helpful. Riemman manifold HMC is quite vagarious, or to be more specific, metric of manifold is very sensitive. In my model log-likelihood I had exponents and values of metrics matrix elements was very large and wh
2 0.95290649 846 andrew gelman stats-2011-08-09-Default priors update?
Introduction: Ryan King writes: I was wondering if you have a brief comment on the state of the art for objective priors for hierarchical generalized linear models (generalized linear mixed models). I have been working off the papers in Bayesian Analysis (2006) 1, Number 3 (Browne and Draper, Kass and Natarajan, Gelman). There seems to have been continuous work for matching priors in linear mixed models, but GLMMs less so because of the lack of an analytic marginal likelihood for the variance components. There are a number of additional suggestions in the literature since 2006, but little robust practical guidance. I’m interested in both mean parameters and the variance components. I’m almost always concerned with logistic random effect models. I’m fascinated by the matching-priors idea of higher-order asymptotic improvements to maximum likelihood, and need to make some kind of defensible default recommendation. Given the massive scale of the datasets (genetics …), extensive sensitivity a
3 0.94991159 1367 andrew gelman stats-2012-06-05-Question 26 of my final exam for Design and Analysis of Sample Surveys
Introduction: 26. You have just graded an an exam with 28 questions and 15 students. You fit a logistic item- response model estimating ability, difficulty, and discrimination parameters. Which of the following statements are basically true? (Indicate all that apply.) (a) If a question is answered correctly by students with very low and very high ability, but is missed by students in the middle, it will have a high value for its discrimination parameter. (b) It is not possible to fit an item-response model when you have more questions than students. In order to fit the model, you either need to reduce the number of questions (for example, by discarding some questions or by putting together some questions into a combined score) or increase the number of students in the dataset. (c) To keep the model identified, you can set one of the difficulty parameters or one of the ability parameters to zero and set one of the discrimination parameters to 1. (d) If two students answer the same number of q
4 0.94797146 1474 andrew gelman stats-2012-08-29-More on scaled-inverse Wishart and prior independence
Introduction: I’ve had a couple of email conversations in the past couple days on dependence in multivariate prior distributions. Modeling the degrees of freedom and scale parameters in the t distribution First, in our Stan group we’ve been discussing the choice of priors for the degrees-of-freedom parameter in the t distribution. I wrote that also there’s the question of parameterization. It does not necessarily make sense to have independent priors on the df and scale parameters. In some sense, the meaning of the scale parameter changes with the df. Prior dependence between correlation and scale parameters in the scaled inverse-Wishart model The second case of parameterization in prior distribution arose from an email I received from Chris Chatham pointing me to this exploration by Matt Simpson of the scaled inverse-Wishart prior distribution for hierarchical covariance matrices. Simpson writes: A popular prior for Σ is the inverse-Wishart distribution [ not the same as the
same-blog 5 0.94640177 1661 andrew gelman stats-2013-01-08-Software is as software does
Introduction: We had a recent discussion about statistics packages where people talked about the structure and capabilities of different computer languages. One thing I wanted to add to this discussion is some sociology. To me, a statistics package is not just its code, it’s also its community, it’s what people do with it. R, for example, is nothing special for graphics (again, I think in retrospect my graphs would be better if I’d been making them in Fortran all these years); what makes R graphics work so well is that there’s a clear path from the numbers to the graphs, there’s a tradition in R of postprocessing. In comparison, consider Sas. I’ve never directly used Sas but whenever I’ve seen it used, whether by people working for me or with me or just people down the hall who left Sas output sitting in the printer, in all these cases there’s no postprocessing. It doesn’t look interactive at all. The user runs some procedure and then there are pages and pages and pages of output. The po
6 0.94409347 2086 andrew gelman stats-2013-11-03-How best to compare effects measured in two different time periods?
7 0.94331998 2231 andrew gelman stats-2014-03-03-Running into a Stan Reference by Accident
8 0.94323248 1072 andrew gelman stats-2011-12-19-“The difference between . . .”: It’s not just p=.05 vs. p=.06
9 0.94247013 1240 andrew gelman stats-2012-04-02-Blogads update
10 0.94197839 1208 andrew gelman stats-2012-03-11-Gelman on Hennig on Gelman on Bayes
11 0.94181514 801 andrew gelman stats-2011-07-13-On the half-Cauchy prior for a global scale parameter
12 0.94126022 1644 andrew gelman stats-2012-12-30-Fixed effects, followed by Bayes shrinkage?
14 0.94109255 401 andrew gelman stats-2010-11-08-Silly old chi-square!
15 0.94102818 1421 andrew gelman stats-2012-07-19-Alexa, Maricel, and Marty: Three cellular automata who got on my nerves
18 0.94063514 1155 andrew gelman stats-2012-02-05-What is a prior distribution?
19 0.94056344 2109 andrew gelman stats-2013-11-21-Hidden dangers of noninformative priors
20 0.93986225 1702 andrew gelman stats-2013-02-01-Don’t let your standard errors drive your research agenda