andrew_gelman_stats andrew_gelman_stats-2012 andrew_gelman_stats-2012-1403 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: I was at a talk awhile ago where the speaker presented tables with 4, 5, 6, even 8 significant digits even though, as is usual, only the first or second digit of each number conveyed any useful information. A graph would be better, but even if you’re too lazy to make a plot, a bit of rounding would seem to be required. I mentioned this to a colleague, who responded: I don’t know how to stop this practice. Logic doesn’t work. Maybe ridicule? Best hope is the departure from field who do it. (Theories don’t die, but the people who follow those theories retire.) Another possibility, I think, is helpful software defaults. If we can get to the people who write the software, maybe we could have some impact. Once the software is written, however, it’s probably too late. I’m not far from the center of the R universe, but I don’t know if I’ll ever succeed in my goals of increasing the default number of histogram bars or reducing the default number of decimal places in regression
sentIndex sentText sentNum sentScore
1 I was at a talk awhile ago where the speaker presented tables with 4, 5, 6, even 8 significant digits even though, as is usual, only the first or second digit of each number conveyed any useful information. [sent-1, score-0.961]
2 A graph would be better, but even if you’re too lazy to make a plot, a bit of rounding would seem to be required. [sent-2, score-0.072]
3 Best hope is the departure from field who do it. [sent-6, score-0.192]
4 (Theories don’t die, but the people who follow those theories retire. [sent-7, score-0.103]
5 ) Another possibility, I think, is helpful software defaults. [sent-8, score-0.138]
6 Once the software is written, however, it’s probably too late. [sent-10, score-0.138]
7 I’m not far from the center of the R universe, but I don’t know if I’ll ever succeed in my goals of increasing the default number of histogram bars or reducing the default number of decimal places in regression results. [sent-11, score-1.005]
8 Neal Beck points us to this (turn to his article beginning on page 4): Numbers in the text of articles and in tables should be reported with no more precision than they are measured and are substantively meaningful. [sent-16, score-0.493]
9 In general, the number of places to the right of the decimal point for a measure should be one more than the number of zeros to the right of the decimal point on the standard error of this measure. [sent-17, score-0.977]
10 Variables in tables should be rescaled so the entire table (or portion of the table) has a uniform number of digits reported. [sent-18, score-1.05]
11 A table should not have regressions coefficients reported at, say, 77000 in one line and . [sent-19, score-0.415]
12 , from thousands to millions of dollars, or population in millions per square mile to population in thousands per square mile), it should be possible to provide regression coefficients that are easily comprehensible numbers. [sent-23, score-1.476]
13 Rescaled units should be intuitively meaningful, so that, for example, dollar figures would be reported in thousands or millions of dollars. [sent-25, score-0.62]
14 The rescaling of variables should aid, not impede, the clarity of a table. [sent-26, score-0.379]
15 In most cases, the uncertainty of numerical estimates is better conveyed by confidence intervals or standard errors (or complete likelihood functions or posterior distributions), rather than by hypothesis tests and p-values. [sent-27, score-0.155]
16 05 may be flagged with 3, 2, and 1 asterisks, respectively, with notes that they are significant at the given levels. [sent-31, score-0.2]
17 Political Analysis follows the conventional usage that the unmodified term “significant” implies statistical significance at the 5% level. [sent-33, score-0.157]
18 Authors should not depart from this convention without good reason and without clearly indicating to readers the departure from convention. [sent-34, score-0.431]
19 All articles should strive for maximal clarity. [sent-35, score-0.224]
20 In the end all decisions about clarity must be made by the author (with some help from referees and editors). [sent-37, score-0.151]
wordName wordTfidf (topN-words)
[('rescaled', 0.251), ('decimal', 0.232), ('tables', 0.212), ('departure', 0.192), ('table', 0.192), ('mile', 0.181), ('number', 0.169), ('millions', 0.163), ('rescaling', 0.158), ('digits', 0.155), ('conveyed', 0.155), ('thousands', 0.153), ('clarity', 0.151), ('software', 0.138), ('square', 0.132), ('reported', 0.126), ('significant', 0.109), ('figures', 0.104), ('theories', 0.103), ('default', 0.1), ('display', 0.097), ('coefficients', 0.097), ('places', 0.091), ('ridicule', 0.091), ('depart', 0.091), ('flagged', 0.091), ('digit', 0.091), ('beck', 0.087), ('significance', 0.085), ('zeros', 0.084), ('substantively', 0.084), ('strive', 0.079), ('per', 0.079), ('clearly', 0.077), ('histogram', 0.074), ('intuitively', 0.074), ('maximal', 0.074), ('respectively', 0.073), ('universe', 0.073), ('neal', 0.073), ('population', 0.072), ('usage', 0.072), ('rounding', 0.072), ('articles', 0.071), ('authors', 0.071), ('portion', 0.071), ('convention', 0.071), ('variables', 0.07), ('succeed', 0.07), ('speaker', 0.07)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 1403 andrew gelman stats-2012-07-02-Moving beyond hopeless graphics
Introduction: I was at a talk awhile ago where the speaker presented tables with 4, 5, 6, even 8 significant digits even though, as is usual, only the first or second digit of each number conveyed any useful information. A graph would be better, but even if you’re too lazy to make a plot, a bit of rounding would seem to be required. I mentioned this to a colleague, who responded: I don’t know how to stop this practice. Logic doesn’t work. Maybe ridicule? Best hope is the departure from field who do it. (Theories don’t die, but the people who follow those theories retire.) Another possibility, I think, is helpful software defaults. If we can get to the people who write the software, maybe we could have some impact. Once the software is written, however, it’s probably too late. I’m not far from the center of the R universe, but I don’t know if I’ll ever succeed in my goals of increasing the default number of histogram bars or reducing the default number of decimal places in regression
2 0.15858734 372 andrew gelman stats-2010-10-27-A use for tables (really)
Introduction: After our recent discussion of semigraphic displays, Jay Ulfelder sent along a semigraphic table from his recent book. He notes, “When countries are the units of analysis, it’s nice that you can use three-letter codes, so all the proper names have the same visual weight.” Ultimately I think that graphs win over tables for display. However in our work we spend a lot of time looking at raw data, often simply to understand what data we have. This use of tables has, I think, been forgotten in the statistical graphics literature. So I’d like to refocus the eternal tables vs. graphs discussion. If the goal is to present information, comparisons, relationships, models, data, etc etc, graphs win. Forget about tables. But . . . when you’re looking at your data, it can often help to see the raw numbers. Once you’re looking at numbers, it makes sense to organize them. Even a displayed matrix in R is a form of table, after all. And once you’re making a table, it can be sensible to
3 0.12762345 939 andrew gelman stats-2011-10-03-DBQQ rounding for labeling charts and communicating tolerances
Introduction: This is a mini research note, not deserving of a paper, but perhaps useful to others. It reinvents what has already appeared on this blog. Let’s say we have a line chart with numbers between 152.134 and 210.823, with the mean of 183.463. How should we label the chart with about 3 tics? Perhaps 152.132, 181.4785 and 210.823? Don’t do it! Objective is to fit about 3-7 tics at the optimal level of rounding. I use the following sequence: decimal rounding : fitting integer power and single-digit decimal i , rounding to i * 10^ power (example: 100 200 300) binary having power , fitting single-digit decimal i and binary b , rounding to 2* i /(1+ b ) * 10^ power (150 200 250) (optional) quaternary having power , fitting single-digit decimal i and quaternary q (0,1,2,3) round to 4* i /(1+ q ) * 10^ power (150 175 200) quinary having power , fitting single-digit decimal i and quinary f (0,1,2,3,4) round to 5* i /(1+ f ) * 10^ power (160 180 200)
4 0.12068781 2172 andrew gelman stats-2014-01-14-Advice on writing research articles
Introduction: From a few years ago : General advice Both the papers sent to me appear to have strong research results. Now that the research has been done, I’d recommend rewriting both articles from scratch, using the following template: 1. Start with the conclusions. Write a couple pages on what you’ve found and what you recommend. In writing these conclusions, you should also be writing some of the introduction, in that you’ll need to give enough background so that general readers can understand what you’re talking about and why they should care. But you want to start with the conclusions, because that will determine what sort of background information you’ll need to give. 2. Now step back. What is the principal evidence for your conclusions? Make some graphs and pull out some key numbers that represent your research findings which back up your claims. 3. Back one more step, now. What are the methods and data you used to obtain your research findings. 4. Now go back and write the l
5 0.11733593 146 andrew gelman stats-2010-07-14-The statistics and the science
Introduction: Yesterday I posted a review of a submitted manuscript where I first wrote that I read the paper only shallowly and then followed up with some suggestions on the statistical analysis, recommending that overdispersion be added to a fitted Posson regression and that the table of regression results be supplemented with a graph showing data and fitted lines. A commenter asked why I wrote such an apparently shallow review, and I realized that some of the implications of my review were not as clear as I’d thought. So let me clarify. There is a connection between my general reaction and my statistical comments. My statistical advice here is relevant for (at least) two reasons. First, a Poisson regression without overdispersion will give nearly-uninterpretable standard errors, which means that I have no sense if the results are statistically significant as claimed. Second, with a time series plot and regression table, but no graph showing the estimated treatment effect, it is very dif
6 0.11111827 20 andrew gelman stats-2010-05-07-Bayesian hierarchical model for the prediction of soccer results
7 0.10918825 324 andrew gelman stats-2010-10-07-Contest for developing an R package recommendation system
8 0.10658462 1879 andrew gelman stats-2013-06-01-Benford’s law and addresses
9 0.10598561 1078 andrew gelman stats-2011-12-22-Tables as graphs: The Ramanujan principle
10 0.10154617 783 andrew gelman stats-2011-06-30-Don’t stop being a statistician once the analysis is done
11 0.095959574 488 andrew gelman stats-2010-12-27-Graph of the year
15 0.093746103 972 andrew gelman stats-2011-10-25-How do you interpret standard errors from a regression fit to the entire population?
16 0.092447571 302 andrew gelman stats-2010-09-28-This is a link to a news article about a scientific paper
17 0.091092356 405 andrew gelman stats-2010-11-10-Estimation from an out-of-date census
19 0.088340223 1673 andrew gelman stats-2013-01-15-My talk last night at the visualization meetup
20 0.087136649 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?
topicId topicWeight
[(0, 0.197), (1, 0.012), (2, 0.026), (3, -0.03), (4, 0.076), (5, -0.082), (6, -0.003), (7, 0.006), (8, -0.019), (9, -0.047), (10, -0.004), (11, -0.03), (12, -0.001), (13, -0.01), (14, -0.003), (15, 0.008), (16, -0.013), (17, -0.01), (18, 0.029), (19, -0.022), (20, 0.03), (21, 0.066), (22, 0.046), (23, 0.014), (24, 0.005), (25, 0.008), (26, 0.041), (27, -0.014), (28, 0.016), (29, -0.02), (30, 0.046), (31, 0.034), (32, 0.006), (33, 0.005), (34, 0.011), (35, -0.029), (36, -0.01), (37, 0.006), (38, 0.009), (39, -0.053), (40, 0.017), (41, -0.057), (42, -0.015), (43, 0.023), (44, 0.03), (45, -0.029), (46, -0.052), (47, 0.007), (48, 0.021), (49, -0.015)]
simIndex simValue blogId blogTitle
same-blog 1 0.96985644 1403 andrew gelman stats-2012-07-02-Moving beyond hopeless graphics
Introduction: I was at a talk awhile ago where the speaker presented tables with 4, 5, 6, even 8 significant digits even though, as is usual, only the first or second digit of each number conveyed any useful information. A graph would be better, but even if you’re too lazy to make a plot, a bit of rounding would seem to be required. I mentioned this to a colleague, who responded: I don’t know how to stop this practice. Logic doesn’t work. Maybe ridicule? Best hope is the departure from field who do it. (Theories don’t die, but the people who follow those theories retire.) Another possibility, I think, is helpful software defaults. If we can get to the people who write the software, maybe we could have some impact. Once the software is written, however, it’s probably too late. I’m not far from the center of the R universe, but I don’t know if I’ll ever succeed in my goals of increasing the default number of histogram bars or reducing the default number of decimal places in regression
2 0.76949215 1452 andrew gelman stats-2012-08-09-Visually weighting regression displays
Introduction: Solomon Hsiang writes : One of my colleagues suggested that I send you this very short note that I wrote on a new approach for displaying regression result uncertainty (attached). It’s very simple, and I’ve found it effective in one of my papers where I actually use it, but if you have a chance to glance over it and have any ideas for how to sell the approach or make it better, I’d be very interested to hear them. (Also, if you’ve seen that someone else has already made this point, I’d appreciate knowing that too.) Here’s an example: Hsiang writes: In Panel A, our eyes are drawn outward, away from the center of the display and toward the swirling confidence intervals at the edges. But in Panel B, our eyes are attracted to the middle of the regression line, where the high contrast between the line and the background is sharp and visually heavy. By using visual-weighting, we focus our readers’s attention on those portions of the regression that contain the most inform
3 0.76596767 1470 andrew gelman stats-2012-08-26-Graphs showing regression uncertainty: the code!
Introduction: After our discussion of visual displays of regression uncertainty, I asked Solomon Hsiang and Lucas Leeman to send me their code. Both of them replied. Solomon wrote: The matlab and stata functions I wrote, as well as the script that replicates my figures, are all posted on my website . Also, I just added options to the main matlab function (vwregress.m) to make it display the spaghetti plot (similar to what Lucas did, but a simple bootstrap) and the shaded CI that you suggested (see figs below). They’re good suggestions. Personally, I [Hsiang] like the shaded CI better, since I think that all the visual activity in the spaghetti plot is a little distracting and sometimes adds visual weight in places where I wouldn’t want it. But the option is there in case people like it. Solomon then followed up with: I just thought of this small adjustment to your filled CI idea that seems neat. Cartographers like map projections that conserve area. We can do som
4 0.73365849 1078 andrew gelman stats-2011-12-22-Tables as graphs: The Ramanujan principle
Introduction: Tables are commonly read as crude graphs: what you notice in a table of numbers is (a) the minus signs, and thus which values are positive and which are negative, and (b) the length of each number, that is, its order of magnitude. The most famous example of such a read might be when the mathematician Srinivasa Ramanujan supposedly conjectured the asymptotic form of the partition function based on a look at a table of the first several partition numbers: he was essentially looking at a graph on the logarithmic scale. I discuss some modern-day statistical examples in this article for Significance magazine . I had a lot of fun creating the “calculator font” for the above graph in R and then writing the article. I hope you enjoy it too! P.S. Also check out this short note by Marcin Kozak and Wojtek Krzanowski on effective presentation of data. P.P.S. I wrote this blog entry a month ago and had it in storage. Then my issue of Significance came in the mail—with my
Introduction: To understand the above title, see here . Masanao writes: This report claims that eating meat increases the risk of cancer. I’m sure you can’t read the page but you probably can understand the graphs. Different bars represent subdivision in the amount of the particular type of meat one consumes. And each chunk is different types of meat. Left is for male right is for female. They claim that the difference is significant, but they are clearly not!! I’m for not eating much meat but this is just way too much… Here’s the graph: I don’t know what to think. If you look carefully you can find one or two statistically significant differences but overall the pattern doesn’t look so compelling. I don’t know what the top and bottom rows are, though. Overall, the pattern in the top row looks like it could represent a real trend, while the graphs on the bottom row look like noise. This could be a good example for our multiple comparisons paper. If the researchers won’t
6 0.72269231 324 andrew gelman stats-2010-10-07-Contest for developing an R package recommendation system
9 0.71582335 593 andrew gelman stats-2011-02-27-Heat map
10 0.71059173 146 andrew gelman stats-2010-07-14-The statistics and the science
11 0.70487726 1971 andrew gelman stats-2013-08-07-I doubt they cheated
13 0.70401907 296 andrew gelman stats-2010-09-26-A simple semigraphic display
14 0.70100647 360 andrew gelman stats-2010-10-21-Forensic bioinformatics, or, Don’t believe everything you read in the (scientific) papers
15 0.69942862 646 andrew gelman stats-2011-04-04-Graphical insights into the safety of cycling.
16 0.6967029 1171 andrew gelman stats-2012-02-16-“False-positive psychology”
17 0.69557124 137 andrew gelman stats-2010-07-10-Cost of communicating numbers
18 0.69270021 1807 andrew gelman stats-2013-04-17-Data problems, coding errors…what can be done?
19 0.69225484 401 andrew gelman stats-2010-11-08-Silly old chi-square!
topicId topicWeight
[(9, 0.016), (15, 0.019), (16, 0.075), (21, 0.029), (24, 0.188), (44, 0.013), (55, 0.023), (63, 0.017), (73, 0.013), (76, 0.019), (86, 0.037), (88, 0.129), (89, 0.025), (95, 0.011), (99, 0.275)]
simIndex simValue blogId blogTitle
same-blog 1 0.96174264 1403 andrew gelman stats-2012-07-02-Moving beyond hopeless graphics
Introduction: I was at a talk awhile ago where the speaker presented tables with 4, 5, 6, even 8 significant digits even though, as is usual, only the first or second digit of each number conveyed any useful information. A graph would be better, but even if you’re too lazy to make a plot, a bit of rounding would seem to be required. I mentioned this to a colleague, who responded: I don’t know how to stop this practice. Logic doesn’t work. Maybe ridicule? Best hope is the departure from field who do it. (Theories don’t die, but the people who follow those theories retire.) Another possibility, I think, is helpful software defaults. If we can get to the people who write the software, maybe we could have some impact. Once the software is written, however, it’s probably too late. I’m not far from the center of the R universe, but I don’t know if I’ll ever succeed in my goals of increasing the default number of histogram bars or reducing the default number of decimal places in regression
2 0.95771956 569 andrew gelman stats-2011-02-12-Get the Data
Introduction: At GetTheData , you can ask and answer data related questions. Here’s a preview: I’m not sure a Q&A; site is the best way to do this. My pipe dream is to create a taxonomy of variables and instances, and collect spreadsheets annotated this way. Imagine doing a search of type: “give me datasets, where an instance is a person, the variables are age, gender and weight” – and out would come datasets, each one tagged with the descriptions of the variables that were held constant for the whole dataset (person_type=student, location=Columbia, time_of_study=1/1/2009, study_type=longitudinal). It would even be possible to automatically convert one variable into another, if it was necessary (like age = time_of_measurement-time_of_birth). Maybe the dream of Semantic Web will actually be implemented for relatively structured statistical data rather than much fuzzier “knowledge”, just consider the difficulties of developing a universal Freebase . Wolfram|Alpha is perhaps currently clos
3 0.95577955 1098 andrew gelman stats-2012-01-04-Bayesian Page Rank?
Introduction: Loren Maxwell writes: I am trying to do some studies on the PageRank algorithm with applying a Bayesian technique. If you are not familiar with PageRank, it is the basis for how Google ranks their pages. It basically treats the internet as a large social network with each link conferring some value onto the page it links to. For example, if I had a webpage that had only one link to it, say from my friend’s webpage, then its PageRank would be dependent on my friend’s PageRank, presumably quite low. However, if the one link to my page was off the Google search page, then my PageRank would be quite high since there are undoubtedly millions of pages linking to Google and few pages that Google links to. The end result of the algorithm, however, is that all the PageRank values of the nodes in the network sum to one and the PageRank of a specific node is the probability that a “random surfer” will end up on that node. For example, in the attached spreadsheet, Column D shows e
4 0.95119464 400 andrew gelman stats-2010-11-08-Poli sci plagiarism update, and a note about the benefits of not caring
Introduction: A recent story about academic plagiarism spurred me to some more general thoughts about the intellectual benefits of not giving a damn. I’ll briefly summarize the plagiarism story and then get to my larger point. Copying big blocks of text from others’ writings without attribution Last month I linked to the story of Frank Fischer, an elderly professor of political science who was caught copying big blocks of text (with minor modifications) from others’ writings without attribution. Apparently there’s some dispute about whether this constitutes plagiarism. On one hand, Harvard’s policy is that “in academic writing, it is considered plagiarism to draw any idea or any language from someone else without adequately crediting that source in your paper.” On the other hand, several of Fischer’s colleagues defend him by saying, “Mr. Fischer sometimes used the words of other authors. . . ” They also write: The essence of plagiarism is passing off someone else’s work as
5 0.95084649 1174 andrew gelman stats-2012-02-18-Not as ugly as you look
Introduction: Kaiser asks the interesting question: How do you measure what restaurants are “overrated”? You can’t just ask people, right? There’s some sort of social element here, that “overrated” implies that someone’s out there doing the rating.
6 0.94722903 1992 andrew gelman stats-2013-08-21-Workshop for Women in Machine Learning
7 0.94440347 603 andrew gelman stats-2011-03-07-Assumptions vs. conditions, part 2
8 0.94408149 629 andrew gelman stats-2011-03-26-Is it plausible that 1% of people pick a career based on their first name?
9 0.94153732 784 andrew gelman stats-2011-07-01-Weighting and prediction in sample surveys
10 0.93306065 1414 andrew gelman stats-2012-07-12-Steven Pinker’s unconvincing debunking of group selection
11 0.93027556 1087 andrew gelman stats-2011-12-27-“Keeping things unridiculous”: Berger, O’Hagan, and me on weakly informative priors
12 0.92717707 290 andrew gelman stats-2010-09-22-Data Thief
13 0.92570829 2365 andrew gelman stats-2014-06-09-I hate polynomials
14 0.92413545 2086 andrew gelman stats-2013-11-03-How best to compare effects measured in two different time periods?
16 0.92319238 1980 andrew gelman stats-2013-08-13-Test scores and grades predict job performance (but maybe not at Google)
17 0.92295969 2089 andrew gelman stats-2013-11-04-Shlemiel the Software Developer and Unknown Unknowns
19 0.92203021 2149 andrew gelman stats-2013-12-26-Statistical evidence for revised standards
20 0.92194569 1240 andrew gelman stats-2012-04-02-Blogads update