andrew_gelman_stats andrew_gelman_stats-2013 andrew_gelman_stats-2013-1807 knowledge-graph by maker-knowledge-mining

1807 andrew gelman stats-2013-04-17-Data problems, coding errors…what can be done?

meta infos for this blog

Source: html

Introduction: This post is by Phil A recent post on this blog discusses a prominent case of an Excel error leading to substantially wrong results from a statistical analysis. Excel is notorious for this because it is easy to add a row or column of data (or intermediate results) but forget to update equations so that they correctly use the new data. That particular error is less common in a language like R because R programmers usually refer to data by variable name (or by applying functions to a named variable), so the same code works even if you add or remove data. Still, there is plenty of opportunity for errors no matter what language one uses. Andrew ran into problems fairly recently, and also blogged about another instance. I’ve never had to retract a paper, but that’s partly because I haven’t published a whole lot of papers. Certainly I have found plenty of substantial errors pretty late in some of my data analyses, and I obviously don’t have sufficient mechanisms in place to be sure

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Excel is notorious for this because it is easy to add a row or column of data (or intermediate results) but forget to update equations so that they correctly use the new data. [sent-2, score-0.522]

2 That particular error is less common in a language like R because R programmers usually refer to data by variable name (or by applying functions to a named variable), so the same code works even if you add or remove data. [sent-3, score-0.813]

3 Certainly I have found plenty of substantial errors pretty late in some of my data analyses, and I obviously don’t have sufficient mechanisms in place to be sure that errors can’t persist all the way through to the end. [sent-7, score-0.455]

4 I sometimes used to refer to the wrong column (or, occasionally, row) of data. [sent-11, score-0.476]

5 The solution here is easy: assign column names, and refer to columns by name instead of number. [sent-12, score-0.75]

6 Ideally, column headers are already in the data file; if not, and if it doesn’t make sense to edit the data file to put them in there, then read the datafile normally and assign column names with the very next command. [sent-13, score-1.261]

7 I sometimes make a change or fix an error in one place in my code but not another. [sent-14, score-0.551]

8 I might have a block of code that analyzes the first dataset, and an almost duplicated block of code that analyzes the second dataset. [sent-16, score-0.712]

9 I try to make myself use two mechanisms to make sure I don’t have this problem: (a) label the outputs and look at the labels. [sent-18, score-0.532]

10 Unfortunately, the way I usually work I often won’t end up looking at those when it comes time to do something with the results; for instance, I might make an output matrix with something like rbind(results1, results2), where results1 and results2 are outputs of the quantile function. [sent-25, score-0.657]

11 In the current example, rather than calling the quantile() function in two different places, I could write a really simple function like this: myquantile = function(datavec) { return(quantile(datavec,probs=c(0. [sent-27, score-0.512]

12 I sometimes use the wrong variable name in a function, and this problem can be hard to find. [sent-35, score-0.475]

13 Great, except that I sometimes make an error like this: inside the function I define a variable like std = sqrt(var(dat)), and then later in the function I say something like x = (dat – mean(dat))/stdev. [sent-39, score-1.081]

14 I wish R had a “strict” option or something, that would make a function give an error or at least a warning if I use a variable that isn’t local. [sent-45, score-0.706]

15 Say I have data on a bunch of buildings, and frame1 has a list of buildings that have certain characteristics, and frame2 has a big list of lots of buildings and some data on each one. [sent-47, score-0.73]

16 Then frame2[match(frame1$buildingname, frame2$buildingname),] gives me just the rows of frame2 that match the buildings in frame1. [sent-48, score-0.384]

17 Even so, I can run into trouble if I want to do further subsetting, like look at only the rows in frame2 for which the name occurs in frame1 AND some other column of frame1 has a certain characteristic, or for which the name occurs in frame1 and some other condition of frame2 is also true. [sent-53, score-0.795]

18 This seems simple and obvious, and in the example I’ve given here it is indeed simple and obvious, but sometimes when I have multiple data sources, and maybe data structures that are more complicated than data frames, it’s not so easy. [sent-57, score-0.609]

19 table() and then just go ahead and start referring to various rows and columns (by name or otherwise); open the datafile, or write out the results from your work, and take a look. [sent-60, score-0.473]

20 Many problems will show up as suspicious patterns: you’ll find that a whole column has the same value, or that something that should always be positive has some negative values somehow, or that a variable you thought you had normalized to 1 has some values over 1, or whatever. [sent-64, score-0.455]

similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('quantile', 0.28), ('column', 0.253), ('buildingname', 0.246), ('function', 0.221), ('dat', 0.168), ('buildings', 0.163), ('code', 0.147), ('variable', 0.137), ('name', 0.135), ('sometimes', 0.127), ('rows', 0.126), ('datafile', 0.123), ('columns', 0.119), ('error', 0.118), ('excel', 0.115), ('data', 0.114), ('stdev', 0.112), ('block', 0.111), ('std', 0.101), ('try', 0.099), ('analyzes', 0.098), ('refer', 0.096), ('match', 0.095), ('results', 0.093), ('make', 0.091), ('outputs', 0.09), ('errors', 0.09), ('list', 0.088), ('edit', 0.088), ('names', 0.087), ('reduce', 0.087), ('percentile', 0.087), ('decide', 0.085), ('mechanisms', 0.085), ('row', 0.079), ('solution', 0.076), ('use', 0.076), ('plenty', 0.076), ('solutions', 0.075), ('occurs', 0.073), ('assign', 0.071), ('simple', 0.07), ('fix', 0.068), ('file', 0.067), ('ve', 0.066), ('matrix', 0.066), ('kinds', 0.066), ('common', 0.066), ('something', 0.065), ('option', 0.063)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000002 1807 andrew gelman stats-2013-04-17-Data problems, coding errors…what can be done?

2 0.19825132 397 andrew gelman stats-2010-11-06-Multilevel quantile regression

Introduction: Ryan Seals writes: I’m an epidemiologist at Emory University, and I’m working on a project of release patterns in jails (basically trying to model how long individuals are in jail before they’re release, for purposes of designing short-term health interventions, i.e. HIV testing, drug counseling, etc…). The question lends itself to quantile regression; we’re interested in the # of days it takes for 50% and 75% of inmates to be released. But being a clustered/nested data structure, it also obviously lends itself to multilevel modeling, with the group-level being individual jails. So: do you know of any work on multilevel quantile regression? My quick lit search didn’t yield much, and I don’t see any preprogrammed way to do it in SAS. My reply: To start with, I’m putting in the R keyword here, on the hope that some readers might be able to refer you to an R function that does what you want. Beyond this, I think it should be possible to program something in Bugs. In ARM we hav

3 0.15982363 1447 andrew gelman stats-2012-08-07-Reproducible science FAIL (so far): What’s stoppin people from sharin data and code?

Introduction: David Karger writes: Your recent post on sharing data was of great interest to me, as my own research in computer science asks how to incentivize and lower barriers to data sharing. I was particularly curious about your highlighting of effort as the major dis-incentive to sharing. I would love to hear more, as this question of effort is on we specifically target in our development of tools for data authoring and publishing. As a straw man, let me point out that sharing data technically requires no more than posting an excel spreadsheet online. And that you likely already produced that spreadsheet during your own analytic work. So, in what way does such low-tech publishing fail to meet your data sharing objectives? Our own hypothesis has been that the effort is really quite low, with the problem being a lack of *immediate/tangible* benefits (as opposed to the long-term values you accurately describe). To attack this problem, we’re developing tools (and, since it appear

4 0.1504854 2190 andrew gelman stats-2014-01-29-Stupid R Tricks: Random Scope

Introduction: Andrew and I have been discussing how we’re going to define functions in Stan for defining systems of differential equations; see our evolving ode design doc ; comments welcome, of course. About Scope I mentioned to Andrew I would prefer pure lexical, static scoping, as found in languages like C++ and Java. If you’re not familiar with the alternatives, there’s a nice overview in the Wikipedia article on scope . Let me call out a few passages that will help set the context. A fundamental distinction in scoping is what “context” means – whether name resolution depends on the location in the source code (lexical scope, static scope, which depends on the lexical context) or depends on the program state when the name is encountered (dynamic scope, which depends on the execution context or calling context). Lexical resolution can be determined at compile time, and is also known as early binding, while dynamic resolution can in general only be determined at run time, and thus

5 0.14665151 1753 andrew gelman stats-2013-03-06-Stan 1.2.0 and RStan 1.2.0

Introduction: Stan 1.2.0 and RStan 1.2.0 are now available for download. See: http://mc-stan.org/ Here are the highlights. Full Mass Matrix Estimation during Warmup Yuanjun Gao, a first-year grad student here at Columbia (!), built a regularized mass-matrix estimator. This helps for posteriors with high correlation among parameters and varying scales. We’re still testing this ourselves, so the estimation procedure may change in the future (don’t worry — it satisfies detailed balance as is, but we might be able to make it more computationally efficient in terms of time per effective sample). It’s not the default option. The major reason is the matrix operations required are expensive, raising the algorithm cost to , where is the average number of leapfrog steps, is the number of iterations, and is the number of parameters. Yuanjun did a great job with the Cholesky factorizations and implemented this about as efficiently as is possible. (His homework for Andrew’s class w

6 0.13725612 907 andrew gelman stats-2011-09-14-Reproducibility in Practice

7 0.13427629 2107 andrew gelman stats-2013-11-20-NYT (non)-retraction watch

8 0.13244548 1760 andrew gelman stats-2013-03-12-Misunderstanding the p-value

9 0.13214657 852 andrew gelman stats-2011-08-13-Checking your model using fake data

10 0.12968771 1605 andrew gelman stats-2012-12-04-Write This Book

11 0.12516153 1289 andrew gelman stats-2012-04-29-We go to war with the data we have, not the data we want

12 0.12429141 1808 andrew gelman stats-2013-04-17-Excel-bashing

13 0.12178656 1799 andrew gelman stats-2013-04-12-Stan 1.3.0 and RStan 1.3.0 Ready for Action

14 0.11636905 1919 andrew gelman stats-2013-06-29-R sucks

15 0.11540385 272 andrew gelman stats-2010-09-13-Ross Ihaka to R: Drop Dead

16 0.1146196 1117 andrew gelman stats-2012-01-13-What are the important issues in ethics and statistics? I’m looking for your input!

17 0.11394101 1955 andrew gelman stats-2013-07-25-Bayes-respecting experimental design and other things

18 0.11292131 2337 andrew gelman stats-2014-05-18-Never back down: The culture of poverty and the culture of journalism

19 0.11121241 2211 andrew gelman stats-2014-02-14-The popularity of certain baby names is falling off the clifffffffffffff

20 0.11051593 1472 andrew gelman stats-2012-08-28-Migrating from dot to underscore

similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.253), (1, 0.001), (2, 0.003), (3, 0.004), (4, 0.125), (5, -0.031), (6, 0.057), (7, -0.061), (8, 0.042), (9, -0.051), (10, -0.022), (11, 0.003), (12, -0.037), (13, -0.038), (14, -0.008), (15, 0.033), (16, -0.012), (17, -0.038), (18, 0.018), (19, 0.003), (20, 0.035), (21, 0.025), (22, -0.008), (23, 0.024), (24, -0.043), (25, -0.017), (26, 0.037), (27, -0.01), (28, 0.006), (29, -0.004), (30, 0.064), (31, 0.027), (32, -0.001), (33, 0.024), (34, 0.022), (35, -0.068), (36, -0.027), (37, 0.052), (38, -0.05), (39, -0.014), (40, -0.001), (41, -0.033), (42, 0.004), (43, 0.008), (44, -0.026), (45, 0.088), (46, -0.012), (47, 0.02), (48, 0.06), (49, 0.059)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.95249486 1807 andrew gelman stats-2013-04-17-Data problems, coding errors…what can be done?

2 0.86018437 266 andrew gelman stats-2010-09-09-The future of R

Introduction: Some thoughts from Christian , including this bit: We need to consider separately 1. R’s brilliant library 2. R’s not-so-brilliant language and/or interpreter. I don’t know that R’s library is so brilliant as all that–if necessary, I don’t think it would be hard to reprogram the important packages in a new language. I would say, though, that the problems with R are not just in the technical details of the language. I think the culture of R has some problems too. As I’ve written before, R functions used to be lean and mean, and now they’re full of exception-handling and calls to other packages. R functions are spaghetti-like messes of connections in which I keep expecting to run into syntax like “GOTO 120.” I learned about these problems a couple years ago when writing bayesglm(), which is a simple adaptation of glm(). But glm(), and its workhorse, glm.fit(), are a mess: They’re about 10 lines of functioning code, plus about 20 lines of necessary front-end, plus a cou

3 0.84617758 272 andrew gelman stats-2010-09-13-Ross Ihaka to R: Drop Dead

Introduction: Christian Robert posts these thoughts : I [Ross Ihaka] have been worried for some time that R isn’t going to provide the base that we’re going to need for statistical computation in the future. (It may well be that the future is already upon us.) There are certainly efficiency problems (speed and memory use), but there are more fundamental issues too. Some of these were inherited from S and some are peculiar to R. One of the worst problems is scoping. Consider the following little gem. f =function() { if (runif(1) > .5) x = 10 x } The x being returned by this function is randomly local or global. There are other examples where variables alternate between local and non-local throughout the body of a function. No sensible language would allow this. It’s ugly and it makes optimisation really difficult. This isn’t the only problem, even weirder things happen because of interactions between scoping and lazy evaluation. In light of this, I [Ihaka] have come to the c

4 0.8304916 2190 andrew gelman stats-2014-01-29-Stupid R Tricks: Random Scope

5 0.81356508 1716 andrew gelman stats-2013-02-09-iPython Notebook

Introduction: Burak Bayramli writes: I wanted to inform you on iPython Notebook technology – allowing markup, Python code to reside in one document. Someone ported one of your examples from ARM . iPynb file is actually a live document, can be downloaded and reran locally, hence change of code on document means change of images, results. Graphs (as well as text output) which are generated by the code, are placed inside the document automatically. No more referencing image files seperately. For now running notebooks locally require a notebook server, but that part can live “on the cloud” as part of an educational software. Viewers, such as nbviewer.ipython.org, do not even need that much, since all recent results of a notebook are embedded in the notebook itself. A lot of people are excited about this; Also out of nowhere, Alfred P. Sloan Foundation dropped a $1.15 million grant on the developers of ipython which provided some extra energy on the project. Cool. We’ll have to do that ex

6 0.80410337 2089 andrew gelman stats-2013-11-04-Shlemiel the Software Developer and Unknown Unknowns

7 0.80049473 818 andrew gelman stats-2011-07-23-Parallel JAGS RNGs

8 0.79388779 2337 andrew gelman stats-2014-05-18-Never back down: The culture of poverty and the culture of journalism

9 0.78722388 470 andrew gelman stats-2010-12-16-“For individuals with wine training, however, we find indications of a positive relationship between price and enjoyment”

10 0.78169215 907 andrew gelman stats-2011-09-14-Reproducibility in Practice

11 0.78020006 1164 andrew gelman stats-2012-02-13-Help with this problem, win valuable prizes

12 0.77363229 1808 andrew gelman stats-2013-04-17-Excel-bashing

13 0.7734502 324 andrew gelman stats-2010-10-07-Contest for developing an R package recommendation system

14 0.77279365 105 andrew gelman stats-2010-06-23-More on those divorce prediction statistics, including a discussion of the innumeracy of (some) mathematicians

15 0.77232856 1462 andrew gelman stats-2012-08-18-Standardizing regression inputs

16 0.77124709 597 andrew gelman stats-2011-03-02-RStudio – new cross-platform IDE for R

17 0.7606768 1655 andrew gelman stats-2013-01-05-The statistics software signal

18 0.75262851 1470 andrew gelman stats-2012-08-26-Graphs showing regression uncertainty: the code!

19 0.75241143 268 andrew gelman stats-2010-09-10-Fighting Migraine with Multilevel Modeling

20 0.74909323 1403 andrew gelman stats-2012-07-02-Moving beyond hopeless graphics

similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(9, 0.02), (15, 0.024), (16, 0.073), (17, 0.023), (21, 0.027), (22, 0.011), (24, 0.122), (34, 0.011), (35, 0.016), (43, 0.017), (57, 0.012), (61, 0.031), (73, 0.012), (75, 0.014), (76, 0.019), (77, 0.013), (78, 0.037), (86, 0.035), (95, 0.038), (99, 0.322)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.97962123 1910 andrew gelman stats-2013-06-22-Struggles over the criticism of the “cannabis users and IQ change” paper

Introduction: Ole Rogeberg points me to a discussion of a discussion of a paper: Did pre-release of my [Rogeberg's] PNAS paper on methodological problems with Meier et al’s 2012 paper on cannabis and IQ reduce the chances that it will have its intended effect? In my case, serious methodological issues related to causal inference from non-random observational data became framed as a conflict over conclusions, forcing the original research team to respond rapidly and insufficiently to my concerns, and prompting them to defend their conclusions and original paper in a way that makes a later, more comprehensive reanalysis of their data less likely. This fits with a recurring theme on this blog: the defensiveness of researchers who don’t want to admit they were wrong. Setting aside cases of outright fraud and plagiarism, I think the worst case remains that of psychologists Neil Anderson and Deniz Ones, who denied any problems even in the presence of a smoking gun of a graph revealing their data

2 0.97941774 1492 andrew gelman stats-2012-09-11-Using the “instrumental variables” or “potential outcomes” approach to clarify causal thinking

Introduction: As I’ve written here many times, my experiences in social science and public health research have left me skeptical of statistical methods that hypothesize or try to detect zero relationships between observational data (see, for example, the discussion starting at the bottom of page 960 in my review of causal inference in the American Journal of Sociology). In short, I have a taste for continuous rather than discrete models. As discussed in the above-linked article (with respect to the writings of cognitive scientist Steven Sloman), I think that common-sense thinking about causal inference can often mislead. In many cases, I have found that that the theoretical frameworks of instrumental variables and potential outcomes (for a review see, for example, chapters 9 and 10 of my book with Jennifer) help clarify my thinking. Here is an example that came up in a recent blog discussion. Computer science student Elias Bareinboim gave the following example: “suppose we know nothing a

same-blog 3 0.97828007 1807 andrew gelman stats-2013-04-17-Data problems, coding errors…what can be done?

4 0.97792006 775 andrew gelman stats-2011-06-21-Fundamental difficulty of inference for a ratio when the denominator could be positive or negative

Introduction: Ratio estimates are common in statistics. In survey sampling, the ratio estimate is when you use y/x to estimate Y/X (using the notation in which x,y are totals of sample measurements and X,Y are population totals). In textbook sampling examples, the denominator X will be an all-positive variable, something that is easy to measure and is, ideally, close to proportional to Y. For example, X is last year’s sales and Y is this year’s sales, or X is the number of people in a cluster and Y is some count. Ratio estimation doesn’t work so well if X can be either positive or negative. More generally we can consider any estimate of a ratio, with no need for a survey sampling context. The problem with estimating Y/X is that the very interpretation of Y/X can change completely if the sign of X changes. Everything is ok for a point estimate: you get X.hat and Y.hat, you can take the ratio Y.hat/X.hat, no problem. But the inference falls apart if you have enough uncertainty in X.hat th

5 0.97665632 110 andrew gelman stats-2010-06-26-Philosophy and the practice of Bayesian statistics

Introduction: Here’s an article that I believe is flat-out entertaining to read. It’s about philosophy, so it’s supposed to be entertaining, in any case. Here’s the abstract: A substantial school in the philosophy of science identifies Bayesian inference with inductive inference and even rationality as such, and seems to be strengthened by the rise and practical success of Bayesian statistics. We argue that the most successful forms of Bayesian statistics do not actually support that particular philosophy but rather accord much better with sophisticated forms of hypothetico-deductivism. We examine the actual role played by prior distributions in Bayesian models, and the crucial aspects of model checking and model revision, which fall outside the scope of Bayesian confirmation theory. We draw on the literature on the consistency of Bayesian updating and also on our experience of applied work in social science. Clarity about these matters should benefit not just philosophy of science, but

6 0.97658437 785 andrew gelman stats-2011-07-02-Experimental reasoning in social science

7 0.97622538 2337 andrew gelman stats-2014-05-18-Never back down: The culture of poverty and the culture of journalism

8 0.9761849 2303 andrew gelman stats-2014-04-23-Thinking of doing a list experiment? Here’s a list of reasons why you should think again

9 0.97614372 315 andrew gelman stats-2010-10-03-He doesn’t trust the fit . . . r=.999

10 0.97586888 431 andrew gelman stats-2010-11-26-One fun thing about physicists . . .

11 0.97584057 32 andrew gelman stats-2010-05-14-Causal inference in economics

12 0.97575808 2350 andrew gelman stats-2014-05-27-A whole fleet of gremlins: Looking more carefully at Richard Tol’s twice-corrected paper, “The Economic Effects of Climate Change”

13 0.97469652 670 andrew gelman stats-2011-04-20-Attractive but hard-to-read graph could be made much much better

14 0.97461504 1620 andrew gelman stats-2012-12-12-“Teaching effectiveness” as another dimension in cognitive ability

15 0.97444099 1289 andrew gelman stats-2012-04-29-We go to war with the data we have, not the data we want

16 0.97437203 2218 andrew gelman stats-2014-02-20-Do differences between biology and statistics explain some of our diverging attitudes regarding criticism and replication of scientific claims?

17 0.97429478 2235 andrew gelman stats-2014-03-06-How much time (if any) should we spend criticizing research that’s fraudulent, crappy, or just plain pointless?

18 0.97409517 2050 andrew gelman stats-2013-10-04-Discussion with Dan Kahan on political polarization, partisan information processing. And, more generally, the role of theory in empirical social science

19 0.97408396 2008 andrew gelman stats-2013-09-04-Does it matter that a sample is unrepresentative? It depends on the size of the treatment interactions

20 0.97383988 1996 andrew gelman stats-2013-08-24-All inference is about generalizing from sample to population