andrew_gelman_stats andrew_gelman_stats-2011 andrew_gelman_stats-2011-938 knowledge-graph by maker-knowledge-mining

938 andrew gelman stats-2011-10-03-Comparing prediction errors


meta infos for this blog

Source: html

Introduction: Someone named James writes: I’m working on a classification task, sentence segmentation. The classifier algorithm we use (BoosTexter, a boosted learning algorithm) classifies each word independently conditional on its features, i.e. a bag-of-words model, so any contextual clues need to be encoded into the features. The feature extraction system I am proposing in my thesis uses a heteroscedastic LDA to transform data to produce the features the classifier runs on. The HLDA system has a couple parameters I’m testing, and I’m running a 3×2 full factorial experiment. That’s the background which may or may not be relevant to the question. The output of each trial is a class (there are only 2 classes, right now) for every word in the dataset. Because of the nature of the task, one class strongly predominates, say 90-95% of the data. My question is this: in terms of overall performance (we use F1 score), many of these trials are pretty close together, which leads me to ask whethe


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Someone named James writes: I’m working on a classification task, sentence segmentation. [sent-1, score-0.245]

2 The classifier algorithm we use (BoosTexter, a boosted learning algorithm) classifies each word independently conditional on its features, i. [sent-2, score-0.611]

3 a bag-of-words model, so any contextual clues need to be encoded into the features. [sent-4, score-0.372]

4 The feature extraction system I am proposing in my thesis uses a heteroscedastic LDA to transform data to produce the features the classifier runs on. [sent-5, score-1.14]

5 The HLDA system has a couple parameters I’m testing, and I’m running a 3×2 full factorial experiment. [sent-6, score-0.239]

6 That’s the background which may or may not be relevant to the question. [sent-7, score-0.144]

7 The output of each trial is a class (there are only 2 classes, right now) for every word in the dataset. [sent-8, score-0.446]

8 Because of the nature of the task, one class strongly predominates, say 90-95% of the data. [sent-9, score-0.152]

9 My question is this: in terms of overall performance (we use F1 score), many of these trials are pretty close together, which leads me to ask whether the parameter settings don’t matter, or they do but the performance of the trials just happened to be very similar. [sent-10, score-0.73]

10 Is there statistical test to see if two trials (a vector of classes calculated from the same ordered list of data) are significantly different, especially given they will both pick the majority class very often? [sent-11, score-0.984]

11 My reply: I too have found that error rates and even log-likelihoods can be noisy measures of prediction accuracy for discrete-data models. [sent-12, score-0.283]

12 One way you could compare two methods would be a head-to-head comparison where you saw which cases were predicted correctly by both, incorrectly by both, or correctly by A and incorrectly by B. [sent-13, score-1.314]

13 What’s left should be more informative about the differences; essentially you’re doing something closer to a matched-pairs or two-way comparison which will give you information if the prediction errors are correlated (which I expect they will be). [sent-15, score-0.326]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('classifier', 0.246), ('trials', 0.241), ('correctly', 0.223), ('exclude', 0.215), ('incorrectly', 0.201), ('cases', 0.2), ('task', 0.161), ('class', 0.152), ('classes', 0.147), ('predicted', 0.146), ('algorithm', 0.144), ('features', 0.14), ('encoded', 0.136), ('factorial', 0.136), ('lda', 0.136), ('conditions', 0.135), ('prediction', 0.131), ('clues', 0.129), ('extraction', 0.129), ('word', 0.125), ('performance', 0.124), ('comparison', 0.12), ('proposing', 0.115), ('intercepts', 0.11), ('contextual', 0.107), ('system', 0.103), ('transform', 0.098), ('classification', 0.098), ('independently', 0.096), ('calculated', 0.096), ('ordered', 0.095), ('tricky', 0.095), ('vector', 0.094), ('significantly', 0.086), ('output', 0.085), ('trial', 0.084), ('thesis', 0.082), ('runs', 0.079), ('noisy', 0.079), ('closer', 0.075), ('varying', 0.075), ('produce', 0.075), ('named', 0.074), ('accuracy', 0.073), ('sentence', 0.073), ('majority', 0.073), ('feature', 0.073), ('score', 0.073), ('may', 0.072), ('logistic', 0.07)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999982 938 andrew gelman stats-2011-10-03-Comparing prediction errors

Introduction: Someone named James writes: I’m working on a classification task, sentence segmentation. The classifier algorithm we use (BoosTexter, a boosted learning algorithm) classifies each word independently conditional on its features, i.e. a bag-of-words model, so any contextual clues need to be encoded into the features. The feature extraction system I am proposing in my thesis uses a heteroscedastic LDA to transform data to produce the features the classifier runs on. The HLDA system has a couple parameters I’m testing, and I’m running a 3×2 full factorial experiment. That’s the background which may or may not be relevant to the question. The output of each trial is a class (there are only 2 classes, right now) for every word in the dataset. Because of the nature of the task, one class strongly predominates, say 90-95% of the data. My question is this: in terms of overall performance (we use F1 score), many of these trials are pretty close together, which leads me to ask whethe

2 0.11168783 411 andrew gelman stats-2010-11-13-Ethical concerns in medical trials

Introduction: I just read this article on the treatment of medical volunteers, written by doctor and bioethicist Carl Ellliott. As a statistician who has done a small amount of consulting for pharmaceutical companies, I have a slightly different perspective. As a doctor, Elliott focuses on individual patients, whereas, as a statistician, I’ve been trained to focus on the goal of accurately estimate treatment effects. I’ll go through Elliott’s article and give my reactions. Elliott: In Miami, investigative reporters for Bloomberg Markets magazine discovered that a contract research organisation called SFBC International was testing drugs on undocumented immigrants in a rundown motel; since that report, the motel has been demolished for fire and safety violations. . . . SFBC had recently been named one of the best small businesses in America by Forbes magazine. The Holiday Inn testing facility was the largest in North America, and had been operating for nearly ten years before inspecto

3 0.10363732 1605 andrew gelman stats-2012-12-04-Write This Book

Introduction: This post is by Phil Price. I’ve been preparing a review of a new statistics textbook aimed at students and practitioners in the “physical sciences,” as distinct from the social sciences and also distinct from people who intend to take more statistics courses. I figured that since it’s been years since I looked at an intro stats textbook, I should look at a few others and see how they differ from this one, so in addition to the book I’m reviewing I’ve looked at some other textbooks aimed at similar audiences: Milton and Arnold; Hines, Montgomery, Goldsman, and Borror; and a few others. I also looked at the table of contents of several more. There is a lot of overlap in the coverage of these books — they all have discussions of common discrete and continuous distributions, joint distributions, descriptive statistics, parameter estimation, hypothesis testing, linear regression, ANOVA, factorial experimental design, and a few other topics. I can see how, from a statisti

4 0.097815558 1130 andrew gelman stats-2012-01-20-Prior beliefs about locations of decision boundaries

Introduction: Forest Gregg writes: I want to incorporate a prior belief into an estimation of a logistic regression classifier of points distributed in a 2d space. My prior belief is a funny kind of prior though. It’s a belief about where the decision boundary between classes should fall. Over the 2d space, I lay a grid, and I believe that a decision boundary that separates any two classes should fall along any of the grid line with some probablity, and that the decision boundary should fall anywhere except a gridline with a much lower probability. For the two class case, and a logistic regression model parameterized by W and data X, my prior could perhaps be expressed Pr(W) = (normalizing constant)/exp(d) where d = f(grid,W,X) such that when logistic(W^TX)= .5 and X is ‘far’ from grid lines, then d is large. Have you ever seen a model like this, or do you have any notions about a good avenue to pursue? My real data consist of geocoded Craigslist’s postings that are labeled with the

5 0.097707696 1191 andrew gelman stats-2012-03-01-Hoe noem je?

Introduction: Gerrit Storms reports on an interesting linguistic research project in which you can participate! Here’s the description: Over the past few weeks, we have been trying to set up a scientific study that is important for many researchers interested in words, word meaning, semantics, and cognitive science in general. It is a huge word association project, in which people are asked to participate in a small task that doesn’t last longer than 5 minutes. Our goal is to build a global word association network that contains connections between about 40,000 words, the size of the lexicon of an average adult. Setting up such a network might learn us a lot about semantic memory, how it develops, and maybe also about how it can deteriorate (like in Alzheimer’s disease). Most people enjoy doing the task, but we need thousands of participants to succeed. Up till today, we found about 53,000 participants willing to do the little task, but we need more subjects. That is why we address you. Would

6 0.096038125 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?

7 0.09482915 726 andrew gelman stats-2011-05-22-Handling multiple versions of an outcome variable

8 0.09363161 861 andrew gelman stats-2011-08-19-Will Stan work well with 40×40 matrices?

9 0.093567021 1289 andrew gelman stats-2012-04-29-We go to war with the data we have, not the data we want

10 0.092973158 2115 andrew gelman stats-2013-11-27-Three unblinded mice

11 0.091704935 1933 andrew gelman stats-2013-07-10-Please send all comments to -dev-ripley

12 0.091302946 2129 andrew gelman stats-2013-12-10-Cross-validation and Bayesian estimation of tuning parameters

13 0.089595132 244 andrew gelman stats-2010-08-30-Useful models, model checking, and external validation: a mini-discussion

14 0.088621438 318 andrew gelman stats-2010-10-04-U-Haul statistics

15 0.087805703 1431 andrew gelman stats-2012-07-27-Overfitting

16 0.085573032 1966 andrew gelman stats-2013-08-03-Uncertainty in parameter estimates using multilevel models

17 0.084522203 1368 andrew gelman stats-2012-06-06-Question 27 of my final exam for Design and Analysis of Sample Surveys

18 0.084412798 350 andrew gelman stats-2010-10-18-Subtle statistical issues to be debated on TV.

19 0.082832567 2041 andrew gelman stats-2013-09-27-Setting up Jitts online

20 0.082770742 2017 andrew gelman stats-2013-09-11-“Informative g-Priors for Logistic Regression”


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.177), (1, 0.059), (2, 0.031), (3, -0.015), (4, 0.053), (5, 0.036), (6, 0.033), (7, 0.012), (8, -0.008), (9, 0.01), (10, 0.01), (11, 0.048), (12, -0.014), (13, -0.046), (14, -0.031), (15, -0.021), (16, 0.007), (17, -0.02), (18, 0.009), (19, -0.004), (20, 0.008), (21, 0.031), (22, 0.006), (23, 0.014), (24, -0.001), (25, -0.011), (26, 0.017), (27, -0.015), (28, 0.022), (29, -0.009), (30, 0.019), (31, 0.031), (32, 0.028), (33, 0.027), (34, 0.004), (35, -0.033), (36, -0.031), (37, 0.009), (38, 0.023), (39, -0.006), (40, 0.023), (41, 0.001), (42, -0.032), (43, -0.002), (44, -0.024), (45, 0.0), (46, -0.01), (47, -0.012), (48, 0.031), (49, 0.006)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.9663617 938 andrew gelman stats-2011-10-03-Comparing prediction errors

Introduction: Someone named James writes: I’m working on a classification task, sentence segmentation. The classifier algorithm we use (BoosTexter, a boosted learning algorithm) classifies each word independently conditional on its features, i.e. a bag-of-words model, so any contextual clues need to be encoded into the features. The feature extraction system I am proposing in my thesis uses a heteroscedastic LDA to transform data to produce the features the classifier runs on. The HLDA system has a couple parameters I’m testing, and I’m running a 3×2 full factorial experiment. That’s the background which may or may not be relevant to the question. The output of each trial is a class (there are only 2 classes, right now) for every word in the dataset. Because of the nature of the task, one class strongly predominates, say 90-95% of the data. My question is this: in terms of overall performance (we use F1 score), many of these trials are pretty close together, which leads me to ask whethe

2 0.7789219 327 andrew gelman stats-2010-10-07-There are never 70 distinct parameters

Introduction: Sam Seaver writes: I’m a graduate student in computational biology, and I’m relatively new to advanced statistics, and am trying to teach myself how best to approach a problem I have. My dataset is a small sparse matrix of 150 cases and 70 predictors, it is sparse as in many zeros, not many ‘NA’s. Each case is a nutrient that is fed into an in silico organism, and its response is whether or not it stimulates growth, and each predictor is one of 70 different pathways that the nutrient may or may not belong to. Because all of the nutrients do not belong to all of the pathways, there are thus many zeros in my matrix. My goal is to be able to use the pathways themselves to predict whether or not a nutrient could stimulate growth, thus I wanted to compute regression coefficients for each pathway, with which I could apply to other nutrients for other species. There are quite a few singularities in the dataset (summary(glm) reports that 14 coefficients are not defined because of sin

3 0.77418804 726 andrew gelman stats-2011-05-22-Handling multiple versions of an outcome variable

Introduction: Jay Ulfelder asks: I have a question for you about what to do in a situation where you have two measures of your dependent variable and no prior reasons to strongly favor one over the other. Here’s what brings this up: I’m working on a project with Michael Ross where we’re modeling transitions to and from democracy in countries worldwide since 1960 to estimate the effects of oil income on the likelihood of those events’ occurrence. We’ve got a TSCS data set, and we’re using a discrete-time event history design, splitting the sample by regime type at the start of each year and then using multilevel logistic regression models with parametric measures of time at risk and random intercepts at the country and region levels. (We’re also checking for the usefulness of random slopes for oil wealth at one or the other level and then including them if they improve a model’s goodness of fit.) All of this is being done in Stata with the gllamm module. Our problem is that we have two plausib

4 0.76899374 1142 andrew gelman stats-2012-01-29-Difficulties with the 1-4-power transformation

Introduction: John Hayes writes: I am a fan of the quarter root transform ever since reading about it on your blog . However, today my student and I hit a wall that I’m hoping you might have some insight on. By training, I am a psychophysicist (think SS Stevens), and people in my field often log transform data prior to analysis. However, this data frequently contains zeros, so I’ve tried using quarter root transforms to get around this. But until today, I had never tried to back transform the plot axis for readability. I assumed this would be straightforward – alas it is not. Specifically, we quarter root transformed our data, performed an ANOVA, got what we thought was a reasonable effect, and then plotted the data. So far so good. However, the LS means in question are below 1, meaning that raising them to the 4th power just makes them smaller, and uninterpretable in the original metric. Do you have any thoughts or insights you might share? My reply: I don’t see the problem with pre

5 0.76798129 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?

Introduction: A research psychologist writes in with a question that’s so long that I’ll put my answer first, then put the question itself below the fold. Here’s my reply: As I wrote in my Anova paper and in my book with Jennifer Hill, I do think that multilevel models can completely replace Anova. At the same time, I think the central idea of Anova should persist in our understanding of these models. To me the central idea of Anova is not F-tests or p-values or sums of squares, but rather the idea of predicting an outcome based on factors with discrete levels, and understanding these factors using variance components. The continuous or categorical response thing doesn’t really matter so much to me. I have no problem using a normal linear model for continuous outcomes (perhaps suitably transformed) and a logistic model for binary outcomes. I don’t want to throw away interactions just because they’re not statistically significant. I’d rather partially pool them toward zero using an inform

6 0.74706638 2041 andrew gelman stats-2013-09-27-Setting up Jitts online

7 0.7466656 1164 andrew gelman stats-2012-02-13-Help with this problem, win valuable prizes

8 0.74475992 1703 andrew gelman stats-2013-02-02-Interaction-based feature selection and classification for high-dimensional biological data

9 0.74384534 245 andrew gelman stats-2010-08-31-Predicting marathon times

10 0.7433601 315 andrew gelman stats-2010-10-03-He doesn’t trust the fit . . . r=.999

11 0.73932683 250 andrew gelman stats-2010-09-02-Blending results from two relatively independent multi-level models

12 0.73818564 1807 andrew gelman stats-2013-04-17-Data problems, coding errors…what can be done?

13 0.7381835 2311 andrew gelman stats-2014-04-29-Bayesian Uncertainty Quantification for Differential Equations!

14 0.73488754 212 andrew gelman stats-2010-08-17-Futures contracts, Granger causality, and my preference for estimation to testing

15 0.73298699 1070 andrew gelman stats-2011-12-19-The scope for snooping

16 0.72926837 1918 andrew gelman stats-2013-06-29-Going negative

17 0.72521317 272 andrew gelman stats-2010-09-13-Ross Ihaka to R: Drop Dead

18 0.72503889 246 andrew gelman stats-2010-08-31-Somewhat Bayesian multilevel modeling

19 0.72463506 226 andrew gelman stats-2010-08-23-More on those L.A. Times estimates of teacher effectiveness

20 0.7245627 2364 andrew gelman stats-2014-06-08-Regression and causality and variable ordering


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(9, 0.015), (16, 0.044), (24, 0.614), (99, 0.207)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.99720675 1046 andrew gelman stats-2011-12-07-Neutral noninformative and informative conjugate beta and gamma prior distributions

Introduction: Jouni Kerman did a cool bit of research justifying the Beta (1/3, 1/3) prior as noninformative for binomial data, and the Gamma (1/3, 0) prior for Poisson data. You probably thought that nothing new could be said about noninformative priors in such basic problems, but you were wrong! Here’s the story : The conjugate binomial and Poisson models are commonly used for estimating proportions or rates. However, it is not well known that the conventional noninformative conjugate priors tend to shrink the posterior quantiles toward the boundary or toward the middle of the parameter space, making them thus appear excessively informative. The shrinkage is always largest when the number of observed events is small. This behavior persists for all sample sizes and exposures. The effect of the prior is therefore most conspicuous and potentially controversial when analyzing rare events. As alternative default conjugate priors, I [Jouni] introduce Beta(1/3, 1/3) and Gamma(1/3, 0), which I cal

2 0.99671447 240 andrew gelman stats-2010-08-29-ARM solutions

Introduction: People sometimes email asking if a solution set is available for the exercises in ARM. The answer, unfortunately, is no. Many years ago, I wrote up 50 solutions for BDA and it was a lot of work–really, it was like writing a small book in itself. The trouble is that, once I started writing them up, I wanted to do it right, to set a good example. That’s a lot more effort than simply scrawling down some quick answers.

3 0.9949646 1437 andrew gelman stats-2012-07-31-Paying survey respondents

Introduction: I agree with Casey Mulligan that participants in government surveys should be paid, and I think it should be part of the code of ethics for commercial pollsters to compensate their respondents also. As Mulligan points out, if a survey is worth doing, it should be worth compensating the participants for their time and effort. P.S. Just to clarify, I do not recommend that Census surveys be made voluntary, I just think that respondents (who can be required to participate) should be paid a small amount. P.P.S. More rant here .

4 0.99334246 545 andrew gelman stats-2011-01-30-New innovations in spam

Introduction: I received the following (unsolicited) email today: Hello Andrew, I’m interested in whether you are accepting guest article submissions for your site Statistical Modeling, Causal Inference, and Social Science? I’m the owner of the recently created nonprofit site OnlineEngineeringDegree.org and am interested in writing / submitting an article for your consideration to be published on your site. Is that something you’d be willing to consider, and if so, what specs in terms of topics or length requirements would you be looking for? Thanks you for your time, and if you have any questions or are interested, I’d appreciate you letting me know. Sincerely, Samantha Rhodes Huh? P.S. My vote for most obnoxious spam remains this one , which does its best to dilute whatever remains of the reputation of Wolfram Research. Or maybe that particular bit of spam was written by a particularly awesome cellular automaton that Wolfram discovered? I guess in the world of big-time software

5 0.99291903 643 andrew gelman stats-2011-04-02-So-called Bayesian hypothesis testing is just as bad as regular hypothesis testing

Introduction: Steve Ziliak points me to this article by the always-excellent Carl Bialik, slamming hypothesis tests. I only wish Carl had talked with me before so hastily posting, though! I would’ve argued with some of the things in the article. In particular, he writes: Reese and Brad Carlin . . . suggest that Bayesian statistics are a better alternative, because they tackle the probability that the hypothesis is true head-on, and incorporate prior knowledge about the variables involved. Brad Carlin does great work in theory, methods, and applications, and I like the bit about the prior knowledge (although I might prefer the more general phrase “additional information”), but I hate that quote! My quick response is that the hypothesis of zero effect is almost never true! The problem with the significance testing framework–Bayesian or otherwise–is in the obsession with the possibility of an exact zero effect. The real concern is not with zero, it’s with claiming a positive effect whe

6 0.99281573 471 andrew gelman stats-2010-12-17-Attractive models (and data) wanted for statistical art show.

7 0.98188722 38 andrew gelman stats-2010-05-18-Breastfeeding, infant hyperbilirubinemia, statistical graphics, and modern medicine

same-blog 8 0.9737035 938 andrew gelman stats-2011-10-03-Comparing prediction errors

9 0.97277695 241 andrew gelman stats-2010-08-29-Ethics and statistics in development research

10 0.97128439 1092 andrew gelman stats-2011-12-29-More by Berger and me on weakly informative priors

11 0.97119302 1978 andrew gelman stats-2013-08-12-Fixing the race, ethnicity, and national origin questions on the U.S. Census

12 0.97035122 1479 andrew gelman stats-2012-09-01-Mothers and Moms

13 0.96425509 59 andrew gelman stats-2010-05-30-Extended Binary Format Support for Mac OS X

14 0.96405131 1787 andrew gelman stats-2013-04-04-Wanna be the next Tyler Cowen? It’s not as easy as you might think!

15 0.96283305 2229 andrew gelman stats-2014-02-28-God-leaf-tree

16 0.96223062 482 andrew gelman stats-2010-12-23-Capitalism as a form of voluntarism

17 0.95506996 1706 andrew gelman stats-2013-02-04-Too many MC’s not enough MIC’s, or What principles should govern attempts to summarize bivariate associations in large multivariate datasets?

18 0.95173413 1891 andrew gelman stats-2013-06-09-“Heterogeneity of variance in experimental studies: A challenge to conventional interpretations”

19 0.9512974 743 andrew gelman stats-2011-06-03-An argument that can’t possibly make sense

20 0.95052004 373 andrew gelman stats-2010-10-27-It’s better than being forwarded the latest works of you-know-who