andrew_gelman_stats andrew_gelman_stats-2013 andrew_gelman_stats-2013-1875 knowledge-graph by maker-knowledge-mining

1875 andrew gelman stats-2013-05-28-Simplify until your fake-data check works, then add complications until you can figure out where the problem is coming from


meta infos for this blog

Source: html

Introduction: I received the following email: I am trying to develop a Bayesian model to represent the process through which individual consumers make online product rating decisions. In my model each individual faces total J product options and for each product option (j) each individual (i) needs to make three sequential decisions: - First he decides whether to consume a specific product option (j) or not (choice decision) - If he decides to consume a product option j, then after consumption he decides whether to rate it or not (incidence decision) - If he decides to rate product j then what finally he decides what rating (k) to assign to it (evaluation decision) We model this decision sequence in terms of three equations. A binary response variable in the first equation represents the choice decision. Another binary response variable in the second equation represents the incidence decision that is observable only when first selection decision is 1. Finally, an ordered response v


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 I received the following email: I am trying to develop a Bayesian model to represent the process through which individual consumers make online product rating decisions. [sent-1, score-0.974]

2 A binary response variable in the first equation represents the choice decision. [sent-3, score-0.808]

3 Another binary response variable in the second equation represents the incidence decision that is observable only when first selection decision is 1. [sent-4, score-1.507]

4 Finally, an ordered response variable in the third stage captures the extent of preference of individual i for product j. [sent-5, score-1.241]

5 This ordered response (rating) is observed only when both first and second decisions are 1. [sent-6, score-0.523]

6 Each of these response variables in turn are dictated by a corresponding latent variable that is assumed to be linearly related to a set of product characteristics. [sent-7, score-1.008]

7 I have been able to implement the estimation algorithm in R. [sent-8, score-0.342]

8 However, when I tried to apply the algorithm to a simulated data set with known parameter values it failed to recover the parameters. [sent-9, score-0.309]

9 I was wondering if there is something wrong with the estimation method. [sent-10, score-0.181]

10 I am attaching a document outlining the model and the proposed estimation framework. [sent-11, score-0.643]

11 It would be immense help if you kindly have a look at the model and the proposed estimation strategy and suggest any improvement or modification needed. [sent-12, score-0.642]

12 I replied: I don’t have time to read this, but just to give some general advice: if your fake-data check does not recover your model, I recommend you simplify your model. [sent-13, score-0.228]

13 Go simpler and simpler until you can get it to work, then from there you can try to identify what is the problem. [sent-14, score-0.232]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('product', 0.456), ('decides', 0.372), ('decision', 0.248), ('rating', 0.191), ('estimation', 0.181), ('consume', 0.168), ('option', 0.163), ('response', 0.162), ('variable', 0.157), ('incidence', 0.149), ('recover', 0.144), ('individual', 0.142), ('ordered', 0.134), ('equation', 0.122), ('simpler', 0.116), ('model', 0.114), ('binary', 0.108), ('proposed', 0.105), ('represents', 0.101), ('algorithm', 0.101), ('outlining', 0.096), ('decisions', 0.092), ('linearly', 0.091), ('attaching', 0.087), ('immense', 0.084), ('simplify', 0.084), ('dictated', 0.084), ('kindly', 0.084), ('choice', 0.083), ('sequential', 0.081), ('observable', 0.077), ('finally', 0.077), ('rate', 0.076), ('first', 0.075), ('modification', 0.074), ('faces', 0.072), ('consumers', 0.071), ('captures', 0.069), ('sequence', 0.067), ('simulated', 0.064), ('three', 0.062), ('preference', 0.061), ('assign', 0.061), ('stage', 0.06), ('document', 0.06), ('second', 0.06), ('implement', 0.06), ('consumption', 0.059), ('latent', 0.058), ('options', 0.057)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999994 1875 andrew gelman stats-2013-05-28-Simplify until your fake-data check works, then add complications until you can figure out where the problem is coming from

Introduction: I received the following email: I am trying to develop a Bayesian model to represent the process through which individual consumers make online product rating decisions. In my model each individual faces total J product options and for each product option (j) each individual (i) needs to make three sequential decisions: - First he decides whether to consume a specific product option (j) or not (choice decision) - If he decides to consume a product option j, then after consumption he decides whether to rate it or not (incidence decision) - If he decides to rate product j then what finally he decides what rating (k) to assign to it (evaluation decision) We model this decision sequence in terms of three equations. A binary response variable in the first equation represents the choice decision. Another binary response variable in the second equation represents the incidence decision that is observable only when first selection decision is 1. Finally, an ordered response v

2 0.29504025 577 andrew gelman stats-2011-02-16-Annals of really really stupid spam

Introduction: This came in the inbox today: Dear Dr. Gelman, GenWay recently found your article titled “Multiple imputation for model checking: completed-data plots with missing and latent data.” (Biometrics. 2005 Mar;61(1):74-85.) and thought you might be interested in learning about our superior quality signaling proteins. GenWay prides itself on being a leader in customer service aiming to exceed your expectations with the quality and price of our products. With more than 60,000 reagents backed by our outstanding guarantee you are sure to find the products you have been searching for. Please feel free to visit the following resource pages: * Apoptosis Pathway (product list) * Adipocytokine (product list) * Cell Cycle Pathway (product list) * Jak STAT (product list) * GnRH (product list) * MAPK (product list) * mTOR (product list) * T Cell Receptor (product list) * TGF-beta (product list) * Wnt (product list) * View All Pathways

3 0.10637002 852 andrew gelman stats-2011-08-13-Checking your model using fake data

Introduction: Someone sent me the following email: I tried to do a logistic regression . . . I programmed the model in different ways and got different answers . . . can’t get the results to match . . . What am I doing wrong? . . . Here’s my code . . . I didn’t have the time to look at his code so I gave the following general response: One way to check things is to try simulating data from the fitted model, then fit your model again to the simulated data and see what happens. P.S. He followed my suggestion and responded a few days later: Yeah, that did the trick! I was treating a factor variable as a covariate! I love it when generic advice works out!

4 0.1053142 1972 andrew gelman stats-2013-08-07-When you’re planning on fitting a model, build up to it by fitting simpler models first. Then, once you have a model you like, check the hell out of it

Introduction: In response to my remarks on his online book, Think Bayes, Allen Downey wrote: I [Downey] have a question about one of your comments: My [Gelman's] main criticism with both books is that they talk a lot about inference but not so much about model building or model checking (recall the three steps of Bayesian data analysis). I think it’s ok for an introductory book to focus on inference, which of course is central to the data-analytic process—but I’d like them to at least mention that Bayesian ideas arise in model building and model checking as well. This sounds like something I agree with, and one of the things I tried to do in the book is to put modeling decisions front and center. But the word “modeling” is used in lots of ways, so I want to see if we are talking about the same thing. For example, in many chapters, I start with a simple model of the scenario, do some analysis, then check whether the model is good enough, and iterate. Here’s the discussion of modeling

5 0.10141382 780 andrew gelman stats-2011-06-27-Bridges between deterministic and probabilistic models for binary data

Introduction: For the analysis of binary data, various deterministic models have been proposed, which are generally simpler to fit and easier to understand than probabilistic models. We claim that corresponding to any deterministic model is an implicit stochastic model in which the deterministic model fits imperfectly, with errors occurring at random. In the context of binary data, we consider a model in which the probability of error depends on the model prediction. We show how to fit this model using a stochastic modification of deterministic optimization schemes. The advantages of fitting the stochastic model explicitly (rather than implicitly, by simply fitting a deterministic model and accepting the occurrence of errors) include quantification of uncertainty in the deterministic model’s parameter estimates, better estimation of the true model error rate, and the ability to check the fit of the model nontrivially. We illustrate this with a simple theoretical example of item response data and w

6 0.096487895 782 andrew gelman stats-2011-06-29-Putting together multinomial discrete regressions by combining simple logits

7 0.094430469 24 andrew gelman stats-2010-05-09-Special journal issue on statistical methods for the social sciences

8 0.094430201 759 andrew gelman stats-2011-06-11-“2 level logit with 2 REs & large sample. computational nightmare – please help”

9 0.09246821 2192 andrew gelman stats-2014-01-30-History is too important to be left to the history professors, Part 2

10 0.086365893 328 andrew gelman stats-2010-10-08-Displaying a fitted multilevel model

11 0.084753498 1392 andrew gelman stats-2012-06-26-Occam

12 0.083946303 216 andrew gelman stats-2010-08-18-More forecasting competitions

13 0.08159741 1462 andrew gelman stats-2012-08-18-Standardizing regression inputs

14 0.081522882 1703 andrew gelman stats-2013-02-02-Interaction-based feature selection and classification for high-dimensional biological data

15 0.079501197 2312 andrew gelman stats-2014-04-29-Ken Rice presents a unifying approach to statistical inference and hypothesis testing

16 0.078722969 1218 andrew gelman stats-2012-03-18-Check your missing-data imputations using cross-validation

17 0.078146793 744 andrew gelman stats-2011-06-03-Statistical methods for healthcare regulation: rating, screening and surveillance

18 0.078103386 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?

19 0.077672802 1409 andrew gelman stats-2012-07-08-Is linear regression unethical in that it gives more weight to cases that are far from the average?

20 0.077327773 758 andrew gelman stats-2011-06-11-Hey, good news! Your p-value just passed the 0.05 threshold!


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.125), (1, 0.083), (2, 0.026), (3, 0.014), (4, 0.039), (5, 0.02), (6, 0.008), (7, -0.03), (8, 0.059), (9, 0.05), (10, 0.002), (11, 0.012), (12, -0.028), (13, -0.009), (14, -0.044), (15, 0.009), (16, 0.053), (17, -0.035), (18, -0.03), (19, 0.02), (20, 0.018), (21, 0.04), (22, 0.042), (23, -0.047), (24, -0.031), (25, -0.001), (26, 0.03), (27, -0.047), (28, 0.015), (29, -0.026), (30, -0.056), (31, -0.01), (32, 0.002), (33, 0.024), (34, -0.011), (35, -0.06), (36, 0.002), (37, 0.018), (38, -0.01), (39, 0.034), (40, -0.007), (41, -0.006), (42, -0.04), (43, 0.011), (44, -0.003), (45, -0.017), (46, 0.01), (47, -0.003), (48, 0.038), (49, 0.051)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97052878 1875 andrew gelman stats-2013-05-28-Simplify until your fake-data check works, then add complications until you can figure out where the problem is coming from

Introduction: I received the following email: I am trying to develop a Bayesian model to represent the process through which individual consumers make online product rating decisions. In my model each individual faces total J product options and for each product option (j) each individual (i) needs to make three sequential decisions: - First he decides whether to consume a specific product option (j) or not (choice decision) - If he decides to consume a product option j, then after consumption he decides whether to rate it or not (incidence decision) - If he decides to rate product j then what finally he decides what rating (k) to assign to it (evaluation decision) We model this decision sequence in terms of three equations. A binary response variable in the first equation represents the choice decision. Another binary response variable in the second equation represents the incidence decision that is observable only when first selection decision is 1. Finally, an ordered response v

2 0.78230858 782 andrew gelman stats-2011-06-29-Putting together multinomial discrete regressions by combining simple logits

Introduction: When predicting 0/1 data we can use logit (or probit or robit or some other robust model such as invlogit (0.01 + 0.98*X*beta)). Logit is simple enough and we can use bayesglm to regularize and avoid the problem of separation. What if there are more than 2 categories? If they’re ordered (1, 2, 3, etc), we can do ordered logit (and use bayespolr() to avoid separation). If the categories are unordered (vanilla, chocolate, strawberry), there are unordered multinomial logit and probit models out there. But it’s not so easy to fit these multinomial model in a multilevel setting (with coefficients that vary by group), especially if the computation is embedded in an iterative routine such as mi where you have real time constraints at each step. So this got me wondering whether we could kluge it with logits. Here’s the basic idea (in the ordered and unordered forms): - If you have a variable that goes 1, 2, 3, etc., set up a series of logits: 1 vs. 2,3,…; 2 vs. 3,…; and so forth

3 0.74596614 852 andrew gelman stats-2011-08-13-Checking your model using fake data

Introduction: Someone sent me the following email: I tried to do a logistic regression . . . I programmed the model in different ways and got different answers . . . can’t get the results to match . . . What am I doing wrong? . . . Here’s my code . . . I didn’t have the time to look at his code so I gave the following general response: One way to check things is to try simulating data from the fitted model, then fit your model again to the simulated data and see what happens. P.S. He followed my suggestion and responded a few days later: Yeah, that did the trick! I was treating a factor variable as a covariate! I love it when generic advice works out!

4 0.73871642 935 andrew gelman stats-2011-10-01-When should you worry about imputed data?

Introduction: Majid Ezzati writes: My research group is increasingly focusing on a series of problems that involve data that either have missingness or measurements that may have bias/error. We have at times developed our own approaches to imputation (as simple as interpolating a missing unit and as sophisticated as a problem-specific Bayesian hierarchical model) and at other times, other groups impute the data. The outputs are being used to investigate the basic associations between pairs of variables, Xs and Ys, in regressions; we may or may not interpret these as causal. I am contacting colleagues with relevant expertise to suggest good references on whether having imputed X and/or Y in a subsequent regression is correct or if it could somehow lead to biased/spurious associations. Thinking about this, we can have at least the following situations (these could all be Bayesian or not): 1) X and Y both measured (perhaps with error) 2) Y imputed using some data and a model and X measur

5 0.73859656 861 andrew gelman stats-2011-08-19-Will Stan work well with 40×40 matrices?

Introduction: Tomas Iesmantas writes: I’m dealing with high dimensional (40-50 parameters) hierarchical bayesian model applied to nonlinear Poisson regression problem. Now I’m using an adaptive version for the Metropolis adjusted Langevin algorithm with a truncated drift (Yves F. Atchade, 2003) to obtain samples from posterior. But this algorithm is not very efficient in my case, it needs several millions iterations as burn-in period. And simulation takes quite a long time, since algorithm has to work with 40×40 matrices. Maybe you know another MCMC algorithm which could take not so many burn-in samples and would be able to deal with nonlinear regression? In non-hierarchical nonlinear regression model adaptive metropolis algorithm is enough, but in hierarchical case I could use something more effective. My reply: Try fitting the model in Stan. If that doesn’t work, let me know.

6 0.73762631 1468 andrew gelman stats-2012-08-24-Multilevel modeling and instrumental variables

7 0.73206604 328 andrew gelman stats-2010-10-08-Displaying a fitted multilevel model

8 0.72137475 1735 andrew gelman stats-2013-02-24-F-f-f-fake data

9 0.71336752 1047 andrew gelman stats-2011-12-08-I Am Too Absolutely Heteroskedastic for This Probit Model

10 0.71336251 1462 andrew gelman stats-2012-08-18-Standardizing regression inputs

11 0.71016753 1999 andrew gelman stats-2013-08-27-Bayesian model averaging or fitting a larger model

12 0.70942867 1200 andrew gelman stats-2012-03-06-Some economists are skeptical about microfoundations

13 0.70531166 823 andrew gelman stats-2011-07-26-Including interactions or not

14 0.70364827 1374 andrew gelman stats-2012-06-11-Convergence Monitoring for Non-Identifiable and Non-Parametric Models

15 0.69795042 819 andrew gelman stats-2011-07-24-Don’t idealize “risk aversion”

16 0.69530308 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making

17 0.69357997 729 andrew gelman stats-2011-05-24-Deviance as a difference

18 0.69342518 20 andrew gelman stats-2010-05-07-Bayesian hierarchical model for the prediction of soccer results

19 0.68545377 1395 andrew gelman stats-2012-06-27-Cross-validation (What is it good for?)

20 0.68539429 1431 andrew gelman stats-2012-07-27-Overfitting


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(9, 0.095), (15, 0.013), (16, 0.033), (24, 0.197), (30, 0.048), (31, 0.125), (53, 0.013), (58, 0.011), (61, 0.018), (68, 0.017), (76, 0.031), (84, 0.013), (86, 0.049), (87, 0.019), (89, 0.02), (99, 0.179)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.93657517 1875 andrew gelman stats-2013-05-28-Simplify until your fake-data check works, then add complications until you can figure out where the problem is coming from

Introduction: I received the following email: I am trying to develop a Bayesian model to represent the process through which individual consumers make online product rating decisions. In my model each individual faces total J product options and for each product option (j) each individual (i) needs to make three sequential decisions: - First he decides whether to consume a specific product option (j) or not (choice decision) - If he decides to consume a product option j, then after consumption he decides whether to rate it or not (incidence decision) - If he decides to rate product j then what finally he decides what rating (k) to assign to it (evaluation decision) We model this decision sequence in terms of three equations. A binary response variable in the first equation represents the choice decision. Another binary response variable in the second equation represents the incidence decision that is observable only when first selection decision is 1. Finally, an ordered response v

2 0.87775689 2225 andrew gelman stats-2014-02-26-A good comment on one of my papers

Introduction: An anonymous reviewer wrote: I appreciate informal writing styles as a means of increasing accessibility. However, the informality here seems to decrease accessibility – partly because of the assumed knowledge of the reader for concepts and terms, and also for its wandering style. Many concepts are introduced without explanation and are not clearly and decisively linked in developing a narrative argument. I think the prose and argumentation would be much stronger if ideas were introduced and developed more deliberately and not assuming insider knowledge of the reader. Good point. I have an informal writing style and that often works well, even for technical papers. But sometimes an informal paper is harder to follow for readers without the background knowledge. Paradoxically, a more stilted style with lots of notation and many stops to make precise definitions, can be more readable for the less-than-expert audience.

3 0.87050563 2192 andrew gelman stats-2014-01-30-History is too important to be left to the history professors, Part 2

Introduction: Completely non-gay historian Niall Ferguson, a man who we can be sure would never be caught at a ballet or a poetry reading, informs us that the British decision to enter the first world war on the side of France and Belgium was “the biggest error in modern history.” Ummm, here are a few bigger errors: The German decision to invade Russia in 1941. The Japanese decision to attack America in 1941. Oh yeah , the German decision to invade Belgium in 1914. The Russian decision to invade Afghanistan in 1981 doesn’t look like such a great decision either. And it wasn’t so smart for Saddam Hussein to invade Kuwait, but maybe the countries involved were too small for this to count as “the biggest error in modern history.” It’s striking that, in considering the biggest error in modern history, Ferguson omits all these notorious acts of aggression (bombing Pearl Harbor, leading to the destruction of much of your country, that was pretty bad, huh?), and decides that the worst

4 0.85668504 2262 andrew gelman stats-2014-03-23-Win probabilities during a sporting event

Introduction: Todd Schneider writes: Apropos of your recent blog post about modeling score differential of basketball games , I thought you might enjoy a site I built, gambletron2000.com , that gathers real-time win probabilities from betting markets for most major sports (including NBA and college basketball). My original goal was to use the variance of changes in win probabilities to quantify which games were the most exciting, but I got a bit carried away and ended up pursuing a bunch of other ideas, which  you can read about in the full writeup here This particular passage from the anonymous someone in your post: My idea is for each timestep in a game (a second, 5 seconds, etc), use the Vegas line, the current score differential, who has the ball, and the number of possessions played already (to account for differences in pace) to create a point estimate probability of the home team winning. reminded me of a graph I made, which shows the mean-reverting tendency of N

5 0.85374367 1368 andrew gelman stats-2012-06-06-Question 27 of my final exam for Design and Analysis of Sample Surveys

Introduction: 27. Which of the following problems were identified with the Burnham et al. survey of Iraq mortality? (Indicate all that apply.) (a) The survey used cluster sampling, which is inappropriate for estimating individual out- comes such as death. (b) In their report, Burnham et al. did not identify their primary sampling units. (c) The second-stage sampling was not a probability sample. (d) Survey materials supplied by the authors are incomplete and inconsistent with published descriptions of the survey. Solution to question 26 From yesterday : 26. You have just graded an an exam with 28 questions and 15 students. You fit a logistic item- response model estimating ability, difficulty, and discrimination parameters. Which of the following statements are basically true? (Indicate all that apply.) (a) If a question is answered correctly by students with very low and very high ability, but is missed by students in the middle, it will have a high value for its discrimination

6 0.84906757 1332 andrew gelman stats-2012-05-20-Problemen met het boek

7 0.84814292 599 andrew gelman stats-2011-03-03-Two interesting posts elsewhere on graphics

8 0.84564614 846 andrew gelman stats-2011-08-09-Default priors update?

9 0.84441835 1778 andrew gelman stats-2013-03-27-My talk at the University of Michigan today 4pm

10 0.84158164 1062 andrew gelman stats-2011-12-16-Mr. Pearson, meet Mr. Mandelbrot: Detecting Novel Associations in Large Data Sets

11 0.84140432 2365 andrew gelman stats-2014-06-09-I hate polynomials

12 0.83979809 953 andrew gelman stats-2011-10-11-Steve Jobs’s cancer and science-based medicine

13 0.83970332 2312 andrew gelman stats-2014-04-29-Ken Rice presents a unifying approach to statistical inference and hypothesis testing

14 0.83883786 779 andrew gelman stats-2011-06-25-Avoiding boundary estimates using a prior distribution as regularization

15 0.83732718 242 andrew gelman stats-2010-08-29-The Subtle Micro-Effects of Peacekeeping

16 0.83631814 2224 andrew gelman stats-2014-02-25-Basketball Stats: Don’t model the probability of win, model the expected score differential.

17 0.83555776 197 andrew gelman stats-2010-08-10-The last great essayist?

18 0.83484745 1367 andrew gelman stats-2012-06-05-Question 26 of my final exam for Design and Analysis of Sample Surveys

19 0.83481336 1455 andrew gelman stats-2012-08-12-Probabilistic screening to get an approximate self-weighted sample

20 0.83430672 1221 andrew gelman stats-2012-03-19-Whassup with deviance having a high posterior correlation with a parameter in the model?