andrew_gelman_stats andrew_gelman_stats-2011 andrew_gelman_stats-2011-726 knowledge-graph by maker-knowledge-mining

726 andrew gelman stats-2011-05-22-Handling multiple versions of an outcome variable


meta infos for this blog

Source: html

Introduction: Jay Ulfelder asks: I have a question for you about what to do in a situation where you have two measures of your dependent variable and no prior reasons to strongly favor one over the other. Here’s what brings this up: I’m working on a project with Michael Ross where we’re modeling transitions to and from democracy in countries worldwide since 1960 to estimate the effects of oil income on the likelihood of those events’ occurrence. We’ve got a TSCS data set, and we’re using a discrete-time event history design, splitting the sample by regime type at the start of each year and then using multilevel logistic regression models with parametric measures of time at risk and random intercepts at the country and region levels. (We’re also checking for the usefulness of random slopes for oil wealth at one or the other level and then including them if they improve a model’s goodness of fit.) All of this is being done in Stata with the gllamm module. Our problem is that we have two plausib


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Jay Ulfelder asks: I have a question for you about what to do in a situation where you have two measures of your dependent variable and no prior reasons to strongly favor one over the other. [sent-1, score-0.588]

2 Here’s what brings this up: I’m working on a project with Michael Ross where we’re modeling transitions to and from democracy in countries worldwide since 1960 to estimate the effects of oil income on the likelihood of those events’ occurrence. [sent-2, score-0.352]

3 We’ve got a TSCS data set, and we’re using a discrete-time event history design, splitting the sample by regime type at the start of each year and then using multilevel logistic regression models with parametric measures of time at risk and random intercepts at the country and region levels. [sent-3, score-0.901]

4 (We’re also checking for the usefulness of random slopes for oil wealth at one or the other level and then including them if they improve a model’s goodness of fit. [sent-4, score-0.41]

5 ) All of this is being done in Stata with the gllamm module. [sent-5, score-0.097]

6 Our problem is that we have two plausible measures of those transition events. [sent-6, score-0.566]

7 Unsurprisingly, the results we get from the two DVs differ, sometimes not by much but in a few cases to a non-trivial degree. [sent-7, score-0.26]

8 I just don’t like that solution, though, because it sweeps under the rug some uncertainty that’s arguably as informative as the results from either version alone. [sent-10, score-0.466]

9 At the same time, it seems a little goofy just to toss both sets of results on the table and then shrug in cases where they diverge non-trivially. [sent-11, score-0.41]

10 Do you know of any elegant solutions to this problem? [sent-12, score-0.083]

11 I recall seeing a paper last year that used Bayesian methods to average across estimates from different versions of a dependent variable, but I don’t think that paper used multilevel models and am assuming the math required is much more complicated (i. [sent-13, score-0.41]

12 My reply: My quick suggestion would be to add the two measures and then use the sum as the outcome. [sent-16, score-0.625]

13 If it’s a continuous measure there’s no problem (although you’d want to prescale the measures so that they’re roughly on a common scale before you add them). [sent-17, score-0.534]

14 If they are binary outcomes you can just fit an ordered logit. [sent-18, score-0.116]

15 Jay liked my suggestion but added: One hitch for our particular problem, though: because we’re estimating event history models, the alternate versions of the DV (which is binary) also come with alternate versions of a couple of the IVs: time at risk and counts of prior events. [sent-19, score-1.642]

16 I can’t see how we could accommodate those differences in the framework you propose. [sent-20, score-0.172]

17 Basically, we’ve got two alternate universes (or two alternate interpretations of the same universe), and the differences permeate both sides of the equation. [sent-21, score-0.996]

18 Sometimes I really wish I worked in the natural sciences… My suggestion would be to combine the predictors in some way as well. [sent-22, score-0.172]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('alternate', 0.34), ('measures', 0.256), ('versions', 0.193), ('suggestion', 0.172), ('oil', 0.152), ('results', 0.143), ('sensitivity', 0.142), ('version', 0.136), ('dependent', 0.131), ('jay', 0.129), ('two', 0.117), ('binary', 0.116), ('problem', 0.113), ('event', 0.112), ('dvs', 0.103), ('footnotes', 0.103), ('hitch', 0.103), ('worldwide', 0.103), ('transitions', 0.097), ('invariably', 0.097), ('gllamm', 0.097), ('shrug', 0.097), ('splitting', 0.097), ('sweeps', 0.097), ('risk', 0.096), ('solution', 0.093), ('ulfelder', 0.093), ('goodness', 0.093), ('history', 0.093), ('re', 0.092), ('dv', 0.09), ('accommodate', 0.09), ('rug', 0.09), ('confirming', 0.087), ('usefulness', 0.087), ('multilevel', 0.086), ('measure', 0.085), ('toss', 0.085), ('unsurprisingly', 0.085), ('goofy', 0.085), ('variable', 0.084), ('elegant', 0.083), ('intercepts', 0.083), ('differences', 0.082), ('ross', 0.08), ('transition', 0.08), ('add', 0.08), ('regime', 0.078), ('universe', 0.078), ('slopes', 0.078)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000001 726 andrew gelman stats-2011-05-22-Handling multiple versions of an outcome variable

Introduction: Jay Ulfelder asks: I have a question for you about what to do in a situation where you have two measures of your dependent variable and no prior reasons to strongly favor one over the other. Here’s what brings this up: I’m working on a project with Michael Ross where we’re modeling transitions to and from democracy in countries worldwide since 1960 to estimate the effects of oil income on the likelihood of those events’ occurrence. We’ve got a TSCS data set, and we’re using a discrete-time event history design, splitting the sample by regime type at the start of each year and then using multilevel logistic regression models with parametric measures of time at risk and random intercepts at the country and region levels. (We’re also checking for the usefulness of random slopes for oil wealth at one or the other level and then including them if they improve a model’s goodness of fit.) All of this is being done in Stata with the gllamm module. Our problem is that we have two plausib

2 0.15024662 1340 andrew gelman stats-2012-05-23-Question 13 of my final exam for Design and Analysis of Sample Surveys

Introduction: 13. A survey of American adults is conducted that includes too many women and not enough men in the sample. In the resulting weighting, each female respondent is given a weight of 1 and each male respondent is given a weight of 1.5. The sample includes 600 women and 380 men, of whom 400 women and 100 men respond Yes to a particular question of interest. Give an estimate and standard error for the proportion of American adults who would answer Yes to this question if asked. Solution to question 12 From yesterday : 12. A researcher fits a regression model predicting some political behavior given predictors for demographics and several measures of economic ideology. The coefficients for the ideology measures are not statistically significant, and the researcher creates a new measure, adding up the ideology questions and creating a common score, and then fits a new regression including the new score and removing the individual ideology questions from the model. Which of the follo

3 0.14723963 1431 andrew gelman stats-2012-07-27-Overfitting

Introduction: Ilya Esteban writes: In traditional machine learning and statistical learning techniques, you spend a lot of time selecting your input features, fiddling with model parameter values, etc., all of which leads to the problem of overfitting the data and producing overly optimistic estimates for how good the model really is. You can use techniques such as cross-validation and out-of-sample validation data to try to limit the damage, but they are imperfect solutions at best. While Bayesian models have the great advantage of not forcing you to manually select among the various weights and input features, you still often end up trying different priors and model structures (especially with hierarchical models), before coming up with a “final” model. When applying Bayesian modeling to real world data sets, how does should you evaluate alternate priors and topologies for the model without falling into the same overfitting trap as you do with non-Bayesian models? If you try several different

4 0.14110342 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?

Introduction: A research psychologist writes in with a question that’s so long that I’ll put my answer first, then put the question itself below the fold. Here’s my reply: As I wrote in my Anova paper and in my book with Jennifer Hill, I do think that multilevel models can completely replace Anova. At the same time, I think the central idea of Anova should persist in our understanding of these models. To me the central idea of Anova is not F-tests or p-values or sums of squares, but rather the idea of predicting an outcome based on factors with discrete levels, and understanding these factors using variance components. The continuous or categorical response thing doesn’t really matter so much to me. I have no problem using a normal linear model for continuous outcomes (perhaps suitably transformed) and a logistic model for binary outcomes. I don’t want to throw away interactions just because they’re not statistically significant. I’d rather partially pool them toward zero using an inform

5 0.13384171 1900 andrew gelman stats-2013-06-15-Exploratory multilevel analysis when group-level variables are of importance

Introduction: Steve Miller writes: Much of what I do is cross-national analyses of survey data (largely World Values Survey). . . . My big question pertains to (what I would call) exploratory analysis of multilevel data, especially when the group-level predictors are of theoretical importance. A lot of what I do involves analyzing cross-national survey items of citizen attitudes, typically of political leadership. These survey items are usually yes/no responses, or four-part responses indicating a level of agreement (strongly agree, agree, disagree, strongly disagree) that can be condensed into a binary variable. I believe these can be explained by reference to country-level factors. Much of the group-level variables of interest are count variables with a modal value of 0, which can be quite messy. How would you recommend exploring the variation in the dependent variable as it could be explained by the group-level count variable of interest, before fitting the multilevel model itself? When

6 0.13067912 1337 andrew gelman stats-2012-05-22-Question 12 of my final exam for Design and Analysis of Sample Surveys

7 0.12874903 352 andrew gelman stats-2010-10-19-Analysis of survey data: Design based models vs. hierarchical modeling?

8 0.12464315 1395 andrew gelman stats-2012-06-27-Cross-validation (What is it good for?)

9 0.12034225 2251 andrew gelman stats-2014-03-17-In the best alternative histories, the real world is what’s ultimately real

10 0.11963192 383 andrew gelman stats-2010-10-31-Analyzing the entire population rather than a sample

11 0.11445213 1149 andrew gelman stats-2012-02-01-Philosophy of Bayesian statistics: my reactions to Cox and Mayo

12 0.11283118 1267 andrew gelman stats-2012-04-17-Hierarchical-multilevel modeling with “big data”

13 0.11254824 704 andrew gelman stats-2011-05-10-Multiple imputation and multilevel analysis

14 0.11216617 1248 andrew gelman stats-2012-04-06-17 groups, 6 group-level predictors: What to do?

15 0.11004138 2109 andrew gelman stats-2013-11-21-Hidden dangers of noninformative priors

16 0.10962461 759 andrew gelman stats-2011-06-11-“2 level logit with 2 REs & large sample. computational nightmare – please help”

17 0.10913169 269 andrew gelman stats-2010-09-10-R vs. Stata, or, Different ways to estimate multilevel models

18 0.10904559 2274 andrew gelman stats-2014-03-30-Adjudicating between alternative interpretations of a statistical interaction?

19 0.10835429 1941 andrew gelman stats-2013-07-16-Priors

20 0.10553323 2200 andrew gelman stats-2014-02-05-Prior distribution for a predicted probability


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.222), (1, 0.107), (2, 0.077), (3, -0.042), (4, 0.083), (5, 0.008), (6, 0.034), (7, -0.03), (8, 0.031), (9, 0.044), (10, 0.044), (11, -0.003), (12, 0.011), (13, 0.014), (14, 0.026), (15, -0.003), (16, 0.005), (17, -0.007), (18, -0.006), (19, 0.018), (20, -0.01), (21, 0.036), (22, 0.021), (23, 0.002), (24, -0.027), (25, -0.047), (26, -0.016), (27, -0.039), (28, -0.009), (29, 0.005), (30, -0.002), (31, 0.0), (32, 0.022), (33, -0.021), (34, 0.019), (35, -0.061), (36, -0.034), (37, 0.045), (38, 0.003), (39, 0.007), (40, -0.003), (41, 0.026), (42, 0.016), (43, -0.004), (44, -0.015), (45, -0.038), (46, 0.024), (47, 0.03), (48, -0.006), (49, 0.001)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97368103 726 andrew gelman stats-2011-05-22-Handling multiple versions of an outcome variable

Introduction: Jay Ulfelder asks: I have a question for you about what to do in a situation where you have two measures of your dependent variable and no prior reasons to strongly favor one over the other. Here’s what brings this up: I’m working on a project with Michael Ross where we’re modeling transitions to and from democracy in countries worldwide since 1960 to estimate the effects of oil income on the likelihood of those events’ occurrence. We’ve got a TSCS data set, and we’re using a discrete-time event history design, splitting the sample by regime type at the start of each year and then using multilevel logistic regression models with parametric measures of time at risk and random intercepts at the country and region levels. (We’re also checking for the usefulness of random slopes for oil wealth at one or the other level and then including them if they improve a model’s goodness of fit.) All of this is being done in Stata with the gllamm module. Our problem is that we have two plausib

2 0.85533857 759 andrew gelman stats-2011-06-11-“2 level logit with 2 REs & large sample. computational nightmare – please help”

Introduction: I received an email with the above title from Daniel Adkins, who elaborates: I [Adkins] am having a tough time with a dataset including 40K obs and 8K subjects. Trying to estimate a 2 level logit with random intercept and age slope and about 13 fixed covariates. I have tried several R packages (lme4, lme4a, glmmPQL, MCMCglmm) and stata xtmelogit and gllamm to no avail. xtmelogit crashes from insufficient memory. The R packages yield false convergences. A simpler model w/ random intercept only gives stable estimates in lme4 with a very large number of quadrature point (nAGQ>220). When i try this (nAGQ=221) with the random age term, it doesn’t make it through a single iteration in 72 hours (have tried both w/ and w/out RE correlation). I am using a power desktop that is top of the line compared to anything other than a cluster. Have tried start values for fixed effects in lme4 and that doesn’t help (couldn’t figure out how to specify RE starts). Do you have any advice. Should I move t

3 0.82495755 1294 andrew gelman stats-2012-05-01-Modeling y = a + b + c

Introduction: Brandon Behlendorf writes: I [Behlendorf] am replicating some previous research using OLS [he's talking about what we call "linear regression"---ed.] to regress a logged rate (to reduce skew) of Y on a number of predictors (Xs). Y is the count of a phenomena divided by the population of the unit of the analysis. The problem that I am encountering is that Y is composite count of a number of distinct phenomena [A+B+C], and these phenomena are not uniformly distributed across the sample. Most of the research in this area has conducted regressions either with Y or with individual phenomena [A or B or C] as the dependent variable. Yet it seems that if [A, B, C] are not uniformly distributed across the sample of units in the same proportion, then the use of Y would be biased, since as a count of [A+B+C] divided by the population, it would treat as equivalent units both [2+0.5+1.5] and [4+0+0]. My goal is trying to find a methodology which allows a researcher to regress Y on a

4 0.82350045 1966 andrew gelman stats-2013-08-03-Uncertainty in parameter estimates using multilevel models

Introduction: David Hsu writes: I have a (perhaps) simple question about uncertainty in parameter estimates using multilevel models — what is an appropriate threshold for measure parameter uncertainty in a multilevel model? The reason why I ask is that I set out to do a crossed two-way model with two varying intercepts, similar to your flight simulator example in your 2007 book. The difference is that I have a lot of predictors specific to each cell (I think equivalent to airport and pilot in your example), and I find after modeling this in JAGS, I happily find that the predictors are much less important than the variability by cell (airport and pilot effects). Happily because this is what I am writing a paper about. However, I then went to check subsets of predictors using lm() and lmer(). I understand that they all use different estimation methods, but what I can’t figure out is why the errors on all of the coefficient estimates are *so* different. For example, using JAGS, and th

5 0.81526971 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?

Introduction: A research psychologist writes in with a question that’s so long that I’ll put my answer first, then put the question itself below the fold. Here’s my reply: As I wrote in my Anova paper and in my book with Jennifer Hill, I do think that multilevel models can completely replace Anova. At the same time, I think the central idea of Anova should persist in our understanding of these models. To me the central idea of Anova is not F-tests or p-values or sums of squares, but rather the idea of predicting an outcome based on factors with discrete levels, and understanding these factors using variance components. The continuous or categorical response thing doesn’t really matter so much to me. I have no problem using a normal linear model for continuous outcomes (perhaps suitably transformed) and a logistic model for binary outcomes. I don’t want to throw away interactions just because they’re not statistically significant. I’d rather partially pool them toward zero using an inform

6 0.81097239 269 andrew gelman stats-2010-09-10-R vs. Stata, or, Different ways to estimate multilevel models

7 0.81030542 704 andrew gelman stats-2011-05-10-Multiple imputation and multilevel analysis

8 0.80809224 246 andrew gelman stats-2010-08-31-Somewhat Bayesian multilevel modeling

9 0.80220789 2296 andrew gelman stats-2014-04-19-Index or indicator variables

10 0.80205238 1121 andrew gelman stats-2012-01-15-R-squared for multilevel models

11 0.80179799 1070 andrew gelman stats-2011-12-19-The scope for snooping

12 0.79818225 2086 andrew gelman stats-2013-11-03-How best to compare effects measured in two different time periods?

13 0.78939444 851 andrew gelman stats-2011-08-12-year + (1|year)

14 0.7850247 753 andrew gelman stats-2011-06-09-Allowing interaction terms to vary

15 0.78023225 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making

16 0.7768693 1248 andrew gelman stats-2012-04-06-17 groups, 6 group-level predictors: What to do?

17 0.77300733 250 andrew gelman stats-2010-09-02-Blending results from two relatively independent multi-level models

18 0.765957 397 andrew gelman stats-2010-11-06-Multilevel quantile regression

19 0.76473659 417 andrew gelman stats-2010-11-17-Clutering and variance components

20 0.76222396 772 andrew gelman stats-2011-06-17-Graphical tools for understanding multilevel models


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(24, 0.089), (86, 0.033), (89, 0.012), (95, 0.027), (99, 0.717)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99956357 726 andrew gelman stats-2011-05-22-Handling multiple versions of an outcome variable

Introduction: Jay Ulfelder asks: I have a question for you about what to do in a situation where you have two measures of your dependent variable and no prior reasons to strongly favor one over the other. Here’s what brings this up: I’m working on a project with Michael Ross where we’re modeling transitions to and from democracy in countries worldwide since 1960 to estimate the effects of oil income on the likelihood of those events’ occurrence. We’ve got a TSCS data set, and we’re using a discrete-time event history design, splitting the sample by regime type at the start of each year and then using multilevel logistic regression models with parametric measures of time at risk and random intercepts at the country and region levels. (We’re also checking for the usefulness of random slopes for oil wealth at one or the other level and then including them if they improve a model’s goodness of fit.) All of this is being done in Stata with the gllamm module. Our problem is that we have two plausib

2 0.9978435 1431 andrew gelman stats-2012-07-27-Overfitting

Introduction: Ilya Esteban writes: In traditional machine learning and statistical learning techniques, you spend a lot of time selecting your input features, fiddling with model parameter values, etc., all of which leads to the problem of overfitting the data and producing overly optimistic estimates for how good the model really is. You can use techniques such as cross-validation and out-of-sample validation data to try to limit the damage, but they are imperfect solutions at best. While Bayesian models have the great advantage of not forcing you to manually select among the various weights and input features, you still often end up trying different priors and model structures (especially with hierarchical models), before coming up with a “final” model. When applying Bayesian modeling to real world data sets, how does should you evaluate alternate priors and topologies for the model without falling into the same overfitting trap as you do with non-Bayesian models? If you try several different

3 0.99779183 589 andrew gelman stats-2011-02-24-On summarizing a noisy scatterplot with a single comparison of two points

Introduction: John Sides discusses how his scatterplot of unionization rates and budget deficits made it onto cable TV news: It’s also interesting to see how he [journalist Chris Hayes] chooses to explain a scatterplot — especially given the evidence that people don’t always understand scatterplots. He compares pairs of cases that don’t illustrate the basic hypothesis of Brooks, Scott Walker, et al. Obviously, such comparisons could be misleading, but given that there was no systematic relationship depicted that graph, these particular comparisons are not. This idea–summarizing a bivariate pattern by comparing pairs of points–reminds me of a well-known statistical identities which I refer to in a paper with David Park: John Sides is certainly correct that if you can pick your pair of points, you can make extremely misleading comparisons. But if you pick every pair of points, and average over them appropriately, you end up with the least-squares regression slope. Pretty cool, and

4 0.99749148 1315 andrew gelman stats-2012-05-12-Question 2 of my final exam for Design and Analysis of Sample Surveys

Introduction: 2. Which of the following are useful goals in a pilot study? (Indicate all that apply.) (a) You can search for statistical significance, then from that decide what to look for in a confirmatory analysis of your full dataset. (b) You can see if you find statistical significance in a pre-chosen comparison of interest. (c) You can examine the direction (positive or negative, even if not statistically significant) of comparisons of interest. (d) With a small sample size, you cannot hope to learn anything conclusive, but you can get a crude estimate of effect size and standard deviation which will be useful in a power analysis to help you decide how large your full study needs to be. (e) You can talk with survey respondents and get a sense of how they perceived your questions. (f) You get a chance to learn about practical difficulties with sampling, nonresponse, and question wording. (g) You can check if your sample is approximately representative of your population. Soluti

5 0.9974516 1434 andrew gelman stats-2012-07-29-FindTheData.org

Introduction: I received the following (unsolicited) email: Hi Andrew, I work on the business development team of FindTheData.org, an unbiased comparison engine founded by Kevin O’Connor (founder and former CEO of DoubleClick) and backed by Kleiner Perkins with ~10M unique visitors per month. We are working with large online publishers including Golf Digest, Huffington Post, Under30CEO, and offer a variety of options to integrate our highly engaging content with your site.  I believe our un-biased and reliable data resources would be of interest to you and your readers. I’d like to set up a quick call to discuss similar partnership ideas with you and would greatly appreciate 10 minutes of your time. Please suggest a couple times that work best for you or let me know if you would like me to send some more information before you make time for a call. Looking forward to hearing from you, Jonny – JONNY KINTZELE Business Development, FindThe Data mobile: 619-307-097

6 0.99731421 521 andrew gelman stats-2011-01-17-“the Tea Party’s ire, directed at Democrats and Republicans alike”

7 0.99709821 772 andrew gelman stats-2011-06-17-Graphical tools for understanding multilevel models

8 0.99676687 1425 andrew gelman stats-2012-07-23-Examples of the use of hierarchical modeling to generalize to new settings

9 0.99620491 809 andrew gelman stats-2011-07-19-“One of the easiest ways to differentiate an economist from almost anyone else in society”

10 0.99603087 638 andrew gelman stats-2011-03-30-More on the correlation between statistical and political ideology

11 0.99434972 1813 andrew gelman stats-2013-04-19-Grad students: Participate in an online survey on statistics education

12 0.99407697 1248 andrew gelman stats-2012-04-06-17 groups, 6 group-level predictors: What to do?

13 0.9940418 507 andrew gelman stats-2011-01-07-Small world: MIT, asymptotic behavior of differential-difference equations, Susan Assmann, subgroup analysis, multilevel modeling

14 0.9937852 1483 andrew gelman stats-2012-09-04-“Bestselling Author Caught Posting Positive Reviews of His Own Work on Amazon”

15 0.99369776 1952 andrew gelman stats-2013-07-23-Christakis response to my comment on his comments on social science (or just skip to the P.P.P.S. at the end)

16 0.9936173 1096 andrew gelman stats-2012-01-02-Graphical communication for legal scholarship

17 0.99342555 1585 andrew gelman stats-2012-11-20-“I know you aren’t the plagiarism police, but . . .”

18 0.99342328 1670 andrew gelman stats-2013-01-13-More Bell Labs happy talk

19 0.99331015 23 andrew gelman stats-2010-05-09-Popper’s great, but don’t bother with his theory of probability

20 0.99319774 174 andrew gelman stats-2010-08-01-Literature and life