andrew_gelman_stats andrew_gelman_stats-2012 andrew_gelman_stats-2012-1294 knowledge-graph by maker-knowledge-mining

1294 andrew gelman stats-2012-05-01-Modeling y = a + b + c


meta infos for this blog

Source: html

Introduction: Brandon Behlendorf writes: I [Behlendorf] am replicating some previous research using OLS [he's talking about what we call "linear regression"---ed.] to regress a logged rate (to reduce skew) of Y on a number of predictors (Xs). Y is the count of a phenomena divided by the population of the unit of the analysis. The problem that I am encountering is that Y is composite count of a number of distinct phenomena [A+B+C], and these phenomena are not uniformly distributed across the sample. Most of the research in this area has conducted regressions either with Y or with individual phenomena [A or B or C] as the dependent variable. Yet it seems that if [A, B, C] are not uniformly distributed across the sample of units in the same proportion, then the use of Y would be biased, since as a count of [A+B+C] divided by the population, it would treat as equivalent units both [2+0.5+1.5] and [4+0+0]. My goal is trying to find a methodology which allows a researcher to regress Y on a


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Brandon Behlendorf writes: I [Behlendorf] am replicating some previous research using OLS [he's talking about what we call "linear regression"---ed. [sent-1, score-0.082]

2 ] to regress a logged rate (to reduce skew) of Y on a number of predictors (Xs). [sent-2, score-0.262]

3 Y is the count of a phenomena divided by the population of the unit of the analysis. [sent-3, score-1.013]

4 The problem that I am encountering is that Y is composite count of a number of distinct phenomena [A+B+C], and these phenomena are not uniformly distributed across the sample. [sent-4, score-1.889]

5 Most of the research in this area has conducted regressions either with Y or with individual phenomena [A or B or C] as the dependent variable. [sent-5, score-0.654]

6 Yet it seems that if [A, B, C] are not uniformly distributed across the sample of units in the same proportion, then the use of Y would be biased, since as a count of [A+B+C] divided by the population, it would treat as equivalent units both [2+0. [sent-6, score-1.181]

7 My goal is trying to find a methodology which allows a researcher to regress Y on a number of Xs, but which accounts for the uneven variation in the distributions of the individual phenomena [A+B+C] that constitute Y. [sent-9, score-1.043]

8 I have thought that it could be treated within a Structural Equation Model as multiple dependent variables, or through a process of joint estimation, but in essence I know the latent factor (Y) that one usually does not know when trying to measure through some sort of SEM or Rasch Model. [sent-10, score-0.356]

9 I have also considered weighting [A,B,C] by converting them into percentages of the total count of each phenomena within the sample (i. [sent-11, score-1.15]

10 (A1/sum A(1-100)) + (B1/sum B(1-100)) + (C1/sum C(1-100))), but the result lacks interpretational quality as to the overall relationship between Xs and Y. [sent-13, score-0.209]

11 My reply: First off, the reason for logging is to model a multiplicative relationship using an additive model. [sent-14, score-0.282]

12 Skewness is typically irrelevant (see the discussion of regression assumptions in chapter 3 or 4 of ARM). [sent-15, score-0.262]

13 Also, if y is a count, you might want to use an overdispersed Poisson regression as discussed in chapter 6. [sent-17, score-0.288]

14 Is it a sample size issue, that by combining a,b,c into y, you get more stable estimates? [sent-19, score-0.175]

15 If so, that’s ok, and you could always try weighted averages if that makes sense in your application. [sent-20, score-0.137]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('phenomena', 0.441), ('xs', 0.32), ('count', 0.295), ('behlendorf', 0.234), ('regress', 0.168), ('uniformly', 0.155), ('units', 0.141), ('distributed', 0.135), ('dependent', 0.135), ('divided', 0.131), ('relationship', 0.116), ('sample', 0.108), ('rasch', 0.107), ('brandon', 0.107), ('skewness', 0.107), ('regression', 0.102), ('uneven', 0.1), ('overdispersed', 0.096), ('skew', 0.096), ('number', 0.094), ('encountering', 0.093), ('ols', 0.093), ('lacks', 0.093), ('sem', 0.093), ('chapter', 0.09), ('converting', 0.09), ('multiplicative', 0.09), ('constitute', 0.088), ('composite', 0.086), ('replicating', 0.082), ('essence', 0.082), ('population', 0.08), ('individual', 0.078), ('additive', 0.076), ('across', 0.075), ('percentages', 0.075), ('within', 0.075), ('accounts', 0.074), ('distinct', 0.074), ('poisson', 0.073), ('structural', 0.073), ('weighted', 0.072), ('irrelevant', 0.07), ('equation', 0.067), ('stable', 0.067), ('separately', 0.067), ('unit', 0.066), ('weighting', 0.066), ('averages', 0.065), ('latent', 0.064)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999982 1294 andrew gelman stats-2012-05-01-Modeling y = a + b + c

Introduction: Brandon Behlendorf writes: I [Behlendorf] am replicating some previous research using OLS [he's talking about what we call "linear regression"---ed.] to regress a logged rate (to reduce skew) of Y on a number of predictors (Xs). Y is the count of a phenomena divided by the population of the unit of the analysis. The problem that I am encountering is that Y is composite count of a number of distinct phenomena [A+B+C], and these phenomena are not uniformly distributed across the sample. Most of the research in this area has conducted regressions either with Y or with individual phenomena [A or B or C] as the dependent variable. Yet it seems that if [A, B, C] are not uniformly distributed across the sample of units in the same proportion, then the use of Y would be biased, since as a count of [A+B+C] divided by the population, it would treat as equivalent units both [2+0.5+1.5] and [4+0+0]. My goal is trying to find a methodology which allows a researcher to regress Y on a

2 0.13432054 1900 andrew gelman stats-2013-06-15-Exploratory multilevel analysis when group-level variables are of importance

Introduction: Steve Miller writes: Much of what I do is cross-national analyses of survey data (largely World Values Survey). . . . My big question pertains to (what I would call) exploratory analysis of multilevel data, especially when the group-level predictors are of theoretical importance. A lot of what I do involves analyzing cross-national survey items of citizen attitudes, typically of political leadership. These survey items are usually yes/no responses, or four-part responses indicating a level of agreement (strongly agree, agree, disagree, strongly disagree) that can be condensed into a binary variable. I believe these can be explained by reference to country-level factors. Much of the group-level variables of interest are count variables with a modal value of 0, which can be quite messy. How would you recommend exploring the variation in the dependent variable as it could be explained by the group-level count variable of interest, before fitting the multilevel model itself? When

3 0.11653974 2273 andrew gelman stats-2014-03-29-References (with code) for Bayesian hierarchical (multilevel) modeling and structural equation modeling

Introduction: A student writes: I am new to Bayesian methods. While I am reading your book, I have some questions for you. I am interested in doing Bayesian hierarchical (multi-level) linear regression (e.g., random-intercept model) and Bayesian structural equation modeling (SEM)—for causality. Do you happen to know if I could find some articles, where authors could provide data w/ R and/or BUGS codes that I could replicate them? My reply: For Bayesian hierarchical (multi-level) linear regression and causal inference, see my book with Jennifer Hill. For Bayesian structural equation modeling, try google and you’ll find some good stuff. Also, I recommend Stan (http://mc-stan.org/) rather than Bugs.

4 0.1081689 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making

Introduction: Andreas Graefe writes (see here here here ): The usual procedure for developing linear models to predict any kind of target variable is to identify a subset of most important predictors and to estimate weights that provide the best possible solution for a given sample. The resulting “optimally” weighted linear composite is then used when predicting new data. This approach is useful in situations with large and reliable datasets and few predictor variables. However, a large body of analytical and empirical evidence since the 1970s shows that the weighting of variables is of little, if any, value in situations with small and noisy datasets and a large number of predictor variables. In such situations, including all relevant variables is more important than their weighting. These findings have yet to impact many fields. This study uses data from nine established U.S. election-forecasting models whose forecasts are regularly published in academic journals to demonstrate the value o

5 0.099115305 439 andrew gelman stats-2010-11-30-Of psychology research and investment tips

Introduction: A few days after “ Dramatic study shows participants are affected by psychological phenomena from the future ,” (see here ) the British Psychological Society follows up with “ Can psychology help combat pseudoscience? .” Somehow I’m reminded of that bit of financial advice which says, if you want to save some money, your best investment is to pay off your credit card bills.

6 0.097052522 972 andrew gelman stats-2011-10-25-How do you interpret standard errors from a regression fit to the entire population?

7 0.096728452 144 andrew gelman stats-2010-07-13-Hey! Here’s a referee report for you!

8 0.08888606 1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?

9 0.08827848 770 andrew gelman stats-2011-06-15-Still more Mr. P in public health

10 0.087481849 1392 andrew gelman stats-2012-06-26-Occam

11 0.085582256 2364 andrew gelman stats-2014-06-08-Regression and causality and variable ordering

12 0.084826618 476 andrew gelman stats-2010-12-19-Google’s word count statistics viewer

13 0.083272278 250 andrew gelman stats-2010-09-02-Blending results from two relatively independent multi-level models

14 0.081812918 352 andrew gelman stats-2010-10-19-Analysis of survey data: Design based models vs. hierarchical modeling?

15 0.079316162 1656 andrew gelman stats-2013-01-05-Understanding regression models and regression coefficients

16 0.079068013 820 andrew gelman stats-2011-07-25-Design of nonrandomized cluster sample study

17 0.078830585 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?

18 0.07791654 1972 andrew gelman stats-2013-08-07-When you’re planning on fitting a model, build up to it by fitting simpler models first. Then, once you have a model you like, check the hell out of it

19 0.076883532 1430 andrew gelman stats-2012-07-26-Some thoughts on survey weighting

20 0.075293668 1999 andrew gelman stats-2013-08-27-Bayesian model averaging or fitting a larger model


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.137), (1, 0.078), (2, 0.081), (3, -0.049), (4, 0.072), (5, 0.041), (6, 0.001), (7, -0.021), (8, 0.057), (9, 0.034), (10, 0.028), (11, -0.014), (12, 0.001), (13, 0.023), (14, 0.001), (15, 0.011), (16, -0.011), (17, -0.004), (18, 0.003), (19, 0.002), (20, -0.004), (21, 0.029), (22, 0.024), (23, -0.019), (24, -0.027), (25, -0.007), (26, 0.021), (27, -0.018), (28, -0.014), (29, 0.011), (30, 0.015), (31, 0.002), (32, 0.021), (33, 0.018), (34, -0.006), (35, 0.009), (36, 0.029), (37, 0.013), (38, -0.023), (39, 0.02), (40, 0.018), (41, -0.025), (42, 0.031), (43, -0.007), (44, 0.045), (45, -0.004), (46, 0.008), (47, 0.008), (48, 0.018), (49, 0.007)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97496724 1294 andrew gelman stats-2012-05-01-Modeling y = a + b + c

Introduction: Brandon Behlendorf writes: I [Behlendorf] am replicating some previous research using OLS [he's talking about what we call "linear regression"---ed.] to regress a logged rate (to reduce skew) of Y on a number of predictors (Xs). Y is the count of a phenomena divided by the population of the unit of the analysis. The problem that I am encountering is that Y is composite count of a number of distinct phenomena [A+B+C], and these phenomena are not uniformly distributed across the sample. Most of the research in this area has conducted regressions either with Y or with individual phenomena [A or B or C] as the dependent variable. Yet it seems that if [A, B, C] are not uniformly distributed across the sample of units in the same proportion, then the use of Y would be biased, since as a count of [A+B+C] divided by the population, it would treat as equivalent units both [2+0.5+1.5] and [4+0+0]. My goal is trying to find a methodology which allows a researcher to regress Y on a

2 0.86802185 1218 andrew gelman stats-2012-03-18-Check your missing-data imputations using cross-validation

Introduction: Elena Grewal writes: I am currently using the iterative regression imputation model as implemented in the Stata ICE package. I am using data from a survey of about 90,000 students in 142 schools and my variable of interest is parent level of education. I want only this variable to be imputed with as little bias as possible as I am not using any other variable. So I scoured the survey for every variable I thought could possibly predict parent education. The main variable I found is parent occupation, which explains about 35% of the variance in parent education for the students with complete data on both. I then include the 20 other variables I found in the survey in a regression predicting parent education, which explains about 40% of the variance in parent education for students with complete data on all the variables. My question is this: many of the other variables I found have more missing values than the parent education variable, and also, although statistically significant

3 0.85593581 1900 andrew gelman stats-2013-06-15-Exploratory multilevel analysis when group-level variables are of importance

Introduction: Steve Miller writes: Much of what I do is cross-national analyses of survey data (largely World Values Survey). . . . My big question pertains to (what I would call) exploratory analysis of multilevel data, especially when the group-level predictors are of theoretical importance. A lot of what I do involves analyzing cross-national survey items of citizen attitudes, typically of political leadership. These survey items are usually yes/no responses, or four-part responses indicating a level of agreement (strongly agree, agree, disagree, strongly disagree) that can be condensed into a binary variable. I believe these can be explained by reference to country-level factors. Much of the group-level variables of interest are count variables with a modal value of 0, which can be quite messy. How would you recommend exploring the variation in the dependent variable as it could be explained by the group-level count variable of interest, before fitting the multilevel model itself? When

4 0.82581389 14 andrew gelman stats-2010-05-01-Imputing count data

Introduction: Guy asks: I am analyzing an original survey of farmers in Uganda. I am hoping to use a battery of welfare proxy variables to create a single welfare index using PCA. I have quick question which I hope you can find time to address: How do you recommend treating count data? (for example # of rooms, # of chickens, # of cows, # of radios)? In my dataset these variables are highly skewed with many responses at zero (which makes taking the natural log problematic). In the case of # of cows or chickens several obs have values in the hundreds. My response: Here’s what we do in our mi package in R. We split a variable into two parts: an indicator for whether it is positive, and the positive part. That is, y = u*v. Then u is binary and can be modeled using logisitc regression, and v can be modeled on the log scale. At the end you can round to the nearest integer if you want to avoid fractional values.

5 0.80885357 1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?

Introduction: Andy Cooper writes: A link to an article , “Four Assumptions Of Multiple Regression That Researchers Should Always Test”, has been making the rounds on Twitter. Their first rule is “Variables are Normally distributed.” And they seem to be talking about the independent variables – but then later bring in tests on the residuals (while admitting that the normally-distributed error assumption is a weak assumption). I thought we had long-since moved away from transforming our independent variables to make them normally distributed for statistical reasons (as opposed to standardizing them for interpretability, etc.) Am I missing something? I agree that leverage in a influence is important, but normality of the variables? The article is from 2002, so it might be dated, but given the popularity of the tweet, I thought I’d ask your opinion. My response: There’s some useful advice on that page but overall I think the advice was dated even in 2002. In section 3.6 of my book wit

6 0.80758703 627 andrew gelman stats-2011-03-24-How few respondents are reasonable to use when calculating the average by county?

7 0.79938954 1121 andrew gelman stats-2012-01-15-R-squared for multilevel models

8 0.79578131 257 andrew gelman stats-2010-09-04-Question about standard range for social science correlations

9 0.79280293 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making

10 0.79270631 251 andrew gelman stats-2010-09-02-Interactions of predictors in a causal model

11 0.78619897 1094 andrew gelman stats-2011-12-31-Using factor analysis or principal components analysis or measurement-error models for biological measurements in archaeology?

12 0.77880448 753 andrew gelman stats-2011-06-09-Allowing interaction terms to vary

13 0.77841842 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?

14 0.77466166 86 andrew gelman stats-2010-06-14-“Too much data”?

15 0.77438027 1248 andrew gelman stats-2012-04-06-17 groups, 6 group-level predictors: What to do?

16 0.77224886 375 andrew gelman stats-2010-10-28-Matching for preprocessing data for causal inference

17 0.76670361 352 andrew gelman stats-2010-10-19-Analysis of survey data: Design based models vs. hierarchical modeling?

18 0.76361752 948 andrew gelman stats-2011-10-10-Combining data from many sources

19 0.75245678 1908 andrew gelman stats-2013-06-21-Interpreting interactions in discrete-data regression

20 0.75035936 770 andrew gelman stats-2011-06-15-Still more Mr. P in public health


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(2, 0.054), (15, 0.05), (16, 0.069), (21, 0.013), (24, 0.095), (53, 0.074), (57, 0.011), (83, 0.158), (84, 0.033), (85, 0.011), (89, 0.022), (95, 0.012), (98, 0.017), (99, 0.245)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.92601061 1307 andrew gelman stats-2012-05-07-The hare, the pineapple, and Ed Wegman

Introduction: Commenters here are occasionally bothered that I spend so much time attacking frauds and plagiarists. See, for example, here and here . Why go on and on about these losers, given that there are more important problems in the world such as war, pestilence, hunger, and graphs where the y-axis doesn’t go all the way down to zero? Part of the story is that I do research for a living so I resent people who devalue research through misattribution or fraud, in the same way that rich people don’t like counterfeiters. What really bugs me, though, is when cheaters get caught and still don’t admit it. People like Hauser, Wegman, Fischer, and Weick get under my skin because they have the chutzpah to just deny deny deny. The grainy time-stamped videotape with their hand in the cookie jar is right there, and they’ll still talk around the problem. Makes me want to scream. This happens all the time . All. Over. The. Place. Everybody makes mistakes, and just about everybody does thing

same-blog 2 0.92017817 1294 andrew gelman stats-2012-05-01-Modeling y = a + b + c

Introduction: Brandon Behlendorf writes: I [Behlendorf] am replicating some previous research using OLS [he's talking about what we call "linear regression"---ed.] to regress a logged rate (to reduce skew) of Y on a number of predictors (Xs). Y is the count of a phenomena divided by the population of the unit of the analysis. The problem that I am encountering is that Y is composite count of a number of distinct phenomena [A+B+C], and these phenomena are not uniformly distributed across the sample. Most of the research in this area has conducted regressions either with Y or with individual phenomena [A or B or C] as the dependent variable. Yet it seems that if [A, B, C] are not uniformly distributed across the sample of units in the same proportion, then the use of Y would be biased, since as a count of [A+B+C] divided by the population, it would treat as equivalent units both [2+0.5+1.5] and [4+0+0]. My goal is trying to find a methodology which allows a researcher to regress Y on a

3 0.91534668 1312 andrew gelman stats-2012-05-11-Are our referencing errors undermining our scholarship and credibility? The case of expatriate failure rates

Introduction: Thomas Basbøll points to this ten-year-old article from Anne-Wil Harzing on the consequences of sloppy citations. Harzing tells the story of an unsupported claim that is contradicted by published data but has been presented as fact in a particular area of the academic literature. She writes that “high expatriate failure rates [with "expatriate failure" defined as "the expatriate returning home before his/her contractual period of employment abroad expires"] were in fact a myth created by massive misquotations and careless copying of references.” Many papers claimed an expatriate failure rate of 25-40% (according to Harzing, this is much higher than the actual rate as estimated from empirical data), with this overly-high rate supported by a complicated link of references leading to . . . no real data. Hartzing reports the following published claims: Harvey (1996: 103): `The rate of failure of expatriate managers relocating overseas from United States based MNCs has been estima

4 0.90172935 926 andrew gelman stats-2011-09-26-NYC

Introduction: Our downstairs neighbor hates us. She looks away from us when we see them on the street, if we’re coming into the building at the same time she doesn’t hold open the door, and if we’re in the elevator when it stops on her floor, she refuses to get on. On the other hand, if you’re a sociology professor in Chicago, one of your colleagues might try to run you over in a parking lot. So I guess I’m getting off easy.

5 0.89655995 645 andrew gelman stats-2011-04-04-Do you have any idea what you’re talking about?

Introduction: We all have opinions about the federal budget and how it should be spent. Infrequently, those opinions are informed by some knowledge about where the money actually goes. It turns out that most people don’t have a clue. What about you? Here, take this poll/quiz and then compare your answers to (1) what other people said, in a CNN poll that asked about these same items and (2) compare your answers to the real answers. Quiz is below the fold. The questions below are from a CNN poll. ======== Think about all the money that the federal government spent last year. I’m going to name a few federal programs and for each one, I’d like you to estimate what percentage of the federal government’s budget last year was spent on each of those programs. Medicare — the federal health program for the elderly Medicaid — the federal health program for the poor Social Security Military spending by the Department of Defense Aid to foreign countries for international development

6 0.89511496 1890 andrew gelman stats-2013-06-09-Frontiers of Science update

7 0.89071947 1977 andrew gelman stats-2013-08-11-Debutante Hill

8 0.88945782 1456 andrew gelman stats-2012-08-13-Macro, micro, and conflicts of interest

9 0.86570287 1923 andrew gelman stats-2013-07-03-Bayes pays!

10 0.85280949 1042 andrew gelman stats-2011-12-05-Timing is everything!

11 0.85239309 711 andrew gelman stats-2011-05-14-Steven Rhoads’s book, “The Economist’s View of the World”

12 0.85215384 649 andrew gelman stats-2011-04-05-Internal and external forecasting

13 0.8441608 248 andrew gelman stats-2010-09-01-Ratios where the numerator and denominator both change signs

14 0.8432349 1704 andrew gelman stats-2013-02-03-Heuristics for identifying ecological fallacies?

15 0.83999056 495 andrew gelman stats-2010-12-31-“Threshold earners” and economic inequality

16 0.83824044 1555 andrew gelman stats-2012-10-31-Social scientists who use medical analogies to explain causal inference are, I think, implicitly trying to borrow some of the scientific and cultural authority of that field for our own purposes

17 0.8381682 2125 andrew gelman stats-2013-12-05-What predicts whether a school district will participate in a large-scale evaluation?

18 0.83738935 446 andrew gelman stats-2010-12-03-Is 0.05 too strict as a p-value threshold?

19 0.83718365 354 andrew gelman stats-2010-10-19-There’s only one Amtrak

20 0.83710212 1905 andrew gelman stats-2013-06-18-There are no fat sprinters