andrew_gelman_stats andrew_gelman_stats-2014 andrew_gelman_stats-2014-2296 knowledge-graph by maker-knowledge-mining

2296 andrew gelman stats-2014-04-19-Index or indicator variables


meta infos for this blog

Source: html

Introduction: Someone who doesn’t want his name shared (for the perhaps reasonable reason that he’ll “one day not be confused, and would rather my confusion not live on online forever”) writes: I’m exploring HLMs and stan, using your book with Jennifer Hill as my field guide to this new territory. I think I have a generally clear grasp on the material, but wanted to be sure I haven’t gone astray. The problem in working on involves a multi-nation survey of students, and I’m especially interested in understanding the effects of country, religion, and sex, and the interactions among those factors (using IRT to estimate individual-level ability, then estimating individual, school, and country effects). Following the basic approach laid out in chapter 13 for such interactions between levels, I think I need to create a matrix of indicator variables for religion and sex. Elsewhere in the book, you recommend against indicator variables in favor of a single index variable. Am I right in thinking t


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Someone who doesn’t want his name shared (for the perhaps reasonable reason that he’ll “one day not be confused, and would rather my confusion not live on online forever”) writes: I’m exploring HLMs and stan, using your book with Jennifer Hill as my field guide to this new territory. [sent-1, score-0.253]

2 I think I have a generally clear grasp on the material, but wanted to be sure I haven’t gone astray. [sent-2, score-0.078]

3 Following the basic approach laid out in chapter 13 for such interactions between levels, I think I need to create a matrix of indicator variables for religion and sex. [sent-4, score-1.544]

4 Elsewhere in the book, you recommend against indicator variables in favor of a single index variable. [sent-5, score-0.625]

5 Am I right in thinking that this is purely a matter of convenience, and that the matrix formulation of chapter 13 requires indicator variables, but that the matrix of indicators or the vector of indices yield otherwise identical results? [sent-6, score-2.064]

6 I replied: Yes, models can be formulated equivalently in terms of index or indicator variables. [sent-8, score-0.635]

7 If a discrete variable can take on a bunch of different possible values (for example, 50 states), it makes sense to use a multilevel model rather than to include indicators as predictors with unmodeled coefficients. [sent-9, score-0.657]

8 If the variable takes on only two or three values, you can still do a multilevel model but really it would be better at that point to use informative priors for any variance parameters. [sent-10, score-0.227]

9 That’s a tactic we do not discuss in our book but which is easy to implement in Stan, and I’m hoping to do more of it in the future. [sent-11, score-0.212]

10 To which my correspondent wrote: The main difference that occurs to me as I work through implementing this is that the matrix of indicator variables loses information about what the underlying variable was. [sent-12, score-1.374]

11 So, for instance, if the matrix mixes an indicator for sex and n indicators for religion and m indicators for schools, we’d have Sigma_beta be an m+n+1 x m+n+1 matrix, when we really want a 3×3 matrix. [sent-13, score-1.854]

12 I could set up the basic structure of Sigma_beta, separately estimate the diagonal elements with a series of multilevel loops by sex, religion, and school, and eschew the matrix formulation in the individual model. [sent-14, score-1.381]

13 So instead of y~N(X_iB_j[i],sigma^2_y) it would be (roughly, I’m doing this on my phone): y_i~N(beta_sex[i]+beta_sex_country[country[i]]+beta_religion[i]+beta_religion_country[i,country[i]]+beta_school[i]+beta_school_country[i,country[i]],sigma^2_y) And the group-level formulation unchanged. [sent-15, score-0.198]

14 Sigma_beta becomes a 3×3 matrix rather than an m+n+1 matrix, which seems both more reasonable and more computationally tractable. [sent-16, score-0.627]

15 My reply: Now I’m getting tangled in your notation. [sent-17, score-0.071]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('matrix', 0.49), ('indicator', 0.381), ('indicators', 0.271), ('religion', 0.226), ('formulation', 0.198), ('sex', 0.141), ('variables', 0.134), ('country', 0.126), ('multilevel', 0.115), ('variable', 0.112), ('index', 0.11), ('interactions', 0.095), ('indices', 0.092), ('irt', 0.092), ('eschew', 0.092), ('hlms', 0.092), ('diagonal', 0.087), ('unmodeled', 0.083), ('tactic', 0.083), ('stan', 0.08), ('chapter', 0.078), ('grasp', 0.078), ('values', 0.076), ('loops', 0.076), ('basic', 0.074), ('mixes', 0.074), ('loses', 0.073), ('equivalently', 0.073), ('computationally', 0.073), ('book', 0.072), ('formulated', 0.071), ('tangled', 0.071), ('school', 0.071), ('forever', 0.069), ('individual', 0.068), ('laid', 0.066), ('elements', 0.066), ('implementing', 0.064), ('reasonable', 0.064), ('vector', 0.064), ('convenience', 0.062), ('correspondent', 0.06), ('occurs', 0.06), ('exploring', 0.06), ('separately', 0.058), ('estimate', 0.057), ('guide', 0.057), ('elsewhere', 0.057), ('implement', 0.057), ('effects', 0.057)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999988 2296 andrew gelman stats-2014-04-19-Index or indicator variables

Introduction: Someone who doesn’t want his name shared (for the perhaps reasonable reason that he’ll “one day not be confused, and would rather my confusion not live on online forever”) writes: I’m exploring HLMs and stan, using your book with Jennifer Hill as my field guide to this new territory. I think I have a generally clear grasp on the material, but wanted to be sure I haven’t gone astray. The problem in working on involves a multi-nation survey of students, and I’m especially interested in understanding the effects of country, religion, and sex, and the interactions among those factors (using IRT to estimate individual-level ability, then estimating individual, school, and country effects). Following the basic approach laid out in chapter 13 for such interactions between levels, I think I need to create a matrix of indicator variables for religion and sex. Elsewhere in the book, you recommend against indicator variables in favor of a single index variable. Am I right in thinking t

2 0.20107111 2258 andrew gelman stats-2014-03-21-Random matrices in the news

Introduction: From 2010 : Mark Buchanan wrote  a cover article  for the New Scientist on random matrices, a heretofore obscure area of probability theory that his headline writer characterizes as “the deep law that shapes our reality.” It’s interesting stuff, and he gets into some statistical applications at the end, so I’ll give you my take on it. But first, some background. About two hundred years ago, the mathematician/physicist Laplace discovered what is now called the central limit theorem, which is that, under certain conditions, the average of a large number of small random variables has an approximate normal (bell-shaped) distribution. A bit over 100 years ago, social scientists such as Galton applied this theorem to all sorts of biological and social phenomena. The central limit theorem, in its generality, is also important in the information that it indirectly conveys when it fails. For example, the distribution of the heights of adult men or women is nicely bell-shaped, but the

3 0.18694586 1753 andrew gelman stats-2013-03-06-Stan 1.2.0 and RStan 1.2.0

Introduction: Stan 1.2.0 and RStan 1.2.0 are now available for download. See: http://mc-stan.org/ Here are the highlights. Full Mass Matrix Estimation during Warmup Yuanjun Gao, a first-year grad student here at Columbia (!), built a regularized mass-matrix estimator. This helps for posteriors with high correlation among parameters and varying scales. We’re still testing this ourselves, so the estimation procedure may change in the future (don’t worry — it satisfies detailed balance as is, but we might be able to make it more computationally efficient in terms of time per effective sample). It’s not the default option. The major reason is the matrix operations required are expensive, raising the algorithm cost to , where is the average number of leapfrog steps, is the number of iterations, and is the number of parameters. Yuanjun did a great job with the Cholesky factorizations and implemented this about as efficiently as is possible. (His homework for Andrew’s class w

4 0.1498969 1900 andrew gelman stats-2013-06-15-Exploratory multilevel analysis when group-level variables are of importance

Introduction: Steve Miller writes: Much of what I do is cross-national analyses of survey data (largely World Values Survey). . . . My big question pertains to (what I would call) exploratory analysis of multilevel data, especially when the group-level predictors are of theoretical importance. A lot of what I do involves analyzing cross-national survey items of citizen attitudes, typically of political leadership. These survey items are usually yes/no responses, or four-part responses indicating a level of agreement (strongly agree, agree, disagree, strongly disagree) that can be condensed into a binary variable. I believe these can be explained by reference to country-level factors. Much of the group-level variables of interest are count variables with a modal value of 0, which can be quite messy. How would you recommend exploring the variation in the dependent variable as it could be explained by the group-level count variable of interest, before fitting the multilevel model itself? When

5 0.14304547 931 andrew gelman stats-2011-09-29-Hamiltonian Monte Carlo stories

Introduction: Tomas Iesmantas had asked me for advice on a regression problem with 50 parameters, and I’d recommended Hamiltonian Monte Carlo. A few weeks later he reported back: After trying several modifications (HMC for all parameters at once, HMC just for first level parameters and Riemman manifold Hamiltonian Monte Carlo method), I finally got it running with HMC just for first level parameters and for others using direct sampling, since conditional distributions turned out to have closed form. However, even in this case it is quite tricky, since I had to employ mass matrix and not just diagonal but at the beginning of algorithm generated it randomly (ensuring it is positive definite). Such random generation of mass matrix is quite blind step, but it proved to be quite helpful. Riemman manifold HMC is quite vagarious, or to be more specific, metric of manifold is very sensitive. In my model log-likelihood I had exponents and values of metrics matrix elements was very large and wh

6 0.13534459 1248 andrew gelman stats-2012-04-06-17 groups, 6 group-level predictors: What to do?

7 0.13220567 2290 andrew gelman stats-2014-04-14-On deck this week

8 0.12936866 2145 andrew gelman stats-2013-12-24-Estimating and summarizing inference for hierarchical variance parameters when the number of groups is small

9 0.12751548 288 andrew gelman stats-2010-09-21-Discussion of the paper by Girolami and Calderhead on Bayesian computation

10 0.12620592 352 andrew gelman stats-2010-10-19-Analysis of survey data: Design based models vs. hierarchical modeling?

11 0.12277327 14 andrew gelman stats-2010-05-01-Imputing count data

12 0.12158701 1627 andrew gelman stats-2012-12-17-Stan and RStan 1.1.0

13 0.11650057 553 andrew gelman stats-2011-02-03-is it possible to “overstratify” when assigning a treatment in a randomized control trial?

14 0.10838059 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?

15 0.10789315 1486 andrew gelman stats-2012-09-07-Prior distributions for regression coefficients

16 0.10781837 2161 andrew gelman stats-2014-01-07-My recent debugging experience

17 0.10630602 1339 andrew gelman stats-2012-05-23-Learning Differential Geometry for Hamiltonian Monte Carlo

18 0.10562734 1991 andrew gelman stats-2013-08-21-BDA3 table of contents (also a new paper on visualization)

19 0.1046141 99 andrew gelman stats-2010-06-19-Paired comparisons

20 0.1039653 1786 andrew gelman stats-2013-04-03-Hierarchical array priors for ANOVA decompositions


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.17), (1, 0.086), (2, 0.059), (3, 0.01), (4, 0.12), (5, 0.064), (6, 0.028), (7, -0.076), (8, 0.01), (9, 0.061), (10, -0.005), (11, 0.013), (12, -0.004), (13, -0.001), (14, 0.09), (15, -0.013), (16, -0.03), (17, 0.03), (18, 0.008), (19, 0.003), (20, -0.037), (21, 0.023), (22, -0.013), (23, 0.024), (24, 0.018), (25, -0.035), (26, 0.019), (27, 0.03), (28, -0.036), (29, 0.006), (30, 0.008), (31, 0.031), (32, 0.006), (33, 0.016), (34, 0.005), (35, -0.018), (36, 0.005), (37, 0.055), (38, -0.005), (39, 0.004), (40, -0.038), (41, -0.048), (42, 0.034), (43, -0.012), (44, -0.01), (45, 0.011), (46, 0.016), (47, -0.001), (48, -0.022), (49, -0.004)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96664602 2296 andrew gelman stats-2014-04-19-Index or indicator variables

Introduction: Someone who doesn’t want his name shared (for the perhaps reasonable reason that he’ll “one day not be confused, and would rather my confusion not live on online forever”) writes: I’m exploring HLMs and stan, using your book with Jennifer Hill as my field guide to this new territory. I think I have a generally clear grasp on the material, but wanted to be sure I haven’t gone astray. The problem in working on involves a multi-nation survey of students, and I’m especially interested in understanding the effects of country, religion, and sex, and the interactions among those factors (using IRT to estimate individual-level ability, then estimating individual, school, and country effects). Following the basic approach laid out in chapter 13 for such interactions between levels, I think I need to create a matrix of indicator variables for religion and sex. Elsewhere in the book, you recommend against indicator variables in favor of a single index variable. Am I right in thinking t

2 0.77734798 753 andrew gelman stats-2011-06-09-Allowing interaction terms to vary

Introduction: Zoltan Fazekas writes: I am a 2nd year graduate student in political science at the University of Vienna. In my empirical research I often employ multilevel modeling, and recently I came across a situation that kept me wondering for quite a while. As I did not find much on this in the literature and considering the topics that you work on and blog about, I figured I will try to contact you. The situation is as follows: in a linear multilevel model, there are two important individual level predictors (x1 and x2) and a set of controls. Let us assume that there is a theoretically grounded argument suggesting that an interaction between x1 and x2 should be included in the model (x1 * x2). Both x1 and x2 are let to vary randomly across groups. Would this directly imply that the coefficient of the interaction should also be left to vary across country? This is even more burning if there is no specific hypothesis on the variance of the conditional effect across countries. And then i

3 0.77530259 1121 andrew gelman stats-2012-01-15-R-squared for multilevel models

Introduction: Fred Schiff writes: I’m writing to you to ask about the “R-squared” approximation procedure you suggest in your 2004 book with Dr. Hill. [See also this paper with Pardoe---ed.] I’m a media sociologist at the University of Houston. I’ve been using HLM3 for about two years. Briefly about my data. It’s a content analysis of news stories with a continuous scale dependent variable, story prominence. I have 6090 news stories, 114 newspapers, and 59 newspaper group owners. All the Level-1, Level-2 and dependent variables have been standardized. Since the means were zero anyway, we left the variables uncentered. All the Level-3 ownership groups and characteristics are dichotomous scales that were left uncentered. PROBLEM: The single most important result I am looking for is to compare the strength of nine competing Level-1 variables in their ability to predict and explain the outcome variable, story prominence. We are trying to use the residuals to calculate a “R-squ

4 0.76068819 1248 andrew gelman stats-2012-04-06-17 groups, 6 group-level predictors: What to do?

Introduction: Yi-Chun Ou writes: I am using a multilevel model with three levels. I read that you wrote a book about multilevel models, and wonder if you can solve the following question. The data structure is like this: Level one: customer (8444 customers) Level two: companys (90 companies) Level three: industry (17 industries) I use 6 level-three variables (i.e. industry characteristics) to explain the variance of the level-one effect across industries. The question here is whether there is an over-fitting problem since there are only 17 industries. I understand that this must be a problem for non-multilevel models, but is it also a problem for multilevel models? My reply: Yes, this could be a problem. I’d suggest combining some of your variables into a common score, or using only some of the variables, or using strong priors to control the inferences. This is an interesting and important area of statistics research, to do this sort of thing systematically. There’s lots o

5 0.75775719 2145 andrew gelman stats-2013-12-24-Estimating and summarizing inference for hierarchical variance parameters when the number of groups is small

Introduction: Chris Che-Castaldo writes: I am trying to compute variance components for a hierarchical model where the group level has two binary predictors and their interaction. When I model each of these three predictors as N(0, tau) the model will not converge, perhaps because the number of coefficients in each batch is so small (2 for the main effects and 4 for the interaction). Although I could simply leave all these as predictors as unmodeled fixed effects, the last sentence of section 21.2 on page 462 of Gelman and Hill (2007) suggests this would not be a wise course of action: For example, it is not clear how to define the (finite) standard deviation of variables that are included in interactions. I am curious – is there still no clear cut way to directly compute the finite standard deviation for binary unmodeled variables that are also part of an interaction as well as the interaction itself? My reply: I’d recommend including these in your model (it’s probably easiest to do so

6 0.72081214 1966 andrew gelman stats-2013-08-03-Uncertainty in parameter estimates using multilevel models

7 0.70809948 948 andrew gelman stats-2011-10-10-Combining data from many sources

8 0.70751464 14 andrew gelman stats-2010-05-01-Imputing count data

9 0.70291698 1815 andrew gelman stats-2013-04-20-Displaying inferences from complex models

10 0.69975388 704 andrew gelman stats-2011-05-10-Multiple imputation and multilevel analysis

11 0.69115192 1294 andrew gelman stats-2012-05-01-Modeling y = a + b + c

12 0.69092953 25 andrew gelman stats-2010-05-10-Two great tastes that taste great together

13 0.68863571 1786 andrew gelman stats-2013-04-03-Hierarchical array priors for ANOVA decompositions

14 0.68666571 1908 andrew gelman stats-2013-06-21-Interpreting interactions in discrete-data regression

15 0.68502438 851 andrew gelman stats-2011-08-12-year + (1|year)

16 0.68458819 397 andrew gelman stats-2010-11-06-Multilevel quantile regression

17 0.68360227 1218 andrew gelman stats-2012-03-18-Check your missing-data imputations using cross-validation

18 0.68337643 1462 andrew gelman stats-2012-08-18-Standardizing regression inputs

19 0.68315619 726 andrew gelman stats-2011-05-22-Handling multiple versions of an outcome variable

20 0.68281037 1686 andrew gelman stats-2013-01-21-Finite-population Anova calculations for models with interactions


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(2, 0.017), (7, 0.026), (15, 0.017), (16, 0.086), (21, 0.032), (23, 0.031), (24, 0.144), (36, 0.013), (44, 0.017), (61, 0.017), (74, 0.01), (76, 0.028), (84, 0.028), (90, 0.01), (91, 0.05), (96, 0.057), (97, 0.039), (99, 0.28)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97962451 2296 andrew gelman stats-2014-04-19-Index or indicator variables

Introduction: Someone who doesn’t want his name shared (for the perhaps reasonable reason that he’ll “one day not be confused, and would rather my confusion not live on online forever”) writes: I’m exploring HLMs and stan, using your book with Jennifer Hill as my field guide to this new territory. I think I have a generally clear grasp on the material, but wanted to be sure I haven’t gone astray. The problem in working on involves a multi-nation survey of students, and I’m especially interested in understanding the effects of country, religion, and sex, and the interactions among those factors (using IRT to estimate individual-level ability, then estimating individual, school, and country effects). Following the basic approach laid out in chapter 13 for such interactions between levels, I think I need to create a matrix of indicator variables for religion and sex. Elsewhere in the book, you recommend against indicator variables in favor of a single index variable. Am I right in thinking t

2 0.95365334 2065 andrew gelman stats-2013-10-17-Cool dynamic demographic maps provide beautiful illustration of Chris Rock effect

Introduction: Robert Gonzalez reports on some beautiful graphs from John Nelson. Here’s Nelson:   The sexes start out homogenous, go super segregated in the teen years, segregate for business in the twenty-somethings, and re-couple for co-habitation years.  Then the lights fade into faint pockets of pink.   I [Nelson] am using simple tract-level population/gender counts from the US Census Bureau. Because their tract boundaries extend into the water and vacant area, I used NYC’s Bytes of the Big Apple zoning shapes to clip the census tracts to residentially zoned areas -giving me a more realistic (and more recognizable) definition of populated areas. The census breaks out their population counts by gender for five-year age spans ranging from teeny tiny infants through esteemed 85+ year-olds. And here’s Gonzalez: Between ages 0 and 14, the entire map is more or less an evenly mixed purple landscape; newborns, children and adolescents, after all, can’t really choose where the

3 0.9512403 586 andrew gelman stats-2011-02-23-A statistical version of Arrow’s paradox

Introduction: Unfortunately, when we deal with scientists, statisticians are often put in a setting reminiscent of Arrow’s paradox, where we are asked to provide estimates that are informative and unbiased and confidence statements that are correct conditional on the data and also on the underlying true parameter. [It's not generally possible for an estimate to do all these things at the same time -- ed.] Larry Wasserman feels that scientists are truly frequentist, and Don Rubin has told me how he feels that scientists interpret all statistical estimates Bayesianly. I have no doubt that both Larry and Don are correct. Voters want lower taxes and more services, and scientists want both Bayesian and frequency coverage; as the saying goes, everybody wants to go to heaven but nobody wants to die.

4 0.95088822 787 andrew gelman stats-2011-07-05-Different goals, different looks: Infovis and the Chris Rock effect

Introduction: Seth writes: Here’s my candidate for bad graphic of the year: I [Seth] studied it and learned nothing. I have no idea how they assigned colors to locations. I already knew that there were more within-city calls than calls to individual distant locations — for example that there are more SF-SF calls than SF-LA calls. The researchers took a huge rich database and boiled it down to nothing (in terms of information value) — and I have a funny feeling they don’t realize how awful this is and what a waste. I send it to you because it isn’t obvious how to do better — at least not obvious to them. My reply: My first reaction is to agree–I don’t get anything out of this graph either! But let me step back. I think it’s best to understand this using the framework of my paper with Antony Unwin , by thinking of the goals that are satisfied by different sorts of graphs. What does this graph convey? It doesn’t tell us much about phone calls, but it does tell us that some peop

5 0.9490096 2246 andrew gelman stats-2014-03-13-An Economist’s Guide to Visualizing Data

Introduction: Stephen Jenkins wrote: I was thinking that you and your blog readers might be interested in “ An Economist’s Guide to Visualizing Data ” by Jonathan Schwabish, in the most recent Journal of Economic Perspectives (which is the American Economic Association’s main “outreach” journal in some ways). I replied: Ooh, I hate this so much! This seems to represent a horrible example of economists not recognizing that outsiders can help them. We do much much better in political science. To which Jenkins wrote: Ha! I guessed as much — hence sent it. And I’ll now admit I was surprised that JEP took the piece without getting Schwabisch to widen his reference points. To elaborate a bit: I agree with Schwabish’s general advice (“show the data,” “reduce the clutter,” and “integrate the text and the graph”). But then he illustrates with 8 before-and-after stories in which he shows an existing graph and then gives his improvements. My problem is that I don’t like most of his “afte

6 0.94794202 18 andrew gelman stats-2010-05-06-$63,000 worth of abusive research . . . or just a really stupid waste of time?

7 0.94766557 736 andrew gelman stats-2011-05-29-Response to “Why Tables Are Really Much Better Than Graphs”

8 0.94685268 1591 andrew gelman stats-2012-11-26-Politics as an escape hatch

9 0.94660193 1023 andrew gelman stats-2011-11-22-Going Beyond the Book: Towards Critical Reading in Statistics Teaching

10 0.94496787 807 andrew gelman stats-2011-07-17-Macro causality

11 0.94478643 1883 andrew gelman stats-2013-06-04-Interrogating p-values

12 0.94362158 226 andrew gelman stats-2010-08-23-More on those L.A. Times estimates of teacher effectiveness

13 0.94338167 319 andrew gelman stats-2010-10-04-“Who owns Congress”

14 0.94322139 788 andrew gelman stats-2011-07-06-Early stopping and penalized likelihood

15 0.94316173 2227 andrew gelman stats-2014-02-27-“What Can we Learn from the Many Labs Replication Project?”

16 0.94311631 2112 andrew gelman stats-2013-11-25-An interesting but flawed attempt to apply general forecasting principles to contextualize attitudes toward risks of global warming

17 0.94302964 53 andrew gelman stats-2010-05-26-Tumors, on the left, or on the right?

18 0.94268733 205 andrew gelman stats-2010-08-13-Arnold Zellner

19 0.94235742 770 andrew gelman stats-2011-06-15-Still more Mr. P in public health

20 0.94222522 690 andrew gelman stats-2011-05-01-Peter Huber’s reflections on data analysis