andrew_gelman_stats andrew_gelman_stats-2011 andrew_gelman_stats-2011-948 knowledge-graph by maker-knowledge-mining

948 andrew gelman stats-2011-10-10-Combining data from many sources


meta infos for this blog

Source: html

Introduction: Mark Grote writes: I’d like to request general feedback and references for a problem of combining disparate data sources in a regression model. We’d like to model log crop yield as a function of environmental predictors, but the observations come from many data sources and are peculiarly structured. Among the issues are: 1. Measurement precision in predictors and outcome varies widely with data sources. Some observations are in very coarse units of measurement, due to rounding or even observer guesswork. 2. There are obvious clusters of observations arising from studies in which crop yields were monitored over successive years in spatially proximate communities. Thus some variables may be constant within clusters–this is true even for log yield, probably due to rounding of similar yields. 3. Cluster size and intra-cluster association structure (temporal, spatial or both) vary widely across the dataset. My [Grote's] intuition is that we can learn about central tendency


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Mark Grote writes: I’d like to request general feedback and references for a problem of combining disparate data sources in a regression model. [sent-1, score-0.465]

2 We’d like to model log crop yield as a function of environmental predictors, but the observations come from many data sources and are peculiarly structured. [sent-2, score-0.917]

3 Measurement precision in predictors and outcome varies widely with data sources. [sent-4, score-0.396]

4 Some observations are in very coarse units of measurement, due to rounding or even observer guesswork. [sent-5, score-0.861]

5 There are obvious clusters of observations arising from studies in which crop yields were monitored over successive years in spatially proximate communities. [sent-7, score-1.46]

6 Thus some variables may be constant within clusters–this is true even for log yield, probably due to rounding of similar yields. [sent-8, score-0.578]

7 Cluster size and intra-cluster association structure (temporal, spatial or both) vary widely across the dataset. [sent-10, score-0.326]

8 My [Grote's] intuition is that we can learn about central tendency even by fitting models that deal only superficially with sample structure (e. [sent-11, score-0.332]

9 But I wonder if we could do better, while still keeping the analysis relatively simple. [sent-14, score-0.069]

10 Although multi-level modeling might appeal, many of the clusters are singletons, which gives me [Grote] pause. [sent-15, score-0.52]

11 It’s no problem doing multilevel modeling when many (or even most) of the clusters are singletons. [sent-17, score-0.693]

12 I don’t think robust standard errors will get you anywhere. [sent-19, score-0.308]

13 It sounds like you want a model with different error variances for different data points. [sent-21, score-0.084]

14 That’s easy enough to do in Bugs/Jags or if programming by hand, possibly doable in Stata’s multilevel modeling functions, not so easy do to in lmer in R without some additional programming. [sent-22, score-0.737]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('grote', 0.395), ('clusters', 0.395), ('crop', 0.24), ('observations', 0.197), ('rounding', 0.179), ('yield', 0.143), ('log', 0.141), ('robust', 0.137), ('programming', 0.133), ('widely', 0.128), ('modeling', 0.125), ('doable', 0.12), ('coarse', 0.12), ('disparate', 0.12), ('proximate', 0.12), ('spatially', 0.12), ('sources', 0.12), ('structure', 0.118), ('measurement', 0.118), ('predictors', 0.113), ('successive', 0.113), ('due', 0.112), ('temporal', 0.108), ('monitored', 0.108), ('observer', 0.101), ('multilevel', 0.1), ('errors', 0.096), ('lmer', 0.091), ('easy', 0.084), ('variances', 0.084), ('arising', 0.084), ('yields', 0.083), ('varies', 0.082), ('cluster', 0.08), ('spatial', 0.08), ('units', 0.079), ('feedback', 0.078), ('request', 0.076), ('stata', 0.076), ('environmental', 0.076), ('standard', 0.075), ('precision', 0.073), ('constant', 0.073), ('even', 0.073), ('appeal', 0.072), ('tendency', 0.072), ('combining', 0.071), ('keeping', 0.069), ('intuition', 0.069), ('functions', 0.066)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 948 andrew gelman stats-2011-10-10-Combining data from many sources

Introduction: Mark Grote writes: I’d like to request general feedback and references for a problem of combining disparate data sources in a regression model. We’d like to model log crop yield as a function of environmental predictors, but the observations come from many data sources and are peculiarly structured. Among the issues are: 1. Measurement precision in predictors and outcome varies widely with data sources. Some observations are in very coarse units of measurement, due to rounding or even observer guesswork. 2. There are obvious clusters of observations arising from studies in which crop yields were monitored over successive years in spatially proximate communities. Thus some variables may be constant within clusters–this is true even for log yield, probably due to rounding of similar yields. 3. Cluster size and intra-cluster association structure (temporal, spatial or both) vary widely across the dataset. My [Grote's] intuition is that we can learn about central tendency

2 0.18135644 107 andrew gelman stats-2010-06-24-PPS in Georgia

Introduction: Lucy Flynn writes: I’m working at a non-profit organization called CRRC in the Republic of Georgia. I’m having a methodological problem and I saw the syllabus for your sampling class online and thought I might be able to ask you about it? We do a lot of complex surveys nationwide; our typical sample design is as follows: - stratify by rural/urban/capital - sub-stratify the rural and urban strata into NE/NW/SE/SW geographic quadrants - select voting precincts as PSUs - select households as SSUs - select individual respondents as TSUs I’m relatively new here, and past practice has been to sample voting precincts with probability proportional to size. It’s desirable because it’s not logistically feasible for us to vary the number of interviews per precinct with precinct size, so it makes the selection probabilities for households more even across precinct sizes. However, I have a complex sampling textbook (Lohr 1999), and it explains how complex it is to calculate sel

3 0.14040053 1966 andrew gelman stats-2013-08-03-Uncertainty in parameter estimates using multilevel models

Introduction: David Hsu writes: I have a (perhaps) simple question about uncertainty in parameter estimates using multilevel models — what is an appropriate threshold for measure parameter uncertainty in a multilevel model? The reason why I ask is that I set out to do a crossed two-way model with two varying intercepts, similar to your flight simulator example in your 2007 book. The difference is that I have a lot of predictors specific to each cell (I think equivalent to airport and pilot in your example), and I find after modeling this in JAGS, I happily find that the predictors are much less important than the variability by cell (airport and pilot effects). Happily because this is what I am writing a paper about. However, I then went to check subsets of predictors using lm() and lmer(). I understand that they all use different estimation methods, but what I can’t figure out is why the errors on all of the coefficient estimates are *so* different. For example, using JAGS, and th

4 0.13107619 1248 andrew gelman stats-2012-04-06-17 groups, 6 group-level predictors: What to do?

Introduction: Yi-Chun Ou writes: I am using a multilevel model with three levels. I read that you wrote a book about multilevel models, and wonder if you can solve the following question. The data structure is like this: Level one: customer (8444 customers) Level two: companys (90 companies) Level three: industry (17 industries) I use 6 level-three variables (i.e. industry characteristics) to explain the variance of the level-one effect across industries. The question here is whether there is an over-fitting problem since there are only 17 industries. I understand that this must be a problem for non-multilevel models, but is it also a problem for multilevel models? My reply: Yes, this could be a problem. I’d suggest combining some of your variables into a common score, or using only some of the variables, or using strong priors to control the inferences. This is an interesting and important area of statistics research, to do this sort of thing systematically. There’s lots o

5 0.12686406 295 andrew gelman stats-2010-09-25-Clusters with very small numbers of observations

Introduction: James O’Brien writes: How would you explain, to a “classically-trained” hypothesis-tester, that “It’s OK to fit a multilevel model even if some groups have only one observation each”? I [O'Brien] think I understand the logic and the statistical principles at work in this, but I’ve having trouble being clear and persuasive. I also feel like I’m contending with some methodological conventional wisdom here. My reply: I’m so used to this idea that I find it difficult to defend it in some sort of general conceptual way. So let me retreat to a more functional defense, which is that multilevel modeling gives good estimates, especially when the number of observations per group is small. One way to see this in any particular example in through cross-validation. Another way is to consider the alternatives. If you try really hard you can come up with a “classical hypothesis testing” approach which will do as well as the multilevel model. It would just take a lot of work. I’d r

6 0.12638265 939 andrew gelman stats-2011-10-03-DBQQ rounding for labeling charts and communicating tolerances

7 0.112813 753 andrew gelman stats-2011-06-09-Allowing interaction terms to vary

8 0.11039933 773 andrew gelman stats-2011-06-18-Should we always be using the t and robit instead of the normal and logit?

9 0.086132564 1934 andrew gelman stats-2013-07-11-Yes, worry about generalizing from data to population. But multilevel modeling is the solution, not the problem

10 0.085711733 1506 andrew gelman stats-2012-09-21-Building a regression model . . . with only 27 data points

11 0.085010998 383 andrew gelman stats-2010-10-31-Analyzing the entire population rather than a sample

12 0.084087364 1737 andrew gelman stats-2013-02-25-Correlation of 1 . . . too good to be true?

13 0.08340621 154 andrew gelman stats-2010-07-18-Predictive checks for hierarchical models

14 0.081647828 255 andrew gelman stats-2010-09-04-How does multilevel modeling affect the estimate of the grand mean?

15 0.078874402 1267 andrew gelman stats-2012-04-17-Hierarchical-multilevel modeling with “big data”

16 0.078514598 397 andrew gelman stats-2010-11-06-Multilevel quantile regression

17 0.076165371 1763 andrew gelman stats-2013-03-14-Everyone’s trading bias for variance at some point, it’s just done at different places in the analyses

18 0.075261585 1425 andrew gelman stats-2012-07-23-Examples of the use of hierarchical modeling to generalize to new settings

19 0.074649729 417 andrew gelman stats-2010-11-17-Clutering and variance components

20 0.073268421 784 andrew gelman stats-2011-07-01-Weighting and prediction in sample surveys


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.138), (1, 0.088), (2, 0.047), (3, -0.036), (4, 0.087), (5, 0.04), (6, -0.024), (7, -0.04), (8, 0.051), (9, 0.07), (10, 0.022), (11, -0.004), (12, -0.006), (13, -0.008), (14, 0.01), (15, 0.003), (16, -0.034), (17, -0.001), (18, 0.011), (19, -0.017), (20, 0.003), (21, 0.004), (22, -0.005), (23, 0.017), (24, -0.035), (25, -0.03), (26, -0.009), (27, 0.009), (28, 0.001), (29, -0.014), (30, 0.002), (31, 0.019), (32, -0.002), (33, -0.004), (34, 0.018), (35, 0.03), (36, -0.011), (37, 0.006), (38, 0.002), (39, -0.005), (40, -0.031), (41, -0.02), (42, -0.018), (43, -0.024), (44, 0.002), (45, 0.034), (46, 0.003), (47, 0.013), (48, -0.033), (49, -0.032)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.95327336 948 andrew gelman stats-2011-10-10-Combining data from many sources

Introduction: Mark Grote writes: I’d like to request general feedback and references for a problem of combining disparate data sources in a regression model. We’d like to model log crop yield as a function of environmental predictors, but the observations come from many data sources and are peculiarly structured. Among the issues are: 1. Measurement precision in predictors and outcome varies widely with data sources. Some observations are in very coarse units of measurement, due to rounding or even observer guesswork. 2. There are obvious clusters of observations arising from studies in which crop yields were monitored over successive years in spatially proximate communities. Thus some variables may be constant within clusters–this is true even for log yield, probably due to rounding of similar yields. 3. Cluster size and intra-cluster association structure (temporal, spatial or both) vary widely across the dataset. My [Grote's] intuition is that we can learn about central tendency

2 0.8879658 1248 andrew gelman stats-2012-04-06-17 groups, 6 group-level predictors: What to do?

Introduction: Yi-Chun Ou writes: I am using a multilevel model with three levels. I read that you wrote a book about multilevel models, and wonder if you can solve the following question. The data structure is like this: Level one: customer (8444 customers) Level two: companys (90 companies) Level three: industry (17 industries) I use 6 level-three variables (i.e. industry characteristics) to explain the variance of the level-one effect across industries. The question here is whether there is an over-fitting problem since there are only 17 industries. I understand that this must be a problem for non-multilevel models, but is it also a problem for multilevel models? My reply: Yes, this could be a problem. I’d suggest combining some of your variables into a common score, or using only some of the variables, or using strong priors to control the inferences. This is an interesting and important area of statistics research, to do this sort of thing systematically. There’s lots o

3 0.84409314 1934 andrew gelman stats-2013-07-11-Yes, worry about generalizing from data to population. But multilevel modeling is the solution, not the problem

Introduction: A sociologist writes in: Samuel Lucas has just published a paper in Quality and Quantity arguing that anything less than a full probability sample of higher levels in HLMs yields biased and unusable results. If I follow him correctly, he is arguing that not only are the SEs too small, but the parameter estimates themselves are biased and we cannot say in advance whether the bias is positive or negative. Lucas has thrown down a big gauntlet, advising us throw away our data unless the sample of macro units is right and ignore the published results that fail this standard. Extreme. Is there another conclusion to be drawn? Other advice to be given? A Bayesian path out of the valley? Heres’s the abstract to Lucas’s paper: The multilevel model has become a staple of social research. I textually and formally explicate sample design features that, I contend, are required for unbiased estimation of macro-level multilevel model parameters and the use of tools for statistical infe

4 0.81988186 772 andrew gelman stats-2011-06-17-Graphical tools for understanding multilevel models

Introduction: There are a few things I want to do: 1. Understand a fitted model using tools such as average predictive comparisons , R-squared, and partial pooling factors . In defining these concepts, Iain and I came up with some clever tricks, including (but not limited to): - Separating the inputs and averaging over all possible values of the input not being altered (for average predictive comparisons); - Defining partial pooling without referring to a raw-data or maximum-likelihood or no-pooling estimate (these don’t necessarily exist when you’re fitting logistic regression with sparse data); - Defining an R-squared for each level of a multilevel model. The methods get pretty complicated, though, and they have some loose ends–in particular, for average predictive comparisons with continuous input variables. So now we want to implement these in R and put them into arm along with bglmer etc. 2. Setting up coefplot so it works more generally (that is, so the graphics look nice

5 0.80612677 2296 andrew gelman stats-2014-04-19-Index or indicator variables

Introduction: Someone who doesn’t want his name shared (for the perhaps reasonable reason that he’ll “one day not be confused, and would rather my confusion not live on online forever”) writes: I’m exploring HLMs and stan, using your book with Jennifer Hill as my field guide to this new territory. I think I have a generally clear grasp on the material, but wanted to be sure I haven’t gone astray. The problem in working on involves a multi-nation survey of students, and I’m especially interested in understanding the effects of country, religion, and sex, and the interactions among those factors (using IRT to estimate individual-level ability, then estimating individual, school, and country effects). Following the basic approach laid out in chapter 13 for such interactions between levels, I think I need to create a matrix of indicator variables for religion and sex. Elsewhere in the book, you recommend against indicator variables in favor of a single index variable. Am I right in thinking t

6 0.80201131 2294 andrew gelman stats-2014-04-17-If you get to the point of asking, just do it. But some difficulties do arise . . .

7 0.78405893 1737 andrew gelman stats-2013-02-25-Correlation of 1 . . . too good to be true?

8 0.78030276 1267 andrew gelman stats-2012-04-17-Hierarchical-multilevel modeling with “big data”

9 0.77886134 25 andrew gelman stats-2010-05-10-Two great tastes that taste great together

10 0.77256817 753 andrew gelman stats-2011-06-09-Allowing interaction terms to vary

11 0.76851958 383 andrew gelman stats-2010-10-31-Analyzing the entire population rather than a sample

12 0.76802486 1468 andrew gelman stats-2012-08-24-Multilevel modeling and instrumental variables

13 0.76234347 2086 andrew gelman stats-2013-11-03-How best to compare effects measured in two different time periods?

14 0.75958067 704 andrew gelman stats-2011-05-10-Multiple imputation and multilevel analysis

15 0.75894696 1294 andrew gelman stats-2012-05-01-Modeling y = a + b + c

16 0.7556448 1814 andrew gelman stats-2013-04-20-A mess with which I am comfortable

17 0.75459319 255 andrew gelman stats-2010-09-04-How does multilevel modeling affect the estimate of the grand mean?

18 0.75189203 1966 andrew gelman stats-2013-08-03-Uncertainty in parameter estimates using multilevel models

19 0.75124973 1121 andrew gelman stats-2012-01-15-R-squared for multilevel models

20 0.74691725 1900 andrew gelman stats-2013-06-15-Exploratory multilevel analysis when group-level variables are of importance


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(6, 0.013), (9, 0.018), (15, 0.024), (16, 0.075), (21, 0.035), (24, 0.097), (52, 0.159), (56, 0.018), (66, 0.072), (77, 0.024), (79, 0.021), (85, 0.011), (86, 0.064), (89, 0.046), (99, 0.233)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.92949271 223 andrew gelman stats-2010-08-21-Statoverflow

Introduction: Skirant Vadali writes: I am writing to seek your help in building a community driven Q&A; website tentatively called called ‘Statistics Analysis’. I am neither a founder of this website nor do I have any financial stake in its success. By way of background to this website, please see Stackoverflow (http://stackoverflow.com/) and Mathoverflow (http://mathoverflow.net/). Stackoverflow is a Q&A; website targeted at software developers and is designed to help them ask questions and get answers from other developers. Mathoverflow is a Q&A; website targeted at research mathematicians and is designed to help them ask and answer questions from other mathematicians across the world. The success of both these sites in helping their respective communities is a strong indicator that sites designed along these lines are very useful. The company that runs Stackoverflow (who also host Mathoverflow.net) has recently decided to develop other community driven websites for various other topic are

2 0.92410511 1686 andrew gelman stats-2013-01-21-Finite-population Anova calculations for models with interactions

Introduction: Jim Thomson writes: I wonder if you could provide some clarification on the correct way to calculate the finite-population standard deviations for interaction terms in your Bayesian approach to ANOVA (as explained in your 2005 paper, and Gelman and Hill 2007). I understand that it is the SD of the constrained batch coefficients that is of interest, but in most WinBUGS examples I have seen, the SDs are all calculated directly as sd.fin<-sd(beta.main[]) for main effects and sd(beta.int[,]) for interaction effects, where beta.main and beta.int are the unconstrained coefficients, e.g. beta.int[i,j]~dnorm(0,tau). For main effects, I can see that it makes no difference, since the constrained value is calculated by subtracting the mean, and sd(B[]) = sd(B[]-mean(B[])). But the conventional sum-to-zero constraint for interaction terms in linear models is more complicated than subtracting the mean (there are only (n1-1)*(n2-1) free coefficients for an interaction b/w factors with n1 a

3 0.91241914 485 andrew gelman stats-2010-12-25-Unlogging

Introduction: Catherine Bueker writes: I [Bueker] am analyzing the effect of various contextual factors on the voter turnout of naturalized Latino citizens. I have included the natural log of the number of Spanish Language ads run in each state during the election cycle to predict voter turnout. I now want to calculate the predicted probabilities of turnout for those in states with 0 ads, 500 ads, 1000 ads, etc. The problem is that I do not know how to handle the beta coefficient of the LN(Spanish language ads). Is there someway to “unlog” the coefficient? My reply: Calculate these probabilities for specific values of predictors, then graph the predictions of interest. Also, you can average over the other inputs in your model to get summaries. See this article with Pardoe for further discussion.

4 0.90794629 1957 andrew gelman stats-2013-07-26-“The Inside Story Of The Harvard Dissertation That Became Too Racist For Heritage”

Introduction: Mark Palko points me to a news article by Zack Beauchamp on Jason Richwine, the recent Ph.D. graduate from Harvard’s policy school who left the conservative Heritage Foundation after it came out that his Ph.D. thesis was said to be all about the low IQ’s of Hispanic immigrants. Heritage and others apparently thought this association could discredit their anti-immigration-reform position. Richwine’s mentor Charles Murray was unhappy about the whole episode. Beauchamp’s article is worth reading in that it provides some interesting background, in particular by getting into the details of the Ph.D. review process. In a sense, Beauchamp is too harsh. Flawed Ph.D. theses get published all the time. I’d say that most Ph.D. theses I’ve seen are flawed: usually the plan is to get the papers into shape later, when submitting them to journals. If a student doesn’t go into academia, the thesis typically just sits there and is rarely followed up on. I don’t know the statistics o

same-blog 5 0.90689611 948 andrew gelman stats-2011-10-10-Combining data from many sources

Introduction: Mark Grote writes: I’d like to request general feedback and references for a problem of combining disparate data sources in a regression model. We’d like to model log crop yield as a function of environmental predictors, but the observations come from many data sources and are peculiarly structured. Among the issues are: 1. Measurement precision in predictors and outcome varies widely with data sources. Some observations are in very coarse units of measurement, due to rounding or even observer guesswork. 2. There are obvious clusters of observations arising from studies in which crop yields were monitored over successive years in spatially proximate communities. Thus some variables may be constant within clusters–this is true even for log yield, probably due to rounding of similar yields. 3. Cluster size and intra-cluster association structure (temporal, spatial or both) vary widely across the dataset. My [Grote's] intuition is that we can learn about central tendency

6 0.88983744 1246 andrew gelman stats-2012-04-04-Data visualization panel at the New York Public Library this evening!

7 0.8897121 1301 andrew gelman stats-2012-05-05-Related to z-statistics

8 0.88906193 889 andrew gelman stats-2011-09-04-The acupuncture paradox

9 0.88384289 1020 andrew gelman stats-2011-11-20-No no no no no

10 0.88360691 104 andrew gelman stats-2010-06-22-Seeking balance

11 0.87361938 1531 andrew gelman stats-2012-10-12-Elderpedia

12 0.87334788 786 andrew gelman stats-2011-07-04-Questions about quantum computing

13 0.87065196 200 andrew gelman stats-2010-08-11-Separating national and state swings in voting and public opinion, or, How I avoided blogorific embarrassment: An agony in four acts

14 0.86243099 1369 andrew gelman stats-2012-06-06-Your conclusion is only as good as your data

15 0.85855031 914 andrew gelman stats-2011-09-16-meta-infographic

16 0.8557117 1322 andrew gelman stats-2012-05-15-Question 5 of my final exam for Design and Analysis of Sample Surveys

17 0.83381629 1588 andrew gelman stats-2012-11-23-No one knows what it’s like to be the bad man

18 0.83282435 546 andrew gelman stats-2011-01-31-Infovis vs. statistical graphics: My talk tomorrow (Tues) 1pm at Columbia

19 0.83254206 2057 andrew gelman stats-2013-10-10-Chris Chabris is irritated by Malcolm Gladwell

20 0.8310436 82 andrew gelman stats-2010-06-12-UnConMax – uncertainty consideration maxims 7 +-- 2