andrew_gelman_stats andrew_gelman_stats-2010 andrew_gelman_stats-2010-271 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Bernard Phiri writes: I am relatively new to glm models, anyhow, I am currently using your book “Data analysis using regression and multilevel/hierarchical models” (pages 109-115). I am using a Poisson GLM model to analyse an aerial census dataset of wild herbivores on a ranch in Kenya. In my analysis I have the following variables: 1. Outcome variable: count of wild herbivores sighted at a given location 2. Explanatory variable1: vegetation type i.e. type of vegetation of the grid in which animals were sighted (the ranch is divided into 1x1km grids) 3. Explanatory variable2: animal species e.g. eland, elephant, zebra etc 4. Exposure: proximity to water i.e. distance (km) to the nearest water point My questions are as follows: 1. Am I correct to include proximity to water point as an offset? I notice that in the example in your book the offset is a count, does this matter? 2. By including proximity to water in the model as an exposure am I correct to interpret th
sentIndex sentText sentNum sentScore
1 Bernard Phiri writes: I am relatively new to glm models, anyhow, I am currently using your book “Data analysis using regression and multilevel/hierarchical models” (pages 109-115). [sent-1, score-0.514]
2 I am using a Poisson GLM model to analyse an aerial census dataset of wild herbivores on a ranch in Kenya. [sent-2, score-0.909]
3 In my analysis I have the following variables: 1. [sent-3, score-0.044]
4 Outcome variable: count of wild herbivores sighted at a given location 2. [sent-4, score-0.796]
5 type of vegetation of the grid in which animals were sighted (the ranch is divided into 1x1km grids) 3. [sent-7, score-0.942]
6 distance (km) to the nearest water point My questions are as follows: 1. [sent-13, score-0.711]
7 Am I correct to include proximity to water point as an offset? [sent-14, score-1.032]
8 I notice that in the example in your book the offset is a count, does this matter? [sent-15, score-0.407]
9 By including proximity to water in the model as an exposure am I correct to interpret this as “the chances of sighting an animal is the same for every kilometre away from the water point”? [sent-17, score-1.841]
10 My reply: I would use proximity to water as a predictor, not an offset, just as, in the wells example in chapter 5 of our book, we use proximity to the nearest well as a predictor. [sent-18, score-1.75]
wordName wordTfidf (topN-words)
[('proximity', 0.468), ('water', 0.425), ('offset', 0.271), ('herbivores', 0.228), ('sighted', 0.228), ('vegetation', 0.208), ('ranch', 0.208), ('nearest', 0.167), ('wild', 0.16), ('glm', 0.155), ('explanatory', 0.153), ('animal', 0.148), ('exposure', 0.129), ('predictor', 0.119), ('count', 0.115), ('anyhow', 0.098), ('wells', 0.098), ('analyse', 0.09), ('type', 0.088), ('elephant', 0.088), ('grids', 0.088), ('bernard', 0.083), ('correct', 0.082), ('book', 0.081), ('grid', 0.073), ('animals', 0.073), ('species', 0.072), ('poisson', 0.071), ('chances', 0.069), ('location', 0.065), ('using', 0.065), ('divided', 0.064), ('census', 0.063), ('distance', 0.062), ('point', 0.057), ('relatively', 0.056), ('notice', 0.055), ('dataset', 0.054), ('interpret', 0.054), ('pages', 0.052), ('models', 0.052), ('follows', 0.049), ('outcome', 0.048), ('currently', 0.048), ('analysis', 0.044), ('etc', 0.044), ('chapter', 0.044), ('variable', 0.042), ('model', 0.041), ('use', 0.04)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999994 271 andrew gelman stats-2010-09-12-GLM – exposure
Introduction: Bernard Phiri writes: I am relatively new to glm models, anyhow, I am currently using your book “Data analysis using regression and multilevel/hierarchical models” (pages 109-115). I am using a Poisson GLM model to analyse an aerial census dataset of wild herbivores on a ranch in Kenya. In my analysis I have the following variables: 1. Outcome variable: count of wild herbivores sighted at a given location 2. Explanatory variable1: vegetation type i.e. type of vegetation of the grid in which animals were sighted (the ranch is divided into 1x1km grids) 3. Explanatory variable2: animal species e.g. eland, elephant, zebra etc 4. Exposure: proximity to water i.e. distance (km) to the nearest water point My questions are as follows: 1. Am I correct to include proximity to water point as an offset? I notice that in the example in your book the offset is a count, does this matter? 2. By including proximity to water in the model as an exposure am I correct to interpret th
2 0.2834661 238 andrew gelman stats-2010-08-27-No radon lobby
Introduction: Kaiser writes thoughtfully about the costs, benefits, and incentives for different policy recommendation options regarding a recent water crisis. Good stuff: it’s solid “freakonomics”–and I mean this in positive way: a mix of economic and statistical analysis, with assumptions stated clearly. Kaiser writes: Using the framework from Chapter 4, we should think about the incentives facing the Mass. Water Resources Authority: A false positive error (people asked to throw out water when water is clean) means people stop drinking tap water temporarily, perhaps switching to bottled water, and the officials claim victory when no one falls sick, and businesses that produce bottled water experience a jump in sales. It is also very difficult to prove a “false positive” when people have stopped drinking the water. So this type of error is easy to hide behind. A false negative error (people told it’s safe to drink water when water is polluted) becomes apparent when someone falls sick
3 0.15652673 179 andrew gelman stats-2010-08-03-An Olympic size swimming pool full of lithium water
Introduction: As part of his continuing plan to sap etc etc., Aleks pointed me to an article by Max Miller reporting on a recommendation from Jacob Appel: Adding trace amounts of lithium to the drinking water could limit suicides. . . . Communities with higher than average amounts of lithium in their drinking water had significantly lower suicide rates than communities with lower levels. Regions of Texas with lower lithium concentrations had an average suicide rate of 14.2 per 100,000 people, whereas those areas with naturally higher lithium levels had a dramatically lower suicide rate of 8.7 per 100,000. The highest levels in Texas (150 micrograms of lithium per liter of water) are only a thousandth of the minimum pharmaceutical dose, and have no known deleterious effects. I don’t know anything about this and am offering no judgment on it; I’m just passing it on. The research studies are here and here . I am skeptical, though, about this part of the argument: We are not talking a
4 0.14232951 413 andrew gelman stats-2010-11-14-Statistics of food consumption
Introduction: Visual Economics shows statistics on average food consumption in America: My brief feedback is that water is confounded with these results. They should have subtracted water content from the weight of all dietary items, as it inflates the proportion of milk, vegetable and fruit items that contain more water. They did that for soda (which is represented as sugar/corn syrup), amplifying the inconsistency. Time Magazine had a beautiful gallery that visualizes diets around the world in a more appealing way.
5 0.070691966 696 andrew gelman stats-2011-05-04-Whassup with glm()?
Introduction: We’re having problem with starting values in glm(). A very simple logistic regression with just an intercept with a very simple starting value (beta=5) blows up. Here’s the R code: > y <- rep (c(1,0),c(10,5)) > glm (y ~ 1, family=binomial(link="logit")) Call: glm(formula = y ~ 1, family = binomial(link = "logit")) Coefficients: (Intercept) 0.6931 Degrees of Freedom: 14 Total (i.e. Null); 14 Residual Null Deviance: 19.1 Residual Deviance: 19.1 AIC: 21.1 > glm (y ~ 1, family=binomial(link="logit"), start=2) Call: glm(formula = y ~ 1, family = binomial(link = "logit"), start = 2) Coefficients: (Intercept) 0.6931 Degrees of Freedom: 14 Total (i.e. Null); 14 Residual Null Deviance: 19.1 Residual Deviance: 19.1 AIC: 21.1 > glm (y ~ 1, family=binomial(link="logit"), start=5) Call: glm(formula = y ~ 1, family = binomial(link = "logit"), start = 5) Coefficients: (Intercept) 1.501e+15 Degrees of Freedom: 14 Total (i.
6 0.067481108 1299 andrew gelman stats-2012-05-04-Models, assumptions, and data summaries
7 0.065810919 1196 andrew gelman stats-2012-03-04-Piss-poor monocausal social science
8 0.064740591 1294 andrew gelman stats-2012-05-01-Modeling y = a + b + c
9 0.063509896 1367 andrew gelman stats-2012-06-05-Question 26 of my final exam for Design and Analysis of Sample Surveys
11 0.063422576 1102 andrew gelman stats-2012-01-06-Bayesian Anova found useful in ecology
12 0.059682757 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?
13 0.059387237 14 andrew gelman stats-2010-05-01-Imputing count data
14 0.05678637 851 andrew gelman stats-2011-08-12-year + (1|year)
15 0.056027871 1516 andrew gelman stats-2012-09-30-Computational problems with glm etc.
16 0.055380043 448 andrew gelman stats-2010-12-03-This is a footnote in one of my papers
17 0.053775042 1900 andrew gelman stats-2013-06-15-Exploratory multilevel analysis when group-level variables are of importance
18 0.053108685 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making
20 0.051432837 1239 andrew gelman stats-2012-04-01-A randomized trial of the set-point diet
topicId topicWeight
[(0, 0.073), (1, 0.042), (2, 0.024), (3, 0.008), (4, 0.042), (5, 0.019), (6, -0.001), (7, -0.013), (8, 0.06), (9, 0.036), (10, 0.013), (11, -0.001), (12, -0.001), (13, -0.024), (14, 0.022), (15, 0.014), (16, -0.0), (17, 0.003), (18, 0.015), (19, -0.037), (20, -0.003), (21, 0.028), (22, -0.009), (23, -0.013), (24, -0.009), (25, 0.008), (26, 0.029), (27, -0.026), (28, 0.037), (29, 0.012), (30, -0.013), (31, 0.001), (32, 0.009), (33, 0.025), (34, -0.0), (35, -0.0), (36, 0.02), (37, -0.001), (38, 0.012), (39, -0.001), (40, -0.025), (41, -0.014), (42, -0.021), (43, 0.017), (44, 0.024), (45, 0.015), (46, 0.013), (47, 0.072), (48, -0.005), (49, 0.023)]
simIndex simValue blogId blogTitle
same-blog 1 0.9472385 271 andrew gelman stats-2010-09-12-GLM – exposure
Introduction: Bernard Phiri writes: I am relatively new to glm models, anyhow, I am currently using your book “Data analysis using regression and multilevel/hierarchical models” (pages 109-115). I am using a Poisson GLM model to analyse an aerial census dataset of wild herbivores on a ranch in Kenya. In my analysis I have the following variables: 1. Outcome variable: count of wild herbivores sighted at a given location 2. Explanatory variable1: vegetation type i.e. type of vegetation of the grid in which animals were sighted (the ranch is divided into 1x1km grids) 3. Explanatory variable2: animal species e.g. eland, elephant, zebra etc 4. Exposure: proximity to water i.e. distance (km) to the nearest water point My questions are as follows: 1. Am I correct to include proximity to water point as an offset? I notice that in the example in your book the offset is a count, does this matter? 2. By including proximity to water in the model as an exposure am I correct to interpret th
2 0.75608081 1121 andrew gelman stats-2012-01-15-R-squared for multilevel models
Introduction: Fred Schiff writes: I’m writing to you to ask about the “R-squared” approximation procedure you suggest in your 2004 book with Dr. Hill. [See also this paper with Pardoe---ed.] I’m a media sociologist at the University of Houston. I’ve been using HLM3 for about two years. Briefly about my data. It’s a content analysis of news stories with a continuous scale dependent variable, story prominence. I have 6090 news stories, 114 newspapers, and 59 newspaper group owners. All the Level-1, Level-2 and dependent variables have been standardized. Since the means were zero anyway, we left the variables uncentered. All the Level-3 ownership groups and characteristics are dichotomous scales that were left uncentered. PROBLEM: The single most important result I am looking for is to compare the strength of nine competing Level-1 variables in their ability to predict and explain the outcome variable, story prominence. We are trying to use the residuals to calculate a “R-squ
3 0.69638181 14 andrew gelman stats-2010-05-01-Imputing count data
Introduction: Guy asks: I am analyzing an original survey of farmers in Uganda. I am hoping to use a battery of welfare proxy variables to create a single welfare index using PCA. I have quick question which I hope you can find time to address: How do you recommend treating count data? (for example # of rooms, # of chickens, # of cows, # of radios)? In my dataset these variables are highly skewed with many responses at zero (which makes taking the natural log problematic). In the case of # of cows or chickens several obs have values in the hundreds. My response: Here’s what we do in our mi package in R. We split a variable into two parts: an indicator for whether it is positive, and the positive part. That is, y = u*v. Then u is binary and can be modeled using logisitc regression, and v can be modeled on the log scale. At the end you can round to the nearest integer if you want to avoid fractional values.
Introduction: Greg Campbell writes: I am a Canadian archaeologist (BSc in Chemistry) researching the past human use of European Atlantic shellfish. After two decades of practice I am finally getting a MA in archaeology at Reading. I am seeing if the habitat or size of harvested mussels (Mytilus edulis) can be reconstructed from measurements of the umbo (the pointy end, and the only bit that survives well in archaeological deposits) using log-transformed measurements (or allometry; relationships between dimensions are more likely exponential than linear). Of course multivariate regressions in most statistics packages (Minitab, SPSS, SAS) assume you are trying to predict one variable from all the others (a Model I regression), and use ordinary least squares to fit the regression line. For organismal dimensions this makes little sense, since all the dimensions are (at least in theory) free to change their mutual proportions during growth. So there is no predictor and predicted, mutual variation of
5 0.67650747 1017 andrew gelman stats-2011-11-18-Lack of complete overlap
Introduction: Evens Salies writes: I have a question regarding a randomizing constraint in my current funded electricity experiment. After elimination of missing data we have 110 voluntary households from a larger population (resource constraints do not allow us to have more households!). I randomly assign them to threated and non treated where the treatment variable is some ICT that allows the treated to track their electricity consumption in real tim. The ICT is made of two devices, one that is plugged on the household’s modem and the other on the electric meter. A necessary condition for being treated is that the distance between the box and the meter be below some threshold (d), the value of which is 20 meters approximately. 50 ICTs can be installed. 60 households will be in the control group. But, I can only assign 6 households in the control group for whom d is less than 20. Therefore, I have only 6 households in the control group who have a counterfactual in the group of treated.
6 0.66284102 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making
7 0.66224277 796 andrew gelman stats-2011-07-10-Matching and regression: two great tastes etc etc
8 0.66070396 1015 andrew gelman stats-2011-11-17-Good examples of lurking variables?
9 0.6586839 1815 andrew gelman stats-2013-04-20-Displaying inferences from complex models
10 0.64772153 1248 andrew gelman stats-2012-04-06-17 groups, 6 group-level predictors: What to do?
11 0.64327502 1656 andrew gelman stats-2013-01-05-Understanding regression models and regression coefficients
12 0.64314812 154 andrew gelman stats-2010-07-18-Predictive checks for hierarchical models
13 0.64168054 198 andrew gelman stats-2010-08-11-Multilevel modeling in R on a Mac
14 0.63509822 948 andrew gelman stats-2011-10-10-Combining data from many sources
15 0.63234913 1468 andrew gelman stats-2012-08-24-Multilevel modeling and instrumental variables
16 0.62987262 1218 andrew gelman stats-2012-03-18-Check your missing-data imputations using cross-validation
17 0.62820435 1900 andrew gelman stats-2013-06-15-Exploratory multilevel analysis when group-level variables are of importance
18 0.61711961 935 andrew gelman stats-2011-10-01-When should you worry about imputed data?
19 0.61680287 25 andrew gelman stats-2010-05-10-Two great tastes that taste great together
20 0.61000186 770 andrew gelman stats-2011-06-15-Still more Mr. P in public health
topicId topicWeight
[(1, 0.011), (2, 0.07), (9, 0.011), (14, 0.014), (16, 0.064), (22, 0.044), (24, 0.056), (53, 0.136), (54, 0.024), (67, 0.272), (69, 0.014), (76, 0.016), (95, 0.022), (99, 0.115)]
simIndex simValue blogId blogTitle
same-blog 1 0.85222566 271 andrew gelman stats-2010-09-12-GLM – exposure
Introduction: Bernard Phiri writes: I am relatively new to glm models, anyhow, I am currently using your book “Data analysis using regression and multilevel/hierarchical models” (pages 109-115). I am using a Poisson GLM model to analyse an aerial census dataset of wild herbivores on a ranch in Kenya. In my analysis I have the following variables: 1. Outcome variable: count of wild herbivores sighted at a given location 2. Explanatory variable1: vegetation type i.e. type of vegetation of the grid in which animals were sighted (the ranch is divided into 1x1km grids) 3. Explanatory variable2: animal species e.g. eland, elephant, zebra etc 4. Exposure: proximity to water i.e. distance (km) to the nearest water point My questions are as follows: 1. Am I correct to include proximity to water point as an offset? I notice that in the example in your book the offset is a count, does this matter? 2. By including proximity to water in the model as an exposure am I correct to interpret th
2 0.67369008 1457 andrew gelman stats-2012-08-13-Retro ethnic slurs
Introduction: From Watership Down: There is a rabbit saying, ‘In the warren, more stories than passages’; and a rabbit can no more refuse to tell a story than an Irishman can refuse to fight. Wow. OK, if someone made a joke about New Yorkers being argumentative or people from Iowa being boring (sorry, Tom!), I wouldn’t see it as being in poor taste. But somehow, to this non-U.K. reader, Adams’s remark about “Irishmen” seems a bit over the top. I’m not criticizing it as offensive, exactly; it just is a bit jarring, and it’s kind of hard for me to believe someone would just write that as a throwaway line anymore. Things have changed a lot since 1971, I guess, or maybe in England an Irish joke is no more offensive/awkward than a joke about corrupt Chicagoans, loopy Californians, or crazy Floridians would be here.
3 0.6546939 1546 andrew gelman stats-2012-10-24-Hey—has anybody done this study yet?
Introduction: A few years ago I suggested a research project to study how Americans define themselves in terms of regional identity. For example, if you grew up in South Dakota but live in Washington, D.C., do you you call yourself a midwesterner, a westerner, a southerner, or what? The analogy is to the paper by Michael Hout on “How 4 million Irish immigrants became 40 million Irish Americans.” Contrary to expectations, it wasn’t about prolific breeding, it was about how people of mixed background choose to classify themselves.
4 0.56893051 129 andrew gelman stats-2010-07-05-Unrelated to all else
Introduction: Another stereotype is affirmed when I go on the U.K. rail system webpage and it repeatedly times out on me. At one point I have a browser open with the itinerary I’m interested in, and then awhile later I reopen the window (not clicking on anything on the page, just bringing the window up on the screen) but it’s timed out again. P.S. Yes, yes, I know that Amtrak is worse. Still, it’s amusing to see a confirmation that, at least in one respect, the British trains are as bad as they say.
5 0.55369282 1235 andrew gelman stats-2012-03-29-I’m looking for a quadrille notebook with faint lines
Introduction: I want a graph-paper-style notebook, ideally something lightweight—I’m looking to make notes, not art drawings—and not too large. I’m currently using a 17 x 22 cm notebook, which is a fine size. It also has pretty small squares, which I like. My problem with the notebook I have now is that the ink is too heavy—that is, the lines are too dark. I want very faint lines, just visible enough to be used as guides but not so heavy that to be overwhelming. The notebooks I see in the stores all have pretty dark lines. Any suggestions?
6 0.54744446 1589 andrew gelman stats-2012-11-25-Life as a blogger: the emails just get weirder and weirder
8 0.53029478 1677 andrew gelman stats-2013-01-16-Greenland is one tough town
9 0.52867639 1766 andrew gelman stats-2013-03-16-“Nightshifts Linked to Increased Risk for Ovarian Cancer”
10 0.52617407 413 andrew gelman stats-2010-11-14-Statistics of food consumption
11 0.52327943 1856 andrew gelman stats-2013-05-14-GPstuff: Bayesian Modeling with Gaussian Processes
12 0.5216676 1468 andrew gelman stats-2012-08-24-Multilevel modeling and instrumental variables
13 0.51001942 1905 andrew gelman stats-2013-06-18-There are no fat sprinters
14 0.5098412 46 andrew gelman stats-2010-05-21-Careers, one-hit wonders, and an offer of a free book
15 0.50885063 488 andrew gelman stats-2010-12-27-Graph of the year
16 0.5014888 1446 andrew gelman stats-2012-08-06-“And will pardon Paul Claudel, Pardons him for writing well”
17 0.49946588 1802 andrew gelman stats-2013-04-14-Detecting predictability in complex ecosystems
18 0.49755836 733 andrew gelman stats-2011-05-27-Another silly graph
19 0.49681801 1902 andrew gelman stats-2013-06-17-Job opening at new “big data” consulting firm!
20 0.49374092 2022 andrew gelman stats-2013-09-13-You heard it here first: Intense exercise can suppress appetite