andrew_gelman_stats andrew_gelman_stats-2011 andrew_gelman_stats-2011-685 knowledge-graph by maker-knowledge-mining

685 andrew gelman stats-2011-04-29-Data mining and allergies

meta infos for this blog

Source: html

Introduction: With all this data floating around, there are some interesting analyses one can do. I came across “The Association of Tree Pollen Concentration Peaks and Allergy Medication Sales in New York City: 2003-2008″ by Perry Sheffield . There they correlate pollen counts with anti-allergy medicine sales – and indeed find that two days after high pollen counts, the medicine sales are the highest. Of course, it would be interesting to play with the data to see *what* tree is actually causing the sales to increase the most. Perhaps this would help the arborists what trees to plant. At the moment they seem to be following a rather sexist approach to tree planting: Ogren says the city could solve the problem by planting only female trees, which don’t produce pollen like male trees do. City arborists shy away from females because many produce messy – or in the case of ginkgos, smelly – fruit that litters sidewalks. In Ogren’s opinion, that’s a mistake. He says the females only pro

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 With all this data floating around, there are some interesting analyses one can do. [sent-1, score-0.203]

2 I came across “The Association of Tree Pollen Concentration Peaks and Allergy Medication Sales in New York City: 2003-2008″ by Perry Sheffield . [sent-2, score-0.07]

3 There they correlate pollen counts with anti-allergy medicine sales – and indeed find that two days after high pollen counts, the medicine sales are the highest. [sent-3, score-2.252]

4 Of course, it would be interesting to play with the data to see *what* tree is actually causing the sales to increase the most. [sent-4, score-0.712]

5 Perhaps this would help the arborists what trees to plant. [sent-5, score-0.501]

6 At the moment they seem to be following a rather sexist approach to tree planting: Ogren says the city could solve the problem by planting only female trees, which don’t produce pollen like male trees do. [sent-6, score-2.102]

7 City arborists shy away from females because many produce messy – or in the case of ginkgos, smelly – fruit that litters sidewalks. [sent-7, score-1.059]

8 He says the females only produce fruit because they are pollinated by the males. [sent-9, score-0.699]

similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('pollen', 0.578), ('fruit', 0.275), ('sales', 0.264), ('trees', 0.236), ('arborists', 0.231), ('ogren', 0.231), ('planting', 0.231), ('tree', 0.2), ('females', 0.173), ('produce', 0.173), ('city', 0.169), ('counts', 0.13), ('medicine', 0.119), ('sexist', 0.105), ('allergies', 0.105), ('peaks', 0.095), ('correlate', 0.089), ('perry', 0.089), ('shy', 0.089), ('concentration', 0.085), ('males', 0.081), ('messy', 0.079), ('says', 0.078), ('floating', 0.074), ('causing', 0.07), ('male', 0.069), ('female', 0.066), ('moment', 0.057), ('interesting', 0.054), ('solve', 0.051), ('association', 0.051), ('play', 0.05), ('days', 0.046), ('analyses', 0.046), ('increase', 0.045), ('york', 0.044), ('opinion', 0.043), ('away', 0.039), ('across', 0.037), ('theory', 0.036), ('help', 0.034), ('indeed', 0.034), ('approach', 0.034), ('came', 0.033), ('high', 0.031), ('course', 0.03), ('around', 0.03), ('data', 0.029), ('following', 0.028), ('seem', 0.027)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000001 685 andrew gelman stats-2011-04-29-Data mining and allergies

2 0.098490648 1374 andrew gelman stats-2012-06-11-Convergence Monitoring for Non-Identifiable and Non-Parametric Models

Introduction: Becky Passonneau and colleagues at the Center for Computational Learning Systems (CCLS) at Columbia have been working on a project for ConEd (New York’s major electric utility) to rank structures based on vulnerability to secondary events (e.g., transformer explosions, cable meltdowns, electrical fires). They’ve been using the R implementation BayesTree of Chipman, George and McCulloch’s Bayesian Additive Regression Trees (BART). BART is a Bayesian non-parametric method that is non-identifiable in two ways. Firstly, it is an additive tree model with a fixed number of trees, the indexes of which aren’t identified (you get the same predictions in a model swapping the order of the trees). This is the same kind of non-identifiability you get with any mixture model (additive or interpolated) with an exchangeable prior on the mixture components. Secondly, the trees themselves have varying structure over samples in terms of number of nodes and their topology (depth, branching, etc

3 0.09767168 455 andrew gelman stats-2010-12-07-Some ideas on communicating risks to the general public

Introduction: Aleks points me to this research summary from Dan Goldstein. Good stuff. I’ve heard of a lot of this–I actually use some of it in my intro statistics course, when we show the students how they can express probability trees using frequencies–but it’s good to see it all in one place.

4 0.086983979 1614 andrew gelman stats-2012-12-09-The pretty picture is just the beginning of the data exploration. But the pretty picture is a great way to get started. Another example of how a puzzle can make a graph appealing

Introduction: Ben Hyde sends along this appealing image by Michael Paukner, which represents a nearly perfect distillation of “infographics”: Here are some of the comments on the linked page: Rather than redrawing the picture to make the lines more clear, I’d say: leave the graphic as is, and have a link to a set of statistical graphs that show where the different sorts of old trees are and what they look like. Let’s value the above image for its clean look and its clever Christmas-tree design, and once we have it, take advantage of viewers’ interest in the topic to show them more. P.S. See my comment below which I think further illuminates the appeal of this particular tree.

5 0.072149187 1057 andrew gelman stats-2011-12-14-Hey—I didn’t know that!

Introduction: From Wikipedia (via Jay Livingston ): Newsweek sells only about 40,000 newsstand copies compared with 1.5 million subscriptions. (Both figures are substantially lower than they were a decade ago.) The figures for Time are about double those of Newsweek, but the ratio of newsstand sales to subscriptions is about the same. I guess I’m not surprised that most of the sales are from subscriptions, but I’m surprised the fraction is so close to 100%.

6 0.068000548 1649 andrew gelman stats-2013-01-02-Back when 50 miles was a long way

7 0.060964048 413 andrew gelman stats-2010-11-14-Statistics of food consumption

8 0.049827762 1297 andrew gelman stats-2012-05-03-New New York data research organizations

9 0.048469447 1718 andrew gelman stats-2013-02-11-Toward a framework for automatic model building

10 0.046888091 289 andrew gelman stats-2010-09-21-“How segregated is your city?”: A story of why every graph, no matter how clear it seems to be, needs a caption to anchor the reader in some numbers

11 0.045507923 665 andrew gelman stats-2011-04-17-Yes, your wish shall be granted (in 25 years)

12 0.045444421 1744 andrew gelman stats-2013-03-01-Why big effects are more important than small effects

13 0.04404635 1942 andrew gelman stats-2013-07-17-“Stop and frisk” statistics

14 0.043447278 698 andrew gelman stats-2011-05-05-Shocking but not surprising

15 0.041378576 2159 andrew gelman stats-2014-01-04-“Dogs are sensitive to small variations of the Earth’s magnetic field”

16 0.040761158 1970 andrew gelman stats-2013-08-06-New words of 1917

17 0.038570486 2065 andrew gelman stats-2013-10-17-Cool dynamic demographic maps provide beautiful illustration of Chris Rock effect

18 0.037553106 1643 andrew gelman stats-2012-12-29-Sexism in science (as elsewhere)

19 0.03740776 1700 andrew gelman stats-2013-01-31-Snotty reviewers

20 0.037151914 634 andrew gelman stats-2011-03-29-A.I. is Whatever We Can’t Yet Automate

similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.055), (1, -0.009), (2, 0.003), (3, -0.008), (4, 0.013), (5, 0.002), (6, -0.001), (7, 0.017), (8, -0.005), (9, -0.004), (10, -0.014), (11, -0.004), (12, 0.011), (13, 0.002), (14, -0.005), (15, 0.014), (16, 0.017), (17, 0.014), (18, -0.003), (19, -0.018), (20, -0.014), (21, -0.005), (22, -0.01), (23, 0.001), (24, -0.008), (25, -0.002), (26, -0.011), (27, -0.006), (28, 0.022), (29, 0.003), (30, 0.003), (31, -0.016), (32, -0.011), (33, -0.021), (34, 0.019), (35, 0.006), (36, -0.002), (37, 0.001), (38, -0.004), (39, -0.005), (40, -0.011), (41, -0.003), (42, 0.003), (43, -0.012), (44, 0.01), (45, 0.011), (46, 0.019), (47, 0.012), (48, 0.004), (49, -0.004)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.92041391 685 andrew gelman stats-2011-04-29-Data mining and allergies

2 0.74243915 925 andrew gelman stats-2011-09-26-Ethnicity and Population Structure in Personal Naming Networks

Introduction: Aleks pointed me to this recent article by Pablo Mateos, Paul Longley, and David O’Sullivan on one of my favorite topics. The authors produced a potentially cool naming network of the city of Auckland New Zealand . I say “potentially cool” because I have such difficulty reading the article–I speak English, statistics, and a bit of political science and economics, but this one is written in heavy sociologese–that I can’t quite be sure what they’re doing. However, despite my (perhaps unfair) disdain for the particulars of their method, it’s probably good that they’re jumping in with this analysis. Others can take their data (and similar datasets from elsewhere) and do better. Ya gotta start somewhere, and the basic idea (to cluster first names that are associated with the same last names, and to cluster last names that are associated with the same first names) seems good. I have to admit, though, that I was amused by the following line, which, amazingly, led off the paper:

3 0.72419792 558 andrew gelman stats-2011-02-05-Fattening of the world and good use of the alpha channel

Introduction: In the spirit of Gapminder , Washington Post created an interactive scatterplot viewer that’s using alpha channel to tell apart overlapping fat dots better than sorting-by-circle-size Gapminder is using: Good news: the rate of fattening of the USA appears to be slowing down. Maybe because of high gas prices? But what’s happening with Oceania?

4 0.72074729 500 andrew gelman stats-2011-01-03-Bribing statistics

Introduction: I Paid a Bribe by Janaagraha, a Bangalore based not-for-profit, harnesses the collective energy of citizens and asks them to report on the nature, number, pattern, types, location, frequency and values of corruption activities. These reports would be used to argue for improving governance systems and procedures, tightening law enforcement and regulation and thereby reduce the scope for corruption. Here’s a presentation of data from the application: Transparency International could make something like this much more widely available around the world . While awareness is good, follow-up is even better. For example, it’s known that New York’s subway signal inspections were being falsified . Signal inspections are pretty serious stuff, as failures lead to disasters , such as the one in Washington. Nothing much happened after: the person responsible (making $163k a year) was merely reassigned .

5 0.71439427 1125 andrew gelman stats-2012-01-18-Beautiful Line Charts

Introduction: I stumbled across a chart that’s in my opinion the best way to express a comparison of quantities through time: It compares the new PC companies, such as Apple, to traditional PC companies like IBM and Compaq, but on the same scale. If you’d like to see how iPads and other novelties compare, see here . I’ve tried to use the same type of visualization in my old work on legal data visualization . It comes from a new market research firm Asymco that also produced a very clean income vs expenses visualization (click to enlarge): While the first figure is pure perfection, Tufte purists might find the second one too colorful. But to a busy person, color helps tell things apart: when I know that pink means interest, it takes a fraction of the second to assess the situation. We live in 2012, not in 1712 to have to think black and white. Finally, they have a few other interesting uses of interactive visualization, such as cellular-broadband infrastructure around

6 0.71369797 2065 andrew gelman stats-2013-10-17-Cool dynamic demographic maps provide beautiful illustration of Chris Rock effect

7 0.68996978 358 andrew gelman stats-2010-10-20-When Kerry Met Sally: Politics and Perceptions in the Demand for Movies

8 0.6873107 2030 andrew gelman stats-2013-09-19-Is coffee a killer? I don’t think the effect is as high as was estimated from the highest number that came out of a noisy study

9 0.6871894 513 andrew gelman stats-2011-01-12-“Tied for Warmest Year On Record”

10 0.68283784 1942 andrew gelman stats-2013-07-17-“Stop and frisk” statistics

11 0.68120688 179 andrew gelman stats-2010-08-03-An Olympic size swimming pool full of lithium water

12 0.6781795 2308 andrew gelman stats-2014-04-27-White stripes and dead armadillos

13 0.67642677 1189 andrew gelman stats-2012-02-28-Those darn physicists

14 0.67273068 12 andrew gelman stats-2010-04-30-More on problems with surveys estimating deaths in war zones

15 0.67255533 2026 andrew gelman stats-2013-09-16-He’s adult entertainer, Child educator, King of the crossfader, He’s the greatest of the greater, He’s a big bad wolf in your neighborhood, Not bad meaning bad but bad meaning good

16 0.67018396 2022 andrew gelman stats-2013-09-13-You heard it here first: Intense exercise can suppress appetite

17 0.6695171 1172 andrew gelman stats-2012-02-17-Rare name analysis and wealth convergence

18 0.66586506 1649 andrew gelman stats-2013-01-02-Back when 50 miles was a long way

19 0.66293192 489 andrew gelman stats-2010-12-28-Brow inflation

20 0.65870762 68 andrew gelman stats-2010-06-03-…pretty soon you’re talking real money.

similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(13, 0.014), (16, 0.036), (20, 0.014), (24, 0.036), (41, 0.289), (43, 0.016), (45, 0.021), (53, 0.089), (63, 0.057), (85, 0.039), (86, 0.011), (98, 0.016), (99, 0.227)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.87470746 685 andrew gelman stats-2011-04-29-Data mining and allergies

2 0.81164294 303 andrew gelman stats-2010-09-28-“Genomics” vs. genetics

Introduction: John Cook and Joseph Delaney point to an article by Yurii Aulchenko et al., who write: 54 loci showing strong statistical evidence for association to human height were described, providing us with potential genomic means of human height prediction. In a population-based study of 5748 people, we find that a 54-loci genomic profile explained 4-6% of the sex- and age-adjusted height variance, and had limited ability to discriminate tall/short people. . . . In a family-based study of 550 people, with both parents having height measurements, we find that the Galtonian mid-parental prediction method explained 40% of the sex- and age-adjusted height variance, and showed high discriminative accuracy. . . . The message is that the simple approach of predicting child’s height using a regression model given parents’ average height performs much better than the method they have based on combining 54 genes. They also find that, if you start with the prediction based on parents’ heigh

3 0.78425092 1626 andrew gelman stats-2012-12-16-The lamest, grudgingest, non-retraction retraction ever

Introduction: In politics we’re familiar with the non-apology apology (well described in Wikipedia as “a statement that has the form of an apology but does not express the expected contrition”). Here’s the scientific equivalent: the non-retraction retraction. Sanjay Srivastava points to an amusing yet barfable story of a pair of researchers who (inadvertently, I assume) made a data coding error and were eventually moved to issue a correction notice, but even then refused to fully admit their error. As Srivastava puts it, the story “ended up with Lew [Goldberg] and colleagues [Kibeom Lee and Michael Ashton] publishing a comment on an erratum – the only time I’ve ever heard of that happening in a scientific journal.” From the comment on the erratum: In their “erratum and addendum,” Anderson and Ones (this issue) explained that we had brought their attention to the “potential” of a “possible” misalignment and described the results computed from re-aligned data as being based on a “post-ho

4 0.75773615 1013 andrew gelman stats-2011-11-16-My talk at Math for America on Saturday

Introduction: Here’s what I’ll talk about for 3 hours : Statistics—Inside and Outside the Classroom (1) Of Beauty, Sex, and Power: Statistical Challenges in the Estimation of Small Effects . A silly example of the frequencies of boy and girl babies leads us to some important research involving the meaning of statistical significance. (2) Mathematics, Statistics, and Political Science . We explore the differences between mathematical and statistical thinking, developing the ideas using examples from my own research in political science. (3) Statistics Teaching Activities . For twenty years I have been collecting class-participation demonstrations in statistics and probability. Here are some of my favorites.

5 0.7535001 1669 andrew gelman stats-2013-01-12-The power of the puzzlegraph

Introduction: The Organisation for Economic Co-operation and Development reports that the following project from Krisztina Szucs and Mate Cziner has won their visualization challenge, “launched in September 2012 to solicit visualisations based on the OECD’s data-rich Education at a Glance report”: (The graph is interactive. Click on the above image and click again to see the full version.) From the press release: Entries from around the world focused on data related to the economic costs and return on investment in education . . . [The winning entry] takes a detailed look at public vs. private and men vs. women for selected countries . . . The judges were particularly impressed by the angled slope format of the visualisation, which encourages comparison between the upper-secondary and tertiary benefits of education. Szucs and Cziner were also lauded for their striking visual design, which draws users into exploring their piece [emphasis added]. I used boldface to highlight a p

6 0.75256389 454 andrew gelman stats-2010-12-07-Diabetes stops at the state line?

7 0.74240065 516 andrew gelman stats-2011-01-14-A new idea for a science core course based entirely on computer simulation

8 0.74210083 2185 andrew gelman stats-2014-01-25-Xihong Lin on sparsity and density

9 0.73013538 1214 andrew gelman stats-2012-03-15-Of forecasts and graph theory and characterizing a statistical method by the information it uses

10 0.72515213 1300 andrew gelman stats-2012-05-05-Recently in the sister blog

11 0.71878624 2204 andrew gelman stats-2014-02-09-Keli Liu and Xiao-Li Meng on Simpson’s paradox

12 0.71281958 1895 andrew gelman stats-2013-06-12-Peter Thiel is writing another book!

13 0.70261872 1816 andrew gelman stats-2013-04-21-Exponential increase in the number of stat majors

14 0.70015657 1583 andrew gelman stats-2012-11-19-I can’t read this interview with me

15 0.69256467 2202 andrew gelman stats-2014-02-07-Outrage of the week

16 0.6783576 1297 andrew gelman stats-2012-05-03-New New York data research organizations

17 0.6734789 2226 andrew gelman stats-2014-02-26-Econometrics, political science, epidemiology, etc.: Don’t model the probability of a discrete outcome, model the underlying continuous variable

18 0.67313492 2144 andrew gelman stats-2013-12-23-I hate this stuff

19 0.67031354 2311 andrew gelman stats-2014-04-29-Bayesian Uncertainty Quantification for Differential Equations!

20 0.66785753 778 andrew gelman stats-2011-06-24-New ideas on DIC from Martyn Plummer and Sumio Watanabe