andrew_gelman_stats andrew_gelman_stats-2010 andrew_gelman_stats-2010-324 knowledge-graph by maker-knowledge-mining

324 andrew gelman stats-2010-10-07-Contest for developing an R package recommendation system

meta infos for this blog

Source: html

Introduction: After I spoke tonight at the NYC R meetup, John Myles White and Drew Conway told me about this competition they’re administering for developing a recommendation system for R packages. They seem to have already done some work laying out the network of R packages–which packages refer to which others, and so forth. I just hope they set up their system so that my own packages (“R2WinBUGS”, “r2jags”, “arm”, and “mi”) get recommended automatically. I really hate to think that there are people out there running regressions in R and not using display() and coefplot() to look at the output. P.S. Ajay Shah asks what I mean by that last sentence. My quick answer is that it’s good to be able to visualize the coefficients and the uncertainty about them. The default options of print(), summary(), and plot() in R don’t do that: - print() doesn’t give enough information - summary() gives everything to a zillion decimal places and gives useless things like p-values - plot() gives a bunch

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 After I spoke tonight at the NYC R meetup, John Myles White and Drew Conway told me about this competition they’re administering for developing a recommendation system for R packages. [sent-1, score-0.76]

2 They seem to have already done some work laying out the network of R packages–which packages refer to which others, and so forth. [sent-2, score-0.652]

3 I just hope they set up their system so that my own packages (“R2WinBUGS”, “r2jags”, “arm”, and “mi”) get recommended automatically. [sent-3, score-0.535]

4 I really hate to think that there are people out there running regressions in R and not using display() and coefplot() to look at the output. [sent-4, score-0.256]

5 Ajay Shah asks what I mean by that last sentence. [sent-7, score-0.082]

6 My quick answer is that it’s good to be able to visualize the coefficients and the uncertainty about them. [sent-8, score-0.361]

7 I like display() because it gives the useful information that’s in summary() but without the crap. [sent-10, score-0.427]

8 I like coefplot() too, but it still needs a bit of work to be generally useful. [sent-11, score-0.154]

9 And I’d also like to have a new function that automatically plots the data and fitted lines. [sent-12, score-0.619]

similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('packages', 0.313), ('coefplot', 0.294), ('gives', 0.265), ('print', 0.219), ('summary', 0.218), ('fitted', 0.189), ('plots', 0.187), ('display', 0.171), ('administering', 0.169), ('plot', 0.162), ('conway', 0.152), ('laying', 0.147), ('meetup', 0.139), ('mi', 0.136), ('decimal', 0.136), ('diagnostic', 0.133), ('visualize', 0.133), ('residual', 0.128), ('system', 0.128), ('drew', 0.123), ('zillion', 0.121), ('spoke', 0.109), ('nyc', 0.108), ('useless', 0.103), ('options', 0.101), ('competition', 0.1), ('arm', 0.099), ('network', 0.096), ('refer', 0.096), ('automatically', 0.096), ('regressions', 0.094), ('recommended', 0.094), ('recommendation', 0.093), ('developing', 0.093), ('information', 0.089), ('hate', 0.088), ('default', 0.088), ('coefficients', 0.085), ('white', 0.083), ('asks', 0.082), ('needs', 0.081), ('places', 0.08), ('lines', 0.077), ('running', 0.074), ('function', 0.074), ('uncertainty', 0.074), ('like', 0.073), ('quick', 0.069), ('told', 0.068), ('bunch', 0.067)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000001 324 andrew gelman stats-2010-10-07-Contest for developing an R package recommendation system

2 0.18184896 252 andrew gelman stats-2010-09-02-R needs a good function to make line plots

Introduction: More and more I’m thinking that line plots are great. More specifically, two-way grids of line plots on common scales, with one, two, or three lines per plot (enough to show comparisons but not so many that you can’t tell the lines apart). Also dot plots, of the sort that have been masterfully used by Lax and Phillips to show comparisons and trends in support for gay rights. There’s a big step missing, though, and that is to be able to make these graphs as a default. We have to figure out the right way to structure the data so these graphs come naturally. Then when it’s all working, we can talk the Excel people into implementing our ideas. I’m not asking to be paid here; all our ideas are in the public domain and I’m happy for Microsoft or Google or whoever to copy us. P.S. Drew Conway writes: This could be accomplished with ggplot2 using various combinations of the grammar. If I am understanding what you mean by line plots, here are some examples with code . In fact,

3 0.18147771 772 andrew gelman stats-2011-06-17-Graphical tools for understanding multilevel models

Introduction: There are a few things I want to do: 1. Understand a fitted model using tools such as average predictive comparisons , R-squared, and partial pooling factors . In defining these concepts, Iain and I came up with some clever tricks, including (but not limited to): - Separating the inputs and averaging over all possible values of the input not being altered (for average predictive comparisons); - Defining partial pooling without referring to a raw-data or maximum-likelihood or no-pooling estimate (these don’t necessarily exist when you’re fitting logistic regression with sparse data); - Defining an R-squared for each level of a multilevel model. The methods get pretty complicated, though, and they have some loose ends–in particular, for average predictive comparisons with continuous input variables. So now we want to implement these in R and put them into arm along with bglmer etc. 2. Setting up coefplot so it works more generally (that is, so the graphics look nice

4 0.13288286 1736 andrew gelman stats-2013-02-24-Rcpp class in Sat 9 Mar in NYC

Introduction: Join Dirk Eddelbuettel for six hours of detailed and hands-on instructions and discussions around Rcpp, RInside, RcppArmadillo, RcppGSL and other packages . . . Rcpp has become the most widely-used language extension for R. Currently deployed by 103 CRAN packages and a further 10 BioConductor packages, it permits users and developers to pass “whole R objects” with ease between R and C++ . . . Morning session: “A Hands-on Introduction to R and C++” . . . Afternoon session: “Advanced R and C++ Topics” . . .

5 0.12480221 1131 andrew gelman stats-2012-01-20-Stan: A (Bayesian) Directed Graphical Model Compiler

Introduction: Here’s Bob’s talk from the NYC machine learning meetup . And here’s Stan himself:

6 0.12210771 929 andrew gelman stats-2011-09-27-Visual diagnostics for discrete-data regressions

7 0.12189828 878 andrew gelman stats-2011-08-29-Infovis, infographics, and data visualization: Where I’m coming from, and where I’d like to go

8 0.10918825 1403 andrew gelman stats-2012-07-02-Moving beyond hopeless graphics

9 0.10305014 1452 andrew gelman stats-2012-08-09-Visually weighting regression displays

10 0.10009903 855 andrew gelman stats-2011-08-16-Infovis and statgraphics update update

11 0.099950671 1609 andrew gelman stats-2012-12-06-Stephen Kosslyn’s principles of graphics and one more: There’s no need to cram everything into a single plot

12 0.099712268 1134 andrew gelman stats-2012-01-21-Lessons learned from a recent R package submission

13 0.099387631 328 andrew gelman stats-2010-10-08-Displaying a fitted multilevel model

14 0.097490869 1450 andrew gelman stats-2012-08-08-My upcoming talk for the data visualization meetup

15 0.096397199 574 andrew gelman stats-2011-02-14-“The best data visualizations should stand on their own”? I don’t think so.

16 0.094504789 61 andrew gelman stats-2010-05-31-A data visualization manifesto

17 0.091261201 1090 andrew gelman stats-2011-12-28-“. . . extending for dozens of pages”

18 0.091174655 144 andrew gelman stats-2010-07-13-Hey! Here’s a referee report for you!

19 0.09068387 1684 andrew gelman stats-2013-01-20-Ugly ugly ugly

20 0.089112699 1753 andrew gelman stats-2013-03-06-Stan 1.2.0 and RStan 1.2.0

similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.136), (1, 0.008), (2, -0.013), (3, 0.063), (4, 0.116), (5, -0.071), (6, -0.018), (7, -0.02), (8, -0.014), (9, 0.01), (10, -0.003), (11, -0.002), (12, -0.024), (13, -0.021), (14, -0.013), (15, -0.013), (16, -0.0), (17, -0.019), (18, 0.012), (19, -0.016), (20, 0.003), (21, 0.024), (22, 0.005), (23, 0.006), (24, -0.01), (25, -0.015), (26, 0.034), (27, -0.026), (28, 0.017), (29, 0.032), (30, 0.047), (31, 0.001), (32, 0.027), (33, -0.02), (34, 0.03), (35, -0.064), (36, 0.002), (37, 0.021), (38, -0.002), (39, -0.047), (40, 0.027), (41, -0.004), (42, -0.032), (43, 0.023), (44, -0.016), (45, -0.004), (46, -0.027), (47, 0.011), (48, 0.035), (49, -0.011)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.95295787 324 andrew gelman stats-2010-10-07-Contest for developing an R package recommendation system

2 0.84443355 252 andrew gelman stats-2010-09-02-R needs a good function to make line plots

3 0.76590174 1609 andrew gelman stats-2012-12-06-Stephen Kosslyn’s principles of graphics and one more: There’s no need to cram everything into a single plot

Introduction: Jerzy Wieczorek has an interesting review of the book Graph Design for the Eye and Mind by psychology researcher Stephen Kosslyn. I recommend you read all of Wieczorek’s review (and maybe Kosslyn’s book, but that I haven’t seen), but here I’ll just focus on one point. Here’s Wieczorek summarizing Kosslyn: p. 18-19: the horizontal axis should be for the variable with the “most important part of the data.” See Kosslyn’s Figure 1.6 and 1.7 below. Figure 1.6 clearly shows that one of the sex-by-income groups reacts to age differently than the other three groups do. Figure 1.7 uses sex as the x-axis variable, making it much harder to see this same effect in the data. As a statistician exploring the data, I might make several plots using different groupings… but for communicating my results to an audience, I would choose the one plot that shows the findings most clearly. Those who know me well (or who have read the title of this post) will guess my reaction, whic

4 0.76296747 1470 andrew gelman stats-2012-08-26-Graphs showing regression uncertainty: the code!

Introduction: After our discussion of visual displays of regression uncertainty, I asked Solomon Hsiang and Lucas Leeman to send me their code. Both of them replied. Solomon wrote: The matlab and stata functions I wrote, as well as the script that replicates my figures, are all posted on my website . Also, I just added options to the main matlab function (vwregress.m) to make it display the spaghetti plot (similar to what Lucas did, but a simple bootstrap) and the shaded CI that you suggested (see figs below). They’re good suggestions. Personally, I [Hsiang] like the shaded CI better, since I think that all the visual activity in the spaghetti plot is a little distracting and sometimes adds visual weight in places where I wouldn’t want it. But the option is there in case people like it. Solomon then followed up with: I just thought of this small adjustment to your filled CI idea that seems neat. Cartographers like map projections that conserve area. We can do som

5 0.75850552 61 andrew gelman stats-2010-05-31-A data visualization manifesto

Introduction: Details matter (at least, they do for me), but we don’t yet have a systematic way of going back and forth between the structure of a graph, its details, and the underlying questions that motivate our visualizations. (Cleveland, Wilkinson, and others have written a bit on how to formalize these connections, and I’ve thought about it too, but we have a ways to go.) I was thinking about this difficulty after reading an article on graphics by some computer scientists that was well-written but to me lacked a feeling for the linkages between substantive/statistical goals and graphical details. I have problems with these issues too, and my point here is not to criticize but to move the discussion forward. When thinking about visualization, how important are the details? Aleks pointed me to this article by Jeffrey Heer, Michael Bostock, and Vadim Ogievetsky, “A Tour through the Visualization Zoo: A survey of powerful visualization techniques, from the obvious to the obscure.” Th

6 0.74931931 672 andrew gelman stats-2011-04-20-The R code for those time-use graphs

7 0.74232304 1452 andrew gelman stats-2012-08-09-Visually weighting regression displays

8 0.73953766 372 andrew gelman stats-2010-10-27-A use for tables (really)

9 0.73858213 736 andrew gelman stats-2011-05-29-Response to “Why Tables Are Really Much Better Than Graphs”

10 0.73822713 1661 andrew gelman stats-2013-01-08-Software is as software does

11 0.73122448 266 andrew gelman stats-2010-09-09-The future of R

12 0.72595185 1807 andrew gelman stats-2013-04-17-Data problems, coding errors…what can be done?

13 0.72558922 1764 andrew gelman stats-2013-03-15-How do I make my graphs?

14 0.72140861 1154 andrew gelman stats-2012-02-04-“Turn a Boring Bar Graph into a 3D Masterpiece”

15 0.7202149 37 andrew gelman stats-2010-05-17-Is chartjunk really “more useful” than plain graphs? I don’t think so.

16 0.71928054 1684 andrew gelman stats-2013-01-20-Ugly ugly ugly

17 0.71184868 272 andrew gelman stats-2010-09-13-Ross Ihaka to R: Drop Dead

18 0.70827609 1403 andrew gelman stats-2012-07-02-Moving beyond hopeless graphics

19 0.70717221 2319 andrew gelman stats-2014-05-05-Can we make better graphs of global temperature history?

20 0.70406789 1116 andrew gelman stats-2012-01-13-Infographic on the economy

similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(0, 0.024), (9, 0.012), (16, 0.085), (24, 0.15), (31, 0.012), (55, 0.036), (57, 0.047), (58, 0.019), (65, 0.026), (79, 0.029), (86, 0.082), (88, 0.015), (90, 0.016), (94, 0.053), (99, 0.295)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96558177 324 andrew gelman stats-2010-10-07-Contest for developing an R package recommendation system

2 0.96311295 582 andrew gelman stats-2011-02-20-Statisticians vs. everybody else

Introduction: Statisticians are literalists. When someone says that the U.K. boundary commission’s delay in redistricting gave the Tories an advantage equivalent to 10 percent of the vote, we’re the kind of person who looks it up and claims that the effect is less than 0.7 percent. When someone says, “Since 1968, with the single exception of the election of George W. Bush in 2000, Americans have chosen Republican presidents in times of perceived danger and Democrats in times of relative calm,” we’re like, Hey, really? And we go look that one up too. And when someone says that engineers have more sons and nurses have more daughters . . . well, let’s not go there. So, when I was pointed to this blog by Michael O’Hare making the following claim, in the context of K-12 education in the United States: My [O'Hare's] favorite examples of this junk [educational content with no workplace value] are spelling and pencil-and-paper algorithm arithmetic. These are absolutely critical for a clerk

3 0.95636189 781 andrew gelman stats-2011-06-28-The holes in my philosophy of Bayesian data analysis

Introduction: I’ve been writing a lot about my philosophy of Bayesian statistics and how it fits into Popper’s ideas about falsification and Kuhn’s ideas about scientific revolutions. Here’s my long, somewhat technical paper with Cosma Shalizi. Here’s our shorter overview for the volume on the philosophy of social science. Here’s my latest try (for an online symposium), focusing on the key issues. I’m pretty happy with my approach–the familiar idea that Bayesian data analysis iterates the three steps of model building, inference, and model checking–but it does have some unresolved (maybe unresolvable) problems. Here are a couple mentioned in the third of the above links. Consider a simple model with independent data y_1, y_2, .., y_10 ~ N(θ,σ^2), with a prior distribution θ ~ N(0,10^2) and σ known and taking on some value of approximately 10. Inference about μ is straightforward, as is model checking, whether based on graphs or numerical summaries such as the sample variance and skewn

4 0.95517552 1971 andrew gelman stats-2013-08-07-I doubt they cheated

Introduction: Following up on my regression-discontinuity post from the other day, Brad DeLong writes : The feel (and I could well be wrong) as that at some point somebody said: “This is very important, but it won’t get published without a statistically significant headline finding. Torture the data via specification search until we find a statistically significant effect so that this can get published!” I think DeLong is mistaken here. But, before getting to this, here’s the graph: and here are the regression results: So, indeed it is that cubic term that takes the result into statistical significance. The reason I disagree with DeLong is that it’s my impression that, in econometrics and applied economics, it’s considered the safe, conservative choice in regression discontinuity to control for a high-degree polynomial. See the paper discussed a few years ago here , for example, where I criticized a pair of economists for using a fifth-degree specification and they replie

5 0.95471478 2182 andrew gelman stats-2014-01-22-Spell-checking example demonstrates key aspects of Bayesian data analysis

Introduction: One of the new examples for the third edition of Bayesian Data Analysis is a spell-checking story. Here it is (just start at 2/3 down on the first page, with “Spelling correction”). I like this example—it demonstrates the Bayesian algebra, also gives a sense of the way that probability models (both “likelihood” and “prior”) are constructed from existing assumptions and data. The models aren’t just specified as a mathematical exercise, they represent some statement about reality. And the problem is close enough to our experience that we can consider ways in which the model can be criticized and improved, all in a simple example that has only three possibilities.

6 0.95322549 1278 andrew gelman stats-2012-04-23-“Any old map will do” meets “God is in every leaf of every tree”

7 0.95312524 187 andrew gelman stats-2010-08-05-Update on state size and governors’ popularity

8 0.95291078 2058 andrew gelman stats-2013-10-11-Gladwell and Chabris, David and Goliath, and science writing as stone soup

9 0.95287174 1746 andrew gelman stats-2013-03-02-Fishing for cherries

10 0.95269173 1036 andrew gelman stats-2011-11-30-Stan uses Nuts!

11 0.95257503 2161 andrew gelman stats-2014-01-07-My recent debugging experience

12 0.95147491 35 andrew gelman stats-2010-05-16-Another update on the spam email study

13 0.95122045 1760 andrew gelman stats-2013-03-12-Misunderstanding the p-value

14 0.95089859 1983 andrew gelman stats-2013-08-15-More on AIC, WAIC, etc

15 0.95065725 1211 andrew gelman stats-2012-03-13-A personal bit of spam, just for me!

16 0.95053279 1266 andrew gelman stats-2012-04-16-Another day, another plagiarist

17 0.95013297 615 andrew gelman stats-2011-03-16-Chess vs. checkers

18 0.94981408 1117 andrew gelman stats-2012-01-13-What are the important issues in ethics and statistics? I’m looking for your input!

19 0.9497968 2055 andrew gelman stats-2013-10-08-A Bayesian approach for peer-review panels? and a speculation about Bruno Frey

20 0.94944167 1980 andrew gelman stats-2013-08-13-Test scores and grades predict job performance (but maybe not at Google)