andrew_gelman_stats andrew_gelman_stats-2011 andrew_gelman_stats-2011-569 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: At GetTheData , you can ask and answer data related questions. Here’s a preview: I’m not sure a Q&A; site is the best way to do this. My pipe dream is to create a taxonomy of variables and instances, and collect spreadsheets annotated this way. Imagine doing a search of type: “give me datasets, where an instance is a person, the variables are age, gender and weight” – and out would come datasets, each one tagged with the descriptions of the variables that were held constant for the whole dataset (person_type=student, location=Columbia, time_of_study=1/1/2009, study_type=longitudinal). It would even be possible to automatically convert one variable into another, if it was necessary (like age = time_of_measurement-time_of_birth). Maybe the dream of Semantic Web will actually be implemented for relatively structured statistical data rather than much fuzzier “knowledge”, just consider the difficulties of developing a universal Freebase . Wolfram|Alpha is perhaps currently clos
sentIndex sentText sentNum sentScore
1 At GetTheData , you can ask and answer data related questions. [sent-1, score-0.118]
2 Here’s a preview: I’m not sure a Q&A; site is the best way to do this. [sent-2, score-0.172]
3 My pipe dream is to create a taxonomy of variables and instances, and collect spreadsheets annotated this way. [sent-3, score-1.347]
4 It would even be possible to automatically convert one variable into another, if it was necessary (like age = time_of_measurement-time_of_birth). [sent-5, score-0.379]
5 Maybe the dream of Semantic Web will actually be implemented for relatively structured statistical data rather than much fuzzier “knowledge”, just consider the difficulties of developing a universal Freebase . [sent-6, score-1.081]
6 I’ve talked about data tools before , as well as about Q&A; sites . [sent-8, score-0.426]
wordName wordTfidf (topN-words)
[('variables', 0.253), ('dream', 0.242), ('datasets', 0.204), ('weight', 0.19), ('tagged', 0.173), ('annotated', 0.173), ('spreadsheets', 0.173), ('freebase', 0.163), ('taxonomy', 0.163), ('banana', 0.163), ('preview', 0.163), ('queries', 0.156), ('semantic', 0.156), ('upload', 0.156), ('age', 0.155), ('pipe', 0.146), ('descriptions', 0.129), ('wolfram', 0.129), ('convert', 0.126), ('closest', 0.126), ('sites', 0.124), ('longitudinal', 0.124), ('alpha', 0.121), ('data', 0.118), ('collect', 0.114), ('universal', 0.114), ('instances', 0.114), ('structured', 0.113), ('location', 0.109), ('implemented', 0.109), ('consider', 0.107), ('consumption', 0.106), ('constant', 0.105), ('gender', 0.103), ('automatically', 0.098), ('talked', 0.098), ('instance', 0.097), ('held', 0.097), ('developing', 0.095), ('relatively', 0.093), ('difficulties', 0.09), ('site', 0.09), ('dataset', 0.09), ('complicated', 0.087), ('countries', 0.086), ('tools', 0.086), ('create', 0.083), ('web', 0.083), ('sure', 0.082), ('comparing', 0.082)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 569 andrew gelman stats-2011-02-12-Get the Data
Introduction: At GetTheData , you can ask and answer data related questions. Here’s a preview: I’m not sure a Q&A; site is the best way to do this. My pipe dream is to create a taxonomy of variables and instances, and collect spreadsheets annotated this way. Imagine doing a search of type: “give me datasets, where an instance is a person, the variables are age, gender and weight” – and out would come datasets, each one tagged with the descriptions of the variables that were held constant for the whole dataset (person_type=student, location=Columbia, time_of_study=1/1/2009, study_type=longitudinal). It would even be possible to automatically convert one variable into another, if it was necessary (like age = time_of_measurement-time_of_birth). Maybe the dream of Semantic Web will actually be implemented for relatively structured statistical data rather than much fuzzier “knowledge”, just consider the difficulties of developing a universal Freebase . Wolfram|Alpha is perhaps currently clos
2 0.11434186 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making
Introduction: Andreas Graefe writes (see here here here ): The usual procedure for developing linear models to predict any kind of target variable is to identify a subset of most important predictors and to estimate weights that provide the best possible solution for a given sample. The resulting “optimally” weighted linear composite is then used when predicting new data. This approach is useful in situations with large and reliable datasets and few predictor variables. However, a large body of analytical and empirical evidence since the 1970s shows that the weighting of variables is of little, if any, value in situations with small and noisy datasets and a large number of predictor variables. In such situations, including all relevant variables is more important than their weighting. These findings have yet to impact many fields. This study uses data from nine established U.S. election-forecasting models whose forecasts are regularly published in academic journals to demonstrate the value o
3 0.097875148 192 andrew gelman stats-2010-08-08-Turning pages into data
Introduction: There is a lot of data on the web, meant to be looked at by people, but how do you turn it into a spreadsheet people could actually analyze statistically? The technique to turn web pages intended for people into structured data sets intended for computers is called “screen scraping.” It has just been made easier with a wiki/community http://scraperwiki.com/ . They provide libraries to extract information from PDF, Excel files, to automatically fill in forms and similar. Moreover, the community aspect of it should allow researchers doing similar things to get connected. It’s very good. Here’s an example of scraping road accident data or port of London ship arrivals . You can already find collections of structured data online, examples are Infochimps (“find the world’s data”), and Freebase (“An entity graph of people, places and things, built by a community that loves open data.”). There’s also a repository system for data, TheData (“An open-source application for pub
4 0.097789019 1905 andrew gelman stats-2013-06-18-There are no fat sprinters
Introduction: This post is by Phil. A little over three years ago I wrote a post about exercise and weight loss in which I described losing a fair amount of weight due to (I believe) an exercise regime, with no effort to change my diet; this contradicted the prediction of studies that had recently been released. The comment thread on that post is quite interesting: a lot of people had had similar experiences — losing weight, or keeping it off, with an exercise program that includes very short periods of exercise at maximal intensity — while other people expressed some skepticism about my claims. Some commenters said that I risked injury; others said it was too early to judge anything because my weight loss might not last. The people who predicted injury were right: running the curve during a 200m sprint a month or two after that post, I strained my Achilles tendon. Nothing really serious, but it did keep me off the track for a couple of months, and rather than go back to sprinting I switched t
5 0.097213849 1900 andrew gelman stats-2013-06-15-Exploratory multilevel analysis when group-level variables are of importance
Introduction: Steve Miller writes: Much of what I do is cross-national analyses of survey data (largely World Values Survey). . . . My big question pertains to (what I would call) exploratory analysis of multilevel data, especially when the group-level predictors are of theoretical importance. A lot of what I do involves analyzing cross-national survey items of citizen attitudes, typically of political leadership. These survey items are usually yes/no responses, or four-part responses indicating a level of agreement (strongly agree, agree, disagree, strongly disagree) that can be condensed into a binary variable. I believe these can be explained by reference to country-level factors. Much of the group-level variables of interest are count variables with a modal value of 0, which can be quite messy. How would you recommend exploring the variation in the dependent variable as it could be explained by the group-level count variable of interest, before fitting the multilevel model itself? When
6 0.090641357 1517 andrew gelman stats-2012-10-01-“On Inspiring Students and Being Human”
7 0.089916825 1447 andrew gelman stats-2012-08-07-Reproducible science FAIL (so far): What’s stoppin people from sharin data and code?
8 0.08991044 451 andrew gelman stats-2010-12-05-What do practitioners need to know about regression?
9 0.087496951 86 andrew gelman stats-2010-06-14-“Too much data”?
10 0.087431319 1198 andrew gelman stats-2012-03-05-A cloud with a silver lining
11 0.085886151 486 andrew gelman stats-2010-12-26-Age and happiness: The pattern isn’t as clear as you might think
12 0.085847355 735 andrew gelman stats-2011-05-28-New app for learning intro statistics
13 0.08527419 714 andrew gelman stats-2011-05-16-NYT Labs releases Openpaths, a utility for saving your iphone data
14 0.085067585 1248 andrew gelman stats-2012-04-06-17 groups, 6 group-level predictors: What to do?
15 0.084885702 1015 andrew gelman stats-2011-11-17-Good examples of lurking variables?
16 0.083706662 1121 andrew gelman stats-2012-01-15-R-squared for multilevel models
17 0.083529577 352 andrew gelman stats-2010-10-19-Analysis of survey data: Design based models vs. hierarchical modeling?
18 0.081897527 1788 andrew gelman stats-2013-04-04-When is there “hidden structure in data” to be discovered?
19 0.081888981 1681 andrew gelman stats-2013-01-19-Participate in a short survey about the weight of evidence provided by statistics
20 0.081869446 2041 andrew gelman stats-2013-09-27-Setting up Jitts online
topicId topicWeight
[(0, 0.136), (1, 0.014), (2, 0.013), (3, -0.027), (4, 0.081), (5, 0.03), (6, -0.02), (7, 0.004), (8, 0.016), (9, 0.041), (10, -0.002), (11, -0.008), (12, 0.019), (13, -0.01), (14, -0.009), (15, 0.04), (16, 0.043), (17, -0.034), (18, 0.007), (19, 0.001), (20, -0.008), (21, 0.023), (22, 0.0), (23, -0.017), (24, -0.003), (25, 0.016), (26, 0.047), (27, -0.046), (28, 0.014), (29, 0.042), (30, 0.037), (31, 0.005), (32, 0.019), (33, 0.045), (34, -0.025), (35, 0.031), (36, -0.01), (37, 0.032), (38, -0.066), (39, 0.016), (40, -0.01), (41, -0.047), (42, 0.04), (43, 0.045), (44, -0.033), (45, 0.04), (46, 0.046), (47, 0.006), (48, 0.013), (49, 0.002)]
simIndex simValue blogId blogTitle
same-blog 1 0.94334275 569 andrew gelman stats-2011-02-12-Get the Data
Introduction: At GetTheData , you can ask and answer data related questions. Here’s a preview: I’m not sure a Q&A; site is the best way to do this. My pipe dream is to create a taxonomy of variables and instances, and collect spreadsheets annotated this way. Imagine doing a search of type: “give me datasets, where an instance is a person, the variables are age, gender and weight” – and out would come datasets, each one tagged with the descriptions of the variables that were held constant for the whole dataset (person_type=student, location=Columbia, time_of_study=1/1/2009, study_type=longitudinal). It would even be possible to automatically convert one variable into another, if it was necessary (like age = time_of_measurement-time_of_birth). Maybe the dream of Semantic Web will actually be implemented for relatively structured statistical data rather than much fuzzier “knowledge”, just consider the difficulties of developing a universal Freebase . Wolfram|Alpha is perhaps currently clos
2 0.76832169 14 andrew gelman stats-2010-05-01-Imputing count data
Introduction: Guy asks: I am analyzing an original survey of farmers in Uganda. I am hoping to use a battery of welfare proxy variables to create a single welfare index using PCA. I have quick question which I hope you can find time to address: How do you recommend treating count data? (for example # of rooms, # of chickens, # of cows, # of radios)? In my dataset these variables are highly skewed with many responses at zero (which makes taking the natural log problematic). In the case of # of cows or chickens several obs have values in the hundreds. My response: Here’s what we do in our mi package in R. We split a variable into two parts: an indicator for whether it is positive, and the positive part. That is, y = u*v. Then u is binary and can be modeled using logisitc regression, and v can be modeled on the log scale. At the end you can round to the nearest integer if you want to avoid fractional values.
3 0.74406779 1218 andrew gelman stats-2012-03-18-Check your missing-data imputations using cross-validation
Introduction: Elena Grewal writes: I am currently using the iterative regression imputation model as implemented in the Stata ICE package. I am using data from a survey of about 90,000 students in 142 schools and my variable of interest is parent level of education. I want only this variable to be imputed with as little bias as possible as I am not using any other variable. So I scoured the survey for every variable I thought could possibly predict parent education. The main variable I found is parent occupation, which explains about 35% of the variance in parent education for the students with complete data on both. I then include the 20 other variables I found in the survey in a regression predicting parent education, which explains about 40% of the variance in parent education for students with complete data on all the variables. My question is this: many of the other variables I found have more missing values than the parent education variable, and also, although statistically significant
4 0.71644527 1121 andrew gelman stats-2012-01-15-R-squared for multilevel models
Introduction: Fred Schiff writes: I’m writing to you to ask about the “R-squared” approximation procedure you suggest in your 2004 book with Dr. Hill. [See also this paper with Pardoe---ed.] I’m a media sociologist at the University of Houston. I’ve been using HLM3 for about two years. Briefly about my data. It’s a content analysis of news stories with a continuous scale dependent variable, story prominence. I have 6090 news stories, 114 newspapers, and 59 newspaper group owners. All the Level-1, Level-2 and dependent variables have been standardized. Since the means were zero anyway, we left the variables uncentered. All the Level-3 ownership groups and characteristics are dichotomous scales that were left uncentered. PROBLEM: The single most important result I am looking for is to compare the strength of nine competing Level-1 variables in their ability to predict and explain the outcome variable, story prominence. We are trying to use the residuals to calculate a “R-squ
5 0.7051903 1330 andrew gelman stats-2012-05-19-Cross-validation to check missing-data imputation
Introduction: Aureliano Crameri writes: I have questions regarding one technique you and your colleagues described in your papers: the cross validation (Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box, with reference to Gelman, King, and Liu, 1998). I think this is the technique I need for my purpose, but I am not sure I understand it right. I want to use the multiple imputation to estimate the outcome of psychotherapies based on longitudinal data. First I have to demonstrate that I am able to get unbiased estimates with the multiple imputation. The expected bias is the overestimation of the outcome of dropouts. I will test my imputation strategies by means of a series of simulations (delete values, impute, compare with the original). Due to the complexity of the statistical analyses I think I need at least 200 cases. Now I don’t have so many cases without any missings. My data have missing values in different variables. The proportion of missing values is
6 0.70125413 708 andrew gelman stats-2011-05-12-Improvement of 5 MPG: how many more auto deaths?
7 0.69406527 527 andrew gelman stats-2011-01-20-Cars vs. trucks
8 0.69252914 946 andrew gelman stats-2011-10-07-Analysis of Power Law of Participation
9 0.69093382 1940 andrew gelman stats-2013-07-16-A poll that throws away data???
11 0.67498672 1506 andrew gelman stats-2012-09-21-Building a regression model . . . with only 27 data points
12 0.67374504 2345 andrew gelman stats-2014-05-24-An interesting mosaic of a data programming course
13 0.6720196 1017 andrew gelman stats-2011-11-18-Lack of complete overlap
14 0.67150998 1509 andrew gelman stats-2012-09-24-Analyzing photon counts
15 0.66973668 799 andrew gelman stats-2011-07-13-Hypothesis testing with multiple imputations
16 0.66840583 192 andrew gelman stats-2010-08-08-Turning pages into data
17 0.66068071 86 andrew gelman stats-2010-06-14-“Too much data”?
18 0.65734369 1070 andrew gelman stats-2011-12-19-The scope for snooping
19 0.65486503 1434 andrew gelman stats-2012-07-29-FindTheData.org
20 0.65311289 118 andrew gelman stats-2010-06-30-Question & Answer Communities
topicId topicWeight
[(5, 0.015), (9, 0.036), (16, 0.037), (21, 0.015), (23, 0.013), (24, 0.153), (34, 0.039), (47, 0.012), (53, 0.05), (59, 0.018), (66, 0.01), (73, 0.013), (84, 0.045), (86, 0.035), (88, 0.187), (95, 0.01), (99, 0.222)]
simIndex simValue blogId blogTitle
1 0.93900585 1174 andrew gelman stats-2012-02-18-Not as ugly as you look
Introduction: Kaiser asks the interesting question: How do you measure what restaurants are “overrated”? You can’t just ask people, right? There’s some sort of social element here, that “overrated” implies that someone’s out there doing the rating.
same-blog 2 0.91146016 569 andrew gelman stats-2011-02-12-Get the Data
Introduction: At GetTheData , you can ask and answer data related questions. Here’s a preview: I’m not sure a Q&A; site is the best way to do this. My pipe dream is to create a taxonomy of variables and instances, and collect spreadsheets annotated this way. Imagine doing a search of type: “give me datasets, where an instance is a person, the variables are age, gender and weight” – and out would come datasets, each one tagged with the descriptions of the variables that were held constant for the whole dataset (person_type=student, location=Columbia, time_of_study=1/1/2009, study_type=longitudinal). It would even be possible to automatically convert one variable into another, if it was necessary (like age = time_of_measurement-time_of_birth). Maybe the dream of Semantic Web will actually be implemented for relatively structured statistical data rather than much fuzzier “knowledge”, just consider the difficulties of developing a universal Freebase . Wolfram|Alpha is perhaps currently clos
3 0.88661695 1098 andrew gelman stats-2012-01-04-Bayesian Page Rank?
Introduction: Loren Maxwell writes: I am trying to do some studies on the PageRank algorithm with applying a Bayesian technique. If you are not familiar with PageRank, it is the basis for how Google ranks their pages. It basically treats the internet as a large social network with each link conferring some value onto the page it links to. For example, if I had a webpage that had only one link to it, say from my friend’s webpage, then its PageRank would be dependent on my friend’s PageRank, presumably quite low. However, if the one link to my page was off the Google search page, then my PageRank would be quite high since there are undoubtedly millions of pages linking to Google and few pages that Google links to. The end result of the algorithm, however, is that all the PageRank values of the nodes in the network sum to one and the PageRank of a specific node is the probability that a “random surfer” will end up on that node. For example, in the attached spreadsheet, Column D shows e
4 0.87696886 1992 andrew gelman stats-2013-08-21-Workshop for Women in Machine Learning
Introduction: This might interest some of you: CALL FOR ABSTRACTS Workshop for Women in Machine Learning Co-located with NIPS 2013, Lake Tahoe, Nevada, USA December 5, 2013 http://www.wimlworkshop.org Deadline for abstract submissions: September 16, 2013 WORKSHOP DESCRIPTION The Workshop for Women in Machine Learning is a day-long event taking place on the first day of NIPS. The workshop aims to showcase the research of women in machine learning and to strengthen their community. The event brings together female faculty, graduate students, and research scientists for an opportunity to connect, exchange ideas, and learn from each other. Underrepresented minorities and undergraduates interested in pursuing machine learning research are encouraged to participate. While all presenters will be female, all genders are invited to attend. Scholarships will be provided to female students and postdoctoral attendees with accepted abstracts to partially offset travel costs. Workshop
5 0.87046009 290 andrew gelman stats-2010-09-22-Data Thief
Introduction: John Transue sends along a link to this software for extracting data from graphs. I haven’t tried it out but it could be useful to somebody out there?
6 0.86132121 629 andrew gelman stats-2011-03-26-Is it plausible that 1% of people pick a career based on their first name?
7 0.8587746 136 andrew gelman stats-2010-07-09-Using ranks as numbers
8 0.85426933 1403 andrew gelman stats-2012-07-02-Moving beyond hopeless graphics
9 0.85303652 603 andrew gelman stats-2011-03-07-Assumptions vs. conditions, part 2
10 0.84406424 400 andrew gelman stats-2010-11-08-Poli sci plagiarism update, and a note about the benefits of not caring
11 0.83940101 1507 andrew gelman stats-2012-09-22-Grade inflation: why weren’t the instructors all giving all A’s already??
12 0.83913642 784 andrew gelman stats-2011-07-01-Weighting and prediction in sample surveys
13 0.82869047 1087 andrew gelman stats-2011-12-27-“Keeping things unridiculous”: Berger, O’Hagan, and me on weakly informative priors
14 0.82815546 2351 andrew gelman stats-2014-05-28-Bayesian nonparametric weighted sampling inference
15 0.82718402 825 andrew gelman stats-2011-07-27-Grade inflation: why weren’t the instructors all giving all A’s already??
16 0.82410353 1414 andrew gelman stats-2012-07-12-Steven Pinker’s unconvincing debunking of group selection
18 0.81850684 2365 andrew gelman stats-2014-06-09-I hate polynomials
19 0.81633461 1930 andrew gelman stats-2013-07-09-Symposium Magazine
20 0.81403017 1866 andrew gelman stats-2013-05-21-Recently in the sister blog