andrew_gelman_stats andrew_gelman_stats-2010 andrew_gelman_stats-2010-136 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: David Shor writes: I’m dealing with a situation where I have two datasets, one that assigns each participant a discrete score out of five for a set of particular traits (Dog behavior characteristics by breed), and another from an independent source that ranks each breed by each characteristic. It’s also possible to obtain the results of a survey, where experts were asked to rank 7 randomly picked breeds by characteristics. I’m interested in obtaining estimates for each trait, and intuitively, it seems clear that the second and third dataset provide a lot of information. But it’s unclear how to incorporate them to infer latent variables, since only sample ranks are observed. This seems like it is a common problem, do you have any suggestions? My quick answer is that you can treat ranks as numbers (a point we make somewhere in Bayesian Data Analysis, I believe) and just fit an item-response model from there. Val Johnson wrote an article on this in Jasa a few years ago, “Bayesia
sentIndex sentText sentNum sentScore
1 David Shor writes: I’m dealing with a situation where I have two datasets, one that assigns each participant a discrete score out of five for a set of particular traits (Dog behavior characteristics by breed), and another from an independent source that ranks each breed by each characteristic. [sent-1, score-1.956]
2 It’s also possible to obtain the results of a survey, where experts were asked to rank 7 randomly picked breeds by characteristics. [sent-2, score-0.811]
3 I’m interested in obtaining estimates for each trait, and intuitively, it seems clear that the second and third dataset provide a lot of information. [sent-3, score-0.436]
4 But it’s unclear how to incorporate them to infer latent variables, since only sample ranks are observed. [sent-4, score-0.894]
5 This seems like it is a common problem, do you have any suggestions? [sent-5, score-0.139]
6 My quick answer is that you can treat ranks as numbers (a point we make somewhere in Bayesian Data Analysis, I believe) and just fit an item-response model from there. [sent-6, score-0.665]
7 Val Johnson wrote an article on this in Jasa a few years ago, “Bayesian analysis of rank data with application to primate intelligence experiments. [sent-7, score-0.691]
8 ” He also did similar work calibrating college grades. [sent-8, score-0.236]
wordName wordTfidf (topN-words)
[('ranks', 0.41), ('breed', 0.347), ('rank', 0.248), ('breeds', 0.173), ('primate', 0.173), ('calibrating', 0.163), ('trait', 0.146), ('assigns', 0.143), ('dog', 0.139), ('jasa', 0.137), ('infer', 0.134), ('intuitively', 0.134), ('val', 0.134), ('shor', 0.132), ('participant', 0.132), ('unclear', 0.129), ('obtaining', 0.122), ('incorporate', 0.116), ('traits', 0.116), ('grades', 0.115), ('intelligence', 0.112), ('johnson', 0.108), ('characteristics', 0.107), ('dealing', 0.107), ('obtain', 0.105), ('latent', 0.105), ('datasets', 0.102), ('picked', 0.102), ('randomly', 0.099), ('discrete', 0.096), ('treat', 0.096), ('score', 0.092), ('bayesian', 0.09), ('dataset', 0.09), ('somewhere', 0.088), ('suggestions', 0.086), ('situation', 0.085), ('experts', 0.084), ('application', 0.084), ('independent', 0.083), ('third', 0.083), ('source', 0.082), ('five', 0.082), ('behavior', 0.074), ('analysis', 0.074), ('college', 0.073), ('seems', 0.071), ('quick', 0.071), ('provide', 0.07), ('common', 0.068)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999988 136 andrew gelman stats-2010-07-09-Using ranks as numbers
Introduction: David Shor writes: I’m dealing with a situation where I have two datasets, one that assigns each participant a discrete score out of five for a set of particular traits (Dog behavior characteristics by breed), and another from an independent source that ranks each breed by each characteristic. It’s also possible to obtain the results of a survey, where experts were asked to rank 7 randomly picked breeds by characteristics. I’m interested in obtaining estimates for each trait, and intuitively, it seems clear that the second and third dataset provide a lot of information. But it’s unclear how to incorporate them to infer latent variables, since only sample ranks are observed. This seems like it is a common problem, do you have any suggestions? My quick answer is that you can treat ranks as numbers (a point we make somewhere in Bayesian Data Analysis, I believe) and just fit an item-response model from there. Val Johnson wrote an article on this in Jasa a few years ago, “Bayesia
2 0.23397486 151 andrew gelman stats-2010-07-16-Wanted: Probability distributions for rank orderings
Introduction: Dietrich Stoyan writes: I asked the IMS people for an expert in statistics of voting/elections and they wrote me your name. I am a statistician, but never worked in the field voting/elections. It was my son-in-law who asked me for statistical theories in that field. He posed in particular the following problem: The aim of the voting is to come to a ranking of c candidates. Every vote is a permutation of these c candidates. The problem is to have probability distributions in the set of all permutations of c elements. Are there theories for such distributions? I should be very grateful for a fast answer with hints to literature. (I confess that I do not know your books.) My reply: Rather than trying to model the ranks directly, I’d recommend modeling a latent continuous outcome which then implies a distribution on ranks, if the ranks are of interest. There are lots of distributions of c-dimensional continuous outcomes. In political science, the usual way to start is
3 0.14884807 744 andrew gelman stats-2011-06-03-Statistical methods for healthcare regulation: rating, screening and surveillance
Introduction: Here is my discussion of a recent article by David Spiegelhalter, Christopher Sherlaw-Johnson, Martin Bardsley, Ian Blunt, Christopher Wood and Olivia Grigg, that is scheduled to appear in the Journal of the Royal Statistical Society: I applaud the authors’ use of a mix of statistical methods to attack an important real-world problem. Policymakers need results right away, and I admire the authors’ ability and willingness to combine several different modeling and significance testing ideas for the purposes of rating and surveillance. That said, I am uncomfortable with the statistical ideas here, for three reasons. First, I feel that the proposed methods, centered as they are around data manipulation and corrections for uncertainty, has serious defects compared to a more model-based approach. My problem with methods based on p-values and z-scores–however they happen to be adjusted–is that they draw discussion toward error rates, sequential analysis, and other technical statistical
4 0.14332947 1723 andrew gelman stats-2013-02-15-Wacky priors can work well?
Introduction: Dave Judkins writes: I would love to see a blog entry on this article , Bayesian Model Selection in High-Dimensional Settings, by Valen Johnson and David Rossell. The simulation results are very encouraging although the choice of colors for some of the graphics is unfortunate. Unless I am colorblind in some way that I am unaware of, they have two thin charcoal lines that are indistinguishable. When Dave Judkins puts in a request, I’ll respond. Also, I’m always happy to see a new Val Johnson paper. Val and I are contemporaries—he and I got our PhD’s at around the same time, with both of us working on Bayesian image reconstruction, then in the early 1990s Val was part of the legendary group at Duke’s Institute of Statistics and Decision Sciences—a veritable ’27 Yankees featuring Mike West, Merlise Clyde, Michael Lavine, Dave Higdon, Peter Mueller, Val, and a bunch of others. I always thought it was too bad they all had to go their separate ways. Val also wrote two classic p
Introduction: As regular readers of this blog are aware, a few months ago Val Johnson published an article, “Revised standards for statistical evidence,” making a Bayesian argument that researchers and journals should use a p=0.005 publication threshold rather than the usual p=0.05. Christian Robert and I were unconvinced by Val’s reasoning and wrote a response , “Revised evidence for statistical standards,” in which we wrote: Johnson’s minimax prior is not intended to correspond to any distribution of effect sizes; rather, it represents a worst case scenario under some mathematical assumptions. Minimax and tradeoffs do well together, and it is hard for us to see how any worst case procedure can supply much guidance on how to balance between two different losses. . . . We would argue that the appropriate significance level depends on the scenario and that what worked well for agricultural experiments in the 1920s might not be so appropriate for many applications in modern biosciences . . .
7 0.11488249 825 andrew gelman stats-2011-07-27-Grade inflation: why weren’t the instructors all giving all A’s already??
8 0.098276958 2140 andrew gelman stats-2013-12-19-Revised evidence for statistical standards
9 0.094101965 183 andrew gelman stats-2010-08-04-Bayesian models for simultaneous equation systems?
10 0.081902735 1695 andrew gelman stats-2013-01-28-Economists argue about Bayes
11 0.080335885 1817 andrew gelman stats-2013-04-21-More on Bayesian model selection in high-dimensional settings
12 0.078949749 2211 andrew gelman stats-2014-02-14-The popularity of certain baby names is falling off the clifffffffffffff
13 0.078512967 781 andrew gelman stats-2011-06-28-The holes in my philosophy of Bayesian data analysis
14 0.077948064 1248 andrew gelman stats-2012-04-06-17 groups, 6 group-level predictors: What to do?
15 0.075715281 451 andrew gelman stats-2010-12-05-What do practitioners need to know about regression?
16 0.073870733 2149 andrew gelman stats-2013-12-26-Statistical evidence for revised standards
17 0.073587492 250 andrew gelman stats-2010-09-02-Blending results from two relatively independent multi-level models
18 0.073298439 1228 andrew gelman stats-2012-03-25-Continuous variables in Bayesian networks
19 0.072291233 703 andrew gelman stats-2011-05-10-Bringing Causal Models Into the Mainstream
20 0.072108105 1150 andrew gelman stats-2012-02-02-The inevitable problems with statistical significance and 95% intervals
topicId topicWeight
[(0, 0.132), (1, 0.063), (2, 0.017), (3, -0.018), (4, 0.008), (5, 0.037), (6, -0.022), (7, 0.036), (8, 0.017), (9, -0.005), (10, 0.016), (11, -0.017), (12, -0.024), (13, 0.014), (14, 0.0), (15, 0.026), (16, 0.035), (17, -0.002), (18, 0.014), (19, 0.027), (20, -0.01), (21, 0.071), (22, -0.013), (23, -0.047), (24, -0.006), (25, -0.02), (26, 0.016), (27, -0.037), (28, 0.008), (29, -0.024), (30, 0.01), (31, 0.002), (32, 0.027), (33, 0.026), (34, -0.016), (35, 0.015), (36, 0.012), (37, 0.015), (38, 0.007), (39, 0.02), (40, 0.011), (41, -0.006), (42, 0.016), (43, 0.039), (44, -0.0), (45, -0.046), (46, 0.006), (47, -0.032), (48, 0.014), (49, 0.021)]
simIndex simValue blogId blogTitle
same-blog 1 0.95552897 136 andrew gelman stats-2010-07-09-Using ranks as numbers
Introduction: David Shor writes: I’m dealing with a situation where I have two datasets, one that assigns each participant a discrete score out of five for a set of particular traits (Dog behavior characteristics by breed), and another from an independent source that ranks each breed by each characteristic. It’s also possible to obtain the results of a survey, where experts were asked to rank 7 randomly picked breeds by characteristics. I’m interested in obtaining estimates for each trait, and intuitively, it seems clear that the second and third dataset provide a lot of information. But it’s unclear how to incorporate them to infer latent variables, since only sample ranks are observed. This seems like it is a common problem, do you have any suggestions? My quick answer is that you can treat ranks as numbers (a point we make somewhere in Bayesian Data Analysis, I believe) and just fit an item-response model from there. Val Johnson wrote an article on this in Jasa a few years ago, “Bayesia
2 0.72192502 1218 andrew gelman stats-2012-03-18-Check your missing-data imputations using cross-validation
Introduction: Elena Grewal writes: I am currently using the iterative regression imputation model as implemented in the Stata ICE package. I am using data from a survey of about 90,000 students in 142 schools and my variable of interest is parent level of education. I want only this variable to be imputed with as little bias as possible as I am not using any other variable. So I scoured the survey for every variable I thought could possibly predict parent education. The main variable I found is parent occupation, which explains about 35% of the variance in parent education for the students with complete data on both. I then include the 20 other variables I found in the survey in a regression predicting parent education, which explains about 40% of the variance in parent education for students with complete data on all the variables. My question is this: many of the other variables I found have more missing values than the parent education variable, and also, although statistically significant
3 0.71083915 1070 andrew gelman stats-2011-12-19-The scope for snooping
Introduction: Macartan Humphreys sent the following question to David Madigan and me: I am working on a piece on the registration of research designs (to prevent snooping). As part of it we want to give some estimates for the “scope for snooping” and how this can be affected by different registration requirements. So we want to answer questions of the form: “Say in truth there is no relation between x and y, you were willing to mess about with models until you found a significant relation between them, what are the chances that you would succeed if: 1. You were free to choose the indicators for x and y 2. You were free to choose h control variable from some group of k possible controls 3. You were free to divide up the sample in k ways to examine heterogeneous treatment effects 4. You were free to select from some set of k reasonable models” People have thought a lot about the first problem of choosing your indicators; we have done a set of simulations to answer the other questions
4 0.70858115 1228 andrew gelman stats-2012-03-25-Continuous variables in Bayesian networks
Introduction: Antti Rasinen writes: I’m a former undergrad machine learning student and a current software engineer with a Bayesian hobby. Today my two worlds collided. I ask for some enlightenment. On your blog you’ve repeatedly advocated continuous distributions with Bayesian models. Today I read this article by Ricky Ho, who writes: The strength of Bayesian network is it is highly scalable and can learn incrementally because all we do is to count the observed variables and update the probability distribution table. Similar to Neural Network, Bayesian network expects all data to be binary, categorical variable will need to be transformed into multiple binary variable as described above. Numeric variable is generally not a good fit for Bayesian network. The last sentence seems to be at odds with what you’ve said. Sadly, I don’t have enough expertise to say which view of the world is correct. During my undergrad years our team wrote an implementation of the Junction Tree algorithm. We r
5 0.69559002 627 andrew gelman stats-2011-03-24-How few respondents are reasonable to use when calculating the average by county?
Introduction: Sam Stroope writes: I’m creating county-level averages based on individual-level respondents. My question is, how few respondents are reasonable to use when calculating the average by county? My end model will be a county-level (only) SEM model. My reply: Any number of respondents should work. If you have very few respondents, you should just end up with large standard errors which will propagate through your analysis. P.S. I must have deleted my original reply by accident so I reconstructed something above.
6 0.69115418 1294 andrew gelman stats-2012-05-01-Modeling y = a + b + c
7 0.68475521 2176 andrew gelman stats-2014-01-19-Transformations for non-normal data
8 0.68454587 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making
9 0.68312126 1511 andrew gelman stats-2012-09-26-What do statistical p-values mean when the sample = the population?
10 0.67485148 1723 andrew gelman stats-2013-02-15-Wacky priors can work well?
11 0.66992193 935 andrew gelman stats-2011-10-01-When should you worry about imputed data?
12 0.66614223 1510 andrew gelman stats-2012-09-25-Incoherence of Bayesian data analysis
13 0.66553408 840 andrew gelman stats-2011-08-05-An example of Bayesian model averaging
14 0.66192108 213 andrew gelman stats-2010-08-17-Matching at two levels
15 0.6611824 1121 andrew gelman stats-2012-01-15-R-squared for multilevel models
16 0.65817732 1205 andrew gelman stats-2012-03-09-Coming to agreement on philosophy of statistics
17 0.65679634 2351 andrew gelman stats-2014-05-28-Bayesian nonparametric weighted sampling inference
18 0.65642673 234 andrew gelman stats-2010-08-25-Modeling constrained parameters
19 0.65556812 1142 andrew gelman stats-2012-01-29-Difficulties with the 1-4-power transformation
20 0.65431958 938 andrew gelman stats-2011-10-03-Comparing prediction errors
topicId topicWeight
[(2, 0.014), (9, 0.013), (16, 0.039), (21, 0.014), (24, 0.069), (34, 0.044), (41, 0.012), (63, 0.034), (65, 0.015), (76, 0.037), (84, 0.045), (86, 0.034), (88, 0.237), (93, 0.015), (94, 0.017), (99, 0.265)]
simIndex simValue blogId blogTitle
1 0.94758081 1174 andrew gelman stats-2012-02-18-Not as ugly as you look
Introduction: Kaiser asks the interesting question: How do you measure what restaurants are “overrated”? You can’t just ask people, right? There’s some sort of social element here, that “overrated” implies that someone’s out there doing the rating.
same-blog 2 0.89295173 136 andrew gelman stats-2010-07-09-Using ranks as numbers
Introduction: David Shor writes: I’m dealing with a situation where I have two datasets, one that assigns each participant a discrete score out of five for a set of particular traits (Dog behavior characteristics by breed), and another from an independent source that ranks each breed by each characteristic. It’s also possible to obtain the results of a survey, where experts were asked to rank 7 randomly picked breeds by characteristics. I’m interested in obtaining estimates for each trait, and intuitively, it seems clear that the second and third dataset provide a lot of information. But it’s unclear how to incorporate them to infer latent variables, since only sample ranks are observed. This seems like it is a common problem, do you have any suggestions? My quick answer is that you can treat ranks as numbers (a point we make somewhere in Bayesian Data Analysis, I believe) and just fit an item-response model from there. Val Johnson wrote an article on this in Jasa a few years ago, “Bayesia
3 0.87566423 1098 andrew gelman stats-2012-01-04-Bayesian Page Rank?
Introduction: Loren Maxwell writes: I am trying to do some studies on the PageRank algorithm with applying a Bayesian technique. If you are not familiar with PageRank, it is the basis for how Google ranks their pages. It basically treats the internet as a large social network with each link conferring some value onto the page it links to. For example, if I had a webpage that had only one link to it, say from my friend’s webpage, then its PageRank would be dependent on my friend’s PageRank, presumably quite low. However, if the one link to my page was off the Google search page, then my PageRank would be quite high since there are undoubtedly millions of pages linking to Google and few pages that Google links to. The end result of the algorithm, however, is that all the PageRank values of the nodes in the network sum to one and the PageRank of a specific node is the probability that a “random surfer” will end up on that node. For example, in the attached spreadsheet, Column D shows e
4 0.87185973 290 andrew gelman stats-2010-09-22-Data Thief
Introduction: John Transue sends along a link to this software for extracting data from graphs. I haven’t tried it out but it could be useful to somebody out there?
5 0.85600531 1992 andrew gelman stats-2013-08-21-Workshop for Women in Machine Learning
Introduction: This might interest some of you: CALL FOR ABSTRACTS Workshop for Women in Machine Learning Co-located with NIPS 2013, Lake Tahoe, Nevada, USA December 5, 2013 http://www.wimlworkshop.org Deadline for abstract submissions: September 16, 2013 WORKSHOP DESCRIPTION The Workshop for Women in Machine Learning is a day-long event taking place on the first day of NIPS. The workshop aims to showcase the research of women in machine learning and to strengthen their community. The event brings together female faculty, graduate students, and research scientists for an opportunity to connect, exchange ideas, and learn from each other. Underrepresented minorities and undergraduates interested in pursuing machine learning research are encouraged to participate. While all presenters will be female, all genders are invited to attend. Scholarships will be provided to female students and postdoctoral attendees with accepted abstracts to partially offset travel costs. Workshop
7 0.84762418 825 andrew gelman stats-2011-07-27-Grade inflation: why weren’t the instructors all giving all A’s already??
8 0.83786571 569 andrew gelman stats-2011-02-12-Get the Data
9 0.82727993 1930 andrew gelman stats-2013-07-09-Symposium Magazine
10 0.82367074 1866 andrew gelman stats-2013-05-21-Recently in the sister blog
11 0.82319438 629 andrew gelman stats-2011-03-26-Is it plausible that 1% of people pick a career based on their first name?
12 0.81415236 1633 andrew gelman stats-2012-12-21-Kahan on Pinker on politics
13 0.80076766 603 andrew gelman stats-2011-03-07-Assumptions vs. conditions, part 2
14 0.79985821 1403 andrew gelman stats-2012-07-02-Moving beyond hopeless graphics
15 0.79144621 784 andrew gelman stats-2011-07-01-Weighting and prediction in sample surveys
16 0.78804862 400 andrew gelman stats-2010-11-08-Poli sci plagiarism update, and a note about the benefits of not caring
17 0.77977395 2351 andrew gelman stats-2014-05-28-Bayesian nonparametric weighted sampling inference
19 0.76453352 2009 andrew gelman stats-2013-09-05-A locally organized online BDA course on G+ hangout?
20 0.75995082 1220 andrew gelman stats-2012-03-19-Sorry, no ARM solutions