andrew_gelman_stats andrew_gelman_stats-2011 andrew_gelman_stats-2011-946 knowledge-graph by maker-knowledge-mining

946 andrew gelman stats-2011-10-07-Analysis of Power Law of Participation


meta infos for this blog

Source: html

Introduction: Rick Wash writes: A colleague as USC (Lian Jian) and I were recently discussing a statistical analysis issue that both of us have run into recently. We both mostly do research about how people use online interactive websites. One property that most of these systems have is known as the “powerlaw of participation” — the distribution of the number of contributions from each person follows a powerlaw. This mean that a few people contribution a TON and many, many people are in the “long tail” and contribute very rarely. For example, Facebook posts and twitter posts both have this distribution, as do comments on blogs and many other forms of user contribution online. This distribution has proven to be a problem when we analyze individual behavior. The basic problem is that we’d like to account for the fact that we have repeated data from many users, but a large number of users only have 1 or 2 data points. For example, Lian recently analyzed data about monetary contributions


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We both mostly do research about how people use online interactive websites. [sent-2, score-0.255]

2 One property that most of these systems have is known as the “powerlaw of participation” — the distribution of the number of contributions from each person follows a powerlaw. [sent-3, score-0.715]

3 This mean that a few people contribution a TON and many, many people are in the “long tail” and contribute very rarely. [sent-4, score-0.532]

4 For example, Facebook posts and twitter posts both have this distribution, as do comments on blogs and many other forms of user contribution online. [sent-5, score-0.661]

5 This distribution has proven to be a problem when we analyze individual behavior. [sent-6, score-0.477]

6 The basic problem is that we’d like to account for the fact that we have repeated data from many users, but a large number of users only have 1 or 2 data points. [sent-7, score-0.751]

7 For example, Lian recently analyzed data about monetary contributions on the website Spot. [sent-8, score-0.663]

8 Us and in her dataset, over 70% of the contributions were the sole contribution of the contributor. [sent-9, score-0.773]

9 com found similar patterns, with large numbers of tags being used only once or twice. [sent-11, score-0.315]

10 How would you analyze this, taking advantage of the knowledge that some data points are the same individual? [sent-12, score-0.402]

11 Use a hierarchical model with contributions nested within people. [sent-14, score-0.739]

12 (AKA use a random effect for people) But this has problems when the majority of people only have exactly one data point? [sent-15, score-0.514]

13 Indeed, with a powerlaw, a large percentage of the data points come from the few high contributors. [sent-20, score-0.359]

14 Use a hierarchical model with contributions nested within people, but lump all of the “low contributors” into a single large category. [sent-22, score-0.938]

15 This is what I did for my analysis of delicious tagging data, but it was unsatisfying because the relationship between data points in that category is different than the relationship between data points in the other categories (which each represent one individual). [sent-23, score-0.959]

16 It’s fine that the majority of people have exactly one data point. [sent-26, score-0.424]

17 Here are some free ones: - A person-level predictor which is the total number of contributions from that person. [sent-29, score-0.8]

18 (Or maybe the logarithm or reciprocal of this total number or will work better as a predictor in a linear model. [sent-30, score-0.541]

19 ) - If the contributions are time-ordered, the reciprocal of the time ranking of the contribution (so if someone has 3 contributions, this predictor will be 1, 1/2, and 1/3 for his or her contributions). [sent-31, score-1.122]

20 This will catch if there is anything going on when people post a lot of times, if their first few posts are different. [sent-32, score-0.237]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('contributions', 0.445), ('lian', 0.306), ('contribution', 0.26), ('powerlaw', 0.204), ('tags', 0.204), ('reciprocal', 0.186), ('predictor', 0.16), ('analyze', 0.154), ('data', 0.151), ('nested', 0.149), ('posts', 0.135), ('large', 0.111), ('number', 0.109), ('individual', 0.102), ('people', 0.102), ('relationship', 0.101), ('users', 0.1), ('majority', 0.099), ('points', 0.097), ('distribution', 0.096), ('tagging', 0.093), ('usc', 0.093), ('use', 0.09), ('lump', 0.088), ('ton', 0.088), ('total', 0.086), ('unsatisfying', 0.084), ('delicious', 0.084), ('hierarchical', 0.08), ('wash', 0.079), ('worries', 0.076), ('nuanced', 0.076), ('contributors', 0.075), ('exactly', 0.072), ('aka', 0.072), ('thoughts', 0.072), ('ranking', 0.071), ('rick', 0.071), ('away', 0.069), ('sole', 0.068), ('many', 0.068), ('tail', 0.067), ('monetary', 0.067), ('within', 0.065), ('property', 0.065), ('facebook', 0.065), ('proven', 0.064), ('twitter', 0.063), ('interactive', 0.063), ('problem', 0.061)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 946 andrew gelman stats-2011-10-07-Analysis of Power Law of Participation

Introduction: Rick Wash writes: A colleague as USC (Lian Jian) and I were recently discussing a statistical analysis issue that both of us have run into recently. We both mostly do research about how people use online interactive websites. One property that most of these systems have is known as the “powerlaw of participation” — the distribution of the number of contributions from each person follows a powerlaw. This mean that a few people contribution a TON and many, many people are in the “long tail” and contribute very rarely. For example, Facebook posts and twitter posts both have this distribution, as do comments on blogs and many other forms of user contribution online. This distribution has proven to be a problem when we analyze individual behavior. The basic problem is that we’d like to account for the fact that we have repeated data from many users, but a large number of users only have 1 or 2 data points. For example, Lian recently analyzed data about monetary contributions

2 0.18327494 2228 andrew gelman stats-2014-02-28-Combining two of my interests

Introduction: Paul Alper writes: Hi Andrew (or Andy or even Gelman [17 of them]): Go to this link and have some fun with (useless? powerful?) data mining. As the authors say, it is addictive. Paul (no other way to spell it) Alper [215 of us] I’m reminded of this discussion from 2012, “Michael’s a Republican, Susan’s a Democrat.” As I wrote at the time: It’s no surprise that men give more to Republicans and women to Democrats, or that the average contribution to a Republican has a larger dollar value than the average contribution to a Democrat, nor perhaps should we be surprised that “Tom” splits his support between the two parties while “Thomas” is a strong Republican. Still, it’s fun to see the data. Overall, I think this graph understates contributions to Republicans because it doesn’t include those new super-pacs. But the new tool seems to be based on a different dataset, opinion polls rather than campaign contributions. Playing around a bit, I see a lot less variability

3 0.12460989 2151 andrew gelman stats-2013-12-27-Should statistics have a Nobel prize?

Introduction: Xiao-Li says yes: The most compelling reason for having highly visible awards in any field is to enhance its ability to attract future talent. Virtually all the media and public attention our profession received in recent years has been on the utility of statistics in all walks of life. We are extremely happy for and proud of this recognition—it is long overdue. However, the media and public have given much more attention to the Fields Medal than to the COPSS Award, even though the former has hardly been about direct or even indirect impact on everyday life. Why this difference? . . . these awards arouse media and public interest by featuring how ingenious the awardees are and how difficult the problems they solved, much like how conquering Everest bestows admiration not because the admirers care or even know much about Everest itself but because it represents the ultimate physical feat. In this sense, the biggest winner of the Fields Medal is mathematics itself: enticing the brig

4 0.10167672 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?

Introduction: A research psychologist writes in with a question that’s so long that I’ll put my answer first, then put the question itself below the fold. Here’s my reply: As I wrote in my Anova paper and in my book with Jennifer Hill, I do think that multilevel models can completely replace Anova. At the same time, I think the central idea of Anova should persist in our understanding of these models. To me the central idea of Anova is not F-tests or p-values or sums of squares, but rather the idea of predicting an outcome based on factors with discrete levels, and understanding these factors using variance components. The continuous or categorical response thing doesn’t really matter so much to me. I have no problem using a normal linear model for continuous outcomes (perhaps suitably transformed) and a logistic model for binary outcomes. I don’t want to throw away interactions just because they’re not statistically significant. I’d rather partially pool them toward zero using an inform

5 0.099749655 1196 andrew gelman stats-2012-03-04-Piss-poor monocausal social science

Introduction: Dan Kahan writes: Okay, have done due diligence here & can’t find the reference. It was in recent blog — and was more or less an aside — but you ripped into researchers (pretty sure econometricians, but this could be my memory adding to your account recollections it conjured from my own experience) who purport to make estimates or predictions based on multivariate regression in which the value of particular predictor is set at some level while others “held constant” etc., on ground that variance in that particular predictor independent of covariance in other model predictors is unrealistic. You made it sound, too, as if this were one of the pet peeves in your menagerie — leading me to think you had blasted into it before. Know what I’m talking about? Also — isn’t this really just a way of saying that the model is misspecified — at least if the goal is to try to make a valid & unbiased estimate of the impact of that particular predictor? The problem can’t be that one is usin

6 0.099578917 389 andrew gelman stats-2010-11-01-Why it can be rational to vote

7 0.099578917 1565 andrew gelman stats-2012-11-06-Why it can be rational to vote

8 0.096378557 1506 andrew gelman stats-2012-09-21-Building a regression model . . . with only 27 data points

9 0.093370438 746 andrew gelman stats-2011-06-05-An unexpected benefit of Arrow’s other theorem

10 0.093284637 902 andrew gelman stats-2011-09-12-The importance of style in academic writing

11 0.093090996 352 andrew gelman stats-2010-10-19-Analysis of survey data: Design based models vs. hierarchical modeling?

12 0.090045512 1695 andrew gelman stats-2013-01-28-Economists argue about Bayes

13 0.089395285 1447 andrew gelman stats-2012-08-07-Reproducible science FAIL (so far): What’s stoppin people from sharin data and code?

14 0.089098811 1425 andrew gelman stats-2012-07-23-Examples of the use of hierarchical modeling to generalize to new settings

15 0.08706516 2258 andrew gelman stats-2014-03-21-Random matrices in the news

16 0.085485756 1367 andrew gelman stats-2012-06-05-Question 26 of my final exam for Design and Analysis of Sample Surveys

17 0.081654586 945 andrew gelman stats-2011-10-06-W’man < W’pedia, again

18 0.080476187 1527 andrew gelman stats-2012-10-10-Another reason why you can get good inferences from a bad model

19 0.079779223 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making

20 0.079561621 1823 andrew gelman stats-2013-04-24-The Tweets-Votes Curve


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.192), (1, 0.036), (2, 0.033), (3, -0.011), (4, 0.056), (5, 0.007), (6, -0.017), (7, -0.026), (8, -0.002), (9, 0.022), (10, 0.009), (11, -0.008), (12, 0.007), (13, -0.013), (14, -0.02), (15, 0.044), (16, 0.031), (17, -0.006), (18, 0.02), (19, -0.002), (20, -0.009), (21, 0.006), (22, -0.005), (23, 0.022), (24, -0.053), (25, -0.003), (26, 0.012), (27, 0.019), (28, 0.015), (29, 0.022), (30, 0.02), (31, -0.02), (32, 0.019), (33, 0.007), (34, 0.014), (35, 0.065), (36, 0.026), (37, 0.021), (38, -0.043), (39, 0.019), (40, 0.003), (41, -0.032), (42, -0.028), (43, -0.02), (44, -0.025), (45, 0.007), (46, 0.008), (47, -0.048), (48, -0.012), (49, -0.002)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.95982546 946 andrew gelman stats-2011-10-07-Analysis of Power Law of Participation

Introduction: Rick Wash writes: A colleague as USC (Lian Jian) and I were recently discussing a statistical analysis issue that both of us have run into recently. We both mostly do research about how people use online interactive websites. One property that most of these systems have is known as the “powerlaw of participation” — the distribution of the number of contributions from each person follows a powerlaw. This mean that a few people contribution a TON and many, many people are in the “long tail” and contribute very rarely. For example, Facebook posts and twitter posts both have this distribution, as do comments on blogs and many other forms of user contribution online. This distribution has proven to be a problem when we analyze individual behavior. The basic problem is that we’d like to account for the fact that we have repeated data from many users, but a large number of users only have 1 or 2 data points. For example, Lian recently analyzed data about monetary contributions

2 0.82238078 1212 andrew gelman stats-2012-03-14-Controversy about a ranking of philosophy departments, or How should we think about statistical results when we can’t see the raw data?

Introduction: Jeff Helzner writes: A friend of mine and I cited your open data article in our attempts to persuade a professor at another institution [Brian Leiter] into releasing the raw data from his influential rankings of philosophy departments. He is now claiming the national security response: . . . disclosing the reputational data would violate the terms on which the evaluators agreed to complete the surveys (did they even bother to read the description of the methodology, one wonders?). I [Helzner] do not find this to be a compelling reply in this case. In fact, I would say that when such data cannot be disclosed it reveals a flaw in the design of the survey. Experimental designs must be open so that others can run the experiment. Mathematical proofs must be open so that they can be reviewed by others. Likewise, it seems to me that the details of statistical argument should be open to inspection. Do you have any thoughts on this? Or do you know of any other leading statistici

3 0.81964236 1506 andrew gelman stats-2012-09-21-Building a regression model . . . with only 27 data points

Introduction: Dan Silitonga writes: I was wondering whether you would have any advice on building a regression model on a very small datasets. I’m in the midst of revamping the model to predict tax collections from unincorporated businesses. But I only have 27 data points, 27 years of annual data. Any advice would be much appreciated. My reply: This sounds tough, especially given that 27 years of annual data isn’t even 27 independent data points. I have various essentially orthogonal suggestions: 1 [added after seeing John Cook's comment below]. Do your best, making as many assumptions as you need. In a Bayesian context, this means that you’d use a strong and informative prior and let the data update it as appropriate. In a less formal setting, you’d start with a guess of a model and then alter it to the extent that your data contradict your original guess. 2. Get more data. Not by getting information on more years (I assume you can’t do that) but by breaking up the data you do

4 0.8003726 1383 andrew gelman stats-2012-06-18-Hierarchical modeling as a framework for extrapolation

Introduction: Phil recently posted on the challenge of extrapolation of inferences to new data. After telling the story of a colleague who flat-out refused to make predictions from his model of buildings to new data, Phil wrote, “This is an interesting problem because it is sort of outside the realm of statistics, and into some sort of meta-statistical area. How can you judge whether your results can be extrapolated to the ‘real world,’ if you cant get a real-world sample to compare to?” In reply, I wrote: I agree that this is an important and general problem, but I don’t think it is outside the realm of statistics! I think that one useful statistical framework here is multilevel modeling. Suppose you are applying a procedure to J cases and want to predict case J+1 (in this case, the cases are buildings and J=52). Let the parameters be theta_1,…,theta_{J+1}, with data y_1,…,y_{J+1}, and case-level predictors X_1,…,X_{J+1}. The question is how to generalize from (theta_1,…,theta_J) to theta_{

5 0.79477614 1289 andrew gelman stats-2012-04-29-We go to war with the data we have, not the data we want

Introduction: This post is by Phil. Psychologists perform experiments on Canadian undergraduate psychology students and draws conclusions that (they believe) apply to humans in general; they publish in Science. A drug company decides to embark on additional trials that will cost tens of millions of dollars based on the results of a careful double-blind study….whose patients are all volunteers from two hospitals. A movie studio holds 9 screenings of a new movie for volunteer viewers and, based on their survey responses, decides to spend another $8 million to re-shoot the ending.  A researcher interested in the effect of ventilation on worker performance conducts a months-long study in which ventilation levels are varied and worker performance is monitored…in a single building. In almost all fields of research, most studies are based on convenience samples, or on random samples from a larger population that is itself a convenience sample. The paragraph above gives just a few examples.  The benefit

6 0.79388684 569 andrew gelman stats-2011-02-12-Get the Data

7 0.79372591 1805 andrew gelman stats-2013-04-16-Memo to Reinhart and Rogoff: I think it’s best to admit your errors and go on from there

8 0.79261941 948 andrew gelman stats-2011-10-10-Combining data from many sources

9 0.78469485 154 andrew gelman stats-2010-07-18-Predictive checks for hierarchical models

10 0.77891117 1940 andrew gelman stats-2013-07-16-A poll that throws away data???

11 0.77661425 70 andrew gelman stats-2010-06-07-Mister P goes on a date

12 0.77463055 544 andrew gelman stats-2011-01-29-Splitting the data

13 0.77184129 211 andrew gelman stats-2010-08-17-Deducer update

14 0.77160639 527 andrew gelman stats-2011-01-20-Cars vs. trucks

15 0.76980805 1447 andrew gelman stats-2012-08-07-Reproducible science FAIL (so far): What’s stoppin people from sharin data and code?

16 0.76664954 1178 andrew gelman stats-2012-02-21-How many data points do you really have?

17 0.76102197 690 andrew gelman stats-2011-05-01-Peter Huber’s reflections on data analysis

18 0.75908387 2059 andrew gelman stats-2013-10-12-Visualization, “big data”, and EDA

19 0.75898194 1823 andrew gelman stats-2013-04-24-The Tweets-Votes Curve

20 0.75883234 215 andrew gelman stats-2010-08-18-DataMarket


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(13, 0.021), (15, 0.025), (16, 0.071), (21, 0.023), (24, 0.116), (42, 0.024), (63, 0.027), (73, 0.017), (75, 0.173), (86, 0.047), (95, 0.022), (99, 0.295)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.97178268 893 andrew gelman stats-2011-09-06-Julian Symons on Frances Newman

Introduction: “She was forty years old when she died. It is possible that her art might have developed to include a wider area of human experience, just as possible that the chilling climate of the thirties might have withered it altogether. But what she actually wrote was greatly talented. She deserves a place, although obviously not a foremost one, in any literary history of the years between the wars. The last letter she wrote, or rather dictated, to the printer of the Laforgue translations shows the invariable fastidiousness of her talent, a fastidiousness which is often infuriating but just as often impressive, and is in any case rare enough to be worth remembrance: To the Printer of Six Moral Tales This book is to be spelled and its words are to be hyphenated according to the usage of the Concise Oxford Dictionary. Page introduction continuously with the tales. Do not put brackets around the numbers of the pages. All the ‘todays’ and all the ‘tomorrows’ should be spelled w

2 0.96775103 522 andrew gelman stats-2011-01-18-Problems with Haiti elections?

Introduction: Mark Weisbrot points me to this report trashing a recent OAS report on Haiti’s elections. Weisbrot writes: The two simplest things that are wrong with the OAS analysis are: (1) By looking only at a sample of the tally sheets and not using any statistical test, they have no idea how many other tally sheets would also be thrown out by the same criteria that they used, and how that would change the result and (2) The missing/quarantined tally sheets are much greater in number than the ones that they threw out; our analysis indicates that if these votes had been counted, the result would go the other way. I have not had a chance to take a look at this myself but I’m posting it here so that experts on election irregularities can see this and give their judgments. P.S. Weisbrot updates: We [Weisbrot et al.] published our actual paper on the OAS Mission’s Report today. The press release is here and gives a very good summary of the major problems with the OAS Mission rep

3 0.96559155 1067 andrew gelman stats-2011-12-18-Christopher Hitchens was a Bayesian

Introduction: 1. We Bayesian statisticians like to say there are three kinds of statisticians: a. Bayesians; b. People who are Bayesians but don’t realize it (that is, they act in coherence with some unstated probability); c. Failed Bayesians (that is, people whose inference could be improved by some attention to coherence). So, if a statistician does great work, we are inclined to claim this person for the Bayesian cause, even if he or she vehemently denies any Bayesian leanings. 2. In his autobiography, Bertrand Russell tells the story of when he went to prison for opposing World War 1: I [Russell] was much cheered on my arrival by the warden at the gate, who had to take particulars about me. He asked my religion, and I replied ‘agnostic.’ He asked how to spell it, and remarked with a sigh: “Well, there are many religions, but I suppose they all worship the same God.” This remark kept me cheerful for about a week. 3. In an op-ed today, Ross Douthat argues that celebrated a

4 0.96301556 28 andrew gelman stats-2010-05-12-Alert: Incompetent colleague wastes time of hardworking Wolfram Research publicist

Introduction: Marty McKee at Wolfram Research appears to have a very very stupid colleague. McKee wrote to Christian Robert: Your article, “Evidence and Evolution: A review”, caught the attention of one of my colleagues, who thought that it could be developed into an interesting Demonstration to add to the Wolfram Demonstrations Project. As Christian points out, adapting his book review into a computer demonstration would be quite a feat! I wonder what McKee’s colleague could be thinking? I recommend that Wolfram fire McKee’s colleague immediately: what an idiot! P.S. I’m not actually sure that McKee was the author of this email; I’m guessing this was the case because this other very similar email was written under his name. P.P.S. To head off the inevitable comments: Yes, yes, I know this is no big deal and I shouldn’t get bent out of shape about it. But . . . Wolfram Research has contributed such great things to the world, that I hate to think of them wasting any money paying

5 0.9548673 1396 andrew gelman stats-2012-06-27-Recently in the sister blog

Introduction: If Paul Krugman is right and it’s 1931, what happens next? What’s with Niall Ferguson? Hey, this reminds me of the Democrats in the U.S. . . . Would President Romney contract the economy? Inconsistency with prior knowledge triggers children’s causal explanatory reasoning

same-blog 6 0.93619889 946 andrew gelman stats-2011-10-07-Analysis of Power Law of Participation

7 0.93397909 1003 andrew gelman stats-2011-11-11-$

8 0.93279546 1808 andrew gelman stats-2013-04-17-Excel-bashing

9 0.92956793 1309 andrew gelman stats-2012-05-09-The first version of my “inference from iterative simulation using parallel sequences” paper!

10 0.91000879 2034 andrew gelman stats-2013-09-23-My talk Tues 24 Sept at 12h30 at Université de Technologie de Compiègne

11 0.90429717 2157 andrew gelman stats-2014-01-02-2013

12 0.89816439 2081 andrew gelman stats-2013-10-29-My talk in Amsterdam tomorrow (Wed 29 Oct): Can we use Bayesian methods to resolve the current crisis of statistically-significant research findings that don’t hold up?

13 0.89050072 8 andrew gelman stats-2010-04-28-Advice to help the rich get richer

14 0.88139892 2235 andrew gelman stats-2014-03-06-How much time (if any) should we spend criticizing research that’s fraudulent, crappy, or just plain pointless?

15 0.88114715 2228 andrew gelman stats-2014-02-28-Combining two of my interests

16 0.87789261 967 andrew gelman stats-2011-10-20-Picking on Gregg Easterbrook

17 0.87586057 315 andrew gelman stats-2010-10-03-He doesn’t trust the fit . . . r=.999

18 0.8753581 2263 andrew gelman stats-2014-03-24-Empirical implications of Empirical Implications of Theoretical Models

19 0.87458843 1910 andrew gelman stats-2013-06-22-Struggles over the criticism of the “cannabis users and IQ change” paper

20 0.87384921 1750 andrew gelman stats-2013-03-05-Watership Down, thick description, applied statistics, immutability of stories, and playing tennis with a net