andrew_gelman_stats andrew_gelman_stats-2013 andrew_gelman_stats-2013-2136 knowledge-graph by maker-knowledge-mining

2136 andrew gelman stats-2013-12-16-Whither the “bet on sparsity principle” in a nonsparse world?


meta infos for this blog

Source: html

Introduction: Rob Tibshirani writes : Hastie et al. (2001) coined the informal “Bet on Sparsity” principle. The l1 methods assume that the truth is sparse, in some basis. If the assumption holds true, then the parameters can be efficiently estimated using l1 penalties. If the assumption does not hold—so that the truth is dense—then no method will be able to recover the underlying model without a large amount of data per parameter. I’ve earlier expressed my full and sincere appreciation for Hastie and Tibshirani’s work in this area. Now I’d like to briefly comment on the above snippet. The question is, how do we think about the “bet on sparsity” principle in a world where the truth is dense? I’m thinking here of social science, where no effects are clean and no coefficient is zero (see page 960 of this article or various blog discussions in the past few years), where every contrast is meaningful—but some of these contrasts might be lost in the noise with any realistic size of data.


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The l1 methods assume that the truth is sparse, in some basis. [sent-3, score-0.245]

2 If the assumption holds true, then the parameters can be efficiently estimated using l1 penalties. [sent-4, score-0.269]

3 If the assumption does not hold—so that the truth is dense—then no method will be able to recover the underlying model without a large amount of data per parameter. [sent-5, score-0.827]

4 I’ve earlier expressed my full and sincere appreciation for Hastie and Tibshirani’s work in this area. [sent-6, score-0.244]

5 Now I’d like to briefly comment on the above snippet. [sent-7, score-0.068]

6 The question is, how do we think about the “bet on sparsity” principle in a world where the truth is dense? [sent-8, score-0.24]

7 I’m thinking here of social science, where no effects are clean and no coefficient is zero (see page 960 of this article or various blog discussions in the past few years), where every contrast is meaningful—but some of these contrasts might be lost in the noise with any realistic size of data. [sent-9, score-0.422]

8 I think there is a way out here, which is that in a dense setting we are not actually interested in “recovering the underlying model. [sent-10, score-0.524]

9 ” The underlying model, such as it is, is a continuous mix of effects. [sent-11, score-0.277]

10 If there’s no discrete thing to recover, there’s no reason to worry that we can’t recover it! [sent-12, score-0.325]

11 I’m sure things are different in a field such as chemistry, where you can try to identify the key compounds that make up some substance. [sent-13, score-0.174]

12 The above quote and link come from Rob’s chapter, “In praise of sparsity and convexity,” in the Committee of Presidents of Statistical Societies volume. [sent-16, score-0.473]

13 My chapter, “How do we choose our default methods? [sent-17, score-0.06]

14 I do think it can often make sense to consider the decision-analytic reasons why it can make sense to go for sparsity: sparse models can be faster to compute, easier to understand, and yield more stable inferences. [sent-22, score-0.742]

15 (Sometimes people say that a sparse model is less likely to overfit but I don’t think that’s quite right, as you can also get rid of overfitting by using a strong regularizer. [sent-23, score-0.669]

16 But I think it is fair to say that a sparse model can yield more stable inferences, in that the inferences for the more complex model can be sensitive to the details of the regularizer or the prior distribution. [sent-24, score-1.085]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('sparsity', 0.383), ('sparse', 0.321), ('dense', 0.304), ('recover', 0.261), ('hastie', 0.187), ('truth', 0.177), ('tibshirani', 0.174), ('rob', 0.171), ('underlying', 0.157), ('stable', 0.147), ('bet', 0.141), ('yield', 0.139), ('assumption', 0.117), ('compounds', 0.116), ('model', 0.115), ('inferences', 0.113), ('recovering', 0.105), ('coined', 0.105), ('chapter', 0.098), ('contrasts', 0.096), ('appreciation', 0.094), ('praise', 0.09), ('overfitting', 0.09), ('sincere', 0.086), ('presidents', 0.086), ('societies', 0.084), ('chemistry', 0.084), ('efficiently', 0.081), ('committee', 0.08), ('rid', 0.08), ('informal', 0.076), ('realistic', 0.075), ('sensitive', 0.072), ('faster', 0.072), ('meaningful', 0.071), ('holds', 0.071), ('briefly', 0.068), ('methods', 0.068), ('compute', 0.065), ('discrete', 0.064), ('noise', 0.064), ('expressed', 0.064), ('think', 0.063), ('clean', 0.063), ('coefficient', 0.062), ('lost', 0.062), ('mix', 0.061), ('default', 0.06), ('continuous', 0.059), ('identify', 0.058)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999982 2136 andrew gelman stats-2013-12-16-Whither the “bet on sparsity principle” in a nonsparse world?

Introduction: Rob Tibshirani writes : Hastie et al. (2001) coined the informal “Bet on Sparsity” principle. The l1 methods assume that the truth is sparse, in some basis. If the assumption holds true, then the parameters can be efficiently estimated using l1 penalties. If the assumption does not hold—so that the truth is dense—then no method will be able to recover the underlying model without a large amount of data per parameter. I’ve earlier expressed my full and sincere appreciation for Hastie and Tibshirani’s work in this area. Now I’d like to briefly comment on the above snippet. The question is, how do we think about the “bet on sparsity” principle in a world where the truth is dense? I’m thinking here of social science, where no effects are clean and no coefficient is zero (see page 960 of this article or various blog discussions in the past few years), where every contrast is meaningful—but some of these contrasts might be lost in the noise with any realistic size of data.

2 0.43940607 2185 andrew gelman stats-2014-01-25-Xihong Lin on sparsity and density

Introduction: I pointed Xihong Lin to this post from last month regarding Hastie and Tibshirani’s “bet on sparsity principle.” I argued that, in the worlds in which I work, in social and environmental science, every contrast is meaningful, even if not all of them can be distinguished from noise given a particular dataset. That is, I claim that effects are dense but data can be sparse—and any apparent sparsity of effects is typically just an artifact of sparsity of data. But things might be different in other fields. Xihong had an interesting perspective in the application areas where she works: Sparsity and density both appear in genetic studies too. For example, ethnicity has effects across millions of genetic variants across the genome (dense). Disease associated genetic variants are sparse.

3 0.17912871 2317 andrew gelman stats-2014-05-04-Honored oldsters write about statistics

Introduction: The new book titled: Past, Present, and Future of Statistical Science is now available for download . The official description makes the book sound pretty stuffy: Past, Present, and Future of Statistical Science, commissioned by the Committee of Presidents of Statistical Societies (COPSS) to celebrate its 50th anniversary and the International Year of Statistics, will be published in April by Taylor & Francis/CRC Press. Through the contributions of a distinguished group of 50 statisticians, the book showcases the breadth and vibrancy of statistics, describes current challenges and new opportunities, highlights the exciting future of statistical science, and provides guidance for future statisticians. Contributors are past COPSS award honorees. But it actually has lots of good stuff, including the chapter by Tibshirani which I discussed last year (in the context of the “bet on sparsity principle”), and chapters by XL and other fun people. Also my own chapter, How do we choo

4 0.14966367 2109 andrew gelman stats-2013-11-21-Hidden dangers of noninformative priors

Introduction: Following up on Christian’s post [link fixed] on the topic, I’d like to offer a few thoughts of my own. In BDA, we express the idea that a noninformative prior is a placeholder: you can use the noninformative prior to get the analysis started, then if your posterior distribution is less informative than you would like, or if it does not make sense, you can go back and add prior information. Same thing for the data model (the “likelihood”), for that matter: it often makes sense to start with something simple and conventional and then go from there. So, in that sense, noninformative priors are no big deal, they’re just a way to get started. Just don’t take them too seriously. Traditionally in statistics we’ve worked with the paradigm of a single highly informative dataset with only weak external information. But if the data are sparse and prior information is strong, we have to think differently. And, when you increase the dimensionality of a problem, both these things hap

5 0.14289251 1769 andrew gelman stats-2013-03-18-Tibshirani announces new research result: A significance test for the lasso

Introduction: Lasso and me For a long time I was wrong about lasso. Lasso (“least absolute shrinkage and selection operator”) is a regularization procedure that shrinks regression coefficients toward zero, and in its basic form is equivalent to maximum penalized likelihood estimation with a penalty function that is proportional to the sum of the absolute values of the regression coefficients. I first heard about lasso from a talk that Trevor Hastie Rob Tibshirani gave at Berkeley in 1994 or 1995. He demonstrated that it shrunk regression coefficients to zero. I wasn’t impressed, first because it seemed like no big deal (if that’s the prior you use, that’s the shrinkage you get) and second because, from a Bayesian perspective, I don’t want to shrink things all the way to zero. In the sorts of social and environmental science problems I’ve worked on, just about nothing is zero. I’d like to control my noisy estimates but there’s nothing special about zero. At the end of the talk I stood

6 0.14021757 1319 andrew gelman stats-2012-05-14-I hate to get all Gerd Gigerenzer on you here, but . . .

7 0.11714284 811 andrew gelman stats-2011-07-20-Kind of Bayesian

8 0.10964648 327 andrew gelman stats-2010-10-07-There are never 70 distinct parameters

9 0.10543533 1972 andrew gelman stats-2013-08-07-When you’re planning on fitting a model, build up to it by fitting simpler models first. Then, once you have a model you like, check the hell out of it

10 0.10143348 1392 andrew gelman stats-2012-06-26-Occam

11 0.1002717 1877 andrew gelman stats-2013-05-30-Infill asymptotics and sprawl asymptotics

12 0.095931247 754 andrew gelman stats-2011-06-09-Difficulties with Bayesian model averaging

13 0.095615335 1946 andrew gelman stats-2013-07-19-Prior distributions on derived quantities rather than on parameters themselves

14 0.09428861 2079 andrew gelman stats-2013-10-27-Uncompressing the concept of compressed sensing

15 0.09405455 781 andrew gelman stats-2011-06-28-The holes in my philosophy of Bayesian data analysis

16 0.091720946 1247 andrew gelman stats-2012-04-05-More philosophy of Bayes

17 0.090955287 138 andrew gelman stats-2010-07-10-Creating a good wager based on probability estimates

18 0.088569812 1431 andrew gelman stats-2012-07-27-Overfitting

19 0.088067219 217 andrew gelman stats-2010-08-19-The “either-or” fallacy of believing in discrete models: an example of folk statistics

20 0.085469194 2133 andrew gelman stats-2013-12-13-Flexibility is good


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.165), (1, 0.087), (2, 0.002), (3, 0.019), (4, -0.016), (5, -0.017), (6, 0.008), (7, -0.01), (8, 0.037), (9, 0.035), (10, -0.016), (11, 0.015), (12, -0.006), (13, -0.02), (14, -0.042), (15, 0.011), (16, -0.006), (17, -0.005), (18, -0.031), (19, 0.014), (20, -0.005), (21, -0.069), (22, -0.026), (23, 0.01), (24, 0.018), (25, 0.021), (26, 0.015), (27, 0.024), (28, 0.022), (29, -0.014), (30, -0.028), (31, 0.015), (32, -0.01), (33, -0.025), (34, 0.056), (35, 0.003), (36, 0.023), (37, -0.032), (38, 0.017), (39, -0.008), (40, 0.012), (41, 0.018), (42, 0.023), (43, -0.002), (44, 0.009), (45, -0.036), (46, -0.047), (47, 0.025), (48, -0.032), (49, 0.054)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.95529389 2136 andrew gelman stats-2013-12-16-Whither the “bet on sparsity principle” in a nonsparse world?

Introduction: Rob Tibshirani writes : Hastie et al. (2001) coined the informal “Bet on Sparsity” principle. The l1 methods assume that the truth is sparse, in some basis. If the assumption holds true, then the parameters can be efficiently estimated using l1 penalties. If the assumption does not hold—so that the truth is dense—then no method will be able to recover the underlying model without a large amount of data per parameter. I’ve earlier expressed my full and sincere appreciation for Hastie and Tibshirani’s work in this area. Now I’d like to briefly comment on the above snippet. The question is, how do we think about the “bet on sparsity” principle in a world where the truth is dense? I’m thinking here of social science, where no effects are clean and no coefficient is zero (see page 960 of this article or various blog discussions in the past few years), where every contrast is meaningful—but some of these contrasts might be lost in the noise with any realistic size of data.

2 0.78813106 1983 andrew gelman stats-2013-08-15-More on AIC, WAIC, etc

Introduction: Following up on our discussion from the other day, Angelika van der Linde sends along this paper from 2012 (link to journal here ). And Aki pulls out this great quote from Geisser and Eddy (1979): This discussion makes clear that in the nested case this method, as Akaike’s, is not consistent; i.e., even if $M_k$ is true, it will be rejected with probability $\alpha$ as $N\to\infty$. This point is also made by Schwarz (1978). However, from the point of view of prediction, this is of no great consequence. For large numbers of observations, a prediction based on the falsely assumed $M_k$, will not differ appreciably from one based on the true $M_k$. For example, if we assert that two normal populations have different means when in fact they have the same mean, then the use of the group mean as opposed to the grand mean for predicting a future observation results in predictors which are asymptotically equivalent and whose predictive variances are $\sigma^2[1 + (1/2n)]$ and $\si

3 0.7589308 1392 andrew gelman stats-2012-06-26-Occam

Introduction: Cosma Shalizi and Larry Wasserman discuss some papers from a conference on Ockham’s Razor. I don’t have anything new to add on this so let me link to past blog entries on the topic and repost the following from 2004 : A lot has been written in statistics about “parsimony”—that is, the desire to explain phenomena using fewer parameters–but I’ve never seen any good general justification for parsimony. (I don’t count “Occam’s Razor,” or “Ockham’s Razor,” or whatever, as a justification. You gotta do better than digging up a 700-year-old quote.) Maybe it’s because I work in social science, but my feeling is: if you can approximate reality with just a few parameters, fine. If you can use more parameters to fold in more information, that’s even better. In practice, I often use simple models—because they are less effort to fit and, especially, to understand. But I don’t kid myself that they’re better than more complicated efforts! My favorite quote on this comes from Rad

4 0.74833721 1395 andrew gelman stats-2012-06-27-Cross-validation (What is it good for?)

Introduction: I think cross-validation is a good way to estimate a model’s forecasting error but I don’t think it’s always such a great tool for comparing models. I mean, sure, if the differences are dramatic, ok. But you can easily have a few candidate models, and one model makes a lot more sense than the others (even from a purely predictive sense, I’m not talking about causality here). The difference between the model doesn’t show up in a xval measure of total error but in the patterns of the predictions. For a simple example, imagine using a linear model with positive slope to model a function that is constrained to be increasing. If the constraint isn’t in the model, the predicted/imputed series will sometimes be nonmonotonic. The effect on the prediction error can be so tiny as to be undetectable (or it might even increase avg prediction error to include the constraint); nonetheless, the predictions will be clearly nonsensical. That’s an extreme example but I think the general point h

5 0.74582398 754 andrew gelman stats-2011-06-09-Difficulties with Bayesian model averaging

Introduction: In response to this article by Cosma Shalizi and myself on the philosophy of Bayesian statistics, David Hogg writes: I [Hogg] agree–even in physics and astronomy–that the models are not “True” in the God-like sense of being absolute reality (that is, I am not a realist); and I have argued (a philosophically very naive paper, but hey, I was new to all this) that for pretty fundamental reasons we could never arrive at the True (with a capital “T”) model of the Universe. The goal of inference is to find the “best” model, where “best” might have something to do with prediction, or explanation, or message length, or (horror!) our utility. Needless to say, most of my physics friends *are* realists, even in the face of “effective theories” as Newtonian mechanics is an effective theory of GR and GR is an effective theory of “quantum gravity” (this plays to your point, because if you think any theory is possibly an effective theory, how could you ever find Truth?). I also liked the i

6 0.74510968 1412 andrew gelman stats-2012-07-10-More questions on the contagion of obesity, height, etc.

7 0.73987985 1041 andrew gelman stats-2011-12-04-David MacKay and Occam’s Razor

8 0.73422164 552 andrew gelman stats-2011-02-03-Model Makers’ Hippocratic Oath

9 0.73348832 1739 andrew gelman stats-2013-02-26-An AI can build and try out statistical models using an open-ended generative grammar

10 0.7304436 217 andrew gelman stats-2010-08-19-The “either-or” fallacy of believing in discrete models: an example of folk statistics

11 0.72913378 1742 andrew gelman stats-2013-02-27-What is “explanation”?

12 0.72421068 1518 andrew gelman stats-2012-10-02-Fighting a losing battle

13 0.7168119 776 andrew gelman stats-2011-06-22-Deviance, DIC, AIC, cross-validation, etc

14 0.71484005 1406 andrew gelman stats-2012-07-05-Xiao-Li Meng and Xianchao Xie rethink asymptotics

15 0.71472317 421 andrew gelman stats-2010-11-19-Just chaid

16 0.71283448 496 andrew gelman stats-2011-01-01-Tukey’s philosophy

17 0.71275383 614 andrew gelman stats-2011-03-15-Induction within a model, deductive inference for model evaluation

18 0.70559591 1459 andrew gelman stats-2012-08-15-How I think about mixture models

19 0.70473754 2208 andrew gelman stats-2014-02-12-How to think about “identifiability” in Bayesian inference?

20 0.69983292 320 andrew gelman stats-2010-10-05-Does posterior predictive model checking fit with the operational subjective approach?


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(9, 0.046), (16, 0.04), (17, 0.059), (18, 0.048), (21, 0.078), (24, 0.152), (31, 0.012), (53, 0.019), (68, 0.033), (70, 0.012), (84, 0.052), (87, 0.037), (95, 0.044), (99, 0.265)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96921968 2136 andrew gelman stats-2013-12-16-Whither the “bet on sparsity principle” in a nonsparse world?

Introduction: Rob Tibshirani writes : Hastie et al. (2001) coined the informal “Bet on Sparsity” principle. The l1 methods assume that the truth is sparse, in some basis. If the assumption holds true, then the parameters can be efficiently estimated using l1 penalties. If the assumption does not hold—so that the truth is dense—then no method will be able to recover the underlying model without a large amount of data per parameter. I’ve earlier expressed my full and sincere appreciation for Hastie and Tibshirani’s work in this area. Now I’d like to briefly comment on the above snippet. The question is, how do we think about the “bet on sparsity” principle in a world where the truth is dense? I’m thinking here of social science, where no effects are clean and no coefficient is zero (see page 960 of this article or various blog discussions in the past few years), where every contrast is meaningful—but some of these contrasts might be lost in the noise with any realistic size of data.

2 0.93609542 2112 andrew gelman stats-2013-11-25-An interesting but flawed attempt to apply general forecasting principles to contextualize attitudes toward risks of global warming

Introduction: I came across a document [updated link here ], “Applying structured analogies to the global warming alarm movement,” by Kesten Green and Scott Armstrong. The general approach is appealing to me, but the execution seemed disturbingly flawed. Here’s how they introduce the project: The structured analogies procedure we [Green and Armstrong] used for this study was as follows: 1. Identify possible analogies by searching the literature and by asking experts with different viewpoints to nominate analogies to the target situation: alarm over dangerous manmade global warming. 2. Screen the possible analogies to ensure they meet the stated criteria and that the outcomes are known. 3. Code the relevant characteristics of the analogous situations. 4. Forecast target situation outcomes by using a predetermined mechanical rule to select the outcomes of the analogies. Here is how we posed the question to the experts: The Intergovernmental Panel on Climate Change and other organizat

3 0.93205923 2037 andrew gelman stats-2013-09-25-Classical probability does not apply to quantum systems (causal inference edition)

Introduction: James Robins, Tyler VanderWeele, and Richard Gill write : Neyman introduced a formal mathematical theory of counterfactual causation that now has become standard language in many quantitative disciplines, but not in physics. We use results on causal interaction and interference between treatments (derived under the Neyman theory) to give a simple new proof of a well-known result in quantum physics, namely, Bellís inequality. Now the predictions of quantum mechanics and the results of experiment both violate Bell’s inequality. In the remainder of the talk, we review the implications for a counterfactual theory of causation. Assuming with Einstein that faster than light (supraluminal) communication is not possible, one can view the Neyman theory of counterfactuals as falsified by experiment. . . . Is it safe for a quantitative discipline to rely on a counterfactual approach to causation, when our best confirmed physical theory falsifies their existence? I haven’t seen the talk

4 0.93112195 1459 andrew gelman stats-2012-08-15-How I think about mixture models

Introduction: Larry Wasserman refers to finite mixture models as “beasts” and writes jokes that they “should be avoided at all costs.” I’ve thought a lot about mixture models, ever since using them in an analysis of voting patterns that was published in 1990. First off, I’d like to say that our model was useful so I’d prefer not to pay the cost of avoiding it. For a quick description of our mixture model and its context, see pp. 379-380 of my article in the Jim Berger volume). Actually, our case was particularly difficult because we were not even fitting a mixture model to data, we were fitting it to latent data and using the model to perform partial pooling. My difficulties in trying to fit this model inspired our discussion of mixture models in Bayesian Data Analysis (page 109 in the second edition, in the section on “Counterexamples to the theorems”). I agree with Larry that if you’re fitting a mixture model, it’s good to be aware of the problems that arise if you try to estimate

5 0.92816097 810 andrew gelman stats-2011-07-20-Adding more information can make the variance go up (depending on your model)

Introduction: Andy McKenzie writes: In their March 9 “ counterpoint ” in nature biotech to the prospect that we should try to integrate more sources of data in clinical practice (see “ point ” arguing for this), Isaac Kohane and David Margulies claim that, “Finally, how much better is our new knowledge than older knowledge? When is the incremental benefit of a genomic variant(s) or gene expression profile relative to a family history or classic histopathology insufficient and when does it add rather than subtract variance?” Perhaps I am mistaken (thus this email), but it seems that this claim runs contra to the definition of conditional probability. That is, if you have a hierarchical model, and the family history / classical histopathology already suggests a parameter estimate with some variance, how could the new genomic info possibly increase the variance of that parameter estimate? Surely the question is how much variance the new genomic info reduces and whether it therefore justifies t

6 0.92610395 486 andrew gelman stats-2010-12-26-Age and happiness: The pattern isn’t as clear as you might think

7 0.9259581 1136 andrew gelman stats-2012-01-23-Fight! (also a bit of reminiscence at the end)

8 0.92569989 2159 andrew gelman stats-2014-01-04-“Dogs are sensitive to small variations of the Earth’s magnetic field”

9 0.92550719 514 andrew gelman stats-2011-01-13-News coverage of statistical issues…how did I do?

10 0.92550701 2097 andrew gelman stats-2013-11-11-Why ask why? Forward causal inference and reverse causal questions

11 0.92495292 1230 andrew gelman stats-2012-03-26-Further thoughts on nonparametric correlation measures

12 0.92480552 789 andrew gelman stats-2011-07-07-Descriptive statistics, causal inference, and story time

13 0.92463648 1422 andrew gelman stats-2012-07-20-Likelihood thresholds and decisions

14 0.92308629 147 andrew gelman stats-2010-07-15-Quote of the day: statisticians and defaults

15 0.92222714 98 andrew gelman stats-2010-06-19-Further thoughts on happiness and life satisfaction research

16 0.92202735 1877 andrew gelman stats-2013-05-30-Infill asymptotics and sprawl asymptotics

17 0.92158198 1883 andrew gelman stats-2013-06-04-Interrogating p-values

18 0.92091501 670 andrew gelman stats-2011-04-20-Attractive but hard-to-read graph could be made much much better

19 0.92063975 1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?

20 0.92010069 2142 andrew gelman stats-2013-12-21-Chasing the noise