andrew_gelman_stats andrew_gelman_stats-2010 andrew_gelman_stats-2010-327 knowledge-graph by maker-knowledge-mining

327 andrew gelman stats-2010-10-07-There are never 70 distinct parameters


meta infos for this blog

Source: html

Introduction: Sam Seaver writes: I’m a graduate student in computational biology, and I’m relatively new to advanced statistics, and am trying to teach myself how best to approach a problem I have. My dataset is a small sparse matrix of 150 cases and 70 predictors, it is sparse as in many zeros, not many ‘NA’s. Each case is a nutrient that is fed into an in silico organism, and its response is whether or not it stimulates growth, and each predictor is one of 70 different pathways that the nutrient may or may not belong to. Because all of the nutrients do not belong to all of the pathways, there are thus many zeros in my matrix. My goal is to be able to use the pathways themselves to predict whether or not a nutrient could stimulate growth, thus I wanted to compute regression coefficients for each pathway, with which I could apply to other nutrients for other species. There are quite a few singularities in the dataset (summary(glm) reports that 14 coefficients are not defined because of sin


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Sam Seaver writes: I’m a graduate student in computational biology, and I’m relatively new to advanced statistics, and am trying to teach myself how best to approach a problem I have. [sent-1, score-0.23]

2 My dataset is a small sparse matrix of 150 cases and 70 predictors, it is sparse as in many zeros, not many ‘NA’s. [sent-2, score-0.578]

3 Each case is a nutrient that is fed into an in silico organism, and its response is whether or not it stimulates growth, and each predictor is one of 70 different pathways that the nutrient may or may not belong to. [sent-3, score-1.828]

4 Because all of the nutrients do not belong to all of the pathways, there are thus many zeros in my matrix. [sent-4, score-0.844]

5 My goal is to be able to use the pathways themselves to predict whether or not a nutrient could stimulate growth, thus I wanted to compute regression coefficients for each pathway, with which I could apply to other nutrients for other species. [sent-5, score-1.763]

6 So I was wondering if there are complementary and/or alternative methods to logistic regression that would give me a coefficient of a kind for each pathway? [sent-7, score-0.402]

7 My reply: If you have this kind of sparsity, I think you’ll need to add some prior information or structure to your model. [sent-8, score-0.255]

8 Our paper on bayesglm suggests a reasonable default prior, but it sounds to me that you’ll have to go further. [sent-9, score-0.142]

9 To put it another way: give up the idea that you’re estimating 70 distinct parameters. [sent-10, score-0.13]

10 Instead, think of these coefficients as linked to each other in a complex web. [sent-11, score-0.215]

11 More generally, I don’t think it ever makes sense to think of a problem with a lot of loose parameters. [sent-12, score-0.08]

12 One of our major research problems now is to set up general models for structured parameters, going beyond simple exchangeability. [sent-14, score-0.068]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('pathways', 0.476), ('nutrient', 0.347), ('nutrients', 0.347), ('singularities', 0.232), ('pathway', 0.19), ('belong', 0.184), ('zeros', 0.184), ('coefficients', 0.16), ('sparse', 0.145), ('growth', 0.125), ('dataset', 0.11), ('organism', 0.105), ('seaver', 0.105), ('stimulates', 0.105), ('structure', 0.104), ('exchangeability', 0.095), ('apply', 0.093), ('complementary', 0.089), ('bayesglm', 0.087), ('sparsity', 0.087), ('may', 0.084), ('stimulate', 0.083), ('fed', 0.081), ('loose', 0.08), ('kind', 0.079), ('glm', 0.079), ('distinct', 0.074), ('prior', 0.072), ('thus', 0.071), ('empty', 0.071), ('sam', 0.07), ('structured', 0.068), ('regression', 0.067), ('remove', 0.065), ('advanced', 0.063), ('matrix', 0.062), ('biology', 0.06), ('predictor', 0.06), ('whether', 0.06), ('na', 0.059), ('compute', 0.059), ('many', 0.058), ('relatively', 0.057), ('coefficient', 0.057), ('give', 0.056), ('teach', 0.055), ('graduate', 0.055), ('default', 0.055), ('linked', 0.055), ('logistic', 0.054)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 327 andrew gelman stats-2010-10-07-There are never 70 distinct parameters

Introduction: Sam Seaver writes: I’m a graduate student in computational biology, and I’m relatively new to advanced statistics, and am trying to teach myself how best to approach a problem I have. My dataset is a small sparse matrix of 150 cases and 70 predictors, it is sparse as in many zeros, not many ‘NA’s. Each case is a nutrient that is fed into an in silico organism, and its response is whether or not it stimulates growth, and each predictor is one of 70 different pathways that the nutrient may or may not belong to. Because all of the nutrients do not belong to all of the pathways, there are thus many zeros in my matrix. My goal is to be able to use the pathways themselves to predict whether or not a nutrient could stimulate growth, thus I wanted to compute regression coefficients for each pathway, with which I could apply to other nutrients for other species. There are quite a few singularities in the dataset (summary(glm) reports that 14 coefficients are not defined because of sin

2 0.11545284 2109 andrew gelman stats-2013-11-21-Hidden dangers of noninformative priors

Introduction: Following up on Christian’s post [link fixed] on the topic, I’d like to offer a few thoughts of my own. In BDA, we express the idea that a noninformative prior is a placeholder: you can use the noninformative prior to get the analysis started, then if your posterior distribution is less informative than you would like, or if it does not make sense, you can go back and add prior information. Same thing for the data model (the “likelihood”), for that matter: it often makes sense to start with something simple and conventional and then go from there. So, in that sense, noninformative priors are no big deal, they’re just a way to get started. Just don’t take them too seriously. Traditionally in statistics we’ve worked with the paradigm of a single highly informative dataset with only weak external information. But if the data are sparse and prior information is strong, we have to think differently. And, when you increase the dimensionality of a problem, both these things hap

3 0.1102249 2356 andrew gelman stats-2014-06-02-On deck this week

Introduction: Mon: Why we hate stepwise regression Tues: Did you buy laundry detergent on their most recent trip to the store? Also comments on scientific publication and yet another suggestion to do a study that allows within-person comparisons Wed: All the Assumptions That Are My Life Thurs: Identifying pathways for managing multiple disturbances to limit plant invasions Fri: Statistically savvy journalism Sat: “Does researching casual marijuana use cause brain abnormalities?” Sun: Regression and causality and variable ordering

4 0.10964648 2136 andrew gelman stats-2013-12-16-Whither the “bet on sparsity principle” in a nonsparse world?

Introduction: Rob Tibshirani writes : Hastie et al. (2001) coined the informal “Bet on Sparsity” principle. The l1 methods assume that the truth is sparse, in some basis. If the assumption holds true, then the parameters can be efficiently estimated using l1 penalties. If the assumption does not hold—so that the truth is dense—then no method will be able to recover the underlying model without a large amount of data per parameter. I’ve earlier expressed my full and sincere appreciation for Hastie and Tibshirani’s work in this area. Now I’d like to briefly comment on the above snippet. The question is, how do we think about the “bet on sparsity” principle in a world where the truth is dense? I’m thinking here of social science, where no effects are clean and no coefficient is zero (see page 960 of this article or various blog discussions in the past few years), where every contrast is meaningful—but some of these contrasts might be lost in the noise with any realistic size of data.

5 0.098698266 2017 andrew gelman stats-2013-09-11-“Informative g-Priors for Logistic Regression”

Introduction: Tim Hanson sends along this paper (coauthored with Adam Branscum and Wesley Johnson): Eliciting information from experts for use in constructing prior distributions for logistic regression coefficients can be challenging. The task is especially difficult when the model contains many predictor variables, because the expert is asked to provide summary information about the probability of “success” for many subgroups of the population. Often, however, experts are confident only in their assessment of the population as a whole. This paper is about incorporating such overall, marginal or averaged, information easily into a logistic regression data analysis by using g-priors. We present a version of the g-prior such that the prior distribution on the probability of success can be set to closely match a beta dis- tribution, when averaged over the set of predictors in a logistic regression. A simple data augmentation formulation that can be implemented in standard statistical software pac

6 0.094722152 1656 andrew gelman stats-2013-01-05-Understanding regression models and regression coefficients

7 0.087246619 1695 andrew gelman stats-2013-01-28-Economists argue about Bayes

8 0.085147418 1941 andrew gelman stats-2013-07-16-Priors

9 0.082591653 1486 andrew gelman stats-2012-09-07-Prior distributions for regression coefficients

10 0.082521439 577 andrew gelman stats-2011-02-16-Annals of really really stupid spam

11 0.081072286 1465 andrew gelman stats-2012-08-21-D. Buggin

12 0.079343468 1155 andrew gelman stats-2012-02-05-What is a prior distribution?

13 0.07789842 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making

14 0.07746955 2185 andrew gelman stats-2014-01-25-Xihong Lin on sparsity and density

15 0.074741416 1946 andrew gelman stats-2013-07-19-Prior distributions on derived quantities rather than on parameters themselves

16 0.074693829 2129 andrew gelman stats-2013-12-10-Cross-validation and Bayesian estimation of tuning parameters

17 0.074510507 2200 andrew gelman stats-2014-02-05-Prior distribution for a predicted probability

18 0.071482435 753 andrew gelman stats-2011-06-09-Allowing interaction terms to vary

19 0.070855968 1886 andrew gelman stats-2013-06-07-Robust logistic regression

20 0.070586145 1466 andrew gelman stats-2012-08-22-The scaled inverse Wishart prior distribution for a covariance matrix in a hierarchical model


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.129), (1, 0.071), (2, 0.014), (3, 0.01), (4, 0.035), (5, 0.005), (6, 0.041), (7, -0.016), (8, -0.033), (9, 0.058), (10, 0.018), (11, 0.022), (12, 0.021), (13, -0.018), (14, -0.003), (15, 0.001), (16, -0.026), (17, -0.004), (18, -0.003), (19, 0.003), (20, 0.008), (21, 0.024), (22, 0.0), (23, 0.019), (24, 0.003), (25, -0.002), (26, 0.042), (27, -0.024), (28, -0.012), (29, 0.003), (30, 0.037), (31, 0.021), (32, 0.02), (33, -0.007), (34, 0.012), (35, -0.055), (36, 0.008), (37, 0.017), (38, -0.024), (39, 0.013), (40, -0.007), (41, 0.007), (42, -0.003), (43, -0.005), (44, 0.027), (45, 0.032), (46, -0.023), (47, 0.001), (48, 0.021), (49, 0.018)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.95702684 327 andrew gelman stats-2010-10-07-There are never 70 distinct parameters

Introduction: Sam Seaver writes: I’m a graduate student in computational biology, and I’m relatively new to advanced statistics, and am trying to teach myself how best to approach a problem I have. My dataset is a small sparse matrix of 150 cases and 70 predictors, it is sparse as in many zeros, not many ‘NA’s. Each case is a nutrient that is fed into an in silico organism, and its response is whether or not it stimulates growth, and each predictor is one of 70 different pathways that the nutrient may or may not belong to. Because all of the nutrients do not belong to all of the pathways, there are thus many zeros in my matrix. My goal is to be able to use the pathways themselves to predict whether or not a nutrient could stimulate growth, thus I wanted to compute regression coefficients for each pathway, with which I could apply to other nutrients for other species. There are quite a few singularities in the dataset (summary(glm) reports that 14 coefficients are not defined because of sin

2 0.80373514 1486 andrew gelman stats-2012-09-07-Prior distributions for regression coefficients

Introduction: Eric Brown writes: I have come across a number of recommendations over the years about best practices for multilevel regression modeling. For example, the use of t-distributed priors for coefficients in logistic regression and standardizing input variables from one of your 2008 Annals of Applied Statistics papers; or recommendations for priors on variance parameters from your 2006 Bayesian Analysis paper. I understand that these are often of varied opinion of people in the field, but I was wondering if you have a reference that you point people to for a place to get started? I’ve tried looking through your blog posts but couldn’t find any summaries. For example, what are some examples of when I should use more than a two-level hierarchical model? Can I use a spike-slab coefficient model with a t-distributed prior for the slab rather than a normal? If I assume that my model is a priori wrong (but still useful), what are some recommended ways to choose how many interactions to u

3 0.79008555 1466 andrew gelman stats-2012-08-22-The scaled inverse Wishart prior distribution for a covariance matrix in a hierarchical model

Introduction: Since we’re talking about the scaled inverse Wishart . . . here’s a recent message from Chris Chatham: I have been reading your book on Bayesian Hierarchical/Multilevel Modeling but have been struggling a bit with deciding whether to model my multivariate normal distribution using the scaled inverse Wishart approach you advocate given the arguments at this blog post [entitled "Why an inverse-Wishart prior may not be such a good idea"]. My reply: We discuss this in our book. We know the inverse-Wishart has problems, that’s why we recommend the scaled inverse-Wishart, which is a more general class of models. Here ‘s an old blog post on the topic. And also of course there’s the description in our book. Chris pointed me to the following comment by Simon Barthelmé: Using the scaled inverse Wishart doesn’t change anything, the standard deviations of the invidual coefficients and their covariance are still dependent. My answer would be to use a prior that models the stan

4 0.77536911 1849 andrew gelman stats-2013-05-09-Same old same old

Introduction: In an email I sent to a colleague who’s writing about lasso and Bayesian regression for R users: The one thing you might want to add, to fit with your pragmatic perspective, is to point out that these different methods are optimal under different assumptions about the data. However, these assumptions are never true (even in the rare cases where you have a believable prior, it won’t really follow the functional form assumed by bayesglm ; even in the rare cases where you have a real loss function, it won’t really follow the mathematical form assumed by lasso etc), but these methods can still be useful and be given the interpretation of regularized estimates. Another thing that someone might naively think is that regularization is fine but “ unbiased ” is somehow the most honest. In practice, if you stick to “unbiased” methods such as least squares, you’ll restrict the number of variables you can include in your model. So in reality you suffer from omitted-variable bias. So th

5 0.76561099 2017 andrew gelman stats-2013-09-11-“Informative g-Priors for Logistic Regression”

Introduction: Tim Hanson sends along this paper (coauthored with Adam Branscum and Wesley Johnson): Eliciting information from experts for use in constructing prior distributions for logistic regression coefficients can be challenging. The task is especially difficult when the model contains many predictor variables, because the expert is asked to provide summary information about the probability of “success” for many subgroups of the population. Often, however, experts are confident only in their assessment of the population as a whole. This paper is about incorporating such overall, marginal or averaged, information easily into a logistic regression data analysis by using g-priors. We present a version of the g-prior such that the prior distribution on the probability of success can be set to closely match a beta dis- tribution, when averaged over the set of predictors in a logistic regression. A simple data augmentation formulation that can be implemented in standard statistical software pac

6 0.76549274 2357 andrew gelman stats-2014-06-02-Why we hate stepwise regression

7 0.76087224 1769 andrew gelman stats-2013-03-18-Tibshirani announces new research result: A significance test for the lasso

8 0.74961179 1462 andrew gelman stats-2012-08-18-Standardizing regression inputs

9 0.74597031 833 andrew gelman stats-2011-07-31-Untunable Metropolis

10 0.74468869 770 andrew gelman stats-2011-06-15-Still more Mr. P in public health

11 0.74227875 1703 andrew gelman stats-2013-02-02-Interaction-based feature selection and classification for high-dimensional biological data

12 0.73555696 1094 andrew gelman stats-2011-12-31-Using factor analysis or principal components analysis or measurement-error models for biological measurements in archaeology?

13 0.73378485 938 andrew gelman stats-2011-10-03-Comparing prediction errors

14 0.73072201 1656 andrew gelman stats-2013-01-05-Understanding regression models and regression coefficients

15 0.73045385 1465 andrew gelman stats-2012-08-21-D. Buggin

16 0.71651876 1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?

17 0.71553242 451 andrew gelman stats-2010-12-05-What do practitioners need to know about regression?

18 0.71376169 268 andrew gelman stats-2010-09-10-Fighting Migraine with Multilevel Modeling

19 0.71215576 1908 andrew gelman stats-2013-06-21-Interpreting interactions in discrete-data regression

20 0.71051937 1946 andrew gelman stats-2013-07-19-Prior distributions on derived quantities rather than on parameters themselves


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(7, 0.012), (8, 0.013), (9, 0.062), (16, 0.049), (24, 0.132), (34, 0.016), (54, 0.021), (84, 0.022), (86, 0.029), (96, 0.28), (99, 0.223)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.96312785 169 andrew gelman stats-2010-07-29-Say again?

Introduction: “Ich glaube, dass die Wahrscheinlichkeitsrechnung das richtige Werkzeug zum Lösen solcher Probleme ist”, sagt Andrew Gelman , Statistikprofessor von der Columbia-Universität in New York. Wie oft aber derart knifflige Aufgaben im realen Leben auftauchen, könne er nicht sagen. Was fast schon beruhigend klingt. OK, fine.

2 0.94440746 1306 andrew gelman stats-2012-05-07-Lists of Note and Letters of Note

Introduction: These (from Shaun Usher) are surprisingly good, especially since he appears to come up with new lists and letters pretty regularly. I suppose a lot of them get sent in from readers, but still. Here’s my favorite recent item, a letter sent to the Seattle Bureau of Prohibition in 1931: Dear Sir: My husband is in the habit of buying a quart of wiskey every other day from a Chinese bootlegger named Chin Waugh living at 317-16th near Alder street. We need this money for household expenses. Will you please have his place raided? He keeps a supply planted in the garden and a smaller quantity under the back steps for quick delivery. If you make the raid at 9:30 any morning you will be sure to get the goods and Chin also as he leaves the house at 10 o’clock and may clean up before he goes. Thanking you in advance, I remain yours truly, Mrs. Hillyer

3 0.88304883 1731 andrew gelman stats-2013-02-21-If a lottery is encouraging addictive gambling, don’t expand it!

Introduction: This story from Vivian Yee seems just horrible to me. First the background: Pronto Lotto’s real business takes place in the carpeted, hushed area where its most devoted customers watch video screens from a scattering of tall silver tables, hour after hour, day after day. The players — mostly men, about a dozen at any given time — come on their lunch breaks or after work to study the screens, which are programmed with the Quick Draw lottery game, and flash a new set of winning numbers every four minutes. They have helped make Pronto Lotto the top Quick Draw vendor in the state, selling $3.3 million worth of tickets last year, more than $1 million more than the second busiest location, a World Books shop in Penn Station. Some stay for just a few minutes. Others play for the length of a workday, repeatedly traversing the few yards between their seats and the cash register as they hand the next wager to a clerk with a dollar bill or two, and return to wait. “It’s like my job, 24

4 0.87607318 410 andrew gelman stats-2010-11-12-The Wald method has been the subject of extensive criticism by statisticians for exaggerating results”

Introduction: Paul Nee sends in this amusing item: MELA Sciences claimed success in a clinical trial of its experimental skin cancer detection device only by altering the statistical method used to analyze the data in violation of an agreement with U.S. regulators, charges an independent healthcare analyst in a report issued last week. . . The BER report, however, relies on its own analysis to suggest that MELA struck out with FDA because the agency’s medical device reviewers discovered the MELAFind pivotal study failed to reach statistical significance despite the company’s claims to the contrary. And now here’s where it gets interesting: MELA claims that a phase III study of MELAFind met its primary endpoint by detecting accurately 112 of 114 eligible melanomas for a “sensitivity” rate of 98%. The lower confidence bound of the sensitivity analysis was 95.1%, which met the FDA’s standard for statistical significance in the study spelled out in a binding agreement with MELA, the compa

same-blog 5 0.86256945 327 andrew gelman stats-2010-10-07-There are never 70 distinct parameters

Introduction: Sam Seaver writes: I’m a graduate student in computational biology, and I’m relatively new to advanced statistics, and am trying to teach myself how best to approach a problem I have. My dataset is a small sparse matrix of 150 cases and 70 predictors, it is sparse as in many zeros, not many ‘NA’s. Each case is a nutrient that is fed into an in silico organism, and its response is whether or not it stimulates growth, and each predictor is one of 70 different pathways that the nutrient may or may not belong to. Because all of the nutrients do not belong to all of the pathways, there are thus many zeros in my matrix. My goal is to be able to use the pathways themselves to predict whether or not a nutrient could stimulate growth, thus I wanted to compute regression coefficients for each pathway, with which I could apply to other nutrients for other species. There are quite a few singularities in the dataset (summary(glm) reports that 14 coefficients are not defined because of sin

6 0.83184135 2153 andrew gelman stats-2013-12-29-“Statistics Done Wrong”

7 0.8091265 1023 andrew gelman stats-2011-11-22-Going Beyond the Book: Towards Critical Reading in Statistics Teaching

8 0.80199152 1118 andrew gelman stats-2012-01-14-A model rejection letter

9 0.79796898 319 andrew gelman stats-2010-10-04-“Who owns Congress”

10 0.78084707 1338 andrew gelman stats-2012-05-23-Advice on writing research articles

11 0.77856159 934 andrew gelman stats-2011-09-30-Nooooooooooooooooooo!

12 0.76985681 99 andrew gelman stats-2010-06-19-Paired comparisons

13 0.76231563 2023 andrew gelman stats-2013-09-14-On blogging

14 0.76211405 302 andrew gelman stats-2010-09-28-This is a link to a news article about a scientific paper

15 0.75839382 787 andrew gelman stats-2011-07-05-Different goals, different looks: Infovis and the Chris Rock effect

16 0.75129628 205 andrew gelman stats-2010-08-13-Arnold Zellner

17 0.74660784 405 andrew gelman stats-2010-11-10-Estimation from an out-of-date census

18 0.74458313 1405 andrew gelman stats-2012-07-04-“Titanic Thompson: The Man Who Would Bet on Everything”

19 0.73097521 2065 andrew gelman stats-2013-10-17-Cool dynamic demographic maps provide beautiful illustration of Chris Rock effect

20 0.72587293 2172 andrew gelman stats-2014-01-14-Advice on writing research articles