andrew_gelman_stats andrew_gelman_stats-2013 andrew_gelman_stats-2013-1703 knowledge-graph by maker-knowledge-mining

1703 andrew gelman stats-2013-02-02-Interaction-based feature selection and classification for high-dimensional biological data


meta infos for this blog

Source: html

Introduction: Ilya Esteban writes: In your blog your advice for performing regression in the presence of large numbers of correlated features, has been to use composite scores and hierarchical modeling. Unfortunately, many problems don’t provide an obvious and unambiguous way of grouping features together (e.g. gene expression data). Are there any techniques that you would recommend that automatically pool correlated features together based on the data, without requiring the researcher to manually define composite scores or feature hierarchies? I don’t know the answer to this but I imagine something is possible . . . any ideas? In the meantime I’m reminded of this recent article by Shaw-Hwa Lo, Haitian Wang, Tian Zheng, and Inchi Hu: Recent high-throughput biological studies successfully identified thousands of risk factors associated with common human dis- eases. Most of these studies used single-variable method and each variable is analyzed individually. The risk factors so identi


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Ilya Esteban writes: In your blog your advice for performing regression in the presence of large numbers of correlated features, has been to use composite scores and hierarchical modeling. [sent-1, score-0.497]

2 Unfortunately, many problems don’t provide an obvious and unambiguous way of grouping features together (e. [sent-2, score-0.567]

3 Are there any techniques that you would recommend that automatically pool correlated features together based on the data, without requiring the researcher to manually define composite scores or feature hierarchies? [sent-5, score-0.956]

4 In the meantime I’m reminded of this recent article by Shaw-Hwa Lo, Haitian Wang, Tian Zheng, and Inchi Hu: Recent high-throughput biological studies successfully identified thousands of risk factors associated with common human dis- eases. [sent-10, score-0.577]

5 Most of these studies used single-variable method and each variable is analyzed individually. [sent-11, score-0.347]

6 The risk factors so identified account for a small portion of disease heritability. [sent-12, score-0.419]

7 Nowadays, there is a growing body of evidence suggesting gene–gene interactions as a possible reason for the missing heritability . [sent-13, score-0.298]

8 To address these challenges, the proposed method extracts different types of information from the data in several stages. [sent-16, score-0.083]

9 In the first stage, we select variables with high potential to form influential variable modules when combining with other variables. [sent-17, score-0.745]

10 In the second stage, we generate highly influential variable modules from variables selected in the first stage so that each variable interacts with others in the same module to produce a strong effect on the response Y. [sent-18, score-1.431]

11 The third stage combines classifiers, each constructed from one module, to form the classification rule. [sent-19, score-0.623]

12 These genetics problems are different from the social science and environmental health examples I work on. [sent-24, score-0.319]

13 In genetics there seem to be many true zeros—that is, you really are trying to find a bunch of needles in a haystack. [sent-25, score-0.267]

14 In my problems, nothing is really zero and we only set things to zero for computational convenience or to make our models more understandable. [sent-26, score-0.259]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('stage', 0.287), ('gene', 0.26), ('modules', 0.251), ('variable', 0.186), ('composite', 0.184), ('module', 0.184), ('features', 0.176), ('genetics', 0.153), ('influential', 0.14), ('identified', 0.124), ('scores', 0.123), ('interactions', 0.118), ('correlated', 0.117), ('needles', 0.114), ('esteban', 0.114), ('manually', 0.114), ('interacts', 0.114), ('heritability', 0.108), ('lo', 0.108), ('ilya', 0.108), ('tian', 0.108), ('risk', 0.107), ('factors', 0.104), ('bart', 0.103), ('classifiers', 0.103), ('grouping', 0.103), ('unambiguous', 0.103), ('zeros', 0.099), ('hu', 0.099), ('zheng', 0.099), ('problems', 0.094), ('together', 0.091), ('zero', 0.091), ('combines', 0.09), ('form', 0.085), ('portion', 0.084), ('wang', 0.084), ('variables', 0.083), ('method', 0.083), ('biological', 0.083), ('classification', 0.082), ('successfully', 0.081), ('constructed', 0.079), ('studies', 0.078), ('requiring', 0.077), ('convenience', 0.077), ('pool', 0.074), ('presence', 0.073), ('environmental', 0.072), ('body', 0.072)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.9999997 1703 andrew gelman stats-2013-02-02-Interaction-based feature selection and classification for high-dimensional biological data

Introduction: Ilya Esteban writes: In your blog your advice for performing regression in the presence of large numbers of correlated features, has been to use composite scores and hierarchical modeling. Unfortunately, many problems don’t provide an obvious and unambiguous way of grouping features together (e.g. gene expression data). Are there any techniques that you would recommend that automatically pool correlated features together based on the data, without requiring the researcher to manually define composite scores or feature hierarchies? I don’t know the answer to this but I imagine something is possible . . . any ideas? In the meantime I’m reminded of this recent article by Shaw-Hwa Lo, Haitian Wang, Tian Zheng, and Inchi Hu: Recent high-throughput biological studies successfully identified thousands of risk factors associated with common human dis- eases. Most of these studies used single-variable method and each variable is analyzed individually. The risk factors so identi

2 0.18990554 706 andrew gelman stats-2011-05-11-The happiness gene: My bottom line (for now)

Introduction: I had a couple of email exchanges with Jan-Emmanuel De Neve and James Fowler, two of the authors of the article on the gene that is associated with life satisfaction which we blogged the other day. (Bruno Frey, the third author of the article in question, is out of town according to his email.) Fowler also commented directly on the blog. I won’t go through all the details, but now I have a better sense of what’s going on. (Thanks, Jan and James!) Here’s my current understanding: 1. The original manuscript was divided into two parts: an article by De Neve alone published in the Journal of Human Genetics, and an article by De Neve, Fowler, Frey, and Nicholas Christakis submitted to Econometrica. The latter paper repeats the analysis from the Adolescent Health survey and also replicates with data from the Framingham heart study (hence Christakis’s involvement). The Framingham study measures a slightly different gene and uses a slightly life-satisfaction question com

3 0.11544867 1431 andrew gelman stats-2012-07-27-Overfitting

Introduction: Ilya Esteban writes: In traditional machine learning and statistical learning techniques, you spend a lot of time selecting your input features, fiddling with model parameter values, etc., all of which leads to the problem of overfitting the data and producing overly optimistic estimates for how good the model really is. You can use techniques such as cross-validation and out-of-sample validation data to try to limit the damage, but they are imperfect solutions at best. While Bayesian models have the great advantage of not forcing you to manually select among the various weights and input features, you still often end up trying different priors and model structures (especially with hierarchical models), before coming up with a “final” model. When applying Bayesian modeling to real world data sets, how does should you evaluate alternate priors and topologies for the model without falling into the same overfitting trap as you do with non-Bayesian models? If you try several different

4 0.10431594 1523 andrew gelman stats-2012-10-06-Comparing people from two surveys, one of which is a simple random sample and one of which is not

Introduction: Juli writes: I’m helping a professor out with an analysis, and I was hoping that you might be able to point me to some relevant literature… She has two studies that have been completed already (so we can’t go back to the planning stage in terms of sampling, unfortunately). Both studies are based around the population of adults in LA who attended LA public high schools at some point, so that is the same for both studies. Study #1 uses random digit dialing, so I consider that one to be SRS. Study #2, however, is a convenience sample in which all participants were involved with one of eight community-based organizations (CBOs). Of course, both studies can be analyzed independently, but she was hoping for there to be some way to combine/compare the two studies. Specifically, I am working on looking at the civic engagement of the adults in both studies. In study #1, this means looking at factors such as involvement in student government. In study #2, this means looking at involv

5 0.097355947 702 andrew gelman stats-2011-05-09-“Discovered: the genetic secret of a happy life”

Introduction: I took the above headline from a news article in the (London) Independent by Jeremy Laurance reporting a study by Jan-Emmanuel De Neve, James Fowler, and Bruno Frey that reportedly just appeared in the Journal of Human Genetics. One of the pleasures of blogging is that I can go beyond the usual journalistic approaches to such a story: (a) puffing it, (b) debunking it, (c) reporting it completely flatly. Even convex combinations of (a), (b), (c) do not allow what I’d like to do, which is to explore the claims and follow wherever my exploration takes me. (And one of the pleasures of building my own audience is that I don’t need to endlessly explain background detail as was needed on a general-public site such as 538.) OK, back to the genetic secret of a happy life. Or, in the words the authors of the study, a gene that “explains less than one percent of the variation in life satisfaction.” “The genetic secret” or “less than one percent of the variation”? Perhaps the secre

6 0.096493542 1578 andrew gelman stats-2012-11-15-Outta control political incorrectness

7 0.09513811 1910 andrew gelman stats-2013-06-22-Struggles over the criticism of the “cannabis users and IQ change” paper

8 0.094358236 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making

9 0.091836333 1352 andrew gelman stats-2012-05-29-Question 19 of my final exam for Design and Analysis of Sample Surveys

10 0.087633073 1427 andrew gelman stats-2012-07-24-More from the sister blog

11 0.087396741 830 andrew gelman stats-2011-07-29-Introductory overview lectures at the Joint Statistical Meetings in Miami this coming week

12 0.086604461 1900 andrew gelman stats-2013-06-15-Exploratory multilevel analysis when group-level variables are of importance

13 0.086258464 1218 andrew gelman stats-2012-03-18-Check your missing-data imputations using cross-validation

14 0.086159334 1086 andrew gelman stats-2011-12-27-The most dangerous jobs in America

15 0.084309429 1374 andrew gelman stats-2012-06-11-Convergence Monitoring for Non-Identifiable and Non-Parametric Models

16 0.0841344 1486 andrew gelman stats-2012-09-07-Prior distributions for regression coefficients

17 0.083902672 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?

18 0.082175672 1695 andrew gelman stats-2013-01-28-Economists argue about Bayes

19 0.081767909 1506 andrew gelman stats-2012-09-21-Building a regression model . . . with only 27 data points

20 0.081522882 1875 andrew gelman stats-2013-05-28-Simplify until your fake-data check works, then add complications until you can figure out where the problem is coming from


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.167), (1, 0.046), (2, 0.028), (3, -0.071), (4, 0.053), (5, 0.012), (6, -0.012), (7, -0.016), (8, 0.002), (9, 0.063), (10, -0.004), (11, 0.011), (12, -0.004), (13, -0.004), (14, 0.011), (15, 0.027), (16, 0.015), (17, -0.024), (18, 0.005), (19, 0.01), (20, -0.016), (21, 0.046), (22, -0.014), (23, 0.009), (24, 0.004), (25, 0.032), (26, 0.047), (27, -0.006), (28, 0.003), (29, -0.025), (30, 0.002), (31, 0.035), (32, 0.039), (33, -0.004), (34, 0.023), (35, -0.029), (36, 0.017), (37, 0.032), (38, -0.002), (39, -0.005), (40, -0.026), (41, -0.011), (42, 0.013), (43, -0.004), (44, -0.01), (45, 0.011), (46, -0.013), (47, 0.01), (48, -0.015), (49, 0.039)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96851319 1703 andrew gelman stats-2013-02-02-Interaction-based feature selection and classification for high-dimensional biological data

Introduction: Ilya Esteban writes: In your blog your advice for performing regression in the presence of large numbers of correlated features, has been to use composite scores and hierarchical modeling. Unfortunately, many problems don’t provide an obvious and unambiguous way of grouping features together (e.g. gene expression data). Are there any techniques that you would recommend that automatically pool correlated features together based on the data, without requiring the researcher to manually define composite scores or feature hierarchies? I don’t know the answer to this but I imagine something is possible . . . any ideas? In the meantime I’m reminded of this recent article by Shaw-Hwa Lo, Haitian Wang, Tian Zheng, and Inchi Hu: Recent high-throughput biological studies successfully identified thousands of risk factors associated with common human dis- eases. Most of these studies used single-variable method and each variable is analyzed individually. The risk factors so identi

2 0.7695626 1462 andrew gelman stats-2012-08-18-Standardizing regression inputs

Introduction: Andy Flies, Ph.D. candidate in zoology, writes: After reading your paper about scaling regression inputs by two standard deviations I found your blog post stating that you wished you had scaled by 1 sd and coded the binary inputs as -1 and 1. Here is my question: If you code the binary input as -1 and 1, do you then standardize it? This makes sense to me because the mean of the standardized input is then zero and the sd is 1, which is what the mean and sd are for all of the other standardized inputs. I know that if you code the binary input as 0 and 1 it should not be standardized. Also, I am not interested in the actual units (i.e. mg/ml) of my response variable and I would like to compare a couple of different response variables that are on different scales. Would it make sense to standardize the response variable also? My reply: No, I don’t standardize the binary input. The point of standardizing inputs is to make the coefs directly interpretable, but with binary i

3 0.76844728 14 andrew gelman stats-2010-05-01-Imputing count data

Introduction: Guy asks: I am analyzing an original survey of farmers in Uganda. I am hoping to use a battery of welfare proxy variables to create a single welfare index using PCA. I have quick question which I hope you can find time to address: How do you recommend treating count data? (for example # of rooms, # of chickens, # of cows, # of radios)? In my dataset these variables are highly skewed with many responses at zero (which makes taking the natural log problematic). In the case of # of cows or chickens several obs have values in the hundreds. My response: Here’s what we do in our mi package in R. We split a variable into two parts: an indicator for whether it is positive, and the positive part. That is, y = u*v. Then u is binary and can be modeled using logisitc regression, and v can be modeled on the log scale. At the end you can round to the nearest integer if you want to avoid fractional values.

4 0.76705611 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?

Introduction: A research psychologist writes in with a question that’s so long that I’ll put my answer first, then put the question itself below the fold. Here’s my reply: As I wrote in my Anova paper and in my book with Jennifer Hill, I do think that multilevel models can completely replace Anova. At the same time, I think the central idea of Anova should persist in our understanding of these models. To me the central idea of Anova is not F-tests or p-values or sums of squares, but rather the idea of predicting an outcome based on factors with discrete levels, and understanding these factors using variance components. The continuous or categorical response thing doesn’t really matter so much to me. I have no problem using a normal linear model for continuous outcomes (perhaps suitably transformed) and a logistic model for binary outcomes. I don’t want to throw away interactions just because they’re not statistically significant. I’d rather partially pool them toward zero using an inform

5 0.76559287 1910 andrew gelman stats-2013-06-22-Struggles over the criticism of the “cannabis users and IQ change” paper

Introduction: Ole Rogeberg points me to a discussion of a discussion of a paper: Did pre-release of my [Rogeberg's] PNAS paper on methodological problems with Meier et al’s 2012 paper on cannabis and IQ reduce the chances that it will have its intended effect? In my case, serious methodological issues related to causal inference from non-random observational data became framed as a conflict over conclusions, forcing the original research team to respond rapidly and insufficiently to my concerns, and prompting them to defend their conclusions and original paper in a way that makes a later, more comprehensive reanalysis of their data less likely. This fits with a recurring theme on this blog: the defensiveness of researchers who don’t want to admit they were wrong. Setting aside cases of outright fraud and plagiarism, I think the worst case remains that of psychologists Neil Anderson and Deniz Ones, who denied any problems even in the presence of a smoking gun of a graph revealing their data

6 0.76043022 1121 andrew gelman stats-2012-01-15-R-squared for multilevel models

7 0.75353563 527 andrew gelman stats-2011-01-20-Cars vs. trucks

8 0.74602896 257 andrew gelman stats-2010-09-04-Question about standard range for social science correlations

9 0.74240816 1663 andrew gelman stats-2013-01-09-The effects of fiscal consolidation

10 0.74184239 1908 andrew gelman stats-2013-06-21-Interpreting interactions in discrete-data regression

11 0.74182647 327 andrew gelman stats-2010-10-07-There are never 70 distinct parameters

12 0.74078774 1218 andrew gelman stats-2012-03-18-Check your missing-data imputations using cross-validation

13 0.7304818 553 andrew gelman stats-2011-02-03-is it possible to “overstratify” when assigning a treatment in a randomized control trial?

14 0.72933596 1656 andrew gelman stats-2013-01-05-Understanding regression models and regression coefficients

15 0.72158599 864 andrew gelman stats-2011-08-21-Going viral — not!

16 0.72041035 1094 andrew gelman stats-2011-12-31-Using factor analysis or principal components analysis or measurement-error models for biological measurements in archaeology?

17 0.71925366 1330 andrew gelman stats-2012-05-19-Cross-validation to check missing-data imputation

18 0.71672696 938 andrew gelman stats-2011-10-03-Comparing prediction errors

19 0.71556431 770 andrew gelman stats-2011-06-15-Still more Mr. P in public health

20 0.71365982 32 andrew gelman stats-2010-05-14-Causal inference in economics


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(9, 0.015), (16, 0.061), (19, 0.01), (21, 0.016), (24, 0.115), (45, 0.029), (52, 0.018), (55, 0.036), (56, 0.014), (62, 0.019), (63, 0.032), (82, 0.013), (86, 0.019), (95, 0.011), (99, 0.487)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99657381 1703 andrew gelman stats-2013-02-02-Interaction-based feature selection and classification for high-dimensional biological data

Introduction: Ilya Esteban writes: In your blog your advice for performing regression in the presence of large numbers of correlated features, has been to use composite scores and hierarchical modeling. Unfortunately, many problems don’t provide an obvious and unambiguous way of grouping features together (e.g. gene expression data). Are there any techniques that you would recommend that automatically pool correlated features together based on the data, without requiring the researcher to manually define composite scores or feature hierarchies? I don’t know the answer to this but I imagine something is possible . . . any ideas? In the meantime I’m reminded of this recent article by Shaw-Hwa Lo, Haitian Wang, Tian Zheng, and Inchi Hu: Recent high-throughput biological studies successfully identified thousands of risk factors associated with common human dis- eases. Most of these studies used single-variable method and each variable is analyzed individually. The risk factors so identi

2 0.99267417 2255 andrew gelman stats-2014-03-19-How Americans vote

Introduction: An interview with me from 2012 : You’re a statistician and wrote a book,  Red State, Blue State, Rich State, Poor State , looking at why Americans vote the way they do. In an election year I think it would be a good time to revisit that question, not just for people in the US, but anyone around the world who wants to understand the realities – rather than the stereotypes – of how Americans vote. I regret the title I gave my book. I was too greedy. I wanted it to be an airport bestseller because I figured there were millions of people who are interested in politics and some subset of them are always looking at the statistics. It’s got a very grabby title and as a result people underestimated the content. They thought it was a popularisation of my work, or, at best, an expansion of an article we’d written. But it had tons of original material. If I’d given it a more serious, political science-y title, then all sorts of people would have wanted to read it, because they would

3 0.99138784 390 andrew gelman stats-2010-11-02-Fragment of statistical autobiography

Introduction: I studied math and physics at MIT. To be more precise, I started in math as default–ever since I was two years old, I’ve thought of myself as a mathematician, and I always did well in math class, so it seemed like a natural fit. But I was concerned. In high school I’d been in the U.S. Mathematical Olympiad training program, and there I’d met kids who were clearly much much better at math than I was. In retrospect, I don’t think I was as bad as I’d thought at the time: there were 24 kids in the program, and I was probably around #20, if that, but I think a lot of the other kids had more practice working on “math olympiad”-type problems. Maybe I was really something like the tenth-best in the group. Tenth-best or twentieth-best, whatever it was, I reached a crisis of confidence around my sophomore or junior year in college. At MIT, I started right off taking advanced math classes, and somewhere along the way I realized I wasn’t seeing the big picture. I was able to do the homework pr

4 0.99103999 757 andrew gelman stats-2011-06-10-Controversy over the Christakis-Fowler findings on the contagion of obesity

Introduction: Nicholas Christakis and James Fowler are famous for finding that obesity is contagious. Their claims, which have been received with both respect and skepticism (perhaps we need a new word for this: “respecticism”?) are based on analysis of data from the Framingham heart study, a large longitudinal public-health study that happened to have some social network data (for the odd reason that each participant was asked to provide the name of a friend who could help the researchers locate them if they were to move away during the study period. The short story is that if your close contact became obese, you were likely to become obese also. The long story is a debate about the reliability of this finding (that is, can it be explained by measurement error and sampling variability) and its causal implications. This sort of study is in my wheelhouse, as it were, but I have never looked at the Christakis-Fowler work in detail. Thus, my previous and current comments are more along the line

5 0.99030358 524 andrew gelman stats-2011-01-19-Data exploration and multiple comparisons

Introduction: Bill Harris writes: I’ve read your paper and presentation showing why you don’t usually worry about multiple comparisons. I see how that applies when you are comparing results across multiple settings (states, etc.). Does the same principle hold when you are exploring data to find interesting relationships? For example, you have some data, and you’re trying a series of models to see which gives you the most useful insight. Do you try your models on a subset of the data so you have another subset for confirmatory analysis later, or do you simply throw all the data against your models? My reply: I’d like to estimate all the relationships at once and use a multilevel model to do partial pooling to handle the mutiplicity issues. That said, in practice, in my applied work I’m always bouncing back and forth between different hypotheses and different datasets, and often I learn a lot when next year’s data come in and I can modify my hypotheses. The trouble with the classical

6 0.99016666 1469 andrew gelman stats-2012-08-25-Ways of knowing

7 0.98981214 222 andrew gelman stats-2010-08-21-Estimating and reporting teacher effectivenss: Newspaper researchers do things that academic researchers never could

8 0.98968381 125 andrew gelman stats-2010-07-02-The moral of the story is, Don’t look yourself up on Google

9 0.98940104 1096 andrew gelman stats-2012-01-02-Graphical communication for legal scholarship

10 0.98935574 2279 andrew gelman stats-2014-04-02-Am I too negative?

11 0.98903215 750 andrew gelman stats-2011-06-07-Looking for a purpose in life: Update on that underworked and overpaid sociologist whose “main task as a university professor was self-cultivation”

12 0.98848462 1517 andrew gelman stats-2012-10-01-“On Inspiring Students and Being Human”

13 0.98799616 201 andrew gelman stats-2010-08-12-Are all rich people now liberals?

14 0.98788947 1095 andrew gelman stats-2012-01-01-Martin and Liu: Probabilistic inference based on consistency of model with data

15 0.98779833 604 andrew gelman stats-2011-03-08-More on the missing conservative psychology researchers

16 0.98764431 1740 andrew gelman stats-2013-02-26-“Is machine learning a subset of statistics?”

17 0.98762512 1585 andrew gelman stats-2012-11-20-“I know you aren’t the plagiarism police, but . . .”

18 0.98750705 2151 andrew gelman stats-2013-12-27-Should statistics have a Nobel prize?

19 0.9873482 472 andrew gelman stats-2010-12-17-So-called fixed and random effects

20 0.98732424 1640 andrew gelman stats-2012-12-26-What do people do wrong? WSJ columnist is looking for examples!