andrew_gelman_stats andrew_gelman_stats-2010 andrew_gelman_stats-2010-112 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Bill Harris writes with two interesting questions involving time series analysis: I used to work in an organization that designed and made signal processing equipment. Antialiasing and windowing of time series was a big deal in performing analysis accurately. Now I’m in a place where I have to make inferences about human-scaled time series. It has dawned on me that the two are related. I’m not sure we often have data sampled at a rate at least twice the highest frequency present (not just the highest frequency of interest). The only articles I’ve seen about aliasing as applied to social science series are from Hinich or from related works . Box and Jenkins hint at it in section 13.3 of Time Series Analysis, but the analysis seems to be mostly heuristic. Yet I can imagine all sorts of time series subject to similar problems, from analyses of stock prices based on closing prices (mentioned in the latter article) to other economic series measured on a monthly basis to en
sentIndex sentText sentNum sentScore
1 Bill Harris writes with two interesting questions involving time series analysis: I used to work in an organization that designed and made signal processing equipment. [sent-1, score-0.804]
2 Antialiasing and windowing of time series was a big deal in performing analysis accurately. [sent-2, score-1.231]
3 Now I’m in a place where I have to make inferences about human-scaled time series. [sent-3, score-0.197]
4 I’m not sure we often have data sampled at a rate at least twice the highest frequency present (not just the highest frequency of interest). [sent-5, score-0.456]
5 The only articles I’ve seen about aliasing as applied to social science series are from Hinich or from related works . [sent-6, score-0.844]
6 Yet I can imagine all sorts of time series subject to similar problems, from analyses of stock prices based on closing prices (mentioned in the latter article) to other economic series measured on a monthly basis to energy usage measured on an hourly or quarter-hourly basis. [sent-9, score-1.885]
7 What do statisticians in the social sciences and economics do to deal with such problems? [sent-11, score-0.281]
8 Now that I think about this, I see advantages to your stance of repeated regressions at subsequent intervals rather than moving to a full time series analysis. [sent-12, score-0.787]
9 At least when you’re transforming a time series with a discrete Fourier transform, the assumption is made that the time series is periodic. [sent-16, score-1.556]
10 Because it’s rarely exactly periodic in the real world, the math will distort the signal. [sent-17, score-0.214]
11 Windowing is a way of tapering the time series to zero at both ends, thus moving distortion products out of the band of interest. [sent-18, score-0.923]
12 While aliasing is uncorrectable after sampling, windowing is done later. [sent-20, score-0.618]
13 I don’t see attention to windowing in treating time series in the social sciences, either. [sent-21, score-1.189]
14 I’m thinking back through my math to see if I can demonstrate whether the assumption of periodicity applies even if there is no Fourier transform in the picture. [sent-22, score-0.355]
15 Do you see evidence of economists or statisticians applying windows to their time series? [sent-23, score-0.263]
16 I wonder if this could apply to the fitting of model parameters via MCSim or other tools that offer model parameter estimation for time series analysis. [sent-25, score-0.852]
17 If you have undersampled data and you try to fit a model to that data, even MCMC integration won’t fit the data properly, and so you could get erroneous parameter estimates, or so it would seem. [sent-26, score-0.288]
18 That’s not a problem with MCSim, but it would seem to be a problem with the analysis that prepares the data for MCSim. [sent-27, score-0.179]
19 For whatever reason, I’ve avoided time series modeling in most of my work Also, classical time-series analysis hasn’t been so useful for me because that theory tends to focus on direct observations. [sent-29, score-0.939]
20 For example, when we’re studying time trends in death penalty support by state, we have sample survey data that gives us estimates for each state and year–and, indeed, we’re fitting time series models to get good estimates–but issues of sampling frequencies seem a bit beside the point. [sent-31, score-1.339]
wordName wordTfidf (topN-words)
[('series', 0.51), ('windowing', 0.353), ('aliasing', 0.265), ('time', 0.197), ('mcsim', 0.161), ('fourier', 0.14), ('transform', 0.115), ('analysis', 0.103), ('frequency', 0.101), ('signal', 0.097), ('prices', 0.095), ('measured', 0.09), ('energy', 0.09), ('highest', 0.089), ('estimates', 0.083), ('assumption', 0.081), ('hourly', 0.08), ('jenkins', 0.08), ('periodicity', 0.08), ('snapshots', 0.08), ('moving', 0.08), ('math', 0.079), ('sciences', 0.078), ('data', 0.076), ('prey', 0.076), ('fitting', 0.075), ('bill', 0.074), ('hint', 0.073), ('distortion', 0.07), ('periodic', 0.07), ('reconstructing', 0.07), ('sampling', 0.07), ('parameter', 0.07), ('social', 0.069), ('deal', 0.068), ('beside', 0.068), ('closing', 0.068), ('band', 0.066), ('erroneous', 0.066), ('problems', 0.066), ('statisticians', 0.066), ('avoided', 0.065), ('distort', 0.065), ('focus', 0.064), ('frequencies', 0.063), ('signals', 0.062), ('harris', 0.061), ('transforming', 0.061), ('treating', 0.06), ('usage', 0.06)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999976 112 andrew gelman stats-2010-06-27-Sampling rate of human-scaled time series
Introduction: Bill Harris writes with two interesting questions involving time series analysis: I used to work in an organization that designed and made signal processing equipment. Antialiasing and windowing of time series was a big deal in performing analysis accurately. Now I’m in a place where I have to make inferences about human-scaled time series. It has dawned on me that the two are related. I’m not sure we often have data sampled at a rate at least twice the highest frequency present (not just the highest frequency of interest). The only articles I’ve seen about aliasing as applied to social science series are from Hinich or from related works . Box and Jenkins hint at it in section 13.3 of Time Series Analysis, but the analysis seems to be mostly heuristic. Yet I can imagine all sorts of time series subject to similar problems, from analyses of stock prices based on closing prices (mentioned in the latter article) to other economic series measured on a monthly basis to en
Introduction: Devrup Ghatak writes: I am a student of economics and recently read your review of Mostly Harmless Econometrics. In the review you mention that the book contains no time series. Given that your book on data analysis (Data Analysis using Regression) does not contain any time series material either, I wonder if you happen to have any favourite time series reference similar in style/level to the data analysis book. I don’t know. The closest thing might be Hierarchical Modeling and Analysis for Spatial Data by Banerjee, Carlin, and Gelfand, but I don’t know of anything focused on time series that’s quite in the format that I’d prefer. This is not my area, though. Maybe you, the readers, have some suggestions?
Introduction: From August 1990. It was in the form of a note sent to all the people in the statistics group of Bell Labs, where I’d worked that summer. To all: Here’s the abstract of the work I’ve done this summer. It’s stored in the file, /fs5/gelman/abstract.bell, and copies of the Figures 1-3 are on Trevor’s desk. Any comments are of course appreciated; I’m at gelman@stat.berkeley.edu. On the Routine Use of Markov Chains for Simulation Andrew Gelman and Donald Rubin, 6 August 1990 corrected version: 8 August 1990 1. Simulation In probability and statistics we can often specify multivariate distributions many of whose properties we do not fully understand–perhaps, as in the Ising model of statistical physics, we can write the joint density function, up to a multiplicative constant that cannot be expressed in closed form. For an example in statistics, consider the Normal random effects model in the analysis of variance, which can be easily placed in a Bayesian fram
4 0.19664221 1464 andrew gelman stats-2012-08-20-Donald E. Westlake on George W. Bush
Introduction: A post-WTC time capsule .
Introduction: John Kastellec points me to this blog by Ezra Klein criticizing the following graph from a recent Republican Party report: Klein (following Alexander Hart ) slams the graph for not going all the way to zero on the y-axis, thus making the projected change seem bigger than it really is. I agree with Klein and Hart that, if you’re gonna do a bar chart, you want the bars to go down to 0. On the other hand, a projected change from 19% to 23% is actually pretty big, and I don’t see the point of using a graphical display that hides it. The solution: Ditch the bar graph entirely and replace it by a lineplot , in particular, a time series with year-by-year data. The time series would have several advantages: 1. Data are placed in context. You’d see every year, instead of discrete averages, and you’d get to see the changes in the context of year-to-year variation. 2. With the time series, you can use whatever y-axis works with the data. No need to go to zero. P.S. I l
6 0.13931809 557 andrew gelman stats-2011-02-05-Call for book proposals
7 0.13765775 1907 andrew gelman stats-2013-06-20-Amazing retro gnu graphics!
8 0.13029307 1201 andrew gelman stats-2012-03-07-Inference = data + model
9 0.12865388 524 andrew gelman stats-2011-01-19-Data exploration and multiple comparisons
10 0.12730449 958 andrew gelman stats-2011-10-14-The General Social Survey is a great resource
11 0.12725069 1379 andrew gelman stats-2012-06-14-Cool-ass signal processing using Gaussian processes (birthdays again)
12 0.11914553 2135 andrew gelman stats-2013-12-15-The UN Plot to Force Bayesianism on Unsuspecting Americans (penalized B-Spline edition)
13 0.11316612 572 andrew gelman stats-2011-02-14-Desecration of valuable real estate
15 0.11000331 962 andrew gelman stats-2011-10-17-Death!
16 0.10708057 450 andrew gelman stats-2010-12-04-The Joy of Stats
17 0.10040572 2279 andrew gelman stats-2014-04-02-Am I too negative?
18 0.099899285 1695 andrew gelman stats-2013-01-28-Economists argue about Bayes
19 0.097789802 1289 andrew gelman stats-2012-04-29-We go to war with the data we have, not the data we want
20 0.095645763 1406 andrew gelman stats-2012-07-05-Xiao-Li Meng and Xianchao Xie rethink asymptotics
topicId topicWeight
[(0, 0.205), (1, 0.04), (2, 0.009), (3, 0.017), (4, 0.055), (5, 0.018), (6, -0.04), (7, -0.002), (8, 0.025), (9, 0.007), (10, 0.001), (11, -0.021), (12, -0.025), (13, -0.011), (14, -0.045), (15, -0.01), (16, 0.017), (17, -0.023), (18, 0.007), (19, -0.036), (20, 0.009), (21, -0.034), (22, -0.055), (23, 0.037), (24, 0.005), (25, -0.004), (26, -0.021), (27, -0.06), (28, 0.058), (29, 0.008), (30, -0.034), (31, -0.028), (32, -0.013), (33, -0.024), (34, 0.004), (35, 0.005), (36, -0.068), (37, 0.001), (38, 0.032), (39, 0.057), (40, 0.041), (41, 0.12), (42, -0.095), (43, -0.029), (44, -0.001), (45, -0.003), (46, 0.024), (47, -0.028), (48, 0.012), (49, -0.04)]
simIndex simValue blogId blogTitle
same-blog 1 0.97786903 112 andrew gelman stats-2010-06-27-Sampling rate of human-scaled time series
Introduction: Bill Harris writes with two interesting questions involving time series analysis: I used to work in an organization that designed and made signal processing equipment. Antialiasing and windowing of time series was a big deal in performing analysis accurately. Now I’m in a place where I have to make inferences about human-scaled time series. It has dawned on me that the two are related. I’m not sure we often have data sampled at a rate at least twice the highest frequency present (not just the highest frequency of interest). The only articles I’ve seen about aliasing as applied to social science series are from Hinich or from related works . Box and Jenkins hint at it in section 13.3 of Time Series Analysis, but the analysis seems to be mostly heuristic. Yet I can imagine all sorts of time series subject to similar problems, from analyses of stock prices based on closing prices (mentioned in the latter article) to other economic series measured on a monthly basis to en
2 0.78070295 1156 andrew gelman stats-2012-02-06-Bayesian model-building by pure thought: Some principles and examples
Introduction: This is one of my favorite papers: In applications, statistical models are often restricted to what produces reasonable estimates based on the data at hand. In many cases, however, the principles that allow a model to be restricted can be derived theoretically, in the absence of any data and with minimal applied context. We illustrate this point with three well-known theoretical examples from spatial statistics and time series. First, we show that an autoregressive model for local averages violates a principle of invariance under scaling. Second, we show how the Bayesian estimate of a strictly-increasing time series, using a uniform prior distribution, depends on the scale of estimation. Third, we interpret local smoothing of spatial lattice data as Bayesian estimation and show why uniform local smoothing does not make sense. In various forms, the results presented here have been derived in previous work; our contribution is to draw out some principles that can be derived theoretic
3 0.76723629 2307 andrew gelman stats-2014-04-27-Big Data…Big Deal? Maybe, if Used with Caution.
Introduction: This post is by David K. Park As we have witnessed, the term “big data” has been thrusted onto the zeitgeist in the past several years, however, when one pushes beyond the hype, there seems to be little substance there. We’ve always had “data” so what so unique about it this time? Yes, we recognize it’s “big” but is there anything unique about data this time around? I’ve spend some time thinking about this and the answer seems to be yes, and it falls on three dimensions: Capturing Conversations & Relationships : Individuals have always communicated with one another, but now we can capture some of that conversation – email, blogs, social media (Facebook, Twitter, Pinterest) – and we can now do it with machines via sensors, ie “the internet of things” as we hear so much about; Granularity : We can now understand individuals at a much finer level of analysis. No longer do we need to rely on a sample size of 500 people to “represent” the nation, but instead we can acc
Introduction: Mike Spagat sent me an email with the above heading, referring to this paper by Leontine Alkema and Jin Rou New, which begins: National estimates of the under-5 mortality rate (U5MR) are used to track progress in reducing child mortality and to evaluate countries’ performance related to United Nations Millennium Development Goal 4, which calls for a reduction in the U5MR by two-thirds between 1990 and 2015. However, for the great majority of developing countries without well-functioning vital registration systems, estimating levels and trends in child mortality is challenging, not only because of limited data availability but also because of issues with data quality. Global U5MR estimates are often constructed without accounting for potential biases in data series, which may lead to inaccurate point estimates and/or credible intervals. We describe a Bayesian penalized B-spline regression model for assessing levels and trends in the U5MR for all countries in the world, whereby bi
5 0.73448068 1287 andrew gelman stats-2012-04-28-Understanding simulations in terms of predictive inference?
Introduction: David Hogg writes: My (now deceased) collaborator and guru in all things inference, Sam Roweis, used to emphasize to me that we should evaluate models in the data space — not the parameter space — because models are always effectively “effective” and not really, fundamentally true. Or, in other words, models should be compared in the space of their predictions, not in the space of their parameters (the parameters didn’t really “exist” at all for Sam). In that spirit, when we estimate the effectiveness of a MCMC method or tuning — by autocorrelation time or ESJD or anything else — shouldn’t we be looking at the changes in the model predictions over time, rather than the changes in the parameters over time? That is, the autocorrelation time should be the autocorrelation time in what the model (at the walker position) predicts for the data, and the ESJD should be the expected squared jump distance in what the model predicts for the data? This might resolve the concern I expressed a
6 0.72156501 690 andrew gelman stats-2011-05-01-Peter Huber’s reflections on data analysis
7 0.71929425 1201 andrew gelman stats-2012-03-07-Inference = data + model
8 0.71706474 358 andrew gelman stats-2010-10-20-When Kerry Met Sally: Politics and Perceptions in the Demand for Movies
9 0.7157129 524 andrew gelman stats-2011-01-19-Data exploration and multiple comparisons
10 0.70544332 789 andrew gelman stats-2011-07-07-Descriptive statistics, causal inference, and story time
11 0.70015806 1178 andrew gelman stats-2012-02-21-How many data points do you really have?
13 0.69530535 1289 andrew gelman stats-2012-04-29-We go to war with the data we have, not the data we want
14 0.69238973 1379 andrew gelman stats-2012-06-14-Cool-ass signal processing using Gaussian processes (birthdays again)
15 0.69192535 1369 andrew gelman stats-2012-06-06-Your conclusion is only as good as your data
17 0.69134235 1543 andrew gelman stats-2012-10-21-Model complexity as a function of sample size
18 0.68472922 1384 andrew gelman stats-2012-06-19-Slick time series decomposition of the birthdays data
19 0.67822629 1718 andrew gelman stats-2013-02-11-Toward a framework for automatic model building
20 0.67696184 2210 andrew gelman stats-2014-02-13-Stopping rules and Bayesian analysis
topicId topicWeight
[(5, 0.027), (9, 0.027), (15, 0.012), (16, 0.083), (21, 0.031), (24, 0.103), (42, 0.023), (63, 0.017), (65, 0.019), (72, 0.023), (79, 0.012), (84, 0.016), (97, 0.167), (99, 0.324)]
simIndex simValue blogId blogTitle
Introduction: David Ebert sends this along: Purdue University School of ECE Faculty Position in Human-Centered Computing The School of Electrical and Computer Engineering at Purdue University invites applications for a faculty position at any level in human-centered computing, including but not limited to visualization, visual analytics, human-computer interaction (HCI), imaging, and graphics. . . . Applications should consist of a cover letter, a CV, research and teaching statements, names and contact information for at least three references, and URLs for three to five online papers. . . . We will consider applications through March 2013. It’s great to see this sort of thing. P.S. Amusingly enough, the Purdue Visualization and Analytics Center has an ugly, bureaucratic, text-heavy webpage . Not that I’m one to talk, the Columbia stat dept has an ugly webpage too (although I think we’ll be switching soon to something better).
Introduction: Earlier today, Nate criticized a U.S. military survey that asks troops the question, “Do you currently serve with a male or female Service member you believe to be homosexual.” [emphasis added] As Nate points out, by asking this question in such a speculative way, “it would seem that you’ll be picking up a tremendous number of false positives–soldiers who are believed to be gay, but aren’t–and that these false positives will swamp any instances in which soldiers (in spite of DADT) are actually somewhat open about their same-sex attractions.” This is a general problem in survey research. In an article in Chance magazine in 1997, “The myth of millions of annual self-defense gun uses: a case study of survey overestimates of rare events” [see here for related references], David Hemenway uses the false-positive, false-negative reasoning to explain this bias in terms of probability theory. Misclassifications that induce seemingly minor biases in estimates of certain small probab
Introduction: Peter Bergman writes: is it possible to “overstratify” when assigning a treatment in a randomized control trial? I [Bergman] have a sample size of roughly 400 people, and several binary variables correlate strongly with the outcome of interest and would also define interesting subgroups for analysis. The problem is, stratifying over all of these (five or six) variables leaves me with strata that have only 1 person in them. I have done some background reading on whether there is a rule of thumb for the maximum number of variables to stratify. There does not seem to be much agreement (some say there should be between N/50-N/100 strata, others say as few as possible). In economics, the paper I looked to is here, which seems to summarize literature related to clinical trials. In short, my question is: is it bad to have several strata with 1 person in them? Should I group these people in with another stratum? P.S. In the paper I mention above, they also say it is important to inc
4 0.94850707 996 andrew gelman stats-2011-11-07-Chi-square FAIL when many cells have small expected values
Introduction: William Perkins, Mark Tygert, and Rachel Ward write : If a discrete probability distribution in a model being tested for goodness-of-fit is not close to uniform, then forming the Pearson χ2 statistic can involve division by nearly zero. This often leads to serious trouble in practice — even in the absence of round-off errors . . . The problem is not merely that the chi-squared statistic doesn’t have the advertised chi-squared distribution —a reference distribution can always be computed via simulation, either using the posterior predictive distribution or by conditioning on a point estimate of the cell expectations and then making a degrees-of-freedom sort of adjustment. Rather, the problem is that, when there are lots of cells with near-zero expectation, the chi-squared test is mostly noise. And this is not merely a theoretical problem. It comes up in real examples. Here’s one, taken from the classic 1992 genetics paper of Guo and Thomspson: And here are the e
5 0.94743532 1694 andrew gelman stats-2013-01-26-Reflections on ethicsblogging
Introduction: I have to say, it distorts my internal incentives when I am happy to see really blatant examples of ethical lapses. Sort of like when you’re cleaning the attic and searching for roaches: on one hand, you’d be happy if there were none, but, still, there’s a thrill each time you find a roach and catch it—and, at that point, you want it to be a big ugly one!
6 0.94741583 1001 andrew gelman stats-2011-11-10-Three hours in the life of a statistician
7 0.94355488 820 andrew gelman stats-2011-07-25-Design of nonrandomized cluster sample study
8 0.93924832 882 andrew gelman stats-2011-08-31-Meanwhile, on the sister blog . . .
same-blog 9 0.9374705 112 andrew gelman stats-2010-06-27-Sampling rate of human-scaled time series
10 0.93591827 526 andrew gelman stats-2011-01-19-“If it saves the life of a single child…” and other nonsense
11 0.9328593 1335 andrew gelman stats-2012-05-21-Responding to a bizarre anti-social-science screed
12 0.93084645 160 andrew gelman stats-2010-07-23-Unhappy with improvement by a factor of 10^29
13 0.92612517 13 andrew gelman stats-2010-04-30-Things I learned from the Mickey Kaus for Senate campaign
14 0.92531776 1812 andrew gelman stats-2013-04-19-Chomsky chomsky chomsky chomsky furiously
15 0.91808069 115 andrew gelman stats-2010-06-28-Whassup with those crappy thrillers?
17 0.91270638 18 andrew gelman stats-2010-05-06-$63,000 worth of abusive research . . . or just a really stupid waste of time?
18 0.90992606 1573 andrew gelman stats-2012-11-11-Incredibly strange spam
19 0.9077456 1591 andrew gelman stats-2012-11-26-Politics as an escape hatch