andrew_gelman_stats andrew_gelman_stats-2011 andrew_gelman_stats-2011-752 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: I always thought predicting traffic for a particular day and time would be something easily predicted from historic data with regression. Google Maps now has this feature: It would be good to actually include season, holiday and similar information: the predictions would be better. I wonder if one can find this data easily, or if others have done this work before.
sentIndex sentText sentNum sentScore
1 I always thought predicting traffic for a particular day and time would be something easily predicted from historic data with regression. [sent-1, score-2.26]
2 Google Maps now has this feature: It would be good to actually include season, holiday and similar information: the predictions would be better. [sent-2, score-1.21]
3 I wonder if one can find this data easily, or if others have done this work before. [sent-3, score-0.701]
wordName wordTfidf (topN-words)
[('easily', 0.362), ('historic', 0.34), ('holiday', 0.317), ('traffic', 0.305), ('season', 0.284), ('maps', 0.233), ('feature', 0.215), ('predicted', 0.215), ('predicting', 0.211), ('predictions', 0.195), ('google', 0.182), ('wonder', 0.15), ('include', 0.143), ('similar', 0.135), ('would', 0.133), ('day', 0.132), ('done', 0.122), ('others', 0.114), ('data', 0.109), ('information', 0.106), ('particular', 0.105), ('always', 0.104), ('thought', 0.1), ('find', 0.097), ('actually', 0.088), ('something', 0.078), ('work', 0.069), ('good', 0.066), ('time', 0.066), ('one', 0.04)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 752 andrew gelman stats-2011-06-08-Traffic Prediction
Introduction: I always thought predicting traffic for a particular day and time would be something easily predicted from historic data with regression. Google Maps now has this feature: It would be good to actually include season, holiday and similar information: the predictions would be better. I wonder if one can find this data easily, or if others have done this work before.
2 0.14523004 1109 andrew gelman stats-2012-01-09-Google correlate links statistics with minorities
Introduction: John Eppley asks what I make of this : Eppley is guessing the negative spikes are searches getting swamped by holiday season shoppers.
3 0.14107378 580 andrew gelman stats-2011-02-19-Weather visualization with WeatherSpark
Introduction: WeatherSpark : prediction and observation quantiles, historic data, multiple predictors, zoomable, draggable, colorful, wonderful: Via Jure Cuhalev .
4 0.1243095 1697 andrew gelman stats-2013-01-29-Where 36% of all boys end up nowadays
Introduction: My Take a Number feature appears in today’s Times. And here are the graphs that I wish they’d had space to include! Original story here .
Introduction: Sandeep Baliga writes : [In a recent study , Gilles Duranton and Matthew Turner write:] For interstate highways in metropolitan areas we [Duranton and Turner] find that VKT (vehicle kilometers traveled) increases one for one with interstate highways, confirming the fundamental law of highway congestion.’ Provision of public transit also simply leads to the people taking public transport being replaced by drivers on the road. Therefore: These findings suggest that both road capacity expansions and extensions to public transit are not appropriate policies with which to combat traffic congestion. This leaves congestion pricing as the main candidate tool to curb traffic congestion. To which I reply: Sure, if your goal is to curb traffic congestion . But what sort of goal is that? Thinking like a microeconomist, my policy goal is to increase people’s utility. Sure, traffic congestion is annoying, but there must be some advantages to driving on that crowded road or pe
6 0.11763766 911 andrew gelman stats-2011-09-15-More data tools worth using from Google
7 0.1171422 563 andrew gelman stats-2011-02-07-Evaluating predictions of political events
8 0.10251337 492 andrew gelman stats-2010-12-30-That puzzle-solving feeling
9 0.099413678 1508 andrew gelman stats-2012-09-23-Speaking frankly
10 0.098209873 1649 andrew gelman stats-2013-01-02-Back when 50 miles was a long way
11 0.096212842 2308 andrew gelman stats-2014-04-27-White stripes and dead armadillos
12 0.088909313 1287 andrew gelman stats-2012-04-28-Understanding simulations in terms of predictive inference?
14 0.082679726 737 andrew gelman stats-2011-05-30-Memorial Day question
15 0.081928357 1167 andrew gelman stats-2012-02-14-Extra babies on Valentine’s Day, fewer on Halloween?
16 0.081136644 1980 andrew gelman stats-2013-08-13-Test scores and grades predict job performance (but maybe not at Google)
17 0.079057775 162 andrew gelman stats-2010-07-25-Darn that Lindsey Graham! (or, “Mr. P Predicts the Kagan vote”)
18 0.076663867 315 andrew gelman stats-2010-10-03-He doesn’t trust the fit . . . r=.999
19 0.076589808 207 andrew gelman stats-2010-08-14-Pourquoi Google search est devenu plus raisonnable?
20 0.075322084 2181 andrew gelman stats-2014-01-21-The Commissar for Traffic presents the latest Five-Year Plan
topicId topicWeight
[(0, 0.119), (1, -0.008), (2, -0.005), (3, 0.031), (4, 0.051), (5, -0.017), (6, 0.002), (7, -0.014), (8, 0.005), (9, 0.003), (10, 0.029), (11, -0.011), (12, 0.021), (13, -0.032), (14, -0.054), (15, 0.038), (16, 0.038), (17, -0.018), (18, 0.025), (19, -0.001), (20, -0.028), (21, 0.025), (22, -0.01), (23, 0.018), (24, -0.015), (25, 0.001), (26, 0.016), (27, -0.024), (28, -0.002), (29, 0.015), (30, 0.05), (31, -0.043), (32, 0.002), (33, -0.007), (34, 0.016), (35, 0.02), (36, -0.011), (37, -0.013), (38, 0.007), (39, -0.006), (40, 0.02), (41, -0.001), (42, 0.019), (43, 0.027), (44, -0.032), (45, 0.007), (46, 0.048), (47, -0.038), (48, 0.03), (49, -0.082)]
simIndex simValue blogId blogTitle
same-blog 1 0.94537038 752 andrew gelman stats-2011-06-08-Traffic Prediction
Introduction: I always thought predicting traffic for a particular day and time would be something easily predicted from historic data with regression. Google Maps now has this feature: It would be good to actually include season, holiday and similar information: the predictions would be better. I wonder if one can find this data easily, or if others have done this work before.
2 0.74171728 228 andrew gelman stats-2010-08-24-A new efficient lossless compression algorithm
Introduction: Frank Wood and Nick Bartlett write : Deplump works the same as all probabilistic lossless compressors. A datastream is fed one observation at a time into a predictor which emits both the data stream and predictions about what the next observation in the stream should be for every observation. An encoder takes this output and produces a compressed stream which can be piped over a network or to a file. A receiver then takes this stream and decompresses it by doing everything in reverse. In order to ensure that the decoder has the same information available to it that the encoder had when compressing the stream, the decoded datastream is both emitted and directed to another predictor. This second predictor’s job is to produce exactly the same predictions as the initial predictor so that the decoder has the same information at every step of the process as the encoder did. The difference between probabilistic lossless compressors is in the prediction engine, encoding and decoding bein
3 0.72315776 910 andrew gelman stats-2011-09-15-Google Refine
Introduction: Tools worth knowing about: Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase. A recent discussion on the Polmeth list about the ANES Cumulative File is a setting where I think Refine might help (admittedly 49760×951 is bigger than I’d really like to deal with in the browser with js… but on a subset yes). [I might write this example up later.] Go watch the screencast videos for Refine. Data-entry problems are rampant in stuff we all use — leading or trailing spaces; mixed decimal-indicators; different units or transformations used in the same column; mixed lettercase leading to false duplicates; that’s only the beginning. Refine certainly would help find duplicates, and it counts things for you too. Just counting rows is too much for researchers sometimes (see yesterday’s post )! Refine 2.0 adds some data-collection tools for
4 0.72070426 911 andrew gelman stats-2011-09-15-More data tools worth using from Google
Introduction: Speaking of open data and google tools, see this post from Revolution R: How to use a Google Spreadsheet as data in R .
5 0.7165947 118 andrew gelman stats-2010-06-30-Question & Answer Communities
Introduction: StackOverflow has been a popular community where software developers would help one another. Recently they raised some VC funding , and to make profits they are selling job postings and expanding the model to other areas. Metaoptimize LLC has started a similar website, using the open-source OSQA framework for such as statistics and machine learning. Here’s a description: You and other data geeks can ask and answer questions on machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization. Here you can ask and answer questions, comment and vote for the questions of others and their answers. Both questions and answers can be revised and improved. Questions can be tagged with the relevant keywords to simplify future access and organize the accumulated material. If you work very hard on your questions and answers, you will receive badges like “Guru”, “Studen
6 0.7163583 358 andrew gelman stats-2010-10-20-When Kerry Met Sally: Politics and Perceptions in the Demand for Movies
7 0.71551132 1559 andrew gelman stats-2012-11-02-The blog is back
8 0.70016551 1434 andrew gelman stats-2012-07-29-FindTheData.org
9 0.69890344 192 andrew gelman stats-2010-08-08-Turning pages into data
10 0.69456565 544 andrew gelman stats-2011-01-29-Splitting the data
11 0.69378263 253 andrew gelman stats-2010-09-03-Gladwell vs Pinker
12 0.68829161 1447 andrew gelman stats-2012-08-07-Reproducible science FAIL (so far): What’s stoppin people from sharin data and code?
13 0.68757546 1823 andrew gelman stats-2013-04-24-The Tweets-Votes Curve
14 0.68637002 724 andrew gelman stats-2011-05-21-New search engine for data & statistics
15 0.68147492 563 andrew gelman stats-2011-02-07-Evaluating predictions of political events
16 0.68068004 1357 andrew gelman stats-2012-06-01-Halloween-Valentine’s update
17 0.67866534 677 andrew gelman stats-2011-04-24-My NOAA story
18 0.67738068 1175 andrew gelman stats-2012-02-19-Factual – a new place to find data
19 0.67639291 569 andrew gelman stats-2011-02-12-Get the Data
topicId topicWeight
[(12, 0.136), (21, 0.06), (24, 0.275), (27, 0.051), (76, 0.054), (77, 0.052), (86, 0.038), (99, 0.177)]
simIndex simValue blogId blogTitle
1 0.88037789 241 andrew gelman stats-2010-08-29-Ethics and statistics in development research
Introduction: From Bannerjee and Duflo, “The Experimental Approach to Development Economics,” Annual Review of Economics (2009): One issue with the explicit acknowledgment of randomization as a fair way to allocate the program is that implementers may find that the easiest way to present it to the community is to say that an expansion of the program is planned for the control areas in the future (especially when such is indeed the case, as in phased-in design). I can’t quite figure out whether Bannerjee and Duflo are saying that they would lie and tell people that an expansion is planned when it isn’t, or whether they’re deploring that other people do it. I’m not bothered by a lot of the deception in experimental research–for example, I think the Milgram obedience experiment was just fine–but somehow the above deception bothers me. It just seems wrong to tell people that an expansion is planned if it’s not. P.S. Overall the article is pretty good. My only real problem with it is that
Introduction: Justin Kinney writes: Since your blog has discussed the “maximal information coefficient” (MIC) of Reshef et al., I figured you might want to see the critique that Gurinder Atwal and I have posted. In short, Reshef et al.’s central claim that MIC is “equitable” is incorrect. We [Kinney and Atwal] offer mathematical proof that the definition of “equitability” Reshef et al. propose is unsatisfiable—no nontrivial dependence measure, including MIC, has this property. Replicating the simulations in their paper with modestly larger data sets validates this finding. The heuristic notion of equitability, however, can be formalized instead as a self-consistency condition closely related to the Data Processing Inequality. Mutual information satisfies this new definition of equitability but MIC does not. We therefore propose that simply estimating mutual information will, in many cases, provide the sort of dependence measure Reshef et al. seek. For background, here are my two p
Introduction: In his new book, “What is Your Race? The Census and Our Flawed Efforts to Classify Americans,” former Census Bureau director Ken Prewitt recommends taking the race question off the decennial census: He recommends gradual changes, integrating the race and national origin questions while improving both. In particular, he would replace the main “race” question by a “race or origin” question, with the instruction to “Mark one or more” of the following boxes: “White,” “Black, African Am., or Negro,” “Hispanic, Latino, or Spanish origin,” “American Indian or Alaska Native,” “Asian”, “Native Hawaiian or Other Pacific Islander,” and “Some other race or origin.” Then the next question is to write in “specific race, origin, or enrolled or principal tribe.” Prewitt writes: His suggestion is to go with these questions in 2020 and 2030, then in 2040 “drop the race question and use only the national origin question.” He’s also relying on the American Community Survey to gather a lo
4 0.87742138 1092 andrew gelman stats-2011-12-29-More by Berger and me on weakly informative priors
Introduction: A couple days ago we discussed some remarks by Tony O’Hagan and Jim Berger on weakly informative priors. Jim followed up on Deborah Mayo’s blog with this: Objective Bayesian priors are often improper (i.e., have infinite total mass), but this is not a problem when they are developed correctly. But not every improper prior is satisfactory. For instance, the constant prior is known to be unsatisfactory in many situations. The ‘solution’ pseudo-Bayesians often use is to choose a constant prior over a large but bounded set (a ‘weakly informative’ prior), saying it is now proper and so all is well. This is not true; if the constant prior on the whole parameter space is bad, so will be the constant prior over the bounded set. The problem is, in part, that some people confuse proper priors with subjective priors and, having learned that true subjective priors are fine, incorrectly presume that weakly informative proper priors are fine. I have a few reactions to this: 1. I agree
5 0.8758595 482 andrew gelman stats-2010-12-23-Capitalism as a form of voluntarism
Introduction: Interesting discussion by Alex Tabarrok (following up on an article by Rebecca Solnit) on the continuum between voluntarism (or, more generally, non-cash transactions) and markets with monetary exchange. I just have a few comments of my own: 1. Solnit writes of “the iceberg economy,” which she characterizes as “based on gift economies, barter, mutual aid, and giving without hope of return . . . the relations between friends, between family members, the activities of volunteers or those who have chosen their vocation on principle rather than for profit.” I just wonder whether “barter” completely fits in here. Maybe it depends on context. Sometimes barter is an informal way of keeping track (you help me and I help you), but in settings of low liquidity I could imagine barter being simply an inefficient way of performing an economic transaction. 2. I am no expert on capitalism but my impression is that it’s not just about “competition and selfishness” but also is related to the
6 0.87568283 1479 andrew gelman stats-2012-09-01-Mothers and Moms
8 0.87477446 38 andrew gelman stats-2010-05-18-Breastfeeding, infant hyperbilirubinemia, statistical graphics, and modern medicine
9 0.87466472 938 andrew gelman stats-2011-10-03-Comparing prediction errors
10 0.87412024 1869 andrew gelman stats-2013-05-24-In which I side with Neyman over Fisher
11 0.8731091 1376 andrew gelman stats-2012-06-12-Simple graph WIN: the example of birthday frequencies
12 0.87184322 2231 andrew gelman stats-2014-03-03-Running into a Stan Reference by Accident
13 0.8716898 2143 andrew gelman stats-2013-12-22-The kluges of today are the textbook solutions of tomorrow.
14 0.87168723 643 andrew gelman stats-2011-04-02-So-called Bayesian hypothesis testing is just as bad as regular hypothesis testing
15 0.87087083 743 andrew gelman stats-2011-06-03-An argument that can’t possibly make sense
16 0.86905396 433 andrew gelman stats-2010-11-27-One way that psychology research is different than medical research
17 0.86877126 1757 andrew gelman stats-2013-03-11-My problem with the Lindley paradox
18 0.86871827 278 andrew gelman stats-2010-09-15-Advice that might make sense for individuals but is negative-sum overall
19 0.86864185 1999 andrew gelman stats-2013-08-27-Bayesian model averaging or fitting a larger model
20 0.86693096 1455 andrew gelman stats-2012-08-12-Probabilistic screening to get an approximate self-weighted sample