brendan_oconnor_ai brendan_oconnor_ai-2008 knowledge-graph by maker-knowledge-mining
1 brendan oconnor ai-2008-12-27-Facebook sentiment mining predicts presidential polls
Introduction: I’m a bit late blogging this, but here’s a messy, exciting — and statistically validated! — new online data source. My friend Roddy at Facebook wrote a post describing their sentiment analysis system , which can evaluate positive or negative sentiment toward a particular topic by looking at a large number of wall messages. (I’d link to it, but I can’t find the URL anymore — here’s the Lexicon , but that version only gets term frequencies but no sentiment.) How they constructed their sentiment detector is interesting. Starting with a list of positive and negative terms, they had a lexical acquisition step to gather many more candidate synonyms and misspellings — a necessity in this social media domain, where WordNet ain’t gonna come close! After manually filtering these candidates, they assess the sentiment toward a mention of a topic by looking for instances of these positive and negative words nearby, along with “negation heuristics” and a few other features. He describ
2 brendan oconnor ai-2008-12-18-Information cost and genocide
Introduction: In 1994, the Rwandan genocide claimed 800,000 lives. This genocide was remarkable for being very low-tech — lots of non-military, average people with machetes killing their neighbors. Romeo Dallaire, the leader of the small UN peacekeeping mission there, saw it coming and was convinced he could stop much of the violence if he had 5,000 international troops plus the authority to seize weapon caches and do other aggressive intervention operations. Famously, he made a plea to his superiors and was denied . (The genocide ended only when a rebel army managed a string of military victories and forcibly stopped the killing.) Embedded video from <a href=”http://www.cnn.com/video” mce_href=”http://www.cnn.com/video”>CNN Video</a> Kofi Annan forbade him from expanding his peacekeeping mandate because there was no international support — in particular, the U.S. was not on board. A recent article from The Economist explains, And the trickiest ch
3 brendan oconnor ai-2008-12-03-Statistics vs. Machine Learning, fight!
Introduction: 10/1/09 update — well, it’s been nearly a year, and I should say not everything in this rant is totally true, and I certainly believe much less of it now. Current take: Statistics , not machine learning, is the real deal, but unfortunately suffers from bad marketing. On the other hand, to the extent that bad marketing includes misguided undergraduate curriculums, there’s plenty of room to improve for everyone. So it’s pretty clear by now that statistics and machine learning aren’t very different fields. I was recently pointed to a very amusing comparison by the excellent statistician — and machine learning expert — Robert Tibshiriani . Reproduced here: Glossary Machine learning Statistics network, graphs model weights parameters learning fitting generalization test set performance supervised learning regression/classification unsupervised learning density estimation, clustering large grant = $1,000,000
4 brendan oconnor ai-2008-11-28-Calculating running variance in Python and C++
Introduction: It’s fairly obvious that an average can be calculated online, but interestingly, there’s also a way to calculate a running variance and standard deviation. Read all about it here . I’m playing around with the Netflix Prize data of 100 million movie ratings, and a huge problem is figuring out how to load and calculate everything in memory. I’m having success with NumPy , the numeric library for Python, because it compactly stores arrays with C/Fortran binary layouts. For example, 100 million 32-bit floats = 100M * 4 = 400MB of memory, which is manageable. And it’s much easier to play around interactively in ipython / matplotlib rather than write C++ for everything. Unfortunately, the simple ways to calculate variance on an array of that size create wasteful intermediate data structures as long as the original array. >>> mean( (x-mean(x)) ** 2 ) # two intermediate structures >>> tmp=x-mean(x); tmp**=2; mean(tmp) # one intermediate structure That’s an e
5 brendan oconnor ai-2008-11-24-Python bindings to Google’s “AJAX” Search API
Introduction: I couldn’t find this anywhere on the web, so I threw together a quick Python binding for Google’s “AJAX” Search API (or rather, JSON-over-HTTP). (There are bindings out there for the old SOAP interface; I heard that was discontinued though.) Nothing fancy but it works for me. At: gist.github.com/28405
6 brendan oconnor ai-2008-11-21-The Wire: Mr. Nugget
Introduction: One of my favorite scenes of wisdom from The Wire : D: Nigga please. The man who invented them things, just some sad ass down at the basement of McDonald’s, thinkin’ of some shit to make some money for the real playas. POOT: Nah, man, that ain’t right. D: Fuck right. It ain’t about right, it’s about money. Now you think Ronald McDonald go down to that basement and say “Hey Mr. Nugget, you da bomb, we sellin’ chicken faster than you can tear the bone out, so I’m gonna write my clowney-ass name on this fat-ass check for you?”
7 brendan oconnor ai-2008-11-21-Netflix Prize
Introduction: Here’s a fascinating NYT article on the Netflix Prize for a better movie recommendation system. Tons of great stuff there; here’s a few highlights … First, a good unsupervised learning story: There’s a sort of unsettling, alien quality to their computers’ results. When the teams examine the ways that singular value decomposition is slotting movies into categories, sometimes it makes sense to them — as when the computer highlights what appears to be some essence of nerdiness in a bunch of sci-fi movies. But many categorizations are now so obscure that they cannot see the reasoning behind them. Possibly the algorithms are finding connections so deep and subconscious that customers themselves wouldn’t even recognize them. At one point, Chabbert showed me a list of movies that his algorithm had discovered share some ineffable similarity; it includes a historical movie, “Joan of Arc,” a wrestling video, “W.W.E.: SummerSlam 2004,” the comedy “It Had to Be You” and a version of Charle
8 brendan oconnor ai-2008-11-17-Correlations – cotton picking vs. 2008 Presidential votes
Introduction: From the neat blog Strange Maps — a map of the U.S. South, overlaying where cotton was picked in 1860 versus Presidential voting in 2008. The claim is that the causal pathway is through high African-American populations.
Introduction: This is a good idea: in a search engine’s query logs, look for outbreaks of queries like [[flu symptoms]] in a given region. I’ve heard (from Roddy ) that this trick also works well on Facebook statuses (e.g. “Feeling crappy this morning, think I just got the flu”). Google Uses Web Searches to Track Flu’s Spread – NYTimes.com Google Flu Trends – google.org For an example with a publicly available data feed, these queries works decently well on Twitter search: [[ flu -shot -google ]] (high recall) [[ "muscle aches" flu -shot ]] (high precision) The “muscle aches” query is too sparse and the general query is too noisy, but you could imagine some more tricks to clean it up, then train a classifier, etc. With a bit more work it looks like geolocation information can be had out of the Twitter search API .
10 brendan oconnor ai-2008-11-05-Obama street celebrations in San Francisco
Introduction: In San Francisco, it’s no secret who everyone wanted to win in this election. Shortly after Obama’s victory speech last night, people started celebrating in the streets near my house in the Mission. At Valencia and 19th, a big party formed and ran for several hours into the night. People and kids were cheering, high-fiving, playing music, and having a good time: Shades of Burning Man : There were happy combinations of alcohol, police, art cars, fireworks, and the Extra Action Marching Band . (That clip was the tensest situation I saw; after that, the police just moved everyone out of the intersection and watched carefully.) I yelled “Yes we can!” and was answered “Yes we DID!” Strangers hugged me. I went home at 1 a.m. and the party was still going strong. Not the worst way to celebrate making history. This election was also big locally, including a tight three-way race for the local district chair
11 brendan oconnor ai-2008-10-17-Twitter graphs of the debate
Introduction: Fascinating, from the Twitter blog :
12 brendan oconnor ai-2008-10-16-Is religion the opiate of the elite?
Introduction: Andrew Gelman claims religion is the “opiate of the elite,” from this graph: He says: Religious attendance predicts Republican voting much more among the rich than the poor. This is a really interesting phenomenon — condition on wealth and see different effects of religion. But from looking at that graph, I saw the flipped interpretation — condition on religion, then see different effects of wealth (each line has a different slope). Only the religious become more Republican with greater wealth; secular voters don’t change their preferences when they get rich.
13 brendan oconnor ai-2008-10-15-Financial market theory on the Daily Show
Introduction: Deep insight of the moment: Volatility frequently occurs when everyone suddenly realizes the stock market is just a consensual mass delusion based on fictitious valuings of abstract assets. It’s like finding out Santa Claus is real because you catch him robbing your house. I wonder what a derivatives market is by that analogy. $596 trillion worth of hypothetical presents?
14 brendan oconnor ai-2008-10-12-The Universal Declaration of Human Rights Animated
Introduction: Link: The Universal Declaration of Human Rights Animated .
15 brendan oconnor ai-2008-10-11-It is accurate to determine a blog’s bias by what it links to
Introduction: Here’s a great project from Andy Baio and Joshua Schachter : they assessed the political biases of different blogs based on which articles they tend link to. Using these political bias scores, they made a cool little Firefox extension that colors the names of different sources on the news aggregator site Memeorandum , like so: How they computed these biases is pretty neat. Their data source was the Memeorandum site itself, which shows a particular news story, then a list of different news sites that have written articles about the topic. Scraping out that data, Joshua constructed the adjacency matrix of sites vs. articles they linked to and ran good ol’ SVD on it, an algorithm that can be used to summarize the very high-dimensional article linking information in just several numbers (“components” or “dimensions”) for each news site. Basically, the algorithm groups together sites that tend to link to the same articles. It’s not exactly clustering though; rather, it project
Introduction: MySpace and the Commission on the Presidential Debates put together a neat site, mydebates.org , which presents the candidates’ positions through various mini-polls and such. It even has a cool data exploration tool for the poll results … for example, here are two support maps, one for respondents over 65 and one for 18-24 year olds. Anyway, the site also takes submissions of questions for tonight’s debate. Apparently six million questions were submitted, and moderator Tom Brokaw will of course use only 10 or so. This begs a question, how were they selected? There’s no Digg-like social filtering or anything. You could imagine automatic methods to help narrow down the pool: Topic clustering? Quality ranking on syntax and vocabulary? Eric Fish suggested the obvious: probably someone picked 1000 randomly and sent them to Brokaw. I’d love to see a corpus of 6 million questions on U.S. political subjects, directed at only two different people. Anyone know anyon
17 brendan oconnor ai-2008-10-08-Blog move has landed
Introduction: We’re now live at a new location: anyall.org/blog . Good-bye, Blogger, it was sometimes nice knowing you. This blog is now on WordPress (perhaps behind the times ), which I’ve usually had good experiences with, e.g. for the Dolores Labs Blog . I also made the blog’s name more boring — the old one, “Social Science++”, was just too long and difficult to remember relative to how descriptive it was, and my interests have changed a little bit in any case. All the old posts have been imported, and I set up redirects for all posts. The RSS feed can’t be redirected though. (One small issue: comment authors’ urls and emails failed to get imported. I can fix it if I am given the info; if you want your old comments fixed, drop me a line.)
18 brendan oconnor ai-2008-09-30-PalinSpeak.com
Introduction: With my friend Doug , I just finished making a game — PalinSpeak.com — where you can chat with a Sarah Palin simulator. Check it out, it’s the best thing to hit the Internet since sliced bread. I’ll post more the technical details (n-gram generation and query-answer matching, hurrah!) later…
19 brendan oconnor ai-2008-09-18-"Machine" translation-vision (Stanford AI courses online)
Introduction: The Stanford Engineering school has put up videos and course materials for several programming, AI, and optimization courses online. They did get some of the ones that are taught by excellent lecturers — e.g. introductory programming (the CS dept has craploads of money, so can afford to hire specialist lecturers, which results in very good courses), and Brad Osgood on the FFT (he’s just such a good lecturer). Main link , minor link. I was looking through the transcript of Chris Manning’s introductory lecture for CS224N, Natural Language Processing, last year. ( SEE link ; actual website link .) I took this same course years ago as a sophomore, and this part sounded familiar: So if you look at the early history of NLP, NLP essentially started in the 1950s. It started just after World War II in the beginning of the Cold War. And what NLP started off as is the field of machine translation , of can you use computers to translate automatically from one language to another l
20 brendan oconnor ai-2008-08-25-Fukuyama: Authoritarianism is still against history
Introduction: The latest on the world ideologies front – In the light of Russia’s Georgia adventures, there’s been lots of talk whether this represents a new rise of authoritarian Russia, which is presumably another nail in the coffin for U.S.-led liberal democratic hegemony in the world. Our “end of history” friend Francis Fukuyama just wrote an op-ed arguing that Russia and China are still not big threats to liberal democracy . There are some good points: Russia is behaving as an aggressive imperial power, but does not embrace a grand, exportable ideology with universal appeal. Similarly with China. They both still feel the need to pay lip service to democratic rituals and norms. Even Nicholas Kristof’s hilarious column chronicling his experience with China’s dubious protest registration system concludes that even a pale mockery of democracy is progress. I still like Azar Gat’s article which I wrote about last year, that Russia and China represent authoritarian capitalism, which will
21 brendan oconnor ai-2008-08-16-A better Obama vs McCain poll aggregation
22 brendan oconnor ai-2008-08-15-East vs West cultural psychology!
23 brendan oconnor ai-2008-07-04-Link: Today’s international organizations
24 brendan oconnor ai-2008-07-01-Bias correction sneak peek!
25 brendan oconnor ai-2008-06-18-Turker classifiers and binary classification threshold calibration
26 brendan oconnor ai-2008-06-17-Pairwise comparisons for relevance evaluation
27 brendan oconnor ai-2008-06-05-Clinton-Obama support visualization
28 brendan oconnor ai-2008-05-23-Sub-reddit for Systems Science and OR
29 brendan oconnor ai-2008-05-19-conplot – a console plotter
30 brendan oconnor ai-2008-05-13-The best natural language search commentary on the internet
32 brendan oconnor ai-2008-04-06-a regression slope is a weighted average of pairs’ slopes!
33 brendan oconnor ai-2008-04-02-Datawocky: More data usually beats better algorithms
34 brendan oconnor ai-2008-03-29-Allende’s cybernetic economy project
35 brendan oconnor ai-2008-03-24-Quick-R, the only decent R documentation on the internet
36 brendan oconnor ai-2008-03-20-Spending money on others makes you happy
37 brendan oconnor ai-2008-03-18-color name study i did
38 brendan oconnor ai-2008-03-10-PHD Comics: Humanities vs. Social Sciences
39 brendan oconnor ai-2008-03-06-data data data
40 brendan oconnor ai-2008-01-31-Food Fight
41 brendan oconnor ai-2008-01-27-Graphics! Atari Breakout and religious text NLP
42 brendan oconnor ai-2008-01-20-Moral psychology on Amazon Mechanical Turk
43 brendan oconnor ai-2008-01-07-Will the humanities save us?
44 brendan oconnor ai-2008-01-05-Indicators of a crackpot paper