hunch_net hunch_net-2010 hunch_net-2010-408 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Adventures in Data Land .
sentIndex sentText sentNum sentScore
wordName wordTfidf (topN-words)
[('land', 0.953), ('data', 0.304)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 408 hunch net-2010-08-24-Alex Smola starts a blog
Introduction: Adventures in Data Land .
2 0.099274404 298 hunch net-2008-04-26-Eliminating the Birthday Paradox for Universal Features
Introduction: I want to expand on this post which describes one of the core tricks for making Vowpal Wabbit fast and easy to use when learning from text. The central trick is converting a word (or any other parseable quantity) into a number via a hash function. Kishore tells me this is a relatively old trick in NLP land, but it has some added advantages when doing online learning, because you can learn directly from the existing data without preprocessing the data to create features (destroying the online property) or using an expensive hashtable lookup (slowing things down). A central concern for this approach is collisions, which create a loss of information. If you use m features in an index space of size n the birthday paradox suggests a collision if m > n 0.5 , essentially because there are m 2 pairs. This is pretty bad, because it says that with a vocabulary of 10 5 features, you might need to have 10 10 entries in your table. It turns out that redundancy is gr
3 0.067844108 260 hunch net-2007-08-25-The Privacy Problem
Introduction: Machine Learning is rising in importance because data is being collected for all sorts of tasks where it either wasn’t previously collected, or for tasks that did not previously exist. While this is great for Machine Learning, it has a downside—the massive data collection which is so useful can also lead to substantial privacy problems. It’s important to understand that this is a much harder problem than many people appreciate. The AOL data release is a good example. To those doing machine learning, the following strategies might be obvious: Just delete any names or other obviously personally identifiable information. The logic here seems to be “if I can’t easily find the person then no one can”. That doesn’t work as demonstrated by the people who were found circumstantially from the AOL data. … then just hash all the search terms! The logic here is “if I can’t read it, then no one can”. It’s also trivially broken by a dictionary attack—just hash all the strings
4 0.064892083 106 hunch net-2005-09-04-Science in the Government
Introduction: I found the article on “ Political Science ” at the New York Times interesting. Essentially the article is about allegations that the US government has been systematically distorting scientific views. With a petition by some 7000+ scientists alleging such behavior this is clearly a significant concern. One thing not mentioned explicitly in this discussion is that there are fundamental cultural differences between academic research and the rest of the world. In academic research, careful, clear thought is valued. This value is achieved by both formal and informal mechanisms. One example of a formal mechanism is peer review. In contrast, in the land of politics, the basic value is agreement. It is only with some amount of agreement that a new law can be passed or other actions can be taken. Since Science (with a capitol ‘S’) has accomplished many things, it can be a significant tool in persuading people. This makes it compelling for a politician to use science as a mec
5 0.060024109 136 hunch net-2005-12-07-Is the Google way the way for machine learning?
Introduction: Urs Hoelzle from Google gave an invited presentation at NIPS . In the presentation, he strongly advocates interacting with data in a particular scalable manner which is something like the following: Make a cluster of machines. Build a unified filesystem. (Google uses GFS, but NFS or other approaches work reasonably well for smaller clusters.) Interact with data via MapReduce . Creating a cluster of machines is, by this point, relatively straightforward. Unified filesystems are a little bit tricky—GFS is capable by design of essentially unlimited speed throughput to disk. NFS can bottleneck because all of the data has to move through one machine. Nevertheless, this may not be a limiting factor for smaller clusters. MapReduce is a programming paradigm. Essentially, it is a combination of a data element transform (map) and an agreggator/selector (reduce). These operations are highly parallelizable and the claim is that they support the forms of data interacti
6 0.059331313 444 hunch net-2011-09-07-KDD and MUCMD 2011
7 0.057768129 373 hunch net-2009-10-03-Static vs. Dynamic multiclass prediction
8 0.057680652 159 hunch net-2006-02-27-The Peekaboom Dataset
9 0.056719758 445 hunch net-2011-09-28-Somebody’s Eating Your Lunch
10 0.056634504 12 hunch net-2005-02-03-Learning Theory, by assumption
11 0.055149823 143 hunch net-2005-12-27-Automated Labeling
12 0.054894559 455 hunch net-2012-02-20-Berkeley Streaming Data Workshop
13 0.051690359 295 hunch net-2008-04-12-It Doesn’t Stop
14 0.041934494 471 hunch net-2012-08-24-Patterns for research in machine learning
15 0.037592471 61 hunch net-2005-04-25-Embeddings: what are they good for?
16 0.037483808 41 hunch net-2005-03-15-The State of Tight Bounds
17 0.035035729 3 hunch net-2005-01-24-The Humanloop Spectrum of Machine Learning
18 0.034751307 325 hunch net-2008-11-10-ICML Reviewing Criteria
19 0.034113608 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)
20 0.033481576 165 hunch net-2006-03-23-The Approximation Argument
topicId topicWeight
[(0, 0.029), (1, 0.022), (2, -0.026), (3, -0.002), (4, 0.037), (5, -0.013), (6, -0.025), (7, 0.021), (8, 0.025), (9, -0.044), (10, -0.025), (11, 0.03), (12, 0.022), (13, 0.015), (14, -0.033), (15, -0.053), (16, -0.037), (17, 0.018), (18, -0.019), (19, -0.008), (20, 0.072), (21, -0.047), (22, 0.008), (23, -0.013), (24, 0.011), (25, -0.025), (26, 0.002), (27, 0.056), (28, 0.063), (29, -0.039), (30, 0.011), (31, -0.065), (32, -0.025), (33, 0.046), (34, 0.003), (35, -0.003), (36, 0.046), (37, -0.01), (38, 0.021), (39, -0.065), (40, 0.028), (41, 0.155), (42, 0.061), (43, -0.014), (44, -0.032), (45, 0.065), (46, -0.002), (47, -0.034), (48, -0.036), (49, -0.061)]
simIndex simValue blogId blogTitle
same-blog 1 0.97769797 408 hunch net-2010-08-24-Alex Smola starts a blog
Introduction: Adventures in Data Land .
2 0.61941594 61 hunch net-2005-04-25-Embeddings: what are they good for?
Introduction: I’ve been looking at some recent embeddings work, and am struck by how beautiful the theory and algorithms are. It also makes me wonder, what are embeddings good for? A few things immediately come to mind: (1) For visualization of high-dimensional data sets. In this case, one would like good algorithms for embedding specifically into 2- and 3-dimensional Euclidean spaces. (2) For nonparametric modeling. The usual nonparametric models (histograms, nearest neighbor) often require resources which are exponential in the dimension. So if the data actually lie close to some low-dimensional surface, it might be a good idea to first identify this surface and embed the data before applying the model. Incidentally, for applications like these, it’s important to have a functional mapping from high to low dimension, which some techniques do not yield up. (3) As a prelude to classifier learning. The hope here is presumably that learning will be easier in the low-dimensional space,
3 0.60948282 260 hunch net-2007-08-25-The Privacy Problem
Introduction: Machine Learning is rising in importance because data is being collected for all sorts of tasks where it either wasn’t previously collected, or for tasks that did not previously exist. While this is great for Machine Learning, it has a downside—the massive data collection which is so useful can also lead to substantial privacy problems. It’s important to understand that this is a much harder problem than many people appreciate. The AOL data release is a good example. To those doing machine learning, the following strategies might be obvious: Just delete any names or other obviously personally identifiable information. The logic here seems to be “if I can’t easily find the person then no one can”. That doesn’t work as demonstrated by the people who were found circumstantially from the AOL data. … then just hash all the search terms! The logic here is “if I can’t read it, then no one can”. It’s also trivially broken by a dictionary attack—just hash all the strings
4 0.57678783 444 hunch net-2011-09-07-KDD and MUCMD 2011
Introduction: At KDD I enjoyed Stephen Boyd ‘s invited talk about optimization quite a bit. However, the most interesting talk for me was David Haussler ‘s. His talk started out with a formidable load of biological complexity. About half-way through you start wondering, “can this be used to help with cancer?” And at the end he connects it directly to use with a call to arms for the audience: cure cancer. The core thesis here is that cancer is a complex set of diseases which can be distentangled via genetic assays, allowing attacking the specific signature of individual cancers. However, the data quantity and complex dependencies within the data require systematic and relatively automatic prediction and analysis algorithms of the kind that we are best familiar with. Some of the papers which interested me are: Kai-Wei Chang and Dan Roth , Selective Block Minimization for Faster Convergence of Limited Memory Large-Scale Linear Models , which is about effectively using a hard-example
5 0.53121519 455 hunch net-2012-02-20-Berkeley Streaming Data Workshop
Introduction: The From Data to Knowledge workshop May 7-11 at Berkeley should be of interest to the many people encountering streaming data in different disciplines. It’s run by a group of astronomers who encounter streaming data all the time. I met Josh Bloom recently and he is broadly interested in a workshop covering all aspects of Machine Learning on streaming data. The hope here is that techniques developed in one area turn out useful in another which seems quite plausible. Particularly if you are in the bay area, consider checking it out.
6 0.48787117 390 hunch net-2010-03-12-Netflix Challenge 2 Canceled
7 0.48707345 136 hunch net-2005-12-07-Is the Google way the way for machine learning?
8 0.46520686 143 hunch net-2005-12-27-Automated Labeling
9 0.456155 159 hunch net-2006-02-27-The Peekaboom Dataset
10 0.44447073 155 hunch net-2006-02-07-Pittsburgh Mind Reading Competition
11 0.41403034 298 hunch net-2008-04-26-Eliminating the Birthday Paradox for Universal Features
12 0.40303406 471 hunch net-2012-08-24-Patterns for research in machine learning
13 0.39114559 475 hunch net-2012-10-26-ML Symposium and Strata-Hadoop World
14 0.38858411 217 hunch net-2006-11-06-Data Linkage Problems
15 0.3733007 325 hunch net-2008-11-10-ICML Reviewing Criteria
16 0.3647024 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)
17 0.36352918 224 hunch net-2006-12-12-Interesting Papers at NIPS 2006
18 0.36004254 476 hunch net-2012-12-29-Simons Institute Big Data Program
19 0.35992882 369 hunch net-2009-08-27-New York Area Machine Learning Events
20 0.34161669 150 hunch net-2006-01-23-On Coding via Mutual Information & Bayes Nets
topicId topicWeight
[(44, 0.561)]
simIndex simValue blogId blogTitle
same-blog 1 0.75506884 408 hunch net-2010-08-24-Alex Smola starts a blog
Introduction: Adventures in Data Land .
2 0.71228319 266 hunch net-2007-10-15-NIPS workshops extended to 3 days
Introduction: (Unofficially, at least.) The Deep Learning Workshop is being held the afternoon before the rest of the workshops in Vancouver, BC. Separate registration is needed, and open. What’s happening fundamentally here is that there are too many interesting workshops to fit into 2 days. Perhaps we can get it officially expanded to 3 days next year.
3 0.11350931 225 hunch net-2007-01-02-Retrospective
Introduction: It’s been almost two years since this blog began. In that time, I’ve learned enough to shift my expectations in several ways. Initially, the idea was for a general purpose ML blog where different people could contribute posts. What has actually happened is most posts come from me, with a few guest posts that I greatly value. There are a few reasons I see for this. Overload . A couple years ago, I had not fully appreciated just how busy life gets for a researcher. Making a post is not simply a matter of getting to it, but rather of prioritizing between {writing a grant, finishing an overdue review, writing a paper, teaching a class, writing a program, etc…}. This is a substantial transition away from what life as a graduate student is like. At some point the question is not “when will I get to it?” but rather “will I get to it?” and the answer starts to become “no” most of the time. Feedback failure . This blog currently receives about 3K unique visitors per day from
4 0.061086856 140 hunch net-2005-12-14-More NIPS Papers II
Introduction: I thought this was a very good NIPS with many excellent papers. The following are a few NIPS papers which I liked and I hope to study more carefully when I get the chance. The list is not exhaustive and in no particular order… Preconditioner Approximations for Probabilistic Graphical Models. Pradeeep Ravikumar and John Lafferty. I thought the use of preconditioner methods from solving linear systems in the context of approximate inference was novel and interesting. The results look good and I’d like to understand the limitations. Rodeo: Sparse nonparametric regression in high dimensions. John Lafferty and Larry Wasserman. A very interesting approach to feature selection in nonparametric regression from a frequentist framework. The use of lengthscale variables in each dimension reminds me a lot of ‘Automatic Relevance Determination’ in Gaussian process regression — it would be interesting to compare Rodeo to ARD in GPs. Interpolating between types and tokens by estimating
5 0.0 1 hunch net-2005-01-19-Why I decided to run a weblog.
Introduction: I have decided to run a weblog on machine learning and learning theory research. Here are some reasons: 1) Weblogs enable new functionality: Public comment on papers. No mechanism for this exists at conferences and most journals. I have encountered it once for a science paper. Some communities have mailing lists supporting this, but not machine learning or learning theory. I have often read papers and found myself wishing there was some method to consider other’s questions and read the replies. Conference shortlists. One of the most common conversations at a conference is “what did you find interesting?” There is no explicit mechanism for sharing this information at conferences, and it’s easy to imagine that it would be handy to do so. Evaluation and comment on research directions. Papers are almost exclusively about new research, rather than evaluation (and consideration) of research directions. This last role is satisfied by funding agencies to some extent, but
6 0.0 2 hunch net-2005-01-24-Holy grails of machine learning?
7 0.0 3 hunch net-2005-01-24-The Humanloop Spectrum of Machine Learning
8 0.0 4 hunch net-2005-01-26-Summer Schools
9 0.0 5 hunch net-2005-01-26-Watchword: Probability
10 0.0 6 hunch net-2005-01-27-Learning Complete Problems
11 0.0 7 hunch net-2005-01-31-Watchword: Assumption
12 0.0 8 hunch net-2005-02-01-NIPS: Online Bayes
13 0.0 9 hunch net-2005-02-01-Watchword: Loss
14 0.0 10 hunch net-2005-02-02-Kolmogorov Complexity and Googling
15 0.0 11 hunch net-2005-02-02-Paper Deadlines
16 0.0 12 hunch net-2005-02-03-Learning Theory, by assumption
17 0.0 13 hunch net-2005-02-04-JMLG
18 0.0 14 hunch net-2005-02-07-The State of the Reduction
19 0.0 15 hunch net-2005-02-08-Some Links
20 0.0 16 hunch net-2005-02-09-Intuitions from applied learning