hunch_net hunch_net-2007 hunch_net-2007-260 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Machine Learning is rising in importance because data is being collected for all sorts of tasks where it either wasn’t previously collected, or for tasks that did not previously exist. While this is great for Machine Learning, it has a downside—the massive data collection which is so useful can also lead to substantial privacy problems. It’s important to understand that this is a much harder problem than many people appreciate. The AOL data release is a good example. To those doing machine learning, the following strategies might be obvious: Just delete any names or other obviously personally identifiable information. The logic here seems to be “if I can’t easily find the person then no one can”. That doesn’t work as demonstrated by the people who were found circumstantially from the AOL data. … then just hash all the search terms! The logic here is “if I can’t read it, then no one can”. It’s also trivially broken by a dictionary attack—just hash all the strings
sentIndex sentText sentNum sentScore
1 Machine Learning is rising in importance because data is being collected for all sorts of tasks where it either wasn’t previously collected, or for tasks that did not previously exist. [sent-1, score-0.55]
2 While this is great for Machine Learning, it has a downside—the massive data collection which is so useful can also lead to substantial privacy problems. [sent-2, score-0.714]
3 It’s also trivially broken by a dictionary attack—just hash all the strings that might be in the data and check to see if they are in the data. [sent-10, score-0.413]
4 If 10 terms appear with known relative frequencies in public data, then finding 10 terms encrypted terms with the same relative frequencies might give you very good evidence for what these terms are. [sent-13, score-0.962]
5 Many internet companies run off of advertising so eliminating the ability to do targeted advertising will eliminate the ability of these companies to exist. [sent-18, score-0.477]
6 However, this is not simply an interest burst—the long term trend of increasing data collection implies this problem will repeatedly come up over the indefinite future. [sent-26, score-0.485]
7 The privacy problem breaks into at least two parts. [sent-27, score-0.408]
8 The ability to collect and analyze large quantities of data which many large organizations now have or are constructing increases their power relative to ordinary people. [sent-34, score-0.782]
9 The cultural norm privacy problem is sometimes solvable by creating an opt-in or opt-out protocol. [sent-36, score-0.724]
10 None of this is helpful for cameras (where no interface exists) or monetary transactions (where the transaction itself determines whether or not some item is shipped). [sent-40, score-0.607]
11 The power balance privacy problem is much more difficult. [sent-41, score-0.77]
12 At some point, we may end up with cameras and storage devices so small, cheap, and portable that forbidding their use is essentially absurd. [sent-49, score-0.389]
13 As technology improves, it’s reasonable to expect cameras just about anywhere people are in public. [sent-56, score-0.443]
14 Some legislation and good engineering could make these cameras available to anyone. [sent-57, score-0.486]
15 This would involve a substantial shift in cultural norms—essentially people would always be in potential public view when not at home. [sent-58, score-0.414]
16 This directly collides with the “privacy as a cultural norm” privacy problem. [sent-59, score-0.54]
17 The hardness of the privacy problem mentioned at the post beginning implies difficult tradeoffs. [sent-60, score-0.463]
18 If you have cultural norm privacy concerns, then you really don’t appreciate method (3) for power balance privacy concerns. [sent-61, score-1.37]
19 If you value privacy greatly and the default action is taken, then you prefer monopolistic marketplaces. [sent-62, score-0.408]
20 All of the above is even murkier because what can be done with data is not fully known, nor is what can be done in a privacy sensitive way. [sent-65, score-0.569]
wordName wordTfidf (topN-words)
[('cameras', 0.389), ('privacy', 0.346), ('data', 0.223), ('power', 0.216), ('cultural', 0.194), ('balance', 0.146), ('collection', 0.145), ('terms', 0.127), ('norm', 0.122), ('strategies', 0.117), ('collect', 0.117), ('aggregations', 0.109), ('dictionary', 0.109), ('forbid', 0.109), ('frequencies', 0.109), ('legislate', 0.109), ('monetary', 0.109), ('transactions', 0.109), ('search', 0.104), ('please', 0.099), ('legislation', 0.097), ('norms', 0.097), ('record', 0.092), ('aol', 0.09), ('browser', 0.085), ('organizations', 0.085), ('flag', 0.085), ('throw', 0.085), ('news', 0.084), ('relative', 0.081), ('collected', 0.081), ('hash', 0.081), ('advertising', 0.078), ('internet', 0.077), ('public', 0.074), ('would', 0.073), ('tasks', 0.067), ('logic', 0.066), ('problem', 0.062), ('companies', 0.062), ('default', 0.062), ('ability', 0.06), ('keep', 0.057), ('turn', 0.057), ('previously', 0.056), ('implies', 0.055), ('technology', 0.054), ('topic', 0.049), ('shipped', 0.049), ('services', 0.049)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999982 260 hunch net-2007-08-25-The Privacy Problem
Introduction: Machine Learning is rising in importance because data is being collected for all sorts of tasks where it either wasn’t previously collected, or for tasks that did not previously exist. While this is great for Machine Learning, it has a downside—the massive data collection which is so useful can also lead to substantial privacy problems. It’s important to understand that this is a much harder problem than many people appreciate. The AOL data release is a good example. To those doing machine learning, the following strategies might be obvious: Just delete any names or other obviously personally identifiable information. The logic here seems to be “if I can’t easily find the person then no one can”. That doesn’t work as demonstrated by the people who were found circumstantially from the AOL data. … then just hash all the search terms! The logic here is “if I can’t read it, then no one can”. It’s also trivially broken by a dictionary attack—just hash all the strings
2 0.29511073 390 hunch net-2010-03-12-Netflix Challenge 2 Canceled
Introduction: The second Netflix prize is canceled due to privacy problems . I continue to believe my original assessment of this paper, that the privacy break was somewhat overstated. I still haven’t seen any serious privacy failures on the scale of the AOL search log release . I expect privacy concerns to continue to be a big issue when dealing with data releases by companies or governments. The theory of maintaining privacy while using data is improving, but it is not yet in a state where the limits of what’s possible are clear let alone how to achieve these limits in a manner friendly to a prediction competition.
3 0.15757462 423 hunch net-2011-02-02-User preferences for search engines
Introduction: I want to comment on the “Bing copies Google” discussion here , here , and here , because there are data-related issues which the general public may not understand, and some of the framing seems substantially misleading to me. As a not-distant-outsider, let me mention the sources of bias I may have. I work at Yahoo! , which has started using Bing . This might predispose me towards Bing, but on the other hand I’m still at Yahoo!, and have been using Linux exclusively as an OS for many years, including even a couple minor kernel patches. And, on the gripping hand , I’ve spent quite a bit of time thinking about the basic principles of incorporating user feedback in machine learning . Also note, this post is not related to official Yahoo! policy, it’s just my personal view. The issue Google engineers inserted synthetic responses to synthetic queries on google.com, then executed the synthetic searches on google.com using Internet Explorer with the Bing toolbar and later
4 0.13317721 406 hunch net-2010-08-22-KDD 2010
Introduction: There were several papers that seemed fairly interesting at KDD this year . The ones that caught my attention are: Xin Jin , Mingyang Zhang, Nan Zhang , and Gautam Das , Versatile Publishing For Privacy Preservation . This paper provides a conservative method for safely determining which data is publishable from any complete source of information (for example, a hospital) such that it does not violate privacy rules in a natural language. It is not differentially private, so no external sources of join information can exist. However, it is a mechanism for publishing data rather than (say) the output of a learning algorithm. Arik Friedman Assaf Schuster , Data Mining with Differential Privacy . This paper shows how to create effective differentially private decision trees. Progress in differentially private datamining is pretty impressive, as it was defined in 2006 . David Chan, Rong Ge, Ori Gershony, Tim Hesterberg , Diane Lambert , Evaluating Online Ad Camp
5 0.11510109 132 hunch net-2005-11-26-The Design of an Optimal Research Environment
Introduction: How do you create an optimal environment for research? Here are some essential ingredients that I see. Stability . University-based research is relatively good at this. On any particular day, researchers face choices in what they will work on. A very common tradeoff is between: easy small difficult big For researchers without stability, the ‘easy small’ option wins. This is often “ok”—a series of incremental improvements on the state of the art can add up to something very beneficial. However, it misses one of the big potentials of research: finding entirely new and better ways of doing things. Stability comes in many forms. The prototypical example is tenure at a university—a tenured professor is almost imposssible to fire which means that the professor has the freedom to consider far horizon activities. An iron-clad guarantee of a paycheck is not necessary—industrial research labs have succeeded well with research positions of indefinite duration. Atnt rese
6 0.097962365 200 hunch net-2006-08-03-AOL’s data drop
7 0.094224446 156 hunch net-2006-02-11-Yahoo’s Learning Problems.
8 0.092578225 420 hunch net-2010-12-26-NIPS 2010
9 0.087541804 138 hunch net-2005-12-09-Some NIPS papers
10 0.086987257 120 hunch net-2005-10-10-Predictive Search is Coming
11 0.086857088 277 hunch net-2007-12-12-Workshop Summary—Principles of Learning Problem Design
12 0.086749904 430 hunch net-2011-04-11-The Heritage Health Prize
13 0.085599296 332 hunch net-2008-12-23-Use of Learning Theory
14 0.085179515 148 hunch net-2006-01-13-Benchmarks for RL
15 0.082534611 452 hunch net-2012-01-04-Why ICML? and the summer conferences
16 0.079695165 159 hunch net-2006-02-27-The Peekaboom Dataset
17 0.079336137 296 hunch net-2008-04-21-The Science 2.0 article
18 0.07877095 343 hunch net-2009-02-18-Decision by Vetocracy
19 0.078732178 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms
20 0.07863766 400 hunch net-2010-06-13-The Good News on Exploration and Learning
topicId topicWeight
[(0, 0.219), (1, 0.005), (2, -0.079), (3, 0.078), (4, -0.016), (5, 0.0), (6, -0.029), (7, 0.041), (8, -0.027), (9, -0.041), (10, -0.073), (11, 0.085), (12, -0.035), (13, 0.081), (14, -0.117), (15, -0.019), (16, -0.101), (17, -0.02), (18, 0.049), (19, -0.058), (20, 0.017), (21, -0.085), (22, 0.004), (23, 0.027), (24, -0.087), (25, 0.083), (26, 0.046), (27, 0.135), (28, 0.077), (29, 0.005), (30, -0.057), (31, -0.045), (32, -0.077), (33, 0.032), (34, -0.018), (35, -0.057), (36, 0.124), (37, 0.036), (38, 0.044), (39, -0.016), (40, 0.019), (41, 0.142), (42, -0.031), (43, -0.049), (44, -0.075), (45, 0.056), (46, 0.005), (47, -0.142), (48, -0.049), (49, -0.026)]
simIndex simValue blogId blogTitle
same-blog 1 0.95585173 260 hunch net-2007-08-25-The Privacy Problem
Introduction: Machine Learning is rising in importance because data is being collected for all sorts of tasks where it either wasn’t previously collected, or for tasks that did not previously exist. While this is great for Machine Learning, it has a downside—the massive data collection which is so useful can also lead to substantial privacy problems. It’s important to understand that this is a much harder problem than many people appreciate. The AOL data release is a good example. To those doing machine learning, the following strategies might be obvious: Just delete any names or other obviously personally identifiable information. The logic here seems to be “if I can’t easily find the person then no one can”. That doesn’t work as demonstrated by the people who were found circumstantially from the AOL data. … then just hash all the search terms! The logic here is “if I can’t read it, then no one can”. It’s also trivially broken by a dictionary attack—just hash all the strings
2 0.85788494 390 hunch net-2010-03-12-Netflix Challenge 2 Canceled
Introduction: The second Netflix prize is canceled due to privacy problems . I continue to believe my original assessment of this paper, that the privacy break was somewhat overstated. I still haven’t seen any serious privacy failures on the scale of the AOL search log release . I expect privacy concerns to continue to be a big issue when dealing with data releases by companies or governments. The theory of maintaining privacy while using data is improving, but it is not yet in a state where the limits of what’s possible are clear let alone how to achieve these limits in a manner friendly to a prediction competition.
3 0.71140474 423 hunch net-2011-02-02-User preferences for search engines
Introduction: I want to comment on the “Bing copies Google” discussion here , here , and here , because there are data-related issues which the general public may not understand, and some of the framing seems substantially misleading to me. As a not-distant-outsider, let me mention the sources of bias I may have. I work at Yahoo! , which has started using Bing . This might predispose me towards Bing, but on the other hand I’m still at Yahoo!, and have been using Linux exclusively as an OS for many years, including even a couple minor kernel patches. And, on the gripping hand , I’ve spent quite a bit of time thinking about the basic principles of incorporating user feedback in machine learning . Also note, this post is not related to official Yahoo! policy, it’s just my personal view. The issue Google engineers inserted synthetic responses to synthetic queries on google.com, then executed the synthetic searches on google.com using Internet Explorer with the Bing toolbar and later
4 0.65782434 125 hunch net-2005-10-20-Machine Learning in the News
Introduction: The New York Times had a short interview about machine learning in datamining being used pervasively by the IRS and large corporations to predict who to audit and who to target for various marketing campaigns. This is a big application area of machine learning. It can be harmful (learning + databases = another way to invade privacy) or beneficial (as google demonstrates, better targeting of marketing campaigns is far less annoying). This is yet more evidence that we can not rely upon “I’m just another fish in the school” logic for our expectations about treatment by government and large corporations.
5 0.64299387 200 hunch net-2006-08-03-AOL’s data drop
Introduction: AOL has released several large search engine related datasets. This looks like a pretty impressive data release, and it is a big opportunity for people everywhere to worry about search engine related learning problems, if they want.
6 0.63265693 406 hunch net-2010-08-22-KDD 2010
7 0.62926275 408 hunch net-2010-08-24-Alex Smola starts a blog
8 0.6004619 217 hunch net-2006-11-06-Data Linkage Problems
9 0.5635643 241 hunch net-2007-04-28-The Coming Patent Apocalypse
10 0.56347626 444 hunch net-2011-09-07-KDD and MUCMD 2011
11 0.53851265 136 hunch net-2005-12-07-Is the Google way the way for machine learning?
12 0.5161202 250 hunch net-2007-06-23-Machine Learning Jobs are Growing on Trees
13 0.51243025 411 hunch net-2010-09-21-Regretting the dead
14 0.51000613 455 hunch net-2012-02-20-Berkeley Streaming Data Workshop
15 0.50743079 475 hunch net-2012-10-26-ML Symposium and Strata-Hadoop World
16 0.50726151 420 hunch net-2010-12-26-NIPS 2010
17 0.50500596 397 hunch net-2010-05-02-What’s the difference between gambling and rewarding good prediction?
18 0.50214398 193 hunch net-2006-07-09-The Stock Prediction Machine Learning Problem
19 0.50096482 430 hunch net-2011-04-11-The Heritage Health Prize
20 0.49409947 61 hunch net-2005-04-25-Embeddings: what are they good for?
topicId topicWeight
[(3, 0.017), (9, 0.011), (10, 0.012), (16, 0.011), (27, 0.178), (38, 0.035), (53, 0.049), (55, 0.074), (63, 0.304), (64, 0.01), (74, 0.018), (94, 0.091), (95, 0.081)]
simIndex simValue blogId blogTitle
1 0.91144586 99 hunch net-2005-08-01-Peekaboom
Introduction: Luis has released Peekaboom a successor to ESPgame ( game site ). The purpose of the game is similar—using the actions of people playing a game to gather data helpful in solving AI. Peekaboom gathers more detailed, and perhaps more useful, data about vision. For ESPgame, the byproduct of the game was mutually agreed upon labels for common images. For Peekaboom, the location of the subimage generating the label is revealed by the game as well. Given knowledge about what portion of the image is related to a label it may be more feasible learn to recognize the appropriate parts. There isn’t a dataset yet available for this game as there is for ESPgame, but hopefully a significant number of people will play and we’ll have one to work wtih soon.
2 0.90425456 150 hunch net-2006-01-23-On Coding via Mutual Information & Bayes Nets
Introduction: Say we have two random variables X,Y with mutual information I(X,Y) . Let’s say we want to represent them with a bayes net of the form X< -M->Y , such that the entropy of M equals the mutual information, i.e. H(M)=I(X,Y) . Intuitively, we would like our hidden state to be as simple as possible (entropy wise). The data processing inequality means that H(M)>=I(X,Y) , so the mutual information is a lower bound on how simple the M could be. Furthermore, if such a construction existed it would have a nice coding interpretation — one could jointly code X and Y by first coding the mutual information, then coding X with this mutual info (without Y ) and coding Y with this mutual info (without X ). It turns out that such a construction does not exist in general (Thx Alina Beygelzimer for a counterexample! see below for the sketch). What are the implications of this? Well, it’s hard for me to say, but it does suggest to me that the ‘generative’ model philosophy might be
same-blog 3 0.8429662 260 hunch net-2007-08-25-The Privacy Problem
Introduction: Machine Learning is rising in importance because data is being collected for all sorts of tasks where it either wasn’t previously collected, or for tasks that did not previously exist. While this is great for Machine Learning, it has a downside—the massive data collection which is so useful can also lead to substantial privacy problems. It’s important to understand that this is a much harder problem than many people appreciate. The AOL data release is a good example. To those doing machine learning, the following strategies might be obvious: Just delete any names or other obviously personally identifiable information. The logic here seems to be “if I can’t easily find the person then no one can”. That doesn’t work as demonstrated by the people who were found circumstantially from the AOL data. … then just hash all the search terms! The logic here is “if I can’t read it, then no one can”. It’s also trivially broken by a dictionary attack—just hash all the strings
4 0.8160696 102 hunch net-2005-08-11-Why Manifold-Based Dimension Reduction Techniques?
Introduction: Manifold based dimension-reduction algorithms share the following general outline. Given: a metric d() and a set of points S Construct a graph with a point in every node and every edge connecting to the node of one of the k -nearest neighbors. Associate with the edge a weight which is the distance between the points in the connected nodes. Digest the graph. This might include computing the shortest path between all points or figuring out how to linearly interpolate the point from it’s neighbors. Find a set of points in a low dimensional space which preserve the digested properties. Examples include LLE, Isomap (which I worked on), Hessian-LLE, SDE, and many others. The hope with these algorithms is that they can recover the low dimensional structure of point sets in high dimensional spaces. Many of them can be shown to work in interesting ways producing various compelling pictures. Despite doing some early work in this direction, I suffer from a motivational
5 0.59862804 132 hunch net-2005-11-26-The Design of an Optimal Research Environment
Introduction: How do you create an optimal environment for research? Here are some essential ingredients that I see. Stability . University-based research is relatively good at this. On any particular day, researchers face choices in what they will work on. A very common tradeoff is between: easy small difficult big For researchers without stability, the ‘easy small’ option wins. This is often “ok”—a series of incremental improvements on the state of the art can add up to something very beneficial. However, it misses one of the big potentials of research: finding entirely new and better ways of doing things. Stability comes in many forms. The prototypical example is tenure at a university—a tenured professor is almost imposssible to fire which means that the professor has the freedom to consider far horizon activities. An iron-clad guarantee of a paycheck is not necessary—industrial research labs have succeeded well with research positions of indefinite duration. Atnt rese
6 0.59713036 360 hunch net-2009-06-15-In Active Learning, the question changes
7 0.59166819 343 hunch net-2009-02-18-Decision by Vetocracy
8 0.59156251 286 hunch net-2008-01-25-Turing’s Club for Machine Learning
9 0.59113663 36 hunch net-2005-03-05-Funding Research
10 0.59106582 371 hunch net-2009-09-21-Netflix finishes (and starts)
11 0.59019077 370 hunch net-2009-09-18-Necessary and Sufficient Research
12 0.58939493 95 hunch net-2005-07-14-What Learning Theory might do
13 0.5857898 235 hunch net-2007-03-03-All Models of Learning have Flaws
14 0.58495927 380 hunch net-2009-11-29-AI Safety
15 0.58490121 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models
16 0.58372748 351 hunch net-2009-05-02-Wielding a New Abstraction
17 0.58313572 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms
18 0.58271229 57 hunch net-2005-04-16-Which Assumptions are Reasonable?
19 0.58256167 478 hunch net-2013-01-07-NYU Large Scale Machine Learning Class
20 0.58216804 194 hunch net-2006-07-11-New Models