hunch_net hunch_net-2007 hunch_net-2007-275 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: A couple security researchers claim to have cracked the netflix dataset . The claims of success appear somewhat overstated to me, but the method of attack is valid and could plausibly be substantially improved so as to reveal the movie preferences of a small fraction of Netflix users. The basic idea is to use a heuristic similarity function between ratings in a public database (from IMDB) and an anonymized database (Netflix) to link ratings in the private database to public identities (in IMDB). They claim to have linked two of a few dozen IMDB users to anonymized netflix users. The claims seem a bit inflated to me, because (a) knowing the IMDB identity isn’t equivalent to knowing the person and (b) the claims of statistical significance are with respect to a model of the world they created (rather than one they created). Overall, this is another example showing that complete privacy is hard . It may be worth remembering that there are some substantial benefits from the Netf
sentIndex sentText sentNum sentScore
1 A couple security researchers claim to have cracked the netflix dataset . [sent-1, score-0.865]
2 The claims of success appear somewhat overstated to me, but the method of attack is valid and could plausibly be substantially improved so as to reveal the movie preferences of a small fraction of Netflix users. [sent-2, score-1.123]
3 The basic idea is to use a heuristic similarity function between ratings in a public database (from IMDB) and an anonymized database (Netflix) to link ratings in the private database to public identities (in IMDB). [sent-3, score-2.038]
4 They claim to have linked two of a few dozen IMDB users to anonymized netflix users. [sent-4, score-0.881]
5 The claims seem a bit inflated to me, because (a) knowing the IMDB identity isn’t equivalent to knowing the person and (b) the claims of statistical significance are with respect to a model of the world they created (rather than one they created). [sent-5, score-1.354]
6 Overall, this is another example showing that complete privacy is hard . [sent-6, score-0.209]
7 It may be worth remembering that there are some substantial benefits from the Netflix challenge as well—we (as a society) have learned something about how to do collaborative filtering which is useful beyond just recommending movies. [sent-7, score-0.548]
wordName wordTfidf (topN-words)
[('imdb', 0.515), ('netflix', 0.37), ('database', 0.275), ('claims', 0.265), ('anonymized', 0.212), ('ratings', 0.2), ('knowing', 0.154), ('created', 0.124), ('public', 0.116), ('claim', 0.116), ('preferences', 0.114), ('recommending', 0.114), ('linked', 0.106), ('cracked', 0.106), ('identities', 0.106), ('movie', 0.106), ('reveal', 0.106), ('significance', 0.106), ('security', 0.095), ('similarity', 0.092), ('slashdot', 0.092), ('valid', 0.092), ('identity', 0.092), ('collaborative', 0.092), ('link', 0.088), ('society', 0.086), ('private', 0.083), ('attack', 0.081), ('filtering', 0.079), ('users', 0.077), ('benefits', 0.075), ('privacy', 0.074), ('showing', 0.07), ('beyond', 0.068), ('equivalent', 0.067), ('statistical', 0.067), ('couple', 0.066), ('fraction', 0.066), ('improved', 0.065), ('complete', 0.065), ('plausibly', 0.064), ('worth', 0.06), ('person', 0.06), ('challenge', 0.06), ('dataset', 0.058), ('overall', 0.057), ('somewhat', 0.056), ('success', 0.055), ('researchers', 0.054), ('appear', 0.053)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999976 275 hunch net-2007-11-29-The Netflix Crack
Introduction: A couple security researchers claim to have cracked the netflix dataset . The claims of success appear somewhat overstated to me, but the method of attack is valid and could plausibly be substantially improved so as to reveal the movie preferences of a small fraction of Netflix users. The basic idea is to use a heuristic similarity function between ratings in a public database (from IMDB) and an anonymized database (Netflix) to link ratings in the private database to public identities (in IMDB). They claim to have linked two of a few dozen IMDB users to anonymized netflix users. The claims seem a bit inflated to me, because (a) knowing the IMDB identity isn’t equivalent to knowing the person and (b) the claims of statistical significance are with respect to a model of the world they created (rather than one they created). Overall, this is another example showing that complete privacy is hard . It may be worth remembering that there are some substantial benefits from the Netf
2 0.16493101 301 hunch net-2008-05-23-Three levels of addressing the Netflix Prize
Introduction: In October 2006, the online movie renter, Netflix, announced the Netflix Prize contest. They published a comprehensive dataset including more than 100 million movie ratings, which were performed by about 480,000 real customers on 17,770 movies.  Competitors in the challenge are required to estimate a few million ratings.  To win the “grand prize,” they need to deliver a 10% improvement in the prediction error compared with the results of Cinematch, Netflix’s proprietary recommender system. Best current results deliver 9.12% improvement , which is quite close to the 10% goal, yet painfully distant.  The Netflix Prize breathed new life and excitement into recommender systems research. The competition allowed the wide research community to access a large scale, real life dataset. Beyond this, the competition changed the rules of the game. Claiming that your nice idea could outperform some mediocre algorithms on some toy dataset is no longer acceptable. Now researcher
3 0.14873905 239 hunch net-2007-04-18-$50K Spock Challenge
Introduction: Apparently, the company Spock is setting up a $50k entity resolution challenge . $50k is much less than the Netflix challenge, but it’s effectively the same as Netflix until someone reaches 10% . It’s also nice that the Spock challenge has a short duration. The (visible) test set is of size 25k and the training set has size 75k.
4 0.14548792 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models
Introduction: This is a paper by Yann LeCun and Fu Jie Huang published at AISTAT 2005 . I found this paper very difficult to read, but it does have some point about a computational shortcut. This paper takes for granted that the method of solving a problem is gradient descent on parameters. Given this assumption, the question arises: Do you want to do gradient descent on a probabilistic model or something else? All (conditional) probabilistic models have the form p(y|x) = f(x,y)/Z(x) where Z(x) = sum y f(x,y) (the paper calls - log f(x,y) an “energy”). If f is parameterized by some w , the gradient has a term for Z(x) , and hence for every value of y . The paper claims, that such models can be optimized for classification purposes using only the correct y and the other y’ not y which maximizes f(x,y) . This can even be done on unnormalizable models. The paper further claims that this can be done with an approximate maximum. These claims are plausible based on experimen
5 0.13639243 371 hunch net-2009-09-21-Netflix finishes (and starts)
Introduction: I attended the Netflix prize ceremony this morning. The press conference part is covered fine elsewhere , with the basic outcome being that BellKor’s Pragmatic Chaos won over The Ensemble by 15-20 minutes , because they were tied in performance on the ultimate holdout set. I’m sure the individual participants will have many chances to speak about the solution. One of these is Bell at the NYAS ML symposium on Nov. 6 . Several additional details may interest ML people. The degree of overfitting exhibited by the difference in performance on the leaderboard test set and the ultimate hold out set was small, but determining at .02 to .03%. A tie was possible, because the rules cut off measurements below the fourth digit based on significance concerns. In actuality, of course, the scores do differ before rounding, but everyone I spoke to claimed not to know how. The complete dataset has been released on UCI , so each team could compute their own score to whatever accu
6 0.13113762 362 hunch net-2009-06-26-Netflix nearly done
7 0.12306108 364 hunch net-2009-07-11-Interesting papers at KDD
8 0.098898366 390 hunch net-2010-03-12-Netflix Challenge 2 Canceled
9 0.080913201 211 hunch net-2006-10-02-$1M Netflix prediction contest
10 0.067136906 377 hunch net-2009-11-09-NYAS ML Symposium this year.
11 0.064090848 400 hunch net-2010-06-13-The Good News on Exploration and Learning
12 0.063059844 423 hunch net-2011-02-02-User preferences for search engines
13 0.062750414 418 hunch net-2010-12-02-Traffic Prediction Problem
14 0.059992932 271 hunch net-2007-11-05-CMU wins DARPA Urban Challenge
15 0.059950009 389 hunch net-2010-02-26-Yahoo! ML events
16 0.055125061 356 hunch net-2009-05-24-2009 ICML discussion site
17 0.053820983 188 hunch net-2006-06-30-ICML papers
18 0.052813735 430 hunch net-2011-04-11-The Heritage Health Prize
19 0.052336451 297 hunch net-2008-04-22-Taking the next step
20 0.051778153 406 hunch net-2010-08-22-KDD 2010
topicId topicWeight
[(0, 0.096), (1, -0.0), (2, -0.021), (3, 0.035), (4, -0.028), (5, 0.026), (6, -0.077), (7, 0.004), (8, -0.03), (9, -0.042), (10, -0.063), (11, 0.202), (12, -0.092), (13, 0.021), (14, -0.03), (15, 0.013), (16, -0.047), (17, 0.027), (18, 0.097), (19, -0.03), (20, -0.173), (21, -0.044), (22, -0.013), (23, -0.023), (24, -0.032), (25, -0.056), (26, -0.003), (27, -0.035), (28, -0.015), (29, 0.022), (30, 0.014), (31, -0.017), (32, -0.075), (33, 0.057), (34, -0.02), (35, -0.105), (36, 0.024), (37, -0.094), (38, 0.007), (39, -0.013), (40, 0.023), (41, -0.023), (42, -0.055), (43, 0.004), (44, -0.078), (45, -0.109), (46, 0.092), (47, -0.084), (48, -0.014), (49, 0.043)]
simIndex simValue blogId blogTitle
same-blog 1 0.98860395 275 hunch net-2007-11-29-The Netflix Crack
Introduction: A couple security researchers claim to have cracked the netflix dataset . The claims of success appear somewhat overstated to me, but the method of attack is valid and could plausibly be substantially improved so as to reveal the movie preferences of a small fraction of Netflix users. The basic idea is to use a heuristic similarity function between ratings in a public database (from IMDB) and an anonymized database (Netflix) to link ratings in the private database to public identities (in IMDB). They claim to have linked two of a few dozen IMDB users to anonymized netflix users. The claims seem a bit inflated to me, because (a) knowing the IMDB identity isn’t equivalent to knowing the person and (b) the claims of statistical significance are with respect to a model of the world they created (rather than one they created). Overall, this is another example showing that complete privacy is hard . It may be worth remembering that there are some substantial benefits from the Netf
2 0.62320054 301 hunch net-2008-05-23-Three levels of addressing the Netflix Prize
Introduction: In October 2006, the online movie renter, Netflix, announced the Netflix Prize contest. They published a comprehensive dataset including more than 100 million movie ratings, which were performed by about 480,000 real customers on 17,770 movies.  Competitors in the challenge are required to estimate a few million ratings.  To win the “grand prize,” they need to deliver a 10% improvement in the prediction error compared with the results of Cinematch, Netflix’s proprietary recommender system. Best current results deliver 9.12% improvement , which is quite close to the 10% goal, yet painfully distant.  The Netflix Prize breathed new life and excitement into recommender systems research. The competition allowed the wide research community to access a large scale, real life dataset. Beyond this, the competition changed the rules of the game. Claiming that your nice idea could outperform some mediocre algorithms on some toy dataset is no longer acceptable. Now researcher
3 0.58241904 364 hunch net-2009-07-11-Interesting papers at KDD
Introduction: I attended KDD this year. The conference has always had a strong grounding in what works based on the KDDcup , but it has developed a halo of workshops on various subjects. It seems that KDD has become a place where the economy meets machine learning in a stronger sense than many other conferences. There were several papers that other people might like to take a look at. Yehuda Koren Collaborative Filtering with Temporal Dynamics . This paper describes how to incorporate temporal dynamics into a couple of collaborative filtering approaches. This was also a best paper award. D. Sculley , Robert Malkin, Sugato Basu , Roberto J. Bayardo , Predicting Bounce Rates in Sponsored Search Advertisements . The basic claim of this paper is that the probability people immediately leave (“bounce”) after clicking on an advertisement is predictable. Frank McSherry and Ilya Mironov Differentially Private Recommender Systems: Building Privacy into the Netflix Prize Contende
4 0.56636655 239 hunch net-2007-04-18-$50K Spock Challenge
Introduction: Apparently, the company Spock is setting up a $50k entity resolution challenge . $50k is much less than the Netflix challenge, but it’s effectively the same as Netflix until someone reaches 10% . It’s also nice that the Spock challenge has a short duration. The (visible) test set is of size 25k and the training set has size 75k.
5 0.54986686 362 hunch net-2009-06-26-Netflix nearly done
Introduction: A $1M qualifying result was achieved on the public Netflix test set by a 3-way ensemble team . This is just in time for Yehuda ‘s presentation at KDD , which I’m sure will be one of the best attended ever. This isn’t quite over—there are a few days for another super-conglomerate team to come together and there is some small chance that the performance is nonrepresentative of the final test set, but I expect not. Regardless of the final outcome, the biggest lesson for ML from the Netflix contest has been the formidable performance edge of ensemble methods.
6 0.53848308 390 hunch net-2010-03-12-Netflix Challenge 2 Canceled
7 0.52801132 211 hunch net-2006-10-02-$1M Netflix prediction contest
8 0.49974549 430 hunch net-2011-04-11-The Heritage Health Prize
9 0.47595662 336 hunch net-2009-01-19-Netflix prize within epsilon
10 0.46881869 371 hunch net-2009-09-21-Netflix finishes (and starts)
11 0.42573422 119 hunch net-2005-10-08-We have a winner
12 0.4177669 291 hunch net-2008-03-07-Spock Challenge Winners
13 0.34961915 418 hunch net-2010-12-02-Traffic Prediction Problem
14 0.34730756 406 hunch net-2010-08-22-KDD 2010
15 0.33960319 135 hunch net-2005-12-04-Watchword: model
16 0.33865443 260 hunch net-2007-08-25-The Privacy Problem
17 0.32307541 423 hunch net-2011-02-02-User preferences for search engines
18 0.32113612 271 hunch net-2007-11-05-CMU wins DARPA Urban Challenge
19 0.32098401 367 hunch net-2009-08-16-Centmail comments
20 0.32059804 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models
topicId topicWeight
[(6, 0.385), (27, 0.191), (42, 0.033), (53, 0.022), (55, 0.054), (94, 0.05), (95, 0.141)]
simIndex simValue blogId blogTitle
same-blog 1 0.88914829 275 hunch net-2007-11-29-The Netflix Crack
Introduction: A couple security researchers claim to have cracked the netflix dataset . The claims of success appear somewhat overstated to me, but the method of attack is valid and could plausibly be substantially improved so as to reveal the movie preferences of a small fraction of Netflix users. The basic idea is to use a heuristic similarity function between ratings in a public database (from IMDB) and an anonymized database (Netflix) to link ratings in the private database to public identities (in IMDB). They claim to have linked two of a few dozen IMDB users to anonymized netflix users. The claims seem a bit inflated to me, because (a) knowing the IMDB identity isn’t equivalent to knowing the person and (b) the claims of statistical significance are with respect to a model of the world they created (rather than one they created). Overall, this is another example showing that complete privacy is hard . It may be worth remembering that there are some substantial benefits from the Netf
2 0.61810637 309 hunch net-2008-07-10-Interesting papers, ICML 2008
Introduction: Here are some papers from ICML 2008 that I found interesting. Risi Kondor and Karsten Borgwardt , The Skew Spectrum of Graphs . This paper is about a new family of functions on graphs which is invariant under node label permutation. They show that these quantities appear to yield good features for learning. Sanjoy Dasgupta and Daniel Hsu . Hierarchical sampling for active learning. This is the first published practical consistent active learning algorithm. The abstract is also pretty impressive. Lihong Li , Michael Littman , and Thomas Walsh Knows What It Knows: A Framework For Self-Aware Learning. This is an attempt to create learning algorithms that know when they err, (other work includes Vovk ). It’s not yet clear to me what the right model for feature-dependent confidence intervals is. Novi Quadrianto , Alex Smola , TIberio Caetano , and Quoc Viet Le Estimating Labels from Label Proportions . This is an example of learning in a speciali
3 0.54645926 483 hunch net-2013-06-10-The Large Scale Learning class notes
Introduction: The large scale machine learning class I taught with Yann LeCun has finished. As I expected, it took quite a bit of time . We had about 25 people attending in person on average and 400 regularly watching the recorded lectures which is substantially more sustained interest than I expected for an advanced ML class. We also had some fun with class projects—I’m hopeful that several will eventually turn into papers. I expect there are a number of professors interested in lecturing on this and related topics. Everyone will have their personal taste in subjects of course, but hopefully there will be some convergence to common course materials as well. To help with this, I am making the sources to my presentations available . Feel free to use/improve/embelish/ridicule/etc… in the pursuit of the perfect course.
4 0.53685576 373 hunch net-2009-10-03-Static vs. Dynamic multiclass prediction
Introduction: I have had interesting discussions about distinction between static vs. dynamic classes with Kishore and Hal . The distinction arises in multiclass prediction settings. A static set of classes is given by a set of labels {1,…,k} and the goal is generally to choose the most likely label given features. The static approach is the one that we typically analyze and think about in machine learning. The dynamic setting is one that is often used in practice. The basic idea is that the number of classes is not fixed, varying on a per example basis. These different classes are generally defined by a choice of features. The distinction between these two settings as far as theory goes, appears to be very substantial. For example, in the static setting, in learning reductions land , we have techniques now for robust O(log(k)) time prediction in many multiclass setting variants. In the dynamic setting, the best techniques known are O(k) , and furthermore this exponential
5 0.53402334 344 hunch net-2009-02-22-Effective Research Funding
Introduction: With a worldwide recession on, my impression is that the carnage in research has not been as severe as might be feared, at least in the United States. I know of two notable negative impacts: It’s quite difficult to get a job this year, as many companies and universities simply aren’t hiring. This is particularly tough on graduating students. Perhaps 10% of IBM research was fired. In contrast, around the time of the dot com bust, ATnT Research and Lucent had one or several 50% size firings wiping out much of the remainder of Bell Labs , triggering a notable diaspora for the respected machine learning group there. As the recession progresses, we may easily see more firings as companies in particular reach a point where they can no longer support research. There are a couple positives to the recession as well. Both the implosion of Wall Street (which siphoned off smart people) and the general difficulty of getting a job coming out of an undergraduate education s
6 0.53176516 127 hunch net-2005-11-02-Progress in Active Learning
7 0.51365405 456 hunch net-2012-02-24-ICML+50%
8 0.50668919 389 hunch net-2010-02-26-Yahoo! ML events
9 0.49983209 360 hunch net-2009-06-15-In Active Learning, the question changes
10 0.49687952 105 hunch net-2005-08-23-(Dis)similarities between academia and open source programmers
11 0.4937253 132 hunch net-2005-11-26-The Design of an Optimal Research Environment
12 0.49267161 36 hunch net-2005-03-05-Funding Research
13 0.49159336 30 hunch net-2005-02-25-Why Papers?
14 0.48967034 478 hunch net-2013-01-07-NYU Large Scale Machine Learning Class
15 0.48482695 220 hunch net-2006-11-27-Continuizing Solutions
16 0.48334569 406 hunch net-2010-08-22-KDD 2010
17 0.48326999 464 hunch net-2012-05-03-Microsoft Research, New York City
18 0.47906008 267 hunch net-2007-10-17-Online as the new adjective
19 0.47774753 57 hunch net-2005-04-16-Which Assumptions are Reasonable?
20 0.47745737 371 hunch net-2009-09-21-Netflix finishes (and starts)