hunch_net hunch_net-2010 hunch_net-2010-390 knowledge-graph by maker-knowledge-mining

390 hunch net-2010-03-12-Netflix Challenge 2 Canceled


meta infos for this blog

Source: html

Introduction: The second Netflix prize is canceled due to privacy problems . I continue to believe my original assessment of this paper, that the privacy break was somewhat overstated. I still haven’t seen any serious privacy failures on the scale of the AOL search log release . I expect privacy concerns to continue to be a big issue when dealing with data releases by companies or governments. The theory of maintaining privacy while using data is improving, but it is not yet in a state where the limits of what’s possible are clear let alone how to achieve these limits in a manner friendly to a prediction competition.


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The second Netflix prize is canceled due to privacy problems . [sent-1, score-0.946]

2 I continue to believe my original assessment of this paper, that the privacy break was somewhat overstated. [sent-2, score-1.498]

3 I still haven’t seen any serious privacy failures on the scale of the AOL search log release . [sent-3, score-1.461]

4 I expect privacy concerns to continue to be a big issue when dealing with data releases by companies or governments. [sent-4, score-1.597]

5 The theory of maintaining privacy while using data is improving, but it is not yet in a state where the limits of what’s possible are clear let alone how to achieve these limits in a manner friendly to a prediction competition. [sent-5, score-2.294]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('privacy', 0.632), ('continue', 0.248), ('limits', 0.239), ('assessment', 0.195), ('aol', 0.181), ('maintaining', 0.163), ('alone', 0.157), ('failures', 0.151), ('break', 0.138), ('release', 0.135), ('prize', 0.129), ('concerns', 0.126), ('netflix', 0.126), ('companies', 0.124), ('dealing', 0.124), ('manner', 0.12), ('original', 0.118), ('competition', 0.116), ('improving', 0.112), ('search', 0.104), ('data', 0.1), ('haven', 0.098), ('somewhat', 0.096), ('issue', 0.096), ('serious', 0.096), ('achieve', 0.091), ('state', 0.091), ('let', 0.09), ('scale', 0.089), ('log', 0.089), ('seen', 0.084), ('clear', 0.082), ('still', 0.081), ('second', 0.077), ('expect', 0.075), ('big', 0.072), ('believe', 0.071), ('due', 0.065), ('yet', 0.063), ('prediction', 0.06), ('possible', 0.06), ('theory', 0.059), ('using', 0.048), ('paper', 0.047), ('problems', 0.043)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000001 390 hunch net-2010-03-12-Netflix Challenge 2 Canceled

Introduction: The second Netflix prize is canceled due to privacy problems . I continue to believe my original assessment of this paper, that the privacy break was somewhat overstated. I still haven’t seen any serious privacy failures on the scale of the AOL search log release . I expect privacy concerns to continue to be a big issue when dealing with data releases by companies or governments. The theory of maintaining privacy while using data is improving, but it is not yet in a state where the limits of what’s possible are clear let alone how to achieve these limits in a manner friendly to a prediction competition.

2 0.29511073 260 hunch net-2007-08-25-The Privacy Problem

Introduction: Machine Learning is rising in importance because data is being collected for all sorts of tasks where it either wasn’t previously collected, or for tasks that did not previously exist. While this is great for Machine Learning, it has a downside—the massive data collection which is so useful can also lead to substantial privacy problems. It’s important to understand that this is a much harder problem than many people appreciate. The AOL data release is a good example. To those doing machine learning, the following strategies might be obvious: Just delete any names or other obviously personally identifiable information. The logic here seems to be “if I can’t easily find the person then no one can”. That doesn’t work as demonstrated by the people who were found circumstantially from the AOL data. … then just hash all the search terms! The logic here is “if I can’t read it, then no one can”. It’s also trivially broken by a dictionary attack—just hash all the strings

3 0.19041228 423 hunch net-2011-02-02-User preferences for search engines

Introduction: I want to comment on the “Bing copies Google” discussion here , here , and here , because there are data-related issues which the general public may not understand, and some of the framing seems substantially misleading to me. As a not-distant-outsider, let me mention the sources of bias I may have. I work at Yahoo! , which has started using Bing . This might predispose me towards Bing, but on the other hand I’m still at Yahoo!, and have been using Linux exclusively as an OS for many years, including even a couple minor kernel patches. And, on the gripping hand , I’ve spent quite a bit of time thinking about the basic principles of incorporating user feedback in machine learning . Also note, this post is not related to official Yahoo! policy, it’s just my personal view. The issue Google engineers inserted synthetic responses to synthetic queries on google.com, then executed the synthetic searches on google.com using Internet Explorer with the Bing toolbar and later

4 0.16120297 406 hunch net-2010-08-22-KDD 2010

Introduction: There were several papers that seemed fairly interesting at KDD this year . The ones that caught my attention are: Xin Jin , Mingyang Zhang, Nan Zhang , and Gautam Das , Versatile Publishing For Privacy Preservation . This paper provides a conservative method for safely determining which data is publishable from any complete source of information (for example, a hospital) such that it does not violate privacy rules in a natural language. It is not differentially private, so no external sources of join information can exist. However, it is a mechanism for publishing data rather than (say) the output of a learning algorithm. Arik Friedman Assaf Schuster , Data Mining with Differential Privacy . This paper shows how to create effective differentially private decision trees. Progress in differentially private datamining is pretty impressive, as it was defined in 2006 . David Chan, Rong Ge, Ori Gershony, Tim Hesterberg , Diane Lambert , Evaluating Online Ad Camp

5 0.14283222 200 hunch net-2006-08-03-AOL’s data drop

Introduction: AOL has released several large search engine related datasets. This looks like a pretty impressive data release, and it is a big opportunity for people everywhere to worry about search engine related learning problems, if they want.

6 0.13640821 364 hunch net-2009-07-11-Interesting papers at KDD

7 0.13018453 430 hunch net-2011-04-11-The Heritage Health Prize

8 0.10855101 420 hunch net-2010-12-26-NIPS 2010

9 0.10263103 336 hunch net-2009-01-19-Netflix prize within epsilon

10 0.098898366 275 hunch net-2007-11-29-The Netflix Crack

11 0.093376003 125 hunch net-2005-10-20-Machine Learning in the News

12 0.087886147 371 hunch net-2009-09-21-Netflix finishes (and starts)

13 0.078579754 169 hunch net-2006-04-05-What is state?

14 0.071888186 268 hunch net-2007-10-19-Second Annual Reinforcement Learning Competition

15 0.070693545 120 hunch net-2005-10-10-Predictive Search is Coming

16 0.069625244 389 hunch net-2010-02-26-Yahoo! ML events

17 0.067300275 332 hunch net-2008-12-23-Use of Learning Theory

18 0.065159552 456 hunch net-2012-02-24-ICML+50%

19 0.064583391 377 hunch net-2009-11-09-NYAS ML Symposium this year.

20 0.063023776 301 hunch net-2008-05-23-Three levels of addressing the Netflix Prize


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.12), (1, 0.006), (2, -0.038), (3, 0.013), (4, -0.002), (5, 0.008), (6, -0.095), (7, 0.024), (8, -0.03), (9, -0.07), (10, -0.083), (11, 0.22), (12, -0.07), (13, 0.088), (14, -0.126), (15, 0.016), (16, -0.036), (17, -0.023), (18, 0.134), (19, -0.108), (20, -0.023), (21, -0.065), (22, 0.017), (23, 0.035), (24, -0.112), (25, 0.105), (26, 0.056), (27, 0.106), (28, 0.02), (29, 0.059), (30, -0.012), (31, -0.083), (32, -0.123), (33, -0.01), (34, -0.092), (35, -0.102), (36, 0.139), (37, -0.009), (38, 0.072), (39, -0.013), (40, 0.012), (41, 0.133), (42, 0.016), (43, -0.079), (44, -0.027), (45, 0.012), (46, -0.016), (47, -0.14), (48, -0.041), (49, -0.016)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99165738 390 hunch net-2010-03-12-Netflix Challenge 2 Canceled

Introduction: The second Netflix prize is canceled due to privacy problems . I continue to believe my original assessment of this paper, that the privacy break was somewhat overstated. I still haven’t seen any serious privacy failures on the scale of the AOL search log release . I expect privacy concerns to continue to be a big issue when dealing with data releases by companies or governments. The theory of maintaining privacy while using data is improving, but it is not yet in a state where the limits of what’s possible are clear let alone how to achieve these limits in a manner friendly to a prediction competition.

2 0.7349934 260 hunch net-2007-08-25-The Privacy Problem

Introduction: Machine Learning is rising in importance because data is being collected for all sorts of tasks where it either wasn’t previously collected, or for tasks that did not previously exist. While this is great for Machine Learning, it has a downside—the massive data collection which is so useful can also lead to substantial privacy problems. It’s important to understand that this is a much harder problem than many people appreciate. The AOL data release is a good example. To those doing machine learning, the following strategies might be obvious: Just delete any names or other obviously personally identifiable information. The logic here seems to be “if I can’t easily find the person then no one can”. That doesn’t work as demonstrated by the people who were found circumstantially from the AOL data. … then just hash all the search terms! The logic here is “if I can’t read it, then no one can”. It’s also trivially broken by a dictionary attack—just hash all the strings

3 0.67069888 423 hunch net-2011-02-02-User preferences for search engines

Introduction: I want to comment on the “Bing copies Google” discussion here , here , and here , because there are data-related issues which the general public may not understand, and some of the framing seems substantially misleading to me. As a not-distant-outsider, let me mention the sources of bias I may have. I work at Yahoo! , which has started using Bing . This might predispose me towards Bing, but on the other hand I’m still at Yahoo!, and have been using Linux exclusively as an OS for many years, including even a couple minor kernel patches. And, on the gripping hand , I’ve spent quite a bit of time thinking about the basic principles of incorporating user feedback in machine learning . Also note, this post is not related to official Yahoo! policy, it’s just my personal view. The issue Google engineers inserted synthetic responses to synthetic queries on google.com, then executed the synthetic searches on google.com using Internet Explorer with the Bing toolbar and later

4 0.6700213 200 hunch net-2006-08-03-AOL’s data drop

Introduction: AOL has released several large search engine related datasets. This looks like a pretty impressive data release, and it is a big opportunity for people everywhere to worry about search engine related learning problems, if they want.

5 0.58848292 406 hunch net-2010-08-22-KDD 2010

Introduction: There were several papers that seemed fairly interesting at KDD this year . The ones that caught my attention are: Xin Jin , Mingyang Zhang, Nan Zhang , and Gautam Das , Versatile Publishing For Privacy Preservation . This paper provides a conservative method for safely determining which data is publishable from any complete source of information (for example, a hospital) such that it does not violate privacy rules in a natural language. It is not differentially private, so no external sources of join information can exist. However, it is a mechanism for publishing data rather than (say) the output of a learning algorithm. Arik Friedman Assaf Schuster , Data Mining with Differential Privacy . This paper shows how to create effective differentially private decision trees. Progress in differentially private datamining is pretty impressive, as it was defined in 2006 . David Chan, Rong Ge, Ori Gershony, Tim Hesterberg , Diane Lambert , Evaluating Online Ad Camp

6 0.55824745 364 hunch net-2009-07-11-Interesting papers at KDD

7 0.54771334 275 hunch net-2007-11-29-The Netflix Crack

8 0.49440706 430 hunch net-2011-04-11-The Heritage Health Prize

9 0.48454252 408 hunch net-2010-08-24-Alex Smola starts a blog

10 0.46749201 125 hunch net-2005-10-20-Machine Learning in the News

11 0.42797968 217 hunch net-2006-11-06-Data Linkage Problems

12 0.40075481 336 hunch net-2009-01-19-Netflix prize within epsilon

13 0.39066803 155 hunch net-2006-02-07-Pittsburgh Mind Reading Competition

14 0.38070938 420 hunch net-2010-12-26-NIPS 2010

15 0.37372175 444 hunch net-2011-09-07-KDD and MUCMD 2011

16 0.35657647 312 hunch net-2008-08-04-Electoralmarkets.com

17 0.34710526 120 hunch net-2005-10-10-Predictive Search is Coming

18 0.34554699 301 hunch net-2008-05-23-Three levels of addressing the Netflix Prize

19 0.34430823 371 hunch net-2009-09-21-Netflix finishes (and starts)

20 0.34329024 156 hunch net-2006-02-11-Yahoo’s Learning Problems.


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(27, 0.125), (55, 0.033), (94, 0.09), (95, 0.601)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.98819435 390 hunch net-2010-03-12-Netflix Challenge 2 Canceled

Introduction: The second Netflix prize is canceled due to privacy problems . I continue to believe my original assessment of this paper, that the privacy break was somewhat overstated. I still haven’t seen any serious privacy failures on the scale of the AOL search log release . I expect privacy concerns to continue to be a big issue when dealing with data releases by companies or governments. The theory of maintaining privacy while using data is improving, but it is not yet in a state where the limits of what’s possible are clear let alone how to achieve these limits in a manner friendly to a prediction competition.

2 0.97274977 479 hunch net-2013-01-31-Remote large scale learning class participation

Introduction: Yann and I have arranged so that people who are interested in our large scale machine learning class and not able to attend in person can follow along via two methods. Videos will be posted with about a 1 day delay on techtalks . This is a side-by-side capture of video+slides from Weyond . We are experimenting with Piazza as a discussion forum. Anyone is welcome to subscribe to Piazza and ask questions there, where I will be monitoring things. update2 : Sign up here . The first lecture is up now, including the revised version of the slides which fixes a few typos and rounds out references.

3 0.97219616 319 hunch net-2008-10-01-NIPS 2008 workshop on ‘Learning over Empirical Hypothesis Spaces’

Introduction: This workshop asks for insights how far we may/can push the theoretical boundary of using data in the design of learning machines. Can we express our classification rule in terms of the sample, or do we have to stick to a core assumption of classical statistical learning theory, namely that the hypothesis space is to be defined independent from the sample? This workshop is particularly interested in – but not restricted to – the ‘luckiness framework’ and the recently introduced notion of ‘compatibility functions’ in a semi-supervised learning context (more information can be found at http://www.kuleuven.be/wehys ).

4 0.9322927 30 hunch net-2005-02-25-Why Papers?

Introduction: Makc asked a good question in comments—”Why bother to make a paper, at all?” There are several reasons for writing papers which may not be immediately obvious to people not in academia. The basic idea is that papers have considerably more utility than the obvious “present an idea”. Papers are a formalized units of work. Academics (especially young ones) are often judged on the number of papers they produce. Papers have a formalized method of citing and crediting other—the bibliography. Academics (especially older ones) are often judged on the number of citations they receive. Papers enable a “more fair” anonymous review. Conferences receive many papers, from which a subset are selected. Discussion forums are inherently not anonymous for anyone who wants to build a reputation for good work. Papers are an excuse to meet your friends. Papers are the content of conferences, but much of what you do is talk to friends about interesting problems while there. Sometimes yo

5 0.90330648 389 hunch net-2010-02-26-Yahoo! ML events

Introduction: Yahoo! is sponsoring two machine learning events that might interest people. The Key Scientific Challenges program (due March 5) for Machine Learning and Statistics offers $5K (plus bonuses) for graduate students working on a core problem of interest to Y! If you are already working on one of these problems, there is no reason not to submit, and if you aren’t you might want to think about it for next year, as I am confident they all press the boundary of the possible in Machine Learning. There are 7 days left. The Learning to Rank challenge (due May 31) offers an $8K first prize for the best ranking algorithm on a real (and really used) dataset for search ranking, with presentations at an ICML workshop. Unlike the Netflix competition, there are prizes for 2nd, 3rd, and 4th place, perhaps avoiding the heartbreak the ensemble encountered. If you think you know how to rank, you should give it a try, and we might all learn something. There are 3 months left.

6 0.85582876 127 hunch net-2005-11-02-Progress in Active Learning

7 0.85326374 456 hunch net-2012-02-24-ICML+50%

8 0.78680855 344 hunch net-2009-02-22-Effective Research Funding

9 0.76327199 373 hunch net-2009-10-03-Static vs. Dynamic multiclass prediction

10 0.70216674 462 hunch net-2012-04-20-Both new: STOC workshops and NEML

11 0.66207588 234 hunch net-2007-02-22-Create Your Own ICML Workshop

12 0.6508795 105 hunch net-2005-08-23-(Dis)similarities between academia and open source programmers

13 0.57206607 7 hunch net-2005-01-31-Watchword: Assumption

14 0.51786166 445 hunch net-2011-09-28-Somebody’s Eating Your Lunch

15 0.51638925 455 hunch net-2012-02-20-Berkeley Streaming Data Workshop

16 0.5090189 275 hunch net-2007-11-29-The Netflix Crack

17 0.4899863 464 hunch net-2012-05-03-Microsoft Research, New York City

18 0.48615605 290 hunch net-2008-02-27-The Stats Handicap

19 0.48536554 36 hunch net-2005-03-05-Funding Research

20 0.48515642 301 hunch net-2008-05-23-Three levels of addressing the Netflix Prize