hunch_net hunch_net-2009 hunch_net-2009-364 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: I attended KDD this year. The conference has always had a strong grounding in what works based on the KDDcup , but it has developed a halo of workshops on various subjects. It seems that KDD has become a place where the economy meets machine learning in a stronger sense than many other conferences. There were several papers that other people might like to take a look at. Yehuda Koren Collaborative Filtering with Temporal Dynamics . This paper describes how to incorporate temporal dynamics into a couple of collaborative filtering approaches. This was also a best paper award. D. Sculley , Robert Malkin, Sugato Basu , Roberto J. Bayardo , Predicting Bounce Rates in Sponsored Search Advertisements . The basic claim of this paper is that the probability people immediately leave (“bounce”) after clicking on an advertisement is predictable. Frank McSherry and Ilya Mironov Differentially Private Recommender Systems: Building Privacy into the Netflix Prize Contende
sentIndex sentText sentNum sentScore
1 The conference has always had a strong grounding in what works based on the KDDcup , but it has developed a halo of workshops on various subjects. [sent-2, score-0.276]
2 It seems that KDD has become a place where the economy meets machine learning in a stronger sense than many other conferences. [sent-3, score-0.217]
3 This paper describes how to incorporate temporal dynamics into a couple of collaborative filtering approaches. [sent-6, score-0.991]
4 The basic claim of this paper is that the probability people immediately leave (“bounce”) after clicking on an advertisement is predictable. [sent-11, score-0.368]
5 The basic claim here is that it is possible to beat the baseline system in Netflix and preserve a nontrivial amount of user privacy. [sent-13, score-0.547]
6 It’s the first demonstration I’ve seen of this sort, and it’s particularly impressive they used a strong algorithm-independent definition of privacy which Cynthia Dwork first stated. [sent-14, score-0.339]
7 KDD also experimented this year with crowdvine which was interesting. [sent-15, score-0.428]
8 Compared to Mark Reid ‘s efforts with ICML , they managed to get substantially more participation. [sent-16, score-0.078]
9 The biggest drawback I found was that the papers themselves were not integrated into the website. [sent-18, score-0.284]
wordName wordTfidf (topN-words)
[('crowdvine', 0.338), ('bounce', 0.254), ('temporal', 0.209), ('integrated', 0.197), ('collaborative', 0.18), ('dynamics', 0.174), ('kdd', 0.172), ('filtering', 0.156), ('privacy', 0.146), ('netflix', 0.146), ('claim', 0.115), ('economy', 0.113), ('contenders', 0.113), ('cynthia', 0.113), ('differentially', 0.113), ('frank', 0.113), ('ilya', 0.113), ('koren', 0.113), ('mcsherry', 0.113), ('kddcup', 0.104), ('demonstration', 0.104), ('meets', 0.104), ('grounding', 0.104), ('popularity', 0.104), ('sponsored', 0.104), ('describes', 0.104), ('encouraged', 0.104), ('schedule', 0.098), ('advertisements', 0.098), ('sculley', 0.098), ('advertisement', 0.094), ('yehuda', 0.094), ('preserve', 0.09), ('experimented', 0.09), ('beat', 0.09), ('nontrivial', 0.09), ('recommender', 0.09), ('strong', 0.089), ('biggest', 0.087), ('incorporate', 0.087), ('reid', 0.087), ('incidentally', 0.084), ('various', 0.083), ('private', 0.082), ('baseline', 0.082), ('paper', 0.081), ('handy', 0.08), ('user', 0.08), ('leave', 0.078), ('efforts', 0.078)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 364 hunch net-2009-07-11-Interesting papers at KDD
Introduction: I attended KDD this year. The conference has always had a strong grounding in what works based on the KDDcup , but it has developed a halo of workshops on various subjects. It seems that KDD has become a place where the economy meets machine learning in a stronger sense than many other conferences. There were several papers that other people might like to take a look at. Yehuda Koren Collaborative Filtering with Temporal Dynamics . This paper describes how to incorporate temporal dynamics into a couple of collaborative filtering approaches. This was also a best paper award. D. Sculley , Robert Malkin, Sugato Basu , Roberto J. Bayardo , Predicting Bounce Rates in Sponsored Search Advertisements . The basic claim of this paper is that the probability people immediately leave (“bounce”) after clicking on an advertisement is predictable. Frank McSherry and Ilya Mironov Differentially Private Recommender Systems: Building Privacy into the Netflix Prize Contende
2 0.21101551 406 hunch net-2010-08-22-KDD 2010
Introduction: There were several papers that seemed fairly interesting at KDD this year . The ones that caught my attention are: Xin Jin , Mingyang Zhang, Nan Zhang , and Gautam Das , Versatile Publishing For Privacy Preservation . This paper provides a conservative method for safely determining which data is publishable from any complete source of information (for example, a hospital) such that it does not violate privacy rules in a natural language. It is not differentially private, so no external sources of join information can exist. However, it is a mechanism for publishing data rather than (say) the output of a learning algorithm. Arik Friedman Assaf Schuster , Data Mining with Differential Privacy . This paper shows how to create effective differentially private decision trees. Progress in differentially private datamining is pretty impressive, as it was defined in 2006 . David Chan, Rong Ge, Ori Gershony, Tim Hesterberg , Diane Lambert , Evaluating Online Ad Camp
3 0.13640821 390 hunch net-2010-03-12-Netflix Challenge 2 Canceled
Introduction: The second Netflix prize is canceled due to privacy problems . I continue to believe my original assessment of this paper, that the privacy break was somewhat overstated. I still haven’t seen any serious privacy failures on the scale of the AOL search log release . I expect privacy concerns to continue to be a big issue when dealing with data releases by companies or governments. The theory of maintaining privacy while using data is improving, but it is not yet in a state where the limits of what’s possible are clear let alone how to achieve these limits in a manner friendly to a prediction competition.
4 0.12306108 275 hunch net-2007-11-29-The Netflix Crack
Introduction: A couple security researchers claim to have cracked the netflix dataset . The claims of success appear somewhat overstated to me, but the method of attack is valid and could plausibly be substantially improved so as to reveal the movie preferences of a small fraction of Netflix users. The basic idea is to use a heuristic similarity function between ratings in a public database (from IMDB) and an anonymized database (Netflix) to link ratings in the private database to public identities (in IMDB). They claim to have linked two of a few dozen IMDB users to anonymized netflix users. The claims seem a bit inflated to me, because (a) knowing the IMDB identity isn’t equivalent to knowing the person and (b) the claims of statistical significance are with respect to a model of the world they created (rather than one they created). Overall, this is another example showing that complete privacy is hard . It may be worth remembering that there are some substantial benefits from the Netf
5 0.11166507 393 hunch net-2010-04-14-MLcomp: a website for objectively comparing ML algorithms
Introduction: Much of the success and popularity of machine learning has been driven by its practical impact. Of course, the evaluation of empirical work is an integral part of the field. But are the existing mechanisms for evaluating algorithms and comparing results good enough? We ( Percy and Jake ) believe there are currently a number of shortcomings: Incomplete Disclosure: You read a paper that proposes Algorithm A which is shown to outperform SVMs on two datasets. Great. But what about on other datasets? How sensitive is this result? What about compute time – does the algorithm take two seconds on a laptop or two weeks on a 100-node cluster? Lack of Standardization: Algorithm A beats Algorithm B on one version of a dataset. Algorithm B beats Algorithm A on another version yet uses slightly different preprocessing. Though doing a head-on comparison would be ideal, it would be tedious since the programs probably use different dataset formats and have a large array of options
6 0.10995003 452 hunch net-2012-01-04-Why ICML? and the summer conferences
7 0.10783577 362 hunch net-2009-06-26-Netflix nearly done
8 0.1062197 156 hunch net-2006-02-11-Yahoo’s Learning Problems.
9 0.099425681 423 hunch net-2011-02-02-User preferences for search engines
10 0.098660052 254 hunch net-2007-07-12-ICML Trends
11 0.094991684 418 hunch net-2010-12-02-Traffic Prediction Problem
12 0.093604766 444 hunch net-2011-09-07-KDD and MUCMD 2011
13 0.090533294 420 hunch net-2010-12-26-NIPS 2010
14 0.088048622 297 hunch net-2008-04-22-Taking the next step
15 0.087928966 301 hunch net-2008-05-23-Three levels of addressing the Netflix Prize
16 0.086388662 371 hunch net-2009-09-21-Netflix finishes (and starts)
17 0.079555094 305 hunch net-2008-06-30-ICML has a comment system
18 0.078616619 439 hunch net-2011-08-01-Interesting papers at COLT 2011
19 0.077793591 437 hunch net-2011-07-10-ICML 2011 and the future
20 0.074845061 468 hunch net-2012-06-29-ICML survey and comments
topicId topicWeight
[(0, 0.159), (1, -0.079), (2, -0.02), (3, -0.045), (4, 0.012), (5, 0.026), (6, -0.059), (7, -0.014), (8, -0.007), (9, -0.011), (10, -0.076), (11, 0.197), (12, -0.146), (13, 0.048), (14, -0.0), (15, 0.054), (16, -0.103), (17, -0.009), (18, 0.134), (19, -0.039), (20, 0.017), (21, -0.043), (22, 0.023), (23, 0.092), (24, -0.067), (25, 0.002), (26, -0.013), (27, -0.053), (28, -0.023), (29, 0.062), (30, -0.066), (31, 0.009), (32, -0.073), (33, 0.041), (34, 0.002), (35, -0.105), (36, 0.034), (37, -0.016), (38, 0.142), (39, 0.027), (40, 0.045), (41, -0.006), (42, -0.071), (43, 0.001), (44, 0.006), (45, -0.062), (46, 0.009), (47, -0.019), (48, -0.098), (49, -0.04)]
simIndex simValue blogId blogTitle
same-blog 1 0.96455812 364 hunch net-2009-07-11-Interesting papers at KDD
Introduction: I attended KDD this year. The conference has always had a strong grounding in what works based on the KDDcup , but it has developed a halo of workshops on various subjects. It seems that KDD has become a place where the economy meets machine learning in a stronger sense than many other conferences. There were several papers that other people might like to take a look at. Yehuda Koren Collaborative Filtering with Temporal Dynamics . This paper describes how to incorporate temporal dynamics into a couple of collaborative filtering approaches. This was also a best paper award. D. Sculley , Robert Malkin, Sugato Basu , Roberto J. Bayardo , Predicting Bounce Rates in Sponsored Search Advertisements . The basic claim of this paper is that the probability people immediately leave (“bounce”) after clicking on an advertisement is predictable. Frank McSherry and Ilya Mironov Differentially Private Recommender Systems: Building Privacy into the Netflix Prize Contende
2 0.68633223 406 hunch net-2010-08-22-KDD 2010
Introduction: There were several papers that seemed fairly interesting at KDD this year . The ones that caught my attention are: Xin Jin , Mingyang Zhang, Nan Zhang , and Gautam Das , Versatile Publishing For Privacy Preservation . This paper provides a conservative method for safely determining which data is publishable from any complete source of information (for example, a hospital) such that it does not violate privacy rules in a natural language. It is not differentially private, so no external sources of join information can exist. However, it is a mechanism for publishing data rather than (say) the output of a learning algorithm. Arik Friedman Assaf Schuster , Data Mining with Differential Privacy . This paper shows how to create effective differentially private decision trees. Progress in differentially private datamining is pretty impressive, as it was defined in 2006 . David Chan, Rong Ge, Ori Gershony, Tim Hesterberg , Diane Lambert , Evaluating Online Ad Camp
3 0.63706923 275 hunch net-2007-11-29-The Netflix Crack
Introduction: A couple security researchers claim to have cracked the netflix dataset . The claims of success appear somewhat overstated to me, but the method of attack is valid and could plausibly be substantially improved so as to reveal the movie preferences of a small fraction of Netflix users. The basic idea is to use a heuristic similarity function between ratings in a public database (from IMDB) and an anonymized database (Netflix) to link ratings in the private database to public identities (in IMDB). They claim to have linked two of a few dozen IMDB users to anonymized netflix users. The claims seem a bit inflated to me, because (a) knowing the IMDB identity isn’t equivalent to knowing the person and (b) the claims of statistical significance are with respect to a model of the world they created (rather than one they created). Overall, this is another example showing that complete privacy is hard . It may be worth remembering that there are some substantial benefits from the Netf
4 0.61591709 390 hunch net-2010-03-12-Netflix Challenge 2 Canceled
Introduction: The second Netflix prize is canceled due to privacy problems . I continue to believe my original assessment of this paper, that the privacy break was somewhat overstated. I still haven’t seen any serious privacy failures on the scale of the AOL search log release . I expect privacy concerns to continue to be a big issue when dealing with data releases by companies or governments. The theory of maintaining privacy while using data is improving, but it is not yet in a state where the limits of what’s possible are clear let alone how to achieve these limits in a manner friendly to a prediction competition.
5 0.55954134 423 hunch net-2011-02-02-User preferences for search engines
Introduction: I want to comment on the “Bing copies Google” discussion here , here , and here , because there are data-related issues which the general public may not understand, and some of the framing seems substantially misleading to me. As a not-distant-outsider, let me mention the sources of bias I may have. I work at Yahoo! , which has started using Bing . This might predispose me towards Bing, but on the other hand I’m still at Yahoo!, and have been using Linux exclusively as an OS for many years, including even a couple minor kernel patches. And, on the gripping hand , I’ve spent quite a bit of time thinking about the basic principles of incorporating user feedback in machine learning . Also note, this post is not related to official Yahoo! policy, it’s just my personal view. The issue Google engineers inserted synthetic responses to synthetic queries on google.com, then executed the synthetic searches on google.com using Internet Explorer with the Bing toolbar and later
6 0.55298173 200 hunch net-2006-08-03-AOL’s data drop
7 0.45435408 420 hunch net-2010-12-26-NIPS 2010
8 0.45412648 117 hunch net-2005-10-03-Not ICML
9 0.43598688 468 hunch net-2012-06-29-ICML survey and comments
10 0.43583766 260 hunch net-2007-08-25-The Privacy Problem
11 0.42610297 452 hunch net-2012-01-04-Why ICML? and the summer conferences
12 0.41126812 418 hunch net-2010-12-02-Traffic Prediction Problem
13 0.40643024 403 hunch net-2010-07-18-ICML & COLT 2010
14 0.38760558 199 hunch net-2006-07-26-Two more UAI papers of interest
15 0.37792155 305 hunch net-2008-06-30-ICML has a comment system
16 0.37536794 444 hunch net-2011-09-07-KDD and MUCMD 2011
17 0.37499952 297 hunch net-2008-04-22-Taking the next step
18 0.37430292 254 hunch net-2007-07-12-ICML Trends
19 0.37113678 361 hunch net-2009-06-24-Interesting papers at UAICMOLT 2009
20 0.36844504 371 hunch net-2009-09-21-Netflix finishes (and starts)
topicId topicWeight
[(3, 0.031), (13, 0.038), (27, 0.162), (30, 0.423), (53, 0.038), (55, 0.108), (94, 0.042), (95, 0.07)]
simIndex simValue blogId blogTitle
1 0.90344059 189 hunch net-2006-07-05-more icml papers
Introduction: Here are a few other papers I enjoyed from ICML06. Topic Models: Dynamic Topic Models David Blei, John Lafferty A nice model for how topics in LDA type models can evolve over time, using a linear dynamical system on the natural parameters and a very clever structured variational approximation (in which the mean field parameters are pseudo-observations of a virtual LDS). Like all Blei papers, he makes it look easy, but it is extremely impressive. Pachinko Allocation Wei Li, Andrew McCallum A very elegant (but computationally challenging) model which induces correlation amongst topics using a multi-level DAG whose interior nodes are “super-topics” and “sub-topics” and whose leaves are the vocabulary words. Makes the slumbering monster of structure learning stir. Sequence Analysis (I missed these talks since I was chairing another session) Online Decoding of Markov Models with Latency Constraints Mukund Narasimhan, Paul Viola, Michael Shilman An “a
same-blog 2 0.89988583 364 hunch net-2009-07-11-Interesting papers at KDD
Introduction: I attended KDD this year. The conference has always had a strong grounding in what works based on the KDDcup , but it has developed a halo of workshops on various subjects. It seems that KDD has become a place where the economy meets machine learning in a stronger sense than many other conferences. There were several papers that other people might like to take a look at. Yehuda Koren Collaborative Filtering with Temporal Dynamics . This paper describes how to incorporate temporal dynamics into a couple of collaborative filtering approaches. This was also a best paper award. D. Sculley , Robert Malkin, Sugato Basu , Roberto J. Bayardo , Predicting Bounce Rates in Sponsored Search Advertisements . The basic claim of this paper is that the probability people immediately leave (“bounce”) after clicking on an advertisement is predictable. Frank McSherry and Ilya Mironov Differentially Private Recommender Systems: Building Privacy into the Netflix Prize Contende
3 0.88808203 154 hunch net-2006-02-04-Research Budget Changes
Introduction: The announcement of an increase in funding for basic research in the US is encouraging. There is some discussion of this at the Computing Research Policy blog. One part of this discussion has a graph of NSF funding over time, presumably in dollar budgets. I don’t believe that dollar budgets are the right way to judge the impact of funding changes on researchers. A better way to judge seems to be in terms of dollar budget divided by GDP which provides a measure of the relative emphasis on research. This graph was assembled by dividing the NSF budget by the US GDP . For 2005 GDP, I used the current estimate and for 2006 and 2007 assumed an increase by a factor of 1.04 per year. The 2007 number also uses the requested 2007 budget which is certain to change. This graph makes it clear why researchers were upset: research funding emphasis has fallen for 3 years in a row. The reality has been significantly more severe due to DARPA decreasing funding and industrial
4 0.83364749 455 hunch net-2012-02-20-Berkeley Streaming Data Workshop
Introduction: The From Data to Knowledge workshop May 7-11 at Berkeley should be of interest to the many people encountering streaming data in different disciplines. It’s run by a group of astronomers who encounter streaming data all the time. I met Josh Bloom recently and he is broadly interested in a workshop covering all aspects of Machine Learning on streaming data. The hope here is that techniques developed in one area turn out useful in another which seems quite plausible. Particularly if you are in the bay area, consider checking it out.
5 0.73880696 444 hunch net-2011-09-07-KDD and MUCMD 2011
Introduction: At KDD I enjoyed Stephen Boyd ‘s invited talk about optimization quite a bit. However, the most interesting talk for me was David Haussler ‘s. His talk started out with a formidable load of biological complexity. About half-way through you start wondering, “can this be used to help with cancer?” And at the end he connects it directly to use with a call to arms for the audience: cure cancer. The core thesis here is that cancer is a complex set of diseases which can be distentangled via genetic assays, allowing attacking the specific signature of individual cancers. However, the data quantity and complex dependencies within the data require systematic and relatively automatic prediction and analysis algorithms of the kind that we are best familiar with. Some of the papers which interested me are: Kai-Wei Chang and Dan Roth , Selective Block Minimization for Faster Convergence of Limited Memory Large-Scale Linear Models , which is about effectively using a hard-example
6 0.71098483 85 hunch net-2005-06-28-A COLT paper
7 0.68021083 292 hunch net-2008-03-15-COLT Open Problems
8 0.55543709 132 hunch net-2005-11-26-The Design of an Optimal Research Environment
9 0.46997443 492 hunch net-2013-12-01-NIPS tutorials and Vowpal Wabbit 7.4
10 0.46744797 140 hunch net-2005-12-14-More NIPS Papers II
11 0.46018535 466 hunch net-2012-06-05-ICML acceptance statistics
12 0.44834822 77 hunch net-2005-05-29-Maximum Margin Mismatch?
13 0.44741687 301 hunch net-2008-05-23-Three levels of addressing the Netflix Prize
14 0.4404763 51 hunch net-2005-04-01-The Producer-Consumer Model of Research
15 0.43907663 475 hunch net-2012-10-26-ML Symposium and Strata-Hadoop World
16 0.43710378 344 hunch net-2009-02-22-Effective Research Funding
17 0.43642575 406 hunch net-2010-08-22-KDD 2010
18 0.43619233 151 hunch net-2006-01-25-1 year
19 0.43467429 484 hunch net-2013-06-16-Representative Reviewing
20 0.43450797 225 hunch net-2007-01-02-Retrospective