hunch_net hunch_net-2006 hunch_net-2006-200 knowledge-graph by maker-knowledge-mining

200 hunch net-2006-08-03-AOL’s data drop

meta infos for this blog

Source: html

Introduction: AOL has released several large search engine related datasets. This looks like a pretty impressive data release, and it is a big opportunity for people everywhere to worry about search engine related learning problems, if they want.

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 AOL has released several large search engine related datasets. [sent-1, score-1.492]

2 This looks like a pretty impressive data release, and it is a big opportunity for people everywhere to worry about search engine related learning problems, if they want. [sent-2, score-2.763]

similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('engine', 0.527), ('search', 0.352), ('aol', 0.305), ('everywhere', 0.274), ('worry', 0.254), ('related', 0.25), ('release', 0.227), ('released', 0.222), ('impressive', 0.222), ('looks', 0.198), ('opportunity', 0.195), ('pretty', 0.15), ('big', 0.121), ('want', 0.096), ('data', 0.084), ('large', 0.078), ('problems', 0.072), ('like', 0.065), ('several', 0.063), ('people', 0.051), ('learning', 0.02)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 200 hunch net-2006-08-03-AOL’s data drop

2 0.1919615 423 hunch net-2011-02-02-User preferences for search engines

Introduction: I want to comment on the “Bing copies Google” discussion here , here , and here , because there are data-related issues which the general public may not understand, and some of the framing seems substantially misleading to me. As a not-distant-outsider, let me mention the sources of bias I may have. I work at Yahoo! , which has started using Bing . This might predispose me towards Bing, but on the other hand I’m still at Yahoo!, and have been using Linux exclusively as an OS for many years, including even a couple minor kernel patches. And, on the gripping hand , I’ve spent quite a bit of time thinking about the basic principles of incorporating user feedback in machine learning . Also note, this post is not related to official Yahoo! policy, it’s just my personal view. The issue Google engineers inserted synthetic responses to synthetic queries on google.com, then executed the synthetic searches on google.com using Internet Explorer with the Bing toolbar and later

3 0.16348863 156 hunch net-2006-02-11-Yahoo’s Learning Problems.

Introduction: I just visited Yahoo Research which has several fundamental learning problems near to (or beyond) the set of problems we know how to solve well. Here are 3 of them. Ranking This is the canonical problem of all search engines. It is made extra difficult for several reasons. There is relatively little “good” supervised learning data and a great deal of data with some signal (such as click through rates). The learning must occur in a partially adversarial environment. Many people very actively attempt to place themselves at the top of rankings. It is not even quite clear whether the problem should be posed as ‘ranking’ or as ‘regression’ which is then used to produce a ranking. Collaborative filtering Yahoo has a large number of recommendation systems for music, movies, etc… In these sorts of systems, users specify how they liked a set of things, and then the system can (hopefully) find some more examples of things they might like by reasoning across multiple

4 0.15537947 120 hunch net-2005-10-10-Predictive Search is Coming

Introduction: “Search” is the other branch of AI research which has been succesful. Concrete examples include Deep Blue which beat the world chess champion and Chinook the champion checkers program. A set of core search techniques exist including A * , alpha-beta pruning, and others that can be applied to any of many different search problems. Given this, it may be surprising to learn that there has been relatively little succesful work on combining prediction and search. Given also that humans typically solve search problems using a number of predictive heuristics to narrow in on a solution, we might be surprised again. However, the big successful search-based systems have typically not used “smart” search algorithms. Insteady they have optimized for very fast search. This is not for lack of trying… many people have tried to synthesize search and prediction to various degrees of success. For example, Knightcap achieves good-but-not-stellar chess playing performance, and TD-gammon

5 0.14283222 390 hunch net-2010-03-12-Netflix Challenge 2 Canceled

Introduction: The second Netflix prize is canceled due to privacy problems . I continue to believe my original assessment of this paper, that the privacy break was somewhat overstated. I still haven’t seen any serious privacy failures on the scale of the AOL search log release . I expect privacy concerns to continue to be a big issue when dealing with data releases by companies or governments. The theory of maintaining privacy while using data is improving, but it is not yet in a state where the limits of what’s possible are clear let alone how to achieve these limits in a manner friendly to a prediction competition.

6 0.12191123 284 hunch net-2008-01-18-Datasets

7 0.10297895 345 hunch net-2009-03-08-Prediction Science

8 0.097962365 260 hunch net-2007-08-25-The Privacy Problem

9 0.096672006 418 hunch net-2010-12-02-Traffic Prediction Problem

10 0.083711043 92 hunch net-2005-07-11-AAAI blog

11 0.076240852 178 hunch net-2006-05-08-Big machine learning

12 0.075248294 264 hunch net-2007-09-30-NIPS workshops are out.

13 0.059065904 269 hunch net-2007-10-24-Contextual Bandits

14 0.057657845 424 hunch net-2011-02-17-What does Watson mean?

15 0.056404576 378 hunch net-2009-11-15-The Other Online Learning

16 0.055055402 93 hunch net-2005-07-13-“Sister Conference” presentations

17 0.053094659 215 hunch net-2006-10-22-Exemplar programming

18 0.052955128 406 hunch net-2010-08-22-KDD 2010

19 0.052243423 58 hunch net-2005-04-21-Dynamic Programming Generalizations and Their Use

20 0.05054082 436 hunch net-2011-06-22-Ultra LDA

similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.084), (1, 0.003), (2, -0.078), (3, 0.024), (4, -0.009), (5, -0.007), (6, -0.031), (7, 0.002), (8, -0.001), (9, -0.022), (10, -0.074), (11, 0.14), (12, -0.031), (13, 0.085), (14, -0.138), (15, 0.038), (16, -0.04), (17, 0.004), (18, 0.072), (19, -0.096), (20, 0.133), (21, -0.031), (22, 0.027), (23, 0.038), (24, -0.011), (25, 0.147), (26, 0.095), (27, 0.03), (28, -0.014), (29, 0.051), (30, -0.026), (31, 0.04), (32, -0.093), (33, -0.021), (34, 0.013), (35, -0.038), (36, 0.037), (37, 0.013), (38, 0.067), (39, 0.061), (40, 0.0), (41, 0.066), (42, -0.07), (43, -0.085), (44, 0.13), (45, -0.001), (46, -0.053), (47, 0.004), (48, -0.033), (49, -0.038)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97233593 200 hunch net-2006-08-03-AOL’s data drop

2 0.76938975 423 hunch net-2011-02-02-User preferences for search engines

3 0.66801697 390 hunch net-2010-03-12-Netflix Challenge 2 Canceled

4 0.64776152 156 hunch net-2006-02-11-Yahoo’s Learning Problems.

5 0.56311893 178 hunch net-2006-05-08-Big machine learning

Introduction: According to the New York Times , Yahoo is releasing Project Panama shortly . Project Panama is about better predicting which advertisements are relevant to a search, implying a higher click through rate, implying larger income for Yahoo . There are two things that seem interesting here: A significant portion of that improved accuracy is almost certainly machine learning at work. The quantitative effect is huge—the estimate in the article is $600*10 6 . Google already has such improvements and Microsoft Search is surely working on them, which suggest this is (perhaps) a $10 9 per year machine learning problem. The exact methodology under use is unlikely to be publicly discussed in the near future because of the competitive enivironment. Hopefully we’ll have some public “war stories” at some point in the future when this information becomes less sensitive. For now, it’s reassuring to simply note that machine learning is having a big impact.

6 0.53752536 120 hunch net-2005-10-10-Predictive Search is Coming

7 0.53668767 260 hunch net-2007-08-25-The Privacy Problem

8 0.49634403 364 hunch net-2009-07-11-Interesting papers at KDD

9 0.48953658 312 hunch net-2008-08-04-Electoralmarkets.com

10 0.46938974 406 hunch net-2010-08-22-KDD 2010

11 0.43290374 420 hunch net-2010-12-26-NIPS 2010

12 0.41716412 418 hunch net-2010-12-02-Traffic Prediction Problem

13 0.38775173 345 hunch net-2009-03-08-Prediction Science

14 0.37634468 10 hunch net-2005-02-02-Kolmogorov Complexity and Googling

15 0.34521252 125 hunch net-2005-10-20-Machine Learning in the News

16 0.3346861 444 hunch net-2011-09-07-KDD and MUCMD 2011

17 0.31432009 446 hunch net-2011-10-03-Monday announcements

18 0.29555047 58 hunch net-2005-04-21-Dynamic Programming Generalizations and Their Use

19 0.28699574 464 hunch net-2012-05-03-Microsoft Research, New York City

20 0.28490007 264 hunch net-2007-09-30-NIPS workshops are out.

similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(27, 0.166), (84, 0.448), (94, 0.185)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.90686375 467 hunch net-2012-06-15-Normal Deviate and the UCSC Machine Learning Summer School

Introduction: Larry Wasserman has started the Normal Deviate blog which I added to the blogroll on the right. Manfred Warmuth points out the UCSC machine learning summer school running July 9-20 which may be of particular interest to those in silicon valley.

same-blog 2 0.89402938 200 hunch net-2006-08-03-AOL’s data drop

3 0.867544 383 hunch net-2009-12-09-Inherent Uncertainty

Introduction: I’d like to point out Inherent Uncertainty , which I’ve added to the ML blog post scanner on the right. My understanding from Jake is that the intention is to have a multiauthor blog which is more specialized towards learning theory/game theory than this one. Nevertheless, several of the posts seem to be of wider interest.

4 0.7892676 121 hunch net-2005-10-12-The unrealized potential of the research lab

Introduction: I attended the IBM research 60th anniversary . IBM research is, by any reasonable account, the industrial research lab which has managed to bring the most value to it’s parent company over the long term. This can be seen by simply counting the survivors: IBM research is the only older research lab which has not gone through a period of massive firing. (Note that there are also new research labs .) Despite this impressive record, IBM research has failed, by far, to achieve it’s potential. Examples which came up in this meeting include: It took about a decade to produce DRAM after it was invented in the lab. (In fact, Intel produced it first.) Relational databases and SQL were invented and then languished. It was only under external competition that IBM released it’s own relational database. Why didn’t IBM grow an Oracle division ? An early lead in IP networking hardware did not result in IBM growing a Cisco division . Why not? And remember … IBM research is a s

5 0.75031483 411 hunch net-2010-09-21-Regretting the dead

Introduction: Nikos pointed out this new york times article about poor clinical design killing people . For those of us who study learning from exploration information this is a reminder that low regret algorithms are particularly important, as regret in clinical trials is measured by patient deaths. Two obvious improvements on the experimental design are: With reasonable record keeping of existing outcomes for the standard treatments, there is no need to explicitly assign people to a control group with the standard treatment, as that approach is effectively explored with great certainty. Asserting otherwise would imply that the nature of effective treatments for cancer has changed between now and a year ago, which denies the value of any clinical trial. An optimal experimental design will smoothly phase between exploration and exploitation as evidence for a new treatment shows that it can be effective. This is old tech, for example in the EXP3.P algorithm (page 12 aka 59) although

6 0.69171494 219 hunch net-2006-11-22-Explicit Randomization in Learning algorithms

7 0.62434363 142 hunch net-2005-12-22-Yes , I am applying

8 0.55235791 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity

9 0.48069525 81 hunch net-2005-06-13-Wikis for Summer Schools and Workshops

10 0.47331706 120 hunch net-2005-10-10-Predictive Search is Coming

11 0.47313514 346 hunch net-2009-03-18-Parallel ML primitives

12 0.47220087 35 hunch net-2005-03-04-The Big O and Constants in Learning

13 0.47071254 276 hunch net-2007-12-10-Learning Track of International Planning Competition

14 0.46208224 115 hunch net-2005-09-26-Prediction Bounds as the Mathematics of Science

15 0.45661744 42 hunch net-2005-03-17-Going all the Way, Sometimes

16 0.44871068 314 hunch net-2008-08-24-Mass Customized Medicine in the Future?

17 0.44083217 229 hunch net-2007-01-26-Parallel Machine Learning Problems

18 0.44003165 221 hunch net-2006-12-04-Structural Problems in NIPS Decision Making

19 0.4351711 136 hunch net-2005-12-07-Is the Google way the way for machine learning?

20 0.42934439 281 hunch net-2007-12-21-Vowpal Wabbit Code Release