hunch_net hunch_net-2006 hunch_net-2006-223 knowledge-graph by maker-knowledge-mining

223 hunch net-2006-12-06-The Spam Problem

meta infos for this blog

Source: html

Introduction: The New York Times has an article on the growth of spam . Interesting facts include: 9/10 of all email is spam, spam source identification is nearly useless due to botnet spam senders, and image based spam (emails which consist of an image only) are on the growth. Estimates of the cost of spam are almost certainly far to low, because they do not account for the cost in time lost by people. The image based spam which is currently penetrating many filters should be catchable with a more sophisticated application of machine learning technology. For the spam I see, the rendered images come in only a few formats, which would be easy to recognize via a support vector machine (with RBF kernel), neural network, or even nearest-neighbor architecture. The mechanics of setting this up to run efficiently is the only real challenge. This is the next step in the spam war. The response to this system is to make the image based spam even more random. We should (essentially) expect to see

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The New York Times has an article on the growth of spam . [sent-1, score-0.888]

2 Interesting facts include: 9/10 of all email is spam, spam source identification is nearly useless due to botnet spam senders, and image based spam (emails which consist of an image only) are on the growth. [sent-2, score-3.141]

3 Estimates of the cost of spam are almost certainly far to low, because they do not account for the cost in time lost by people. [sent-3, score-1.0]

4 The image based spam which is currently penetrating many filters should be catchable with a more sophisticated application of machine learning technology. [sent-4, score-1.15]

5 For the spam I see, the rendered images come in only a few formats, which would be easy to recognize via a support vector machine (with RBF kernel), neural network, or even nearest-neighbor architecture. [sent-5, score-0.943]

6 The mechanics of setting this up to run efficiently is the only real challenge. [sent-6, score-0.105]

7 The response to this system is to make the image based spam even more random. [sent-8, score-1.108]

8 We should (essentially) expect to see Captcha spam, and our inability to recognize captcha spam should persist as long as the vision problem is not solved. [sent-9, score-1.11]

9 This hopefully degrades the value of spam to the spammers, but it may not make the value of spam nonzero. [sent-10, score-1.672]

10 One simple economic solution is to transfer from first time sender to receiver a small amount (10 cents? [sent-12, score-0.334]

11 If the receiver classifies the email as spam then the charge repeats on the next receipt, and otherwise it goes away. [sent-14, score-1.388]

12 There are several difficulties with this approach: How do you change a huge system in heavy use which no one controls? [sent-15, score-0.129]

13 For example, we could extend the mail protocol to include a payment system (using the â€œX-â€? [sent-18, score-0.441]

14 lines) and use the existence of a payment as a feature in existing spam-or-not prediction systems. [sent-19, score-0.382]

15 Over time, this feature may become the most useful feature encouraging every legitimate email user to offer a small payment with the first email to a recipient. [sent-20, score-0.962]

similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('spam', 0.786), ('payment', 0.236), ('image', 0.185), ('email', 0.166), ('captcha', 0.158), ('receiver', 0.14), ('recognize', 0.105), ('feature', 0.09), ('classifies', 0.07), ('extend', 0.07), ('cents', 0.07), ('consist', 0.07), ('rbf', 0.07), ('verifiable', 0.07), ('based', 0.069), ('system', 0.068), ('include', 0.067), ('charge', 0.065), ('emails', 0.065), ('cost', 0.064), ('mechanics', 0.061), ('inability', 0.061), ('estimates', 0.061), ('heavy', 0.061), ('next', 0.059), ('offer', 0.058), ('identification', 0.058), ('encouraging', 0.058), ('repeats', 0.058), ('controls', 0.056), ('existence', 0.056), ('sophisticated', 0.056), ('growth', 0.056), ('lists', 0.054), ('mailing', 0.054), ('formats', 0.054), ('filters', 0.054), ('images', 0.052), ('economic', 0.052), ('transfer', 0.052), ('value', 0.05), ('useless', 0.05), ('user', 0.05), ('lines', 0.05), ('small', 0.048), ('article', 0.046), ('goes', 0.044), ('efficiently', 0.044), ('lost', 0.044), ('time', 0.042)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000004 223 hunch net-2006-12-06-The Spam Problem

2 0.27777895 367 hunch net-2009-08-16-Centmail comments

Introduction: Centmail is a scheme which makes charity donations have a secondary value, as a stamp for email. When discussed on newscientist , slashdot , and others, some of the comments make the academic review process appear thoughtful . Some prominent fallacies are: Costing money fallacy. Some commenters appear to believe the system charges money per email. Instead, the basic idea is that users get an extra benefit from donations to a charity and participation is strictly voluntary. The solution to this fallacy is simply reading the details . Single solution fallacy. Some commenters seem to think this is proposed as a complete solution to spam, and since not everyone will opt to participate, it won’t work. But a complete solution is not at all necessary or even possible given the flag-day problem . Deployed machine learning systems for fighting spam are great at taking advantage of a partial solution. The solution to this fallacy is learning about machine learning. In the

3 0.18899602 132 hunch net-2005-11-26-The Design of an Optimal Research Environment

Introduction: How do you create an optimal environment for research? Here are some essential ingredients that I see. Stability . University-based research is relatively good at this. On any particular day, researchers face choices in what they will work on. A very common tradeoff is between: easy small difficult big For researchers without stability, the ‘easy small’ option wins. This is often “ok”—a series of incremental improvements on the state of the art can add up to something very beneficial. However, it misses one of the big potentials of research: finding entirely new and better ways of doing things. Stability comes in many forms. The prototypical example is tenure at a university—a tenured professor is almost imposssible to fire which means that the professor has the freedom to consider far horizon activities. An iron-clad guarantee of a paycheck is not necessary—industrial research labs have succeeded well with research positions of indefinite duration. Atnt rese

4 0.16365603 25 hunch net-2005-02-20-At One Month

Introduction: This is near the one month point, so it seems appropriate to consider meta-issues for the moment. The number of posts is a bit over 20. The number of people speaking up in discussions is about 10. The number of people viewing the site is somewhat more than 100. I am (naturally) dissatisfied with many things. Many of the potential uses haven’t been realized. This is partly a matter of opportunity (no conferences in the last month), partly a matter of will (no open problems because it’s hard to give them up), and partly a matter of tradition. In academia, there is a strong tradition of trying to get everything perfectly right before presentation. This is somewhat contradictory to the nature of making many posts, and it’s definitely contradictory to the idea of doing “public research”. If that sort of idea is to pay off, it must be significantly more succesful than previous methods. In an effort to continue experimenting, I’m going to use the next week as “open problems we

5 0.1266275 297 hunch net-2008-04-22-Taking the next step

Introduction: At the last ICML , Tom Dietterich asked me to look into systems for commenting on papers. I’ve been slow getting to this, but it’s relevant now. The essential observation is that we now have many tools for online collaboration, but they are not yet much used in academic research. If we can find the right way to use them, then perhaps great things might happen, with extra kudos to the first conference that manages to really create an online community. Various conferences have been poking at this. For example, UAI has setup a wiki , COLT has started using Joomla , with some dynamic content, and AAAI has been setting up a “ student blog “. Similarly, Dinoj Surendran setup a twiki for the Chicago Machine Learning Summer School , which was quite useful for coordinating events and other things. I believe the most important thing is a willingness to experiment. A good place to start seems to be enhancing existing conference websites. For example, the ICML 2007 papers pag

6 0.10315313 347 hunch net-2009-03-26-Machine Learning is too easy

7 0.091107845 20 hunch net-2005-02-15-ESPgame and image labeling

8 0.086426474 444 hunch net-2011-09-07-KDD and MUCMD 2011

9 0.069833413 159 hunch net-2006-02-27-The Peekaboom Dataset

10 0.063494496 401 hunch net-2010-06-20-2010 ICML discussion site

11 0.057952624 195 hunch net-2006-07-12-Who is having visa problems reaching US conferences?

12 0.054639608 260 hunch net-2007-08-25-The Privacy Problem

13 0.054174762 201 hunch net-2006-08-07-The Call of the Deep

14 0.050317552 265 hunch net-2007-10-14-NIPS workshp: Learning Problem Design

15 0.049756292 349 hunch net-2009-04-21-Interesting Presentations at Snowbird

16 0.04917042 218 hunch net-2006-11-20-Context and the calculation misperception

17 0.04741523 152 hunch net-2006-01-30-Should the Input Representation be a Vector?

18 0.044979908 207 hunch net-2006-09-12-Incentive Compatible Reviewing

19 0.043223165 193 hunch net-2006-07-09-The Stock Prediction Machine Learning Problem

20 0.042016711 102 hunch net-2005-08-11-Why Manifold-Based Dimension Reduction Techniques?

similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.114), (1, 0.001), (2, -0.05), (3, 0.052), (4, -0.021), (5, 0.011), (6, -0.018), (7, -0.001), (8, 0.004), (9, -0.018), (10, -0.095), (11, -0.031), (12, -0.055), (13, 0.011), (14, 0.009), (15, 0.01), (16, -0.115), (17, -0.067), (18, -0.024), (19, 0.149), (20, -0.062), (21, -0.015), (22, -0.014), (23, -0.041), (24, 0.031), (25, 0.049), (26, 0.032), (27, 0.042), (28, 0.029), (29, -0.086), (30, 0.017), (31, 0.005), (32, -0.075), (33, 0.029), (34, 0.031), (35, 0.083), (36, 0.044), (37, -0.07), (38, -0.013), (39, -0.056), (40, -0.007), (41, -0.015), (42, -0.02), (43, -0.022), (44, -0.066), (45, -0.162), (46, -0.096), (47, 0.024), (48, -0.104), (49, -0.04)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96573192 223 hunch net-2006-12-06-The Spam Problem

2 0.8200649 367 hunch net-2009-08-16-Centmail comments

3 0.58818495 25 hunch net-2005-02-20-At One Month

4 0.48444048 20 hunch net-2005-02-15-ESPgame and image labeling

Introduction: Luis von Ahn has been running the espgame for awhile now. The espgame provides a picture to two randomly paired people across the web, and asks them to agree on a label. It hasn’t managed to label the web yet, but it has produced a large dataset of (image, label) pairs. I organized the dataset so you could explore the implied bipartite graph (requires much bandwidth). Relative to other image datasets, this one is quite large—67000 images, 358,000 labels (average of 5/image with variation from 1 to 19), and 22,000 unique labels (one every 3 images). The dataset is also very ‘natural’, consisting of images spidered from the internet. The multiple label characteristic is intriguing because ‘learning to learn’ and metalearning techniques may be applicable. The ‘natural’ quality means that this dataset varies greatly in difficulty from easy (predicting “red”) to hard (predicting “funny”) and potentially more rewarding to tackle. The open problem here is, of course, to make

5 0.4831962 297 hunch net-2008-04-22-Taking the next step

6 0.48100859 195 hunch net-2006-07-12-Who is having visa problems reaching US conferences?

7 0.46269169 159 hunch net-2006-02-27-The Peekaboom Dataset

8 0.42262223 354 hunch net-2009-05-17-Server Update

9 0.41554987 399 hunch net-2010-05-20-Google Predict

10 0.40230802 142 hunch net-2005-12-22-Yes , I am applying

11 0.39487159 99 hunch net-2005-08-01-Peekaboom

12 0.39094701 69 hunch net-2005-05-11-Visa Casualties

13 0.38584027 444 hunch net-2011-09-07-KDD and MUCMD 2011

14 0.38497406 151 hunch net-2006-01-25-1 year

15 0.38284734 152 hunch net-2006-01-30-Should the Input Representation be a Vector?

16 0.3819803 370 hunch net-2009-09-18-Necessary and Sufficient Research

17 0.37917811 401 hunch net-2010-06-20-2010 ICML discussion site

18 0.37511468 132 hunch net-2005-11-26-The Design of an Optimal Research Environment

19 0.37255776 137 hunch net-2005-12-09-Machine Learning Thoughts

20 0.3644127 358 hunch net-2009-06-01-Multitask Poisoning

similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(1, 0.023), (3, 0.046), (10, 0.027), (27, 0.181), (31, 0.327), (38, 0.018), (48, 0.013), (53, 0.047), (55, 0.1), (94, 0.064), (95, 0.018)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.97441757 119 hunch net-2005-10-08-We have a winner

Introduction: The DARPA grandchallenge is a big contest for autonomous robot vehicle driving. It was run once in 2004 for the first time and all teams did badly. This year was notably different with the Stanford and CMU teams succesfully completing the course. A number of details are here and wikipedia has continuing coverage . A formal winner hasnâ€™t been declared yet although Stanford completed the course quickest. The Stanford and CMU teams deserve a large round of applause as they have strongly demonstrated the feasibility of autonomous vehicles. The good news for machine learning is that the Stanford team (at least) is using some machine learning techniques.

same-blog 2 0.85314739 223 hunch net-2006-12-06-The Spam Problem

3 0.56161702 484 hunch net-2013-06-16-Representative Reviewing

Introduction: When thinking about how best to review papers, it seems helpful to have some conception of what good reviewing is. As far as I can tell, this is almost always only discussed in the specific context of a paper (i.e. your rejected paper), or at most an area (i.e. what a “good paper” looks like for that area) rather than general principles. Neither individual papers or areas are sufficiently general for a large conference—every paper differs in the details, and what if you want to build a new area and/or cross areas? An unavoidable reason for reviewing is that the community of research is too large. In particular, it is not possible for a researcher to read every paper which someone thinks might be of interest. This reason for reviewing exists independent of constraints on rooms or scheduling formats of individual conferences. Indeed, history suggests that physical constraints are relatively meaningless over the long term — growing conferences simply use more rooms and/or change fo

4 0.55751836 320 hunch net-2008-10-14-Who is Responsible for a Bad Review?

Introduction: Although I’m greatly interested in machine learning, I think it must be admitted that there is a large amount of low quality logic being used in reviews. The problem is bad enough that sometimes I wonder if the Byzantine generals limit has been exceeded. For example, I’ve seen recent reviews where the given reasons for rejecting are: [ NIPS ] Theorem A is uninteresting because Theorem B is uninteresting. [ UAI ] When you learn by memorization, the problem addressed is trivial. [NIPS] The proof is in the appendix. [NIPS] This has been done before. (… but not giving any relevant citations) Just for the record I want to point out what’s wrong with these reviews. A future world in which such reasons never come up again would be great, but I’m sure these errors will be committed many times more in the future. This is nonsense. A theorem should be evaluated based on it’s merits, rather than the merits of another theorem. Learning by memorization requires an expon

5 0.55747372 40 hunch net-2005-03-13-Avoiding Bad Reviewing

Introduction: If we accept that bad reviewing often occurs and want to fix it, the question is “how”? Reviewing is done by paper writers just like yourself, so a good proxy for this question is asking “How can I be a better reviewer?” Here are a few things I’ve learned by trial (and error), as a paper writer, and as a reviewer. The secret ingredient is careful thought. There is no good substitution for a deep and careful understanding. Avoid reviewing papers that you feel competitive about. You almost certainly will be asked to review papers that feel competitive if you work on subjects of common interest. But, the feeling of competition can easily lead to bad judgement. If you feel biased for some other reason, then you should avoid reviewing. For example… Feeling angry or threatened by a paper is a form of bias. See above. Double blind yourself (avoid looking at the name even in a single-blind situation). The significant effect of a name you recognize is making you pay close a

6 0.5572775 343 hunch net-2009-02-18-Decision by Vetocracy

7 0.55482411 437 hunch net-2011-07-10-ICML 2011 and the future

8 0.548823 51 hunch net-2005-04-01-The Producer-Consumer Model of Research

9 0.54721922 225 hunch net-2007-01-02-Retrospective

10 0.54653376 95 hunch net-2005-07-14-What Learning Theory might do

11 0.54586565 461 hunch net-2012-04-09-ICML author feedback is open

12 0.54579395 96 hunch net-2005-07-21-Six Months

13 0.54550385 463 hunch net-2012-05-02-ICML: Behind the Scenes

14 0.54486245 194 hunch net-2006-07-11-New Models

15 0.54456973 98 hunch net-2005-07-27-Not goal metrics

16 0.54442966 207 hunch net-2006-09-12-Incentive Compatible Reviewing

17 0.54434681 183 hunch net-2006-06-14-Explorations of Exploration

18 0.54432023 5 hunch net-2005-01-26-Watchword: Probability

19 0.54349947 454 hunch net-2012-01-30-ICML Posters and Scope

20 0.54347771 132 hunch net-2005-11-26-The Design of an Optimal Research Environment