nips nips2004 nips2004-199 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kumar Chellapilla, Patrice Y. Simard
Abstract: Machine learning is often used to automatically solve human tasks. In this paper, we look for tasks where machine learning algorithms are not as good as humans with the hope of gaining insight into their current limitations. We studied various Human Interactive Proofs (HIPs) on the market, because they are systems designed to tell computers and humans apart by posing challenges presumably too hard for computers. We found that most HIPs are pure recognition tasks which can easily be broken using machine learning. The harder HIPs use a combination of segmentation and recognition tasks. From this observation, we found that building segmentation tasks is the most effective way to confuse machine learning algorithms. This has enabled us to build effective HIPs (which we deployed in MSN Passport), as well as design challenging segmentation tasks for machine learning algorithms. 1 In t rod u ct i on The OCR problem for high resolution printed text has virtually been solved 10 years ago [1]. On the other hand, cursive handwriting recognition today is still too poor for most people to rely on. Is there a fundamental difference between these two seemingly similar problems? To shed more light on this question, we study problems that have been designed to be difficult for computers. The hope is that we will get some insight on what the stumbling blocks are for machine learning and devise appropriate tests to further understand their similarities and differences. Work on distinguishing computers from humans traces back to the original Turing Test [2] which asks that a human distinguish between another human and a machine by asking questions of both. Recent interest has turned to developing systems that allow a computer to distinguish between another computer and a human. These systems enable the construction of automatic filters that can be used to prevent automated scripts from utilizing services intended for humans [4]. Such systems have been termed Human Interactive Proofs (HIPs) [3] or Completely Automated Public Turing Tests to Tell Computers and Humans Apart (CAPTCHAs) [4]. An overview of the work in this area can be found in [5]. Construction of HIPs that are of practical value is difficult because it is not sufficient to develop challenges at which humans are somewhat more successful than machines. This is because the cost of failure for an automatic attacker is minimal compared to the cost of failure for humans. Ideally a HIP should be solved by humans more than 80% of the time, while an automatic script with reasonable resource use should succeed less than 0.01% of the time. This latter ratio (1 in 10,000) is a function of the cost of an automatic trial divided by the cost of having a human perform the attack. This constraint of generating tasks that are failed 99.99% of the time by all automated algorithms has generated various solutions which can easily be sampled on the internet. Seven different HIPs, namely, Mailblocks, MSN (before April 28th, 2004), Ticketmaster, Yahoo, Yahoo v2 (after Sept’04), Register, and Google, will be given as examples in the next section. We will show in Section 3 that machinelearning-based attacks are far more successful than 1 in 10,000. Yet, some of these HIPs are harder than others and could be made even harder by identifying the recognition and segmentation parts, and emphasizing the latter. Section 4 presents examples of more difficult HIPs which are much more respectable challenges for machine learning, and yet surprisingly easy for humans. The final section discusses a (known) weakness of machine learning algorithms and suggests designing simple artificial datasets for studying this weakness. 2 Exa mp les o f H I Ps The HIPs explored in this paper are made of characters (or symbols) rendered to an image and presented to the user. Solving the HIP requires identifying all characters in the correct order. The following HIPs can be sampled from the web: Mailblocks: While signing up for free email service with (www.mailblocks.com), you will find HIP challenges of the type: mailblocks MSN: While signing up for free e-mail with MSN Hotmail (www.hotmail.com), you will find HIP challenges of the type: Register.com: While requesting a whois lookup for a domain at www.register.com, you will HIP challenges of the type: Yahoo!/EZ-Gimpy (CMU): While signing up for free e-mail service with Yahoo! (www.yahoo.com), you will receive HIP challenges of the type: Yahoo! (version 2): Starting in August 2004, Yahoo! introduced their second generation HIP. Three examples are presented below: Ticketmaster: While looking for concert tickets at www.ticketmaster.com, you will receive HIP challenges of the type: Google/Gmail: While signing up for free e-mail with Gmail at www.google.com, one will receive HIP challenges of the type: While solutions to Yahoo HIPs are common English words, those for ticketmaster and Google do not necessarily belong to the English dictionary. They appear to have been created using a phonetic generator [8]. 3 Usi n g ma ch i n e lea rn i n g t o b rea k H IP s Breaking HIPs is not new. Mori and Malik [7] have successfully broken the EZGimpy (92% success) and Gimpy (33% success) HIPs from CMU. Our approach aims at an automatic process for solving multiple HIPs with minimum human intervention, using machine learning. In this paper, our main goal is to learn more about the common strengths and weaknesses of these HIPs rather than to prove that we can break any one HIP in particular with the highest possible success rate. We have results for six different HIPs: EZ-Gimpy/Yahoo, Yahoo v2, mailblocks, register, ticketmaster, and Google. To simplify our study, we will not be using language models in our attempt to break HIPs. For example, there are only about 600 words in the EZ-Gimpy dictionary [7], which means that a random guess attack would get a success rate of 1 in 600 (more than enough to break the HIP, i.e., greater than 0.01% success). HIPs become harder when no language model is used. Similarly, when a HIP uses a language model to generate challenges, success rate of attacks can be significantly improved by incorporating the language model. Further, since the language model is not common to all HIPs studied, it was not used in this paper. Our generic method for breaking all of these HIPs is to write a custom algorithm to locate the characters, and then use machine learning for recognition. Surprisingly, segmentation, or finding the characters, is simple for many HIPs which makes the process of breaking the HIP particularly easy. Gimpy uses a single constant predictable color (black) for letters even though the background color changes. We quickly realized that once the segmentation problem is solved, solving the HIP becomes a pure recognition problem, and it can trivially be solved using machine learning. Our recognition engine is based on neural networks [6][9]. It yielded a 0.4% error rate on the MNIST database, uses little memory, and is very fast for recognition (important for breaking HIPs). For each HIP, we have a segmentation step, followed by a recognition step. It should be stressed that we are not trying to solve every HIP of a given type i.e., our goal is not 100% success rate, but something efficient that can achieve much better than 0.01%. In each of the following experiments, 2500 HIPs were hand labeled and used as follows (a) recognition (1600 for training, 200 for validation, and 200 for testing), and (b) segmentation (500 for testing segmentation). For each of the five HIPs, a convolution neural network, identical to the one described in [6], was trained and tested on gray level character images centered on the guessed character positions (see below). The trained neural network became the recognizer. 3.1 M a i l b l oc k s To solve the HIP, we select the red channel, binarize and erode it, extract the largest connected components (CCs), and breakup CCs that are too large into two or three adjacent CCs. Further, vertically overlapping half character size CCs are merged. The resulting rough segmentation works most of the time. Here is an example: For instance, in the example above, the NN would be trained, and tested on the following images: … The end-to-end success rate is 88.8% for segmentation, 95.9% for recognition (given correct segmentation), and (0.888)*(0.959)7 = 66.2% total. Note that most of the errors come from segmentation, even though this is where all the custom programming was invested. 3.2 Register The procedure to solve HIPs is very similar. The image was smoothed, binarized, and the largest 5 connected components were identified. Two examples are presented below: The end-to-end success rate is 95.4% for segmentation, 87.1% for recognition (given correct segmentation), and (0.954)*(0.871)5 = 47.8% total. 3.3 Y a h oo/ E Z - G i mp y Unlike the mailblocks and register HIPs, the Yahoo/EZ-Gimpy HIPs are richer in that a variety of backgrounds and clutter are possible. Though some amount of text warping is present, the text color, size, and font have low variability. Three simple segmentation algorithms were designed with associated rules to identify which algorithm to use. The goal was to keep these simple yet effective: a) No mesh: Convert to grayscale image, threshold to black and white, select large CCs with sizes close to HIP char sizes. One example: b) Black mesh: Convert to grayscale image, threshold to black and white, remove vertical and horizontal line pixels that don’t have neighboring pixels, select large CCs with sizes close to HIP char sizes. One example: c) White mesh: Convert to grayscale image, threshold to black and white, add black pixels (in white line locations) if there exist neighboring pixels, select large CCs with sizes close to HIP char sizes. One example: Tests for black and white meshes were performed to determine which segmentation algorithm to use. The end-to-end success rate was 56.2% for segmentation (38.2% came from a), 11.8% from b), and 6.2% from c), 90.3% for recognition (given correct segmentation), and (0.562)*(0.903)4.8 = 34.4% total. The average length of a Yahoo HIP solution is 4.8 characters. 3.4 T i c k e t ma s t e r The procedure that solved the Yahoo HIP is fairly successful at solving some of the ticket master HIPs. These HIPs are characterized by cris-crossing lines at random angles clustered around 0, 45, 90, and 135 degrees. A multipronged attack as in the Yahoo case (section 3.3) has potential. In the interests of simplicity, a single attack was developed: Convert to grayscale, threshold to black and white, up-sample image, dilate first then erode, select large CCs with sizes close to HIP char sizes. One example: The dilate-erode combination causes the lines to be removed (along with any thin objects) but retains solid thick characters. This single attack is successful in achieving an end-to-end success rate of 16.6% for segmentation, the recognition rate was 82.3% (in spite of interfering lines), and (0.166)*(0.823)6.23 = 4.9% total. The average HIP solution length is 6.23 characters. 3.5 Y a h oo ve r s i on 2 The second generation HIP from Yahoo had several changes: a) it did not use words from a dictionary or even use a phonetic generator, b) it uses only black and white colors, c) uses both letters and digits, and d) uses connected lines and arcs as clutter. The HIP is somewhat similar to the MSN/Passport HIP which does not use a dictionary, uses two colors, uses letters and digits, and background and foreground arcs as clutter. Unlike the MSN/Passport HIP, several different fonts are used. A single segmentation attack was developed: Remove 6 pixel border, up-sample, dilate first then erode, select large CCs with sizes close to HIP char sizes. The attack is practically identical to that used for the ticketmaster HIP with different preprocessing stages and slightly modified parameters. Two examples: This single attack is successful in achieving an end-to-end success rate of 58.4% for segmentation, the recognition rate was 95.2%, and (0.584)*(0.952)5 = 45.7% total. The average HIP solution length is 5 characters. 3.6 G oog l e / G M a i l The Google HIP is unique in that it uses only image warp as a means of distorting the characters. Similar to the MSN/Passport and Yahoo version 2 HIPs, it is also two color. The HIP characters are arranged closed to one another (they often touch) and follow a curved baseline. The following very simple attack was used to segment Google HIPs: Convert to grayscale, up-sample, threshold and separate connected components. a) b) This very simple attack gives an end-to-end success rate of 10.2% for segmentation, the recognition rate was 89.3%, giving (0.102)*(0.893)6.5 = 4.89% total probability of breaking a HIP. Average Google HIP solution length is 6.5 characters. This can be significantly improved upon by judicious use of dilate-erode attack. A direct application doesn’t do as well as it did on the ticketmaster and yahoo HIPs (because of the shear and warp of the baseline of the word). More successful and complicated attacks might estimate and counter the shear and warp of the baseline to achieve better success rates. 4 Lesso n s lea rn ed f ro m b rea ki n g H IPs From the previous section, it is clear that most of the errors come from incorrect segmentations, even though most of the development time is spent devising custom segmentation schemes. This observation raises the following questions: Why is segmentation a hard problem? Can we devise harder HIPs and datasets? Can we build an automatic segmentor? Can we compare classification algorithms based on how useful they are for segmentation? 4.1 T h e s e g me n t a t i on p r ob l e m As a review, segmentation is difficult for the following reasons: 1. Segmentation is computationally expensive. In order to find valid patterns, a recognizer must attempt recognition at many different candidate locations. 2. The segmentation function is complex. To segment successfully, the system must learn to identify which patterns are valid among the set of all possible valid and non-valid patterns. This task is intrinsically more difficult than classification because the space of input is considerably larger. Unlike the space of valid patterns, the space of non-valid patterns is typically too vast to sample. This is a problem for many learning algorithms which yield too many false positives when presented non-valid patterns. 3. Identifying valid characters among a set of valid and invalid candidates is a combinatorial problem. For example, correctly identifying which 8 characters among 20 candidates (assuming 12 false positives), has a 1 in 125,970 (20 choose 8) chances of success by random guessing. 4.2 B ui l d i n g b e t te r / h a r de r H I P s We can use what we have learned to build better HIPs. For instance the HIP below was designed to make segmentation difficult and a similar version has been deployed by MSN Passport for hotmail registrations (www.hotmail.com): The idea is that the additional arcs are themselves good candidates for false characters. The previous segmentation attacks would fail on this HIP. Furthermore, simple change of fonts, distortions, or arc types would require extensive work for the attacker to adjust to. We believe HIPs that emphasize the segmentation problem, such as the above example, are much stronger than the HIPs we examined in this paper, which rely on recognition being difficult. Pushing this to the extreme, we can easily generate the following HIPs: Despite the apparent difficulty of these HIPs, humans are surprisingly good at solving these, indicating that humans are far better than computers at segmentation. This approach of adding several competing false positives can in principle be used to automatically create difficult segmentation problems or benchmarks to test classification algorithms. 4.3 B ui l d i n g a n a ut o ma t i c s e g me n t or To build an automatic segmentor, we could use the following procedure. Label characters based on their correct position and train a recognizer. Apply the trained recognizer at all locations in the HIP image. Collect all candidate characters identified with high confidence by the recognizer. Compute the probability of each combination of candidates (going from left to right), and output the solution string with the highest probability. This is better illustrated with an example. Consider the following HIP (to the right). The trained neural network has these maps (warm colors indicate recognition) that show that K, Y, and so on are correctly identified. However, the maps for 7 and 9 show several false positives. In general, we would get the following color coded map for all the different candidates: HIP K Y B 7 9 With a threshold of 0.5 on the network’s outputs, the map obtained is: We note that there are several false positives for each true positive. The number of false positives per true positive character was found to be between 1 and 4, giving a 1 in C(16,8) = 12,870 to 1 in C(32,8) = 10,518,300 random chance of guessing the correct segmentation for the HIP characters. These numbers can be improved upon by constraining solution strings to flow sequentially from left to right and by restricting overlap. For each combination, we compute a probability by multiplying the 8 probabilities of the classifier for each position. The combination with the highest probability is the one proposed by the classifier. We do not have results for such an automatic segmentor at this time. It is interesting to note that with such a method a classifier that is robust to false positives would do far better than one that is not. This suggests another axis for comparing classifiers. 5 Con clu si on In this paper, we have successfully applied machine learning to the problem of solving HIPs. We have learned that decomposing the HIP problem into segmentation and recognition greatly simplifies analysis. Recognition on even unprocessed images (given segmentation is a solved) can be done automatically using neural networks. Segmentation, on the other hand, is the difficulty differentiator between weaker and stronger HIPs and requires custom intervention for each HIP. We have used this observation to design new HIPs and new tests for machine learning algorithms with the hope of improving them. A c k n ow l e d ge me n t s We would like to acknowledge Chau Luu and Eric Meltzer for their help with labeling and segmenting various HIPs. We would also like to acknowledge Josh Benaloh and Cem Paya for stimulating discussions on HIP security. References [1] Baird HS (1992), “Anatomy of a versatile page reader,” IEEE Pro., v.80, pp. 1059-1065. [2] Turing AM (1950), “Computing Machinery and Intelligence,” Mind, 59:236, pp. 433-460. [3] First Workshop on Human Interactive Proofs, Palo Alto, CA, January 2002. [4] Von Ahn L, Blum M, and Langford J, The Captcha Project. http://www.captcha.net [5] Baird HS and Popat K (2002) “Human Interactive Proofs and Document Image Analysis,” Proc. IAPR 2002 Workshop on Document Analysis Systerms, Princeton, NJ. [6] Simard PY, Steinkraus D, and Platt J, (2003) “Best Practice for Convolutional Neural Networks Applied to Visual Document Analysis,” in International Conference on Document Analysis and Recognition (ICDAR), pp. 958-962, IEEE Computer Society, Los Alamitos. [7] Mori G, Malik J (2003), “Recognizing Objects in Adversarial Clutter: Breaking a Visual CAPTCHA,” Proc. of the Computer Vision and Pattern Recognition (CVPR) Conference, IEEE Computer Society, vol.1, pages:I-134 - I-141, June 18-20, 2003 [8] Chew, M. and Baird, H. S. (2003), “BaffleText: a Human Interactive Proof,” Proc., 10th IS&T;/SPIE Document Recognition & Retrieval Conf., Santa Clara, CA, Jan. 22. [9] LeCun Y, Bottou L, Bengio Y, and Haffner P, “Gradient-based learning applied to document recognition,’ Proceedings of the IEEE, Nov. 1998.
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract Machine learning is often used to automatically solve human tasks. [sent-4, score-0.043]
2 In this paper, we look for tasks where machine learning algorithms are not as good as humans with the hope of gaining insight into their current limitations. [sent-5, score-0.071]
3 We studied various Human Interactive Proofs (HIPs) on the market, because they are systems designed to tell computers and humans apart by posing challenges presumably too hard for computers. [sent-6, score-0.236]
4 We found that most HIPs are pure recognition tasks which can easily be broken using machine learning. [sent-7, score-0.104]
5 The harder HIPs use a combination of segmentation and recognition tasks. [sent-8, score-0.323]
6 From this observation, we found that building segmentation tasks is the most effective way to confuse machine learning algorithms. [sent-9, score-0.197]
7 This has enabled us to build effective HIPs (which we deployed in MSN Passport), as well as design challenging segmentation tasks for machine learning algorithms. [sent-10, score-0.243]
8 On the other hand, cursive handwriting recognition today is still too poor for most people to rely on. [sent-12, score-0.085]
9 To shed more light on this question, we study problems that have been designed to be difficult for computers. [sent-14, score-0.092]
10 The hope is that we will get some insight on what the stumbling blocks are for machine learning and devise appropriate tests to further understand their similarities and differences. [sent-15, score-0.043]
11 Work on distinguishing computers from humans traces back to the original Turing Test [2] which asks that a human distinguish between another human and a machine by asking questions of both. [sent-16, score-0.196]
12 These systems enable the construction of automatic filters that can be used to prevent automated scripts from utilizing services intended for humans [4]. [sent-18, score-0.142]
13 Construction of HIPs that are of practical value is difficult because it is not sufficient to develop challenges at which humans are somewhat more successful than machines. [sent-21, score-0.258]
14 This is because the cost of failure for an automatic attacker is minimal compared to the cost of failure for humans. [sent-22, score-0.071]
15 Ideally a HIP should be solved by humans more than 80% of the time, while an automatic script with reasonable resource use should succeed less than 0. [sent-23, score-0.14]
16 This latter ratio (1 in 10,000) is a function of the cost of an automatic trial divided by the cost of having a human perform the attack. [sent-25, score-0.088]
17 We will show in Section 3 that machinelearning-based attacks are far more successful than 1 in 10,000. [sent-29, score-0.083]
18 Yet, some of these HIPs are harder than others and could be made even harder by identifying the recognition and segmentation parts, and emphasizing the latter. [sent-30, score-0.392]
19 Section 4 presents examples of more difficult HIPs which are much more respectable challenges for machine learning, and yet surprisingly easy for humans. [sent-31, score-0.157]
20 2 Exa mp les o f H I Ps The HIPs explored in this paper are made of characters (or symbols) rendered to an image and presented to the user. [sent-33, score-0.139]
21 Solving the HIP requires identifying all characters in the correct order. [sent-34, score-0.123]
22 The following HIPs can be sampled from the web: Mailblocks: While signing up for free email service with (www. [sent-35, score-0.103]
23 com), you will find HIP challenges of the type: mailblocks MSN: While signing up for free e-mail with MSN Hotmail (www. [sent-37, score-0.266]
24 com), you will find HIP challenges of the type: Register. [sent-39, score-0.111]
25 /EZ-Gimpy (CMU): While signing up for free e-mail service with Yahoo! [sent-43, score-0.103]
26 com), you will receive HIP challenges of the type: Yahoo! [sent-46, score-0.109]
27 com, you will receive HIP challenges of the type: Google/Gmail: While signing up for free e-mail with Gmail at www. [sent-51, score-0.188]
28 com, one will receive HIP challenges of the type: While solutions to Yahoo HIPs are common English words, those for ticketmaster and Google do not necessarily belong to the English dictionary. [sent-53, score-0.2]
29 They appear to have been created using a phonetic generator [8]. [sent-54, score-0.052]
30 3 Usi n g ma ch i n e lea rn i n g t o b rea k H IP s Breaking HIPs is not new. [sent-55, score-0.06]
31 Our approach aims at an automatic process for solving multiple HIPs with minimum human intervention, using machine learning. [sent-57, score-0.108]
32 In this paper, our main goal is to learn more about the common strengths and weaknesses of these HIPs rather than to prove that we can break any one HIP in particular with the highest possible success rate. [sent-58, score-0.115]
33 To simplify our study, we will not be using language models in our attempt to break HIPs. [sent-60, score-0.061]
34 For example, there are only about 600 words in the EZ-Gimpy dictionary [7], which means that a random guess attack would get a success rate of 1 in 600 (more than enough to break the HIP, i. [sent-61, score-0.277]
35 HIPs become harder when no language model is used. [sent-65, score-0.07]
36 Similarly, when a HIP uses a language model to generate challenges, success rate of attacks can be significantly improved by incorporating the language model. [sent-66, score-0.246]
37 Our generic method for breaking all of these HIPs is to write a custom algorithm to locate the characters, and then use machine learning for recognition. [sent-68, score-0.097]
38 Surprisingly, segmentation, or finding the characters, is simple for many HIPs which makes the process of breaking the HIP particularly easy. [sent-69, score-0.052]
39 Gimpy uses a single constant predictable color (black) for letters even though the background color changes. [sent-70, score-0.067]
40 We quickly realized that once the segmentation problem is solved, solving the HIP becomes a pure recognition problem, and it can trivially be solved using machine learning. [sent-71, score-0.326]
41 Our recognition engine is based on neural networks [6][9]. [sent-72, score-0.085]
42 4% error rate on the MNIST database, uses little memory, and is very fast for recognition (important for breaking HIPs). [sent-74, score-0.189]
43 For each HIP, we have a segmentation step, followed by a recognition step. [sent-75, score-0.282]
44 , our goal is not 100% success rate, but something efficient that can achieve much better than 0. [sent-78, score-0.083]
45 In each of the following experiments, 2500 HIPs were hand labeled and used as follows (a) recognition (1600 for training, 200 for validation, and 200 for testing), and (b) segmentation (500 for testing segmentation). [sent-80, score-0.282]
46 For each of the five HIPs, a convolution neural network, identical to the one described in [6], was trained and tested on gray level character images centered on the guessed character positions (see below). [sent-81, score-0.07]
47 1 M a i l b l oc k s To solve the HIP, we select the red channel, binarize and erode it, extract the largest connected components (CCs), and breakup CCs that are too large into two or three adjacent CCs. [sent-84, score-0.071]
48 The resulting rough segmentation works most of the time. [sent-86, score-0.197]
49 Here is an example: For instance, in the example above, the NN would be trained, and tested on the following images: … The end-to-end success rate is 88. [sent-87, score-0.112]
50 Note that most of the errors come from segmentation, even though this is where all the custom programming was invested. [sent-93, score-0.045]
51 Two examples are presented below: The end-to-end success rate is 95. [sent-97, score-0.112]
52 3 Y a h oo/ E Z - G i mp y Unlike the mailblocks and register HIPs, the Yahoo/EZ-Gimpy HIPs are richer in that a variety of backgrounds and clutter are possible. [sent-104, score-0.146]
53 Three simple segmentation algorithms were designed with associated rules to identify which algorithm to use. [sent-106, score-0.218]
54 The goal was to keep these simple yet effective: a) No mesh: Convert to grayscale image, threshold to black and white, select large CCs with sizes close to HIP char sizes. [sent-107, score-0.237]
55 One example: b) Black mesh: Convert to grayscale image, threshold to black and white, remove vertical and horizontal line pixels that don’t have neighboring pixels, select large CCs with sizes close to HIP char sizes. [sent-108, score-0.257]
56 One example: c) White mesh: Convert to grayscale image, threshold to black and white, add black pixels (in white line locations) if there exist neighboring pixels, select large CCs with sizes close to HIP char sizes. [sent-109, score-0.343]
57 One example: Tests for black and white meshes were performed to determine which segmentation algorithm to use. [sent-110, score-0.283]
58 4 T i c k e t ma s t e r The procedure that solved the Yahoo HIP is fairly successful at solving some of the ticket master HIPs. [sent-124, score-0.074]
59 A multipronged attack as in the Yahoo case (section 3. [sent-126, score-0.101]
60 In the interests of simplicity, a single attack was developed: Convert to grayscale, threshold to black and white, up-sample image, dilate first then erode, select large CCs with sizes close to HIP char sizes. [sent-128, score-0.325]
61 This single attack is successful in achieving an end-to-end success rate of 16. [sent-130, score-0.243]
62 5 Y a h oo ve r s i on 2 The second generation HIP from Yahoo had several changes: a) it did not use words from a dictionary or even use a phonetic generator, b) it uses only black and white colors, c) uses both letters and digits, and d) uses connected lines and arcs as clutter. [sent-140, score-0.242]
63 The HIP is somewhat similar to the MSN/Passport HIP which does not use a dictionary, uses two colors, uses letters and digits, and background and foreground arcs as clutter. [sent-141, score-0.075]
64 A single segmentation attack was developed: Remove 6 pixel border, up-sample, dilate first then erode, select large CCs with sizes close to HIP char sizes. [sent-143, score-0.456]
65 The attack is practically identical to that used for the ticketmaster HIP with different preprocessing stages and slightly modified parameters. [sent-144, score-0.192]
66 Two examples: This single attack is successful in achieving an end-to-end success rate of 58. [sent-145, score-0.243]
67 6 G oog l e / G M a i l The Google HIP is unique in that it uses only image warp as a means of distorting the characters. [sent-153, score-0.079]
68 The HIP characters are arranged closed to one another (they often touch) and follow a curved baseline. [sent-155, score-0.095]
69 The following very simple attack was used to segment Google HIPs: Convert to grayscale, up-sample, threshold and separate connected components. [sent-156, score-0.124]
70 a) b) This very simple attack gives an end-to-end success rate of 10. [sent-157, score-0.213]
71 A direct application doesn’t do as well as it did on the ticketmaster and yahoo HIPs (because of the shear and warp of the baseline of the word). [sent-167, score-0.296]
72 More successful and complicated attacks might estimate and counter the shear and warp of the baseline to achieve better success rates. [sent-168, score-0.23]
73 4 Lesso n s lea rn ed f ro m b rea ki n g H IPs From the previous section, it is clear that most of the errors come from incorrect segmentations, even though most of the development time is spent devising custom segmentation schemes. [sent-169, score-0.302]
74 This observation raises the following questions: Why is segmentation a hard problem? [sent-170, score-0.197]
75 1 T h e s e g me n t a t i on p r ob l e m As a review, segmentation is difficult for the following reasons: 1. [sent-175, score-0.268]
76 In order to find valid patterns, a recognizer must attempt recognition at many different candidate locations. [sent-177, score-0.169]
77 To segment successfully, the system must learn to identify which patterns are valid among the set of all possible valid and non-valid patterns. [sent-180, score-0.066]
78 This task is intrinsically more difficult than classification because the space of input is considerably larger. [sent-181, score-0.1]
79 This is a problem for many learning algorithms which yield too many false positives when presented non-valid patterns. [sent-183, score-0.099]
80 Identifying valid characters among a set of valid and invalid candidates is a combinatorial problem. [sent-185, score-0.197]
81 For example, correctly identifying which 8 characters among 20 candidates (assuming 12 false positives), has a 1 in 125,970 (20 choose 8) chances of success by random guessing. [sent-186, score-0.291]
82 For instance the HIP below was designed to make segmentation difficult and a similar version has been deployed by MSN Passport for hotmail registrations (www. [sent-189, score-0.345]
83 com): The idea is that the additional arcs are themselves good candidates for false characters. [sent-191, score-0.114]
84 The previous segmentation attacks would fail on this HIP. [sent-192, score-0.25]
85 We believe HIPs that emphasize the segmentation problem, such as the above example, are much stronger than the HIPs we examined in this paper, which rely on recognition being difficult. [sent-194, score-0.282]
86 Pushing this to the extreme, we can easily generate the following HIPs: Despite the apparent difficulty of these HIPs, humans are surprisingly good at solving these, indicating that humans are far better than computers at segmentation. [sent-195, score-0.223]
87 This approach of adding several competing false positives can in principle be used to automatically create difficult segmentation problems or benchmarks to test classification algorithms. [sent-196, score-0.396]
88 3 B ui l d i n g a n a ut o ma t i c s e g me n t or To build an automatic segmentor, we could use the following procedure. [sent-198, score-0.065]
89 Label characters based on their correct position and train a recognizer. [sent-199, score-0.095]
90 Collect all candidate characters identified with high confidence by the recognizer. [sent-201, score-0.095]
91 However, the maps for 7 and 9 show several false positives. [sent-206, score-0.049]
92 In general, we would get the following color coded map for all the different candidates: HIP K Y B 7 9 With a threshold of 0. [sent-207, score-0.045]
93 5 on the network’s outputs, the map obtained is: We note that there are several false positives for each true positive. [sent-208, score-0.099]
94 The number of false positives per true positive character was found to be between 1 and 4, giving a 1 in C(16,8) = 12,870 to 1 in C(32,8) = 10,518,300 random chance of guessing the correct segmentation for the HIP characters. [sent-209, score-0.331]
95 We do not have results for such an automatic segmentor at this time. [sent-213, score-0.091]
96 It is interesting to note that with such a method a classifier that is robust to false positives would do far better than one that is not. [sent-214, score-0.121]
97 We have learned that decomposing the HIP problem into segmentation and recognition greatly simplifies analysis. [sent-217, score-0.282]
98 Recognition on even unprocessed images (given segmentation is a solved) can be done automatically using neural networks. [sent-218, score-0.197]
99 Segmentation, on the other hand, is the difficulty differentiator between weaker and stronger HIPs and requires custom intervention for each HIP. [sent-219, score-0.091]
100 [9] LeCun Y, Bottou L, Bengio Y, and Haffner P, “Gradient-based learning applied to document recognition,’ Proceedings of the IEEE, Nov. [sent-246, score-0.04]
wordName wordTfidf (topN-words)
[('hip', 0.608), ('hips', 0.593), ('segmentation', 0.197), ('yahoo', 0.141), ('ccs', 0.122), ('attack', 0.101), ('characters', 0.095), ('ticketmaster', 0.091), ('challenges', 0.086), ('recognition', 0.085), ('success', 0.083), ('char', 0.076), ('mailblocks', 0.076), ('msn', 0.076), ('humans', 0.071), ('difficult', 0.071), ('signing', 0.061), ('interactive', 0.056), ('google', 0.053), ('attacks', 0.053), ('breaking', 0.052), ('positives', 0.05), ('false', 0.049), ('register', 0.048), ('convert', 0.048), ('erode', 0.046), ('segmentor', 0.046), ('custom', 0.045), ('automatic', 0.045), ('black', 0.043), ('human', 0.043), ('grayscale', 0.043), ('white', 0.043), ('harder', 0.041), ('document', 0.04), ('turing', 0.04), ('computers', 0.039), ('candidates', 0.036), ('baird', 0.036), ('proofs', 0.035), ('character', 0.035), ('warp', 0.034), ('valid', 0.033), ('microsoft', 0.032), ('colors', 0.032), ('dictionary', 0.032), ('mesh', 0.032), ('break', 0.032), ('captcha', 0.03), ('dilate', 0.03), ('fonts', 0.03), ('gimpy', 0.03), ('hotmail', 0.03), ('lea', 0.03), ('passport', 0.03), ('patrice', 0.03), ('rea', 0.03), ('redmond', 0.03), ('shear', 0.03), ('successful', 0.03), ('classification', 0.029), ('arcs', 0.029), ('rate', 0.029), ('language', 0.029), ('identifying', 0.028), ('sizes', 0.027), ('generator', 0.026), ('deployed', 0.026), ('phonetic', 0.026), ('attacker', 0.026), ('recognizer', 0.026), ('automated', 0.026), ('find', 0.025), ('select', 0.025), ('solved', 0.024), ('mori', 0.024), ('intervention', 0.024), ('service', 0.024), ('simard', 0.024), ('tests', 0.024), ('receive', 0.023), ('threshold', 0.023), ('uses', 0.023), ('mp', 0.022), ('classifier', 0.022), ('hs', 0.022), ('difficulty', 0.022), ('color', 0.022), ('image', 0.022), ('type', 0.021), ('designed', 0.021), ('solving', 0.02), ('build', 0.02), ('pixels', 0.02), ('english', 0.019), ('broken', 0.019), ('tell', 0.019), ('devise', 0.019), ('wa', 0.019), ('free', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 199 nips-2004-Using Machine Learning to Break Visual Human Interaction Proofs (HIPs)
Author: Kumar Chellapilla, Patrice Y. Simard
Abstract: Machine learning is often used to automatically solve human tasks. In this paper, we look for tasks where machine learning algorithms are not as good as humans with the hope of gaining insight into their current limitations. We studied various Human Interactive Proofs (HIPs) on the market, because they are systems designed to tell computers and humans apart by posing challenges presumably too hard for computers. We found that most HIPs are pure recognition tasks which can easily be broken using machine learning. The harder HIPs use a combination of segmentation and recognition tasks. From this observation, we found that building segmentation tasks is the most effective way to confuse machine learning algorithms. This has enabled us to build effective HIPs (which we deployed in MSN Passport), as well as design challenging segmentation tasks for machine learning algorithms. 1 In t rod u ct i on The OCR problem for high resolution printed text has virtually been solved 10 years ago [1]. On the other hand, cursive handwriting recognition today is still too poor for most people to rely on. Is there a fundamental difference between these two seemingly similar problems? To shed more light on this question, we study problems that have been designed to be difficult for computers. The hope is that we will get some insight on what the stumbling blocks are for machine learning and devise appropriate tests to further understand their similarities and differences. Work on distinguishing computers from humans traces back to the original Turing Test [2] which asks that a human distinguish between another human and a machine by asking questions of both. Recent interest has turned to developing systems that allow a computer to distinguish between another computer and a human. These systems enable the construction of automatic filters that can be used to prevent automated scripts from utilizing services intended for humans [4]. Such systems have been termed Human Interactive Proofs (HIPs) [3] or Completely Automated Public Turing Tests to Tell Computers and Humans Apart (CAPTCHAs) [4]. An overview of the work in this area can be found in [5]. Construction of HIPs that are of practical value is difficult because it is not sufficient to develop challenges at which humans are somewhat more successful than machines. This is because the cost of failure for an automatic attacker is minimal compared to the cost of failure for humans. Ideally a HIP should be solved by humans more than 80% of the time, while an automatic script with reasonable resource use should succeed less than 0.01% of the time. This latter ratio (1 in 10,000) is a function of the cost of an automatic trial divided by the cost of having a human perform the attack. This constraint of generating tasks that are failed 99.99% of the time by all automated algorithms has generated various solutions which can easily be sampled on the internet. Seven different HIPs, namely, Mailblocks, MSN (before April 28th, 2004), Ticketmaster, Yahoo, Yahoo v2 (after Sept’04), Register, and Google, will be given as examples in the next section. We will show in Section 3 that machinelearning-based attacks are far more successful than 1 in 10,000. Yet, some of these HIPs are harder than others and could be made even harder by identifying the recognition and segmentation parts, and emphasizing the latter. Section 4 presents examples of more difficult HIPs which are much more respectable challenges for machine learning, and yet surprisingly easy for humans. The final section discusses a (known) weakness of machine learning algorithms and suggests designing simple artificial datasets for studying this weakness. 2 Exa mp les o f H I Ps The HIPs explored in this paper are made of characters (or symbols) rendered to an image and presented to the user. Solving the HIP requires identifying all characters in the correct order. The following HIPs can be sampled from the web: Mailblocks: While signing up for free email service with (www.mailblocks.com), you will find HIP challenges of the type: mailblocks MSN: While signing up for free e-mail with MSN Hotmail (www.hotmail.com), you will find HIP challenges of the type: Register.com: While requesting a whois lookup for a domain at www.register.com, you will HIP challenges of the type: Yahoo!/EZ-Gimpy (CMU): While signing up for free e-mail service with Yahoo! (www.yahoo.com), you will receive HIP challenges of the type: Yahoo! (version 2): Starting in August 2004, Yahoo! introduced their second generation HIP. Three examples are presented below: Ticketmaster: While looking for concert tickets at www.ticketmaster.com, you will receive HIP challenges of the type: Google/Gmail: While signing up for free e-mail with Gmail at www.google.com, one will receive HIP challenges of the type: While solutions to Yahoo HIPs are common English words, those for ticketmaster and Google do not necessarily belong to the English dictionary. They appear to have been created using a phonetic generator [8]. 3 Usi n g ma ch i n e lea rn i n g t o b rea k H IP s Breaking HIPs is not new. Mori and Malik [7] have successfully broken the EZGimpy (92% success) and Gimpy (33% success) HIPs from CMU. Our approach aims at an automatic process for solving multiple HIPs with minimum human intervention, using machine learning. In this paper, our main goal is to learn more about the common strengths and weaknesses of these HIPs rather than to prove that we can break any one HIP in particular with the highest possible success rate. We have results for six different HIPs: EZ-Gimpy/Yahoo, Yahoo v2, mailblocks, register, ticketmaster, and Google. To simplify our study, we will not be using language models in our attempt to break HIPs. For example, there are only about 600 words in the EZ-Gimpy dictionary [7], which means that a random guess attack would get a success rate of 1 in 600 (more than enough to break the HIP, i.e., greater than 0.01% success). HIPs become harder when no language model is used. Similarly, when a HIP uses a language model to generate challenges, success rate of attacks can be significantly improved by incorporating the language model. Further, since the language model is not common to all HIPs studied, it was not used in this paper. Our generic method for breaking all of these HIPs is to write a custom algorithm to locate the characters, and then use machine learning for recognition. Surprisingly, segmentation, or finding the characters, is simple for many HIPs which makes the process of breaking the HIP particularly easy. Gimpy uses a single constant predictable color (black) for letters even though the background color changes. We quickly realized that once the segmentation problem is solved, solving the HIP becomes a pure recognition problem, and it can trivially be solved using machine learning. Our recognition engine is based on neural networks [6][9]. It yielded a 0.4% error rate on the MNIST database, uses little memory, and is very fast for recognition (important for breaking HIPs). For each HIP, we have a segmentation step, followed by a recognition step. It should be stressed that we are not trying to solve every HIP of a given type i.e., our goal is not 100% success rate, but something efficient that can achieve much better than 0.01%. In each of the following experiments, 2500 HIPs were hand labeled and used as follows (a) recognition (1600 for training, 200 for validation, and 200 for testing), and (b) segmentation (500 for testing segmentation). For each of the five HIPs, a convolution neural network, identical to the one described in [6], was trained and tested on gray level character images centered on the guessed character positions (see below). The trained neural network became the recognizer. 3.1 M a i l b l oc k s To solve the HIP, we select the red channel, binarize and erode it, extract the largest connected components (CCs), and breakup CCs that are too large into two or three adjacent CCs. Further, vertically overlapping half character size CCs are merged. The resulting rough segmentation works most of the time. Here is an example: For instance, in the example above, the NN would be trained, and tested on the following images: … The end-to-end success rate is 88.8% for segmentation, 95.9% for recognition (given correct segmentation), and (0.888)*(0.959)7 = 66.2% total. Note that most of the errors come from segmentation, even though this is where all the custom programming was invested. 3.2 Register The procedure to solve HIPs is very similar. The image was smoothed, binarized, and the largest 5 connected components were identified. Two examples are presented below: The end-to-end success rate is 95.4% for segmentation, 87.1% for recognition (given correct segmentation), and (0.954)*(0.871)5 = 47.8% total. 3.3 Y a h oo/ E Z - G i mp y Unlike the mailblocks and register HIPs, the Yahoo/EZ-Gimpy HIPs are richer in that a variety of backgrounds and clutter are possible. Though some amount of text warping is present, the text color, size, and font have low variability. Three simple segmentation algorithms were designed with associated rules to identify which algorithm to use. The goal was to keep these simple yet effective: a) No mesh: Convert to grayscale image, threshold to black and white, select large CCs with sizes close to HIP char sizes. One example: b) Black mesh: Convert to grayscale image, threshold to black and white, remove vertical and horizontal line pixels that don’t have neighboring pixels, select large CCs with sizes close to HIP char sizes. One example: c) White mesh: Convert to grayscale image, threshold to black and white, add black pixels (in white line locations) if there exist neighboring pixels, select large CCs with sizes close to HIP char sizes. One example: Tests for black and white meshes were performed to determine which segmentation algorithm to use. The end-to-end success rate was 56.2% for segmentation (38.2% came from a), 11.8% from b), and 6.2% from c), 90.3% for recognition (given correct segmentation), and (0.562)*(0.903)4.8 = 34.4% total. The average length of a Yahoo HIP solution is 4.8 characters. 3.4 T i c k e t ma s t e r The procedure that solved the Yahoo HIP is fairly successful at solving some of the ticket master HIPs. These HIPs are characterized by cris-crossing lines at random angles clustered around 0, 45, 90, and 135 degrees. A multipronged attack as in the Yahoo case (section 3.3) has potential. In the interests of simplicity, a single attack was developed: Convert to grayscale, threshold to black and white, up-sample image, dilate first then erode, select large CCs with sizes close to HIP char sizes. One example: The dilate-erode combination causes the lines to be removed (along with any thin objects) but retains solid thick characters. This single attack is successful in achieving an end-to-end success rate of 16.6% for segmentation, the recognition rate was 82.3% (in spite of interfering lines), and (0.166)*(0.823)6.23 = 4.9% total. The average HIP solution length is 6.23 characters. 3.5 Y a h oo ve r s i on 2 The second generation HIP from Yahoo had several changes: a) it did not use words from a dictionary or even use a phonetic generator, b) it uses only black and white colors, c) uses both letters and digits, and d) uses connected lines and arcs as clutter. The HIP is somewhat similar to the MSN/Passport HIP which does not use a dictionary, uses two colors, uses letters and digits, and background and foreground arcs as clutter. Unlike the MSN/Passport HIP, several different fonts are used. A single segmentation attack was developed: Remove 6 pixel border, up-sample, dilate first then erode, select large CCs with sizes close to HIP char sizes. The attack is practically identical to that used for the ticketmaster HIP with different preprocessing stages and slightly modified parameters. Two examples: This single attack is successful in achieving an end-to-end success rate of 58.4% for segmentation, the recognition rate was 95.2%, and (0.584)*(0.952)5 = 45.7% total. The average HIP solution length is 5 characters. 3.6 G oog l e / G M a i l The Google HIP is unique in that it uses only image warp as a means of distorting the characters. Similar to the MSN/Passport and Yahoo version 2 HIPs, it is also two color. The HIP characters are arranged closed to one another (they often touch) and follow a curved baseline. The following very simple attack was used to segment Google HIPs: Convert to grayscale, up-sample, threshold and separate connected components. a) b) This very simple attack gives an end-to-end success rate of 10.2% for segmentation, the recognition rate was 89.3%, giving (0.102)*(0.893)6.5 = 4.89% total probability of breaking a HIP. Average Google HIP solution length is 6.5 characters. This can be significantly improved upon by judicious use of dilate-erode attack. A direct application doesn’t do as well as it did on the ticketmaster and yahoo HIPs (because of the shear and warp of the baseline of the word). More successful and complicated attacks might estimate and counter the shear and warp of the baseline to achieve better success rates. 4 Lesso n s lea rn ed f ro m b rea ki n g H IPs From the previous section, it is clear that most of the errors come from incorrect segmentations, even though most of the development time is spent devising custom segmentation schemes. This observation raises the following questions: Why is segmentation a hard problem? Can we devise harder HIPs and datasets? Can we build an automatic segmentor? Can we compare classification algorithms based on how useful they are for segmentation? 4.1 T h e s e g me n t a t i on p r ob l e m As a review, segmentation is difficult for the following reasons: 1. Segmentation is computationally expensive. In order to find valid patterns, a recognizer must attempt recognition at many different candidate locations. 2. The segmentation function is complex. To segment successfully, the system must learn to identify which patterns are valid among the set of all possible valid and non-valid patterns. This task is intrinsically more difficult than classification because the space of input is considerably larger. Unlike the space of valid patterns, the space of non-valid patterns is typically too vast to sample. This is a problem for many learning algorithms which yield too many false positives when presented non-valid patterns. 3. Identifying valid characters among a set of valid and invalid candidates is a combinatorial problem. For example, correctly identifying which 8 characters among 20 candidates (assuming 12 false positives), has a 1 in 125,970 (20 choose 8) chances of success by random guessing. 4.2 B ui l d i n g b e t te r / h a r de r H I P s We can use what we have learned to build better HIPs. For instance the HIP below was designed to make segmentation difficult and a similar version has been deployed by MSN Passport for hotmail registrations (www.hotmail.com): The idea is that the additional arcs are themselves good candidates for false characters. The previous segmentation attacks would fail on this HIP. Furthermore, simple change of fonts, distortions, or arc types would require extensive work for the attacker to adjust to. We believe HIPs that emphasize the segmentation problem, such as the above example, are much stronger than the HIPs we examined in this paper, which rely on recognition being difficult. Pushing this to the extreme, we can easily generate the following HIPs: Despite the apparent difficulty of these HIPs, humans are surprisingly good at solving these, indicating that humans are far better than computers at segmentation. This approach of adding several competing false positives can in principle be used to automatically create difficult segmentation problems or benchmarks to test classification algorithms. 4.3 B ui l d i n g a n a ut o ma t i c s e g me n t or To build an automatic segmentor, we could use the following procedure. Label characters based on their correct position and train a recognizer. Apply the trained recognizer at all locations in the HIP image. Collect all candidate characters identified with high confidence by the recognizer. Compute the probability of each combination of candidates (going from left to right), and output the solution string with the highest probability. This is better illustrated with an example. Consider the following HIP (to the right). The trained neural network has these maps (warm colors indicate recognition) that show that K, Y, and so on are correctly identified. However, the maps for 7 and 9 show several false positives. In general, we would get the following color coded map for all the different candidates: HIP K Y B 7 9 With a threshold of 0.5 on the network’s outputs, the map obtained is: We note that there are several false positives for each true positive. The number of false positives per true positive character was found to be between 1 and 4, giving a 1 in C(16,8) = 12,870 to 1 in C(32,8) = 10,518,300 random chance of guessing the correct segmentation for the HIP characters. These numbers can be improved upon by constraining solution strings to flow sequentially from left to right and by restricting overlap. For each combination, we compute a probability by multiplying the 8 probabilities of the classifier for each position. The combination with the highest probability is the one proposed by the classifier. We do not have results for such an automatic segmentor at this time. It is interesting to note that with such a method a classifier that is robust to false positives would do far better than one that is not. This suggests another axis for comparing classifiers. 5 Con clu si on In this paper, we have successfully applied machine learning to the problem of solving HIPs. We have learned that decomposing the HIP problem into segmentation and recognition greatly simplifies analysis. Recognition on even unprocessed images (given segmentation is a solved) can be done automatically using neural networks. Segmentation, on the other hand, is the difficulty differentiator between weaker and stronger HIPs and requires custom intervention for each HIP. We have used this observation to design new HIPs and new tests for machine learning algorithms with the hope of improving them. A c k n ow l e d ge me n t s We would like to acknowledge Chau Luu and Eric Meltzer for their help with labeling and segmenting various HIPs. We would also like to acknowledge Josh Benaloh and Cem Paya for stimulating discussions on HIP security. References [1] Baird HS (1992), “Anatomy of a versatile page reader,” IEEE Pro., v.80, pp. 1059-1065. [2] Turing AM (1950), “Computing Machinery and Intelligence,” Mind, 59:236, pp. 433-460. [3] First Workshop on Human Interactive Proofs, Palo Alto, CA, January 2002. [4] Von Ahn L, Blum M, and Langford J, The Captcha Project. http://www.captcha.net [5] Baird HS and Popat K (2002) “Human Interactive Proofs and Document Image Analysis,” Proc. IAPR 2002 Workshop on Document Analysis Systerms, Princeton, NJ. [6] Simard PY, Steinkraus D, and Platt J, (2003) “Best Practice for Convolutional Neural Networks Applied to Visual Document Analysis,” in International Conference on Document Analysis and Recognition (ICDAR), pp. 958-962, IEEE Computer Society, Los Alamitos. [7] Mori G, Malik J (2003), “Recognizing Objects in Adversarial Clutter: Breaking a Visual CAPTCHA,” Proc. of the Computer Vision and Pattern Recognition (CVPR) Conference, IEEE Computer Society, vol.1, pages:I-134 - I-141, June 18-20, 2003 [8] Chew, M. and Baird, H. S. (2003), “BaffleText: a Human Interactive Proof,” Proc., 10th IS&T;/SPIE Document Recognition & Retrieval Conf., Santa Clara, CA, Jan. 22. [9] LeCun Y, Bottou L, Bengio Y, and Haffner P, “Gradient-based learning applied to document recognition,’ Proceedings of the IEEE, Nov. 1998.
2 0.052034136 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition
Author: Le Lu, Gregory D. Hager, Laurent Younes
Abstract: Visual action recognition is an important problem in computer vision. In this paper, we propose a new method to probabilistically model and recognize actions of articulated objects, such as hand or body gestures, in image sequences. Our method consists of three levels of representation. At the low level, we first extract a feature vector invariant to scale and in-plane rotation by using the Fourier transform of a circular spatial histogram. Then, spectral partitioning [20] is utilized to obtain an initial clustering; this clustering is then refined using a temporal smoothness constraint. Gaussian mixture model (GMM) based clustering and density estimation in the subspace of linear discriminant analysis (LDA) are then applied to thousands of image feature vectors to obtain an intermediate level representation. Finally, at the high level we build a temporal multiresolution histogram model for each action by aggregating the clustering weights of sampled images belonging to that action. We discuss how this high level representation can be extended to achieve temporal scaling invariance and to include Bi-gram or Multi-gram transition information. Both image clustering and action recognition/segmentation results are given to show the validity of our three tiered representation.
3 0.043071616 180 nips-2004-Synchronization of neural networks by mutual learning and its application to cryptography
Author: Einat Klein, Rachel Mislovaty, Ido Kanter, Andreas Ruttor, Wolfgang Kinzel
Abstract: Two neural networks that are trained on their mutual output synchronize to an identical time dependant weight vector. This novel phenomenon can be used for creation of a secure cryptographic secret-key using a public channel. Several models for this cryptographic system have been suggested, and have been tested for their security under different sophisticated attack strategies. The most promising models are networks that involve chaos synchronization. The synchronization process of mutual learning is described analytically using statistical physics methods.
4 0.041886445 107 nips-2004-Making Latin Manuscripts Searchable using gHMM's
Author: Jaety Edwards, Yee W. Teh, Roger Bock, Michael Maire, Grace Vesom, David A. Forsyth
Abstract: We describe a method that can make a scanned, handwritten mediaeval latin manuscript accessible to full text search. A generalized HMM is fitted, using transcribed latin to obtain a transition model and one example each of 22 letters to obtain an emission model. We show results for unigram, bigram and trigram models. Our method transcribes 25 pages of a manuscript of Terence with fair accuracy (75% of letters correctly transcribed). Search results are very strong; we use examples of variant spellings to demonstrate that the search respects the ink of the document. Furthermore, our model produces fair searches on a document from which we obtained no training data. 1. Intoduction There are many large corpora of handwritten scanned documents, and their number is growing rapidly. Collections range from the complete works of Mark Twain to thousands of pages of zoological notes spanning two centuries. Large scale analyses of such corpora is currently very difficult, because handwriting recognition works poorly. Recently, Rath and Manmatha have demonstrated that one can use small bodies of aligned material as supervised data to train a word spotting mechanism [7]. The result can make scanned handwritten documents searchable. Current techniques assume a closed vocabulary — one can search only for words in the training set — and search for instances of whole words. This approach is particularly unattractive for an inflected language, because individual words can take so many forms that one is unlikely to see all in the training set. Furthermore, one would like the method used to require very little aligned training data, so that it is possible to process documents written by different scribes with little overhead. Mediaeval Latin manuscripts are a natural first corpus for studying this problem, because there are many scanned manuscripts and because the handwriting is relatively regular. We expect the primary user need to be search over a large body of documents — to allow comparisons between documents — rather than transcription of a particular document (which is usually relatively easy to do by hand). Desirable features for a system are: First, that it use little or no aligned training data (an ideal, which we believe may be attainable, is an unsupervised learning system). Second, that one can search the document for an arbitrary string (rather than, say, only complete words that appear in the training data). This would allow a user to determine whether a document contains curious or distinctive spellings, for example (figure 7). We show that, using a statistical model based on a generalized HMM, we can search a medieval manuscript with considerable accuracy, using only one instance each of each letter in the manuscript to train the method (22 instances in total; Latin has no j, k, w, or z). Furthermore, our method allows fairly accurate transcription of the manuscript. We train our system on 22 glyphs taken from a a 12th century latin manuscript of Terence’s Comedies (obtained from a repository of over 80 scanned medieval works maintained by Oxford University [1]). We evaluate searches using a considerable portion of this manuscript aligned by hand; we then show that fair search results are available on a different manuscript (MS. Auct. D. 2. 16, Latin Gospels with beast-headed evangelist portraits made at Landvennec, Brittany, late 9th or early 10th century, from [1]) without change of letter templates. 1.1. Previous Work Handwriting recognition is a traditional problem, too well studied to review in detail here (see [6]). Typically, online handwriting recognition (where strokes can be recorded) works better than offline handwriting recognition. Handwritten digits can now be recognized with high accuracy [2, 5]. Handwritten amounts can be read with fair accuracy, which is significantly improved if one segments the amount into digits at the same time as one recognizes it [4, 5]. Recently several authors have proposed new techniques for search and translation in this unrestricted setting. Manmatha et al [7] introduce the technique of “word spotting,” which segments text into word images, rectifies the word images, and then uses an aligned training set to learn correspondences between rectified word images and strings. The method is not suitable for a heavily inflected language, because words take so many forms. In an inflected language, the natural unit to match to is a subset of a word, rather than a whole word, implying that one should segment the text into blocks — which may be smaller than words — while recognizing. Vinciarelli et al [8] introduce a method for line by line recognition based around an HMM and quite similar to techniques used in the speech recognition community. Their method uses a window that slides along the text to obtain features; this has the difficulty that the same window is in some places too small (and so uninformative) and in others too big (and so spans more than one letter, and is confusing). Their method requires a substantial body of aligned training data, which makes it impractical for our applications. Close in spirit to our work is the approach to machine translation of Koehn and Knight [3]. They demonstrate that the statistics of unaligned corpora may provide as powerful constraints for training models as aligned bitexts. 2. The Model Our models for both search and transcription are based on the generalized HMM and differ only in their choice of transition model. In an HMM, each hidden node ct emits a single evidence node xt . In a generalized HMM, we allow each ct to emit a series of x’s whose length is itself a random variable. In our model, the hidden nodes correspond to letters and each xt is a single column of pixels. Allowing letters to emit sets of columns lets us accomodate letter templates of variable width. In particular, this means that we can unify segmenting ink into letters and recognizing blocks of ink; figure 3 shows an example of how useful this is. 2.1. Generating a line of text Our hidden state consists of a character label c, width w and vertical position y. The statespace of c contains the characters ‘a’-‘z’, a space ‘ ’, and a special end state Ω. Let T c be the template associated with character c, Tch , Tcw be respectively the height and width of that template, and m be the height of the image. Figure 1: Left, a full page of our manuscript, a 12’th century manuscript of Terence’s Comedies obtained from [1]. Top right, a set of lines from a page from that document and bottom right, some words in higher resolution. Note: (a) the richness of page layout; (b) the clear spacing of the lines; (c) the relatively regular handwriting. Figure 2: Left, the 22 instances, one per letter, used to train our emission model. These templates are extracted by hand from the Terence document. Right, the five image channels for a single letter. Beginning at image column 1 (and assuming a dummy space before the first character), • • • • choose character c ∼ p(c|c−1...−n ) (an n-gram letter model) choose length w ∼ Uniform(Tcw − k, Tcw + k) (for some small k) choose vertical position y ∼ Uniform(1, m − Tch ) z,y and Tch now define a bounding box b of pixels. Let i and j be indexed from the top left of that bounding box. – draw pixel (i, j) ∼ N (Tcij , σcij ) for each pixel in b – draw all pixels above and below b from background gaussian N (µ0 , σ0 ) (See 2.2 for greater detail on pixel emission model) • move to column w + 1 and repeat until we enter the end state Ω. Inference on a gHMM is a relatively straighforward business of dynamic programming. We have used unigram, bigram and trigram models, with each transition model fitted using an electronic version of Caesar’s Gallic Wars, obtained from http://www.thelatinlibrary.com. We do not believe that the choice of author should significantly affect the fitted transition model — which is at the level of characters — but have not experimented with this point. The important matter is the emission model. 2.2. The Emission Model Our emission model is as follows: Given the character c and width w, we generate a template of the required length. Each pixel in this template becomes the mean of a gaussian which generates the corresponding pixel in the image. This template has a separate mean image for each pixel channel. The channels are assumed independent given the means. We train the model by cutting out by hand a single instance of each letter from our corpus (figure 2). This forms the central portion of the template. Pixels above and below this Model Perfect transcription unigram bigram trigram matching chars 21019 14603 15572 15788 substitutions 0 5487 4597 4410 insertions 0 534 541 507 deletions 0 773 718 695 Table 1: Edit distance between our transcribed Terence and the editor’s version. Note the trigram model produces significantly fewer letter errors than the unigram model, but that the error rate is still a substantial 25%. central box are generated from a single gaussian used to model background pixels (basically white pixels). We add a third variable yt to our hidden state indicating the vertical position of the central box. However, since we are uninterested in actually recovering this variable, during inference we sum it out of the model. The width of a character is constrained to be close to the width (tw ) of our hand cut example by setting p(w|c) = 0 for w < tw − k and w > tw + k. Here k is a small, user defined integer. Within this range, p(w|c) is distributed uniformly, larger templates are created by appending pixels from the background model to the template and smaller ones by simply removing the right k-most columns of the hand cut example. For features, we generate five image representations, shown in figure 2. The first is a grayscale version of the original color image. The second and third are generated by convolving the grayscale image with a vertical derivative of gaussian filter, separating the positive and negative components of this response, and smoothing each of these gradient images separately. The fourth and fifth are generated similarly but with a horizontal derivative of gaussian filter. We have experimented with different weightings of these 5 channels. In practice we use the gray scale channel and the horizontal gradient channels. We emphasize the horizontal pieces since these seem the more discriminative. 2.3. Transcription For transcription, we model letters as coming from an n-gram language model, with no dependencies between words. Thus, the probability of a letter depends on the k letters before it, where k = n unless this would cross a word boundary in which case the history terminates at this boundary. We chose not to model word to word transition probabilities since, unlike in English, word order in Latin is highly arbitrary. This transition model is fit from a corpus of ascii encoded latin. We have experimented with unigram (i.e. uniform transition probabilities), bigram and trigram letter models. We can perform transcription by fitting the maximum likelihood path through any given line. Some results of this technique are shown in figure 3. 2.4. Search For search, we rank lines by the probability that they contain our search word. We set up a finite state machine like that in figure 4. In this figure, ‘bg’ represents our background model for that portion of the line not generated by our search word. We can use any of the n-gram letter models described for transcription as the transition model for ‘bg’. The probability that the line contains the search word is the probability that this FSM takes path 1. We use this FSM as the transition model for our gHMM, and output the posterior probability of the two arrows leading into the end state. 1 and 2 are user defined weights, but in practice the algorithm does not appear to be particular sensitive to the choice of these parameters. The results presented here use the unigram model. Editorial translation Orator ad vos venio ornatu prologi: unigram b u rt o r a d u o s u em o o r n a t u p r o l o g r b u rt o r a d v o s v em o o r u a t u p r o l o g r fo r a t o r a d v o s v en i o o r n a t u p r o l o g i bigram trigram Figure 3: We transcribe the text by finding the maximum likelihood path through the gHMM. The top line shows the standard version of the line (obtained by consensus among editors who have consulted various manuscripts; we obtained this information in electronic form from http://www.thelatinlibrary.com). Below, we show the line as segmented and transcribed by unigram, bigram and trigram models; the unigram and bigram models transcribe one word as “vemo”, but the stronger trigram model forces the two letters to be segmented and correctly transcribes the word as “venio”, illustrating the considerable benefit to be obtained by segmenting only at recognition time. 1 − ε1 Path 1 1 − ε2 a b bg ε1 Ω bg Path 2 ε2 Figure 4: The finite state machine to search for the word ‘ab.’ ‘bg’ is a place holder for the larger finite state machine defined by our language model’s transition matrix. 3. Results Figure 1 shows a page from our collection. This is a scanned 12th century manuscript of Terence’s Comedies, obtained from the collection at [1]. In preprocessing, we extract individual lines of text by rotating the image to various degrees and projecting the sum of the pixel values onto the y-axis. We choose the orientation whose projection vector has the lowest entropy, and then segment lines by cutting at minima of this projection. Transcription is not our primary task, but methods that produce good transcriptions are going to support good searches. The gHMM can produce a surprisingly good transcription, given how little training data is used to train the emission model. We aligned an editors version of Terence with 25 pages from the manuscript by hand, and computed the edit distance between the transcribed text and the aligned text; as table 1 indicates, approximately 75% of letters are read correctly. Search results are strong. We show results for two documents. The first set of results refers to the edition of Terence’s Comedies, from which we took the 22 letter instances. In particular, for any given search term, our process ranks the complete set of lines. We used a hand alignment of the manuscript to determine which lines contained each term; figure 5 shows an overview of searches performed using every word that appears in the 50 100 150 200 250 300 350 400 450 500 550 Figure 5: Our search ranks 587 manuscript lines, with higher ranking lines more likely to contain the relevant term. This figure shows complete search results for each term that appears more than three times in the 587 lines. Each row represents the ranked search results for a term, and a black mark appears if the search term is actually in the line; a successful search will therefore appear as a row which is wholly dark to the left, and then wholly light. All 587 lines are represented. More common terms are represented by lower rows. More detailed results appear in figure 5 and figure 6; this summary figure suggests almost all searches are highly successful. document more than three times, in particular, showing which of the ranked set of lines actually contained the search term. For almost every search, the term appears mainly in the lines with higher rank. Figure 6 contains more detailed information for a smaller set of words. We do not score the position of a word in a line (for practical reasons). Figure 7 demonstrates (a) that our search respects the ink of the document and (b) that for the Terence document, word positions are accurately estimated. The spelling of mediaeval documents is typically cleaned up by editors; in our manuscript, the scribe reliably spells “michi” for the standard “mihi”. A search on “michi” produces many instances; a search on “mihi” produces none, because the ink doesn’t have any. Notice this phenomenon also in the bottom right line of figure 7, the scribe writes “habet, ut consumat nunc cum nichil obsint doli” and the editor gives “habet, ut consumat nunc quom nil obsint doli.” Figure 8 shows that searches on short strings produce many words containing that string as one would wish. 4. Discussion We have shown that it is possible to make at least some handwritten mediaeval manuscripts accessible to full text search, without requiring an aligned text or much supervisory data. Our documents have very regular letters, and letter frequencies — which can be obtained from transcribed Latin — appear to provide so powerful a cue that relatively little detailed information about letter shapes is required. Linking letter segmentation and recognition has thoroughly beneficial effects. This suggests that the pool of manuscripts that can be made accessible in this way is large. In particular, we have used our method, trained on 22 instances of letters from one document, to search another document. Figure 9 shows the results from two searches of our second document (MS. Auct. D. 2. 16, Latin Gospels with beast-headed evangelist portraits made at Landvennec, Brittany, late 9th or early 10th century, from [1]). No information from this document was used in training at all; but letter 1tu arbitror pater etiam nisi factum primum siet vero illi inter hic michi ibi qui tu ibi michi 0.9 0.8 0.7 qui hic 0.6 inter 0.5 illi 0.4 siet 0.3 vero 0.2 nisi 0.1 50 100 150 200 250 300 350 400 450 500 550 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 6: On the left, search results for selected words (indicated on the leftmost column). Each row represents the ranked search results for a term, and a black mark appears if the search term is actually in the line; a successful search will therefore appear as a row which is wholly dark to the left, and then wholly light. Note only the top 300 results are represented, and that lines containing the search term are almost always at or close to the top of the search results (black marks to the left). On the right, we plot precision against recall for a set of different words by taking the top 10, 20, ... lines returned from the search, and checking them against the aligned manuscript. Note that, once all cases have been found, if the size of the pool is increased the precision will fall with 100% recall; many words work well, with most of the first 20 or so lines returned containing the search term. shapes are sufficiently well shared that the search is still useful. All this suggests that one might be able to use EM to link three processes: one that clusters to determine letter shapes; one that segments letters; and one that imposes a language model. Such a system might be able to make handwritten Latin searchable with no training data. References [1] Early Manuscripts at Oxford University. Bodleian library ms. auct. f. 2.13. http://image.ox.ac.uk/. [2] Serge Belongie, Jitendra Malik, and Jan Puzicha. Shape matching and object recognition using shape contexts. IEEE T. Pattern Analysis and Machine Intelligence, 24(4):509–522, 2002. [3] Philipp Koehn and Kevin Knight. Estimating word translation probabilities from unrelated monolingual corpora. In Proc. of the 17th National Conf. on AI, pages 711–715. AAAI Press / The MIT Press, 2000. [4] Y. LeCun, L. Bottou, and Y. Bengio. Reading checks with graph transformer networks. In International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 151–154, Munich, 1997. IEEE. [5] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [6] R. Plamondon and S.N. Srihari. Online and off-line handwriting recognition: a comprehensive survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):63–84, 2000. [7] T. M. Rath and R. Manmatha. Word image matching using dynamic time warping. In Proc. of the Conf. on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 521–527, 2003. [8] Alessandro Vinciarelli, Samy Bengio, and Horst Bunke. Offline recognition of unconstrained handwritten texts using hmms and statistical language models. IEEE Trans. Pattern Anal. Mach. Intell., 26(6):709–720, 2004. michi: Spe incerta certum mihi laborem sustuli, mihi: Faciuntne intellegendo ut nil intellegant? michi: Nonnumquam conlacrumabat. placuit tum id mihi. mihi: Placuit: despondi. hic nuptiis dictust dies. michi: Sto exspectans siquid mi imperent. venit una,
5 0.041439589 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach
Author: Francis R. Bach, Michael I. Jordan
Abstract: We present an algorithm to perform blind, one-microphone speech separation. Our algorithm separates mixtures of speech without modeling individual speakers. Instead, we formulate the problem of speech separation as a problem in segmenting the spectrogram of the signal into two or more disjoint sets. We build feature sets for our segmenter using classical cues from speech psychophysics. We then combine these features into parameterized affinity matrices. We also take advantage of the fact that we can generate training examples for segmentation by artificially superposing separately-recorded signals. Thus the parameters of the affinity matrices can be tuned using recent work on learning spectral clustering [1]. This yields an adaptive, speech-specific segmentation algorithm that can successfully separate one-microphone speech mixtures. 1
6 0.040529758 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction
7 0.036874905 192 nips-2004-The power of feature clustering: An application to object detection
8 0.03677322 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters
9 0.032991193 47 nips-2004-Contextual Models for Object Detection Using Boosted Random Fields
10 0.032852545 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution
11 0.031130139 186 nips-2004-The Correlated Correspondence Algorithm for Unsupervised Registration of Nonrigid Surfaces
12 0.030956648 40 nips-2004-Common-Frame Model for Object Recognition
13 0.028433962 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering
14 0.028031956 161 nips-2004-Self-Tuning Spectral Clustering
15 0.027428199 205 nips-2004-Who's In the Picture
16 0.027404916 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations
17 0.026929289 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
18 0.025151659 182 nips-2004-Synergistic Face Detection and Pose Estimation with Energy-Based Models
19 0.024400439 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics
20 0.023648905 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection
topicId topicWeight
[(0, -0.078), (1, 0.007), (2, -0.016), (3, -0.069), (4, 0.021), (5, 0.027), (6, 0.025), (7, 0.003), (8, -0.023), (9, -0.022), (10, -0.057), (11, 0.012), (12, -0.064), (13, 0.016), (14, -0.043), (15, -0.002), (16, 0.017), (17, -0.001), (18, 0.007), (19, -0.027), (20, 0.044), (21, 0.001), (22, -0.034), (23, -0.048), (24, 0.005), (25, -0.017), (26, -0.004), (27, 0.002), (28, -0.01), (29, 0.045), (30, 0.033), (31, 0.03), (32, 0.017), (33, 0.02), (34, 0.088), (35, -0.088), (36, -0.017), (37, -0.04), (38, 0.042), (39, -0.035), (40, -0.031), (41, -0.023), (42, -0.071), (43, 0.011), (44, -0.017), (45, 0.054), (46, 0.039), (47, 0.061), (48, 0.047), (49, -0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.92862207 199 nips-2004-Using Machine Learning to Break Visual Human Interaction Proofs (HIPs)
Author: Kumar Chellapilla, Patrice Y. Simard
Abstract: Machine learning is often used to automatically solve human tasks. In this paper, we look for tasks where machine learning algorithms are not as good as humans with the hope of gaining insight into their current limitations. We studied various Human Interactive Proofs (HIPs) on the market, because they are systems designed to tell computers and humans apart by posing challenges presumably too hard for computers. We found that most HIPs are pure recognition tasks which can easily be broken using machine learning. The harder HIPs use a combination of segmentation and recognition tasks. From this observation, we found that building segmentation tasks is the most effective way to confuse machine learning algorithms. This has enabled us to build effective HIPs (which we deployed in MSN Passport), as well as design challenging segmentation tasks for machine learning algorithms. 1 In t rod u ct i on The OCR problem for high resolution printed text has virtually been solved 10 years ago [1]. On the other hand, cursive handwriting recognition today is still too poor for most people to rely on. Is there a fundamental difference between these two seemingly similar problems? To shed more light on this question, we study problems that have been designed to be difficult for computers. The hope is that we will get some insight on what the stumbling blocks are for machine learning and devise appropriate tests to further understand their similarities and differences. Work on distinguishing computers from humans traces back to the original Turing Test [2] which asks that a human distinguish between another human and a machine by asking questions of both. Recent interest has turned to developing systems that allow a computer to distinguish between another computer and a human. These systems enable the construction of automatic filters that can be used to prevent automated scripts from utilizing services intended for humans [4]. Such systems have been termed Human Interactive Proofs (HIPs) [3] or Completely Automated Public Turing Tests to Tell Computers and Humans Apart (CAPTCHAs) [4]. An overview of the work in this area can be found in [5]. Construction of HIPs that are of practical value is difficult because it is not sufficient to develop challenges at which humans are somewhat more successful than machines. This is because the cost of failure for an automatic attacker is minimal compared to the cost of failure for humans. Ideally a HIP should be solved by humans more than 80% of the time, while an automatic script with reasonable resource use should succeed less than 0.01% of the time. This latter ratio (1 in 10,000) is a function of the cost of an automatic trial divided by the cost of having a human perform the attack. This constraint of generating tasks that are failed 99.99% of the time by all automated algorithms has generated various solutions which can easily be sampled on the internet. Seven different HIPs, namely, Mailblocks, MSN (before April 28th, 2004), Ticketmaster, Yahoo, Yahoo v2 (after Sept’04), Register, and Google, will be given as examples in the next section. We will show in Section 3 that machinelearning-based attacks are far more successful than 1 in 10,000. Yet, some of these HIPs are harder than others and could be made even harder by identifying the recognition and segmentation parts, and emphasizing the latter. Section 4 presents examples of more difficult HIPs which are much more respectable challenges for machine learning, and yet surprisingly easy for humans. The final section discusses a (known) weakness of machine learning algorithms and suggests designing simple artificial datasets for studying this weakness. 2 Exa mp les o f H I Ps The HIPs explored in this paper are made of characters (or symbols) rendered to an image and presented to the user. Solving the HIP requires identifying all characters in the correct order. The following HIPs can be sampled from the web: Mailblocks: While signing up for free email service with (www.mailblocks.com), you will find HIP challenges of the type: mailblocks MSN: While signing up for free e-mail with MSN Hotmail (www.hotmail.com), you will find HIP challenges of the type: Register.com: While requesting a whois lookup for a domain at www.register.com, you will HIP challenges of the type: Yahoo!/EZ-Gimpy (CMU): While signing up for free e-mail service with Yahoo! (www.yahoo.com), you will receive HIP challenges of the type: Yahoo! (version 2): Starting in August 2004, Yahoo! introduced their second generation HIP. Three examples are presented below: Ticketmaster: While looking for concert tickets at www.ticketmaster.com, you will receive HIP challenges of the type: Google/Gmail: While signing up for free e-mail with Gmail at www.google.com, one will receive HIP challenges of the type: While solutions to Yahoo HIPs are common English words, those for ticketmaster and Google do not necessarily belong to the English dictionary. They appear to have been created using a phonetic generator [8]. 3 Usi n g ma ch i n e lea rn i n g t o b rea k H IP s Breaking HIPs is not new. Mori and Malik [7] have successfully broken the EZGimpy (92% success) and Gimpy (33% success) HIPs from CMU. Our approach aims at an automatic process for solving multiple HIPs with minimum human intervention, using machine learning. In this paper, our main goal is to learn more about the common strengths and weaknesses of these HIPs rather than to prove that we can break any one HIP in particular with the highest possible success rate. We have results for six different HIPs: EZ-Gimpy/Yahoo, Yahoo v2, mailblocks, register, ticketmaster, and Google. To simplify our study, we will not be using language models in our attempt to break HIPs. For example, there are only about 600 words in the EZ-Gimpy dictionary [7], which means that a random guess attack would get a success rate of 1 in 600 (more than enough to break the HIP, i.e., greater than 0.01% success). HIPs become harder when no language model is used. Similarly, when a HIP uses a language model to generate challenges, success rate of attacks can be significantly improved by incorporating the language model. Further, since the language model is not common to all HIPs studied, it was not used in this paper. Our generic method for breaking all of these HIPs is to write a custom algorithm to locate the characters, and then use machine learning for recognition. Surprisingly, segmentation, or finding the characters, is simple for many HIPs which makes the process of breaking the HIP particularly easy. Gimpy uses a single constant predictable color (black) for letters even though the background color changes. We quickly realized that once the segmentation problem is solved, solving the HIP becomes a pure recognition problem, and it can trivially be solved using machine learning. Our recognition engine is based on neural networks [6][9]. It yielded a 0.4% error rate on the MNIST database, uses little memory, and is very fast for recognition (important for breaking HIPs). For each HIP, we have a segmentation step, followed by a recognition step. It should be stressed that we are not trying to solve every HIP of a given type i.e., our goal is not 100% success rate, but something efficient that can achieve much better than 0.01%. In each of the following experiments, 2500 HIPs were hand labeled and used as follows (a) recognition (1600 for training, 200 for validation, and 200 for testing), and (b) segmentation (500 for testing segmentation). For each of the five HIPs, a convolution neural network, identical to the one described in [6], was trained and tested on gray level character images centered on the guessed character positions (see below). The trained neural network became the recognizer. 3.1 M a i l b l oc k s To solve the HIP, we select the red channel, binarize and erode it, extract the largest connected components (CCs), and breakup CCs that are too large into two or three adjacent CCs. Further, vertically overlapping half character size CCs are merged. The resulting rough segmentation works most of the time. Here is an example: For instance, in the example above, the NN would be trained, and tested on the following images: … The end-to-end success rate is 88.8% for segmentation, 95.9% for recognition (given correct segmentation), and (0.888)*(0.959)7 = 66.2% total. Note that most of the errors come from segmentation, even though this is where all the custom programming was invested. 3.2 Register The procedure to solve HIPs is very similar. The image was smoothed, binarized, and the largest 5 connected components were identified. Two examples are presented below: The end-to-end success rate is 95.4% for segmentation, 87.1% for recognition (given correct segmentation), and (0.954)*(0.871)5 = 47.8% total. 3.3 Y a h oo/ E Z - G i mp y Unlike the mailblocks and register HIPs, the Yahoo/EZ-Gimpy HIPs are richer in that a variety of backgrounds and clutter are possible. Though some amount of text warping is present, the text color, size, and font have low variability. Three simple segmentation algorithms were designed with associated rules to identify which algorithm to use. The goal was to keep these simple yet effective: a) No mesh: Convert to grayscale image, threshold to black and white, select large CCs with sizes close to HIP char sizes. One example: b) Black mesh: Convert to grayscale image, threshold to black and white, remove vertical and horizontal line pixels that don’t have neighboring pixels, select large CCs with sizes close to HIP char sizes. One example: c) White mesh: Convert to grayscale image, threshold to black and white, add black pixels (in white line locations) if there exist neighboring pixels, select large CCs with sizes close to HIP char sizes. One example: Tests for black and white meshes were performed to determine which segmentation algorithm to use. The end-to-end success rate was 56.2% for segmentation (38.2% came from a), 11.8% from b), and 6.2% from c), 90.3% for recognition (given correct segmentation), and (0.562)*(0.903)4.8 = 34.4% total. The average length of a Yahoo HIP solution is 4.8 characters. 3.4 T i c k e t ma s t e r The procedure that solved the Yahoo HIP is fairly successful at solving some of the ticket master HIPs. These HIPs are characterized by cris-crossing lines at random angles clustered around 0, 45, 90, and 135 degrees. A multipronged attack as in the Yahoo case (section 3.3) has potential. In the interests of simplicity, a single attack was developed: Convert to grayscale, threshold to black and white, up-sample image, dilate first then erode, select large CCs with sizes close to HIP char sizes. One example: The dilate-erode combination causes the lines to be removed (along with any thin objects) but retains solid thick characters. This single attack is successful in achieving an end-to-end success rate of 16.6% for segmentation, the recognition rate was 82.3% (in spite of interfering lines), and (0.166)*(0.823)6.23 = 4.9% total. The average HIP solution length is 6.23 characters. 3.5 Y a h oo ve r s i on 2 The second generation HIP from Yahoo had several changes: a) it did not use words from a dictionary or even use a phonetic generator, b) it uses only black and white colors, c) uses both letters and digits, and d) uses connected lines and arcs as clutter. The HIP is somewhat similar to the MSN/Passport HIP which does not use a dictionary, uses two colors, uses letters and digits, and background and foreground arcs as clutter. Unlike the MSN/Passport HIP, several different fonts are used. A single segmentation attack was developed: Remove 6 pixel border, up-sample, dilate first then erode, select large CCs with sizes close to HIP char sizes. The attack is practically identical to that used for the ticketmaster HIP with different preprocessing stages and slightly modified parameters. Two examples: This single attack is successful in achieving an end-to-end success rate of 58.4% for segmentation, the recognition rate was 95.2%, and (0.584)*(0.952)5 = 45.7% total. The average HIP solution length is 5 characters. 3.6 G oog l e / G M a i l The Google HIP is unique in that it uses only image warp as a means of distorting the characters. Similar to the MSN/Passport and Yahoo version 2 HIPs, it is also two color. The HIP characters are arranged closed to one another (they often touch) and follow a curved baseline. The following very simple attack was used to segment Google HIPs: Convert to grayscale, up-sample, threshold and separate connected components. a) b) This very simple attack gives an end-to-end success rate of 10.2% for segmentation, the recognition rate was 89.3%, giving (0.102)*(0.893)6.5 = 4.89% total probability of breaking a HIP. Average Google HIP solution length is 6.5 characters. This can be significantly improved upon by judicious use of dilate-erode attack. A direct application doesn’t do as well as it did on the ticketmaster and yahoo HIPs (because of the shear and warp of the baseline of the word). More successful and complicated attacks might estimate and counter the shear and warp of the baseline to achieve better success rates. 4 Lesso n s lea rn ed f ro m b rea ki n g H IPs From the previous section, it is clear that most of the errors come from incorrect segmentations, even though most of the development time is spent devising custom segmentation schemes. This observation raises the following questions: Why is segmentation a hard problem? Can we devise harder HIPs and datasets? Can we build an automatic segmentor? Can we compare classification algorithms based on how useful they are for segmentation? 4.1 T h e s e g me n t a t i on p r ob l e m As a review, segmentation is difficult for the following reasons: 1. Segmentation is computationally expensive. In order to find valid patterns, a recognizer must attempt recognition at many different candidate locations. 2. The segmentation function is complex. To segment successfully, the system must learn to identify which patterns are valid among the set of all possible valid and non-valid patterns. This task is intrinsically more difficult than classification because the space of input is considerably larger. Unlike the space of valid patterns, the space of non-valid patterns is typically too vast to sample. This is a problem for many learning algorithms which yield too many false positives when presented non-valid patterns. 3. Identifying valid characters among a set of valid and invalid candidates is a combinatorial problem. For example, correctly identifying which 8 characters among 20 candidates (assuming 12 false positives), has a 1 in 125,970 (20 choose 8) chances of success by random guessing. 4.2 B ui l d i n g b e t te r / h a r de r H I P s We can use what we have learned to build better HIPs. For instance the HIP below was designed to make segmentation difficult and a similar version has been deployed by MSN Passport for hotmail registrations (www.hotmail.com): The idea is that the additional arcs are themselves good candidates for false characters. The previous segmentation attacks would fail on this HIP. Furthermore, simple change of fonts, distortions, or arc types would require extensive work for the attacker to adjust to. We believe HIPs that emphasize the segmentation problem, such as the above example, are much stronger than the HIPs we examined in this paper, which rely on recognition being difficult. Pushing this to the extreme, we can easily generate the following HIPs: Despite the apparent difficulty of these HIPs, humans are surprisingly good at solving these, indicating that humans are far better than computers at segmentation. This approach of adding several competing false positives can in principle be used to automatically create difficult segmentation problems or benchmarks to test classification algorithms. 4.3 B ui l d i n g a n a ut o ma t i c s e g me n t or To build an automatic segmentor, we could use the following procedure. Label characters based on their correct position and train a recognizer. Apply the trained recognizer at all locations in the HIP image. Collect all candidate characters identified with high confidence by the recognizer. Compute the probability of each combination of candidates (going from left to right), and output the solution string with the highest probability. This is better illustrated with an example. Consider the following HIP (to the right). The trained neural network has these maps (warm colors indicate recognition) that show that K, Y, and so on are correctly identified. However, the maps for 7 and 9 show several false positives. In general, we would get the following color coded map for all the different candidates: HIP K Y B 7 9 With a threshold of 0.5 on the network’s outputs, the map obtained is: We note that there are several false positives for each true positive. The number of false positives per true positive character was found to be between 1 and 4, giving a 1 in C(16,8) = 12,870 to 1 in C(32,8) = 10,518,300 random chance of guessing the correct segmentation for the HIP characters. These numbers can be improved upon by constraining solution strings to flow sequentially from left to right and by restricting overlap. For each combination, we compute a probability by multiplying the 8 probabilities of the classifier for each position. The combination with the highest probability is the one proposed by the classifier. We do not have results for such an automatic segmentor at this time. It is interesting to note that with such a method a classifier that is robust to false positives would do far better than one that is not. This suggests another axis for comparing classifiers. 5 Con clu si on In this paper, we have successfully applied machine learning to the problem of solving HIPs. We have learned that decomposing the HIP problem into segmentation and recognition greatly simplifies analysis. Recognition on even unprocessed images (given segmentation is a solved) can be done automatically using neural networks. Segmentation, on the other hand, is the difficulty differentiator between weaker and stronger HIPs and requires custom intervention for each HIP. We have used this observation to design new HIPs and new tests for machine learning algorithms with the hope of improving them. A c k n ow l e d ge me n t s We would like to acknowledge Chau Luu and Eric Meltzer for their help with labeling and segmenting various HIPs. We would also like to acknowledge Josh Benaloh and Cem Paya for stimulating discussions on HIP security. References [1] Baird HS (1992), “Anatomy of a versatile page reader,” IEEE Pro., v.80, pp. 1059-1065. [2] Turing AM (1950), “Computing Machinery and Intelligence,” Mind, 59:236, pp. 433-460. [3] First Workshop on Human Interactive Proofs, Palo Alto, CA, January 2002. [4] Von Ahn L, Blum M, and Langford J, The Captcha Project. http://www.captcha.net [5] Baird HS and Popat K (2002) “Human Interactive Proofs and Document Image Analysis,” Proc. IAPR 2002 Workshop on Document Analysis Systerms, Princeton, NJ. [6] Simard PY, Steinkraus D, and Platt J, (2003) “Best Practice for Convolutional Neural Networks Applied to Visual Document Analysis,” in International Conference on Document Analysis and Recognition (ICDAR), pp. 958-962, IEEE Computer Society, Los Alamitos. [7] Mori G, Malik J (2003), “Recognizing Objects in Adversarial Clutter: Breaking a Visual CAPTCHA,” Proc. of the Computer Vision and Pattern Recognition (CVPR) Conference, IEEE Computer Society, vol.1, pages:I-134 - I-141, June 18-20, 2003 [8] Chew, M. and Baird, H. S. (2003), “BaffleText: a Human Interactive Proof,” Proc., 10th IS&T;/SPIE Document Recognition & Retrieval Conf., Santa Clara, CA, Jan. 22. [9] LeCun Y, Bottou L, Bengio Y, and Haffner P, “Gradient-based learning applied to document recognition,’ Proceedings of the IEEE, Nov. 1998.
2 0.55443901 205 nips-2004-Who's In the Picture
Author: Tamara L. Berg, Alexander C. Berg, Jaety Edwards, David A. Forsyth
Abstract: The context in which a name appears in a caption provides powerful cues as to who is depicted in the associated image. We obtain 44,773 face images, using a face detector, from approximately half a million captioned news images and automatically link names, obtained using a named entity recognizer, with these faces. A simple clustering method can produce fair results. We improve these results significantly by combining the clustering process with a model of the probability that an individual is depicted given its context. Once the labeling procedure is over, we have an accurately labeled set of faces, an appearance model for each individual depicted, and a natural language model that can produce accurate results on captions in isolation. 1
3 0.51076293 192 nips-2004-The power of feature clustering: An application to object detection
Author: Shai Avidan, Moshe Butman
Abstract: We give a fast rejection scheme that is based on image segments and demonstrate it on the canonical example of face detection. However, instead of focusing on the detection step we focus on the rejection step and show that our method is simple and fast to be learned, thus making it an excellent pre-processing step to accelerate standard machine learning classifiers, such as neural-networks, Bayes classifiers or SVM. We decompose a collection of face images into regions of pixels with similar behavior over the image set. The relationships between the mean and variance of image segments are used to form a cascade of rejectors that can reject over 99.8% of image patches, thus only a small fraction of the image patches must be passed to a full-scale classifier. Moreover, the training time for our method is much less than an hour, on a standard PC. The shape of the features (i.e. image segments) we use is data-driven, they are very cheap to compute and they form a very low dimensional feature space in which exhaustive search for the best features is tractable. 1
4 0.49740595 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters
Author: Massimiliano Pavan, Marcello Pelillo
Abstract: Dominant sets are a new graph-theoretic concept that has proven to be relevant in pairwise data clustering problems, such as image segmentation. They generalize the notion of a maximal clique to edgeweighted graphs and have intriguing, non-trivial connections to continuous quadratic optimization and spectral-based grouping. We address the problem of grouping out-of-sample examples after the clustering process has taken place. This may serve either to drastically reduce the computational burden associated to the processing of very large data sets, or to efficiently deal with dynamic situations whereby data sets need to be updated continually. We show that the very notion of a dominant set offers a simple and efficient way of doing this. Numerical experiments on various grouping problems show the effectiveness of the approach. 1
5 0.48655528 182 nips-2004-Synergistic Face Detection and Pose Estimation with Energy-Based Models
Author: Margarita Osadchy, Matthew L. Miller, Yann L. Cun
Abstract: We describe a novel method for real-time, simultaneous multi-view face detection and facial pose estimation. The method employs a convolutional network to map face images to points on a manifold, parametrized by pose, and non-face images to points far from that manifold. This network is trained by optimizing a loss function of three variables: image, pose, and face/non-face label. We test the resulting system, in a single configuration, on three standard data sets – one for frontal pose, one for rotated faces, and one for profiles – and find that its performance on each set is comparable to previous multi-view face detectors that can only handle one form of pose variation. We also show experimentally that the system’s accuracy on both face detection and pose estimation is improved by training for the two tasks together.
6 0.45233411 107 nips-2004-Making Latin Manuscripts Searchable using gHMM's
7 0.39069489 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition
8 0.38823479 180 nips-2004-Synchronization of neural networks by mutual learning and its application to cryptography
9 0.37519309 68 nips-2004-Face Detection --- Efficient and Rank Deficient
10 0.36784458 43 nips-2004-Conditional Models of Identity Uncertainty with Application to Noun Coreference
11 0.36507186 120 nips-2004-Modeling Conversational Dynamics as a Mixed-Memory Markov Process
12 0.36202365 51 nips-2004-Detecting Significant Multidimensional Spatial Clusters
13 0.36172292 191 nips-2004-The Variational Ising Classifier (VIC) Algorithm for Coherently Contaminated Data
14 0.36078703 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction
15 0.35953692 193 nips-2004-Theories of Access Consciousness
16 0.35537484 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection
17 0.3512935 109 nips-2004-Mass Meta-analysis in Talairach Space
18 0.34472251 53 nips-2004-Discriminant Saliency for Visual Recognition from Cluttered Scenes
19 0.33756158 122 nips-2004-Modelling Uncertainty in the Game of Go
20 0.32813469 171 nips-2004-Solitaire: Man Versus Machine
topicId topicWeight
[(13, 0.047), (15, 0.109), (17, 0.014), (26, 0.031), (31, 0.015), (33, 0.119), (35, 0.02), (50, 0.464), (51, 0.017), (53, 0.01), (81, 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.86863232 199 nips-2004-Using Machine Learning to Break Visual Human Interaction Proofs (HIPs)
Author: Kumar Chellapilla, Patrice Y. Simard
Abstract: Machine learning is often used to automatically solve human tasks. In this paper, we look for tasks where machine learning algorithms are not as good as humans with the hope of gaining insight into their current limitations. We studied various Human Interactive Proofs (HIPs) on the market, because they are systems designed to tell computers and humans apart by posing challenges presumably too hard for computers. We found that most HIPs are pure recognition tasks which can easily be broken using machine learning. The harder HIPs use a combination of segmentation and recognition tasks. From this observation, we found that building segmentation tasks is the most effective way to confuse machine learning algorithms. This has enabled us to build effective HIPs (which we deployed in MSN Passport), as well as design challenging segmentation tasks for machine learning algorithms. 1 In t rod u ct i on The OCR problem for high resolution printed text has virtually been solved 10 years ago [1]. On the other hand, cursive handwriting recognition today is still too poor for most people to rely on. Is there a fundamental difference between these two seemingly similar problems? To shed more light on this question, we study problems that have been designed to be difficult for computers. The hope is that we will get some insight on what the stumbling blocks are for machine learning and devise appropriate tests to further understand their similarities and differences. Work on distinguishing computers from humans traces back to the original Turing Test [2] which asks that a human distinguish between another human and a machine by asking questions of both. Recent interest has turned to developing systems that allow a computer to distinguish between another computer and a human. These systems enable the construction of automatic filters that can be used to prevent automated scripts from utilizing services intended for humans [4]. Such systems have been termed Human Interactive Proofs (HIPs) [3] or Completely Automated Public Turing Tests to Tell Computers and Humans Apart (CAPTCHAs) [4]. An overview of the work in this area can be found in [5]. Construction of HIPs that are of practical value is difficult because it is not sufficient to develop challenges at which humans are somewhat more successful than machines. This is because the cost of failure for an automatic attacker is minimal compared to the cost of failure for humans. Ideally a HIP should be solved by humans more than 80% of the time, while an automatic script with reasonable resource use should succeed less than 0.01% of the time. This latter ratio (1 in 10,000) is a function of the cost of an automatic trial divided by the cost of having a human perform the attack. This constraint of generating tasks that are failed 99.99% of the time by all automated algorithms has generated various solutions which can easily be sampled on the internet. Seven different HIPs, namely, Mailblocks, MSN (before April 28th, 2004), Ticketmaster, Yahoo, Yahoo v2 (after Sept’04), Register, and Google, will be given as examples in the next section. We will show in Section 3 that machinelearning-based attacks are far more successful than 1 in 10,000. Yet, some of these HIPs are harder than others and could be made even harder by identifying the recognition and segmentation parts, and emphasizing the latter. Section 4 presents examples of more difficult HIPs which are much more respectable challenges for machine learning, and yet surprisingly easy for humans. The final section discusses a (known) weakness of machine learning algorithms and suggests designing simple artificial datasets for studying this weakness. 2 Exa mp les o f H I Ps The HIPs explored in this paper are made of characters (or symbols) rendered to an image and presented to the user. Solving the HIP requires identifying all characters in the correct order. The following HIPs can be sampled from the web: Mailblocks: While signing up for free email service with (www.mailblocks.com), you will find HIP challenges of the type: mailblocks MSN: While signing up for free e-mail with MSN Hotmail (www.hotmail.com), you will find HIP challenges of the type: Register.com: While requesting a whois lookup for a domain at www.register.com, you will HIP challenges of the type: Yahoo!/EZ-Gimpy (CMU): While signing up for free e-mail service with Yahoo! (www.yahoo.com), you will receive HIP challenges of the type: Yahoo! (version 2): Starting in August 2004, Yahoo! introduced their second generation HIP. Three examples are presented below: Ticketmaster: While looking for concert tickets at www.ticketmaster.com, you will receive HIP challenges of the type: Google/Gmail: While signing up for free e-mail with Gmail at www.google.com, one will receive HIP challenges of the type: While solutions to Yahoo HIPs are common English words, those for ticketmaster and Google do not necessarily belong to the English dictionary. They appear to have been created using a phonetic generator [8]. 3 Usi n g ma ch i n e lea rn i n g t o b rea k H IP s Breaking HIPs is not new. Mori and Malik [7] have successfully broken the EZGimpy (92% success) and Gimpy (33% success) HIPs from CMU. Our approach aims at an automatic process for solving multiple HIPs with minimum human intervention, using machine learning. In this paper, our main goal is to learn more about the common strengths and weaknesses of these HIPs rather than to prove that we can break any one HIP in particular with the highest possible success rate. We have results for six different HIPs: EZ-Gimpy/Yahoo, Yahoo v2, mailblocks, register, ticketmaster, and Google. To simplify our study, we will not be using language models in our attempt to break HIPs. For example, there are only about 600 words in the EZ-Gimpy dictionary [7], which means that a random guess attack would get a success rate of 1 in 600 (more than enough to break the HIP, i.e., greater than 0.01% success). HIPs become harder when no language model is used. Similarly, when a HIP uses a language model to generate challenges, success rate of attacks can be significantly improved by incorporating the language model. Further, since the language model is not common to all HIPs studied, it was not used in this paper. Our generic method for breaking all of these HIPs is to write a custom algorithm to locate the characters, and then use machine learning for recognition. Surprisingly, segmentation, or finding the characters, is simple for many HIPs which makes the process of breaking the HIP particularly easy. Gimpy uses a single constant predictable color (black) for letters even though the background color changes. We quickly realized that once the segmentation problem is solved, solving the HIP becomes a pure recognition problem, and it can trivially be solved using machine learning. Our recognition engine is based on neural networks [6][9]. It yielded a 0.4% error rate on the MNIST database, uses little memory, and is very fast for recognition (important for breaking HIPs). For each HIP, we have a segmentation step, followed by a recognition step. It should be stressed that we are not trying to solve every HIP of a given type i.e., our goal is not 100% success rate, but something efficient that can achieve much better than 0.01%. In each of the following experiments, 2500 HIPs were hand labeled and used as follows (a) recognition (1600 for training, 200 for validation, and 200 for testing), and (b) segmentation (500 for testing segmentation). For each of the five HIPs, a convolution neural network, identical to the one described in [6], was trained and tested on gray level character images centered on the guessed character positions (see below). The trained neural network became the recognizer. 3.1 M a i l b l oc k s To solve the HIP, we select the red channel, binarize and erode it, extract the largest connected components (CCs), and breakup CCs that are too large into two or three adjacent CCs. Further, vertically overlapping half character size CCs are merged. The resulting rough segmentation works most of the time. Here is an example: For instance, in the example above, the NN would be trained, and tested on the following images: … The end-to-end success rate is 88.8% for segmentation, 95.9% for recognition (given correct segmentation), and (0.888)*(0.959)7 = 66.2% total. Note that most of the errors come from segmentation, even though this is where all the custom programming was invested. 3.2 Register The procedure to solve HIPs is very similar. The image was smoothed, binarized, and the largest 5 connected components were identified. Two examples are presented below: The end-to-end success rate is 95.4% for segmentation, 87.1% for recognition (given correct segmentation), and (0.954)*(0.871)5 = 47.8% total. 3.3 Y a h oo/ E Z - G i mp y Unlike the mailblocks and register HIPs, the Yahoo/EZ-Gimpy HIPs are richer in that a variety of backgrounds and clutter are possible. Though some amount of text warping is present, the text color, size, and font have low variability. Three simple segmentation algorithms were designed with associated rules to identify which algorithm to use. The goal was to keep these simple yet effective: a) No mesh: Convert to grayscale image, threshold to black and white, select large CCs with sizes close to HIP char sizes. One example: b) Black mesh: Convert to grayscale image, threshold to black and white, remove vertical and horizontal line pixels that don’t have neighboring pixels, select large CCs with sizes close to HIP char sizes. One example: c) White mesh: Convert to grayscale image, threshold to black and white, add black pixels (in white line locations) if there exist neighboring pixels, select large CCs with sizes close to HIP char sizes. One example: Tests for black and white meshes were performed to determine which segmentation algorithm to use. The end-to-end success rate was 56.2% for segmentation (38.2% came from a), 11.8% from b), and 6.2% from c), 90.3% for recognition (given correct segmentation), and (0.562)*(0.903)4.8 = 34.4% total. The average length of a Yahoo HIP solution is 4.8 characters. 3.4 T i c k e t ma s t e r The procedure that solved the Yahoo HIP is fairly successful at solving some of the ticket master HIPs. These HIPs are characterized by cris-crossing lines at random angles clustered around 0, 45, 90, and 135 degrees. A multipronged attack as in the Yahoo case (section 3.3) has potential. In the interests of simplicity, a single attack was developed: Convert to grayscale, threshold to black and white, up-sample image, dilate first then erode, select large CCs with sizes close to HIP char sizes. One example: The dilate-erode combination causes the lines to be removed (along with any thin objects) but retains solid thick characters. This single attack is successful in achieving an end-to-end success rate of 16.6% for segmentation, the recognition rate was 82.3% (in spite of interfering lines), and (0.166)*(0.823)6.23 = 4.9% total. The average HIP solution length is 6.23 characters. 3.5 Y a h oo ve r s i on 2 The second generation HIP from Yahoo had several changes: a) it did not use words from a dictionary or even use a phonetic generator, b) it uses only black and white colors, c) uses both letters and digits, and d) uses connected lines and arcs as clutter. The HIP is somewhat similar to the MSN/Passport HIP which does not use a dictionary, uses two colors, uses letters and digits, and background and foreground arcs as clutter. Unlike the MSN/Passport HIP, several different fonts are used. A single segmentation attack was developed: Remove 6 pixel border, up-sample, dilate first then erode, select large CCs with sizes close to HIP char sizes. The attack is practically identical to that used for the ticketmaster HIP with different preprocessing stages and slightly modified parameters. Two examples: This single attack is successful in achieving an end-to-end success rate of 58.4% for segmentation, the recognition rate was 95.2%, and (0.584)*(0.952)5 = 45.7% total. The average HIP solution length is 5 characters. 3.6 G oog l e / G M a i l The Google HIP is unique in that it uses only image warp as a means of distorting the characters. Similar to the MSN/Passport and Yahoo version 2 HIPs, it is also two color. The HIP characters are arranged closed to one another (they often touch) and follow a curved baseline. The following very simple attack was used to segment Google HIPs: Convert to grayscale, up-sample, threshold and separate connected components. a) b) This very simple attack gives an end-to-end success rate of 10.2% for segmentation, the recognition rate was 89.3%, giving (0.102)*(0.893)6.5 = 4.89% total probability of breaking a HIP. Average Google HIP solution length is 6.5 characters. This can be significantly improved upon by judicious use of dilate-erode attack. A direct application doesn’t do as well as it did on the ticketmaster and yahoo HIPs (because of the shear and warp of the baseline of the word). More successful and complicated attacks might estimate and counter the shear and warp of the baseline to achieve better success rates. 4 Lesso n s lea rn ed f ro m b rea ki n g H IPs From the previous section, it is clear that most of the errors come from incorrect segmentations, even though most of the development time is spent devising custom segmentation schemes. This observation raises the following questions: Why is segmentation a hard problem? Can we devise harder HIPs and datasets? Can we build an automatic segmentor? Can we compare classification algorithms based on how useful they are for segmentation? 4.1 T h e s e g me n t a t i on p r ob l e m As a review, segmentation is difficult for the following reasons: 1. Segmentation is computationally expensive. In order to find valid patterns, a recognizer must attempt recognition at many different candidate locations. 2. The segmentation function is complex. To segment successfully, the system must learn to identify which patterns are valid among the set of all possible valid and non-valid patterns. This task is intrinsically more difficult than classification because the space of input is considerably larger. Unlike the space of valid patterns, the space of non-valid patterns is typically too vast to sample. This is a problem for many learning algorithms which yield too many false positives when presented non-valid patterns. 3. Identifying valid characters among a set of valid and invalid candidates is a combinatorial problem. For example, correctly identifying which 8 characters among 20 candidates (assuming 12 false positives), has a 1 in 125,970 (20 choose 8) chances of success by random guessing. 4.2 B ui l d i n g b e t te r / h a r de r H I P s We can use what we have learned to build better HIPs. For instance the HIP below was designed to make segmentation difficult and a similar version has been deployed by MSN Passport for hotmail registrations (www.hotmail.com): The idea is that the additional arcs are themselves good candidates for false characters. The previous segmentation attacks would fail on this HIP. Furthermore, simple change of fonts, distortions, or arc types would require extensive work for the attacker to adjust to. We believe HIPs that emphasize the segmentation problem, such as the above example, are much stronger than the HIPs we examined in this paper, which rely on recognition being difficult. Pushing this to the extreme, we can easily generate the following HIPs: Despite the apparent difficulty of these HIPs, humans are surprisingly good at solving these, indicating that humans are far better than computers at segmentation. This approach of adding several competing false positives can in principle be used to automatically create difficult segmentation problems or benchmarks to test classification algorithms. 4.3 B ui l d i n g a n a ut o ma t i c s e g me n t or To build an automatic segmentor, we could use the following procedure. Label characters based on their correct position and train a recognizer. Apply the trained recognizer at all locations in the HIP image. Collect all candidate characters identified with high confidence by the recognizer. Compute the probability of each combination of candidates (going from left to right), and output the solution string with the highest probability. This is better illustrated with an example. Consider the following HIP (to the right). The trained neural network has these maps (warm colors indicate recognition) that show that K, Y, and so on are correctly identified. However, the maps for 7 and 9 show several false positives. In general, we would get the following color coded map for all the different candidates: HIP K Y B 7 9 With a threshold of 0.5 on the network’s outputs, the map obtained is: We note that there are several false positives for each true positive. The number of false positives per true positive character was found to be between 1 and 4, giving a 1 in C(16,8) = 12,870 to 1 in C(32,8) = 10,518,300 random chance of guessing the correct segmentation for the HIP characters. These numbers can be improved upon by constraining solution strings to flow sequentially from left to right and by restricting overlap. For each combination, we compute a probability by multiplying the 8 probabilities of the classifier for each position. The combination with the highest probability is the one proposed by the classifier. We do not have results for such an automatic segmentor at this time. It is interesting to note that with such a method a classifier that is robust to false positives would do far better than one that is not. This suggests another axis for comparing classifiers. 5 Con clu si on In this paper, we have successfully applied machine learning to the problem of solving HIPs. We have learned that decomposing the HIP problem into segmentation and recognition greatly simplifies analysis. Recognition on even unprocessed images (given segmentation is a solved) can be done automatically using neural networks. Segmentation, on the other hand, is the difficulty differentiator between weaker and stronger HIPs and requires custom intervention for each HIP. We have used this observation to design new HIPs and new tests for machine learning algorithms with the hope of improving them. A c k n ow l e d ge me n t s We would like to acknowledge Chau Luu and Eric Meltzer for their help with labeling and segmenting various HIPs. We would also like to acknowledge Josh Benaloh and Cem Paya for stimulating discussions on HIP security. References [1] Baird HS (1992), “Anatomy of a versatile page reader,” IEEE Pro., v.80, pp. 1059-1065. [2] Turing AM (1950), “Computing Machinery and Intelligence,” Mind, 59:236, pp. 433-460. [3] First Workshop on Human Interactive Proofs, Palo Alto, CA, January 2002. [4] Von Ahn L, Blum M, and Langford J, The Captcha Project. http://www.captcha.net [5] Baird HS and Popat K (2002) “Human Interactive Proofs and Document Image Analysis,” Proc. IAPR 2002 Workshop on Document Analysis Systerms, Princeton, NJ. [6] Simard PY, Steinkraus D, and Platt J, (2003) “Best Practice for Convolutional Neural Networks Applied to Visual Document Analysis,” in International Conference on Document Analysis and Recognition (ICDAR), pp. 958-962, IEEE Computer Society, Los Alamitos. [7] Mori G, Malik J (2003), “Recognizing Objects in Adversarial Clutter: Breaking a Visual CAPTCHA,” Proc. of the Computer Vision and Pattern Recognition (CVPR) Conference, IEEE Computer Society, vol.1, pages:I-134 - I-141, June 18-20, 2003 [8] Chew, M. and Baird, H. S. (2003), “BaffleText: a Human Interactive Proof,” Proc., 10th IS&T;/SPIE Document Recognition & Retrieval Conf., Santa Clara, CA, Jan. 22. [9] LeCun Y, Bottou L, Bengio Y, and Haffner P, “Gradient-based learning applied to document recognition,’ Proceedings of the IEEE, Nov. 1998.
2 0.8060084 136 nips-2004-On Semi-Supervised Classification
Author: Balaji Krishnapuram, David Williams, Ya Xue, Lawrence Carin, Mário Figueiredo, Alexander J. Hartemink
Abstract: A graph-based prior is proposed for parametric semi-supervised classification. The prior utilizes both labelled and unlabelled data; it also integrates features from multiple views of a given sample (e.g., multiple sensors), thus implementing a Bayesian form of co-training. An EM algorithm for training the classifier automatically adjusts the tradeoff between the contributions of: (a) the labelled data; (b) the unlabelled data; and (c) the co-training information. Active label query selection is performed using a mutual information based criterion that explicitly uses the unlabelled data and the co-training information. Encouraging results are presented on public benchmarks and on measured data from single and multiple sensors. 1
3 0.78204256 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning
Author: Liam Paninski
Abstract: Log-concavity is an important property in the context of optimization, Laplace approximation, and sampling; Bayesian methods based on Gaussian process priors have become quite popular recently for classification, regression, density estimation, and point process intensity estimation. Here we prove that the predictive densities corresponding to each of these applications are log-concave, given any observed data. We also prove that the likelihood is log-concave in the hyperparameters controlling the mean function of the Gaussian prior in the density and point process intensity estimation cases, and the mean, covariance, and observation noise parameters in the classification and regression cases; this result leads to a useful parameterization of these hyperparameters, indicating a suitably large class of priors for which the corresponding maximum a posteriori problem is log-concave.
4 0.74491704 49 nips-2004-Density Level Detection is Classification
Author: Ingo Steinwart, Don Hush, Clint Scovel
Abstract: We show that anomaly detection can be interpreted as a binary classification problem. Using this interpretation we propose a support vector machine (SVM) for anomaly detection. We then present some theoretical results which include consistency and learning rates. Finally, we experimentally compare our SVM with the standard one-class SVM. 1
5 0.73039269 29 nips-2004-Beat Tracking the Graphical Model Way
Author: Dustin Lang, Nando D. Freitas
Abstract: We present a graphical model for beat tracking in recorded music. Using a probabilistic graphical model allows us to incorporate local information and global smoothness constraints in a principled manner. We evaluate our model on a set of varied and difficult examples, and achieve impressive results. By using a fast dual-tree algorithm for graphical model inference, our system runs in less time than the duration of the music being processed. 1
6 0.51062483 69 nips-2004-Fast Rates to Bayes for Kernel Machines
7 0.50808799 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
8 0.4705193 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
9 0.47040722 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms
10 0.46713844 168 nips-2004-Semigroup Kernels on Finite Sets
11 0.46494675 158 nips-2004-Sampling Methods for Unsupervised Learning
12 0.46334469 103 nips-2004-Limits of Spectral Clustering
13 0.46076623 65 nips-2004-Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments
14 0.4587087 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression
15 0.45602837 84 nips-2004-Inference, Attention, and Decision in a Bayesian Neural Architecture
16 0.45550588 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors
17 0.45185202 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge
18 0.45026565 175 nips-2004-Stable adaptive control with online learning
19 0.44190672 165 nips-2004-Semi-supervised Learning on Directed Graphs
20 0.4371601 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods