nips nips2008 nips2008-41 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jennifer Tam, Jiri Simsa, Sean Hyde, Luis V. Ahn
Abstract: CAP TCHAs are computer-generated tests that humans can pass but current computer systems cannot. CAP TCHAs provide a method for automatically distinguishing a human from a computer program, and therefore can protect Web services from abuse by so-called “bots.” Most CAP TCHAs consist of distorted images, usually text, for which a user must provide some description. Unfortunately, visual CAP TCHAs limit access to the millions of visually impaired people using the Web. Audio CAP TCHAs were created to solve this accessibility issue; however, the security of audio CAP TCHAs was never formally tested. Some visual CAP TCHAs have been broken using machine learning techniques, and we propose using similar ideas to test the security of audio CAP TCHAs. Audio CAP TCHAs are generally composed of a set of words to be identified, layered on top of noise. We analyzed the security of current audio CAP TCHAs from popular Web sites by using AdaBoost, SVM, and k-NN, and achieved correct solutions for test samples with accuracy up to 71%. Such accuracy is enough to consider these CAPTCHAs broken. Training several different machine learning algorithms on different types of audio CAP TCHAs allowed us to analyze the strengths and weaknesses of the algorithms so that we could suggest a design for a more robust audio CAPTCHA. 1 Int rod uct i o n CAP TCHAs [1] are automated tests designed to tell computers and humans apart by presenting users with a problem that humans can solve but current computer programs cannot. Because CAPTCHAs can distinguish between humans and computers with high probability, they are used for many different security applications: they prevent bots from voting continuously in online polls, automatically registering for millions of spam email accounts, automatically purchasing tickets to buy out an event, etc. Once a CAP TCHA is broken (i.e., computer programs can successfully pass the test), bots can impersonate humans and gain access to services that they should not. Therefore, it is important for CAP TCHAs to be secure. To pass the typical visual CAP TCHA, a user must correctly type the characters displayed in an image of distorted text. Many visual CAP TCHAs have been broken with machine learning techniques [2]-[3], though some remain secure against such attacks. Because visually impaired users who surf the Web using screen-reading programs cannot see this type of CAPTCHA, audio CAP TCHAs were created. Typical audio CAP TCHAs consist of one or several speakers saying letters or digits at randomly spaced intervals. A user must correctly identify the digits or characters spoken in the audio file to pass the CAP TCHA. To make this test difficult for current computer systems, specifically automatic speech recognition (ASR) programs, background noise is injected into the audio files. Since no official evaluation of existing audio CAP TCHAs has been reported, we tested the security of audio CAP TCHAs used by many popular Web sites by running machine learning experiments designed to break them. In the next section, we provide an overview of the literature related to our project. Section 3 describes our methods for creating training data, and section 4 describes how we create classifiers that can recognize letters, digits, and noise. In section 5, we discuss how we evaluated our methods on widely used audio CAP TCHAs and we give our results. In particular, we show that the audio CAP TCHAs used by sites such as Google and Digg are susceptible to machine learning attacks. Section 6 mentions the proposed design of a new more secure audio CAP TCHA based on our findings. 2 Lit erat u r e r ev i ew To break the audio CAP TCHAs, we derive features from the CAP TCHA audio and use several machine learning techniques to perform ASR on segments of the CAPTCHA. There are many popular techniques for extracting features from speech. The three techniques we use are mel-frequency cepstral coefficients (MFCC), perceptual linear prediction (PLP), and relative spectral transform-PLP (RAS TA-PLP). MFCC is one of the most popular speech feature representations used. Similar to a fast Fourier transform (FF T), MFCC transforms an audio file into frequency bands, but (unlike FF T) MFCC uses mel-frequency bands, which are better for approximating the range of frequencies humans hear. PLP was designed to extract speaker-independent features from speech [4]. Therefore, by using PLP and a variant such as RAS TA-PL P, we were able to train our classifiers to recognize letters and digits independently of who spoke them. Since many different people recorded the digits used in one of the types of audio CAP TCHAs we tested, PLP and RAS TA-PLP were needed to extract the features that were most useful for solving them. In [4]-[5], the authors conducted experiments on recognizing isolated digits in the presence of noise using both PLP and RAS TA-PL P. However, the noise used consisted of telephone or microphone static caused by recording in different locations. The audio CAP TCHAs we use contain this type of noise, as well as added vocal noise and/or music, which is supposed to make the automated recognition process much harder. The authors of [3] emphasize how many visual CAP TCHAs can be broken by successfully splitting the task into two smaller tasks: segmentation and recognition. We follow a similar approach in that we first automatically split the audio into segments, and then we classify these segments as noise or words. In early March 2008, concurrent to our work, the blog of Wintercore Labs [6] claimed to have successfully broken the Google audio CAP TCHA. After reading their Web article and viewing the video of how they solve the CAP TCHAs, we are unconvinced that the process is entirely automatic, and it is unclear what their exact pass rate is. Because we are unable to find any formal technical analysis of this program, we can neither be sure of its accuracy nor the extent of its automation. 3 Cr e at i o n of tra i n i n g dat a Since automated programs can attempt to pass a CAPTCHA repeatedly, a CAPTCHA is essentially broken when a program can pass it more than a non-trivial fraction of the time; e.g., a 5% pass rate is enough. Our approach to breaking the audio CAP TCHAs began by first splitting the audio files into segments of noise or words: for our experiments, the words were spoken letters or digits. We used manual transcriptions of the audio CAP TCHAs to get information regarding the location of each spoken word within the audio file. We were able to label our segments accurately by using this information. We gathered 1,000 audio CAP TCHAs from each of the following Web sites: google.com, digg.com, and an older version of the audio CAP TCHA in recaptcha.net. Each of the CAP TCHAs was annotated with the information regarding letter/digit locations provided by the manual transcriptions. For each type of CAPTCHA, we randomly selected 900 samples for training and used the remaining 100 for testing. Using the digit/letter location information provided in the manual CAP TCHA transcriptions, each training CAP TCHA is divided into segments of noise, the letters a-z, or the digits 0-9, and labeled as such. We ignore the annotation information of the CAP TCHAs we use for testing, and therefore we cannot identify the size of those segments. Instead, each test CAP TCHA is divided into a number of fixed-size segments. The segments with the highest energy peaks are then classified using machine learning techniques (Figure 1). Since the size of a feature vector extracted from a segment generally depends on the size of the segment, using fixed-size segments allows each segment to be described with a feature vector of the same length. We chose the window size by listening to a few training segments and adjusted accordingly to ensure that the segment contained the entire digit/letter. There is undoubtedly a more optimal way of selecting the window size, however, we were still able to break the three CAP TCHAs we tested with our method. Figure 1: A test audio CAP TCHA with the fixed-size segments containing the highest energy peaks highlighted. The information provided in the manual transcriptions of the audio CAP TCHAs contains a list of the time intervals within which words are spoken. However, these intervals are of variable size and the word might be spoken anywhere within this interval. To provide fixedsize segments for training, we developed the following heuristic. First, divide each file into variable-size segments using the time intervals provided and label each segment accordingly. Then, within each segment, detect the highest energy peak and return its fixed-size neighborhood labeled with the current segment’s label. This heuristic achieved nearly perfect labeling accuracy for the training set. Rare mistakes occurred when the highest energy peak of a digit or letter segment corresponded to noise rather than to a digit or letter. To summarize this subsection, an audio file is transformed into a set of fixed-size segments labeled as noise, a digit between 0 and 9, or a letter between a and z. These segments are then used for training. Classifiers are trained for one type of CAPTCHA at a time. 4 C l a s s i f i e r con s t ru ct i o n From the training data we extracted five sets of features using twelve MFCCs and twelfth- order spectral (SPEC) and cepstral (CEPS) coefficients from PLP Matlab functions for extracting these features were provided online Voicebox package. We use AdaBoost, SVM, and k-NN algorithms digit and letter recognition. We detail our implementation of following subsections. 4 .1 and RAS TA-PL P. The at [7] and as part of the to implement automated each algorithm in the AdaBoost Using decision stumps as weak classifiers for AdaBoost, anywhere from 11 to 37 ensemble classifiers are built. The number of classifiers built depends on which type of CAPTCHA we are solving. Each classifier trains on all the segments associated with that type of CAP TCHA, and for the purpose of building a single classifier, segments are labeled by either -1 (negative example) or +1 (positive example). Using cross-validation, we choose to use 50 iterations for our AdaBoost algorithm. A segment can then be classified as a particular letter, digit, or noise according to the ensemble classifier that outputs the number closest to 1. 4 .2 S u p p o rt v e ct o r m a c h i n e To conduct digit recognition with SVM, we used the C++ implementations of libSVM [8] version 2.85 with C-SMV and RBF kernel. First, all feature values are scaled to the range of -1 to 1 as suggested by [8]. The scale parameters are stored so that test samples can be scaled accordingly. Then, a single multiclass classifier is created for each set of features using all the segments for a particular type of CAPTCHA. We use cross-validation and grid search to discover the optimal slack penalty (C=32) and kernel parameter (γ=0.011). 4 .3 k - n e a re st n e i g h b o r ( k - N N ) We use k-NN as our final method for classifying digits. For each type of CAP TCHA, five different classifiers are created by using all of the training data and the five sets of features associated with that particular type of CAP TCHA. Again we use cross-validation to discover the optimal parameter, in this case k=1. We use Euclidian distance as our distance metric. 5 Ass e s sm e n t of cu rre n t a ud i o CAPTCHAs Our method for solving CAP TCHAs iteratively extracts an audio segment from a CAP TCHA, inputs the segment to one of our digit or letter recognizers, and outputs the label for that segment. We continue this process until the maximum solution size is reached or there are no unlabeled segments left. Some of the CAPTCHAs we evaluated have solutions that vary in length. Our method ensures that we get solutions of varying length that are never longer than the maximum solution length. A segment to be classified is identified by taking the neighborhood of the highest energy peak of an as yet unlabeled part of the CAP TCHA. Once a prediction of the solution to the CAPTCHA is computed, it is compared to the true solution. Given that at least one of the audio CAP TCHAs allows users to make a mistake in one of the digits (e.g., reCAPTCHA), we compute the pass rate for each of the different types of CAPTCHAs with all of the following conditions: • The prediction matches the true solution exactly. • Inserting one digit into the prediction would make it match the solution exactly. • Replacing one digit in the prediction would make it match the solution exactly. • Removing one digit from the prediction would make it match the solution exactly. However, since we are only sure that these conditions apply to reCAPTCHA audio CAP TCHAs, we also calculate the percentage of exact solution matches in our results for each type of audio CAP TCHA. These results are described in the following subsections. 5 .1 Goog le Google audio CAP TCHAs consist of one speaker saying random digits 0-9, the phrase “once again,” followed by the exact same recorded sequence of digits originally presented. The background noise consists of human voices speaking backwards at varying volumes. A solution can range in length from five to eight words. We set our classifier to find the 12 loudest segments and classify these segments as digits or noise. Because the phrase “once again” marks the halfway point of the CAPTCHA, we preprocessed the audio to only serve this half of the CAP TCHA to our classifiers. It is important to note, however, that the classifiers were always able to identify the segment containing “once again,” and these segments were identified before all other segments. Therefore, if necessary, we could have had our system cut the file in half after first labeling this segment. For AdaBoost, we create 12 classifiers: one classifier for each digit, one for noise, and one for the phrase “once again.” Our results ( Table 1) show that at best we achieved a 90% pass rate using the “one mistake” passing conditions and a 66% exact solution match rate. Using SVM and the “one mistake” passing conditions, at best we achieve a 92% pass rate and a 67% exact solution match. For k-NN, the “one mistake” pass rate is 62% and the exact solution match rate is 26%. Table 1: Google audio CAP TCHA results: Maximum 67% accuracy was achieved by SVM. Classifiers Used AdaBoost SVM k-NN one mistake exact match one mistake exact match 88% 61% 92% 67% 30% 1% PLPSPEC 90% 66% 90% 67% 60% 26% PLPCEPS 90% 66% 92% 67% 62% 23% RAS TAPLPSPEC 88% 48% 90% 61% 29% 1% RAS TAPLPCEPS 5 .2 exact match MFCC Features Used One mistake 90% 63% 92% 67% 33% 2% Digg Digg CAP TCHAs also consist of one speaker, in this case saying a random combination of letters and digits. The background noise consists of static or what sounds like trickling water and is not continuous throughout the entire file. We noticed in our training data that the following characters were never present in a solution: 0, 1, 2, 5, 7, 9, i, o, z. Since the Digg audio CAPTCHA is also the verbal transcription of the visual CAP TCHA, we believe that these characters are excluded to avoid confusion between digits and letters that are similar in appearance. The solution length varies between three and six words. Using AdaBoost, we create 28 classifiers: one classifier for each digit or letter that appears in our training data and one classifier for noise. Perhaps because we had fewer segments to train with and there was a far higher proportion of noise segments, AdaBoost failed to produce any correct solutions. We believe that the overwhelming number of negative training examples versus the small number of positive training samples used to create each decision stump severely affected AdaBoost’s ability to classify audio segments correctly. A histogram of the training samples is provided in Figure 2 to illustrate the amount of training data available for each character. When using SVM, the best feature set passed with 96% using “one mistake” passing conditions and passed with 71% when matching the solution exactly. For k-NN, the best feature set produced a 90% “one mistake” pass rate and a 49% exact solution match. Full results can be found in Table 2. Table 2: Digg audio CAP TCHA results: Maximum 71% accuracy was achieved by SVM. Classifiers Used AdaBoost SVM k-NN exact match one mistake exact match one mistake exact match MFCC - - 96% 71% 89% 49% PLPSPEC - - 94% 65% 90% 47% PLPCEPS - - 96% 71% 64% 17% RAS TAPLPSPEC - - 17% 3% 67% 17% RAS TAPLPCEPS Features Used one mistake - - 96% 71% 82% 34% 1000 900 800 # of Segments 700 600 500 400 300 200 100 y x v w t u r s q p n m l j k h f g e d c b a 8 6 4 3 noise 0 Segment Label Figure 2: Digg CAP TCHA training data distribution. 5 .3 reC A P T C H A The older version of reCAPTCHA’s audio CAP TCHAs we tested consist of several speakers who speak random digits. The background noise consists of human voices speaking backwards at varying volumes. The solution is always eight digits long. For AdaBoost, we create 11 classifiers: one classifier for each digit and one classifier for noise. Because we know that the reCAPTCHA passing conditions are the “one mistake” passing conditions, SVM produces our best pass rate of 58%. Our best exact match rate is 45% ( Table 3). Table 3: reCAPTCHA audio CAP TCHA results: Maximum 45% accuracy was achieved by SVM. Classifiers Used AdaBoost SVM k-NN one mistake exact match one mistake exact match 18% 6% 56% 43% 22% 11% PLPSPEC 27% 10% 58% 39% 43% 25% PLPCEPS 23% 10% 56% 45% 29% 14% RAS TAPLPSPEC 9% 3% 36% 18% 24% 4% RAS TAPLPCEPS 6 exact match MFCC Features Used one mistake 9% 3% 46% 30% 32% 12% Prop ert i e s of w ea k ver s u s st ro n g CAPT CHAs From our results, we note that the easiest CAP TCHAs to break were from Digg. Google had the next strongest CAP TCHAs followed by the strongest from reCAPTCHA. Although the Digg CAP TCHAs have the largest vocabulary, giving us less training data per label, the same woman recorded them all. More importantly, the same type of noise is used throughout the entire CAPTCHA. The noise sounds like running water and static which sounds very different from the human voice and does not produce the same energy spikes needed to locate segments, therefore making segmentation quite easy. The CAP TCHAs from Google and reCAPTCHA used other human voices for background noise, making segmentation much more difficult. Although Google used a smaller vocabulary than Digg and also only used one speaker, Google’s background noise made the CAP TCHA more difficult to solve. After listening to a few of Google’s CAP TCHAs, we noticed that although the background noise consisted of human voices, the same background noise was repeated. reCAP TCHA had similar noise to Google, but they had a larger selection of noise thus making it harder to learn. reCAP TCHA also has the longest solution length making it more difficult to get perfectly correct. Finally, reCAPTCHA used many different speakers causing it to be the strongest CAP TCHA of the three we tested. In conclusion, an audio CAP TCHA that consists of a finite vocabulary and background noise should have multiple speakers and noise similar to the speakers. 7 Recomm e n d at i o n s f or creat i n g st ro n g e r aud i o CAPTCHAs Due to our success in solving audio CAP TCHAs, we have decided to start developing new audio CAP TCHAs that our methods, and machine learning methods in general, will be less likely to solve. From our experiments, we note that CAP TCHAs containing longer solutions and multiple speakers tend to be more difficult to solve. Also, because our methods depend on the amount of training data we have, having a large vocabulary would make it more difficult to collect enough training data. Already since obtaining these results, reCAPTCHA.net has updated their audio CAP TCHA to contain more distortions and a larger vocabulary: the digits 0 through 99. In designing a new audio CAP TCHA we are also concerned with the human pass rate. The current human pass rate for the reCAPTCHA audio CAP TCHAs is only 70%. To develop an audio CAP TCHA with an improved human pass rate, we plan to take advantage of the human mind’s ability to understand distorted audio through context clues. By listening to a phrase instead of to random isolated words, humans are better able to decipher distorted utterances because they are familiar with the phrase or can use contextual clues to decipher the distorted audio. Using this idea, the audio for our new audio CAP TCHA will be taken from old-time radio programs in which the poor quality of the audio makes transcription by ASR systems difficult. Users will be presented with an audio clip consisting of a 4-6 word phrase. Half of the CAPTCHA consists of words, which validate a user to be human, while the other half of the words need to be transcribed. This is the same idea behind the visual reCAP TCHA that is currently digitizing text on which OCR fails. We expect that this new audio CAP TCHA will be more secure than the current version and easier for humans to pass. Initial experiments using this idea show this to be true [9]. 8 Co n c l u s i o n We have succeeded in “breaking” three different types of widely used audio CAP TCHAs, even though these were developed with the purpose of defeating attacks by machine learning techniques. We believe our results can be improved by selecting optimal segment sizes, but that is unnecessary given our already high success rate. For our experiments, segment sizes were not chosen in a special way; occasionally yielding results in which a segment only contained half of a word, causing our prediction to contain that particular word twice. We also believe that the AdaBoost results can be improved, particularly for the Digg audio CAP TCHAs, by ensuring that the number of negative training samples is closer to the number of positive training samples. We have shown that our approach is successful and can be used with many different audio CAP TCHAs that contain small finite vocabularies. A ck n o w l e d g m e n t s This work was partially supported by generous gifts from the Heinz Endowment, by an equipment grant from Intel Corporation, and by the Army Research Office through grant number DAAD19-02-1-0389 to CyLab at Carnegie Mellon University. Luis von Ahn was partially supported by a Microsoft Research New Faculty Fellowship and a MacArthur Fellowship. Jennifer Tam was partially supported by a Google Anita Borg Scholarship. R e f e re n c e s [1] L. von Ahn, M. Blum, and J. Langford. “Telling Humans and Computers Apart Automatically,” Communication s of the ACM, vol. 47, no. 2, pp. 57-60, Feb. 2004. [2] G. Mori and J. Malik. “Recognizing Objects in Adversarial Clutter: Breaking a Visual CAPTCHA,” In Computer Vision and Pattern Recognition CVPR'03, June 2003. [3] K. Chellapilla, and P. Simard, “ U sing Machine Learning to Break Visual Human Interactio n Proofs (HIP s),” Advances in Neural Information P rocessing Systems 17, Neural Info rmatio n P rocessing Systems (NIPS'2004), MIT Press. [4] H. Hermansk y, “ Perceptual Linear Predictive (PL P) Analysis of Speech,” J. Acoust. Soc. Am., vol. 87, no. 4, pp. 1738-1752, Apr. 1990. [5] H. Hermansk y, N. Morgan, A. Bayya, and P. Kohn. “RASTA-PL P Speech Analysi s Technique,” In P roc. IEEE Int’l Conf. Acoustics, Speech & Signal Processing, vol. 1, pp. 121124, San Francisco, 1992. [6] R. Santamarta. “Breaking Gmail ’s Audio Captcha,” http://blog.wintercore.com/?p=11, 2008. [7] D. Ell is. “ P L P and RASTA (and MFCC, and inversion) in Matlab using melfcc.m and invmelfcc.m,” http:/ /ww w.ee.columbia.edu/~dpwe/resources/matlab/rastamat/, 2006. [8] C. Chang and C. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http: //ww w.csie.ntu.edu.tw/~cjlin/libsvm [9] A. Schlaikjer. “ A Dual-Use Speech CA PTCHA: Aiding Visually Impaired Web Users while Providing Transcriptions of Audio Streams,” Technical Report CMU-LTI-07-014, Carnegie Mellon Universi t y. November 2007.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract CAP TCHAs are computer-generated tests that humans can pass but current computer systems cannot. [sent-10, score-0.124]
2 Audio CAP TCHAs were created to solve this accessibility issue; however, the security of audio CAP TCHAs was never formally tested. [sent-14, score-0.397]
3 Some visual CAP TCHAs have been broken using machine learning techniques, and we propose using similar ideas to test the security of audio CAP TCHAs. [sent-15, score-0.462]
4 We analyzed the security of current audio CAP TCHAs from popular Web sites by using AdaBoost, SVM, and k-NN, and achieved correct solutions for test samples with accuracy up to 71%. [sent-17, score-0.418]
5 Training several different machine learning algorithms on different types of audio CAP TCHAs allowed us to analyze the strengths and weaknesses of the algorithms so that we could suggest a design for a more robust audio CAPTCHA. [sent-19, score-0.72]
6 1 Int rod uct i o n CAP TCHAs [1] are automated tests designed to tell computers and humans apart by presenting users with a problem that humans can solve but current computer programs cannot. [sent-20, score-0.187]
7 , computer programs can successfully pass the test), bots can impersonate humans and gain access to services that they should not. [sent-24, score-0.176]
8 To pass the typical visual CAP TCHA, a user must correctly type the characters displayed in an image of distorted text. [sent-26, score-0.191]
9 Because visually impaired users who surf the Web using screen-reading programs cannot see this type of CAPTCHA, audio CAP TCHAs were created. [sent-28, score-0.463]
10 Typical audio CAP TCHAs consist of one or several speakers saying letters or digits at randomly spaced intervals. [sent-29, score-0.541]
11 A user must correctly identify the digits or characters spoken in the audio file to pass the CAP TCHA. [sent-30, score-0.619]
12 To make this test difficult for current computer systems, specifically automatic speech recognition (ASR) programs, background noise is injected into the audio files. [sent-31, score-0.501]
13 Since no official evaluation of existing audio CAP TCHAs has been reported, we tested the security of audio CAP TCHAs used by many popular Web sites by running machine learning experiments designed to break them. [sent-32, score-0.804]
14 Section 3 describes our methods for creating training data, and section 4 describes how we create classifiers that can recognize letters, digits, and noise. [sent-34, score-0.168]
15 In section 5, we discuss how we evaluated our methods on widely used audio CAP TCHAs and we give our results. [sent-35, score-0.36]
16 In particular, we show that the audio CAP TCHAs used by sites such as Google and Digg are susceptible to machine learning attacks. [sent-36, score-0.381]
17 Section 6 mentions the proposed design of a new more secure audio CAP TCHA based on our findings. [sent-37, score-0.392]
18 2 Lit erat u r e r ev i ew To break the audio CAP TCHAs, we derive features from the CAP TCHA audio and use several machine learning techniques to perform ASR on segments of the CAPTCHA. [sent-38, score-0.867]
19 Similar to a fast Fourier transform (FF T), MFCC transforms an audio file into frequency bands, but (unlike FF T) MFCC uses mel-frequency bands, which are better for approximating the range of frequencies humans hear. [sent-42, score-0.457]
20 Therefore, by using PLP and a variant such as RAS TA-PL P, we were able to train our classifiers to recognize letters and digits independently of who spoke them. [sent-44, score-0.234]
21 Since many different people recorded the digits used in one of the types of audio CAP TCHAs we tested, PLP and RAS TA-PLP were needed to extract the features that were most useful for solving them. [sent-45, score-0.433]
22 In [4]-[5], the authors conducted experiments on recognizing isolated digits in the presence of noise using both PLP and RAS TA-PL P. [sent-46, score-0.117]
23 The audio CAP TCHAs we use contain this type of noise, as well as added vocal noise and/or music, which is supposed to make the automated recognition process much harder. [sent-48, score-0.448]
24 We follow a similar approach in that we first automatically split the audio into segments, and then we classify these segments as noise or words. [sent-50, score-0.525]
25 In early March 2008, concurrent to our work, the blog of Wintercore Labs [6] claimed to have successfully broken the Google audio CAP TCHA. [sent-51, score-0.401]
26 After reading their Web article and viewing the video of how they solve the CAP TCHAs, we are unconvinced that the process is entirely automatic, and it is unclear what their exact pass rate is. [sent-52, score-0.131]
27 3 Cr e at i o n of tra i n i n g dat a Since automated programs can attempt to pass a CAPTCHA repeatedly, a CAPTCHA is essentially broken when a program can pass it more than a non-trivial fraction of the time; e. [sent-54, score-0.253]
28 Our approach to breaking the audio CAP TCHAs began by first splitting the audio files into segments of noise or words: for our experiments, the words were spoken letters or digits. [sent-57, score-0.969]
29 We used manual transcriptions of the audio CAP TCHAs to get information regarding the location of each spoken word within the audio file. [sent-58, score-0.824]
30 We were able to label our segments accurately by using this information. [sent-59, score-0.121]
31 We gathered 1,000 audio CAP TCHAs from each of the following Web sites: google. [sent-60, score-0.36]
32 com, and an older version of the audio CAP TCHA in recaptcha. [sent-62, score-0.36]
33 Using the digit/letter location information provided in the manual CAP TCHA transcriptions, each training CAP TCHA is divided into segments of noise, the letters a-z, or the digits 0-9, and labeled as such. [sent-66, score-0.267]
34 The segments with the highest energy peaks are then classified using machine learning techniques (Figure 1). [sent-69, score-0.184]
35 Since the size of a feature vector extracted from a segment generally depends on the size of the segment, using fixed-size segments allows each segment to be described with a feature vector of the same length. [sent-70, score-0.279]
36 We chose the window size by listening to a few training segments and adjusted accordingly to ensure that the segment contained the entire digit/letter. [sent-71, score-0.248]
37 Figure 1: A test audio CAP TCHA with the fixed-size segments containing the highest energy peaks highlighted. [sent-73, score-0.519]
38 The information provided in the manual transcriptions of the audio CAP TCHAs contains a list of the time intervals within which words are spoken. [sent-74, score-0.418]
39 To provide fixedsize segments for training, we developed the following heuristic. [sent-76, score-0.121]
40 First, divide each file into variable-size segments using the time intervals provided and label each segment accordingly. [sent-77, score-0.253]
41 Rare mistakes occurred when the highest energy peak of a digit or letter segment corresponded to noise rather than to a digit or letter. [sent-80, score-0.347]
42 To summarize this subsection, an audio file is transformed into a set of fixed-size segments labeled as noise, a digit between 0 and 9, or a letter between a and z. [sent-81, score-0.636]
43 We use AdaBoost, SVM, and k-NN algorithms digit and letter recognition. [sent-85, score-0.102]
44 The at [7] and as part of the to implement automated each algorithm in the AdaBoost Using decision stumps as weak classifiers for AdaBoost, anywhere from 11 to 37 ensemble classifiers are built. [sent-89, score-0.279]
45 The number of classifiers built depends on which type of CAPTCHA we are solving. [sent-90, score-0.152]
46 Each classifier trains on all the segments associated with that type of CAP TCHA, and for the purpose of building a single classifier, segments are labeled by either -1 (negative example) or +1 (positive example). [sent-91, score-0.344]
47 A segment can then be classified as a particular letter, digit, or noise according to the ensemble classifier that outputs the number closest to 1. [sent-93, score-0.227]
48 Then, a single multiclass classifier is created for each set of features using all the segments for a particular type of CAPTCHA. [sent-99, score-0.223]
49 For each type of CAP TCHA, five different classifiers are created by using all of the training data and the five sets of features associated with that particular type of CAP TCHA. [sent-104, score-0.263]
50 5 Ass e s sm e n t of cu rre n t a ud i o CAPTCHAs Our method for solving CAP TCHAs iteratively extracts an audio segment from a CAP TCHA, inputs the segment to one of our digit or letter recognizers, and outputs the label for that segment. [sent-107, score-0.62]
51 We continue this process until the maximum solution size is reached or there are no unlabeled segments left. [sent-108, score-0.141]
52 A segment to be classified is identified by taking the neighborhood of the highest energy peak of an as yet unlabeled part of the CAP TCHA. [sent-111, score-0.183]
53 Given that at least one of the audio CAP TCHAs allows users to make a mistake in one of the digits (e. [sent-113, score-0.569]
54 , reCAPTCHA), we compute the pass rate for each of the different types of CAPTCHAs with all of the following conditions: • The prediction matches the true solution exactly. [sent-115, score-0.117]
55 • Inserting one digit into the prediction would make it match the solution exactly. [sent-116, score-0.136]
56 • Replacing one digit in the prediction would make it match the solution exactly. [sent-117, score-0.136]
57 • Removing one digit from the prediction would make it match the solution exactly. [sent-118, score-0.136]
58 However, since we are only sure that these conditions apply to reCAPTCHA audio CAP TCHAs, we also calculate the percentage of exact solution matches in our results for each type of audio CAP TCHA. [sent-119, score-0.797]
59 1 Goog le Google audio CAP TCHAs consist of one speaker saying random digits 0-9, the phrase “once again,” followed by the exact same recorded sequence of digits originally presented. [sent-122, score-0.631]
60 The background noise consists of human voices speaking backwards at varying volumes. [sent-123, score-0.141]
61 We set our classifier to find the 12 loudest segments and classify these segments as digits or noise. [sent-125, score-0.394]
62 Because the phrase “once again” marks the halfway point of the CAPTCHA, we preprocessed the audio to only serve this half of the CAP TCHA to our classifiers. [sent-126, score-0.409]
63 It is important to note, however, that the classifiers were always able to identify the segment containing “once again,” and these segments were identified before all other segments. [sent-127, score-0.354]
64 For AdaBoost, we create 12 classifiers: one classifier for each digit, one for noise, and one for the phrase “once again. [sent-129, score-0.13]
65 ” Our results ( Table 1) show that at best we achieved a 90% pass rate using the “one mistake” passing conditions and a 66% exact solution match rate. [sent-130, score-0.224]
66 Using SVM and the “one mistake” passing conditions, at best we achieve a 92% pass rate and a 67% exact solution match. [sent-131, score-0.176]
67 For k-NN, the “one mistake” pass rate is 62% and the exact solution match rate is 26%. [sent-132, score-0.216]
68 Table 1: Google audio CAP TCHA results: Maximum 67% accuracy was achieved by SVM. [sent-133, score-0.36]
69 Classifiers Used AdaBoost SVM k-NN one mistake exact match one mistake exact match 88% 61% 92% 67% 30% 1% PLPSPEC 90% 66% 90% 67% 60% 26% PLPCEPS 90% 66% 92% 67% 62% 23% RAS TAPLPSPEC 88% 48% 90% 61% 29% 1% RAS TAPLPCEPS 5 . [sent-134, score-0.382]
70 2 exact match MFCC Features Used One mistake 90% 63% 92% 67% 33% 2% Digg Digg CAP TCHAs also consist of one speaker, in this case saying a random combination of letters and digits. [sent-135, score-0.263]
71 The background noise consists of static or what sounds like trickling water and is not continuous throughout the entire file. [sent-136, score-0.111]
72 Since the Digg audio CAPTCHA is also the verbal transcription of the visual CAP TCHA, we believe that these characters are excluded to avoid confusion between digits and letters that are similar in appearance. [sent-138, score-0.514]
73 Using AdaBoost, we create 28 classifiers: one classifier for each digit or letter that appears in our training data and one classifier for noise. [sent-140, score-0.299]
74 Perhaps because we had fewer segments to train with and there was a far higher proportion of noise segments, AdaBoost failed to produce any correct solutions. [sent-141, score-0.165]
75 We believe that the overwhelming number of negative training examples versus the small number of positive training samples used to create each decision stump severely affected AdaBoost’s ability to classify audio segments correctly. [sent-142, score-0.54]
76 For k-NN, the best feature set produced a 90% “one mistake” pass rate and a 49% exact solution match. [sent-145, score-0.151]
77 Table 2: Digg audio CAP TCHA results: Maximum 71% accuracy was achieved by SVM. [sent-147, score-0.36]
78 3 reC A P T C H A The older version of reCAPTCHA’s audio CAP TCHAs we tested consist of several speakers who speak random digits. [sent-150, score-0.415]
79 The background noise consists of human voices speaking backwards at varying volumes. [sent-151, score-0.141]
80 For AdaBoost, we create 11 classifiers: one classifier for each digit and one classifier for noise. [sent-153, score-0.245]
81 Because we know that the reCAPTCHA passing conditions are the “one mistake” passing conditions, SVM produces our best pass rate of 58%. [sent-154, score-0.147]
82 Table 3: reCAPTCHA audio CAP TCHA results: Maximum 45% accuracy was achieved by SVM. [sent-156, score-0.36]
83 The noise sounds like running water and static which sounds very different from the human voice and does not produce the same energy spikes needed to locate segments, therefore making segmentation quite easy. [sent-161, score-0.158]
84 Although Google used a smaller vocabulary than Digg and also only used one speaker, Google’s background noise made the CAP TCHA more difficult to solve. [sent-163, score-0.141]
85 After listening to a few of Google’s CAP TCHAs, we noticed that although the background noise consisted of human voices, the same background noise was repeated. [sent-164, score-0.198]
86 In conclusion, an audio CAP TCHA that consists of a finite vocabulary and background noise should have multiple speakers and noise similar to the speakers. [sent-168, score-0.539]
87 7 Recomm e n d at i o n s f or creat i n g st ro n g e r aud i o CAPTCHAs Due to our success in solving audio CAP TCHAs, we have decided to start developing new audio CAP TCHAs that our methods, and machine learning methods in general, will be less likely to solve. [sent-169, score-0.741]
88 Also, because our methods depend on the amount of training data we have, having a large vocabulary would make it more difficult to collect enough training data. [sent-171, score-0.11]
89 net has updated their audio CAP TCHA to contain more distortions and a larger vocabulary: the digits 0 through 99. [sent-173, score-0.433]
90 In designing a new audio CAP TCHA we are also concerned with the human pass rate. [sent-174, score-0.468]
91 The current human pass rate for the reCAPTCHA audio CAP TCHAs is only 70%. [sent-175, score-0.485]
92 To develop an audio CAP TCHA with an improved human pass rate, we plan to take advantage of the human mind’s ability to understand distorted audio through context clues. [sent-176, score-0.895]
93 By listening to a phrase instead of to random isolated words, humans are better able to decipher distorted utterances because they are familiar with the phrase or can use contextual clues to decipher the distorted audio. [sent-177, score-0.256]
94 Using this idea, the audio for our new audio CAP TCHA will be taken from old-time radio programs in which the poor quality of the audio makes transcription by ASR systems difficult. [sent-178, score-1.111]
95 Users will be presented with an audio clip consisting of a 4-6 word phrase. [sent-179, score-0.378]
96 We expect that this new audio CAP TCHA will be more secure than the current version and easier for humans to pass. [sent-182, score-0.436]
97 8 Co n c l u s i o n We have succeeded in “breaking” three different types of widely used audio CAP TCHAs, even though these were developed with the purpose of defeating attacks by machine learning techniques. [sent-184, score-0.36]
98 For our experiments, segment sizes were not chosen in a special way; occasionally yielding results in which a segment only contained half of a word, causing our prediction to contain that particular word twice. [sent-186, score-0.193]
99 We also believe that the AdaBoost results can be improved, particularly for the Digg audio CAP TCHAs, by ensuring that the number of negative training samples is closer to the number of positive training samples. [sent-187, score-0.4]
100 We have shown that our approach is successful and can be used with many different audio CAP TCHAs that contain small finite vocabularies. [sent-188, score-0.36]
wordName wordTfidf (topN-words)
[('cap', 0.532), ('tchas', 0.517), ('audio', 0.36), ('tcha', 0.306), ('classifiers', 0.129), ('captcha', 0.127), ('segments', 0.121), ('ras', 0.116), ('mistake', 0.109), ('digg', 0.095), ('adaboost', 0.092), ('recaptcha', 0.084), ('pass', 0.08), ('classifier', 0.079), ('segment', 0.079), ('captchas', 0.074), ('mfcc', 0.074), ('digits', 0.073), ('digit', 0.068), ('google', 0.058), ('plp', 0.055), ('file', 0.053), ('match', 0.048), ('humans', 0.044), ('noise', 0.044), ('difficult', 0.042), ('voices', 0.042), ('broken', 0.041), ('distorted', 0.039), ('security', 0.037), ('transcriptions', 0.037), ('speakers', 0.036), ('exact', 0.034), ('letter', 0.034), ('five', 0.034), ('forbes', 0.034), ('letters', 0.032), ('phrase', 0.032), ('plpceps', 0.032), ('plpspec', 0.032), ('secure', 0.032), ('taplpceps', 0.032), ('taplpspec', 0.032), ('programs', 0.031), ('web', 0.03), ('ave', 0.03), ('spoken', 0.028), ('carnegie', 0.028), ('mellon', 0.028), ('speech', 0.028), ('human', 0.028), ('ahn', 0.028), ('asr', 0.028), ('listening', 0.028), ('recap', 0.028), ('svm', 0.028), ('vocabulary', 0.028), ('background', 0.027), ('users', 0.027), ('break', 0.026), ('classified', 0.025), ('identified', 0.025), ('passing', 0.025), ('characters', 0.025), ('breaking', 0.024), ('sounds', 0.024), ('visual', 0.024), ('type', 0.023), ('impaired', 0.022), ('energy', 0.022), ('sites', 0.021), ('saying', 0.021), ('aud', 0.021), ('bots', 0.021), ('coefficients', 0.021), ('decipher', 0.021), ('hermansk', 0.021), ('rocessing', 0.021), ('manual', 0.021), ('automated', 0.021), ('computers', 0.02), ('solution', 0.02), ('pittsburgh', 0.02), ('training', 0.02), ('strongest', 0.02), ('create', 0.019), ('speaker', 0.019), ('consist', 0.019), ('tam', 0.018), ('von', 0.018), ('word', 0.018), ('half', 0.017), ('rate', 0.017), ('cepstral', 0.017), ('luis', 0.017), ('static', 0.016), ('peak', 0.016), ('ff', 0.016), ('int', 0.016), ('highest', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999994 41 nips-2008-Breaking Audio CAPTCHAs
Author: Jennifer Tam, Jiri Simsa, Sean Hyde, Luis V. Ahn
Abstract: CAP TCHAs are computer-generated tests that humans can pass but current computer systems cannot. CAP TCHAs provide a method for automatically distinguishing a human from a computer program, and therefore can protect Web services from abuse by so-called “bots.” Most CAP TCHAs consist of distorted images, usually text, for which a user must provide some description. Unfortunately, visual CAP TCHAs limit access to the millions of visually impaired people using the Web. Audio CAP TCHAs were created to solve this accessibility issue; however, the security of audio CAP TCHAs was never formally tested. Some visual CAP TCHAs have been broken using machine learning techniques, and we propose using similar ideas to test the security of audio CAP TCHAs. Audio CAP TCHAs are generally composed of a set of words to be identified, layered on top of noise. We analyzed the security of current audio CAP TCHAs from popular Web sites by using AdaBoost, SVM, and k-NN, and achieved correct solutions for test samples with accuracy up to 71%. Such accuracy is enough to consider these CAPTCHAs broken. Training several different machine learning algorithms on different types of audio CAP TCHAs allowed us to analyze the strengths and weaknesses of the algorithms so that we could suggest a design for a more robust audio CAPTCHA. 1 Int rod uct i o n CAP TCHAs [1] are automated tests designed to tell computers and humans apart by presenting users with a problem that humans can solve but current computer programs cannot. Because CAPTCHAs can distinguish between humans and computers with high probability, they are used for many different security applications: they prevent bots from voting continuously in online polls, automatically registering for millions of spam email accounts, automatically purchasing tickets to buy out an event, etc. Once a CAP TCHA is broken (i.e., computer programs can successfully pass the test), bots can impersonate humans and gain access to services that they should not. Therefore, it is important for CAP TCHAs to be secure. To pass the typical visual CAP TCHA, a user must correctly type the characters displayed in an image of distorted text. Many visual CAP TCHAs have been broken with machine learning techniques [2]-[3], though some remain secure against such attacks. Because visually impaired users who surf the Web using screen-reading programs cannot see this type of CAPTCHA, audio CAP TCHAs were created. Typical audio CAP TCHAs consist of one or several speakers saying letters or digits at randomly spaced intervals. A user must correctly identify the digits or characters spoken in the audio file to pass the CAP TCHA. To make this test difficult for current computer systems, specifically automatic speech recognition (ASR) programs, background noise is injected into the audio files. Since no official evaluation of existing audio CAP TCHAs has been reported, we tested the security of audio CAP TCHAs used by many popular Web sites by running machine learning experiments designed to break them. In the next section, we provide an overview of the literature related to our project. Section 3 describes our methods for creating training data, and section 4 describes how we create classifiers that can recognize letters, digits, and noise. In section 5, we discuss how we evaluated our methods on widely used audio CAP TCHAs and we give our results. In particular, we show that the audio CAP TCHAs used by sites such as Google and Digg are susceptible to machine learning attacks. Section 6 mentions the proposed design of a new more secure audio CAP TCHA based on our findings. 2 Lit erat u r e r ev i ew To break the audio CAP TCHAs, we derive features from the CAP TCHA audio and use several machine learning techniques to perform ASR on segments of the CAPTCHA. There are many popular techniques for extracting features from speech. The three techniques we use are mel-frequency cepstral coefficients (MFCC), perceptual linear prediction (PLP), and relative spectral transform-PLP (RAS TA-PLP). MFCC is one of the most popular speech feature representations used. Similar to a fast Fourier transform (FF T), MFCC transforms an audio file into frequency bands, but (unlike FF T) MFCC uses mel-frequency bands, which are better for approximating the range of frequencies humans hear. PLP was designed to extract speaker-independent features from speech [4]. Therefore, by using PLP and a variant such as RAS TA-PL P, we were able to train our classifiers to recognize letters and digits independently of who spoke them. Since many different people recorded the digits used in one of the types of audio CAP TCHAs we tested, PLP and RAS TA-PLP were needed to extract the features that were most useful for solving them. In [4]-[5], the authors conducted experiments on recognizing isolated digits in the presence of noise using both PLP and RAS TA-PL P. However, the noise used consisted of telephone or microphone static caused by recording in different locations. The audio CAP TCHAs we use contain this type of noise, as well as added vocal noise and/or music, which is supposed to make the automated recognition process much harder. The authors of [3] emphasize how many visual CAP TCHAs can be broken by successfully splitting the task into two smaller tasks: segmentation and recognition. We follow a similar approach in that we first automatically split the audio into segments, and then we classify these segments as noise or words. In early March 2008, concurrent to our work, the blog of Wintercore Labs [6] claimed to have successfully broken the Google audio CAP TCHA. After reading their Web article and viewing the video of how they solve the CAP TCHAs, we are unconvinced that the process is entirely automatic, and it is unclear what their exact pass rate is. Because we are unable to find any formal technical analysis of this program, we can neither be sure of its accuracy nor the extent of its automation. 3 Cr e at i o n of tra i n i n g dat a Since automated programs can attempt to pass a CAPTCHA repeatedly, a CAPTCHA is essentially broken when a program can pass it more than a non-trivial fraction of the time; e.g., a 5% pass rate is enough. Our approach to breaking the audio CAP TCHAs began by first splitting the audio files into segments of noise or words: for our experiments, the words were spoken letters or digits. We used manual transcriptions of the audio CAP TCHAs to get information regarding the location of each spoken word within the audio file. We were able to label our segments accurately by using this information. We gathered 1,000 audio CAP TCHAs from each of the following Web sites: google.com, digg.com, and an older version of the audio CAP TCHA in recaptcha.net. Each of the CAP TCHAs was annotated with the information regarding letter/digit locations provided by the manual transcriptions. For each type of CAPTCHA, we randomly selected 900 samples for training and used the remaining 100 for testing. Using the digit/letter location information provided in the manual CAP TCHA transcriptions, each training CAP TCHA is divided into segments of noise, the letters a-z, or the digits 0-9, and labeled as such. We ignore the annotation information of the CAP TCHAs we use for testing, and therefore we cannot identify the size of those segments. Instead, each test CAP TCHA is divided into a number of fixed-size segments. The segments with the highest energy peaks are then classified using machine learning techniques (Figure 1). Since the size of a feature vector extracted from a segment generally depends on the size of the segment, using fixed-size segments allows each segment to be described with a feature vector of the same length. We chose the window size by listening to a few training segments and adjusted accordingly to ensure that the segment contained the entire digit/letter. There is undoubtedly a more optimal way of selecting the window size, however, we were still able to break the three CAP TCHAs we tested with our method. Figure 1: A test audio CAP TCHA with the fixed-size segments containing the highest energy peaks highlighted. The information provided in the manual transcriptions of the audio CAP TCHAs contains a list of the time intervals within which words are spoken. However, these intervals are of variable size and the word might be spoken anywhere within this interval. To provide fixedsize segments for training, we developed the following heuristic. First, divide each file into variable-size segments using the time intervals provided and label each segment accordingly. Then, within each segment, detect the highest energy peak and return its fixed-size neighborhood labeled with the current segment’s label. This heuristic achieved nearly perfect labeling accuracy for the training set. Rare mistakes occurred when the highest energy peak of a digit or letter segment corresponded to noise rather than to a digit or letter. To summarize this subsection, an audio file is transformed into a set of fixed-size segments labeled as noise, a digit between 0 and 9, or a letter between a and z. These segments are then used for training. Classifiers are trained for one type of CAPTCHA at a time. 4 C l a s s i f i e r con s t ru ct i o n From the training data we extracted five sets of features using twelve MFCCs and twelfth- order spectral (SPEC) and cepstral (CEPS) coefficients from PLP Matlab functions for extracting these features were provided online Voicebox package. We use AdaBoost, SVM, and k-NN algorithms digit and letter recognition. We detail our implementation of following subsections. 4 .1 and RAS TA-PL P. The at [7] and as part of the to implement automated each algorithm in the AdaBoost Using decision stumps as weak classifiers for AdaBoost, anywhere from 11 to 37 ensemble classifiers are built. The number of classifiers built depends on which type of CAPTCHA we are solving. Each classifier trains on all the segments associated with that type of CAP TCHA, and for the purpose of building a single classifier, segments are labeled by either -1 (negative example) or +1 (positive example). Using cross-validation, we choose to use 50 iterations for our AdaBoost algorithm. A segment can then be classified as a particular letter, digit, or noise according to the ensemble classifier that outputs the number closest to 1. 4 .2 S u p p o rt v e ct o r m a c h i n e To conduct digit recognition with SVM, we used the C++ implementations of libSVM [8] version 2.85 with C-SMV and RBF kernel. First, all feature values are scaled to the range of -1 to 1 as suggested by [8]. The scale parameters are stored so that test samples can be scaled accordingly. Then, a single multiclass classifier is created for each set of features using all the segments for a particular type of CAPTCHA. We use cross-validation and grid search to discover the optimal slack penalty (C=32) and kernel parameter (γ=0.011). 4 .3 k - n e a re st n e i g h b o r ( k - N N ) We use k-NN as our final method for classifying digits. For each type of CAP TCHA, five different classifiers are created by using all of the training data and the five sets of features associated with that particular type of CAP TCHA. Again we use cross-validation to discover the optimal parameter, in this case k=1. We use Euclidian distance as our distance metric. 5 Ass e s sm e n t of cu rre n t a ud i o CAPTCHAs Our method for solving CAP TCHAs iteratively extracts an audio segment from a CAP TCHA, inputs the segment to one of our digit or letter recognizers, and outputs the label for that segment. We continue this process until the maximum solution size is reached or there are no unlabeled segments left. Some of the CAPTCHAs we evaluated have solutions that vary in length. Our method ensures that we get solutions of varying length that are never longer than the maximum solution length. A segment to be classified is identified by taking the neighborhood of the highest energy peak of an as yet unlabeled part of the CAP TCHA. Once a prediction of the solution to the CAPTCHA is computed, it is compared to the true solution. Given that at least one of the audio CAP TCHAs allows users to make a mistake in one of the digits (e.g., reCAPTCHA), we compute the pass rate for each of the different types of CAPTCHAs with all of the following conditions: • The prediction matches the true solution exactly. • Inserting one digit into the prediction would make it match the solution exactly. • Replacing one digit in the prediction would make it match the solution exactly. • Removing one digit from the prediction would make it match the solution exactly. However, since we are only sure that these conditions apply to reCAPTCHA audio CAP TCHAs, we also calculate the percentage of exact solution matches in our results for each type of audio CAP TCHA. These results are described in the following subsections. 5 .1 Goog le Google audio CAP TCHAs consist of one speaker saying random digits 0-9, the phrase “once again,” followed by the exact same recorded sequence of digits originally presented. The background noise consists of human voices speaking backwards at varying volumes. A solution can range in length from five to eight words. We set our classifier to find the 12 loudest segments and classify these segments as digits or noise. Because the phrase “once again” marks the halfway point of the CAPTCHA, we preprocessed the audio to only serve this half of the CAP TCHA to our classifiers. It is important to note, however, that the classifiers were always able to identify the segment containing “once again,” and these segments were identified before all other segments. Therefore, if necessary, we could have had our system cut the file in half after first labeling this segment. For AdaBoost, we create 12 classifiers: one classifier for each digit, one for noise, and one for the phrase “once again.” Our results ( Table 1) show that at best we achieved a 90% pass rate using the “one mistake” passing conditions and a 66% exact solution match rate. Using SVM and the “one mistake” passing conditions, at best we achieve a 92% pass rate and a 67% exact solution match. For k-NN, the “one mistake” pass rate is 62% and the exact solution match rate is 26%. Table 1: Google audio CAP TCHA results: Maximum 67% accuracy was achieved by SVM. Classifiers Used AdaBoost SVM k-NN one mistake exact match one mistake exact match 88% 61% 92% 67% 30% 1% PLPSPEC 90% 66% 90% 67% 60% 26% PLPCEPS 90% 66% 92% 67% 62% 23% RAS TAPLPSPEC 88% 48% 90% 61% 29% 1% RAS TAPLPCEPS 5 .2 exact match MFCC Features Used One mistake 90% 63% 92% 67% 33% 2% Digg Digg CAP TCHAs also consist of one speaker, in this case saying a random combination of letters and digits. The background noise consists of static or what sounds like trickling water and is not continuous throughout the entire file. We noticed in our training data that the following characters were never present in a solution: 0, 1, 2, 5, 7, 9, i, o, z. Since the Digg audio CAPTCHA is also the verbal transcription of the visual CAP TCHA, we believe that these characters are excluded to avoid confusion between digits and letters that are similar in appearance. The solution length varies between three and six words. Using AdaBoost, we create 28 classifiers: one classifier for each digit or letter that appears in our training data and one classifier for noise. Perhaps because we had fewer segments to train with and there was a far higher proportion of noise segments, AdaBoost failed to produce any correct solutions. We believe that the overwhelming number of negative training examples versus the small number of positive training samples used to create each decision stump severely affected AdaBoost’s ability to classify audio segments correctly. A histogram of the training samples is provided in Figure 2 to illustrate the amount of training data available for each character. When using SVM, the best feature set passed with 96% using “one mistake” passing conditions and passed with 71% when matching the solution exactly. For k-NN, the best feature set produced a 90% “one mistake” pass rate and a 49% exact solution match. Full results can be found in Table 2. Table 2: Digg audio CAP TCHA results: Maximum 71% accuracy was achieved by SVM. Classifiers Used AdaBoost SVM k-NN exact match one mistake exact match one mistake exact match MFCC - - 96% 71% 89% 49% PLPSPEC - - 94% 65% 90% 47% PLPCEPS - - 96% 71% 64% 17% RAS TAPLPSPEC - - 17% 3% 67% 17% RAS TAPLPCEPS Features Used one mistake - - 96% 71% 82% 34% 1000 900 800 # of Segments 700 600 500 400 300 200 100 y x v w t u r s q p n m l j k h f g e d c b a 8 6 4 3 noise 0 Segment Label Figure 2: Digg CAP TCHA training data distribution. 5 .3 reC A P T C H A The older version of reCAPTCHA’s audio CAP TCHAs we tested consist of several speakers who speak random digits. The background noise consists of human voices speaking backwards at varying volumes. The solution is always eight digits long. For AdaBoost, we create 11 classifiers: one classifier for each digit and one classifier for noise. Because we know that the reCAPTCHA passing conditions are the “one mistake” passing conditions, SVM produces our best pass rate of 58%. Our best exact match rate is 45% ( Table 3). Table 3: reCAPTCHA audio CAP TCHA results: Maximum 45% accuracy was achieved by SVM. Classifiers Used AdaBoost SVM k-NN one mistake exact match one mistake exact match 18% 6% 56% 43% 22% 11% PLPSPEC 27% 10% 58% 39% 43% 25% PLPCEPS 23% 10% 56% 45% 29% 14% RAS TAPLPSPEC 9% 3% 36% 18% 24% 4% RAS TAPLPCEPS 6 exact match MFCC Features Used one mistake 9% 3% 46% 30% 32% 12% Prop ert i e s of w ea k ver s u s st ro n g CAPT CHAs From our results, we note that the easiest CAP TCHAs to break were from Digg. Google had the next strongest CAP TCHAs followed by the strongest from reCAPTCHA. Although the Digg CAP TCHAs have the largest vocabulary, giving us less training data per label, the same woman recorded them all. More importantly, the same type of noise is used throughout the entire CAPTCHA. The noise sounds like running water and static which sounds very different from the human voice and does not produce the same energy spikes needed to locate segments, therefore making segmentation quite easy. The CAP TCHAs from Google and reCAPTCHA used other human voices for background noise, making segmentation much more difficult. Although Google used a smaller vocabulary than Digg and also only used one speaker, Google’s background noise made the CAP TCHA more difficult to solve. After listening to a few of Google’s CAP TCHAs, we noticed that although the background noise consisted of human voices, the same background noise was repeated. reCAP TCHA had similar noise to Google, but they had a larger selection of noise thus making it harder to learn. reCAP TCHA also has the longest solution length making it more difficult to get perfectly correct. Finally, reCAPTCHA used many different speakers causing it to be the strongest CAP TCHA of the three we tested. In conclusion, an audio CAP TCHA that consists of a finite vocabulary and background noise should have multiple speakers and noise similar to the speakers. 7 Recomm e n d at i o n s f or creat i n g st ro n g e r aud i o CAPTCHAs Due to our success in solving audio CAP TCHAs, we have decided to start developing new audio CAP TCHAs that our methods, and machine learning methods in general, will be less likely to solve. From our experiments, we note that CAP TCHAs containing longer solutions and multiple speakers tend to be more difficult to solve. Also, because our methods depend on the amount of training data we have, having a large vocabulary would make it more difficult to collect enough training data. Already since obtaining these results, reCAPTCHA.net has updated their audio CAP TCHA to contain more distortions and a larger vocabulary: the digits 0 through 99. In designing a new audio CAP TCHA we are also concerned with the human pass rate. The current human pass rate for the reCAPTCHA audio CAP TCHAs is only 70%. To develop an audio CAP TCHA with an improved human pass rate, we plan to take advantage of the human mind’s ability to understand distorted audio through context clues. By listening to a phrase instead of to random isolated words, humans are better able to decipher distorted utterances because they are familiar with the phrase or can use contextual clues to decipher the distorted audio. Using this idea, the audio for our new audio CAP TCHA will be taken from old-time radio programs in which the poor quality of the audio makes transcription by ASR systems difficult. Users will be presented with an audio clip consisting of a 4-6 word phrase. Half of the CAPTCHA consists of words, which validate a user to be human, while the other half of the words need to be transcribed. This is the same idea behind the visual reCAP TCHA that is currently digitizing text on which OCR fails. We expect that this new audio CAP TCHA will be more secure than the current version and easier for humans to pass. Initial experiments using this idea show this to be true [9]. 8 Co n c l u s i o n We have succeeded in “breaking” three different types of widely used audio CAP TCHAs, even though these were developed with the purpose of defeating attacks by machine learning techniques. We believe our results can be improved by selecting optimal segment sizes, but that is unnecessary given our already high success rate. For our experiments, segment sizes were not chosen in a special way; occasionally yielding results in which a segment only contained half of a word, causing our prediction to contain that particular word twice. We also believe that the AdaBoost results can be improved, particularly for the Digg audio CAP TCHAs, by ensuring that the number of negative training samples is closer to the number of positive training samples. We have shown that our approach is successful and can be used with many different audio CAP TCHAs that contain small finite vocabularies. A ck n o w l e d g m e n t s This work was partially supported by generous gifts from the Heinz Endowment, by an equipment grant from Intel Corporation, and by the Army Research Office through grant number DAAD19-02-1-0389 to CyLab at Carnegie Mellon University. Luis von Ahn was partially supported by a Microsoft Research New Faculty Fellowship and a MacArthur Fellowship. Jennifer Tam was partially supported by a Google Anita Borg Scholarship. R e f e re n c e s [1] L. von Ahn, M. Blum, and J. Langford. “Telling Humans and Computers Apart Automatically,” Communication s of the ACM, vol. 47, no. 2, pp. 57-60, Feb. 2004. [2] G. Mori and J. Malik. “Recognizing Objects in Adversarial Clutter: Breaking a Visual CAPTCHA,” In Computer Vision and Pattern Recognition CVPR'03, June 2003. [3] K. Chellapilla, and P. Simard, “ U sing Machine Learning to Break Visual Human Interactio n Proofs (HIP s),” Advances in Neural Information P rocessing Systems 17, Neural Info rmatio n P rocessing Systems (NIPS'2004), MIT Press. [4] H. Hermansk y, “ Perceptual Linear Predictive (PL P) Analysis of Speech,” J. Acoust. Soc. Am., vol. 87, no. 4, pp. 1738-1752, Apr. 1990. [5] H. Hermansk y, N. Morgan, A. Bayya, and P. Kohn. “RASTA-PL P Speech Analysi s Technique,” In P roc. IEEE Int’l Conf. Acoustics, Speech & Signal Processing, vol. 1, pp. 121124, San Francisco, 1992. [6] R. Santamarta. “Breaking Gmail ’s Audio Captcha,” http://blog.wintercore.com/?p=11, 2008. [7] D. Ell is. “ P L P and RASTA (and MFCC, and inversion) in Matlab using melfcc.m and invmelfcc.m,” http:/ /ww w.ee.columbia.edu/~dpwe/resources/matlab/rastamat/, 2006. [8] C. Chang and C. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http: //ww w.csie.ntu.edu.tw/~cjlin/libsvm [9] A. Schlaikjer. “ A Dual-Use Speech CA PTCHA: Aiding Visually Impaired Web Users while Providing Transcriptions of Audio Streams,” Technical Report CMU-LTI-07-014, Carnegie Mellon Universi t y. November 2007.
2 0.066257358 147 nips-2008-Multiscale Random Fields with Application to Contour Grouping
Author: Longin J. Latecki, Chengen Lu, Marc Sobel, Xiang Bai
Abstract: We introduce a new interpretation of multiscale random fields (MSRFs) that admits efficient optimization in the framework of regular (single level) random fields (RFs). It is based on a new operator, called append, that combines sets of random variables (RVs) to single RVs. We assume that a MSRF can be decomposed into disjoint trees that link RVs at different pyramid levels. The append operator is then applied to map RVs in each tree structure to a single RV. We demonstrate the usefulness of the proposed approach on a challenging task involving grouping contours of target shapes in images. It provides a natural representation of multiscale contour models, which is needed in order to cope with unstable contour decompositions. The append operator allows us to find optimal image segment labels using the classical framework of relaxation labeling. Alternative methods like Markov Chain Monte Carlo (MCMC) could also be used.
3 0.051719576 208 nips-2008-Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes
Author: Erik B. Sudderth, Michael I. Jordan
Abstract: We develop a statistical framework for the simultaneous, unsupervised segmentation and discovery of visual object categories from image databases. Examining a large set of manually segmented scenes, we show that object frequencies and segment sizes both follow power law distributions, which are well modeled by the Pitman–Yor (PY) process. This nonparametric prior distribution leads to learning algorithms which discover an unknown set of objects, and segmentation methods which automatically adapt their resolution to each image. Generalizing previous applications of PY processes, we use Gaussian processes to discover spatially contiguous segments which respect image boundaries. Using a novel family of variational approximations, our approach produces segmentations which compare favorably to state-of-the-art methods, while simultaneously discovering categories shared among natural scenes. 1
4 0.044642907 250 nips-2008-Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning
Author: Ali Rahimi, Benjamin Recht
Abstract: Randomized neural networks are immortalized in this AI Koan: In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6. “What are you doing?” asked Minsky. “I am training a randomly wired neural net to play tic-tac-toe,” Sussman replied. “Why is the net wired randomly?” asked Minsky. Sussman replied, “I do not want it to have any preconceptions of how to play.” Minsky then shut his eyes. “Why do you close your eyes?” Sussman asked his teacher. “So that the room will be empty,” replied Minsky. At that moment, Sussman was enlightened. We analyze shallow random networks with the help of concentration of measure inequalities. Specifically, we consider architectures that compute a weighted sum of their inputs after passing them through a bank of arbitrary randomized nonlinearities. We identify conditions under which these networks exhibit good classification performance, and bound their test error in terms of the size of the dataset and the number of random nonlinearities. 1
5 0.043427862 36 nips-2008-Beyond Novelty Detection: Incongruent Events, when General and Specific Classifiers Disagree
Author: Daphna Weinshall, Hynek Hermansky, Alon Zweig, Jie Luo, Holly Jimison, Frank Ohl, Misha Pavel
Abstract: Unexpected stimuli are a challenge to any machine learning algorithm. Here we identify distinct types of unexpected events, focusing on ’incongruent events’ when ’general level’ and ’specific level’ classifiers give conflicting predictions. We define a formal framework for the representation and processing of incongruent events: starting from the notion of label hierarchy, we show how partial order on labels can be deduced from such hierarchies. For each event, we compute its probability in different ways, based on adjacent levels (according to the partial order) in the label hierarchy. An incongruent event is an event where the probability computed based on some more specific level (in accordance with the partial order) is much smaller than the probability computed based on some more general level, leading to conflicting predictions. We derive algorithms to detect incongruent events from different types of hierarchies, corresponding to class membership or part membership. Respectively, we show promising results with real data on two specific problems: Out Of Vocabulary words in speech recognition, and the identification of a new sub-class (e.g., the face of a new individual) in audio-visual facial object recognition.
6 0.035646494 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features
7 0.035337605 62 nips-2008-Differentiable Sparse Coding
8 0.034339875 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words
9 0.033523954 142 nips-2008-Multi-Level Active Prediction of Useful Image Annotations for Recognition
10 0.033320896 26 nips-2008-Analyzing human feature learning as nonparametric Bayesian inference
11 0.030314026 101 nips-2008-Human Active Learning
12 0.028128698 196 nips-2008-Relative Margin Machines
13 0.027117217 169 nips-2008-Online Models for Content Optimization
14 0.025581442 241 nips-2008-Transfer Learning by Distribution Matching for Targeted Advertising
15 0.024913207 4 nips-2008-A Scalable Hierarchical Distributed Language Model
16 0.024183048 52 nips-2008-Correlated Bigram LSA for Unsupervised Language Model Adaptation
17 0.023931213 42 nips-2008-Cascaded Classification Models: Combining Models for Holistic Scene Understanding
18 0.023635235 111 nips-2008-Kernel Change-point Analysis
19 0.02351444 56 nips-2008-Deep Learning with Kernel Regularization for Visual Recognition
20 0.022765469 179 nips-2008-Phase transitions for high-dimensional joint support recovery
topicId topicWeight
[(0, -0.07), (1, -0.028), (2, 0.024), (3, -0.034), (4, -0.024), (5, -0.004), (6, -0.006), (7, -0.025), (8, 0.014), (9, 0.015), (10, 0.001), (11, 0.015), (12, -0.017), (13, -0.054), (14, -0.019), (15, 0.037), (16, -0.008), (17, -0.001), (18, 0.001), (19, 0.017), (20, 0.006), (21, 0.035), (22, 0.053), (23, 0.001), (24, 0.006), (25, 0.055), (26, 0.026), (27, -0.021), (28, -0.05), (29, 0.016), (30, -0.0), (31, 0.019), (32, 0.013), (33, -0.012), (34, -0.027), (35, 0.0), (36, 0.051), (37, -0.028), (38, 0.065), (39, -0.022), (40, 0.01), (41, -0.002), (42, 0.048), (43, -0.029), (44, 0.077), (45, -0.022), (46, -0.038), (47, -0.031), (48, -0.013), (49, 0.101)]
simIndex simValue paperId paperTitle
same-paper 1 0.92980862 41 nips-2008-Breaking Audio CAPTCHAs
Author: Jennifer Tam, Jiri Simsa, Sean Hyde, Luis V. Ahn
Abstract: CAP TCHAs are computer-generated tests that humans can pass but current computer systems cannot. CAP TCHAs provide a method for automatically distinguishing a human from a computer program, and therefore can protect Web services from abuse by so-called “bots.” Most CAP TCHAs consist of distorted images, usually text, for which a user must provide some description. Unfortunately, visual CAP TCHAs limit access to the millions of visually impaired people using the Web. Audio CAP TCHAs were created to solve this accessibility issue; however, the security of audio CAP TCHAs was never formally tested. Some visual CAP TCHAs have been broken using machine learning techniques, and we propose using similar ideas to test the security of audio CAP TCHAs. Audio CAP TCHAs are generally composed of a set of words to be identified, layered on top of noise. We analyzed the security of current audio CAP TCHAs from popular Web sites by using AdaBoost, SVM, and k-NN, and achieved correct solutions for test samples with accuracy up to 71%. Such accuracy is enough to consider these CAPTCHAs broken. Training several different machine learning algorithms on different types of audio CAP TCHAs allowed us to analyze the strengths and weaknesses of the algorithms so that we could suggest a design for a more robust audio CAPTCHA. 1 Int rod uct i o n CAP TCHAs [1] are automated tests designed to tell computers and humans apart by presenting users with a problem that humans can solve but current computer programs cannot. Because CAPTCHAs can distinguish between humans and computers with high probability, they are used for many different security applications: they prevent bots from voting continuously in online polls, automatically registering for millions of spam email accounts, automatically purchasing tickets to buy out an event, etc. Once a CAP TCHA is broken (i.e., computer programs can successfully pass the test), bots can impersonate humans and gain access to services that they should not. Therefore, it is important for CAP TCHAs to be secure. To pass the typical visual CAP TCHA, a user must correctly type the characters displayed in an image of distorted text. Many visual CAP TCHAs have been broken with machine learning techniques [2]-[3], though some remain secure against such attacks. Because visually impaired users who surf the Web using screen-reading programs cannot see this type of CAPTCHA, audio CAP TCHAs were created. Typical audio CAP TCHAs consist of one or several speakers saying letters or digits at randomly spaced intervals. A user must correctly identify the digits or characters spoken in the audio file to pass the CAP TCHA. To make this test difficult for current computer systems, specifically automatic speech recognition (ASR) programs, background noise is injected into the audio files. Since no official evaluation of existing audio CAP TCHAs has been reported, we tested the security of audio CAP TCHAs used by many popular Web sites by running machine learning experiments designed to break them. In the next section, we provide an overview of the literature related to our project. Section 3 describes our methods for creating training data, and section 4 describes how we create classifiers that can recognize letters, digits, and noise. In section 5, we discuss how we evaluated our methods on widely used audio CAP TCHAs and we give our results. In particular, we show that the audio CAP TCHAs used by sites such as Google and Digg are susceptible to machine learning attacks. Section 6 mentions the proposed design of a new more secure audio CAP TCHA based on our findings. 2 Lit erat u r e r ev i ew To break the audio CAP TCHAs, we derive features from the CAP TCHA audio and use several machine learning techniques to perform ASR on segments of the CAPTCHA. There are many popular techniques for extracting features from speech. The three techniques we use are mel-frequency cepstral coefficients (MFCC), perceptual linear prediction (PLP), and relative spectral transform-PLP (RAS TA-PLP). MFCC is one of the most popular speech feature representations used. Similar to a fast Fourier transform (FF T), MFCC transforms an audio file into frequency bands, but (unlike FF T) MFCC uses mel-frequency bands, which are better for approximating the range of frequencies humans hear. PLP was designed to extract speaker-independent features from speech [4]. Therefore, by using PLP and a variant such as RAS TA-PL P, we were able to train our classifiers to recognize letters and digits independently of who spoke them. Since many different people recorded the digits used in one of the types of audio CAP TCHAs we tested, PLP and RAS TA-PLP were needed to extract the features that were most useful for solving them. In [4]-[5], the authors conducted experiments on recognizing isolated digits in the presence of noise using both PLP and RAS TA-PL P. However, the noise used consisted of telephone or microphone static caused by recording in different locations. The audio CAP TCHAs we use contain this type of noise, as well as added vocal noise and/or music, which is supposed to make the automated recognition process much harder. The authors of [3] emphasize how many visual CAP TCHAs can be broken by successfully splitting the task into two smaller tasks: segmentation and recognition. We follow a similar approach in that we first automatically split the audio into segments, and then we classify these segments as noise or words. In early March 2008, concurrent to our work, the blog of Wintercore Labs [6] claimed to have successfully broken the Google audio CAP TCHA. After reading their Web article and viewing the video of how they solve the CAP TCHAs, we are unconvinced that the process is entirely automatic, and it is unclear what their exact pass rate is. Because we are unable to find any formal technical analysis of this program, we can neither be sure of its accuracy nor the extent of its automation. 3 Cr e at i o n of tra i n i n g dat a Since automated programs can attempt to pass a CAPTCHA repeatedly, a CAPTCHA is essentially broken when a program can pass it more than a non-trivial fraction of the time; e.g., a 5% pass rate is enough. Our approach to breaking the audio CAP TCHAs began by first splitting the audio files into segments of noise or words: for our experiments, the words were spoken letters or digits. We used manual transcriptions of the audio CAP TCHAs to get information regarding the location of each spoken word within the audio file. We were able to label our segments accurately by using this information. We gathered 1,000 audio CAP TCHAs from each of the following Web sites: google.com, digg.com, and an older version of the audio CAP TCHA in recaptcha.net. Each of the CAP TCHAs was annotated with the information regarding letter/digit locations provided by the manual transcriptions. For each type of CAPTCHA, we randomly selected 900 samples for training and used the remaining 100 for testing. Using the digit/letter location information provided in the manual CAP TCHA transcriptions, each training CAP TCHA is divided into segments of noise, the letters a-z, or the digits 0-9, and labeled as such. We ignore the annotation information of the CAP TCHAs we use for testing, and therefore we cannot identify the size of those segments. Instead, each test CAP TCHA is divided into a number of fixed-size segments. The segments with the highest energy peaks are then classified using machine learning techniques (Figure 1). Since the size of a feature vector extracted from a segment generally depends on the size of the segment, using fixed-size segments allows each segment to be described with a feature vector of the same length. We chose the window size by listening to a few training segments and adjusted accordingly to ensure that the segment contained the entire digit/letter. There is undoubtedly a more optimal way of selecting the window size, however, we were still able to break the three CAP TCHAs we tested with our method. Figure 1: A test audio CAP TCHA with the fixed-size segments containing the highest energy peaks highlighted. The information provided in the manual transcriptions of the audio CAP TCHAs contains a list of the time intervals within which words are spoken. However, these intervals are of variable size and the word might be spoken anywhere within this interval. To provide fixedsize segments for training, we developed the following heuristic. First, divide each file into variable-size segments using the time intervals provided and label each segment accordingly. Then, within each segment, detect the highest energy peak and return its fixed-size neighborhood labeled with the current segment’s label. This heuristic achieved nearly perfect labeling accuracy for the training set. Rare mistakes occurred when the highest energy peak of a digit or letter segment corresponded to noise rather than to a digit or letter. To summarize this subsection, an audio file is transformed into a set of fixed-size segments labeled as noise, a digit between 0 and 9, or a letter between a and z. These segments are then used for training. Classifiers are trained for one type of CAPTCHA at a time. 4 C l a s s i f i e r con s t ru ct i o n From the training data we extracted five sets of features using twelve MFCCs and twelfth- order spectral (SPEC) and cepstral (CEPS) coefficients from PLP Matlab functions for extracting these features were provided online Voicebox package. We use AdaBoost, SVM, and k-NN algorithms digit and letter recognition. We detail our implementation of following subsections. 4 .1 and RAS TA-PL P. The at [7] and as part of the to implement automated each algorithm in the AdaBoost Using decision stumps as weak classifiers for AdaBoost, anywhere from 11 to 37 ensemble classifiers are built. The number of classifiers built depends on which type of CAPTCHA we are solving. Each classifier trains on all the segments associated with that type of CAP TCHA, and for the purpose of building a single classifier, segments are labeled by either -1 (negative example) or +1 (positive example). Using cross-validation, we choose to use 50 iterations for our AdaBoost algorithm. A segment can then be classified as a particular letter, digit, or noise according to the ensemble classifier that outputs the number closest to 1. 4 .2 S u p p o rt v e ct o r m a c h i n e To conduct digit recognition with SVM, we used the C++ implementations of libSVM [8] version 2.85 with C-SMV and RBF kernel. First, all feature values are scaled to the range of -1 to 1 as suggested by [8]. The scale parameters are stored so that test samples can be scaled accordingly. Then, a single multiclass classifier is created for each set of features using all the segments for a particular type of CAPTCHA. We use cross-validation and grid search to discover the optimal slack penalty (C=32) and kernel parameter (γ=0.011). 4 .3 k - n e a re st n e i g h b o r ( k - N N ) We use k-NN as our final method for classifying digits. For each type of CAP TCHA, five different classifiers are created by using all of the training data and the five sets of features associated with that particular type of CAP TCHA. Again we use cross-validation to discover the optimal parameter, in this case k=1. We use Euclidian distance as our distance metric. 5 Ass e s sm e n t of cu rre n t a ud i o CAPTCHAs Our method for solving CAP TCHAs iteratively extracts an audio segment from a CAP TCHA, inputs the segment to one of our digit or letter recognizers, and outputs the label for that segment. We continue this process until the maximum solution size is reached or there are no unlabeled segments left. Some of the CAPTCHAs we evaluated have solutions that vary in length. Our method ensures that we get solutions of varying length that are never longer than the maximum solution length. A segment to be classified is identified by taking the neighborhood of the highest energy peak of an as yet unlabeled part of the CAP TCHA. Once a prediction of the solution to the CAPTCHA is computed, it is compared to the true solution. Given that at least one of the audio CAP TCHAs allows users to make a mistake in one of the digits (e.g., reCAPTCHA), we compute the pass rate for each of the different types of CAPTCHAs with all of the following conditions: • The prediction matches the true solution exactly. • Inserting one digit into the prediction would make it match the solution exactly. • Replacing one digit in the prediction would make it match the solution exactly. • Removing one digit from the prediction would make it match the solution exactly. However, since we are only sure that these conditions apply to reCAPTCHA audio CAP TCHAs, we also calculate the percentage of exact solution matches in our results for each type of audio CAP TCHA. These results are described in the following subsections. 5 .1 Goog le Google audio CAP TCHAs consist of one speaker saying random digits 0-9, the phrase “once again,” followed by the exact same recorded sequence of digits originally presented. The background noise consists of human voices speaking backwards at varying volumes. A solution can range in length from five to eight words. We set our classifier to find the 12 loudest segments and classify these segments as digits or noise. Because the phrase “once again” marks the halfway point of the CAPTCHA, we preprocessed the audio to only serve this half of the CAP TCHA to our classifiers. It is important to note, however, that the classifiers were always able to identify the segment containing “once again,” and these segments were identified before all other segments. Therefore, if necessary, we could have had our system cut the file in half after first labeling this segment. For AdaBoost, we create 12 classifiers: one classifier for each digit, one for noise, and one for the phrase “once again.” Our results ( Table 1) show that at best we achieved a 90% pass rate using the “one mistake” passing conditions and a 66% exact solution match rate. Using SVM and the “one mistake” passing conditions, at best we achieve a 92% pass rate and a 67% exact solution match. For k-NN, the “one mistake” pass rate is 62% and the exact solution match rate is 26%. Table 1: Google audio CAP TCHA results: Maximum 67% accuracy was achieved by SVM. Classifiers Used AdaBoost SVM k-NN one mistake exact match one mistake exact match 88% 61% 92% 67% 30% 1% PLPSPEC 90% 66% 90% 67% 60% 26% PLPCEPS 90% 66% 92% 67% 62% 23% RAS TAPLPSPEC 88% 48% 90% 61% 29% 1% RAS TAPLPCEPS 5 .2 exact match MFCC Features Used One mistake 90% 63% 92% 67% 33% 2% Digg Digg CAP TCHAs also consist of one speaker, in this case saying a random combination of letters and digits. The background noise consists of static or what sounds like trickling water and is not continuous throughout the entire file. We noticed in our training data that the following characters were never present in a solution: 0, 1, 2, 5, 7, 9, i, o, z. Since the Digg audio CAPTCHA is also the verbal transcription of the visual CAP TCHA, we believe that these characters are excluded to avoid confusion between digits and letters that are similar in appearance. The solution length varies between three and six words. Using AdaBoost, we create 28 classifiers: one classifier for each digit or letter that appears in our training data and one classifier for noise. Perhaps because we had fewer segments to train with and there was a far higher proportion of noise segments, AdaBoost failed to produce any correct solutions. We believe that the overwhelming number of negative training examples versus the small number of positive training samples used to create each decision stump severely affected AdaBoost’s ability to classify audio segments correctly. A histogram of the training samples is provided in Figure 2 to illustrate the amount of training data available for each character. When using SVM, the best feature set passed with 96% using “one mistake” passing conditions and passed with 71% when matching the solution exactly. For k-NN, the best feature set produced a 90% “one mistake” pass rate and a 49% exact solution match. Full results can be found in Table 2. Table 2: Digg audio CAP TCHA results: Maximum 71% accuracy was achieved by SVM. Classifiers Used AdaBoost SVM k-NN exact match one mistake exact match one mistake exact match MFCC - - 96% 71% 89% 49% PLPSPEC - - 94% 65% 90% 47% PLPCEPS - - 96% 71% 64% 17% RAS TAPLPSPEC - - 17% 3% 67% 17% RAS TAPLPCEPS Features Used one mistake - - 96% 71% 82% 34% 1000 900 800 # of Segments 700 600 500 400 300 200 100 y x v w t u r s q p n m l j k h f g e d c b a 8 6 4 3 noise 0 Segment Label Figure 2: Digg CAP TCHA training data distribution. 5 .3 reC A P T C H A The older version of reCAPTCHA’s audio CAP TCHAs we tested consist of several speakers who speak random digits. The background noise consists of human voices speaking backwards at varying volumes. The solution is always eight digits long. For AdaBoost, we create 11 classifiers: one classifier for each digit and one classifier for noise. Because we know that the reCAPTCHA passing conditions are the “one mistake” passing conditions, SVM produces our best pass rate of 58%. Our best exact match rate is 45% ( Table 3). Table 3: reCAPTCHA audio CAP TCHA results: Maximum 45% accuracy was achieved by SVM. Classifiers Used AdaBoost SVM k-NN one mistake exact match one mistake exact match 18% 6% 56% 43% 22% 11% PLPSPEC 27% 10% 58% 39% 43% 25% PLPCEPS 23% 10% 56% 45% 29% 14% RAS TAPLPSPEC 9% 3% 36% 18% 24% 4% RAS TAPLPCEPS 6 exact match MFCC Features Used one mistake 9% 3% 46% 30% 32% 12% Prop ert i e s of w ea k ver s u s st ro n g CAPT CHAs From our results, we note that the easiest CAP TCHAs to break were from Digg. Google had the next strongest CAP TCHAs followed by the strongest from reCAPTCHA. Although the Digg CAP TCHAs have the largest vocabulary, giving us less training data per label, the same woman recorded them all. More importantly, the same type of noise is used throughout the entire CAPTCHA. The noise sounds like running water and static which sounds very different from the human voice and does not produce the same energy spikes needed to locate segments, therefore making segmentation quite easy. The CAP TCHAs from Google and reCAPTCHA used other human voices for background noise, making segmentation much more difficult. Although Google used a smaller vocabulary than Digg and also only used one speaker, Google’s background noise made the CAP TCHA more difficult to solve. After listening to a few of Google’s CAP TCHAs, we noticed that although the background noise consisted of human voices, the same background noise was repeated. reCAP TCHA had similar noise to Google, but they had a larger selection of noise thus making it harder to learn. reCAP TCHA also has the longest solution length making it more difficult to get perfectly correct. Finally, reCAPTCHA used many different speakers causing it to be the strongest CAP TCHA of the three we tested. In conclusion, an audio CAP TCHA that consists of a finite vocabulary and background noise should have multiple speakers and noise similar to the speakers. 7 Recomm e n d at i o n s f or creat i n g st ro n g e r aud i o CAPTCHAs Due to our success in solving audio CAP TCHAs, we have decided to start developing new audio CAP TCHAs that our methods, and machine learning methods in general, will be less likely to solve. From our experiments, we note that CAP TCHAs containing longer solutions and multiple speakers tend to be more difficult to solve. Also, because our methods depend on the amount of training data we have, having a large vocabulary would make it more difficult to collect enough training data. Already since obtaining these results, reCAPTCHA.net has updated their audio CAP TCHA to contain more distortions and a larger vocabulary: the digits 0 through 99. In designing a new audio CAP TCHA we are also concerned with the human pass rate. The current human pass rate for the reCAPTCHA audio CAP TCHAs is only 70%. To develop an audio CAP TCHA with an improved human pass rate, we plan to take advantage of the human mind’s ability to understand distorted audio through context clues. By listening to a phrase instead of to random isolated words, humans are better able to decipher distorted utterances because they are familiar with the phrase or can use contextual clues to decipher the distorted audio. Using this idea, the audio for our new audio CAP TCHA will be taken from old-time radio programs in which the poor quality of the audio makes transcription by ASR systems difficult. Users will be presented with an audio clip consisting of a 4-6 word phrase. Half of the CAPTCHA consists of words, which validate a user to be human, while the other half of the words need to be transcribed. This is the same idea behind the visual reCAP TCHA that is currently digitizing text on which OCR fails. We expect that this new audio CAP TCHA will be more secure than the current version and easier for humans to pass. Initial experiments using this idea show this to be true [9]. 8 Co n c l u s i o n We have succeeded in “breaking” three different types of widely used audio CAP TCHAs, even though these were developed with the purpose of defeating attacks by machine learning techniques. We believe our results can be improved by selecting optimal segment sizes, but that is unnecessary given our already high success rate. For our experiments, segment sizes were not chosen in a special way; occasionally yielding results in which a segment only contained half of a word, causing our prediction to contain that particular word twice. We also believe that the AdaBoost results can be improved, particularly for the Digg audio CAP TCHAs, by ensuring that the number of negative training samples is closer to the number of positive training samples. We have shown that our approach is successful and can be used with many different audio CAP TCHAs that contain small finite vocabularies. A ck n o w l e d g m e n t s This work was partially supported by generous gifts from the Heinz Endowment, by an equipment grant from Intel Corporation, and by the Army Research Office through grant number DAAD19-02-1-0389 to CyLab at Carnegie Mellon University. Luis von Ahn was partially supported by a Microsoft Research New Faculty Fellowship and a MacArthur Fellowship. Jennifer Tam was partially supported by a Google Anita Borg Scholarship. R e f e re n c e s [1] L. von Ahn, M. Blum, and J. Langford. “Telling Humans and Computers Apart Automatically,” Communication s of the ACM, vol. 47, no. 2, pp. 57-60, Feb. 2004. [2] G. Mori and J. Malik. “Recognizing Objects in Adversarial Clutter: Breaking a Visual CAPTCHA,” In Computer Vision and Pattern Recognition CVPR'03, June 2003. [3] K. Chellapilla, and P. Simard, “ U sing Machine Learning to Break Visual Human Interactio n Proofs (HIP s),” Advances in Neural Information P rocessing Systems 17, Neural Info rmatio n P rocessing Systems (NIPS'2004), MIT Press. [4] H. Hermansk y, “ Perceptual Linear Predictive (PL P) Analysis of Speech,” J. Acoust. Soc. Am., vol. 87, no. 4, pp. 1738-1752, Apr. 1990. [5] H. Hermansk y, N. Morgan, A. Bayya, and P. Kohn. “RASTA-PL P Speech Analysi s Technique,” In P roc. IEEE Int’l Conf. Acoustics, Speech & Signal Processing, vol. 1, pp. 121124, San Francisco, 1992. [6] R. Santamarta. “Breaking Gmail ’s Audio Captcha,” http://blog.wintercore.com/?p=11, 2008. [7] D. Ell is. “ P L P and RASTA (and MFCC, and inversion) in Matlab using melfcc.m and invmelfcc.m,” http:/ /ww w.ee.columbia.edu/~dpwe/resources/matlab/rastamat/, 2006. [8] C. Chang and C. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http: //ww w.csie.ntu.edu.tw/~cjlin/libsvm [9] A. Schlaikjer. “ A Dual-Use Speech CA PTCHA: Aiding Visually Impaired Web Users while Providing Transcriptions of Audio Streams,” Technical Report CMU-LTI-07-014, Carnegie Mellon Universi t y. November 2007.
2 0.47693554 101 nips-2008-Human Active Learning
Author: Rui M. Castro, Charles Kalish, Robert Nowak, Ruichen Qian, Tim Rogers, Xiaojin Zhu
Abstract: We investigate a topic at the interface of machine learning and cognitive science. Human active learning, where learners can actively query the world for information, is contrasted with passive learning from random examples. Furthermore, we compare human active learning performance with predictions from statistical learning theory. We conduct a series of human category learning experiments inspired by a machine learning task for which active and passive learning error bounds are well understood, and dramatically distinct. Our results indicate that humans are capable of actively selecting informative queries, and in doing so learn better and faster than if they are given random training data, as predicted by learning theory. However, the improvement over passive learning is not as dramatic as that achieved by machine active learning algorithms. To the best of our knowledge, this is the first quantitative study comparing human category learning in active versus passive settings. 1
3 0.46497393 111 nips-2008-Kernel Change-point Analysis
Author: Zaïd Harchaoui, Eric Moulines, Francis R. Bach
Abstract: We introduce a kernel-based method for change-point analysis within a sequence of temporal observations. Change-point analysis of an unlabelled sample of observations consists in, first, testing whether a change in the distribution occurs within the sample, and second, if a change occurs, estimating the change-point instant after which the distribution of the observations switches from one distribution to another different distribution. We propose a test statistic based upon the maximum kernel Fisher discriminant ratio as a measure of homogeneity between segments. We derive its limiting distribution under the null hypothesis (no change occurs), and establish the consistency under the alternative hypothesis (a change occurs). This allows to build a statistical hypothesis testing procedure for testing the presence of a change-point, with a prescribed false-alarm probability and detection probability tending to one in the large-sample setting. If a change actually occurs, the test statistic also yields an estimator of the change-point location. Promising experimental results in temporal segmentation of mental tasks from BCI data and pop song indexation are presented. 1
4 0.45690748 169 nips-2008-Online Models for Content Optimization
Author: Deepak Agarwal, Bee-chung Chen, Pradheep Elango, Nitin Motgi, Seung-taek Park, Raghu Ramakrishnan, Scott Roy, Joe Zachariah
Abstract: We describe a new content publishing system that selects articles to serve to a user, choosing from an editorially programmed pool that is frequently refreshed. It is now deployed on a major Yahoo! portal, and selects articles to serve to hundreds of millions of user visits per day, significantly increasing the number of user clicks over the original manual approach, in which editors periodically selected articles to display. Some of the challenges we face include a dynamic content pool, short article lifetimes, non-stationary click-through rates, and extremely high traffic volumes. The fundamental problem we must solve is to quickly identify which items are popular (perhaps within different user segments), and to exploit them while they remain current. We must also explore the underlying pool constantly to identify promising alternatives, quickly discarding poor performers. Our approach is based on tracking per article performance in near real time through online models. We describe the characteristics and constraints of our application setting, discuss our design choices, and show the importance and effectiveness of coupling online models with a randomization procedure. We discuss the challenges encountered in a production online content-publishing environment and highlight issues that deserve careful attention. Our analysis of this application also suggests a number of future research avenues. 1
5 0.43159333 142 nips-2008-Multi-Level Active Prediction of Useful Image Annotations for Recognition
Author: Sudheendra Vijayanarasimhan, Kristen Grauman
Abstract: We introduce a framework for actively learning visual categories from a mixture of weakly and strongly labeled image examples. We propose to allow the categorylearner to strategically choose what annotations it receives—based on both the expected reduction in uncertainty as well as the relative costs of obtaining each annotation. We construct a multiple-instance discriminative classifier based on the initial training data. Then all remaining unlabeled and weakly labeled examples are surveyed to actively determine which annotation ought to be requested next. After each request, the current classifier is incrementally updated. Unlike previous work, our approach accounts for the fact that the optimal use of manual annotation may call for a combination of labels at multiple levels of granularity (e.g., a full segmentation on some images and a present/absent flag on others). As a result, it is possible to learn more accurate category models with a lower total expenditure of manual annotation effort. 1
6 0.40350401 147 nips-2008-Multiscale Random Fields with Application to Contour Grouping
7 0.40028697 36 nips-2008-Beyond Novelty Detection: Incongruent Events, when General and Specific Classifiers Disagree
8 0.38163489 67 nips-2008-Effects of Stimulus Type and of Error-Correcting Code Design on BCI Speller Performance
9 0.37831381 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words
10 0.37455636 219 nips-2008-Spectral Hashing
11 0.36526227 186 nips-2008-Probabilistic detection of short events, with application to critical care monitoring
12 0.3608698 250 nips-2008-Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning
13 0.36063203 241 nips-2008-Transfer Learning by Distribution Matching for Targeted Advertising
14 0.35757905 208 nips-2008-Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes
15 0.35057038 19 nips-2008-An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis
16 0.34443587 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost
17 0.34241387 4 nips-2008-A Scalable Hierarchical Distributed Language Model
18 0.33385274 218 nips-2008-Spectral Clustering with Perturbed Data
19 0.33322936 173 nips-2008-Optimization on a Budget: A Reinforcement Learning Approach
20 0.33306986 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias
topicId topicWeight
[(6, 0.039), (7, 0.051), (12, 0.043), (20, 0.431), (28, 0.076), (32, 0.016), (57, 0.036), (63, 0.031), (71, 0.023), (77, 0.046), (83, 0.052), (91, 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.76665199 41 nips-2008-Breaking Audio CAPTCHAs
Author: Jennifer Tam, Jiri Simsa, Sean Hyde, Luis V. Ahn
Abstract: CAP TCHAs are computer-generated tests that humans can pass but current computer systems cannot. CAP TCHAs provide a method for automatically distinguishing a human from a computer program, and therefore can protect Web services from abuse by so-called “bots.” Most CAP TCHAs consist of distorted images, usually text, for which a user must provide some description. Unfortunately, visual CAP TCHAs limit access to the millions of visually impaired people using the Web. Audio CAP TCHAs were created to solve this accessibility issue; however, the security of audio CAP TCHAs was never formally tested. Some visual CAP TCHAs have been broken using machine learning techniques, and we propose using similar ideas to test the security of audio CAP TCHAs. Audio CAP TCHAs are generally composed of a set of words to be identified, layered on top of noise. We analyzed the security of current audio CAP TCHAs from popular Web sites by using AdaBoost, SVM, and k-NN, and achieved correct solutions for test samples with accuracy up to 71%. Such accuracy is enough to consider these CAPTCHAs broken. Training several different machine learning algorithms on different types of audio CAP TCHAs allowed us to analyze the strengths and weaknesses of the algorithms so that we could suggest a design for a more robust audio CAPTCHA. 1 Int rod uct i o n CAP TCHAs [1] are automated tests designed to tell computers and humans apart by presenting users with a problem that humans can solve but current computer programs cannot. Because CAPTCHAs can distinguish between humans and computers with high probability, they are used for many different security applications: they prevent bots from voting continuously in online polls, automatically registering for millions of spam email accounts, automatically purchasing tickets to buy out an event, etc. Once a CAP TCHA is broken (i.e., computer programs can successfully pass the test), bots can impersonate humans and gain access to services that they should not. Therefore, it is important for CAP TCHAs to be secure. To pass the typical visual CAP TCHA, a user must correctly type the characters displayed in an image of distorted text. Many visual CAP TCHAs have been broken with machine learning techniques [2]-[3], though some remain secure against such attacks. Because visually impaired users who surf the Web using screen-reading programs cannot see this type of CAPTCHA, audio CAP TCHAs were created. Typical audio CAP TCHAs consist of one or several speakers saying letters or digits at randomly spaced intervals. A user must correctly identify the digits or characters spoken in the audio file to pass the CAP TCHA. To make this test difficult for current computer systems, specifically automatic speech recognition (ASR) programs, background noise is injected into the audio files. Since no official evaluation of existing audio CAP TCHAs has been reported, we tested the security of audio CAP TCHAs used by many popular Web sites by running machine learning experiments designed to break them. In the next section, we provide an overview of the literature related to our project. Section 3 describes our methods for creating training data, and section 4 describes how we create classifiers that can recognize letters, digits, and noise. In section 5, we discuss how we evaluated our methods on widely used audio CAP TCHAs and we give our results. In particular, we show that the audio CAP TCHAs used by sites such as Google and Digg are susceptible to machine learning attacks. Section 6 mentions the proposed design of a new more secure audio CAP TCHA based on our findings. 2 Lit erat u r e r ev i ew To break the audio CAP TCHAs, we derive features from the CAP TCHA audio and use several machine learning techniques to perform ASR on segments of the CAPTCHA. There are many popular techniques for extracting features from speech. The three techniques we use are mel-frequency cepstral coefficients (MFCC), perceptual linear prediction (PLP), and relative spectral transform-PLP (RAS TA-PLP). MFCC is one of the most popular speech feature representations used. Similar to a fast Fourier transform (FF T), MFCC transforms an audio file into frequency bands, but (unlike FF T) MFCC uses mel-frequency bands, which are better for approximating the range of frequencies humans hear. PLP was designed to extract speaker-independent features from speech [4]. Therefore, by using PLP and a variant such as RAS TA-PL P, we were able to train our classifiers to recognize letters and digits independently of who spoke them. Since many different people recorded the digits used in one of the types of audio CAP TCHAs we tested, PLP and RAS TA-PLP were needed to extract the features that were most useful for solving them. In [4]-[5], the authors conducted experiments on recognizing isolated digits in the presence of noise using both PLP and RAS TA-PL P. However, the noise used consisted of telephone or microphone static caused by recording in different locations. The audio CAP TCHAs we use contain this type of noise, as well as added vocal noise and/or music, which is supposed to make the automated recognition process much harder. The authors of [3] emphasize how many visual CAP TCHAs can be broken by successfully splitting the task into two smaller tasks: segmentation and recognition. We follow a similar approach in that we first automatically split the audio into segments, and then we classify these segments as noise or words. In early March 2008, concurrent to our work, the blog of Wintercore Labs [6] claimed to have successfully broken the Google audio CAP TCHA. After reading their Web article and viewing the video of how they solve the CAP TCHAs, we are unconvinced that the process is entirely automatic, and it is unclear what their exact pass rate is. Because we are unable to find any formal technical analysis of this program, we can neither be sure of its accuracy nor the extent of its automation. 3 Cr e at i o n of tra i n i n g dat a Since automated programs can attempt to pass a CAPTCHA repeatedly, a CAPTCHA is essentially broken when a program can pass it more than a non-trivial fraction of the time; e.g., a 5% pass rate is enough. Our approach to breaking the audio CAP TCHAs began by first splitting the audio files into segments of noise or words: for our experiments, the words were spoken letters or digits. We used manual transcriptions of the audio CAP TCHAs to get information regarding the location of each spoken word within the audio file. We were able to label our segments accurately by using this information. We gathered 1,000 audio CAP TCHAs from each of the following Web sites: google.com, digg.com, and an older version of the audio CAP TCHA in recaptcha.net. Each of the CAP TCHAs was annotated with the information regarding letter/digit locations provided by the manual transcriptions. For each type of CAPTCHA, we randomly selected 900 samples for training and used the remaining 100 for testing. Using the digit/letter location information provided in the manual CAP TCHA transcriptions, each training CAP TCHA is divided into segments of noise, the letters a-z, or the digits 0-9, and labeled as such. We ignore the annotation information of the CAP TCHAs we use for testing, and therefore we cannot identify the size of those segments. Instead, each test CAP TCHA is divided into a number of fixed-size segments. The segments with the highest energy peaks are then classified using machine learning techniques (Figure 1). Since the size of a feature vector extracted from a segment generally depends on the size of the segment, using fixed-size segments allows each segment to be described with a feature vector of the same length. We chose the window size by listening to a few training segments and adjusted accordingly to ensure that the segment contained the entire digit/letter. There is undoubtedly a more optimal way of selecting the window size, however, we were still able to break the three CAP TCHAs we tested with our method. Figure 1: A test audio CAP TCHA with the fixed-size segments containing the highest energy peaks highlighted. The information provided in the manual transcriptions of the audio CAP TCHAs contains a list of the time intervals within which words are spoken. However, these intervals are of variable size and the word might be spoken anywhere within this interval. To provide fixedsize segments for training, we developed the following heuristic. First, divide each file into variable-size segments using the time intervals provided and label each segment accordingly. Then, within each segment, detect the highest energy peak and return its fixed-size neighborhood labeled with the current segment’s label. This heuristic achieved nearly perfect labeling accuracy for the training set. Rare mistakes occurred when the highest energy peak of a digit or letter segment corresponded to noise rather than to a digit or letter. To summarize this subsection, an audio file is transformed into a set of fixed-size segments labeled as noise, a digit between 0 and 9, or a letter between a and z. These segments are then used for training. Classifiers are trained for one type of CAPTCHA at a time. 4 C l a s s i f i e r con s t ru ct i o n From the training data we extracted five sets of features using twelve MFCCs and twelfth- order spectral (SPEC) and cepstral (CEPS) coefficients from PLP Matlab functions for extracting these features were provided online Voicebox package. We use AdaBoost, SVM, and k-NN algorithms digit and letter recognition. We detail our implementation of following subsections. 4 .1 and RAS TA-PL P. The at [7] and as part of the to implement automated each algorithm in the AdaBoost Using decision stumps as weak classifiers for AdaBoost, anywhere from 11 to 37 ensemble classifiers are built. The number of classifiers built depends on which type of CAPTCHA we are solving. Each classifier trains on all the segments associated with that type of CAP TCHA, and for the purpose of building a single classifier, segments are labeled by either -1 (negative example) or +1 (positive example). Using cross-validation, we choose to use 50 iterations for our AdaBoost algorithm. A segment can then be classified as a particular letter, digit, or noise according to the ensemble classifier that outputs the number closest to 1. 4 .2 S u p p o rt v e ct o r m a c h i n e To conduct digit recognition with SVM, we used the C++ implementations of libSVM [8] version 2.85 with C-SMV and RBF kernel. First, all feature values are scaled to the range of -1 to 1 as suggested by [8]. The scale parameters are stored so that test samples can be scaled accordingly. Then, a single multiclass classifier is created for each set of features using all the segments for a particular type of CAPTCHA. We use cross-validation and grid search to discover the optimal slack penalty (C=32) and kernel parameter (γ=0.011). 4 .3 k - n e a re st n e i g h b o r ( k - N N ) We use k-NN as our final method for classifying digits. For each type of CAP TCHA, five different classifiers are created by using all of the training data and the five sets of features associated with that particular type of CAP TCHA. Again we use cross-validation to discover the optimal parameter, in this case k=1. We use Euclidian distance as our distance metric. 5 Ass e s sm e n t of cu rre n t a ud i o CAPTCHAs Our method for solving CAP TCHAs iteratively extracts an audio segment from a CAP TCHA, inputs the segment to one of our digit or letter recognizers, and outputs the label for that segment. We continue this process until the maximum solution size is reached or there are no unlabeled segments left. Some of the CAPTCHAs we evaluated have solutions that vary in length. Our method ensures that we get solutions of varying length that are never longer than the maximum solution length. A segment to be classified is identified by taking the neighborhood of the highest energy peak of an as yet unlabeled part of the CAP TCHA. Once a prediction of the solution to the CAPTCHA is computed, it is compared to the true solution. Given that at least one of the audio CAP TCHAs allows users to make a mistake in one of the digits (e.g., reCAPTCHA), we compute the pass rate for each of the different types of CAPTCHAs with all of the following conditions: • The prediction matches the true solution exactly. • Inserting one digit into the prediction would make it match the solution exactly. • Replacing one digit in the prediction would make it match the solution exactly. • Removing one digit from the prediction would make it match the solution exactly. However, since we are only sure that these conditions apply to reCAPTCHA audio CAP TCHAs, we also calculate the percentage of exact solution matches in our results for each type of audio CAP TCHA. These results are described in the following subsections. 5 .1 Goog le Google audio CAP TCHAs consist of one speaker saying random digits 0-9, the phrase “once again,” followed by the exact same recorded sequence of digits originally presented. The background noise consists of human voices speaking backwards at varying volumes. A solution can range in length from five to eight words. We set our classifier to find the 12 loudest segments and classify these segments as digits or noise. Because the phrase “once again” marks the halfway point of the CAPTCHA, we preprocessed the audio to only serve this half of the CAP TCHA to our classifiers. It is important to note, however, that the classifiers were always able to identify the segment containing “once again,” and these segments were identified before all other segments. Therefore, if necessary, we could have had our system cut the file in half after first labeling this segment. For AdaBoost, we create 12 classifiers: one classifier for each digit, one for noise, and one for the phrase “once again.” Our results ( Table 1) show that at best we achieved a 90% pass rate using the “one mistake” passing conditions and a 66% exact solution match rate. Using SVM and the “one mistake” passing conditions, at best we achieve a 92% pass rate and a 67% exact solution match. For k-NN, the “one mistake” pass rate is 62% and the exact solution match rate is 26%. Table 1: Google audio CAP TCHA results: Maximum 67% accuracy was achieved by SVM. Classifiers Used AdaBoost SVM k-NN one mistake exact match one mistake exact match 88% 61% 92% 67% 30% 1% PLPSPEC 90% 66% 90% 67% 60% 26% PLPCEPS 90% 66% 92% 67% 62% 23% RAS TAPLPSPEC 88% 48% 90% 61% 29% 1% RAS TAPLPCEPS 5 .2 exact match MFCC Features Used One mistake 90% 63% 92% 67% 33% 2% Digg Digg CAP TCHAs also consist of one speaker, in this case saying a random combination of letters and digits. The background noise consists of static or what sounds like trickling water and is not continuous throughout the entire file. We noticed in our training data that the following characters were never present in a solution: 0, 1, 2, 5, 7, 9, i, o, z. Since the Digg audio CAPTCHA is also the verbal transcription of the visual CAP TCHA, we believe that these characters are excluded to avoid confusion between digits and letters that are similar in appearance. The solution length varies between three and six words. Using AdaBoost, we create 28 classifiers: one classifier for each digit or letter that appears in our training data and one classifier for noise. Perhaps because we had fewer segments to train with and there was a far higher proportion of noise segments, AdaBoost failed to produce any correct solutions. We believe that the overwhelming number of negative training examples versus the small number of positive training samples used to create each decision stump severely affected AdaBoost’s ability to classify audio segments correctly. A histogram of the training samples is provided in Figure 2 to illustrate the amount of training data available for each character. When using SVM, the best feature set passed with 96% using “one mistake” passing conditions and passed with 71% when matching the solution exactly. For k-NN, the best feature set produced a 90% “one mistake” pass rate and a 49% exact solution match. Full results can be found in Table 2. Table 2: Digg audio CAP TCHA results: Maximum 71% accuracy was achieved by SVM. Classifiers Used AdaBoost SVM k-NN exact match one mistake exact match one mistake exact match MFCC - - 96% 71% 89% 49% PLPSPEC - - 94% 65% 90% 47% PLPCEPS - - 96% 71% 64% 17% RAS TAPLPSPEC - - 17% 3% 67% 17% RAS TAPLPCEPS Features Used one mistake - - 96% 71% 82% 34% 1000 900 800 # of Segments 700 600 500 400 300 200 100 y x v w t u r s q p n m l j k h f g e d c b a 8 6 4 3 noise 0 Segment Label Figure 2: Digg CAP TCHA training data distribution. 5 .3 reC A P T C H A The older version of reCAPTCHA’s audio CAP TCHAs we tested consist of several speakers who speak random digits. The background noise consists of human voices speaking backwards at varying volumes. The solution is always eight digits long. For AdaBoost, we create 11 classifiers: one classifier for each digit and one classifier for noise. Because we know that the reCAPTCHA passing conditions are the “one mistake” passing conditions, SVM produces our best pass rate of 58%. Our best exact match rate is 45% ( Table 3). Table 3: reCAPTCHA audio CAP TCHA results: Maximum 45% accuracy was achieved by SVM. Classifiers Used AdaBoost SVM k-NN one mistake exact match one mistake exact match 18% 6% 56% 43% 22% 11% PLPSPEC 27% 10% 58% 39% 43% 25% PLPCEPS 23% 10% 56% 45% 29% 14% RAS TAPLPSPEC 9% 3% 36% 18% 24% 4% RAS TAPLPCEPS 6 exact match MFCC Features Used one mistake 9% 3% 46% 30% 32% 12% Prop ert i e s of w ea k ver s u s st ro n g CAPT CHAs From our results, we note that the easiest CAP TCHAs to break were from Digg. Google had the next strongest CAP TCHAs followed by the strongest from reCAPTCHA. Although the Digg CAP TCHAs have the largest vocabulary, giving us less training data per label, the same woman recorded them all. More importantly, the same type of noise is used throughout the entire CAPTCHA. The noise sounds like running water and static which sounds very different from the human voice and does not produce the same energy spikes needed to locate segments, therefore making segmentation quite easy. The CAP TCHAs from Google and reCAPTCHA used other human voices for background noise, making segmentation much more difficult. Although Google used a smaller vocabulary than Digg and also only used one speaker, Google’s background noise made the CAP TCHA more difficult to solve. After listening to a few of Google’s CAP TCHAs, we noticed that although the background noise consisted of human voices, the same background noise was repeated. reCAP TCHA had similar noise to Google, but they had a larger selection of noise thus making it harder to learn. reCAP TCHA also has the longest solution length making it more difficult to get perfectly correct. Finally, reCAPTCHA used many different speakers causing it to be the strongest CAP TCHA of the three we tested. In conclusion, an audio CAP TCHA that consists of a finite vocabulary and background noise should have multiple speakers and noise similar to the speakers. 7 Recomm e n d at i o n s f or creat i n g st ro n g e r aud i o CAPTCHAs Due to our success in solving audio CAP TCHAs, we have decided to start developing new audio CAP TCHAs that our methods, and machine learning methods in general, will be less likely to solve. From our experiments, we note that CAP TCHAs containing longer solutions and multiple speakers tend to be more difficult to solve. Also, because our methods depend on the amount of training data we have, having a large vocabulary would make it more difficult to collect enough training data. Already since obtaining these results, reCAPTCHA.net has updated their audio CAP TCHA to contain more distortions and a larger vocabulary: the digits 0 through 99. In designing a new audio CAP TCHA we are also concerned with the human pass rate. The current human pass rate for the reCAPTCHA audio CAP TCHAs is only 70%. To develop an audio CAP TCHA with an improved human pass rate, we plan to take advantage of the human mind’s ability to understand distorted audio through context clues. By listening to a phrase instead of to random isolated words, humans are better able to decipher distorted utterances because they are familiar with the phrase or can use contextual clues to decipher the distorted audio. Using this idea, the audio for our new audio CAP TCHA will be taken from old-time radio programs in which the poor quality of the audio makes transcription by ASR systems difficult. Users will be presented with an audio clip consisting of a 4-6 word phrase. Half of the CAPTCHA consists of words, which validate a user to be human, while the other half of the words need to be transcribed. This is the same idea behind the visual reCAP TCHA that is currently digitizing text on which OCR fails. We expect that this new audio CAP TCHA will be more secure than the current version and easier for humans to pass. Initial experiments using this idea show this to be true [9]. 8 Co n c l u s i o n We have succeeded in “breaking” three different types of widely used audio CAP TCHAs, even though these were developed with the purpose of defeating attacks by machine learning techniques. We believe our results can be improved by selecting optimal segment sizes, but that is unnecessary given our already high success rate. For our experiments, segment sizes were not chosen in a special way; occasionally yielding results in which a segment only contained half of a word, causing our prediction to contain that particular word twice. We also believe that the AdaBoost results can be improved, particularly for the Digg audio CAP TCHAs, by ensuring that the number of negative training samples is closer to the number of positive training samples. We have shown that our approach is successful and can be used with many different audio CAP TCHAs that contain small finite vocabularies. A ck n o w l e d g m e n t s This work was partially supported by generous gifts from the Heinz Endowment, by an equipment grant from Intel Corporation, and by the Army Research Office through grant number DAAD19-02-1-0389 to CyLab at Carnegie Mellon University. Luis von Ahn was partially supported by a Microsoft Research New Faculty Fellowship and a MacArthur Fellowship. Jennifer Tam was partially supported by a Google Anita Borg Scholarship. R e f e re n c e s [1] L. von Ahn, M. Blum, and J. Langford. “Telling Humans and Computers Apart Automatically,” Communication s of the ACM, vol. 47, no. 2, pp. 57-60, Feb. 2004. [2] G. Mori and J. Malik. “Recognizing Objects in Adversarial Clutter: Breaking a Visual CAPTCHA,” In Computer Vision and Pattern Recognition CVPR'03, June 2003. [3] K. Chellapilla, and P. Simard, “ U sing Machine Learning to Break Visual Human Interactio n Proofs (HIP s),” Advances in Neural Information P rocessing Systems 17, Neural Info rmatio n P rocessing Systems (NIPS'2004), MIT Press. [4] H. Hermansk y, “ Perceptual Linear Predictive (PL P) Analysis of Speech,” J. Acoust. Soc. Am., vol. 87, no. 4, pp. 1738-1752, Apr. 1990. [5] H. Hermansk y, N. Morgan, A. Bayya, and P. Kohn. “RASTA-PL P Speech Analysi s Technique,” In P roc. IEEE Int’l Conf. Acoustics, Speech & Signal Processing, vol. 1, pp. 121124, San Francisco, 1992. [6] R. Santamarta. “Breaking Gmail ’s Audio Captcha,” http://blog.wintercore.com/?p=11, 2008. [7] D. Ell is. “ P L P and RASTA (and MFCC, and inversion) in Matlab using melfcc.m and invmelfcc.m,” http:/ /ww w.ee.columbia.edu/~dpwe/resources/matlab/rastamat/, 2006. [8] C. Chang and C. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http: //ww w.csie.ntu.edu.tw/~cjlin/libsvm [9] A. Schlaikjer. “ A Dual-Use Speech CA PTCHA: Aiding Visually Impaired Web Users while Providing Transcriptions of Audio Streams,” Technical Report CMU-LTI-07-014, Carnegie Mellon Universi t y. November 2007.
2 0.54209483 158 nips-2008-Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks
Author: Alex Graves, Jürgen Schmidhuber
Abstract: Offline handwriting recognition—the automatic transcription of images of handwritten text—is a challenging task that combines computer vision with sequence learning. In most systems the two elements are handled separately, with sophisticated preprocessing techniques used to extract the image features and sequential models such as HMMs used to provide the transcriptions. By combining two recent innovations in neural networks—multidimensional recurrent neural networks and connectionist temporal classification—this paper introduces a globally trained offline handwriting recogniser that takes raw pixel data as input. Unlike competing systems, it does not require any alphabet specific preprocessing, and can therefore be used unchanged for any language. Evidence of its generality and power is provided by data from a recent international Arabic recognition competition, where it outperformed all entries (91.4% accuracy compared to 87.2% for the competition winner) despite the fact that neither author understands a word of Arabic. 1
Author: Liu Yang, Rong Jin, Rahul Sukthankar
Abstract: The cluster assumption is exploited by most semi-supervised learning (SSL) methods. However, if the unlabeled data is merely weakly related to the target classes, it becomes questionable whether driving the decision boundary to the low density regions of the unlabeled data will help the classification. In such case, the cluster assumption may not be valid; and consequently how to leverage this type of unlabeled data to enhance the classification accuracy becomes a challenge. We introduce “Semi-supervised Learning with Weakly-Related Unlabeled Data” (SSLW), an inductive method that builds upon the maximum-margin approach, towards a better usage of weakly-related unlabeled information. Although the SSLW could improve a wide range of classification tasks, in this paper, we focus on text categorization with a small training pool. The key assumption behind this work is that, even with different topics, the word usage patterns across different corpora tends to be consistent. To this end, SSLW estimates the optimal wordcorrelation matrix that is consistent with both the co-occurrence information derived from the weakly-related unlabeled documents and the labeled documents. For empirical evaluation, we present a direct comparison with a number of stateof-the-art methods for inductive semi-supervised learning and text categorization. We show that SSLW results in a significant improvement in categorization accuracy, equipped with a small training set and an unlabeled resource that is weakly related to the test domain.
4 0.28200787 194 nips-2008-Regularized Learning with Networks of Features
Author: Ted Sandler, John Blitzer, Partha P. Talukdar, Lyle H. Ungar
Abstract: For many supervised learning problems, we possess prior knowledge about which features yield similar information about the target variable. In predicting the topic of a document, we might know that two words are synonyms, and when performing image recognition, we know which pixels are adjacent. Such synonymous or neighboring features are near-duplicates and should be expected to have similar weights in an accurate model. Here we present a framework for regularized learning when one has prior knowledge about which features are expected to have similar and dissimilar weights. The prior knowledge is encoded as a network whose vertices are features and whose edges represent similarities and dissimilarities between them. During learning, each feature’s weight is penalized by the amount it differs from the average weight of its neighbors. For text classification, regularization using networks of word co-occurrences outperforms manifold learning and compares favorably to other recently proposed semi-supervised learning methods. For sentiment analysis, feature networks constructed from declarative human knowledge significantly improve prediction accuracy. 1
5 0.28055179 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
Author: Aarti Singh, Robert Nowak, Xiaojin Zhu
Abstract: Empirical evidence shows that in favorable situations semi-supervised learning (SSL) algorithms can capitalize on the abundance of unlabeled training data to improve the performance of a learning task, in the sense that fewer labeled training data are needed to achieve a target error bound. However, in other situations unlabeled data do not seem to help. Recent attempts at theoretically characterizing SSL gains only provide a partial and sometimes apparently conflicting explanations of whether, and to what extent, unlabeled data can help. In this paper, we attempt to bridge the gap between the practice and theory of semi-supervised learning. We develop a finite sample analysis that characterizes the value of unlabeled data and quantifies the performance improvement of SSL compared to supervised learning. We show that there are large classes of problems for which SSL can significantly outperform supervised learning, in finite sample regimes and sometimes also in terms of error convergence rates.
6 0.27971947 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
7 0.27875301 62 nips-2008-Differentiable Sparse Coding
8 0.27728522 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words
9 0.27706394 95 nips-2008-Grouping Contours Via a Related Image
10 0.27666667 75 nips-2008-Estimating vector fields using sparse basis field expansions
11 0.27647722 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
12 0.27493483 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
13 0.2749005 202 nips-2008-Robust Regression and Lasso
14 0.27300617 143 nips-2008-Multi-label Multiple Kernel Learning
15 0.27276325 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning
16 0.27267092 216 nips-2008-Sparse probabilistic projections
17 0.27253625 99 nips-2008-High-dimensional support union recovery in multivariate regression
18 0.27250138 196 nips-2008-Relative Margin Machines
19 0.27220351 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization
20 0.27216887 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models