nips nips2008 nips2008-158 knowledge-graph by maker-knowledge-mining

158 nips-2008-Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks

Source: pdf

Author: Alex Graves, Jürgen Schmidhuber

Abstract: Ofﬂine handwriting recognition—the automatic transcription of images of handwritten text—is a challenging task that combines computer vision with sequence learning. In most systems the two elements are handled separately, with sophisticated preprocessing techniques used to extract the image features and sequential models such as HMMs used to provide the transcriptions. By combining two recent innovations in neural networks—multidimensional recurrent neural networks and connectionist temporal classiﬁcation—this paper introduces a globally trained ofﬂine handwriting recogniser that takes raw pixel data as input. Unlike competing systems, it does not require any alphabet speciﬁc preprocessing, and can therefore be used unchanged for any language. Evidence of its generality and power is provided by data from a recent international Arabic recognition competition, where it outperformed all entries (91.4% accuracy compared to 87.2% for the competition winner) despite the fact that neither author understands a word of Arabic. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ch Abstract Ofﬂine handwriting recognition—the automatic transcription of images of handwritten text—is a challenging task that combines computer vision with sequence learning. [sent-4, score-0.484]

2 In most systems the two elements are handled separately, with sophisticated preprocessing techniques used to extract the image features and sequential models such as HMMs used to provide the transcriptions. [sent-5, score-0.159]

3 By combining two recent innovations in neural networks—multidimensional recurrent neural networks and connectionist temporal classiﬁcation—this paper introduces a globally trained ofﬂine handwriting recogniser that takes raw pixel data as input. [sent-6, score-1.208]

4 Unlike competing systems, it does not require any alphabet speciﬁc preprocessing, and can therefore be used unchanged for any language. [sent-7, score-0.063]

5 Evidence of its generality and power is provided by data from a recent international Arabic recognition competition, where it outperformed all entries (91. [sent-8, score-0.107]

6 2% for the competition winner) despite the fact that neither author understands a word of Arabic. [sent-10, score-0.104]

7 1 Introduction Ofﬂine handwriting recognition is generally observed to be harder than online handwriting recognition [14]. [sent-11, score-0.998]

8 In the online case, features can be extracted from both the pen trajectory and the resulting image, whereas in the ofﬂine case only the image is available. [sent-12, score-0.24]

9 Nonetheless, the standard recognition process is essentially the same: a sequence of features are extracted from the data, then matched to a sequence of labels (usually characters or sub-character strokes) using either a hidden Markov model (HMM) [9] or an HMM-neural network hybrid [10]. [sent-13, score-0.496]

10 The main drawback of this approach is that the input features must meet the stringent independence assumptions imposed by HMMs (these assumptions are somewhat relaxed in the case of hybrid systems, but long-range input dependencies are still problematic). [sent-14, score-0.281]

11 In practice this means the features must be redesigned for every alphabet, and, to a lesser extent, for every language. [sent-15, score-0.166]

12 For example it would be impossible to use the same system to recognise both English and Arabic. [sent-16, score-0.087]

13 Following our recent success in transcribing raw online handwriting data with recurrent networks [6], we wanted to build an ofﬂine recognition system that would work on raw pixels. [sent-17, score-1.255]

14 As well as being alphabet-independent, such a system would have the advantage of being globally trainable, with the image features optimised along with the classiﬁer. [sent-18, score-0.314]

15 The online case was relatively straightforward, since the input data formed a 1D sequence that could be fed directly to a recurrent network. [sent-19, score-0.583]

16 The long short-term memory (LSTM) network architecture [8, 3] was chosen for its ability to access long-range context, and the connectionist temporal classiﬁcation [5] output layer allowed the network to transcribe the data with no prior segmentation. [sent-20, score-0.541]

17 The ofﬂine case, however, is more challenging, since the input is no longer one-dimensional. [sent-21, score-0.054]

18 A naive approach would be to present the images to the network one vertical line at a time, thereby transforming them into 1D sequences. [sent-22, score-0.172]

19 However such a system would be unable to handle distor1 Figure 1: Two dimensional MDRNN. [sent-23, score-0.079]

20 The thick lines show connections to the current point (i, j). [sent-24, score-0.172]

21 The connections within the hidden layer plane are recurrent. [sent-25, score-0.434]

22 The dashed lines show the scanning strips along which previous points were visited, starting at the top left corner. [sent-26, score-0.436]

23 tions along the vertical axis; for example the same image shifted up by one pixel would appear completely different. [sent-27, score-0.261]

24 A more ﬂexible solution is offered by multidimensional recurrent neural networks (MDRNNs) [7]. [sent-28, score-0.807]

25 MDRNNs, which are a special case of directed acyclic graph networks [1], generalise standard RNNs by providing recurrent connections along all spatio-temporal dimensions present in the data. [sent-29, score-0.848]

26 These connections make MDRNNs robust to local distortions along any combination of input dimensions (e. [sent-30, score-0.361]

27 image rotations and shears, which mix vertical and horizontal displacements) and allow them to model multidimensional context in a ﬂexible way. [sent-32, score-0.398]

28 We use multidimensional LSTM because it is able to access long-range context. [sent-33, score-0.298]

29 The problem remains, though, of how to transform two-dimensional images into one-dimensional label sequences. [sent-34, score-0.051]

30 Our solution is to pass the data through a hierarchy of MDRNN layers, with blocks of activations gathered together after each level. [sent-35, score-0.273]

31 The heights of the blocks are chosen to incrementally collapse the 2D images onto 1D sequences, which can then be labelled by the output layer. [sent-36, score-0.212]

32 Such hierarchical structures are common in computer vision [15], because they allow complex features to be built up in stages. [sent-37, score-0.049]

33 In particular our multilayered structure is similar to that used by convolution networks [11], although it should be noted that because convolution networks are not recurrent, they cannot be used for cursive handwriting recognition without presegmented inputs. [sent-38, score-0.867]

34 2 Method The three components of our recognition system are: (1) multidimensional recurrent neural networks, and multidimensional LSTM in particular; (2) the connectionist temporal classiﬁcation output layer; and (3) the hierarchical structure. [sent-40, score-1.192]

35 1 Multidimensional Recurrent Neural Networks The basic idea of multidimensional recurrent neural networks (MDRNNs) [7] is to replace the single recurrent connection found in standard recurrent networks with as many connections as there are spatio-temporal dimensions in the data. [sent-43, score-1.951]

36 These connections allow the network to create a ﬂexible internal representation of surrounding context, which is robust to localised distortions. [sent-44, score-0.302]

37 An MDRNN hidden layer scans through the input in 1D strips, storing its activations in a buffer. [sent-45, score-0.502]

38 The strips are ordered in such a way that at every point the layer has already visited the points one step back along every dimension. [sent-46, score-0.518]

39 The hidden activations at these previous points are fed to the current point through recurrent connections, along with the input. [sent-47, score-0.733]

40 One such layer is sufﬁcient to give the network access to all context against the direction of scanning from the current point (e. [sent-50, score-0.528]

41 However we usually want surrounding context in all directions. [sent-54, score-0.139]

42 The canonical 1D solution is bidi2 rectional recurrent networks [16], where two separate hidden layers scan through the input forwards and backwards. [sent-56, score-0.878]

43 The generalisation of bidirectional networks to n dimensions requires 2n hidden layers, starting in every corner of the n dimensional hypercube and scanning in opposite directions. [sent-57, score-0.718]

44 For example, a 2D network has four layers, one starting in the top left and scanning down and right, one starting in the bottom left and scanning up and right, etc. [sent-58, score-0.532]

45 All the hidden layers are connected to a single output layer, which therefore receives information about all surrounding context. [sent-59, score-0.393]

46 The error gradient of an MDRNN can be calculated with an n-dimensional extension of backpropagation through time. [sent-60, score-0.034]

47 As in the 1D case, the data is processed in the reverse order of the forward pass, with each hidden layer receiving both the output derivatives and its own n ‘future’ derivatives at every timestep. [sent-61, score-0.529]

48 Let ap and bp be respectively the input and activation of unit j at point p = (p1 , . [sent-62, score-0.325]

49 , pn ) in an nj j dimensional input sequence x with dimensions (D1 , . [sent-65, score-0.278]

50 Let wij and wij be respectively the weight of the feedforward d connection from unit i to unit j and the recurrent connection from i to j along dimension d. [sent-82, score-0.805]

51 Let θh be the activation function of hidden unit h, and for some unit j and some differentiable objective p ∂O function O let δj = ∂ap . [sent-83, score-0.272]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('recurrent', 0.391), ('handwriting', 0.366), ('multidimensional', 0.24), ('mdrnn', 0.209), ('mdrnns', 0.209), ('scanning', 0.185), ('layer', 0.178), ('ine', 0.174), ('lstm', 0.157), ('pd', 0.148), ('networks', 0.139), ('connections', 0.138), ('strips', 0.137), ('ap', 0.136), ('layers', 0.13), ('hidden', 0.118), ('recognition', 0.107), ('activations', 0.106), ('graves', 0.105), ('whh', 0.105), ('surrounding', 0.098), ('pass', 0.094), ('munich', 0.091), ('connectionist', 0.084), ('pn', 0.084), ('dimensions', 0.068), ('along', 0.066), ('network', 0.066), ('forward', 0.065), ('hmms', 0.065), ('backward', 0.063), ('alphabet', 0.063), ('tu', 0.063), ('image', 0.062), ('raw', 0.06), ('access', 0.058), ('convolution', 0.058), ('competition', 0.058), ('vertical', 0.055), ('visited', 0.055), ('connection', 0.054), ('input', 0.054), ('hybrid', 0.053), ('unit', 0.053), ('units', 0.053), ('fed', 0.052), ('exible', 0.052), ('online', 0.052), ('images', 0.051), ('globally', 0.05), ('wij', 0.05), ('features', 0.049), ('preprocessing', 0.048), ('activation', 0.048), ('starting', 0.048), ('output', 0.047), ('forwards', 0.046), ('optimised', 0.046), ('bx', 0.046), ('generalise', 0.046), ('arabic', 0.046), ('recognise', 0.046), ('scans', 0.046), ('jurgen', 0.046), ('understands', 0.046), ('germany', 0.044), ('temporal', 0.042), ('bidirectional', 0.042), ('trainable', 0.042), ('pen', 0.042), ('pixel', 0.041), ('every', 0.041), ('system', 0.041), ('context', 0.041), ('derivatives', 0.04), ('blocks', 0.04), ('wanted', 0.039), ('heights', 0.039), ('generalisation', 0.039), ('dimensional', 0.038), ('stringent', 0.037), ('tions', 0.037), ('offered', 0.037), ('bh', 0.037), ('lesser', 0.035), ('innovations', 0.035), ('winner', 0.035), ('distortions', 0.035), ('switzerland', 0.035), ('collapse', 0.035), ('extracted', 0.035), ('sequence', 0.034), ('backpropagation', 0.034), ('meet', 0.034), ('bp', 0.034), ('feedforward', 0.034), ('thick', 0.034), ('challenging', 0.033), ('gathered', 0.033), ('dn', 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 158 nips-2008-Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks

Author: Alex Graves, Jürgen Schmidhuber

2 0.14543581 58 nips-2008-Dependence of Orientation Tuning on Recurrent Excitation and Inhibition in a Network Model of V1

Author: Klaus Wimmer, Marcel Stimberg, Robert Martin, Lars Schwabe, Jorge Mariño, James Schummers, David C. Lyon, Mriganka Sur, Klaus Obermayer

Abstract: The computational role of the local recurrent network in primary visual cortex is still a matter of debate. To address this issue, we analyze intracellular recording data of cat V1, which combine measuring the tuning of a range of neuronal properties with a precise localization of the recording sites in the orientation preference map. For the analysis, we consider a network model of Hodgkin-Huxley type neurons arranged according to a biologically plausible two-dimensional topographic orientation preference map. We then systematically vary the strength of the recurrent excitation and inhibition relative to the strength of the afferent input. Each parametrization gives rise to a different model instance for which the tuning of model neurons at different locations of the orientation map is compared to the experimentally measured orientation tuning of membrane potential, spike output, excitatory, and inhibitory conductances. A quantitative analysis shows that the data provides strong evidence for a network model in which the afferent input is dominated by strong, balanced contributions of recurrent excitation and inhibition. This recurrent regime is close to a regime of “instability”, where strong, self-sustained activity of the network occurs. The ﬁring rate of neurons in the best-ﬁtting network is particularly sensitive to small modulations of model parameters, which could be one of the functional beneﬁts of a network operating in this particular regime. 1

3 0.12366334 118 nips-2008-Learning Transformational Invariants from Natural Movies

Author: Charles Cadieu, Bruno A. Olshausen

Abstract: We describe a hierarchical, probabilistic model that learns to extract complex motion from movies of the natural environment. The model consists of two hidden layers: the ﬁrst layer produces a sparse representation of the image that is expressed in terms of local amplitude and phase variables. The second layer learns the higher-order structure among the time-varying phase variables. After training on natural movies, the top layer units discover the structure of phase-shifts within the ﬁrst layer. We show that the top layer units encode transformational invariants: they are selective for the speed and direction of a moving pattern, but are invariant to its spatial structure (orientation/spatial-frequency). The diversity of units in both the intermediate and top layers of the model provides a set of testable predictions for representations that might be found in V1 and MT. In addition, the model demonstrates how feedback from higher levels can inﬂuence representations at lower levels as a by-product of inference in a graphical model. 1

4 0.12341802 148 nips-2008-Natural Image Denoising with Convolutional Networks

Author: Viren Jain, Sebastian Seung

Abstract: We present an approach to low-level vision that combines two main ideas: the use of convolutional networks as an image processing architecture and an unsupervised learning procedure that synthesizes training samples from speciﬁc noise models. We demonstrate this approach on the challenging problem of natural image denoising. Using a test set with a hundred natural images, we ﬁnd that convolutional networks provide comparable and in some cases superior performance to state of the art wavelet and Markov random ﬁeld (MRF) methods. Moreover, we ﬁnd that a convolutional network offers similar performance in the blind denoising setting as compared to other techniques in the non-blind setting. We also show how convolutional networks are mathematically related to MRF approaches by presenting a mean ﬁeld theory for an MRF specially designed for image denoising. Although these approaches are related, convolutional networks avoid computational difﬁculties in MRF approaches that arise from probabilistic learning and inference. This makes it possible to learn image processing architectures that have a high degree of representational power (we train models with over 15,000 parameters), but whose computational expense is signiﬁcantly less than that associated with inference in MRF approaches with even hundreds of parameters. 1 Background Low-level image processing tasks include edge detection, interpolation, and deconvolution. These tasks are useful both in themselves, and as a front-end for high-level visual tasks like object recognition. This paper focuses on the task of denoising, deﬁned as the recovery of an underlying image from an observation that has been subjected to Gaussian noise. One approach to image denoising is to transform an image from pixel intensities into another representation where statistical regularities are more easily captured. For example, the Gaussian scale mixture (GSM) model introduced by Portilla and colleagues is based on a multiscale wavelet decomposition that provides an effective description of local image statistics [1, 2]. Another approach is to try and capture statistical regularities of pixel intensities directly using Markov random ﬁelds (MRFs) to deﬁne a prior over the image space. Initial work used handdesigned settings of the parameters, but recently there has been increasing success in learning the parameters of such models from databases of natural images [3, 4, 5, 6, 7, 8]. Prior models can be used for tasks such as image denoising by augmenting the prior with a noise model. Alternatively, an MRF can be used to model the probability distribution of the clean image conditioned on the noisy image. This conditional random ﬁeld (CRF) approach is said to be discriminative, in contrast to the generative MRF approach. Several researchers have shown that the CRF approach can outperform generative learning on various image restoration and labeling tasks [9, 10]. CRFs have recently been applied to the problem of image denoising as well [5]. 1 The present work is most closely related to the CRF approach. Indeed, certain special cases of convolutional networks can be seen as performing maximum likelihood inference on a CRF [11]. The advantage of the convolutional network approach is that it avoids a general difﬁculty with applying MRF-based methods to image analysis: the computational expense associated with both parameter estimation and inference in probabilistic models. For example, naive methods of learning MRFbased models involve calculation of the partition function, a normalization factor that is generally intractable for realistic models and image dimensions. As a result, a great deal of research has been devoted to approximate MRF learning and inference techniques that meliorate computational difﬁculties, generally at the cost of either representational power or theoretical guarantees [12, 13]. Convolutional networks largely avoid these difﬁculties by posing the computational task within the statistical framework of regression rather than density estimation. Regression is a more tractable computation and therefore permits models with greater representational power than methods based on density estimation. This claim will be argued for with empirical results on the denoising problem, as well as mathematical connections between MRF and convolutional network approaches. 2 Convolutional Networks Convolutional networks have been extensively applied to visual object recognition using architectures that accept an image as input and, through alternating layers of convolution and subsampling, produce one or more output values that are thresholded to yield binary predictions regarding object identity [14, 15]. In contrast, we study networks that accept an image as input and produce an entire image as output. Previous work has used such architectures to produce images with binary targets in image restoration problems for specialized microscopy data [11, 16]. Here we show that similar architectures can also be used to produce images with the analog ﬂuctuations found in the intensity distributions of natural images. Network Dynamics and Architecture A convolutional network is an alternating sequence of linear ﬁltering and nonlinear transformation operations. The input and output layers include one or more images, while intermediate layers contain “hidden

5 0.077574082 56 nips-2008-Deep Learning with Kernel Regularization for Visual Recognition

Author: Kai Yu, Wei Xu, Yihong Gong

Abstract: In this paper we aim to train deep neural networks for rapid visual recognition. The task is highly challenging, largely due to the lack of a meaningful regularizer on the functions realized by the networks. We propose a novel regularization method that takes advantage of kernel methods, where an oracle kernel function represents prior knowledge about the recognition task of interest. We derive an efﬁcient algorithm using stochastic gradient descent, and demonstrate encouraging results on a wide range of recognition tasks, in terms of both accuracy and speed. 1

6 0.07576783 160 nips-2008-On Computational Power and the Order-Chaos Phase Transition in Reservoir Computing

7 0.071139589 153 nips-2008-Nonlinear causal discovery with additive noise models

8 0.065136544 103 nips-2008-Implicit Mixtures of Restricted Boltzmann Machines

9 0.062831521 191 nips-2008-Recursive Segmentation and Recognition Templates for 2D Parsing

10 0.060855739 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models

11 0.060239572 240 nips-2008-Tracking Changing Stimuli in Continuous Attractor Neural Networks

12 0.058805332 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data

13 0.056811169 231 nips-2008-Temporal Dynamics of Cognitive Control

14 0.053338304 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation

15 0.050597381 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models

16 0.049167994 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words

17 0.048425525 42 nips-2008-Cascaded Classification Models: Combining Models for Holistic Scene Understanding

18 0.047233988 62 nips-2008-Differentiable Sparse Coding

19 0.046926793 152 nips-2008-Non-stationary dynamic Bayesian networks

20 0.046870925 61 nips-2008-Diffeomorphic Dimensionality Reduction

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.14), (1, -0.023), (2, 0.102), (3, -0.029), (4, 0.013), (5, 0.0), (6, -0.054), (7, -0.038), (8, 0.062), (9, -0.014), (10, -0.009), (11, 0.105), (12, -0.032), (13, 0.042), (14, -0.025), (15, -0.193), (16, -0.032), (17, -0.109), (18, 0.011), (19, -0.14), (20, -0.139), (21, 0.064), (22, -0.01), (23, 0.013), (24, 0.078), (25, 0.053), (26, -0.073), (27, 0.036), (28, -0.0), (29, 0.016), (30, 0.019), (31, 0.085), (32, -0.115), (33, 0.091), (34, -0.001), (35, 0.044), (36, 0.033), (37, 0.049), (38, 0.195), (39, -0.022), (40, -0.066), (41, -0.006), (42, -0.032), (43, 0.094), (44, -0.073), (45, -0.004), (46, -0.017), (47, -0.037), (48, -0.088), (49, -0.091)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97007167 158 nips-2008-Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks

Author: Alex Graves, Jürgen Schmidhuber

2 0.71590513 148 nips-2008-Natural Image Denoising with Convolutional Networks

Author: Viren Jain, Sebastian Seung

3 0.67282742 160 nips-2008-On Computational Power and the Order-Chaos Phase Transition in Reservoir Computing

Author: Benjamin Schrauwen, Lars Buesing, Robert A. Legenstein

Abstract: Randomly connected recurrent neural circuits have proven to be very powerful models for online computations when a trained memoryless readout function is appended. Such Reservoir Computing (RC) systems are commonly used in two ﬂavors: with analog or binary (spiking) neurons in the recurrent circuits. Previous work showed a fundamental difference between these two incarnations of the RC idea. The performance of a RC system built from binary neurons seems to depend strongly on the network connectivity structure. In networks of analog neurons such dependency has not been observed. In this article we investigate this apparent dichotomy in terms of the in-degree of the circuit nodes. Our analyses based amongst others on the Lyapunov exponent reveal that the phase transition between ordered and chaotic network behavior of binary circuits qualitatively differs from the one in analog circuits. This explains the observed decreased computational performance of binary circuits of high node in-degree. Furthermore, a novel mean-ﬁeld predictor for computational performance is introduced and shown to accurately predict the numerically obtained results. 1

4 0.59588522 58 nips-2008-Dependence of Orientation Tuning on Recurrent Excitation and Inhibition in a Network Model of V1

Author: Klaus Wimmer, Marcel Stimberg, Robert Martin, Lars Schwabe, Jorge Mariño, James Schummers, David C. Lyon, Mriganka Sur, Klaus Obermayer

5 0.59381324 3 nips-2008-A Massively Parallel Digital Learning Processor

Author: Hans P. Graf, Srihari Cadambi, Venkata Jakkula, Murugan Sankaradass, Eric Cosatto, Srimat Chakradhar, Igor Dourdanovic

Abstract: We present a new, massively parallel architecture for accelerating machine learning algorithms, based on arrays of vector processing elements (VPEs) with variable-resolution arithmetic. Groups of VPEs operate in SIMD (single instruction multiple data) mode, and each group is connected to an independent memory bank. The memory bandwidth thus scales with the number of VPEs, while the main data flows are local, keeping power dissipation low. With 256 VPEs, implemented on two FPGAs (field programmable gate array) chips, we obtain a sustained speed of 19 GMACS (billion multiplyaccumulate per sec.) for SVM training, and 86 GMACS for SVM classification. This performance is more than an order of magnitude higher than that of any FPGA implementation reported so far. The speed on one FPGA is similar to the fastest speeds published on a Graphics Processor for the MNIST problem, despite a clock rate that is an order of magnitude lower. Tests with Convolutional Neural Networks show similar compute performances. This massively parallel architecture is particularly attractive for embedded applications, where low power dissipation is critical. 1 I n trod u cti on Machine learning demands higher and higher compute-performance, but serial processors are not improving that much anymore - at least not as quickly as they used to. Mainstream processor development is moving to multi-core systems, using shared memory technology to hide the parallel nature of the processors. But shared memory technology does not scale to hundreds or thousands of cores. In order to reach such levels of parallelization alternative approaches have to be developed. Massively parallel general-purpose computers had limited success so far, because of difficulties programming these machines, and they remain a niche market, mostly in highperformance computing. Yet processors specialized for certain application domains, such as graphics processors or routing processors 1, have been parallelized to several hundred cores and are successful mass products. They improve performance over general-purpose processors by focusing on a few key algorithmic elements, yet still maintain enough flexibility that they can be programmed for a variety of applications. We explore in this paper if a similar approach can lead to efficient machine learning processors. 1 e.g. Nvidia, Quadro FX 5600 graphics processor; Cisco, CRS-1 routing processor Several processors optimized for machine learning, in particular for neural networks, were developed during the 1980’s and 90’s. Examples are the Synapse-1 architecture [1], or the Connectionist Network Supercomputer, CNS1 [2]. Recently there has been less activity in this field, but some accelerators are sold today for specific applications, such as the Axeon [3] processor for power train control of cars. Beside digital processors a large number of analog circuits were built, emulating neural network structures. Extremely high performance with low power dissipation is achievable, see e.g. [4][5], but these networks have little flexibility. SVM implementations on FPGA have been demonstrated in recent years [6-8], yet reached only low compute-performances. All machine learning processors had only limited success so far, indicating how difficult it is to find a good combination of performance, flexibility, price and ease of use. An important consideration is that many applications of machine learning, such as video analysis, data mining, or personalization of services, show the most promise in embedded systems. Embedded learning requires high compute performance while dissipating little power, a combination that is difficult to achieve, and so far required application specific IC (ASIC). Our aim is to develop architectures that meet the requirements for embedded learning, but are programmable and therefore can be used in a wide range of applications. With the goal of analyzing different architectures we designed a development and testing environment where the parallel computation is mapped onto FPGA’s. Initially this system was intended only for experimentation, but its performance is so high that this platform is useful in its own right as accelerator for high-performance systems. While the experiments shown here emphasize high performance, the architecture has been designed from the start for low power dissipation. The main features for achieving this goal are: low-resolution arithmetic, keeping the main data flow local, low operating frequencies, and a modular design, so that unused parts can be powered down dynamically. All results shown here are from the test platform; migration to lowpower FPGA or chip designs are done in a later stage. 2 Al gori th ms - A ri th meti c - A rch i te ctu re For a substantial improvement over a general purpose processor, the algorithms, the arithmetic units, as well as the architecture have to be optimized simultaneously. This is not just an exercise in hardware design, but algorithms and their software implementations have to be developed concurrently. Most machine learning algorithms have not been developed with parallelization in mind. Therefore, we first need to find good parallel versions, identify their performance bottlenecks, and then extract common computational patterns that can be mapped into accelerator hardware. 2.1 Algorithms Characteristic for machine learning is that large amounts of data need to be processed, often with predictable data access patterns and no dependency between operations over large segments of the computation. This is why data-parallelization can often provide good accelerations on multi-core chips, clusters of machines, or even on loosely coupled networks of machines. Using MapReduce, speedups linear with the number of processors have been reported in [9] for several machine learning algorithms. Up to 16 cores were tested, and simulations indicate good scaling to more processors in some cases. Many algorithms, such as KNN, K-means clustering, LVQ, and Neural Networks can be reduced to forms where the computation is dominated by vector-matrix multiplications, which are easily parallelizable. For Convolutional Neural Networks (CNN) the data flow can be complex, yet the core of the computation is a convolution, an operation which has been studied extensively for parallel implementations. For Support Vector Machines (SVM), several parallel algorithms were described, but most saturate quickly for more than 16 processors. Scaling to larger numbers of processors has been demonstrated, applying MapReduce on a graphics processor with 128 cores [10]. Another implementation on a cluster of 48 dual-core machines (with 384 MMX units) [11] scales even super-linearly, and, according to simulations, scales to thousands of cores. Based on this analysis it is clear that vector-matrix and matrix-matrix multiplications for large vector dimensionalities and large numbers of vectors must be handled efficiently. Yet this alone is not sufficient since data access patterns vary greatly between algorithms. We analyze this here in more detail for SVM and CNN. These algorithms were chosen, because they are widely used for industrial applications and cover a broad range of computation, I/O, and memory requirements. The characteristics of the SVM training are summarized in Table 1. We use an approach similar to the one described in [11] to split different parts of the computation between a host CPU and the FPGA accelerator. For large dimensions d of the vectors the calculation of the columns of the kernel matrix dominates by far. This is needed to update the gradients, and in the present implementation, only this part is mapped onto the FPGA. If the dimensionality d is smaller than around 100, operations 2 and 5 can become bottlenecks and should also be mapped onto the accelerator. Challenging is that for each kernel computation a new data vector has to be loaded 4 into the processor, leading to very high I/O requirements. We consider here dimensions of 10 - 10 5 7 and numbers of training data of 10 - 10 , resulting easily in Gigabytes that need to be transferred to the processors at each iteration. 1 2 3 4 5 6 Operation Initialize all αx, Gx Do Find working set αi, αj Update αi, αj Get 2 columns of kernel matrix Update gradients Gx While not converged Computation 2n IO 2n Unit CPU I I I I I * 2n I*2 I * (2d+2dn) I*n CPU CPU FPGA CPU * * * * 2n 10 2nd n Table 1: Compute- and IO-requirements of each step for SVM training (SMO algorithm). n: number of training data; d: dimension of the vectors; G: gradients; α: support vector factors; I: number of iterations. The last column indicates whether the execution happens on the host CPU or the accelerator FPGA. It is assumed that the kernel computation requires a dot product between vectors (e.g. rbf, polynomial, tanh kernels). Neural network algorithms are essentially sequences of vector-matrix multiplications, but networks with special connectivity patterns, such as convolutional networks have very different IO characteristics than fully connected networks. Table 2 shows the computation and IO requirements for scanning several convolution kernels over one input plane. A full network requires multiple of these operations for one layer, with nonlinearities between layers. We map all operations onto the FPGA accelerator, since intermediate results are re-used right away. The most significant 2 difference to between the SVM and CNN is the Compute/IO ratio: SVM: ~ 1; CNN: ~ L*k > 100. Therefore the requirements for these two algorithms are very different, and handling both cases efficiently is quite a challenge for an architecture design. Operation Load L kernels For all input pixels Shift in new pixel Multiply kernels Shift out result 1 2 3 4 Computation IO 2 L* k n* m 2 n*m*L*k n*m Unit FPGA FPGA FPGA FPGA FPGA Table 2: Compute- and IO-requirements for CNN computation (forward pass), where l kernels of size k*k are scanned simultaneously over an input plane of size n*m. This is representative for implementations with kernel unrolling (kernel pixels processed in parallel). Internal shifts, computation of the non-linearity, and border effects not shown. 2.2 Arithmetic Hardware can be built much more compactly and runs with lower power dissipation, if it uses fixed-point instead of floating-point operations. Fortunately, many learning algorithms tolerate a low resolution in most of the computations. This has been investigated extensively for neural networks [12][13], but less so for other learning algorithms. Learning from data is inherently a noisy process, because we see only a sparse sampling of the true probability distributions. A different type of noise is introduced in gradient descent algorithms, when only a few training data are used at a time to move the optimization forward iteratively. This noise is particularly pronounced for stochastic gradient descent. There is no point in representing noisy variables with high resolution, and it is therefore a property inherent to many algorithms that low-resolution computation can be used. It is important, not to confuse this tolerance to low resolution with the resolution required to avoid numeric instabilities. Some of the computations have to be performed with a high resolution, in particular for variables that are updated incrementally. They maintain the state of the optimization and may change in very small steps. But usually by far the largest part of the computation can be executed at a low resolution. Key is that the hardware is flexible enough and can take advantage of reduced resolution while handling high resolution where necessary. Problem Adult Forest MNIST NORB Kernel: Float Obj. f. # SV 31,930.77 11,486 653,170.7 49,333 4,960.13 6,172 1,243.71 3,077 F-score 77.58 98.29 99.12 93.34 Kernel: 16 bit fixed point Obj. f. # SV F-score 31,930.1 11,490 77.63 652,758 49,299 98.28 4,959.64 6,166 99.11 1,244.76 3,154 93.26 F-sc. (4b in) NA NA 99.11 92.78 Table 3: Comparison of the results of SVM training when the kernels are represented with floating point numbers (32 or 64 bits) (left half) and with 16 bit fixed point (right half). The last column shows the results when the resolution of the training data is reduced from 8 bit to 4 bit. For NORB this reduces the accuracy; all other differences in accuracy are not significant. All are two class problems: Adult: n=32,562, d=122; Forest: n=522,000, d=54 (2 against the rest); MNIST: n=60,000, d=784 (odd–even); NORB: n=48,560, d=5,184. We developed a simulator that allows running the training algorithms with various resolutions in each of the variables. A few examples for SVM training are shown in Table 3. Reducing the resolution of the kernel values from double or float to 16 bit fixed point representations does not affect the accuracy for any of the problems. Therefore all the multiplications in the dot products for the kernel computation can be done in low resolutions (4–16 bit in the factors), but the accumulator needs sufficient resolution to avoid over/under flow (48 bit). Once the calculation of the kernel value is completed, it can be reduced to 16 bit. A low resolution of 16 bit is also tolerable for the α values, but a high resolution is required for the gradients (double). For Neural Networks, including CNN, several studies have confirmed that states and gradients can be kept at low resolutions (<16 bit), but the weights must be maintained at a high resolution (float) (see e.g. [12]). In our own evaluations 24 bits in the weights tend to be sufficient. Once the network is trained, for the classification low resolutions can be used for the weights as well (<16 bit). 2.3 A rc h i t e c t u re Figure 1: Left: Schematic of the architecture with the main data flows; on one FPGA 128 VPE are configured into four SIMD groups; L-S: Load-store units. Right: Picture of an FPGA board; in our experiments one or two of them are used, connected via PCI bus to a host CPU. Based on the analysis above, it is clear that the architecture must be optimized for processing massive amounts of data with relatively low precision. Most of the time, data access patterns are predictable and data are processed in blocks that can be stored contiguously. This type of computation is well suited for vector processing, and simple vector processing elements (VPE) with fixed-point arithmetic can handle the operations. Since typically large blocks of data are processed with the same operation, groups of VPE can work in SIMD (single instruction multiple data) mode. Algorithms must then be segmented to map the highvolume, low precision parts onto the vector accelerators and parts requiring high precision arithmetic onto the CPU. The most important design decision is the organization of the memory. Most memory accesses are done in large blocks, so that the data can be streamed, making complex caching unnecessary. This is fortunate, since the amounts of data to be loaded onto the processor are so large that conventional caching strategies would be overwhelmed anyway. Because the blocks tend to be large, a high data bandwidth is crucial, but latency for starting a block transfer is less critical. Therefore we can use regular DDR memories and still get high IO rates. This led to the design shown schematically in Figure 1, where independent memory banks are connected via separate IO ports for each group of 32 VPE. By connecting multiple of the units shown in Figure 1 to a CPU, this architecture scales to larger numbers of VPE. Parallel data IO and parallel memory access scale simultaneously with the number of parallel cores, and we therefore refer to this as the P3 (P-cube) architecture. Notice also that the main data flow is only local between a group of VPE and its own memory block. Avoiding movements of data over long distances is crucial for low power dissipation. How far this architecture can reasonably scale with one CPU depends on the algorithms, the amount of data and the vector dimensionality (see below). A few hundred VPE per CPU have provided good accelerations in all our tests, and much higher numbers are possible with multi-core CPUs and faster CPU-FPGA connections. 3 I mp l e men tati on of th e P 3 A rch i t ectu re This architecture fits surprisingly well onto some of the recent FPGA chips that are available with several hundred Digital Signal Processors (DSP) units and over 1,000 IO pins for data transfers. The boards used here contain each one Xilinx Virtex 5 LX330T-2 FPGA coupled to 4 independent DDR2 SDRAM with a total of 1GB, and 2 independent 4MB SSRAM memory banks (commercial board from AlphaData). One FPGA chip contains 192 DSP with a maximum speed of 550MHz, which corresponds to a theoretical compute-performance of 105.6 GMACS (18 bit and 25 bit operands). There is a total of 14 Mbit of on-chip memory, and the chip incorporates 960 pins for data IO. Due to routing overhead, not all DSP units can be used and the actual clock frequencies tend to be considerably lower than what is advertised for such chips (typically 230MHz or less for our designs). Nevertheless, we obtain high performances because we can use a large number of DSP units for executing the main computation. The main architecture features are: • Parallel processing (on one chip): 128 VPE (hardware DSP) are divided into 4 blocks of 32, each group controlled by one sequencer with a vector instruction set. • Custom Precision: Data are represented with 1 to 16 bit resolution. Higher resolutions are possible by operating multiple DSP as one processor. • Overlapping Computation and Communication: CPU-FPGA communication is overlapped with the FPGA computation. • Overlap Memory Operations with Computation: All loads and stores from the FPGA to off-chip memory are performed concurrently with computations. • High Off-chip Memory Bandwidth: 6 independent data ports, each 32 bits wide, access banked memories concurrently (12GB/s per chip). • • Streaming Data Flow, Simple Access Patterns: Load/store units are tailored for streaming input and output data, and for simple, bursty access patterns. Caching is done under application control with dual-port memory on chip. Load/store with (de)compression: For an increase of effective IO bandwidth the load/store units provide compression and decompression in hardware. Figure 2 shows the configuration of the VPEs for vector dot product computation used for SVM training and classification. For training, the main computation is the calculation of one column of the kernel matrix. One vector is pre-fetched and stored in on-chip memory. All other vectors are streamed in from off-chip memory banks 1-4. Since this is a regular and predictable access pattern, we can utilize burst-mode, achieving a throughput of close to one memory word per cycle. But the speed is nevertheless IO bound. When several vectors can be stored on-chip, as is the case for classification, then the speed becomes compute-bound. Figure 2: Architecture for vector dot-product computation. The left side shows a high-level schematic with the main data flow. The data are streamed from memory banks 1-4 to the VPE arrays, while memory banks 5 and 6, alternatively receive results or stream them back to the host. The right side shows how a group of VPE is pipelined to improve clock speed. The operation for SVM training on the FPGA corresponds to a vector-matrix multiplication and the one for classification to a matrix-matrix multiplication. Therefore the configuration of Figure 2 is useful for many other algorithms as well, where operations with large vectors and matrices are needed, such as Neural Networks. We implemented a specialized configuration for Convolutional Neural Networks, for more efficiency and lower power dissipation. The VPE are daisy-chained and operate as systolic array. In this way we can take advantage of the high computation to IO ratio (Table 2) to reduce the data transfers from memory. 4 E val u ati on s We evaluated SVM training and classification with the NORB and MNIST problems, the latter with up to 2 million training samples (data from [11]). Both are benchmarks with vectors of high dimensionality, representative for applications in image and video analysis. The computation is split between CPU and FPGA as indicated by Table 1. The DDR2 memory banks are clocked at 230MHz, providing double that rate for data transfers. The data may be compressed to save IO bandwidth. On the FPGA they are decompressed first and distributed to the VPE. In our case, a 32 bit word contains eight 4-bit vector components. Four 32 bit words are needed to feed all 32 VPEs of a group; therefore clocking the VPE faster than 115MHz does not improve performance. A VPE executes a multiplication plus add operation in one clock cycle, resulting in a theoretical maximum of 14.7 GMACS per chip. The sustained compute-rate is lower, about 9.4 GMACS, due to overhead (see Table 4). The computation on the host CPU overlaps with that on the FPGA, and has no effect on the speed in the experiments shown here. For the classification the VPE can be clocked higher, at 230 MHz. By using 4-bit operands we can execute 2 multiply-accumulates simultaneously on one DSP, resulting in speed that is more than four times higher and a sustained 43.0 GMACS limited by the number and speed of the VPE. Adding a second FPGA card doubles the speed, showing little saturation effects yet, but for more FPGA per CPU there will be saturation (see Fig. 3). The compute speed in GMACS obtained for NORB is almost identical. # 60k 2M Iterations 8,000 266,900 CPU time 754s -- speed 0.5 -- CPU+MMX time speed 240 s 1.57 531,534 s 1.58 CPU+FPGA time speed 40 s 9.42 88,589 s 9.48 CPU+2 FPGA time speed 21 s 17.9 48,723 s 17.2 Table 4: Training times and average compute speed for SVM training. Systems tested: CPU, Opteron, 2.2GHz; CPU using MMX; CPU with one FPGA; CPU with two FPGA boards. Results are shown for training sizes of 60k and 2M samples. Compute speed is in GMACS (just kernel computations). Training algorithm: SMO with second order working set selection. Parallelizations of SVM training have been reported recently for a GPU [10] and for a cluster [11], both using the MNIST data. In [10] different bounds for stopping were used than here and in [11]. Nevertheless, a comparison of the compute performance is possible, because based on the number of iterations we can compute the average GMACS for the kernel computations. As can be seen in Table 5 a single FPGA is similar in speed to a GPU with 128 stream processors, despite a clock rate that is about 5.5 times lower for I/O and 11 times lower for the VPE. The cluster with 384 MMX units is about 6 times faster than one FPGA with 128 VPE, but dissipates about two orders of magnitude more electric power. For the FPGA this calculation includes only the computation of the kernel values while the part on the CPU is neglected. This is justified for this study, because the rest of the calculations can be mapped on the FPGA as well and will increase the power dissipation only minimally. Number Clock Operand Power Average of cores speed type dissipation compute speed CPU (Opteron) 1 2.2 GHz float 40 W 0.5 GMACS GPU (from [10]) 128 1.35 GHz float 80 W 7.4 GMACS Cluster (from [11]) 384 1.6 GHz byte > 1 kW 54 GMACS FPGA 128 0.12 GHz 4 bit nibble 9W 9.4 GMACS Table 5: Comparison of performances for SVM training (MNIST data). GPU: Nvidia 8800 GTX. Cluster: 48 dual core CPU (Athlon), 384 MMX units. The GPU was training with 60k samples ([10], table 2, second order), the cluster trained with 2 million samples. Processor Figure 3: Acceleration of SVM training as a function of the number of VPE. MNIST n: 2,000,000, d=784; NORB: n=48,560, d=5,184. The points for 128 and 256 VPE are experimental, the higher ones are simulations. Curves MNIST, NORB: Multiple FPGA are attached to one CPU. Curve MNIST C: Each FPGA is attached to a separate host CPU. Scaling of the acceleration with the number of VPEs is shown in Figure 3. The reference speed is that of one FPGA attached to a CPU. The evaluation has been done experimentally for 128 and 256 VPEs, and beyond that with a simulator. The onset of saturation depends on the dimensionality of the vectors, but to a much lesser extent on the number of training vectors (up to the limit of the memory on the FPGA card). MNIST saturates for more than two FPGAs because then the CPU and FPGA computation times become comparable. For the larger vectors of NORB (d=5,184) this saturation starts to be noticeable for more than 4 FPGA. Alternatively, a system can be scaled by grouping multiple CPU, each with one attached FPGA accelerator. Then the scaling follows a linear or even super-linear acceleration (MNIST C) to several thousand VPE. If the CPUs are working in a cluster arrangement, the scaling is similar to the one described in [11]. For convolutional neural networks, the architecture of Figure 2 is modified to allow a block of VPE to operate as systolic array. In this way convolutions can be implemented with minimal data movements. In addition to the convolution, also sub-sampling and non-linear functions plus the logistics to handle multiple layers with arbitrary numbers of kernels in each layer are done on the FPGA. Four separate blocks of such convolvers are packed onto one FPGA, using 100 VPE. Clocked at 115MHz, this architecture provides a maximum of 11.5 GMACS. Including all the overhead the sustained speed is about 10 GMACS. 5 Con cl u s i on s By systematically exploiting characteristic properties of machine learning algorithms, we developed a new massively parallel processor architecture that is very efficient and can be scaled to thousands of processing elements. The implementation demonstrated here is more than an order of magnitude higher in performance than previous FPGA implementations of SVM or CNN. For the MNIST problem it is comparable to the fastest GPU implementations reported so far. These results underline the importance of flexibility over raw compute-speed for massively parallel systems. The flexibility of the FPGA allows more efficient routing and packing of the data and the use of computations with the lowest resolution an algorithm permits. The results of Table 5 indicate the potential of this architecture for low-power operation in embedded applications. R e f e re n c e s [1] Ramacher, et al. (1995) Synapse-1: A high-speed general purpose parallel neurocomputer system. In Proc. 9th Intl. Symposium on Parallel Processing (IPPS'95), pp. 774-781. [2] Asanovic, K., Beck, Feldman, J., Morgan, N. & Wawrzynek, J. (1994) A Supercomputer for Neural Computation, Proc. IEEE Intl. Joint Conference on Neural Networks, pp. 5-9, Orlando, Florida. [3] Neil, P., (2005) Combining hardware with a powerful automotive MCU for powertrain applications. In Industrial Embedded Resource Guide, p. 88. [4] Korekado, et al. (2003) A Convolutional Neural Network VLSI for Image Recognition Using Merged/Mixed Analog-Digital Architecture, in Proc. 7th KES 2003, Oxford, pp 169-176. [5] Murasaki, M., Arima, Y. & Shinohara, H. (1993) A 20 Tera-CPS Analog Neural Network Board. In Proc. Int. Joint Conf. Neural Networks, pp. 3027 – 3030. [6] Pedersen, R., Schoeberl, M. (2006), An Embedded Support Vector Machine, WISE 2006. [7] Dey, S., Kedia, M. Agarwal, N., Basu, A., Embedded Support Vector Machine: Architectural Enhancements and Evaluation, in Proc 20th Int. Conf. VLSI Design. [8] Anguita, D., Boni, A., Ridella, S., (2003) A Digital Architecture for Support Vector Machines: Theory, Algorithm, and FPGA Implementation, IEEE Trans. Neural Networks, 14/5, pp.993-1009. [9] Chu, C., Kim, S., Lin, Y., Yu, Y., Bradski, G., Ng, A. & Olukotun, K. (2007) Map-Reduce for Machine Learning on Multicore, Advances in Neural Information Processing Systems 19, MIT Press. [10] Catanzaro, B., Sundaram, N., & Keutzer, K. (2008) Fast Support Vector Machine Training and Classification on Graphics Processors, Proc. 25th Int. Conf. Machine Learning, pp 104-111. [11] Durdanovic, I., Cosatto, E. & Graf, H. (2007) Large Scale Parallel SVM Implementation. In L. Bottou, O. Chapelle, D. DeCoste, J. Weston (eds.), Large Scale Kernel Machines, pp. 105-138, MIT Press. [12] Simard, P & Graf, H. (1994) Backpropagation without Multiplication. In J. Cowan, G. Tesauro, J. Alspector, (eds.), Neural Information Processing Systems 6, pp. 232 – 239, Morgan Kaufmann. [13] Savich, A., Moussa, M., Areibi, S., (2007) The Impact of Arithmetic Representation on Implementing MLP-BP on FPGAs: A Study, IEEE Trans. Neural Networks, 18/1, pp. 240-252.

6 0.51984322 118 nips-2008-Learning Transformational Invariants from Natural Movies

7 0.49299598 27 nips-2008-Artificial Olfactory Brain for Mixture Identification

8 0.47393847 152 nips-2008-Non-stationary dynamic Bayesian networks

9 0.46044207 204 nips-2008-Self-organization using synaptic plasticity

10 0.45830506 240 nips-2008-Tracking Changing Stimuli in Continuous Attractor Neural Networks

11 0.40930203 43 nips-2008-Cell Assemblies in Large Sparse Inhibitory Networks of Biologically Realistic Spiking Neurons

12 0.40804294 56 nips-2008-Deep Learning with Kernel Regularization for Visual Recognition

13 0.36831129 237 nips-2008-The Recurrent Temporal Restricted Boltzmann Machine

14 0.36429513 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models

15 0.35472545 103 nips-2008-Implicit Mixtures of Restricted Boltzmann Machines

16 0.34965315 30 nips-2008-Bayesian Experimental Design of Magnetic Resonance Imaging Sequences

17 0.3301371 124 nips-2008-Load and Attentional Bayes

18 0.31844652 38 nips-2008-Bio-inspired Real Time Sensory Map Realignment in a Robotic Barn Owl

19 0.31539947 66 nips-2008-Dynamic visual attention: searching for coding length increments

20 0.30954611 115 nips-2008-Learning Bounded Treewidth Bayesian Networks

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.068), (7, 0.058), (12, 0.033), (20, 0.319), (21, 0.017), (28, 0.121), (57, 0.15), (59, 0.025), (63, 0.019), (71, 0.012), (77, 0.063), (83, 0.039)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87466794 41 nips-2008-Breaking Audio CAPTCHAs

Author: Jennifer Tam, Jiri Simsa, Sean Hyde, Luis V. Ahn

Abstract: CAP TCHAs are computer-generated tests that humans can pass but current computer systems cannot. CAP TCHAs provide a method for automatically distinguishing a human from a computer program, and therefore can protect Web services from abuse by so-called “bots.” Most CAP TCHAs consist of distorted images, usually text, for which a user must provide some description. Unfortunately, visual CAP TCHAs limit access to the millions of visually impaired people using the Web. Audio CAP TCHAs were created to solve this accessibility issue; however, the security of audio CAP TCHAs was never formally tested. Some visual CAP TCHAs have been broken using machine learning techniques, and we propose using similar ideas to test the security of audio CAP TCHAs. Audio CAP TCHAs are generally composed of a set of words to be identified, layered on top of noise. We analyzed the security of current audio CAP TCHAs from popular Web sites by using AdaBoost, SVM, and k-NN, and achieved correct solutions for test samples with accuracy up to 71%. Such accuracy is enough to consider these CAPTCHAs broken. Training several different machine learning algorithms on different types of audio CAP TCHAs allowed us to analyze the strengths and weaknesses of the algorithms so that we could suggest a design for a more robust audio CAPTCHA. 1 Int rod uct i o n CAP TCHAs [1] are automated tests designed to tell computers and humans apart by presenting users with a problem that humans can solve but current computer programs cannot. Because CAPTCHAs can distinguish between humans and computers with high probability, they are used for many different security applications: they prevent bots from voting continuously in online polls, automatically registering for millions of spam email accounts, automatically purchasing tickets to buy out an event, etc. Once a CAP TCHA is broken (i.e., computer programs can successfully pass the test), bots can impersonate humans and gain access to services that they should not. Therefore, it is important for CAP TCHAs to be secure. To pass the typical visual CAP TCHA, a user must correctly type the characters displayed in an image of distorted text. Many visual CAP TCHAs have been broken with machine learning techniques [2]-[3], though some remain secure against such attacks. Because visually impaired users who surf the Web using screen-reading programs cannot see this type of CAPTCHA, audio CAP TCHAs were created. Typical audio CAP TCHAs consist of one or several speakers saying letters or digits at randomly spaced intervals. A user must correctly identify the digits or characters spoken in the audio file to pass the CAP TCHA. To make this test difficult for current computer systems, specifically automatic speech recognition (ASR) programs, background noise is injected into the audio files. Since no official evaluation of existing audio CAP TCHAs has been reported, we tested the security of audio CAP TCHAs used by many popular Web sites by running machine learning experiments designed to break them. In the next section, we provide an overview of the literature related to our project. Section 3 describes our methods for creating training data, and section 4 describes how we create classifiers that can recognize letters, digits, and noise. In section 5, we discuss how we evaluated our methods on widely used audio CAP TCHAs and we give our results. In particular, we show that the audio CAP TCHAs used by sites such as Google and Digg are susceptible to machine learning attacks. Section 6 mentions the proposed design of a new more secure audio CAP TCHA based on our findings. 2 Lit erat u r e r ev i ew To break the audio CAP TCHAs, we derive features from the CAP TCHA audio and use several machine learning techniques to perform ASR on segments of the CAPTCHA. There are many popular techniques for extracting features from speech. The three techniques we use are mel-frequency cepstral coefficients (MFCC), perceptual linear prediction (PLP), and relative spectral transform-PLP (RAS TA-PLP). MFCC is one of the most popular speech feature representations used. Similar to a fast Fourier transform (FF T), MFCC transforms an audio file into frequency bands, but (unlike FF T) MFCC uses mel-frequency bands, which are better for approximating the range of frequencies humans hear. PLP was designed to extract speaker-independent features from speech [4]. Therefore, by using PLP and a variant such as RAS TA-PL P, we were able to train our classifiers to recognize letters and digits independently of who spoke them. Since many different people recorded the digits used in one of the types of audio CAP TCHAs we tested, PLP and RAS TA-PLP were needed to extract the features that were most useful for solving them. In [4]-[5], the authors conducted experiments on recognizing isolated digits in the presence of noise using both PLP and RAS TA-PL P. However, the noise used consisted of telephone or microphone static caused by recording in different locations. The audio CAP TCHAs we use contain this type of noise, as well as added vocal noise and/or music, which is supposed to make the automated recognition process much harder. The authors of [3] emphasize how many visual CAP TCHAs can be broken by successfully splitting the task into two smaller tasks: segmentation and recognition. We follow a similar approach in that we first automatically split the audio into segments, and then we classify these segments as noise or words. In early March 2008, concurrent to our work, the blog of Wintercore Labs [6] claimed to have successfully broken the Google audio CAP TCHA. After reading their Web article and viewing the video of how they solve the CAP TCHAs, we are unconvinced that the process is entirely automatic, and it is unclear what their exact pass rate is. Because we are unable to find any formal technical analysis of this program, we can neither be sure of its accuracy nor the extent of its automation. 3 Cr e at i o n of tra i n i n g dat a Since automated programs can attempt to pass a CAPTCHA repeatedly, a CAPTCHA is essentially broken when a program can pass it more than a non-trivial fraction of the time; e.g., a 5% pass rate is enough. Our approach to breaking the audio CAP TCHAs began by first splitting the audio files into segments of noise or words: for our experiments, the words were spoken letters or digits. We used manual transcriptions of the audio CAP TCHAs to get information regarding the location of each spoken word within the audio file. We were able to label our segments accurately by using this information. We gathered 1,000 audio CAP TCHAs from each of the following Web sites: google.com, digg.com, and an older version of the audio CAP TCHA in recaptcha.net. Each of the CAP TCHAs was annotated with the information regarding letter/digit locations provided by the manual transcriptions. For each type of CAPTCHA, we randomly selected 900 samples for training and used the remaining 100 for testing. Using the digit/letter location information provided in the manual CAP TCHA transcriptions, each training CAP TCHA is divided into segments of noise, the letters a-z, or the digits 0-9, and labeled as such. We ignore the annotation information of the CAP TCHAs we use for testing, and therefore we cannot identify the size of those segments. Instead, each test CAP TCHA is divided into a number of fixed-size segments. The segments with the highest energy peaks are then classified using machine learning techniques (Figure 1). Since the size of a feature vector extracted from a segment generally depends on the size of the segment, using fixed-size segments allows each segment to be described with a feature vector of the same length. We chose the window size by listening to a few training segments and adjusted accordingly to ensure that the segment contained the entire digit/letter. There is undoubtedly a more optimal way of selecting the window size, however, we were still able to break the three CAP TCHAs we tested with our method. Figure 1: A test audio CAP TCHA with the fixed-size segments containing the highest energy peaks highlighted. The information provided in the manual transcriptions of the audio CAP TCHAs contains a list of the time intervals within which words are spoken. However, these intervals are of variable size and the word might be spoken anywhere within this interval. To provide fixedsize segments for training, we developed the following heuristic. First, divide each file into variable-size segments using the time intervals provided and label each segment accordingly. Then, within each segment, detect the highest energy peak and return its fixed-size neighborhood labeled with the current segment’s label. This heuristic achieved nearly perfect labeling accuracy for the training set. Rare mistakes occurred when the highest energy peak of a digit or letter segment corresponded to noise rather than to a digit or letter. To summarize this subsection, an audio file is transformed into a set of fixed-size segments labeled as noise, a digit between 0 and 9, or a letter between a and z. These segments are then used for training. Classifiers are trained for one type of CAPTCHA at a time. 4 C l a s s i f i e r con s t ru ct i o n From the training data we extracted five sets of features using twelve MFCCs and twelfth- order spectral (SPEC) and cepstral (CEPS) coefficients from PLP Matlab functions for extracting these features were provided online Voicebox package. We use AdaBoost, SVM, and k-NN algorithms digit and letter recognition. We detail our implementation of following subsections. 4 .1 and RAS TA-PL P. The at [7] and as part of the to implement automated each algorithm in the AdaBoost Using decision stumps as weak classifiers for AdaBoost, anywhere from 11 to 37 ensemble classifiers are built. The number of classifiers built depends on which type of CAPTCHA we are solving. Each classifier trains on all the segments associated with that type of CAP TCHA, and for the purpose of building a single classifier, segments are labeled by either -1 (negative example) or +1 (positive example). Using cross-validation, we choose to use 50 iterations for our AdaBoost algorithm. A segment can then be classified as a particular letter, digit, or noise according to the ensemble classifier that outputs the number closest to 1. 4 .2 S u p p o rt v e ct o r m a c h i n e To conduct digit recognition with SVM, we used the C++ implementations of libSVM [8] version 2.85 with C-SMV and RBF kernel. First, all feature values are scaled to the range of -1 to 1 as suggested by [8]. The scale parameters are stored so that test samples can be scaled accordingly. Then, a single multiclass classifier is created for each set of features using all the segments for a particular type of CAPTCHA. We use cross-validation and grid search to discover the optimal slack penalty (C=32) and kernel parameter (γ=0.011). 4 .3 k - n e a re st n e i g h b o r ( k - N N ) We use k-NN as our final method for classifying digits. For each type of CAP TCHA, five different classifiers are created by using all of the training data and the five sets of features associated with that particular type of CAP TCHA. Again we use cross-validation to discover the optimal parameter, in this case k=1. We use Euclidian distance as our distance metric. 5 Ass e s sm e n t of cu rre n t a ud i o CAPTCHAs Our method for solving CAP TCHAs iteratively extracts an audio segment from a CAP TCHA, inputs the segment to one of our digit or letter recognizers, and outputs the label for that segment. We continue this process until the maximum solution size is reached or there are no unlabeled segments left. Some of the CAPTCHAs we evaluated have solutions that vary in length. Our method ensures that we get solutions of varying length that are never longer than the maximum solution length. A segment to be classified is identified by taking the neighborhood of the highest energy peak of an as yet unlabeled part of the CAP TCHA. Once a prediction of the solution to the CAPTCHA is computed, it is compared to the true solution. Given that at least one of the audio CAP TCHAs allows users to make a mistake in one of the digits (e.g., reCAPTCHA), we compute the pass rate for each of the different types of CAPTCHAs with all of the following conditions: • The prediction matches the true solution exactly. • Inserting one digit into the prediction would make it match the solution exactly. • Replacing one digit in the prediction would make it match the solution exactly. • Removing one digit from the prediction would make it match the solution exactly. However, since we are only sure that these conditions apply to reCAPTCHA audio CAP TCHAs, we also calculate the percentage of exact solution matches in our results for each type of audio CAP TCHA. These results are described in the following subsections. 5 .1 Goog le Google audio CAP TCHAs consist of one speaker saying random digits 0-9, the phrase “once again,” followed by the exact same recorded sequence of digits originally presented. The background noise consists of human voices speaking backwards at varying volumes. A solution can range in length from five to eight words. We set our classifier to find the 12 loudest segments and classify these segments as digits or noise. Because the phrase “once again” marks the halfway point of the CAPTCHA, we preprocessed the audio to only serve this half of the CAP TCHA to our classifiers. It is important to note, however, that the classifiers were always able to identify the segment containing “once again,” and these segments were identified before all other segments. Therefore, if necessary, we could have had our system cut the file in half after first labeling this segment. For AdaBoost, we create 12 classifiers: one classifier for each digit, one for noise, and one for the phrase “once again.” Our results ( Table 1) show that at best we achieved a 90% pass rate using the “one mistake” passing conditions and a 66% exact solution match rate. Using SVM and the “one mistake” passing conditions, at best we achieve a 92% pass rate and a 67% exact solution match. For k-NN, the “one mistake” pass rate is 62% and the exact solution match rate is 26%. Table 1: Google audio CAP TCHA results: Maximum 67% accuracy was achieved by SVM. Classifiers Used AdaBoost SVM k-NN one mistake exact match one mistake exact match 88% 61% 92% 67% 30% 1% PLPSPEC 90% 66% 90% 67% 60% 26% PLPCEPS 90% 66% 92% 67% 62% 23% RAS TAPLPSPEC 88% 48% 90% 61% 29% 1% RAS TAPLPCEPS 5 .2 exact match MFCC Features Used One mistake 90% 63% 92% 67% 33% 2% Digg Digg CAP TCHAs also consist of one speaker, in this case saying a random combination of letters and digits. The background noise consists of static or what sounds like trickling water and is not continuous throughout the entire file. We noticed in our training data that the following characters were never present in a solution: 0, 1, 2, 5, 7, 9, i, o, z. Since the Digg audio CAPTCHA is also the verbal transcription of the visual CAP TCHA, we believe that these characters are excluded to avoid confusion between digits and letters that are similar in appearance. The solution length varies between three and six words. Using AdaBoost, we create 28 classifiers: one classifier for each digit or letter that appears in our training data and one classifier for noise. Perhaps because we had fewer segments to train with and there was a far higher proportion of noise segments, AdaBoost failed to produce any correct solutions. We believe that the overwhelming number of negative training examples versus the small number of positive training samples used to create each decision stump severely affected AdaBoost’s ability to classify audio segments correctly. A histogram of the training samples is provided in Figure 2 to illustrate the amount of training data available for each character. When using SVM, the best feature set passed with 96% using “one mistake” passing conditions and passed with 71% when matching the solution exactly. For k-NN, the best feature set produced a 90% “one mistake” pass rate and a 49% exact solution match. Full results can be found in Table 2. Table 2: Digg audio CAP TCHA results: Maximum 71% accuracy was achieved by SVM. Classifiers Used AdaBoost SVM k-NN exact match one mistake exact match one mistake exact match MFCC - - 96% 71% 89% 49% PLPSPEC - - 94% 65% 90% 47% PLPCEPS - - 96% 71% 64% 17% RAS TAPLPSPEC - - 17% 3% 67% 17% RAS TAPLPCEPS Features Used one mistake - - 96% 71% 82% 34% 1000 900 800 # of Segments 700 600 500 400 300 200 100 y x v w t u r s q p n m l j k h f g e d c b a 8 6 4 3 noise 0 Segment Label Figure 2: Digg CAP TCHA training data distribution. 5 .3 reC A P T C H A The older version of reCAPTCHA’s audio CAP TCHAs we tested consist of several speakers who speak random digits. The background noise consists of human voices speaking backwards at varying volumes. The solution is always eight digits long. For AdaBoost, we create 11 classifiers: one classifier for each digit and one classifier for noise. Because we know that the reCAPTCHA passing conditions are the “one mistake” passing conditions, SVM produces our best pass rate of 58%. Our best exact match rate is 45% ( Table 3). Table 3: reCAPTCHA audio CAP TCHA results: Maximum 45% accuracy was achieved by SVM. Classifiers Used AdaBoost SVM k-NN one mistake exact match one mistake exact match 18% 6% 56% 43% 22% 11% PLPSPEC 27% 10% 58% 39% 43% 25% PLPCEPS 23% 10% 56% 45% 29% 14% RAS TAPLPSPEC 9% 3% 36% 18% 24% 4% RAS TAPLPCEPS 6 exact match MFCC Features Used one mistake 9% 3% 46% 30% 32% 12% Prop ert i e s of w ea k ver s u s st ro n g CAPT CHAs From our results, we note that the easiest CAP TCHAs to break were from Digg. Google had the next strongest CAP TCHAs followed by the strongest from reCAPTCHA. Although the Digg CAP TCHAs have the largest vocabulary, giving us less training data per label, the same woman recorded them all. More importantly, the same type of noise is used throughout the entire CAPTCHA. The noise sounds like running water and static which sounds very different from the human voice and does not produce the same energy spikes needed to locate segments, therefore making segmentation quite easy. The CAP TCHAs from Google and reCAPTCHA used other human voices for background noise, making segmentation much more difficult. Although Google used a smaller vocabulary than Digg and also only used one speaker, Google’s background noise made the CAP TCHA more difficult to solve. After listening to a few of Google’s CAP TCHAs, we noticed that although the background noise consisted of human voices, the same background noise was repeated. reCAP TCHA had similar noise to Google, but they had a larger selection of noise thus making it harder to learn. reCAP TCHA also has the longest solution length making it more difficult to get perfectly correct. Finally, reCAPTCHA used many different speakers causing it to be the strongest CAP TCHA of the three we tested. In conclusion, an audio CAP TCHA that consists of a finite vocabulary and background noise should have multiple speakers and noise similar to the speakers. 7 Recomm e n d at i o n s f or creat i n g st ro n g e r aud i o CAPTCHAs Due to our success in solving audio CAP TCHAs, we have decided to start developing new audio CAP TCHAs that our methods, and machine learning methods in general, will be less likely to solve. From our experiments, we note that CAP TCHAs containing longer solutions and multiple speakers tend to be more difficult to solve. Also, because our methods depend on the amount of training data we have, having a large vocabulary would make it more difficult to collect enough training data. Already since obtaining these results, reCAPTCHA.net has updated their audio CAP TCHA to contain more distortions and a larger vocabulary: the digits 0 through 99. In designing a new audio CAP TCHA we are also concerned with the human pass rate. The current human pass rate for the reCAPTCHA audio CAP TCHAs is only 70%. To develop an audio CAP TCHA with an improved human pass rate, we plan to take advantage of the human mind’s ability to understand distorted audio through context clues. By listening to a phrase instead of to random isolated words, humans are better able to decipher distorted utterances because they are familiar with the phrase or can use contextual clues to decipher the distorted audio. Using this idea, the audio for our new audio CAP TCHA will be taken from old-time radio programs in which the poor quality of the audio makes transcription by ASR systems difficult. Users will be presented with an audio clip consisting of a 4-6 word phrase. Half of the CAPTCHA consists of words, which validate a user to be human, while the other half of the words need to be transcribed. This is the same idea behind the visual reCAP TCHA that is currently digitizing text on which OCR fails. We expect that this new audio CAP TCHA will be more secure than the current version and easier for humans to pass. Initial experiments using this idea show this to be true [9]. 8 Co n c l u s i o n We have succeeded in “breaking” three different types of widely used audio CAP TCHAs, even though these were developed with the purpose of defeating attacks by machine learning techniques. We believe our results can be improved by selecting optimal segment sizes, but that is unnecessary given our already high success rate. For our experiments, segment sizes were not chosen in a special way; occasionally yielding results in which a segment only contained half of a word, causing our prediction to contain that particular word twice. We also believe that the AdaBoost results can be improved, particularly for the Digg audio CAP TCHAs, by ensuring that the number of negative training samples is closer to the number of positive training samples. We have shown that our approach is successful and can be used with many different audio CAP TCHAs that contain small finite vocabularies. A ck n o w l e d g m e n t s This work was partially supported by generous gifts from the Heinz Endowment, by an equipment grant from Intel Corporation, and by the Army Research Office through grant number DAAD19-02-1-0389 to CyLab at Carnegie Mellon University. Luis von Ahn was partially supported by a Microsoft Research New Faculty Fellowship and a MacArthur Fellowship. Jennifer Tam was partially supported by a Google Anita Borg Scholarship. R e f e re n c e s [1] L. von Ahn, M. Blum, and J. Langford. “Telling Humans and Computers Apart Automatically,” Communication s of the ACM, vol. 47, no. 2, pp. 57-60, Feb. 2004. [2] G. Mori and J. Malik. “Recognizing Objects in Adversarial Clutter: Breaking a Visual CAPTCHA,” In Computer Vision and Pattern Recognition CVPR'03, June 2003. [3] K. Chellapilla, and P. Simard, “ U sing Machine Learning to Break Visual Human Interactio n Proofs (HIP s),” Advances in Neural Information P rocessing Systems 17, Neural Info rmatio n P rocessing Systems (NIPS'2004), MIT Press. [4] H. Hermansk y, “ Perceptual Linear Predictive (PL P) Analysis of Speech,” J. Acoust. Soc. Am., vol. 87, no. 4, pp. 1738-1752, Apr. 1990. [5] H. Hermansk y, N. Morgan, A. Bayya, and P. Kohn. “RASTA-PL P Speech Analysi s Technique,” In P roc. IEEE Int’l Conf. Acoustics, Speech & Signal Processing, vol. 1, pp. 121124, San Francisco, 1992. [6] R. Santamarta. “Breaking Gmail ’s Audio Captcha,” http://blog.wintercore.com/?p=11, 2008. [7] D. Ell is. “ P L P and RASTA (and MFCC, and inversion) in Matlab using melfcc.m and invmelfcc.m,” http:/ /ww w.ee.columbia.edu/~dpwe/resources/matlab/rastamat/, 2006. [8] C. Chang and C. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http: //ww w.csie.ntu.edu.tw/~cjlin/libsvm [9] A. Schlaikjer. “ A Dual-Use Speech CA PTCHA: Aiding Visually Impaired Web Users while Providing Transcriptions of Audio Streams,” Technical Report CMU-LTI-07-014, Carnegie Mellon Universi t y. November 2007.

same-paper 2 0.78361446 158 nips-2008-Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks

Author: Alex Graves, Jürgen Schmidhuber

3 0.56546557 27 nips-2008-Artificial Olfactory Brain for Mixture Identification

Author: Mehmet K. Muezzinoglu, Alexander Vergara, Ramon Huerta, Thomas Nowotny, Nikolai Rulkov, Henry Abarbanel, Allen Selverston, Mikhail Rabinovich

Abstract: The odor transduction process has a large time constant and is susceptible to various types of noise. Therefore, the olfactory code at the sensor/receptor level is in general a slow and highly variable indicator of the input odor in both natural and artiﬁcial situations. Insects overcome this problem by using a neuronal device in their Antennal Lobe (AL), which transforms the identity code of olfactory receptors to a spatio-temporal code. This transformation improves the decision of the Mushroom Bodies (MBs), the subsequent classiﬁer, in both speed and accuracy. Here we propose a rate model based on two intrinsic mechanisms in the insect AL, namely integration and inhibition. Then we present a MB classiﬁer model that resembles the sparse and random structure of insect MB. A local Hebbian learning procedure governs the plasticity in the model. These formulations not only help to understand the signal conditioning and classiﬁcation methods of insect olfactory systems, but also can be leveraged in synthetic problems. Among them, we consider here the discrimination of odor mixtures from pure odors. We show on a set of records from metal-oxide gas sensors that the cascade of these two new models facilitates fast and accurate discrimination of even highly imbalanced mixtures from pure odors. 1

4 0.55656981 80 nips-2008-Extended Grassmann Kernels for Subspace-Based Learning

Author: Jihun Hamm, Daniel D. Lee

Abstract: Subspace-based learning problems involve data whose elements are linear subspaces of a vector space. To handle such data structures, Grassmann kernels have been proposed and used previously. In this paper, we analyze the relationship between Grassmann kernels and probabilistic similarity measures. Firstly, we show that the KL distance in the limit yields the Projection kernel on the Grassmann manifold, whereas the Bhattacharyya kernel becomes trivial in the limit and is suboptimal for subspace-based problems. Secondly, based on our analysis of the KL distance, we propose extensions of the Projection kernel which can be extended to the set of afﬁne as well as scaled subspaces. We demonstrate the advantages of these extended kernels for classiﬁcation and recognition tasks with Support Vector Machines and Kernel Discriminant Analysis using synthetic and real image databases. 1

5 0.55376709 100 nips-2008-How memory biases affect information transmission: A rational analysis of serial reproduction

Author: Jing Xu, Thomas L. Griffiths

Abstract: Many human interactions involve pieces of information being passed from one person to another, raising the question of how this process of information transmission is affected by the capacities of the agents involved. In the 1930s, Sir Frederic Bartlett explored the inﬂuence of memory biases in “serial reproduction” of information, in which one person’s reconstruction of a stimulus from memory becomes the stimulus seen by the next person. These experiments were done using relatively uncontrolled stimuli such as pictures and stories, but suggested that serial reproduction would transform information in a way that reﬂected the biases inherent in memory. We formally analyze serial reproduction using a Bayesian model of reconstruction from memory, giving a general result characterizing the effect of memory biases on information transmission. We then test the predictions of this account in two experiments using simple one-dimensional stimuli. Our results provide theoretical and empirical justiﬁcation for the idea that serial reproduction reﬂects memory biases. 1

6 0.55181926 236 nips-2008-The Mondrian Process

7 0.54726303 208 nips-2008-Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes

8 0.54327065 191 nips-2008-Recursive Segmentation and Recognition Templates for 2D Parsing

9 0.54307598 148 nips-2008-Natural Image Denoising with Convolutional Networks

10 0.54120648 233 nips-2008-The Gaussian Process Density Sampler

11 0.53103226 200 nips-2008-Robust Kernel Principal Component Analysis

12 0.52752525 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data

13 0.52287489 234 nips-2008-The Infinite Factorial Hidden Markov Model

14 0.52204347 192 nips-2008-Reducing statistical dependencies in natural signals using radial Gaussianization

15 0.52192253 216 nips-2008-Sparse probabilistic projections

16 0.5189907 118 nips-2008-Learning Transformational Invariants from Natural Movies

17 0.51809931 66 nips-2008-Dynamic visual attention: searching for coding length increments

18 0.51718211 62 nips-2008-Differentiable Sparse Coding

19 0.51684815 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation

20 0.51494288 232 nips-2008-The Conjoint Effect of Divisive Normalization and Orientation Selectivity on Redundancy Reduction