nips nips2008 nips2008-148 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Viren Jain, Sebastian Seung
Abstract: We present an approach to low-level vision that combines two main ideas: the use of convolutional networks as an image processing architecture and an unsupervised learning procedure that synthesizes training samples from specific noise models. We demonstrate this approach on the challenging problem of natural image denoising. Using a test set with a hundred natural images, we find that convolutional networks provide comparable and in some cases superior performance to state of the art wavelet and Markov random field (MRF) methods. Moreover, we find that a convolutional network offers similar performance in the blind denoising setting as compared to other techniques in the non-blind setting. We also show how convolutional networks are mathematically related to MRF approaches by presenting a mean field theory for an MRF specially designed for image denoising. Although these approaches are related, convolutional networks avoid computational difficulties in MRF approaches that arise from probabilistic learning and inference. This makes it possible to learn image processing architectures that have a high degree of representational power (we train models with over 15,000 parameters), but whose computational expense is significantly less than that associated with inference in MRF approaches with even hundreds of parameters. 1 Background Low-level image processing tasks include edge detection, interpolation, and deconvolution. These tasks are useful both in themselves, and as a front-end for high-level visual tasks like object recognition. This paper focuses on the task of denoising, defined as the recovery of an underlying image from an observation that has been subjected to Gaussian noise. One approach to image denoising is to transform an image from pixel intensities into another representation where statistical regularities are more easily captured. For example, the Gaussian scale mixture (GSM) model introduced by Portilla and colleagues is based on a multiscale wavelet decomposition that provides an effective description of local image statistics [1, 2]. Another approach is to try and capture statistical regularities of pixel intensities directly using Markov random fields (MRFs) to define a prior over the image space. Initial work used handdesigned settings of the parameters, but recently there has been increasing success in learning the parameters of such models from databases of natural images [3, 4, 5, 6, 7, 8]. Prior models can be used for tasks such as image denoising by augmenting the prior with a noise model. Alternatively, an MRF can be used to model the probability distribution of the clean image conditioned on the noisy image. This conditional random field (CRF) approach is said to be discriminative, in contrast to the generative MRF approach. Several researchers have shown that the CRF approach can outperform generative learning on various image restoration and labeling tasks [9, 10]. CRFs have recently been applied to the problem of image denoising as well [5]. 1 The present work is most closely related to the CRF approach. Indeed, certain special cases of convolutional networks can be seen as performing maximum likelihood inference on a CRF [11]. The advantage of the convolutional network approach is that it avoids a general difficulty with applying MRF-based methods to image analysis: the computational expense associated with both parameter estimation and inference in probabilistic models. For example, naive methods of learning MRFbased models involve calculation of the partition function, a normalization factor that is generally intractable for realistic models and image dimensions. As a result, a great deal of research has been devoted to approximate MRF learning and inference techniques that meliorate computational difficulties, generally at the cost of either representational power or theoretical guarantees [12, 13]. Convolutional networks largely avoid these difficulties by posing the computational task within the statistical framework of regression rather than density estimation. Regression is a more tractable computation and therefore permits models with greater representational power than methods based on density estimation. This claim will be argued for with empirical results on the denoising problem, as well as mathematical connections between MRF and convolutional network approaches. 2 Convolutional Networks Convolutional networks have been extensively applied to visual object recognition using architectures that accept an image as input and, through alternating layers of convolution and subsampling, produce one or more output values that are thresholded to yield binary predictions regarding object identity [14, 15]. In contrast, we study networks that accept an image as input and produce an entire image as output. Previous work has used such architectures to produce images with binary targets in image restoration problems for specialized microscopy data [11, 16]. Here we show that similar architectures can also be used to produce images with the analog fluctuations found in the intensity distributions of natural images. Network Dynamics and Architecture A convolutional network is an alternating sequence of linear filtering and nonlinear transformation operations. The input and output layers include one or more images, while intermediate layers contain “hidden
Reference: text
sentIndex sentText sentNum sentScore
1 We demonstrate this approach on the challenging problem of natural image denoising. [sent-3, score-0.299]
2 Using a test set with a hundred natural images, we find that convolutional networks provide comparable and in some cases superior performance to state of the art wavelet and Markov random field (MRF) methods. [sent-4, score-0.783]
3 Moreover, we find that a convolutional network offers similar performance in the blind denoising setting as compared to other techniques in the non-blind setting. [sent-5, score-1.077]
4 We also show how convolutional networks are mathematically related to MRF approaches by presenting a mean field theory for an MRF specially designed for image denoising. [sent-6, score-0.97]
5 Although these approaches are related, convolutional networks avoid computational difficulties in MRF approaches that arise from probabilistic learning and inference. [sent-7, score-0.677]
6 1 Background Low-level image processing tasks include edge detection, interpolation, and deconvolution. [sent-9, score-0.284]
7 This paper focuses on the task of denoising, defined as the recovery of an underlying image from an observation that has been subjected to Gaussian noise. [sent-11, score-0.322]
8 One approach to image denoising is to transform an image from pixel intensities into another representation where statistical regularities are more easily captured. [sent-12, score-0.927]
9 For example, the Gaussian scale mixture (GSM) model introduced by Portilla and colleagues is based on a multiscale wavelet decomposition that provides an effective description of local image statistics [1, 2]. [sent-13, score-0.325]
10 Another approach is to try and capture statistical regularities of pixel intensities directly using Markov random fields (MRFs) to define a prior over the image space. [sent-14, score-0.364]
11 Prior models can be used for tasks such as image denoising by augmenting the prior with a noise model. [sent-16, score-0.734]
12 Alternatively, an MRF can be used to model the probability distribution of the clean image conditioned on the noisy image. [sent-17, score-0.314]
13 Several researchers have shown that the CRF approach can outperform generative learning on various image restoration and labeling tasks [9, 10]. [sent-19, score-0.356]
14 CRFs have recently been applied to the problem of image denoising as well [5]. [sent-20, score-0.585]
15 Indeed, certain special cases of convolutional networks can be seen as performing maximum likelihood inference on a CRF [11]. [sent-22, score-0.713]
16 The advantage of the convolutional network approach is that it avoids a general difficulty with applying MRF-based methods to image analysis: the computational expense associated with both parameter estimation and inference in probabilistic models. [sent-23, score-0.991]
17 For example, naive methods of learning MRFbased models involve calculation of the partition function, a normalization factor that is generally intractable for realistic models and image dimensions. [sent-24, score-0.259]
18 This claim will be argued for with empirical results on the denoising problem, as well as mathematical connections between MRF and convolutional network approaches. [sent-28, score-1.048]
19 In contrast, we study networks that accept an image as input and produce an entire image as output. [sent-30, score-0.692]
20 Previous work has used such architectures to produce images with binary targets in image restoration problems for specialized microscopy data [11, 16]. [sent-31, score-0.59]
21 Here we show that similar architectures can also be used to produce images with the analog fluctuations found in the intensity distributions of natural images. [sent-32, score-0.282]
22 Network Dynamics and Architecture A convolutional network is an alternating sequence of linear filtering and nonlinear transformation operations. [sent-33, score-0.696]
23 The input and output layers include one or more images, while intermediate layers contain “hidden" units with images called feature maps that are the internal computations of the algorithm. [sent-34, score-0.435]
24 The activity of feature map a in layer k is given by wk,ab ⊗ Ik−1,b − θk,a Ik,a = f (1) b where Ik−1,b are feature maps that provide input to Ik,a , and ⊗ denotes the convolution operation. [sent-35, score-0.273]
25 We restrict our experiments to monochrome images and hence the networks contain a single image in the input layer. [sent-37, score-0.533]
26 It is straightforward to extend this approach to color images by assuming an input layer with multiple images (e. [sent-38, score-0.422]
27 We also explicitly encode the border of the image by padding an area surrounding the image with values of −1. [sent-42, score-0.518]
28 Learning to Denoise Parameter learning can be performed with a modification of the backpropagation algorithm for feedfoward neural networks that takes into account the weight-sharing structure of convolutional networks [14]. [sent-43, score-0.805]
29 However, several issues have to be addressed in order to learn the architecture in Figure 1 for the task of natural image denoising. [sent-44, score-0.459]
30 Firstly, the image denoising task must be formulated as a learning problem in order to train the convolutional network. [sent-45, score-1.201]
31 Since we assume access to a database of only clean, noiseless images, we implicitly specify the desired image processing task by integrating a noise process into the training procedure. [sent-46, score-0.456]
32 In particular, we assume a noise process n(x) that operates on an image xi drawn from a distribution of natural images X. [sent-47, score-0.508]
33 If we consider the entire convolutional network to be some function 2 Architecture of CN1 and CN2 I1,1 I3,1 I4,1 I1,2 I2,2 I3,2 I4,2 . [sent-48, score-0.719]
34 I1,24 input image I2,1 I2,24 I3,24 I4,24 output image Figure 1: Architecture of convolutional network used for denoising. [sent-60, score-1.214]
35 The network has 4 hidden layers and 24 feature maps in each hidden layer. [sent-61, score-0.466]
36 In layers 2, 3, and 4, each feature map is connected to 8 randomly chosen feature maps in the previous layer. [sent-62, score-0.225]
37 Each arrow represents a single convolution associated with a 5 × 5 filter, and hence this network has 15,697 free parameters and requires 624 convolutions to process its forward pass. [sent-63, score-0.235]
38 Fφ with free parameters φ, then the parameter estimation problem is to minimize the reconstruction error of the images subject to the noise process: minφ i (xi − Fφ (n(xi )))2 ). [sent-64, score-0.237]
39 Using a localized image patch violates the independence assumption in stochastic online learning, but combining the gradient from six separate images yields a 6 × 6 × 6 cube that in practice is a sufficient approximation of the gradient to be effective. [sent-70, score-0.476]
40 This procedure starts by training a network with a single hidden layer. [sent-76, score-0.254]
41 After thirty epochs, the weights from the first hidden layer are copied to a new network with two hidden layers; the weights connecting the hidden layer to the output layer are discarded. [sent-77, score-0.803]
42 The two hidden layer network is optimized for another thirty epochs, and the procedure is repeated for N layers. [sent-78, score-0.375]
43 Finally, when learning networks with two or more hidden layers it was important to use a very small learning rate for the final layer (0. [sent-79, score-0.442]
44 Implementation Convolutional network inference and learning can be implemented in just a few lines of MATLAB code using multi-dimensional convolution and cross-correlation routines. [sent-82, score-0.24]
45 3 Experiments We derive training and test sets for our experiments from natural images in the Berkeley segmentation database, which has been previously used to study denoising [20, 4]. [sent-84, score-0.547]
46 We restrict our experiments to the case of monochrome images; color images in the Berkeley dataset are converted to grayscale by averaging the color channels. [sent-85, score-0.225]
47 CN1 and CNBlind are learned using the same forty image training set as the Field of Experts model (FoE). [sent-89, score-0.33]
48 PSNR has been widely used to evaluate denoising performance [1, 4, 2, 5, 6, 7]. [sent-94, score-0.326]
49 Denoising with known noise conditions In this task it is assumed that images have been subjected to Gaussian noise of known variance. [sent-95, score-0.349]
50 We use this noise model during the training process and learn a five-layer network for each noise level. [sent-96, score-0.35]
51 Both the Bayes Least Squares-Gaussian Scale Mixture (BLS-GSM) and Field of Experts (FoE) method also optimize the denoising process based on a specified noise level. [sent-97, score-0.403]
52 In another set of networks, called CN2, this training set is augmented by an additional sixty images from the Berkeley database. [sent-100, score-0.22]
53 The architecture of these networks is shown in Fig. [sent-101, score-0.218]
54 We find that the convolutional network has the highest average PSNR using either training set, although by a margin that is within statistical insignificance when standard error is computed from the distribution of PSNR values of the entire image. [sent-108, score-0.768]
55 Blind denoising In this task it is assumed that images have been subjected to Gaussian noise of unknown variance. [sent-110, score-0.598]
56 We train a single six-layer network network we refer to as CNBlind by randomly varying the amount of noise added to each example in the training process, in the range of σ = [0, 100] . [sent-112, score-0.392]
57 During inference, the noise level is unknown and only the image is provided as input. [sent-113, score-0.336]
58 We find that a convolutional network trained for blind denoising performs well even compared to the other methods under non-blind conditions. [sent-119, score-1.077]
59 25 Figure 3: Denoising results on an image from the test set. [sent-126, score-0.259]
60 The noisy image was generated by adding Gaussian noise with σ = 50 to the clean image. [sent-127, score-0.391]
61 Non-blind denoising results for the BLS-GSM, FoE, and convolutional network methods are shown. [sent-128, score-1.022]
62 4 Relationship between MRF and Convolutional Network Approaches In the introduction, we claim that convolutional networks have similar or even greater representational power compared to MRFs. [sent-133, score-0.782]
63 To support this claim, we will show that special cases of convolutional networks correspond to mean field inference for an MRF. [sent-134, score-0.713]
64 This does not rigorously prove that convolutional networks have representational power greater than or equal to MRFs, since mean field inference is an approximation. [sent-135, score-0.792]
65 Previous work has pointed out that the Field of Experts MRF can be interpreted as a convolutional network (see [21]) and that MRFs with an Ising-like prior can be related to convolutional networks (see [11]). [sent-137, score-1.395]
66 Here, we analyze a different MRF that is specially designed for image denoising and show that it is closely related to the convolutional network in Figure 1. [sent-138, score-1.315]
67 Note that by symmetry we have wi−j = wj−i , 5 Layer 1 Layer 2 Figure 4: Filters learned for the first 2 hidden layers of network CNBlind. [sent-140, score-0.352]
68 The second hidden layer has 192 filters (24 feature maps 8 filters per map). [sent-141, score-0.267]
69 Hence, P (v h) constitutes an undirected graphical model which can be conceptualized as having separate layers for the visible and hidden variables. [sent-144, score-0.258]
70 There are no intralayer interactions in the visible layer and convolutional structure (instead of full connectivity) in the intralayer interactions between hidden variables and interlayer interactions between the visible and hidden layer. [sent-145, score-1.112]
71 We can use this model for denoising by fixing the visible variables to the noisy image, computing the most likely hidden variables h by MAP inference, and regarding the conditional expectation of P (v h ) as the denoised image. [sent-148, score-0.523]
72 1, we find that this is equivalent to a convolutional network in which each hidden layer has the same weights and each feature map directly receives input from the image. [sent-152, score-0.957]
73 These results suggest that certain convolutional networks can be interpreted as performing approximate inference on MRF models designed for denoising. [sent-153, score-0.713]
74 In practice, the convolutional network architectures we train are not exactly related to such MRF models because the weights of each hidden layer are not constrained to be the same, nor is the image an input to any feature map except those in the first layer. [sent-154, score-1.3]
75 An interesting question for future research is how these additional architectural constraints would affect performance of the convolutional network approach. [sent-155, score-0.696]
76 5 Discussion Prior versus learned structure Before learning, the convolutional network has little structure specialized to natural images. [sent-157, score-0.796]
77 In contrast, the GSM model uses a multi-scale wavelet representation that is known for its suitability in 6 representing natural image statistics. [sent-158, score-0.344]
78 Moreover, inference in the FoE model uses a procedure similar to non-linear diffusion methods, which have been previously used for natural image processing without learning. [sent-159, score-0.335]
79 Random parameter settings of the convolutional networks do not produce any clearly useful computation. [sent-161, score-0.732]
80 If the parameters of CN2 are randomized in just the last layer, denoising performance for the image in Fig. [sent-162, score-0.585]
81 The advantage of this approach is that it may lead to more accurate performance, and can be applied to novel forms of imagery that have very different statistics than natural images or any previously studied dataset (an example of this is the specialized image restoration problem studied in [11]). [sent-172, score-0.541]
82 Network architecture and using more image context The amount of image context the convolutional network uses to produce an output value for a specific image location is determined by the number of layers in the network and size of filter in each layer. [sent-173, score-1.862]
83 For example, the 5 and 6-layer networks explored here respectively use a 20 × 20 and 24 × 24 image patch. [sent-174, score-0.362]
84 It is surprising that despite this major difference, the convolutional network approach still provides good performance. [sent-176, score-0.696]
85 One explanation could be that the scale of objects in the chosen image dataset may allow for most relevant information to be captured in a relatively small field of view. [sent-177, score-0.259]
86 Nonetheless, it is of interest for denoising as well as other applications to increase the amount of context used by the network. [sent-178, score-0.326]
87 Adding additional machinery in the network architecture may work better. [sent-180, score-0.237]
88 Integrating the operations of sub-sampling and super-sampling would allow a network to process the image at multiple scales, while still being entirely amenable to gradient learning. [sent-181, score-0.412]
89 Computational efficiency With many free parameters, convolutional networks may seem like a computationally expensive image processing architecture. [sent-182, score-0.964]
90 1) requires only 624 image convolutions to process an image. [sent-184, score-0.302]
91 We also report wall-clock speed by denoising a 512 × 512 pixel image on a 2. [sent-187, score-0.623]
92 It is true, however, that training the convolutional network architecture requires substantial computation. [sent-203, score-0.86]
93 As gradient learning can require many thousands of updates to converge, training the denoising networks required a parallel implementation that utilized a dozen processors for a week. [sent-204, score-0.538]
94 Learning more complex image transformations and generalized image attractors models In this work we have explored an image processing task which can be easily formulated as a learning problem by synthesizing training examples from abundantly available noiseless natural images. [sent-206, score-0.957]
95 That said, a major virtue of the image prior approach is the ability to easily reuse a single image model in novel situations by simply augmenting the prior with the appropriate observation model. [sent-210, score-0.587]
96 This is possible because the image prior and the observation model are decoupled. [sent-211, score-0.281]
97 Convolutional networks forgo probabilistic modeling and, as developed here, focus on specific image to image transformations as a regression problem. [sent-213, score-0.621]
98 It will be interesting to combine the two approaches to learn models that are “unnormalized priors” in the sense of energy-based image attractors; regression can then be used as a tool for unsupervised learning by capturing dependencies between variables within the same distribution [22]. [sent-214, score-0.284]
99 Image denoising using scale mixtures of Gaussians in the wavelet domain. [sent-224, score-0.371]
100 Fields of Experts: a framework for learning image priors. [sent-243, score-0.259]
wordName wordTfidf (topN-words)
[('convolutional', 0.574), ('foe', 0.347), ('denoising', 0.326), ('image', 0.259), ('psnr', 0.251), ('mrf', 0.176), ('images', 0.132), ('layer', 0.131), ('layers', 0.125), ('network', 0.122), ('architecture', 0.115), ('networks', 0.103), ('hidden', 0.083), ('ha', 0.078), ('cnblind', 0.077), ('gsm', 0.077), ('noise', 0.077), ('restoration', 0.072), ('architectures', 0.062), ('lters', 0.061), ('representational', 0.056), ('clean', 0.055), ('blind', 0.055), ('visible', 0.05), ('training', 0.049), ('wa', 0.048), ('attractors', 0.046), ('experts', 0.045), ('wavelet', 0.045), ('eld', 0.045), ('convolutions', 0.043), ('subjected', 0.043), ('denoised', 0.043), ('convolution', 0.042), ('field', 0.042), ('crf', 0.042), ('natural', 0.04), ('code', 0.04), ('mrfs', 0.039), ('bls', 0.039), ('hb', 0.039), ('intralayer', 0.039), ('monochrome', 0.039), ('portilla', 0.039), ('sixty', 0.039), ('thirty', 0.039), ('specialized', 0.038), ('pixel', 0.038), ('cvpr', 0.038), ('inference', 0.036), ('elds', 0.036), ('lecun', 0.035), ('turaga', 0.034), ('tappen', 0.034), ('wab', 0.034), ('specially', 0.034), ('adelson', 0.031), ('filters', 0.031), ('gradient', 0.031), ('maps', 0.031), ('fields', 0.03), ('vi', 0.029), ('thousands', 0.029), ('settings', 0.028), ('free', 0.028), ('culties', 0.028), ('color', 0.027), ('howard', 0.027), ('berkeley', 0.027), ('produce', 0.027), ('roth', 0.026), ('claim', 0.026), ('database', 0.026), ('learn', 0.025), ('backpropagation', 0.025), ('noiseless', 0.025), ('augmenting', 0.025), ('map', 0.025), ('tasks', 0.025), ('epochs', 0.024), ('intensities', 0.023), ('entire', 0.023), ('power', 0.023), ('inef', 0.023), ('patch', 0.023), ('nips', 0.022), ('learned', 0.022), ('regularities', 0.022), ('prior', 0.022), ('feature', 0.022), ('train', 0.022), ('severe', 0.021), ('analog', 0.021), ('hundred', 0.021), ('multiscale', 0.021), ('regarding', 0.021), ('accept', 0.021), ('interactions', 0.021), ('task', 0.02), ('deep', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 148 nips-2008-Natural Image Denoising with Convolutional Networks
Author: Viren Jain, Sebastian Seung
Abstract: We present an approach to low-level vision that combines two main ideas: the use of convolutional networks as an image processing architecture and an unsupervised learning procedure that synthesizes training samples from specific noise models. We demonstrate this approach on the challenging problem of natural image denoising. Using a test set with a hundred natural images, we find that convolutional networks provide comparable and in some cases superior performance to state of the art wavelet and Markov random field (MRF) methods. Moreover, we find that a convolutional network offers similar performance in the blind denoising setting as compared to other techniques in the non-blind setting. We also show how convolutional networks are mathematically related to MRF approaches by presenting a mean field theory for an MRF specially designed for image denoising. Although these approaches are related, convolutional networks avoid computational difficulties in MRF approaches that arise from probabilistic learning and inference. This makes it possible to learn image processing architectures that have a high degree of representational power (we train models with over 15,000 parameters), but whose computational expense is significantly less than that associated with inference in MRF approaches with even hundreds of parameters. 1 Background Low-level image processing tasks include edge detection, interpolation, and deconvolution. These tasks are useful both in themselves, and as a front-end for high-level visual tasks like object recognition. This paper focuses on the task of denoising, defined as the recovery of an underlying image from an observation that has been subjected to Gaussian noise. One approach to image denoising is to transform an image from pixel intensities into another representation where statistical regularities are more easily captured. For example, the Gaussian scale mixture (GSM) model introduced by Portilla and colleagues is based on a multiscale wavelet decomposition that provides an effective description of local image statistics [1, 2]. Another approach is to try and capture statistical regularities of pixel intensities directly using Markov random fields (MRFs) to define a prior over the image space. Initial work used handdesigned settings of the parameters, but recently there has been increasing success in learning the parameters of such models from databases of natural images [3, 4, 5, 6, 7, 8]. Prior models can be used for tasks such as image denoising by augmenting the prior with a noise model. Alternatively, an MRF can be used to model the probability distribution of the clean image conditioned on the noisy image. This conditional random field (CRF) approach is said to be discriminative, in contrast to the generative MRF approach. Several researchers have shown that the CRF approach can outperform generative learning on various image restoration and labeling tasks [9, 10]. CRFs have recently been applied to the problem of image denoising as well [5]. 1 The present work is most closely related to the CRF approach. Indeed, certain special cases of convolutional networks can be seen as performing maximum likelihood inference on a CRF [11]. The advantage of the convolutional network approach is that it avoids a general difficulty with applying MRF-based methods to image analysis: the computational expense associated with both parameter estimation and inference in probabilistic models. For example, naive methods of learning MRFbased models involve calculation of the partition function, a normalization factor that is generally intractable for realistic models and image dimensions. As a result, a great deal of research has been devoted to approximate MRF learning and inference techniques that meliorate computational difficulties, generally at the cost of either representational power or theoretical guarantees [12, 13]. Convolutional networks largely avoid these difficulties by posing the computational task within the statistical framework of regression rather than density estimation. Regression is a more tractable computation and therefore permits models with greater representational power than methods based on density estimation. This claim will be argued for with empirical results on the denoising problem, as well as mathematical connections between MRF and convolutional network approaches. 2 Convolutional Networks Convolutional networks have been extensively applied to visual object recognition using architectures that accept an image as input and, through alternating layers of convolution and subsampling, produce one or more output values that are thresholded to yield binary predictions regarding object identity [14, 15]. In contrast, we study networks that accept an image as input and produce an entire image as output. Previous work has used such architectures to produce images with binary targets in image restoration problems for specialized microscopy data [11, 16]. Here we show that similar architectures can also be used to produce images with the analog fluctuations found in the intensity distributions of natural images. Network Dynamics and Architecture A convolutional network is an alternating sequence of linear filtering and nonlinear transformation operations. The input and output layers include one or more images, while intermediate layers contain “hidden
2 0.14379662 118 nips-2008-Learning Transformational Invariants from Natural Movies
Author: Charles Cadieu, Bruno A. Olshausen
Abstract: We describe a hierarchical, probabilistic model that learns to extract complex motion from movies of the natural environment. The model consists of two hidden layers: the first layer produces a sparse representation of the image that is expressed in terms of local amplitude and phase variables. The second layer learns the higher-order structure among the time-varying phase variables. After training on natural movies, the top layer units discover the structure of phase-shifts within the first layer. We show that the top layer units encode transformational invariants: they are selective for the speed and direction of a moving pattern, but are invariant to its spatial structure (orientation/spatial-frequency). The diversity of units in both the intermediate and top layers of the model provides a set of testable predictions for representations that might be found in V1 and MT. In addition, the model demonstrates how feedback from higher levels can influence representations at lower levels as a by-product of inference in a graphical model. 1
3 0.13549665 191 nips-2008-Recursive Segmentation and Recognition Templates for 2D Parsing
Author: Leo Zhu, Yuanhao Chen, Yuan Lin, Chenxi Lin, Alan L. Yuille
Abstract: Language and image understanding are two major goals of artificial intelligence which can both be conceptually formulated in terms of parsing the input signal into a hierarchical representation. Natural language researchers have made great progress by exploiting the 1D structure of language to design efficient polynomialtime parsing algorithms. By contrast, the two-dimensional nature of images makes it much harder to design efficient image parsers and the form of the hierarchical representations is also unclear. Attempts to adapt representations and algorithms from natural language have only been partially successful. In this paper, we propose a Hierarchical Image Model (HIM) for 2D image parsing which outputs image segmentation and object recognition. This HIM is represented by recursive segmentation and recognition templates in multiple layers and has advantages for representation, inference, and learning. Firstly, the HIM has a coarse-to-fine representation which is capable of capturing long-range dependency and exploiting different levels of contextual information. Secondly, the structure of the HIM allows us to design a rapid inference algorithm, based on dynamic programming, which enables us to parse the image rapidly in polynomial time. Thirdly, we can learn the HIM efficiently in a discriminative manner from a labeled dataset. We demonstrate that HIM outperforms other state-of-the-art methods by evaluation on the challenging public MSRC image dataset. Finally, we sketch how the HIM architecture can be extended to model more complex image phenomena. 1
4 0.12736462 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
Author: Xuming He, Richard S. Zemel
Abstract: Extensive labeled data for image annotation systems, which learn to assign class labels to image regions, is difficult to obtain. We explore a hybrid model framework for utilizing partially labeled data that integrates a generative topic model for image appearance with discriminative label prediction. We propose three alternative formulations for imposing a spatial smoothness prior on the image labels. Tests of the new models and some baseline approaches on three real image datasets demonstrate the effectiveness of incorporating the latent structure. 1
5 0.12341802 158 nips-2008-Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks
Author: Alex Graves, Jürgen Schmidhuber
Abstract: Offline handwriting recognition—the automatic transcription of images of handwritten text—is a challenging task that combines computer vision with sequence learning. In most systems the two elements are handled separately, with sophisticated preprocessing techniques used to extract the image features and sequential models such as HMMs used to provide the transcriptions. By combining two recent innovations in neural networks—multidimensional recurrent neural networks and connectionist temporal classification—this paper introduces a globally trained offline handwriting recogniser that takes raw pixel data as input. Unlike competing systems, it does not require any alphabet specific preprocessing, and can therefore be used unchanged for any language. Evidence of its generality and power is provided by data from a recent international Arabic recognition competition, where it outperformed all entries (91.4% accuracy compared to 87.2% for the competition winner) despite the fact that neither author understands a word of Arabic. 1
6 0.1116423 56 nips-2008-Deep Learning with Kernel Regularization for Visual Recognition
7 0.095219791 42 nips-2008-Cascaded Classification Models: Combining Models for Holistic Scene Understanding
8 0.091113038 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words
9 0.086385369 208 nips-2008-Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes
10 0.086309962 3 nips-2008-A Massively Parallel Digital Learning Processor
11 0.082657896 200 nips-2008-Robust Kernel Principal Component Analysis
12 0.0803615 119 nips-2008-Learning a discriminative hidden part model for human action recognition
13 0.080085687 103 nips-2008-Implicit Mixtures of Restricted Boltzmann Machines
14 0.07824865 95 nips-2008-Grouping Contours Via a Related Image
15 0.077388562 62 nips-2008-Differentiable Sparse Coding
16 0.074180692 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models
17 0.070241787 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features
18 0.06857422 207 nips-2008-Shape-Based Object Localization for Descriptive Classification
19 0.066390812 142 nips-2008-Multi-Level Active Prediction of Useful Image Annotations for Recognition
20 0.064639479 226 nips-2008-Supervised Dictionary Learning
topicId topicWeight
[(0, -0.182), (1, -0.086), (2, 0.134), (3, -0.101), (4, 0.037), (5, 0.009), (6, -0.113), (7, -0.1), (8, 0.035), (9, 0.026), (10, 0.007), (11, 0.006), (12, 0.052), (13, 0.013), (14, -0.069), (15, -0.191), (16, 0.0), (17, -0.079), (18, 0.014), (19, -0.099), (20, -0.052), (21, -0.024), (22, 0.036), (23, -0.001), (24, 0.01), (25, 0.072), (26, -0.031), (27, 0.072), (28, -0.062), (29, -0.032), (30, -0.02), (31, 0.02), (32, -0.071), (33, 0.072), (34, -0.019), (35, 0.039), (36, 0.009), (37, -0.043), (38, 0.186), (39, -0.013), (40, -0.027), (41, 0.043), (42, 0.06), (43, 0.083), (44, -0.03), (45, -0.054), (46, 0.029), (47, 0.054), (48, 0.033), (49, -0.134)]
simIndex simValue paperId paperTitle
same-paper 1 0.95059866 148 nips-2008-Natural Image Denoising with Convolutional Networks
Author: Viren Jain, Sebastian Seung
Abstract: We present an approach to low-level vision that combines two main ideas: the use of convolutional networks as an image processing architecture and an unsupervised learning procedure that synthesizes training samples from specific noise models. We demonstrate this approach on the challenging problem of natural image denoising. Using a test set with a hundred natural images, we find that convolutional networks provide comparable and in some cases superior performance to state of the art wavelet and Markov random field (MRF) methods. Moreover, we find that a convolutional network offers similar performance in the blind denoising setting as compared to other techniques in the non-blind setting. We also show how convolutional networks are mathematically related to MRF approaches by presenting a mean field theory for an MRF specially designed for image denoising. Although these approaches are related, convolutional networks avoid computational difficulties in MRF approaches that arise from probabilistic learning and inference. This makes it possible to learn image processing architectures that have a high degree of representational power (we train models with over 15,000 parameters), but whose computational expense is significantly less than that associated with inference in MRF approaches with even hundreds of parameters. 1 Background Low-level image processing tasks include edge detection, interpolation, and deconvolution. These tasks are useful both in themselves, and as a front-end for high-level visual tasks like object recognition. This paper focuses on the task of denoising, defined as the recovery of an underlying image from an observation that has been subjected to Gaussian noise. One approach to image denoising is to transform an image from pixel intensities into another representation where statistical regularities are more easily captured. For example, the Gaussian scale mixture (GSM) model introduced by Portilla and colleagues is based on a multiscale wavelet decomposition that provides an effective description of local image statistics [1, 2]. Another approach is to try and capture statistical regularities of pixel intensities directly using Markov random fields (MRFs) to define a prior over the image space. Initial work used handdesigned settings of the parameters, but recently there has been increasing success in learning the parameters of such models from databases of natural images [3, 4, 5, 6, 7, 8]. Prior models can be used for tasks such as image denoising by augmenting the prior with a noise model. Alternatively, an MRF can be used to model the probability distribution of the clean image conditioned on the noisy image. This conditional random field (CRF) approach is said to be discriminative, in contrast to the generative MRF approach. Several researchers have shown that the CRF approach can outperform generative learning on various image restoration and labeling tasks [9, 10]. CRFs have recently been applied to the problem of image denoising as well [5]. 1 The present work is most closely related to the CRF approach. Indeed, certain special cases of convolutional networks can be seen as performing maximum likelihood inference on a CRF [11]. The advantage of the convolutional network approach is that it avoids a general difficulty with applying MRF-based methods to image analysis: the computational expense associated with both parameter estimation and inference in probabilistic models. For example, naive methods of learning MRFbased models involve calculation of the partition function, a normalization factor that is generally intractable for realistic models and image dimensions. As a result, a great deal of research has been devoted to approximate MRF learning and inference techniques that meliorate computational difficulties, generally at the cost of either representational power or theoretical guarantees [12, 13]. Convolutional networks largely avoid these difficulties by posing the computational task within the statistical framework of regression rather than density estimation. Regression is a more tractable computation and therefore permits models with greater representational power than methods based on density estimation. This claim will be argued for with empirical results on the denoising problem, as well as mathematical connections between MRF and convolutional network approaches. 2 Convolutional Networks Convolutional networks have been extensively applied to visual object recognition using architectures that accept an image as input and, through alternating layers of convolution and subsampling, produce one or more output values that are thresholded to yield binary predictions regarding object identity [14, 15]. In contrast, we study networks that accept an image as input and produce an entire image as output. Previous work has used such architectures to produce images with binary targets in image restoration problems for specialized microscopy data [11, 16]. Here we show that similar architectures can also be used to produce images with the analog fluctuations found in the intensity distributions of natural images. Network Dynamics and Architecture A convolutional network is an alternating sequence of linear filtering and nonlinear transformation operations. The input and output layers include one or more images, while intermediate layers contain “hidden
2 0.7860024 158 nips-2008-Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks
Author: Alex Graves, Jürgen Schmidhuber
Abstract: Offline handwriting recognition—the automatic transcription of images of handwritten text—is a challenging task that combines computer vision with sequence learning. In most systems the two elements are handled separately, with sophisticated preprocessing techniques used to extract the image features and sequential models such as HMMs used to provide the transcriptions. By combining two recent innovations in neural networks—multidimensional recurrent neural networks and connectionist temporal classification—this paper introduces a globally trained offline handwriting recogniser that takes raw pixel data as input. Unlike competing systems, it does not require any alphabet specific preprocessing, and can therefore be used unchanged for any language. Evidence of its generality and power is provided by data from a recent international Arabic recognition competition, where it outperformed all entries (91.4% accuracy compared to 87.2% for the competition winner) despite the fact that neither author understands a word of Arabic. 1
3 0.62304378 3 nips-2008-A Massively Parallel Digital Learning Processor
Author: Hans P. Graf, Srihari Cadambi, Venkata Jakkula, Murugan Sankaradass, Eric Cosatto, Srimat Chakradhar, Igor Dourdanovic
Abstract: We present a new, massively parallel architecture for accelerating machine learning algorithms, based on arrays of vector processing elements (VPEs) with variable-resolution arithmetic. Groups of VPEs operate in SIMD (single instruction multiple data) mode, and each group is connected to an independent memory bank. The memory bandwidth thus scales with the number of VPEs, while the main data flows are local, keeping power dissipation low. With 256 VPEs, implemented on two FPGAs (field programmable gate array) chips, we obtain a sustained speed of 19 GMACS (billion multiplyaccumulate per sec.) for SVM training, and 86 GMACS for SVM classification. This performance is more than an order of magnitude higher than that of any FPGA implementation reported so far. The speed on one FPGA is similar to the fastest speeds published on a Graphics Processor for the MNIST problem, despite a clock rate that is an order of magnitude lower. Tests with Convolutional Neural Networks show similar compute performances. This massively parallel architecture is particularly attractive for embedded applications, where low power dissipation is critical. 1 I n trod u cti on Machine learning demands higher and higher compute-performance, but serial processors are not improving that much anymore - at least not as quickly as they used to. Mainstream processor development is moving to multi-core systems, using shared memory technology to hide the parallel nature of the processors. But shared memory technology does not scale to hundreds or thousands of cores. In order to reach such levels of parallelization alternative approaches have to be developed. Massively parallel general-purpose computers had limited success so far, because of difficulties programming these machines, and they remain a niche market, mostly in highperformance computing. Yet processors specialized for certain application domains, such as graphics processors or routing processors 1, have been parallelized to several hundred cores and are successful mass products. They improve performance over general-purpose processors by focusing on a few key algorithmic elements, yet still maintain enough flexibility that they can be programmed for a variety of applications. We explore in this paper if a similar approach can lead to efficient machine learning processors. 1 e.g. Nvidia, Quadro FX 5600 graphics processor; Cisco, CRS-1 routing processor Several processors optimized for machine learning, in particular for neural networks, were developed during the 1980’s and 90’s. Examples are the Synapse-1 architecture [1], or the Connectionist Network Supercomputer, CNS1 [2]. Recently there has been less activity in this field, but some accelerators are sold today for specific applications, such as the Axeon [3] processor for power train control of cars. Beside digital processors a large number of analog circuits were built, emulating neural network structures. Extremely high performance with low power dissipation is achievable, see e.g. [4][5], but these networks have little flexibility. SVM implementations on FPGA have been demonstrated in recent years [6-8], yet reached only low compute-performances. All machine learning processors had only limited success so far, indicating how difficult it is to find a good combination of performance, flexibility, price and ease of use. An important consideration is that many applications of machine learning, such as video analysis, data mining, or personalization of services, show the most promise in embedded systems. Embedded learning requires high compute performance while dissipating little power, a combination that is difficult to achieve, and so far required application specific IC (ASIC). Our aim is to develop architectures that meet the requirements for embedded learning, but are programmable and therefore can be used in a wide range of applications. With the goal of analyzing different architectures we designed a development and testing environment where the parallel computation is mapped onto FPGA’s. Initially this system was intended only for experimentation, but its performance is so high that this platform is useful in its own right as accelerator for high-performance systems. While the experiments shown here emphasize high performance, the architecture has been designed from the start for low power dissipation. The main features for achieving this goal are: low-resolution arithmetic, keeping the main data flow local, low operating frequencies, and a modular design, so that unused parts can be powered down dynamically. All results shown here are from the test platform; migration to lowpower FPGA or chip designs are done in a later stage. 2 Al gori th ms - A ri th meti c - A rch i te ctu re For a substantial improvement over a general purpose processor, the algorithms, the arithmetic units, as well as the architecture have to be optimized simultaneously. This is not just an exercise in hardware design, but algorithms and their software implementations have to be developed concurrently. Most machine learning algorithms have not been developed with parallelization in mind. Therefore, we first need to find good parallel versions, identify their performance bottlenecks, and then extract common computational patterns that can be mapped into accelerator hardware. 2.1 Algorithms Characteristic for machine learning is that large amounts of data need to be processed, often with predictable data access patterns and no dependency between operations over large segments of the computation. This is why data-parallelization can often provide good accelerations on multi-core chips, clusters of machines, or even on loosely coupled networks of machines. Using MapReduce, speedups linear with the number of processors have been reported in [9] for several machine learning algorithms. Up to 16 cores were tested, and simulations indicate good scaling to more processors in some cases. Many algorithms, such as KNN, K-means clustering, LVQ, and Neural Networks can be reduced to forms where the computation is dominated by vector-matrix multiplications, which are easily parallelizable. For Convolutional Neural Networks (CNN) the data flow can be complex, yet the core of the computation is a convolution, an operation which has been studied extensively for parallel implementations. For Support Vector Machines (SVM), several parallel algorithms were described, but most saturate quickly for more than 16 processors. Scaling to larger numbers of processors has been demonstrated, applying MapReduce on a graphics processor with 128 cores [10]. Another implementation on a cluster of 48 dual-core machines (with 384 MMX units) [11] scales even super-linearly, and, according to simulations, scales to thousands of cores. Based on this analysis it is clear that vector-matrix and matrix-matrix multiplications for large vector dimensionalities and large numbers of vectors must be handled efficiently. Yet this alone is not sufficient since data access patterns vary greatly between algorithms. We analyze this here in more detail for SVM and CNN. These algorithms were chosen, because they are widely used for industrial applications and cover a broad range of computation, I/O, and memory requirements. The characteristics of the SVM training are summarized in Table 1. We use an approach similar to the one described in [11] to split different parts of the computation between a host CPU and the FPGA accelerator. For large dimensions d of the vectors the calculation of the columns of the kernel matrix dominates by far. This is needed to update the gradients, and in the present implementation, only this part is mapped onto the FPGA. If the dimensionality d is smaller than around 100, operations 2 and 5 can become bottlenecks and should also be mapped onto the accelerator. Challenging is that for each kernel computation a new data vector has to be loaded 4 into the processor, leading to very high I/O requirements. We consider here dimensions of 10 - 10 5 7 and numbers of training data of 10 - 10 , resulting easily in Gigabytes that need to be transferred to the processors at each iteration. 1 2 3 4 5 6 Operation Initialize all αx, Gx Do Find working set αi, αj Update αi, αj Get 2 columns of kernel matrix Update gradients Gx While not converged Computation 2n IO 2n Unit CPU I I I I I * 2n I*2 I * (2d+2dn) I*n CPU CPU FPGA CPU * * * * 2n 10 2nd n Table 1: Compute- and IO-requirements of each step for SVM training (SMO algorithm). n: number of training data; d: dimension of the vectors; G: gradients; α: support vector factors; I: number of iterations. The last column indicates whether the execution happens on the host CPU or the accelerator FPGA. It is assumed that the kernel computation requires a dot product between vectors (e.g. rbf, polynomial, tanh kernels). Neural network algorithms are essentially sequences of vector-matrix multiplications, but networks with special connectivity patterns, such as convolutional networks have very different IO characteristics than fully connected networks. Table 2 shows the computation and IO requirements for scanning several convolution kernels over one input plane. A full network requires multiple of these operations for one layer, with nonlinearities between layers. We map all operations onto the FPGA accelerator, since intermediate results are re-used right away. The most significant 2 difference to between the SVM and CNN is the Compute/IO ratio: SVM: ~ 1; CNN: ~ L*k > 100. Therefore the requirements for these two algorithms are very different, and handling both cases efficiently is quite a challenge for an architecture design. Operation Load L kernels For all input pixels Shift in new pixel Multiply kernels Shift out result 1 2 3 4 Computation IO 2 L* k n* m 2 n*m*L*k n*m Unit FPGA FPGA FPGA FPGA FPGA Table 2: Compute- and IO-requirements for CNN computation (forward pass), where l kernels of size k*k are scanned simultaneously over an input plane of size n*m. This is representative for implementations with kernel unrolling (kernel pixels processed in parallel). Internal shifts, computation of the non-linearity, and border effects not shown. 2.2 Arithmetic Hardware can be built much more compactly and runs with lower power dissipation, if it uses fixed-point instead of floating-point operations. Fortunately, many learning algorithms tolerate a low resolution in most of the computations. This has been investigated extensively for neural networks [12][13], but less so for other learning algorithms. Learning from data is inherently a noisy process, because we see only a sparse sampling of the true probability distributions. A different type of noise is introduced in gradient descent algorithms, when only a few training data are used at a time to move the optimization forward iteratively. This noise is particularly pronounced for stochastic gradient descent. There is no point in representing noisy variables with high resolution, and it is therefore a property inherent to many algorithms that low-resolution computation can be used. It is important, not to confuse this tolerance to low resolution with the resolution required to avoid numeric instabilities. Some of the computations have to be performed with a high resolution, in particular for variables that are updated incrementally. They maintain the state of the optimization and may change in very small steps. But usually by far the largest part of the computation can be executed at a low resolution. Key is that the hardware is flexible enough and can take advantage of reduced resolution while handling high resolution where necessary. Problem Adult Forest MNIST NORB Kernel: Float Obj. f. # SV 31,930.77 11,486 653,170.7 49,333 4,960.13 6,172 1,243.71 3,077 F-score 77.58 98.29 99.12 93.34 Kernel: 16 bit fixed point Obj. f. # SV F-score 31,930.1 11,490 77.63 652,758 49,299 98.28 4,959.64 6,166 99.11 1,244.76 3,154 93.26 F-sc. (4b in) NA NA 99.11 92.78 Table 3: Comparison of the results of SVM training when the kernels are represented with floating point numbers (32 or 64 bits) (left half) and with 16 bit fixed point (right half). The last column shows the results when the resolution of the training data is reduced from 8 bit to 4 bit. For NORB this reduces the accuracy; all other differences in accuracy are not significant. All are two class problems: Adult: n=32,562, d=122; Forest: n=522,000, d=54 (2 against the rest); MNIST: n=60,000, d=784 (odd–even); NORB: n=48,560, d=5,184. We developed a simulator that allows running the training algorithms with various resolutions in each of the variables. A few examples for SVM training are shown in Table 3. Reducing the resolution of the kernel values from double or float to 16 bit fixed point representations does not affect the accuracy for any of the problems. Therefore all the multiplications in the dot products for the kernel computation can be done in low resolutions (4–16 bit in the factors), but the accumulator needs sufficient resolution to avoid over/under flow (48 bit). Once the calculation of the kernel value is completed, it can be reduced to 16 bit. A low resolution of 16 bit is also tolerable for the α values, but a high resolution is required for the gradients (double). For Neural Networks, including CNN, several studies have confirmed that states and gradients can be kept at low resolutions (<16 bit), but the weights must be maintained at a high resolution (float) (see e.g. [12]). In our own evaluations 24 bits in the weights tend to be sufficient. Once the network is trained, for the classification low resolutions can be used for the weights as well (<16 bit). 2.3 A rc h i t e c t u re Figure 1: Left: Schematic of the architecture with the main data flows; on one FPGA 128 VPE are configured into four SIMD groups; L-S: Load-store units. Right: Picture of an FPGA board; in our experiments one or two of them are used, connected via PCI bus to a host CPU. Based on the analysis above, it is clear that the architecture must be optimized for processing massive amounts of data with relatively low precision. Most of the time, data access patterns are predictable and data are processed in blocks that can be stored contiguously. This type of computation is well suited for vector processing, and simple vector processing elements (VPE) with fixed-point arithmetic can handle the operations. Since typically large blocks of data are processed with the same operation, groups of VPE can work in SIMD (single instruction multiple data) mode. Algorithms must then be segmented to map the highvolume, low precision parts onto the vector accelerators and parts requiring high precision arithmetic onto the CPU. The most important design decision is the organization of the memory. Most memory accesses are done in large blocks, so that the data can be streamed, making complex caching unnecessary. This is fortunate, since the amounts of data to be loaded onto the processor are so large that conventional caching strategies would be overwhelmed anyway. Because the blocks tend to be large, a high data bandwidth is crucial, but latency for starting a block transfer is less critical. Therefore we can use regular DDR memories and still get high IO rates. This led to the design shown schematically in Figure 1, where independent memory banks are connected via separate IO ports for each group of 32 VPE. By connecting multiple of the units shown in Figure 1 to a CPU, this architecture scales to larger numbers of VPE. Parallel data IO and parallel memory access scale simultaneously with the number of parallel cores, and we therefore refer to this as the P3 (P-cube) architecture. Notice also that the main data flow is only local between a group of VPE and its own memory block. Avoiding movements of data over long distances is crucial for low power dissipation. How far this architecture can reasonably scale with one CPU depends on the algorithms, the amount of data and the vector dimensionality (see below). A few hundred VPE per CPU have provided good accelerations in all our tests, and much higher numbers are possible with multi-core CPUs and faster CPU-FPGA connections. 3 I mp l e men tati on of th e P 3 A rch i t ectu re This architecture fits surprisingly well onto some of the recent FPGA chips that are available with several hundred Digital Signal Processors (DSP) units and over 1,000 IO pins for data transfers. The boards used here contain each one Xilinx Virtex 5 LX330T-2 FPGA coupled to 4 independent DDR2 SDRAM with a total of 1GB, and 2 independent 4MB SSRAM memory banks (commercial board from AlphaData). One FPGA chip contains 192 DSP with a maximum speed of 550MHz, which corresponds to a theoretical compute-performance of 105.6 GMACS (18 bit and 25 bit operands). There is a total of 14 Mbit of on-chip memory, and the chip incorporates 960 pins for data IO. Due to routing overhead, not all DSP units can be used and the actual clock frequencies tend to be considerably lower than what is advertised for such chips (typically 230MHz or less for our designs). Nevertheless, we obtain high performances because we can use a large number of DSP units for executing the main computation. The main architecture features are: • Parallel processing (on one chip): 128 VPE (hardware DSP) are divided into 4 blocks of 32, each group controlled by one sequencer with a vector instruction set. • Custom Precision: Data are represented with 1 to 16 bit resolution. Higher resolutions are possible by operating multiple DSP as one processor. • Overlapping Computation and Communication: CPU-FPGA communication is overlapped with the FPGA computation. • Overlap Memory Operations with Computation: All loads and stores from the FPGA to off-chip memory are performed concurrently with computations. • High Off-chip Memory Bandwidth: 6 independent data ports, each 32 bits wide, access banked memories concurrently (12GB/s per chip). • • Streaming Data Flow, Simple Access Patterns: Load/store units are tailored for streaming input and output data, and for simple, bursty access patterns. Caching is done under application control with dual-port memory on chip. Load/store with (de)compression: For an increase of effective IO bandwidth the load/store units provide compression and decompression in hardware. Figure 2 shows the configuration of the VPEs for vector dot product computation used for SVM training and classification. For training, the main computation is the calculation of one column of the kernel matrix. One vector is pre-fetched and stored in on-chip memory. All other vectors are streamed in from off-chip memory banks 1-4. Since this is a regular and predictable access pattern, we can utilize burst-mode, achieving a throughput of close to one memory word per cycle. But the speed is nevertheless IO bound. When several vectors can be stored on-chip, as is the case for classification, then the speed becomes compute-bound. Figure 2: Architecture for vector dot-product computation. The left side shows a high-level schematic with the main data flow. The data are streamed from memory banks 1-4 to the VPE arrays, while memory banks 5 and 6, alternatively receive results or stream them back to the host. The right side shows how a group of VPE is pipelined to improve clock speed. The operation for SVM training on the FPGA corresponds to a vector-matrix multiplication and the one for classification to a matrix-matrix multiplication. Therefore the configuration of Figure 2 is useful for many other algorithms as well, where operations with large vectors and matrices are needed, such as Neural Networks. We implemented a specialized configuration for Convolutional Neural Networks, for more efficiency and lower power dissipation. The VPE are daisy-chained and operate as systolic array. In this way we can take advantage of the high computation to IO ratio (Table 2) to reduce the data transfers from memory. 4 E val u ati on s We evaluated SVM training and classification with the NORB and MNIST problems, the latter with up to 2 million training samples (data from [11]). Both are benchmarks with vectors of high dimensionality, representative for applications in image and video analysis. The computation is split between CPU and FPGA as indicated by Table 1. The DDR2 memory banks are clocked at 230MHz, providing double that rate for data transfers. The data may be compressed to save IO bandwidth. On the FPGA they are decompressed first and distributed to the VPE. In our case, a 32 bit word contains eight 4-bit vector components. Four 32 bit words are needed to feed all 32 VPEs of a group; therefore clocking the VPE faster than 115MHz does not improve performance. A VPE executes a multiplication plus add operation in one clock cycle, resulting in a theoretical maximum of 14.7 GMACS per chip. The sustained compute-rate is lower, about 9.4 GMACS, due to overhead (see Table 4). The computation on the host CPU overlaps with that on the FPGA, and has no effect on the speed in the experiments shown here. For the classification the VPE can be clocked higher, at 230 MHz. By using 4-bit operands we can execute 2 multiply-accumulates simultaneously on one DSP, resulting in speed that is more than four times higher and a sustained 43.0 GMACS limited by the number and speed of the VPE. Adding a second FPGA card doubles the speed, showing little saturation effects yet, but for more FPGA per CPU there will be saturation (see Fig. 3). The compute speed in GMACS obtained for NORB is almost identical. # 60k 2M Iterations 8,000 266,900 CPU time 754s -- speed 0.5 -- CPU+MMX time speed 240 s 1.57 531,534 s 1.58 CPU+FPGA time speed 40 s 9.42 88,589 s 9.48 CPU+2 FPGA time speed 21 s 17.9 48,723 s 17.2 Table 4: Training times and average compute speed for SVM training. Systems tested: CPU, Opteron, 2.2GHz; CPU using MMX; CPU with one FPGA; CPU with two FPGA boards. Results are shown for training sizes of 60k and 2M samples. Compute speed is in GMACS (just kernel computations). Training algorithm: SMO with second order working set selection. Parallelizations of SVM training have been reported recently for a GPU [10] and for a cluster [11], both using the MNIST data. In [10] different bounds for stopping were used than here and in [11]. Nevertheless, a comparison of the compute performance is possible, because based on the number of iterations we can compute the average GMACS for the kernel computations. As can be seen in Table 5 a single FPGA is similar in speed to a GPU with 128 stream processors, despite a clock rate that is about 5.5 times lower for I/O and 11 times lower for the VPE. The cluster with 384 MMX units is about 6 times faster than one FPGA with 128 VPE, but dissipates about two orders of magnitude more electric power. For the FPGA this calculation includes only the computation of the kernel values while the part on the CPU is neglected. This is justified for this study, because the rest of the calculations can be mapped on the FPGA as well and will increase the power dissipation only minimally. Number Clock Operand Power Average of cores speed type dissipation compute speed CPU (Opteron) 1 2.2 GHz float 40 W 0.5 GMACS GPU (from [10]) 128 1.35 GHz float 80 W 7.4 GMACS Cluster (from [11]) 384 1.6 GHz byte > 1 kW 54 GMACS FPGA 128 0.12 GHz 4 bit nibble 9W 9.4 GMACS Table 5: Comparison of performances for SVM training (MNIST data). GPU: Nvidia 8800 GTX. Cluster: 48 dual core CPU (Athlon), 384 MMX units. The GPU was training with 60k samples ([10], table 2, second order), the cluster trained with 2 million samples. Processor Figure 3: Acceleration of SVM training as a function of the number of VPE. MNIST n: 2,000,000, d=784; NORB: n=48,560, d=5,184. The points for 128 and 256 VPE are experimental, the higher ones are simulations. Curves MNIST, NORB: Multiple FPGA are attached to one CPU. Curve MNIST C: Each FPGA is attached to a separate host CPU. Scaling of the acceleration with the number of VPEs is shown in Figure 3. The reference speed is that of one FPGA attached to a CPU. The evaluation has been done experimentally for 128 and 256 VPEs, and beyond that with a simulator. The onset of saturation depends on the dimensionality of the vectors, but to a much lesser extent on the number of training vectors (up to the limit of the memory on the FPGA card). MNIST saturates for more than two FPGAs because then the CPU and FPGA computation times become comparable. For the larger vectors of NORB (d=5,184) this saturation starts to be noticeable for more than 4 FPGA. Alternatively, a system can be scaled by grouping multiple CPU, each with one attached FPGA accelerator. Then the scaling follows a linear or even super-linear acceleration (MNIST C) to several thousand VPE. If the CPUs are working in a cluster arrangement, the scaling is similar to the one described in [11]. For convolutional neural networks, the architecture of Figure 2 is modified to allow a block of VPE to operate as systolic array. In this way convolutions can be implemented with minimal data movements. In addition to the convolution, also sub-sampling and non-linear functions plus the logistics to handle multiple layers with arbitrary numbers of kernels in each layer are done on the FPGA. Four separate blocks of such convolvers are packed onto one FPGA, using 100 VPE. Clocked at 115MHz, this architecture provides a maximum of 11.5 GMACS. Including all the overhead the sustained speed is about 10 GMACS. 5 Con cl u s i on s By systematically exploiting characteristic properties of machine learning algorithms, we developed a new massively parallel processor architecture that is very efficient and can be scaled to thousands of processing elements. The implementation demonstrated here is more than an order of magnitude higher in performance than previous FPGA implementations of SVM or CNN. For the MNIST problem it is comparable to the fastest GPU implementations reported so far. These results underline the importance of flexibility over raw compute-speed for massively parallel systems. The flexibility of the FPGA allows more efficient routing and packing of the data and the use of computations with the lowest resolution an algorithm permits. The results of Table 5 indicate the potential of this architecture for low-power operation in embedded applications. R e f e re n c e s [1] Ramacher, et al. (1995) Synapse-1: A high-speed general purpose parallel neurocomputer system. In Proc. 9th Intl. Symposium on Parallel Processing (IPPS'95), pp. 774-781. [2] Asanovic, K., Beck, Feldman, J., Morgan, N. & Wawrzynek, J. (1994) A Supercomputer for Neural Computation, Proc. IEEE Intl. Joint Conference on Neural Networks, pp. 5-9, Orlando, Florida. [3] Neil, P., (2005) Combining hardware with a powerful automotive MCU for powertrain applications. In Industrial Embedded Resource Guide, p. 88. [4] Korekado, et al. (2003) A Convolutional Neural Network VLSI for Image Recognition Using Merged/Mixed Analog-Digital Architecture, in Proc. 7th KES 2003, Oxford, pp 169-176. [5] Murasaki, M., Arima, Y. & Shinohara, H. (1993) A 20 Tera-CPS Analog Neural Network Board. In Proc. Int. Joint Conf. Neural Networks, pp. 3027 – 3030. [6] Pedersen, R., Schoeberl, M. (2006), An Embedded Support Vector Machine, WISE 2006. [7] Dey, S., Kedia, M. Agarwal, N., Basu, A., Embedded Support Vector Machine: Architectural Enhancements and Evaluation, in Proc 20th Int. Conf. VLSI Design. [8] Anguita, D., Boni, A., Ridella, S., (2003) A Digital Architecture for Support Vector Machines: Theory, Algorithm, and FPGA Implementation, IEEE Trans. Neural Networks, 14/5, pp.993-1009. [9] Chu, C., Kim, S., Lin, Y., Yu, Y., Bradski, G., Ng, A. & Olukotun, K. (2007) Map-Reduce for Machine Learning on Multicore, Advances in Neural Information Processing Systems 19, MIT Press. [10] Catanzaro, B., Sundaram, N., & Keutzer, K. (2008) Fast Support Vector Machine Training and Classification on Graphics Processors, Proc. 25th Int. Conf. Machine Learning, pp 104-111. [11] Durdanovic, I., Cosatto, E. & Graf, H. (2007) Large Scale Parallel SVM Implementation. In L. Bottou, O. Chapelle, D. DeCoste, J. Weston (eds.), Large Scale Kernel Machines, pp. 105-138, MIT Press. [12] Simard, P & Graf, H. (1994) Backpropagation without Multiplication. In J. Cowan, G. Tesauro, J. Alspector, (eds.), Neural Information Processing Systems 6, pp. 232 – 239, Morgan Kaufmann. [13] Savich, A., Moussa, M., Areibi, S., (2007) The Impact of Arithmetic Representation on Implementing MLP-BP on FPGAs: A Study, IEEE Trans. Neural Networks, 18/1, pp. 240-252.
4 0.60431194 118 nips-2008-Learning Transformational Invariants from Natural Movies
Author: Charles Cadieu, Bruno A. Olshausen
Abstract: We describe a hierarchical, probabilistic model that learns to extract complex motion from movies of the natural environment. The model consists of two hidden layers: the first layer produces a sparse representation of the image that is expressed in terms of local amplitude and phase variables. The second layer learns the higher-order structure among the time-varying phase variables. After training on natural movies, the top layer units discover the structure of phase-shifts within the first layer. We show that the top layer units encode transformational invariants: they are selective for the speed and direction of a moving pattern, but are invariant to its spatial structure (orientation/spatial-frequency). The diversity of units in both the intermediate and top layers of the model provides a set of testable predictions for representations that might be found in V1 and MT. In addition, the model demonstrates how feedback from higher levels can influence representations at lower levels as a by-product of inference in a graphical model. 1
5 0.55001497 191 nips-2008-Recursive Segmentation and Recognition Templates for 2D Parsing
Author: Leo Zhu, Yuanhao Chen, Yuan Lin, Chenxi Lin, Alan L. Yuille
Abstract: Language and image understanding are two major goals of artificial intelligence which can both be conceptually formulated in terms of parsing the input signal into a hierarchical representation. Natural language researchers have made great progress by exploiting the 1D structure of language to design efficient polynomialtime parsing algorithms. By contrast, the two-dimensional nature of images makes it much harder to design efficient image parsers and the form of the hierarchical representations is also unclear. Attempts to adapt representations and algorithms from natural language have only been partially successful. In this paper, we propose a Hierarchical Image Model (HIM) for 2D image parsing which outputs image segmentation and object recognition. This HIM is represented by recursive segmentation and recognition templates in multiple layers and has advantages for representation, inference, and learning. Firstly, the HIM has a coarse-to-fine representation which is capable of capturing long-range dependency and exploiting different levels of contextual information. Secondly, the structure of the HIM allows us to design a rapid inference algorithm, based on dynamic programming, which enables us to parse the image rapidly in polynomial time. Thirdly, we can learn the HIM efficiently in a discriminative manner from a labeled dataset. We demonstrate that HIM outperforms other state-of-the-art methods by evaluation on the challenging public MSRC image dataset. Finally, we sketch how the HIM architecture can be extended to model more complex image phenomena. 1
6 0.52766216 56 nips-2008-Deep Learning with Kernel Regularization for Visual Recognition
7 0.52538282 42 nips-2008-Cascaded Classification Models: Combining Models for Holistic Scene Understanding
8 0.52428097 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words
9 0.52246445 66 nips-2008-Dynamic visual attention: searching for coding length increments
10 0.51608253 160 nips-2008-On Computational Power and the Order-Chaos Phase Transition in Reservoir Computing
11 0.50995642 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
12 0.49128282 30 nips-2008-Bayesian Experimental Design of Magnetic Resonance Imaging Sequences
13 0.48490697 208 nips-2008-Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes
14 0.48137364 103 nips-2008-Implicit Mixtures of Restricted Boltzmann Machines
15 0.45826021 119 nips-2008-Learning a discriminative hidden part model for human action recognition
16 0.44074643 147 nips-2008-Multiscale Random Fields with Application to Contour Grouping
17 0.42760393 95 nips-2008-Grouping Contours Via a Related Image
18 0.42587823 207 nips-2008-Shape-Based Object Localization for Descriptive Classification
19 0.42485675 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models
20 0.41815165 27 nips-2008-Artificial Olfactory Brain for Mixture Identification
topicId topicWeight
[(6, 0.044), (7, 0.059), (12, 0.018), (28, 0.115), (57, 0.562), (59, 0.01), (63, 0.021), (77, 0.018), (83, 0.054)]
simIndex simValue paperId paperTitle
1 0.94308978 191 nips-2008-Recursive Segmentation and Recognition Templates for 2D Parsing
Author: Leo Zhu, Yuanhao Chen, Yuan Lin, Chenxi Lin, Alan L. Yuille
Abstract: Language and image understanding are two major goals of artificial intelligence which can both be conceptually formulated in terms of parsing the input signal into a hierarchical representation. Natural language researchers have made great progress by exploiting the 1D structure of language to design efficient polynomialtime parsing algorithms. By contrast, the two-dimensional nature of images makes it much harder to design efficient image parsers and the form of the hierarchical representations is also unclear. Attempts to adapt representations and algorithms from natural language have only been partially successful. In this paper, we propose a Hierarchical Image Model (HIM) for 2D image parsing which outputs image segmentation and object recognition. This HIM is represented by recursive segmentation and recognition templates in multiple layers and has advantages for representation, inference, and learning. Firstly, the HIM has a coarse-to-fine representation which is capable of capturing long-range dependency and exploiting different levels of contextual information. Secondly, the structure of the HIM allows us to design a rapid inference algorithm, based on dynamic programming, which enables us to parse the image rapidly in polynomial time. Thirdly, we can learn the HIM efficiently in a discriminative manner from a labeled dataset. We demonstrate that HIM outperforms other state-of-the-art methods by evaluation on the challenging public MSRC image dataset. Finally, we sketch how the HIM architecture can be extended to model more complex image phenomena. 1
2 0.92947721 233 nips-2008-The Gaussian Process Density Sampler
Author: Iain Murray, David MacKay, Ryan P. Adams
Abstract: We present the Gaussian Process Density Sampler (GPDS), an exchangeable generative model for use in nonparametric Bayesian density estimation. Samples drawn from the GPDS are consistent with exact, independent samples from a fixed density function that is a transformation of a function drawn from a Gaussian process prior. Our formulation allows us to infer an unknown density from data using Markov chain Monte Carlo, which gives samples from the posterior distribution over density functions and from the predictive distribution on data space. We can also infer the hyperparameters of the Gaussian process. We compare this density modeling technique to several existing techniques on a toy problem and a skullreconstruction task. 1
same-paper 3 0.92579043 148 nips-2008-Natural Image Denoising with Convolutional Networks
Author: Viren Jain, Sebastian Seung
Abstract: We present an approach to low-level vision that combines two main ideas: the use of convolutional networks as an image processing architecture and an unsupervised learning procedure that synthesizes training samples from specific noise models. We demonstrate this approach on the challenging problem of natural image denoising. Using a test set with a hundred natural images, we find that convolutional networks provide comparable and in some cases superior performance to state of the art wavelet and Markov random field (MRF) methods. Moreover, we find that a convolutional network offers similar performance in the blind denoising setting as compared to other techniques in the non-blind setting. We also show how convolutional networks are mathematically related to MRF approaches by presenting a mean field theory for an MRF specially designed for image denoising. Although these approaches are related, convolutional networks avoid computational difficulties in MRF approaches that arise from probabilistic learning and inference. This makes it possible to learn image processing architectures that have a high degree of representational power (we train models with over 15,000 parameters), but whose computational expense is significantly less than that associated with inference in MRF approaches with even hundreds of parameters. 1 Background Low-level image processing tasks include edge detection, interpolation, and deconvolution. These tasks are useful both in themselves, and as a front-end for high-level visual tasks like object recognition. This paper focuses on the task of denoising, defined as the recovery of an underlying image from an observation that has been subjected to Gaussian noise. One approach to image denoising is to transform an image from pixel intensities into another representation where statistical regularities are more easily captured. For example, the Gaussian scale mixture (GSM) model introduced by Portilla and colleagues is based on a multiscale wavelet decomposition that provides an effective description of local image statistics [1, 2]. Another approach is to try and capture statistical regularities of pixel intensities directly using Markov random fields (MRFs) to define a prior over the image space. Initial work used handdesigned settings of the parameters, but recently there has been increasing success in learning the parameters of such models from databases of natural images [3, 4, 5, 6, 7, 8]. Prior models can be used for tasks such as image denoising by augmenting the prior with a noise model. Alternatively, an MRF can be used to model the probability distribution of the clean image conditioned on the noisy image. This conditional random field (CRF) approach is said to be discriminative, in contrast to the generative MRF approach. Several researchers have shown that the CRF approach can outperform generative learning on various image restoration and labeling tasks [9, 10]. CRFs have recently been applied to the problem of image denoising as well [5]. 1 The present work is most closely related to the CRF approach. Indeed, certain special cases of convolutional networks can be seen as performing maximum likelihood inference on a CRF [11]. The advantage of the convolutional network approach is that it avoids a general difficulty with applying MRF-based methods to image analysis: the computational expense associated with both parameter estimation and inference in probabilistic models. For example, naive methods of learning MRFbased models involve calculation of the partition function, a normalization factor that is generally intractable for realistic models and image dimensions. As a result, a great deal of research has been devoted to approximate MRF learning and inference techniques that meliorate computational difficulties, generally at the cost of either representational power or theoretical guarantees [12, 13]. Convolutional networks largely avoid these difficulties by posing the computational task within the statistical framework of regression rather than density estimation. Regression is a more tractable computation and therefore permits models with greater representational power than methods based on density estimation. This claim will be argued for with empirical results on the denoising problem, as well as mathematical connections between MRF and convolutional network approaches. 2 Convolutional Networks Convolutional networks have been extensively applied to visual object recognition using architectures that accept an image as input and, through alternating layers of convolution and subsampling, produce one or more output values that are thresholded to yield binary predictions regarding object identity [14, 15]. In contrast, we study networks that accept an image as input and produce an entire image as output. Previous work has used such architectures to produce images with binary targets in image restoration problems for specialized microscopy data [11, 16]. Here we show that similar architectures can also be used to produce images with the analog fluctuations found in the intensity distributions of natural images. Network Dynamics and Architecture A convolutional network is an alternating sequence of linear filtering and nonlinear transformation operations. The input and output layers include one or more images, while intermediate layers contain “hidden
4 0.90758276 236 nips-2008-The Mondrian Process
Author: Daniel M. Roy, Yee W. Teh
Abstract: We describe a novel class of distributions, called Mondrian processes, which can be interpreted as probability distributions over kd-tree data structures. Mondrian processes are multidimensional generalizations of Poisson processes and this connection allows us to construct multidimensional generalizations of the stickbreaking process described by Sethuraman (1994), recovering the Dirichlet process in one dimension. After introducing the Aldous-Hoover representation for jointly and separately exchangeable arrays, we show how the process can be used as a nonparametric prior distribution in Bayesian models of relational data. 1
5 0.89581656 80 nips-2008-Extended Grassmann Kernels for Subspace-Based Learning
Author: Jihun Hamm, Daniel D. Lee
Abstract: Subspace-based learning problems involve data whose elements are linear subspaces of a vector space. To handle such data structures, Grassmann kernels have been proposed and used previously. In this paper, we analyze the relationship between Grassmann kernels and probabilistic similarity measures. Firstly, we show that the KL distance in the limit yields the Projection kernel on the Grassmann manifold, whereas the Bhattacharyya kernel becomes trivial in the limit and is suboptimal for subspace-based problems. Secondly, based on our analysis of the KL distance, we propose extensions of the Projection kernel which can be extended to the set of affine as well as scaled subspaces. We demonstrate the advantages of these extended kernels for classification and recognition tasks with Support Vector Machines and Kernel Discriminant Analysis using synthetic and real image databases. 1
6 0.7972461 100 nips-2008-How memory biases affect information transmission: A rational analysis of serial reproduction
7 0.77440816 27 nips-2008-Artificial Olfactory Brain for Mixture Identification
8 0.7199775 208 nips-2008-Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes
9 0.67568564 158 nips-2008-Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks
10 0.63675183 234 nips-2008-The Infinite Factorial Hidden Markov Model
11 0.62248045 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
12 0.62070781 35 nips-2008-Bayesian Synchronous Grammar Induction
13 0.60396832 200 nips-2008-Robust Kernel Principal Component Analysis
14 0.59180248 232 nips-2008-The Conjoint Effect of Divisive Normalization and Orientation Selectivity on Redundancy Reduction
15 0.58967459 66 nips-2008-Dynamic visual attention: searching for coding length increments
16 0.58631849 127 nips-2008-Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction
17 0.58478808 192 nips-2008-Reducing statistical dependencies in natural signals using radial Gaussianization
18 0.58289248 42 nips-2008-Cascaded Classification Models: Combining Models for Holistic Scene Understanding
19 0.58022648 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation
20 0.57108295 229 nips-2008-Syntactic Topic Models