Author: Yoshua Bengio, James S. Bergstra

Abstract: We introduce a new type of neural network activation function based on recent physiological rate models for complex cells in visual area V1. A single-hiddenlayer neural network of this kind of model achieves 1.50% error on MNIST. We also introduce an existing criterion for learning slow, decorrelated features as a pretraining strategy for image models. This pretraining strategy results in orientation-selective features, similar to the receptive fields of complex cells. With this pretraining, the same single-hidden-layer model achieves 1.34% error, even though the pretraining sample distribution is very different from the fine-tuning distribution. To implement this pretraining strategy, we derive a fast algorithm for online learning of decorrelated features such that each iteration of the algorithm runs in linear time with respect to the number of features. 1

1 ca Abstract We introduce a new type of neural network activation function based on recent physiological rate models for complex cells in visual area V1. [sent-5, score-0.572]

2 We also introduce an existing criterion for learning slow, decorrelated features as a pretraining strategy for image models. [sent-8, score-0.782]

3 This pretraining strategy results in orientation-selective features, similar to the receptive fields of complex cells. [sent-9, score-0.603]

4 34% error, even though the pretraining sample distribution is very different from the fine-tuning distribution. [sent-11, score-0.443]

5 To implement this pretraining strategy, we derive a fast algorithm for online learning of decorrelated features such that each iteration of the algorithm runs in linear time with respect to the number of features. [sent-12, score-0.608]

6 1 Introduction Visual area V1 is the first area of cortex devoted to handling visual input in the human visual system (Dayan & Abbott, 2001). [sent-13, score-0.258]

7 One convenient simplification in the study of cell behaviour is to ignore the timing of individual spikes, and to look instead at their frequency. [sent-14, score-0.153]

8 Some cells in V1 are described well by a linear filter that has been rectified to be non-negative and perhaps bounded. [sent-15, score-0.154]

9 These so-called simple cells are similar to sigmoidal activation functions: their activity (firing frequency) is greater as an image stimulus looks more like some particular linear filter. [sent-16, score-0.494]

10 However, these simple cells are a minority in visual area V1 and the characterization of the remaining cells there (and even beyond in visual areas V2, V4, MT, and so on) is a very active area of ongoing research. [sent-17, score-0.566]

11 This non-linear response has been modeled by quadrature pairs (Adelson & Bergen, 1985; Dayan & Abbott, 2001): pairs of linear filters with the property that the sum of their squared responses is constant for an input image with particular spatial frequency and orientation (i. [sent-20, score-0.16]

12 More recently, it has been argued that V1 cells exhibit a range of behaviour that blurs distinctions between simple and complex cells and between energy models and max-pooling models (Rust et al. [sent-24, score-0.495]

13 Another theme in neural modeling is that cells do not react to single images, they react to image sequences. [sent-26, score-0.32]

14 It is a gross approximation to suppose that each cell implements a function from image to activity level. [sent-27, score-0.18]

15 The principle of identifying slowly moving/changing factors in temporal/spatial data has been investigated by many (Becker & Hinton, 1993; Wiskott & Sejnowski, 2002; Hurri & Hyv¨ rinen, 2003; K¨ rding a o et al. [sent-30, score-0.176]

16 , 2004; Cadieu & Olshausen, 2009) as a principle for finding useful representations of images, 1 and as an explanation for why V1 simple and complex cells behave the way they do. [sent-31, score-0.274]

17 However, models that have been pretrained with appropriate unsupervised learning procedures (such as RBMs and various forms of auto-encoders) generalize better (Hinton et al. [sent-35, score-0.241]

18 Recent work in the pretraining of neural networks has taken a generative modeling perspective. [sent-44, score-0.443]

19 Reconstruction error is an even poorer approximation of the maximum likelihood gradient, and sometimes works better than CD (with additional twists like sparsity or the denoising of (Vincent et al. [sent-48, score-0.172]

20 The temporal coherence and decorrelation criterion is an alternative to training generative models such as RBMs or auto-encoder variants. [sent-50, score-0.502]

21 , 2009) demonstrated that a slowness criterion regularizing the top-most internal layer of a deep convolutional network during supervised learning helps their model to generalize better. [sent-52, score-0.497]

22 Our model is similar in spirit to pre-training with the semi-supervised embedding criterion at each level (Weston et al. [sent-53, score-0.163]

23 , 2009), but differs in the use of decorrelation as a mechanism for preventing trivial solutions to a slowness criterion. [sent-55, score-0.303]

24 Whereas RBMs and denoising autoencoders are defined for general input distributions, the temporal coherence and decorrelation criterion makes sense only in the context of data with slowly-changing temporal or spatial structure, such as images, video, and sound. [sent-56, score-0.639]

25 In the same way that simple cell models were the inspiration for sigmoidal activation units in artificial neural networks and validated simple cell models, we investigate in artificial neural network classifiers the value of complex cell models. [sent-57, score-0.773]

26 This paper builds on these results by showing that the principle of temporal coherence is useful for finding initial conditions for the hidden layer of a neural network that biases it towards better generalization in object recognition. [sent-58, score-0.362]

27 We introduce temporal coherence and decorrelation as a pretraining algorithm. [sent-59, score-0.791]

28 In order for this criterion to be useful in the context of large models, we derive a fast online algorithm for decorrelating units and maximizing temporal coherence. [sent-61, score-0.233]

29 1 Algorithm Slow, decorrelated feature learning algorithm (K¨ rding et al. [sent-63, score-0.24]

30 , 2004) introduced a principle (and training criterion) to explain the formation of o complex cell receptive fields. [sent-64, score-0.359]

31 They based their analysis on the complex-cell model of (Adelson & Bergen, 1985), which describes a complex cell as a pair of half-rectified linear filters whose outputs are squared and added together and then a square root is applied to that sum. [sent-65, score-0.238]

32 Suppose x is an input image and we have F complex cells h1 , . [sent-66, score-0.307]

33 This makes the criterion too slow for use with large datasets. [sent-74, score-0.16]

34 At the same time, the size of the covariance matrix is quadratic in the number of features, so it is computationally expensive (perhaps prohibitively) to apply the criterion to train large numbers of features. [sent-75, score-0.156]

35 One way to apply the criterion to large or infinite datasets is by estimating the covariance (and variance) from consecutive minibatches of N movie frames. [sent-79, score-0.21]

36 ¯ ¯ hi (t) = ρhi (t − 1) + (1 − ρ)hi (t) For good results, ρ should be chosen so that the estimates change very slowly. [sent-82, score-0.227]

37 L(t) = α N2 F |Z(t)Z (t)|2 − N ( zi (τ )2 )2 + i=1 τ =1 1 N −1 N −1 F (zi (τ ) − zi (τ − 1))2 (3) τ =1 i=1 The time complexity of computing L(t) using Equation 3 from Z(t) is O(N N F ). [sent-95, score-0.248]

38 2 Complex-cell activation function Recently, (Rust et al. [sent-100, score-0.226]

39 , 2005) have argued that existing models, such as that of (Adelson & Bergen, 1985) cannot account for the variety of behaviour found in visual area V1. [sent-101, score-0.174]

40 Some complex cells behave like simple cells to some extent and vice versa; there is a continuous range of simple to complex cells. [sent-102, score-0.47]

41 They put forward a similar but more involved expression that can capture the simple and complex cells as special cases, but ultimately parameterizes a larger class of cell-response functions (Eq. [sent-103, score-0.235]

42 β max(0, wx)2 + a+ I (i) 2 i=1 (u x) 1 + γ max(0, wx)2 + ζ I (i) 2 i=1 (u x) −δ ζ + J (j) 2 x) j=1 (v ζ J (j) x)2 j=1 (v ζ (4) The numerator in Eq 4 describes the difference between an excitation term and a shunting inhibition term. [sent-105, score-0.173]

43 Parameters w, u(i) , v (j) have the same shape as the input image x, and can be thought of as image filters like the first layer of a neural network or the codebook of a sparse-coding model. [sent-107, score-0.252]

44 The parameters a, β, δ, γ, , ζ are scalars that control the range and shape of the activation function, given all the filter responses. [sent-108, score-0.165]

45 The range of this activation function (as a function of x) is a connected set on the (−1, 1) interval. [sent-119, score-0.165]

46 If the inhibition term is always 0 for example, then the activation function will be non-negative. [sent-121, score-0.226]

47 Each hidden unit had one linear filter w, a bias b, two quadratic excitatory filters u1 , u2 and two quadratic inhibitory filters v1 , v2 . [sent-126, score-0.233]

48 The computational cost of evaluating each unit was thus five times the cost of evaluating a normal sigmoidal activation function of the form tanh(w x + b). [sent-127, score-0.268]

49 Figure 1: Four of the three hundred activation functions learned by training our model from random initialization to perform classification. [sent-136, score-0.257]

50 Bottom row: the red and blue channels are the two quadratic filters of the shunting inhibition term. [sent-138, score-0.177]

51 2 Pretraining with natural movies Under the hypothesis that the matched Gabor functions (see Fig. [sent-141, score-0.207]

52 1) allowed our model to generalize better across slight translations of the image, we appealed to a pretraining process to initialize our model with values better than random noise. [sent-142, score-0.443]

53 We pretrained the hidden layer according to the online version of the cost in Eq. [sent-143, score-0.205]

54 3, using movies (MIXED-movies) made by sliding a 28 x 28 pixel window across large photographs. [sent-144, score-0.291]

55 Each of these movies was short (just four frames long) and ten movies were used in each minibatch (N = 40). [sent-145, score-0.597]

56 The sliding initial position was sampled uniformly from image coordinates. [sent-149, score-0.156]

57 The shunting inhibition filters (v1 , v2 ) learned after five hundred thousand movies (fifty thousand iterations of stochastic gradient descent) are shown in Figure 2. [sent-155, score-0.47]

58 The filters shown in figure 2 have fairly global receptive fields, but smaller more local receptive fields were obtained by applying 1 weight-penalization during pretraining. [sent-157, score-0.158]

59 The α parameter that balances decorrelation and slowness was chosen manually on the basis of the trained filters. [sent-158, score-0.303]

60 The excitatory filters learned similar Gabor pairs but the receptive fields tended to be both smaller (more localized) and lower-frequency. [sent-160, score-0.163]

61 3 Pretraining with MNIST movies We also tried pretraining with videos whose frames follow a similar distribution to the images used for fine-tuning and testing. [sent-165, score-0.818]

62 We created MNIST movies by sampling an image from the training set, and moving around (translating it) according to a Brownian motion. [sent-166, score-0.369]

63 Changes in that velocity between each 5 Figure 2: Filters from some of the units of the model, pretrained on small sliding image patches from two large images. [sent-169, score-0.316]

64 Furthermore, the digit image in each frame was modified according to a randomly chosen elastic deformation, as in (Loosli et al. [sent-176, score-0.179]

65 As before, movies of four frames were created in this way and training was conducted on minibatches of ten movies (N = 4 ∗ 10 = 40). [sent-178, score-0.635]

66 Unlike the mnist frames in MIXED-movies, the frames of MNIST-movies contain a single digit that is approximately centered. [sent-179, score-0.308]

67 The activation functions learned by minimizing Equation 3 on these MNIST movies were qualitatively different from the activation functions learned from the MIXED movies. [sent-180, score-0.617]

68 The inhibitory weights (v1 , v2 ) learned from MNIST movies are shown in 3. [sent-181, score-0.291]

69 Figure 3: Filters of our model, pretrained on movies of centered MNIST training images subjected to Brownian translation. [sent-189, score-0.402]

70 6 Table 1: Generalization error (% error) from 100 labeled MNIST examples after pretraining on MIXED-movies and MNIST-movies. [sent-193, score-0.443]

71 Pre-training Dataset Number of pretraining iterations (×104 ) 0 1 2 3 4 5 MIXED-movies MNIST-movies 4 23. [sent-194, score-0.443]

72 A single-hidden layer neural network of sigmoidal units can achieve 1. [sent-207, score-0.268]

73 A single-hidden layer sigmoidal neural network pretrained as a denoising auto-encoder (and then fine-tuned) can achieve 1. [sent-210, score-0.39]

74 4%; our pretraining strategy allows our single-layer model be better than Gaussian SVMs (Decoste & Sch¨ lkopf, 2002). [sent-215, score-0.443]

75 In future work, we will explore strategies for combining these methods and with our decorrelation criterion to train deep networks of models with quadratic input interactions. [sent-224, score-0.48]

76 The value of pretraining is evident right away: after two unsupervised passes over the MNIST training data (100K movies and 10K iterations), the weights have been initialized better. [sent-232, score-0.779]

77 Further pretraining offers a diminishing marginal return, although after ten unsupervised passes through the training data (500K movies) there is no evidence of over-pretraining. [sent-236, score-0.61]

78 A larger test set would be required to make a strong conclusion about a downward trend in test set scores for larger numbers of pretraining iterations. [sent-240, score-0.443]

79 2 Slowness in normalized features encourages binary activations Somewhat counter-intuitively, the slowness criterion requires movement in the features h. [sent-243, score-0.359]

80 Suppose a feature hi has activation levels that are normally distributed around 0. [sent-244, score-0.392]

81 2, but the activation at each frame of a movie is independent of previous frames. [sent-246, score-0.262]

82 Since the features has a small variance, then the normalized feature zi will oscillate in the same way, but with unit variance. [sent-247, score-0.186]

83 This will cause zi (t) − zi (t − 1) to be relatively high, and for our slowness criterion not to be well satisfied. [sent-248, score-0.483]

84 In this way the lack of variance in hi can actually make for a relatively fast normalized feature zi rather than a slow one. [sent-249, score-0.409]

85 However, if hi has activation levels that are normally distributed around . [sent-250, score-0.392]

86 9 for other image sequences, the marginal variance in hi will be larger. [sent-254, score-0.299]

87 In this sense, the slowness objective can be maximally satisfied by features hi (t) that take near-minimum and near-maximum values for most movies, and never transition from a near-minimum to a near-maximum value during a movie. [sent-258, score-0.422]

88 Perhaps this is one of the roles of saccades in the visual system: to suspend the normal objective of temporal coherence during a rapid widespread change of activation levels. [sent-260, score-0.41]

89 3 Eigenvalue interpretation of decorrelation term What does our unsupervised cost mean? [sent-262, score-0.247]

90 One way of thinking about the decorrelation term (first term in Eq. [sent-263, score-0.17]

91 =j Covt (hi , hj )2 =2 Var(hi )Var(hj ) F −1  F F  F Covt (zi , zj )2 =  Covt (zi , zj )2  − F i=1 j=1 i=1 j=i+1 If we use C to denote the matrix whose i, j entry is Covt (zi , zj ), and we use U ΛU to denote the eigen-decomposition of C, then we can transform this sum over i! [sent-267, score-0.183]

92 1 as penalizing the squared eigenvalues of the covariance matrix between features in a normalized feature space (z as opposed to h), or as minimizing the squared eigenvalues of the correlation matrix between features h. [sent-270, score-0.222]

93 5 Conclusion We have presented an activation function for use in neural networks that is a simplification of a recent rate model of visual area V1 complex cells. [sent-271, score-0.375]

94 Temporal coherence and decorrelation has been put forward as a principle for explaining the functional behaviour of visual area V1 complex cells. [sent-273, score-0.568]

95 Pretraining our model with this unsupervised criterion yields even lower generalization error: better than Gaussian SVMs, and competitive with deep denoising auto-encoders and 3-layer deep belief networks. [sent-275, score-0.563]

96 Slow feature analysis yields a rich repertoire of complex cell properties. [sent-299, score-0.189]

97 The difficulty of training deep architectures and the effect of unsupervised pre-training. [sent-325, score-0.337]

98 Computational diversity in complex cells of cat primary visual cortex. [sent-332, score-0.302]

99 How are complex cell properties o a o adapted to the statistics of natural stimuli? [sent-353, score-0.189]

100 An empirical evaluation of deep architectures on problems with many factors of variation. [sent-368, score-0.208]

