jmlr jmlr2013 jmlr2013-101 jmlr2013-101-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Markus Thom, Günther Palm
Abstract: Sparseness is a useful regularizer for learning in a wide range of applications, in particular in neural networks. This paper proposes a model targeted at classification tasks, where sparse activity and sparse connectivity are used to enhance classification capabilities. The tool for achieving this is a sparseness-enforcing projection operator which finds the closest vector with a pre-defined sparseness for any given vector. In the theoretical part of this paper, a comprehensive theory for such a projection is developed. In conclusion, it is shown that the projection is differentiable almost everywhere and can thus be implemented as a smooth neuronal transfer function. The entire model can hence be tuned end-to-end using gradient-based methods. Experiments on the MNIST database of handwritten digits show that classification performance can be boosted by sparse activity or sparse connectivity. With a combination of both, performance can be significantly better compared to classical non-sparse approaches. Keywords: supervised learning, sparseness projection, sparse activity, sparse connectivity
L. Acion, J. J. Peterson, S. Temple, and S. Arndt. Probabilistic index: An intuitive non-parametric approach to measuring the size of treatment effects. Statistics in Medicine, 25(4):591–602, 2006. E. B. Baum, J. Moody, and F. Wilczek. Internal representations for associative memory. Biological Cybernetics, 59(4–5):217–228, 1988. D. P. Bertsekas. Nonlinear Programming. Athena Scientific, 2nd edition, 1999. C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, 1995. T. Blumensath and M. E. Davies. A simple, efficient and near optimal algorithm for compressed sensing. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3357–3360, 2009. L. Bottou and Y. LeCun. Large scale online learning. In Advances in Neural Information Processing Systems 16, pages 217–224, 2004. D. M. Bradley and J. A. Bagnell. Differentiable sparse coding. In Advances in Neural Information Processing Systems 21, pages 113–120, 2009. Y. Chen and X. Ye. Projection onto a simplex. Technical Report arXiv:1101.6081v2, University of Florida, 2011. D. C. Cire¸an, U. Meier, L. M. Gambardella, and J. Schmidhuber. Deep, big, simple neural nets for s handwritten digit recognition. Neural Computation, 22(12):3207–3220, 2010. D. C. Cire¸an, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image class sification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3642–3649, 2012. G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2(4):303–314, 1989. D. DeCoste and B. Schölkopf. Training invariant support vector machines. Machine Learning, 46 (1–3):161–190, 2002. J. Demšar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30, 2006. F. Deutsch. Best Approximation in Inner Product Spaces. Springer, 2001. D. L. Donoho. For most large underdetermined systems of linear equations the minimal ℓ1 -norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics, 59(6): 797–829, 2006. F. Downton. The estimation of Pr(Y < X) in the normal case. Technometrics, 15(3):551–558, 1973. J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the ℓ1 -ball for learning in high dimensions. In Proceedings of the International Conference on Machine Learning, pages 272–279, 2008. 1139 T HOM AND PALM R. A. Dunne and N. A. Campbell. On the pairing of the softmax activation and cross-entropy penalty functions and the derivation of the softmax activation function. In Proceedings of the Australasian Conference on Neural Networks, pages 181–185, 1997. R. L. Dykstra. An algorithm for restricted least squares regression. Journal of the American Statistical Association, 78(384):837–842, 1983. R. L. Dykstra and J. P. Boyle. An algorithm for least squares projections onto the intersection of translated, convex cones. Journal of Statistical Planning and Inference, 15(3):391–399, 1987. J. Friedman, T. Hastie, and R. Tibshirani. A note on the group lasso and a sparse group lasso. Technical Report arXiv:1001.0736v1, Stanford University, 2010. K. Funahashi. On the approximate realization of continuous mappings by neural networks. Neural Networks, 2(3):183–192, 1989. S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance dilemma. Neural Computation, 4(1):1–58, 1992. X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics, pages 315–323, 2011. K. Gregor and Y. LeCun. Learning fast approximations of sparse coding. In Proceedings of the International Conference on Machine Learning, pages 399–406, 2010. R. J. Grissom. Probability of the superior outcome of one treatment over another. Journal of Applied Psychology, 79(2):314–316, 1994. M. Heiler and C. Schnörr. Learning sparse representations by non-negative matrix factorization and sequential cone programming. Journal of Machine Learning Research, 7:1385–1407, 2006. G. E. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554, 2006. J.-B. Hiriart-Urruty. At what points is the projection mapping differentiable? The American Mathematical Monthly, 89(7):456–458, 1982. Y. Hochberg and A. C. Tamhane. Multiple Comparison Procedures. John Wiley & Sons, Inc., 1987. K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359–366, 1989. H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(6):417–441, 1933. P. O. Hoyer. Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research, 5:1457–1469, 2004. D. H. Hubel and T. N. Wiesel. Receptive fields of single neurones in the cat’s striate cortex. Journal of Physiology, 148(3):574–591, 1959. 1140 S PARSE ACTIVITY AND S PARSE C ONNECTIVITY IN S UPERVISED L EARNING N. Hurley and S. Rickard. Comparing measures of sparsity. IEEE Transactions on Information Theory, 55(10):4723–4741, 2009. A. Hyvärinen, P. O. Hoyer, and E. Oja. Sparse code shrinkage: Denoising by nonlinear maximum likelihood estimation. In Advances in Neural Information Processing Systems 11, pages 473–478, 1999. J. Karbowski. How does connectivity between cortical areas depend on brain size? Implications for efficient computation. Journal of Computational Neuroscience, 15(3):347–356, 2003. T. Kohonen. Correlation matrix memories. IEEE Transactions on Computers, C-21(4):353–359, 1972. W. H. Kruskal and W. A. Wallis. Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260):583–621, 1952. A. J. Laub. Matrix Analysis for Scientists and Engineers. Society for Industrial and Applied Mathematics, 2004. S. B. Laughlin and T. J. Sejnowski. Communication in neuronal networks. Science, 301(5641): 1870–1874, 2003. Y. LeCun and C. Cortes. The MNIST database of handwritten digits, 1998. Available electronically at http://yann.lecun.com/exdb/mnist. Y. LeCun, J. S. Denker, and S. A. Solla. Optimal brain damage. In Advances in Neural Information Processing Systems 2, pages 598–605, 1990. D. D. Lee and H. S. Seung. Learning the parts of objects by nonnegative matrix factorization. Nature, 401(6755):788–791, 1999. H. Levene. Robust tests for equality of variances. In Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling, pages 278–292. Stanford University Press, 1960. S. Z. Li, X. Hou, H. Zhang, and Q. Cheng. Learning spatially localized, parts-based representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 207– 212, 2001. J. Liu and J. Ye. Efficient euclidean projections in linear time. In Proceedings of the International Conference on Machine Learning, pages 657–664, 2009. J. Liu and J. Ye. Efficient ℓ1 /ℓq norm regularization. Technical Report arXiv:1009.4766v1, Arizona State University, 2010. J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Supervised dictionary learning. In Advances in Neural Information Processing Systems 21, pages 1033–1040, 2009. J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11:19–60, 2010. 1141 T HOM AND PALM H. Markram, J. Lübke, M. Frotscher, A. Roth, and B. Sakmann. Physiology and anatomy of synaptic connections between thick tufted pyramidal neurones in the developing rat neocortex. Journal of Physiology, 500(2):409–440, 1997. A. Mason, A. Nicoll, and K. Stratford. Synaptic transmission between individual pyramidal neurons of the rat visual cortex in vitro. Journal of Neuroscience, 11(1):72–84, 1991. C. Michelot. A finite algorithm for finding the projection of a point onto the canonical simplex of Rn . Journal of Optimization Theory and Applications, 50(1):195–200, 1986. J. Mutch and D. G. Lowe. Multiclass object recognition with sparse, localized features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11–18, 2006. B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24 (2):227–234, 1995. H. Neudecker. Some theorems on matrix differentiation with special reference to Kronecker matrix products. Journal of the American Statistical Association, 64(327):953–963, 1969. B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607–609, 1996. B. A. Olshausen and D. J. Field. Sparse coding of sensory inputs. Current Opinion in Neurobiology, 14(4):481–487, 2004. P. Paatero and U. Tapper. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5(2):111–126, 1994. G. Palm. On associative memory. Biological Cybernetics, 36(1):19–31, 1980. J. Pizarro, E. Guerrero, and P. L. Galindo. Multiple comparison procedures applied to model selection. Neurocomputing, 48(1–4):155–173, 2002. A. Quattoni, X. Carreras, M. Collins, and T. Darrell. An efficient projection for l1,∞ regularization. In Proceedings of the International Conference on Machine Learning, pages 857–864, 2009. M. Ranzato, Y. Boureau, and Y. LeCun. Sparse feature learning for deep belief networks. In Advances in Neural Information Processing Systems 20, pages 1185–1192, 2008. M. Rehn and F. T. Sommer. A network that uses few active neurones to code visual input predicts the diverse shapes of cortical receptive fields. Journal of Computational Neuroscience, 22(2): 135–146, 2007. J. L. Rodgers and W. A. Nicewander. Thirteen ways to look at the correlation coefficient. The American Statistician, 42(1):59–66, 1988. D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 323(6088):533–536, 1986. B. Schölkopf. Support Vector Learning. PhD thesis, Technische Universität Berlin, 1997. 1142 S PARSE ACTIVITY AND S PARSE C ONNECTIVITY IN S UPERVISED L EARNING S. S. Shapiro and M. B. Wilk. An analysis of variance test for normality (complete samples). Biometrika, 52(3–4):591–611, 1965. P. Y. Simard, D. Steinkraus, and J. C. Platt. Best practices for convolutional neural networks applied to visual document analysis. In Proceedings of the International Conference on Document Analysis and Recognition, pages 958–962, 2003. S. Sra. Fast projections onto mixed-norm balls with applications. Data Mining and Knowledge Discovery, 25(2):358–377, 2012. F. J. Theis and T. Tanaka. Sparseness by iterative projections onto spheres. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 709–712, 2006. F. J. Theis, K. Stadlthanner, and T. Tanaka. First results on uniqueness of sparse non-negative matrix factorization. In Proceedings of the European Signal Processing Conference, pages 1672–1675, 2005. M. Thom, R. Schweiger, and G. Palm. Supervised matrix factorization with sparseness constraints and fast inference. In Proceedings of the International Joint Conference on Neural Networks, pages 973–979, 2011a. M. Thom, R. Schweiger, and G. Palm. Training of sparsely connected MLPs. In Lecture Notes in Computer Science, volume 6835, pages 356–365, 2011b. E. van den Berg, M. Schmidt, M. P. Friedlander, and K. Murphy. Group sparsity via linear-time projection. Technical Report TR-2008-09, Department of Computer Science, University of British Columbia, 2008. W. J. Vetter. Derivative operations on matrices. IEEE Transactions on Automatic Control, 15(2): 241–244, 1970. W. E. Vinje and J. L. Gallant. Sparse coding and decorrelation in primary visual cortex during natural vision. Science, 287(5456):1273–1276, 2000. J. von Neumann. Functional Operators, Volume II: The Geometry of Orthogonal Spaces. Princeton University Press, 1950. J. Weston, A. Elisseeff, B. Schölkopf, and M. Tipping. Use of the zero-norm with linear models and kernel methods. Journal of Machine Learning Research, 3:1439–1461, 2003. D. R. Wilson and T. R. Martinez. The general inefficiency of batch training for gradient descent learning. Neural Networks, 16(11):1429–1451, 2003. H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005. 1143