nips nips2013 nips2013-251 nips2013-251-reference knowledge-graph by maker-knowledge-mining

251 nips-2013-Predicting Parameters in Deep Learning


Source: pdf

Author: Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, Nando de Freitas

Abstract: We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. 1


reference text

[1] Y. Bengio. Deep learning of representations: Looking forward. Technical Report arXiv:1305.0445, Universit´ de Montr´ al, 2013. e e

[2] D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. ¸ In IEEE Computer Vision and Pattern Recognition, pages 3642–3649, 2012.

[3] D. Ciresan, U. Meier, and J. Masci. High-performance neural networks for visual object classification. ¸ arXiv:1102.0183, 2011.

[4] A. Coates, A. Karpathy, and A. Ng. Emergence of object-selective features in unsupervised feature learning. In Advances in Neural Information Processing Systems, pages 2690–2698, 2012.

[5] A. Coates and A. Y. Ng. Selecting receptive fields in deep networks. In Advances in Neural Information Processing Systems, pages 2528–2536, 2011.

[6] A. Coates, A. Y. Ng, and H. Lee. An analysis of single-layer networks in unsupervised feature learning. In Artificial Intelligence and Statistics, 2011.

[7] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pages 1232–1240, 2012.

[8] L. Deng, D. Yu, and J. Platt. Scalable stacking and learning for building deep architectures. In International Conference on Acoustics, Speech, and Signal Processing, pages 2133–2136, 2012.

[9] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In International Conference on Machine Learning, 2013.

[10] K. Gregor and Y. LeCun. Emergence of complex-like cells in a temporal product network with local receptive fields. arXiv preprint arXiv:1006.0448, 2010.

[11] C. G¨ lcehre and Y. Bengio. Knowledge matters: Importance of prior information for optimization. In u¸ International Conference on Learning Representations, 2013.

[12] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.

[13] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1106–1114, 2012.

[14] K. Lang and G. Hinton. Dimensionality reduction and prior knowledge in e-set recognition. In Advances in Neural Information Processing Systems, 1990.

[15] Q. V. Le, A. Karpenko, J. Ngiam, and A. Y. Ng. ICA with reconstruction cost for efficient overcomplete feature learning. Advances in Neural Information Processing Systems, 24:1017–1025, 2011.

[16] Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and A. Ng. Building highlevel features using large scale unsupervised learning. In International Conference on Machine Learning, 2012.

[17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[18] Y. LeCun, J. S. Denker, S. Solla, R. E. Howard, and L. D. Jackel. Optimal brain damage. In Advances in Neural Information Processing Systems, pages 598–605, 1990.

[19] K.-F. Lee and H.-W. Hon. Speaker-independent phone recognition using hidden markov models. Acoustics, Speech and Signal Processing, IEEE Transactions on, 37(11):1641–1648, 1989.

[20] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proc. 27th International Conference on Machine Learning, pages 807–814. Omnipress Madison, WI, 2010.

[21] M. Ranzato, A. Krizhevsky, and G. E. Hinton. Factored 3-way restricted Boltzmann machines for modeling natural images. In Artificial Intelligence and Statistics, 2010.

[22] R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua. Learning separable filters. In IEEE Computer Vision and Pattern Recognition, 2013.

[23] R. Rubinstein, M. Zibulevsky, and M. Elad. Double sparsity: learning sparse dictionaries for sparse signal approximation. IEEE Transactions on Signal Processing, 58:1553–1564, 2010.

[24] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 323(6088):533–536, 1986.

[25] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, New York, NY, USA, 2004.

[26] K. Swersky, M. Ranzato, D. Buchman, B. Marlin, and N. Freitas. On autoencoders and score matching for energy based models. In International Conference on Machine Learning, pages 1201–1208, 2011.

[27] P. Vincent and Y. Bengio. A neural support vector network architecture with adaptive kernels. In International Joint Conference on Neural Networks, pages 187–192, 2000. 9