nips nips2013 nips2013-251 nips2013-251-reference knowledge-graph by maker-knowledge-mining

251 nips-2013-Predicting Parameters in Deep Learning

Source: pdf

Author: Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, Nando de Freitas

Abstract: We demonstrate that there is signiﬁcant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. 1

reference text

[1] Y. Bengio. Deep learning of representations: Looking forward. Technical Report arXiv:1305.0445, Universit´ de Montr´ al, 2013. e e

[2] D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classiﬁcation. ¸ In IEEE Computer Vision and Pattern Recognition, pages 3642–3649, 2012.

[3] D. Ciresan, U. Meier, and J. Masci. High-performance neural networks for visual object classiﬁcation. ¸ arXiv:1102.0183, 2011.

[4] A. Coates, A. Karpathy, and A. Ng. Emergence of object-selective features in unsupervised feature learning. In Advances in Neural Information Processing Systems, pages 2690–2698, 2012.

[5] A. Coates and A. Y. Ng. Selecting receptive ﬁelds in deep networks. In Advances in Neural Information Processing Systems, pages 2528–2536, 2011.

[6] A. Coates, A. Y. Ng, and H. Lee. An analysis of single-layer networks in unsupervised feature learning. In Artiﬁcial Intelligence and Statistics, 2011.

[7] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pages 1232–1240, 2012.

[8] L. Deng, D. Yu, and J. Platt. Scalable stacking and learning for building deep architectures. In International Conference on Acoustics, Speech, and Signal Processing, pages 2133–2136, 2012.

[9] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In International Conference on Machine Learning, 2013.

[10] K. Gregor and Y. LeCun. Emergence of complex-like cells in a temporal product network with local receptive ﬁelds. arXiv preprint arXiv:1006.0448, 2010.

[11] C. G¨ lcehre and Y. Bengio. Knowledge matters: Importance of prior information for optimization. In u¸ International Conference on Learning Representations, 2013.

[12] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.

[13] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1106–1114, 2012.

[14] K. Lang and G. Hinton. Dimensionality reduction and prior knowledge in e-set recognition. In Advances in Neural Information Processing Systems, 1990.

[15] Q. V. Le, A. Karpenko, J. Ngiam, and A. Y. Ng. ICA with reconstruction cost for efﬁcient overcomplete feature learning. Advances in Neural Information Processing Systems, 24:1017–1025, 2011.

[16] Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and A. Ng. Building highlevel features using large scale unsupervised learning. In International Conference on Machine Learning, 2012.

[17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[18] Y. LeCun, J. S. Denker, S. Solla, R. E. Howard, and L. D. Jackel. Optimal brain damage. In Advances in Neural Information Processing Systems, pages 598–605, 1990.

[19] K.-F. Lee and H.-W. Hon. Speaker-independent phone recognition using hidden markov models. Acoustics, Speech and Signal Processing, IEEE Transactions on, 37(11):1641–1648, 1989.

[20] V. Nair and G. E. Hinton. Rectiﬁed linear units improve restricted boltzmann machines. In Proc. 27th International Conference on Machine Learning, pages 807–814. Omnipress Madison, WI, 2010.

[21] M. Ranzato, A. Krizhevsky, and G. E. Hinton. Factored 3-way restricted Boltzmann machines for modeling natural images. In Artiﬁcial Intelligence and Statistics, 2010.

[22] R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua. Learning separable ﬁlters. In IEEE Computer Vision and Pattern Recognition, 2013.

[23] R. Rubinstein, M. Zibulevsky, and M. Elad. Double sparsity: learning sparse dictionaries for sparse signal approximation. IEEE Transactions on Signal Processing, 58:1553–1564, 2010.

[24] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 323(6088):533–536, 1986.

[25] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, New York, NY, USA, 2004.

[26] K. Swersky, M. Ranzato, D. Buchman, B. Marlin, and N. Freitas. On autoencoders and score matching for energy based models. In International Conference on Machine Learning, pages 1201–1208, 2011.

[27] P. Vincent and Y. Bengio. A neural support vector network architecture with adaptive kernels. In International Joint Conference on Neural Networks, pages 187–192, 2000. 9