nips nips2013 nips2013-75 nips2013-75-reference knowledge-graph by maker-knowledge-mining

75 nips-2013-Convex Two-Layer Modeling

Source: pdf

Author: Özlem Aslan, Hao Cheng, Xinhua Zhang, Dale Schuurmans

Abstract: Latent variable prediction models, such as multi-layer networks, impose auxiliary latent variables between inputs and outputs to allow automatic inference of implicit features useful for prediction. Unfortunately, such models are difﬁcult to train because inference over latent variables must be performed concurrently with parameter optimization—creating a highly non-convex problem. Instead of proposing another local training method, we develop a convex relaxation of hidden-layer conditional models that admits global training. Our approach extends current convex modeling approaches to handle two nested nonlinearities separated by a non-trivial adaptive latent layer. The resulting methods are able to acquire two-layer models that cannot be represented by any single-layer model over the same features, while improving training quality over local heuristics. 1

reference text

[1] Q. Le, M. Ranzato, R. Monga, M. Devin, G. Corrado, K. Chen, J. Dean, and A. Ng. Building high-level features using large scale unsupervised learning. In Proceedings ICML, 2012.

[2] N. Srivastava and R. Salakhutdinov. Multimodal learning with deep Boltzmann machines. In NIPS, 2012.

[3] Y. Bengio. Learning deep architectures for AI. Foundat. and Trends in Machine Learning, 2:1–127, 2009.

[4] G. Hinton. Learning multiple layers of representations. Trends in Cognitive Sciences, 11:428–434, 2007.

[5] G. Hinton, S. Osindero, and Y. Teh. A fast algorithm for deep belief nets. Neur. Comp., 18(7), 2006.

[6] N. Lawrence. Probabilistic non-linear principal component analysis. JMLR, 6:1783–1816, 2005.

[7] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh. Clustering with Bregman divergences. J. Mach. Learn. Res., 6:1705–1749, 2005.

[8] M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans. on Image Processing, 15:3736–3745, 2006.

[9] P. Comon. Independent component analysis, a new concept? Signal Processing, 36(3):287–314, 1994.

[10] M. Carreira-Perpi˜ an and Z. Lu. dimensionality reduction by unsupervised regression. In CVPR, 2010. n´

[11] N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. In Allerton Conf., 1999.

[12] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive auto-encoders: Explicit invariance during feature extraction. In ICML, 2011.

[13] K. Swersky, M. Ranzato, D. Buchman, B. Marlin, and N. de Freitas. On autoencoders and score matching for energy based models. In Proceedings ICML, 2011.

[14] Y. LeCun. Who is afraid of non-convex loss functions? http://videolectures.net/eml07 lecun wia, 2007.

[15] Y. Bengio, N. Le Roux, P. Vincent, and O. Delalleau. Convex neural networks. In NIPS, 2005.

[16] S. Nowozin and G. Bakir. A decoupled approach to exemplar-based unsupervised learning. In Proceedings of the International Conference on Machine Learning, 2008.

[17] D. Bradley and J. Bagnell. Convex coding. In UAI, 2009.

[18] A. Joulin and F. Bach. A convex relaxation for weakly supervised classiﬁers. In Proc. ICML, 2012.

[19] A. Joulin, F. Bach, and J. Ponce. Efﬁcient optimization for discrimin. latent class models. In NIPS, 2010.

[20] Y. Guo and D. Schuurmans. Convex relaxations of latent variable training. In Proc. NIPS 20, 2007.

[21] A. Goldberg, X. Zhu, B. Recht, J. Xu, and R. Nowak. Transduction with matrix completion: Three birds with one stone. In NIPS 23, 2010.

[22] E. Candes, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? arXiv:0912.3599, 2009.

[23] X. Zhang, Y. Yu, and D. Schuurmans. Accelerated training for matrix-norm regularization: A boosting approach. In Advances in Neural Information Processing Systems 25, 2012.

[24] A. Anandkumar, D. Hsu, and S. Kakade. A method of moments for mixture models and hidden Markov models. In Proc. Conference on Learning Theory, 2012.

[25] D. Hsu and S. Kakade. Learning mixtures of spherical Gaussians: Moment methods and spectral decompositions. In Innovations in Theoretical Computer Science (ITCS), 2013.

[26] Y. Cho and L. Saul. Large margin classiﬁcation in inﬁnite neural networks. Neural Comput., 22, 2010.

[27] R. Neal. Connectionist learning of belief networks. Artiﬁcial Intelligence, 56(1):71–113, 1992.

[28] G. Kimeldorf and G. Wahba. Some results on Tchebychefﬁan spline functions. JMAA, 33:82–95, 1971.

[29] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, pages 265–292, 2001.

[30] J. Fuernkranz, E. Huellermeier, E. Mencia, and K. Brinker. Multilabel classiﬁcation via calibrated label ranking. Machine Learning, 73(2):133–153, 2008.

[31] Y. Guo and D. Schuurmans. Adaptive large margin training for multilabel classiﬁcation. In AAAI, 2011.

[32] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3), 2008.

[33] Y. Nesterov and A. Nimirovskii. Interior-Point Polynomial Algorithms in Convex Programming. 1994.

[34] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge U. Press, 2004.

[35] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundat. Trends in Mach. Learn., 3(1):1–123, 2010.

[36] S. Laue. A hybrid algorithm for convex semideﬁnite optimization. In Proc. ICML, 2012.

[37] O. Chapelle. Training a support vector machine in the primal. Neural Comput., 19(5):1155–1178, 2007.

[38] T. Joachims. Transductive inference for text classiﬁcation using support vector machines. In ICML, 1999.

[39] V. Sindhwani and S. Keerthi. Large scale semi-supervised linear SVMs. In SIGIR, 2006.

[40] http://olivier.chapelle.cc/ssl- book/benchmarks.html.

[41] http://archive.ics.uci.edu/ml/datasets.

[42] http://www.cs.toronto.edu/ kriz/cifar.html. 9