nips nips2012 nips2012-186 nips2012-186-reference knowledge-graph by maker-knowledge-mining

186 nips-2012-Learning as MAP Inference in Discrete Graphical Models


Source: pdf

Author: Xianghang Liu, James Petterson, Tibério S. Caetano

Abstract: We present a new formulation for binary classification. Instead of relying on convex losses and regularizers such as in SVMs, logistic regression and boosting, or instead non-convex but continuous formulations such as those encountered in neural networks and deep belief networks, our framework entails a non-convex but discrete formulation, where estimation amounts to finding a MAP configuration in a graphical model whose potential functions are low-dimensional discrete surrogates for the misclassification loss. We argue that such a discrete formulation can naturally account for a number of issues that are typically encountered in either the convex or the continuous non-convex approaches, or both. By reducing the learning problem to a MAP inference problem, we can immediately translate the guarantees available for many inference settings to the learning problem itself. We empirically demonstrate in a number of experiments that this approach is promising in dealing with issues such as severe label noise, while still having global optimality guarantees. Due to the discrete nature of the formulation, it also allows for direct regularization through cardinality-based penalties, such as the 0 pseudo-norm, thus providing the ability to perform feature selection and trade-off interpretability and predictability in a principled manner. We also outline a number of open problems arising from the formulation. 1


reference text

[1] P. M. Long and R. A. Servedio, “Random classification noise defeats all convex potential boosters,” Machine Learning, vol. 78, no. 3, pp. 287–304, 2010.

[2] P. M. Long and R. A. Servedio, “Learning large-margin halfspaces with more malicious noise,” in NIPS, 2011.

[3] S. M. Aji and R. J. McEliece, “The generalized distributive law,” IEEE Trans. Inform. Theory, vol. 46, no. 2, pp. 325–343, 2000.

[4] B. Korte and J. Vygen, Combinatorial Optimization: Theory and Algorithms. Springer Publishing Company, Incorporated, 4th ed., 2007.

[5] V. Kolmogorov and R. Zabih, “What energy functions can be minimizedvia graph cuts?,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, pp. 147–159, 2004.

[6] A. Globerson and T. S. Jaakkola, “Approximate inference using planar graph decomposition,” in Advances in Neural Information Processing Systems 19 (B. Sch¨lkopf, J. Platt, and T. Hoffo man, eds.), pp. 473–480, Cambridge, MA: MIT Press, 2007.

[7] T. Jebara, “Perfect graphs and graphical modeling.” To Appear in Tractability, Cambridge University Press, 2012.

[8] V. V. Vazirani, Approximation Algorithms. Springer, 2004.

[9] D. Sontag, Approximate Inference in Graphical Models using LP Relaxations. PhD thesis, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2010.

[10] P. Zhao and B. Yu, “On model selection consistency of lasso,” J. Mach. Learn. Res., vol. 7, pp. 2541–2563, Dec. 2006.

[11] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski, “Structured sparsity through convex optimization.” Technical report, HAL 00621245-v2, to appear in Statistical Science, 2012.

[12] J. Huang, T. Zhang, and D. Metaxas, “Learning with structured sparsity,” in Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, (New York, NY, USA), pp. 417–424, ACM, 2009.

[13] F. R. Bach, “Structured sparsity-inducing norms through submodular functions,” in NIPS, pp. 118–126, 2010.

[14] D. Mcallester and J. Keshet, “Generalization bounds and consistency for latent structural probit and ramp loss,” in Advances in Neural Information Processing Systems 24 (J. ShaweTaylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, eds.), pp. 2205–2212, 2011.

[15] D. Koller and N. Friedman, Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.

[16] M. J. Wainwright and M. I. Jordan, Graphical Models, Exponential Families, and Variational Inference. Hanover, MA, USA: Now Publishers Inc., 2008.

[17] B. Potetz, “Estimating the bayes point using linear knapsack problems,” in ICML, pp. 257–264, 2011.

[18] M.-F. Balcan, A. Blum, and S. Vempala, “Kernels as features: On kernels, margins, and low-dimensional mappings,” Machine Learning, vol. 65, no. 1, pp. 79–94, 2006.

[19] Q. Shi, C. Chen, R. Hill, and A. van den Hengel, “Is margin preserved after random projection?,” in ICML, 2012.

[20] S. Dasgupta and A. Gupta, “An elementary proof of a theorem of johnson and lindenstrauss,” Random Struct. Algorithms, vol. 22, pp. 60–65, Jan. 2003.

[21] A. Frank and A. Asuncion, “UCI machine learning repository,” 2010.

[22] O. Meshi and A. Globerson, “An alternating direction method for dual map lp relaxation,” in Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II, ECML PKDD’11, (Berlin, Heidelberg), pp. 470–483, SpringerVerlag, 2011.

[23] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends in Machine Learning, vol. 3, 2011.

[24] T. Jebara, J. Wang, and S. Chang, “Graph construction and b-matching for semi-supervised learning,” in ICML, 2009. 9