jmlr jmlr2012 jmlr2012-70 jmlr2012-70-reference knowledge-graph by maker-knowledge-mining

70 jmlr-2012-Multi-Assignment Clustering for Boolean Data

Source: pdf

Author: Mario Frank, Andreas P. Streich, David Basin, Joachim M. Buhmann

Abstract: We propose a probabilistic model for clustering Boolean data where an object can be simultaneously assigned to multiple clusters. By explicitly modeling the underlying generative process that combines the individual source emissions, highly structured data are expressed with substantially fewer clusters compared to single-assignment clustering. As a consequence, such a model provides robust parameter estimators even when the number of samples is low. We extend the model with different noise processes and demonstrate that maximum-likelihood estimation with multiple assignments consistently infers source parameters more accurately than single-assignment clustering. Our model is primarily motivated by the task of role mining for role-based access control, where users of a system are assigned one or more roles. In experiments with real-world access-control data, our model exhibits better generalization performance than state-of-the-art approaches. Keywords: clustering, multi-assignments, overlapping clusters, Boolean data, role mining, latent feature models

reference text

Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules. In Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors, Proc. 20th Int. Conf. Very Large Data Bases, VLDB, pages 487–499. Morgan Kaufmann, 1994. Rakesh Agrawal, Tomasz Imieli´ ski, and Arun Swami. Mining association rules between sets of n items in large databases. Int Conf on Management of Data, 22(2):207–216, 1993. Eugene L. Allgower and Kurt Georg. Simplicial and continuation methods for approximations, ﬁxed points and solutions to systems of equations. SIAM Review, 22:28–85, 1980. Charles E. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, 2(6):1152–1174, November 1974. Radim Belohlavek and Vilem Vychodil. Discovery of optimal factors in binary data via a novel method of matrix decomposition. J. Comput. Syst. Sci., 76(1):3–20, 2010. Joachim M. Buhmann and Hans Kühnel. Vector quantization with complexity costs. In IEEE Trans on Information Theory, volume 39, pages 1133–1145. IEEE, 1993. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, 2nd ed. MIT Press, 2001. Alina Ene, William Horne, Nikola Milosavljevic, Prasad Rao, Robert Schreiber, and Robert E. Tarjan. Fast exact and heuristic methods for role minimization problems. In SACMAT ’08: Proceeding of the 13th ACM Symposium on Access Control Models and Technologies, pages 1–10, 2008. 487 F RANK , S TREICH , BASIN AND B UHMANN Thomas S. Ferguson. A Bayesian analysis of some nonparametric problems. Annals of Statistics, 1 (2):209–230, 1973. David F. Ferraiolo, Ravi Sandhu, Serban Gavrila, D. Richard Kuhn, and Ramaswamy Chandramouli. Proposed NIST standard for role-based access control. ACM Trans. Inf. Syst. Secur., 4 (3):224–274, 2001. Mario Frank, David Basin, and Joachim M. Buhmann. A class of probabilistic models for role engineering. In CCS ’08: Proceedings of the 15th ACM Conference on Computer and Communications Security, pages 299–310, New York, NY, USA, 2008. ACM. Mario Frank, Joachim M. Buhmann, and David Basin. On the deﬁnition of role mining. In SACMAT ’10: Proceeding of the 15th ACM Symposium on Access Control Models and Technologies, pages 35–44, New York, NY, USA, 2010. ACM. Mario Frank, Morteza Chehreghani, and Joachim M. Buhmann. The minimum transfer cost principle for model-order selection. In ECML PKDD ’11: Machine Learning and Knowledge Discovery in Databases, pages 423–438. Springer Berlin / Heidelberg, 2011. Bernhard Ganter and Rudolf Wille. Springer, 1999. Formal Concept Analysis - Mathematical Foundations. Zoubin Ghahramani, Thomas L. Grifﬁths, and Peter Sollich. Bayesian nonparametric latent feature models. Bayesian Statistics 8. Oxford University Press, pages 201–225, 2007. James F. Gimpel. The minimization of spatially-multiplexed character sets. Communications of the ACM, 17(6):315–318, 1974. Thomas L. Grifﬁths and Zoubin Ghahramani. The indian buffet process: An introduction and review. Journal of Machine Learning Research, 12:1185–1224, 2011. Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate generation. In SIGMOD ’00: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pages 1–12, New York, NY, USA, 2000. ACM. Katherine A. Heller and Zoubin Ghahramani. A nonparametric bayesian approach to modeling overlapping clusters. In Eleventh International Conference on Artiﬁcial Intelligence and Statistics (AISTATS-2007), pages 297–304, 2007. Tommi S. Jaakkola and Michael I. Jordan. Variational probabilistic inference and the qmr-dt network. Journal of Artiﬁcial Intelligence Research, 10(1):291–322, 1999. Ata Kabán and Ella Bingham. Factorisation and denoising of 0-1 data: A variational approach. Neurocomputing, 71(10-12):2291–2308, 2008. Charles Kemp, Joshua B. Tenenbaum, Thomas L. Griffths, Takeshi Yamada, and Naonori Ueda. Learning systems of concepts with an inﬁnite relational model. In Nat Conf on Artiﬁcial Intelligence, pages 763–770, 2006. Ales Keprt and Václav Snásel. Binary factor analysis with help of formal concepts. In Proc. of CLA 2004, pages 90–101, 2004. 488 M ULTI -A SSIGNMENT C LUSTERING FOR B OOLEAN DATA Martin Kuhlmann, Dalia Shohat, and Gerhard Schimpf. Role mining — revealing business roles for security administration using data mining technology. In SACMAT’03: Proceeding of the 8th ACM Symp on Access Control Models and Technologies, pages 179–186, New York, NY, USA, 2003. ACM. Harold W. Kuhn. The hungarian method for the assignment problem. In 50 Years of Integer Programming 1958-2008, pages 29–47. Springer Berlin Heidelberg, 2010. Pauli Miettinen, Taneli Mielikäinen, Aris Gionis, Gautam Das, and Heikki Mannila. The Discrete Basis Problem. In Proc. of Principles and Practice of Knowledge Discovery in Databases, pages 335–346. Springer, 2006. Ian Molloy, Hong Chen, Tiancheng Li, Qihua Wang, Ninghui Li, Elisa Bertino, Seraphin Calo, and Jorge Lobo. Mining roles with semantic meanings. In SACMAT ’08: Proceeding of the 13th ACM Symposium on Access Control Models and Technologies, pages 21–30, 2008. Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, September 1988. Kenneth Rose. Deterministic annealing for clustering, compression, classiﬁcation, regression, and related optimization problems. In Proc. of the IEEE, pages 2210–2239, 1998. Larry J. Stockmeyer. The set basis problem is NP-complete. Report RC5431, IBM Watson Research, 1975. Andreas P. Streich, Mario Frank, David Basin, and Joachim M. Buhmann. Multi-assignment clustering for Boolean data. In ICML ’09: Proceedings of the 26th Annual International Conference on Machine Learning, pages 969–976, New York, NY, USA, 2009. ACM. Jaideep Vaidya, Vijay Atluri, and Qi Guo. The Role Mining Problem: Finding a minimal descriptive set of roles. In SACMAT ’07: Proceeding of the 12th ACM Symposium on Access Control Models and Technologies, pages 175–184. ACM Press, 2007. Tomáš Šingliar and Miloš Hauskrecht. Noisy-or component analysis and its application to link analysis. Journal of Machine Learning Research, 7:2189–2213, 2006. Frank Wood, Thomas L. Grifﬁths, and Zoubin Ghahramani. A non-parametric Bayesian method for inferring hidden causes. In Conference on Uncertainty in Artiﬁcial Intelligence, pages 536–543. AUAI Press, 2006. 489