jmlr jmlr2008 jmlr2008-22 jmlr2008-22-reference knowledge-graph by maker-knowledge-mining

22 jmlr-2008-Closed Sets for Labeled Data

Source: pdf

Author: Gemma C. Garriga, Petra Kralj, Nada Lavrač

Abstract: Closed sets have been proven successful in the context of compacted data representation for association rule learning. However, their use is mainly descriptive, dealing only with unlabeled data. This paper shows that when considering labeled data, closed sets can be adapted for classiﬁcation and discrimination purposes by conveniently contrasting covering properties on positive and negative examples. We formally prove that these sets characterize the space of relevant combinations of features for discriminating the target class. In practice, identifying relevant/irrelevant combinations of features through closed sets is useful in many applications: to compact emerging patterns of typical descriptive mining applications, to reduce the number of essential rules in classiﬁcation, and to efﬁciently learn subgroup descriptions, as demonstrated in real-life subgroup discovery experiments on a high dimensional microarray data set. Keywords: rule relevancy, closed sets, ROC space, emerging patterns, essential rules, subgroup discovery

reference text

R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A.I. Verkamo. Fast discovery of association rules. Advances in Knowledge Discovery and Data Mining, pages 307–328, 1996. J.L. Balc´ zar and J. Baixeries. Discrete deterministic datamining as knowledge compilation. In a SIAM Int. Workshop on Discrete Mathematics and Data Mining, 2003. E. Baralis and S. Chiusano. Essential classiﬁcation rule sets. ACM Trans. Database Syst., 29(4): 635–674, 2004. Y. Bastide, N. Pasquier, R. Taouil, G. Stumme, and L. Lakhal. Mining minimal non-redundant association rules using frequent closed itemsets. Lecture Notes in Computer Science, 1861:972– 986, 2000a. Y. Bastide, R. Taouil, N. Pasquier, G. Stumme, and L. Lakhal. Mining frequent patterns with counting inference. SIGKDD Explor. Newsl., 2(2):66–75, 2000b. S.D. Bay and M.J. Pazzani. Detecting group differences: Mining contrast sets. Data Min. Knowl. Discov., 5(3):213–246, 2001. ISSN 1384-5810. 577 ˇ G ARRIGA , K RALJ AND L AVRA C J.F. Boulicaut, A. Bykowski, and C. Rigotti. Free-sets: A condensed representation of boolean data for the approximation of frequency queries. Data Min. Knowl. Discov., 7(1):5–22, 2003. ISSN 1384-5810. S. Brin, R. Motwani, J.D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In Proceedings ACM SIGMOD Int. Conference on Management of Data, pages 255–264, 1997. T. Calders and B. Goethals. Minimal k-free representations of frequent sets. In Proceedings of the 7th European Conference on Principles and Knowledge Discovery in Data mining, pages 71–82, 2003. C. Carpineto and G. Romano. Concept Data Analysis. Theory and Applications. Wiley, 2004. H.C. Causton, J. Quackenbush, and A. Brazma. Microarray Gene Expression Data Analysis: A Beginner’s Guide. Blackwell Publishing, Oxford, United Kingdom, 2003. P. Clark and T. Niblett. The CN2 induction algorithm. Machine Learning, 3(4):261–283, 1989. W. W. Cohen. Fast effective rule induction. In Proceedings of the 12th International Conference on Machine Learning, pages 115–123, 1995. B. Cr´ milleux and J. F. Boulicaut. Simplest rules characterizing classes generated by delta-free e sets. In Proceedings of the 22nd Annual International Conference Knowledge Based Systems and Applied Artiﬁcial Intelligence, pages 33–46, 2002. G. Dong and J. Li. Efﬁcient mining of emerging patterns: discovering trends and differences. In Proceedings of the 5th Int. Conference on Knowledge discovery and data mining, pages 43–52, 1999. G. Dong, X. Zhang, L. Wong, and J. Li. CAEP: classiﬁcation by aggregating emerging patterns. In Proceedings of the 2nd In. Conference on Discovery Science, pages 30–42, 1999. D. Gamberger and N. Lavraˇ . Expert-guided subgroup discovery: Methodology and application. c Journal of Artiﬁcial Intelligence Research, 17:501–527, 2002. ˇ D. Gamberger, N. Lavraˇ , F. Zelezn´ , and J. Tolar. Induction of comprehensible models for gene c y expression datasets by subgroup discovery methodology. Journal of Biomedical Informatics, 37 (4):269–284, 2004. B. Ganter and R. Wille. Formal Concept Analysis. Mathematical Foundations. Springer, 1998. G.C. Garriga, P. Kralj, and N.Lavraˇ . Closed sets for labeled data. In Proceedings of the 10th Int. c Conference on Principles and Knowledge Discovery on Databases, pages 163–174, 2006. B. Goethals and M. Zaki. Advances in frequent itemset mining implementations: report on FIMI’03. SIGKDD Explor. Newsl., 6(1):109–117, 2004. J. Han and J. Pei. Mining frequent patterns by pattern-growth: methodology and implications. SIGKDD Explor. Newsl., 2(2):14–20, 2000. 578 C LOSED S ETS FOR L ABELED DATA V. Jovanoski and N. Lavraˇ . Classiﬁcation rule learning with APRIORI-C. In Proceedings of c the10th Portuguese Conference on Artiﬁcial Intelligence on Progress in Artiﬁcial Intelligence, Knowledge Extraction, Multi-agent Systems, Logic Programming and Constraint Solving (EPIA ’01), pages 44–51. Springer-Verlag, 2001. B. Kavˇek and N. Lavraˇ . APRIORI-SD: Adapting association rule learning to subgroup discovery. s c Applied Artiﬁcial Intelligence, To appear, 2006. D. Koller and M. Sahami. Toward optimal feature selection. In Proceedings of the 13th Int. Conference on Machine Learning, pages 284–292, 1996. P. Kralj, A. Grubeˇiˇ , K. Gruden N. Toplak, N. Lavraˇ , and G.C. Garriga. Application of closed sc c itemset mining for class labeled data in functional genomics. Informatica Medica Slovenica, 2006. N. Lavraˇ and D. Gamberger. Relevancy in constraint-based subgroup discovery. Constraint-Based c Mining and Inductive databases, 3848:243–266, 2005. N. Lavraˇ , D. Gamberger, and V. Jovanoski. A study of relevance for learning in deductive c databases. Journal of Logic Programming, 40(2/3):215–249, 1999. N. Lavraˇ , B. Kavˇek, P. Flach, and L. Todorovski. Subgroup discovery with CN2-SD. Journal of c s Machine Learning Research, 5:153–188, 2004. J. Li, G. Dong, and K. Ramamohanarao. Instance-based classiﬁcation by emerging patterns. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, pages 191–200, 2000. B. Liu, W. Hsu, and Y. Ma. Integrating classiﬁcation and association rule mining. In Proceedings of the 4th Int. Conference on Knowledge Discovery and Data Mining, pages 571–574, 1998. H. Liu and H. Motoda. Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers, 1998. M. Molla, M. Waddell, D. Page, and J. Shavlik. Using machine learning to design and interpret gene-expression microarrays. AI Magazine, 25(1):23–44, 2004. G. Parmigiani, E.S. Garrett, R.A. Irizarry, and S.L. Zeger, editors. The Analysis of Gene Expression Data: Methods and Software. Springer-Verlag, New York, 2003. N. Pasquier, Y. Bastide, R. Taouil L., and Lakhal. Closed set based discovery of small covers for association rules. Networking and Information Systems, 3(2):349–377, 2001. J.L. Pfaltz. Closure lattices. Discrete Mathematics, 154:217–236, 1996. J.L. Pfaltz and C.M. Taylor. Scientiﬁc knowledge discovery through iterative transformations of concept lattices. In SIAM Int. Workshop on Discrete Mathematics and Data Mining, pages 65– 74, 2002. F.J. Provost and T. Fawcett. Robust classiﬁcation for imprecise environments. Machine Learning, 42(3):203–231, 2001. 579 ˇ G ARRIGA , K RALJ AND L AVRA C A. Soulet, B. Cr´ milleux, and F. Rioult. Condensed representation of eps and patterns quantiﬁed e by frequency-based measures. In Proceedings of Knowledge Discovery in Inductive Databases Workshop, pages 173–190, 2004. T.P. Speed, editor. Statistical Analysis of Gene Expression Microarray Data. Chapman & Hall/CRC, Boca Raton, 2003. L. Taiz and E. Zeiger. Plant Physiology. Sinauer Associates, second edition (372:374) edition, 1998. P-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2005. R. Taouil, Y. Bastide, N. Pasquier, and L. Lakhal. Mining bases for association rules using closed sets. In Proceedings of the 16th Int. Conference on Data Engineering, page 307. IEEE Computer Society, 2000. T. Uno, T. Asai, Y. Uchida, and H. Arimura. An efﬁcient algorithm for enumerating closed patterns in transaction databases. In Discovery Science, pages 16–31, 2004. I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 2005. M. Zaki. Generating non-redundant association rules. In Proceedings of the 6th Int. Conference on Knowledge Discovery and Data Mining, pages 34–43, 2000a. M. Zaki. Mining non-redundant association rules. Data Mining and Knowledge Discovery: An International Journal, 4(3):223–248, 2004. M. Zaki. Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12(3):372–390, 2000b. M. Zaki and M. Ogihara. Theoretical foundations of association rules. In SIGMOD-DMKD Int. Workshop on Research Issues in Data Mining and Knowledge Discovery, 1998. J. Zhang, E. Bloedorn, L. Rosen, and D. Venese. Learning rules from highly unbalanced data sets. In Proceedings of the 4th. IEEE Int. Conference on Data Mining (ICDM’04), pages 571–574, 2004. 580