jmlr jmlr2008 jmlr2008-43 jmlr2008-43-reference knowledge-graph by maker-knowledge-mining

43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection

Source: pdf

Author: Elena Marchiori

Abstract: In supervised learning, a training set consisting of labeled instances is used by a learning algorithm for generating a model (classiﬁer) that is subsequently employed for deciding the class label of new instances (for generalization). Characteristics of the training set, such as presence of noisy instances and size, inﬂuence the learning algorithm and affect generalization performance. This paper introduces a new network-based representation of a training set, called hit miss network (HMN), which provides a compact description of the nearest neighbor relation over pairs of instances from each pair of classes. We show that structural properties of HMN’s correspond to properties of training points related to the one nearest neighbor (1-NN) decision rule, such as being border or central point. This motivates us to use HMN’s for improving the performance of a 1-NN classiﬁer by removing instances from the training set (instance selection). We introduce three new HMN-based algorithms for instance selection. HMN-C, which removes instances without affecting accuracy of 1-NN on the original training set, HMN-E, based on a more aggressive storage reduction, and HMN-EI, which applies iteratively HMN-E. Their performance is assessed on 22 data sets with different characteristics, such as input dimension, cardinality, class balance, number of classes, noise content, and presence of redundant variables. Results of experiments on these data sets show that accuracy of 1-NN classiﬁer increases signiﬁcantly when HMN-EI is applied. Comparison with state-of-the-art editing algorithms for instance selection on these data sets indicates best generalization performance of HMN-EI and no signiﬁcant difference in storage requirements. In general, these results indicate that HMN’s provide a powerful graph-based representation of a training set, which can be successfully applied for performing noise and redundance reduction in instance-based learning. Keywords: graph-based training set repre

reference text

D.W. Aha, D. Kibler, and M.K. Albert. Instance-based learning algorithms. Machine Learning, 6: 37–66, 1991. F. Angiulli. Fast nearest neighbor condensation for large data sets classiﬁcation. IEEE Transactions on Knowledge and Data Engineering, 19(11):1450–1464, 2007. ISSN 1041-4347. V. Barnett. The ordering of multivariate data. J. Roy. Statist. Soc., Ser. A 139(3):318–355, 1976. B. Bhattacharya and D. Kaller. Reference set thinning for the k-nearest neighbor decision rule. Proceedings of the 14th International Conference on Pattern Recognition, (1):238–243, 1998. B. Bhattacharya, K. Mukherjee, and G. Toussaint. Geometric decision rules for instance-based learning problems. In Proceedings of the 1st International Conference on Pattern Recognition and Machine Intelligence (PReMI’05), number LNCS 3776, pages 60–69. Springer, 2005. B.K. Bhattacharya. Application of computational geometry to pattern recognition problems. PhD Thesis. Simon Fraser University, School of Computing Science, Technical Report, TR 82-3, 1982. H. Brighton and C. Mellish. On the consistency of information ﬁlters for lazy learning algorithms. In PKDD ’99: Proceedings of the Third European Conference on Principles of Data Mining and Knowledge Discovery, pages 283–288. Springer-Verlag, 1999. H. Brighton and C. Mellish. Advances in instance selection for instance-based learning algorithms. Data Mining and Knowledge Discovery, (6):153–172, 2002. R.M. Cameron-Jones. Instance selection by encoding length heuristic with random mutation hill climbing. In Proceedings of the Eighth Australian Joint Conference on Artiﬁcial Intelligence, pages 99–106, 1995. O. Chapelle and A. Zien. Semi-supervised classiﬁcation by low density separation. In Proceedings of the Tenth International Workshop on Artiﬁcial Intelligence and Statistics, pages 57–64, 2005. 1015 M ARCHIORI T. Cover and P. Hart. Nearest neighbor pattern classiﬁcation. IEEE Transactions on Information Theory, 13:21–27, 1967. B. V. Dasarathy. Minimal consistent set (mcs) identiﬁcation for optimal nearest neighbor decision systems design. IEEE Transactions on Systems, Man, and Cybernetics, 24(3):511–517, 1994. J. Demsar. Statistical comparisons of classiﬁers over multiple data sets. Journal of Machine Learning Research, 7:1–30, 2006. J. DeVinney and C.E. Priebe. The use of domination number of a random proximity catch digraph for testing spatial patterns of segregation and association. Statistics and Probability Letters, 73 (1):37–50, 2005. J. DeVinney and C.E. Priebe. A new family of proximity graphs: Class cover catch digraphs. Discrete Applied Mathematics, 154(14):1975–1982, 2006. E.J. Wegman D.J. Marchette and C.E. Priebe. Fast algorithms for classiﬁcation using class cover catch digraphs. Handbook of Statistics, 24:331–358, 2005. S.N. Dorogovtsev and J.F.F. Mendes. Evolution of Networks: From Biological Nets to the Internet and WWW. Oxford University Press, 2003. P.J. Grother, G.T. Candela, and J.L. Blue. Fast implementation of nearest neighbor classiﬁers. Pattern Recognition, 30:459–465, 1997. P. E. Hart. The condensed nearest neighbor rule. IEEE Transactions on Information Theory, 14: 515–516, 1968. N. Jankowski and M. Grochowski. Comparison of instances selection algorithms i. algorithms survey. In Artiﬁcial Intelligence and Soft Computing, pages 598–603. Springer, 2004a. N. Jankowski and M. Grochowski. Comparison of instances selection algorithms ii. results and comments. In Artiﬁcial Intelligence and Soft Computing, pages 580–585. Springer, 2004b. J.W. Jaromczyk and G.T. Toussaint. Relative neighborhood graphs and their relatives. P-IEEE, 80: 1502–1517, 1992. C.D.J. Marchette, C.E. Priebe, D.A. Socolinsky, and J.G. DeVinney. Classiﬁcation using class cover catch digraphs. Journal of classiﬁcation, 20(1):3–23, 2003. K. Mukherjee. Application of the gabriel graph to instance-based learning. In M.sc. project, School of Computing Science, Simon Fraser University, 2004. E. Pekalska, R.P. W. Duin, and P. Pacl´k. Prototype selection for dissimilarity-based classiﬁers. ı Pattern Recognition, 39(2):189–208, 2006. ¨ G. R¨ tsch, T. Onoda, and K.-R. Muller. Soft margins for AdaBoost. Machine Learning, 42(3): a 287–320, 2001. G.L. Ritter, H.B. Woodruff, S.R. Lowry, and T.L. Isenhour. An algorithm for a selective nearest neighbor decision rule. IEEE Transactions on Information Theory, 21(6):665–669, 1975. 1016 H IT M ISS N ETWORKS J.S. S´ nchez, F. Pla, and F.J. Ferri. Prototype selection for the nearest neighbour rule through a proximity graphs. Pattern Recognition Letters, 18:507–513, 1997. J. Sankaranarayanan, H. Samet, and A. Varshney. A fast all nearest neighbor algorithm for applications involving large point-clouds. Comput. Graph., 31(2):157–174, 2007. H. Shin and S. Cho. Neighborhood property based pattern selection for support vector machines. Neural Computation, (19):816–855, 2007. I. Tomek. An experiment with the edited nearest-neighbor rule. IEEE Transactions on Systems, Man, and Cybernetics, 6(6):448–452, 1976. G.T. Toussaint. Proximity graphs for nearest neighbor decision rules: recent progress. In Interface2002, 34th Symposium on Computing and Statistics, pages 83–106, 2002. G.T. Toussaint, B.K. Bhattacharya, and R.S. Poulsen. The application of voronoi diagrams to nonparametric decision rules. In Proceedings of the 16th Symposium on Computer Science and Statistics, pages 97 –108, 1984. A. Vezhnevets and O. Barinova. Avoiding boosting overﬁtting by removing ”confusing” samples. In Proceedings of the 18th European Conference on Machine Learning (ECML), volume 4701, pages 430–441. LNCS, 2007. D. L. Wilson. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man and Cybernetics, (2):408–420, 1972. D.R. Wilson and T.R. Martinez. Instance pruning techniques. In Proc. 14th International Conference on Machine Learning, pages 403–411. Morgan Kaufmann, 1997. D.R. Wilson and T.R. Martinez. Reduction techniques for instance-based learning algorithms. Machine Learning, 38(3):257–286, 2000. D.A. Zighed, S. Lallich, and F. Muhlenbach. Separability index in supervised learning. In PKDD ’02: Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery, pages 475–487. Springer-Verlag, 2002. 1017