jmlr jmlr2013 jmlr2013-42 jmlr2013-42-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Edward McFowland III, Skyler Speakman, Daniel B. Neill
Abstract: We propose Fast Generalized Subset Scan (FGSS), a new method for detecting anomalous patterns in general categorical data sets. We frame the pattern detection problem as a search over subsets of data records and attributes, maximizing a nonparametric scan statistic over all such subsets. We prove that the nonparametric scan statistics possess a novel property that allows for efficient optimization over the exponentially many subsets of the data without an exhaustive search, enabling FGSS to scale to massive and high-dimensional data sets. We evaluate the performance of FGSS in three real-world application domains (customs monitoring, disease surveillance, and network intrusion detection), and demonstrate that FGSS can successfully detect and characterize relevant patterns in each domain. As compared to three other recently proposed detection algorithms, FGSS substantially decreased run time and improved detection power for massive multivariate data sets. Keywords: pattern detection, anomaly detection, knowledge discovery, Bayesian networks, scan statistics
KDD Cup, 1999. URL http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. R. H. Berk and D. H. Jones. Goodness-of-fit test statistics that dominate the Kolmogorov statistics. Z. Wahrsch. Verw. Gebiete, 47:47–59, 1979. D. H. Chau, S. Pandit, and C. Faloutsos. Detecting fraudulent personalities in networks of online auctioneers. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, pages 103–114, 2006. 1560 FAST G ENERALIZED S UBSET S CAN FOR A NOMALOUS PATTERN D ETECTION K. Das. Detecting patterns of anomalies. Technical Report CMU-ML-09-101, PhD thesis, Carnegie Mellon University, Department of Machine Learning, 2009. K. Das and J. Schneider. Detecting anomalous records in categorical datasets. In Proceedings of the 13th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 220–229, 2007. K. Das, J. Schneider, and D. B. Neill. Anomaly pattern detection in categorical datasets. In Proceedings of the 14th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2008. A. C. Davison and D. V. Hinkley. Bootstrap Methods and Their Applications. Cambridge University Press, 1997. D. Donoho and J. Jin. Higher criticism for detecting sparse heterogeneous mixtures. Annals of Statistics, 32(3):962–994, 2004. W. R. Hogan, G. F. Cooper, G. L. Wallstrom, M. M. Wagner, and J.-M. Depinay. The bayesian aerosol release detector: An algorithm for detecting and characterizing outbreaks caused by an atmospheric release of bacillus anthracis. Statistics in Medicine, 26:5225–5252, 2007. M. Kulldorff. A spatial scan statistic. Communications in Statistics: Theory and Methods, 26(6): 1481–1496, 1997. M. Kulldorff and N. Nagarwalla. Spatial disease clusters: detection and inference. Statistics in Medicine, 14:799–810, 1995. A. W. Moore and W.-K. Wong. Optimal reinsertion: A new search operator for accelerated and more accurate bayesian network structure learning. In Proceedings of the 20th International Conference on Machine Learning, pages 552–559. AAAI Press, 2003. D. B. Neill. Expectation-based scan statistics for monitoring spatial time series data. International Journal of Forecasting, 25:498–517, 2009. D. B. Neill. Fast Bayesian scan statistics for multivariate event detection and visualization. Statistics in Medicine, 30(5):455–469, 2011. D. B. Neill. Fast subset scan for spatial pattern detection. Journal of the Royal Statistical Society (Series B: Statistical Methodology), 74(2):337–360, 2012. D. B. Neill and G. F. Cooper. A multivariate Bayesian scan statistic for early event detection and characterization. Machine Learning, 79:261–282, 2010. D. B. Neill and J. Lingwall. A nonparametric scan statistic for multivariate disease surveillance. Advances in Disease Surveillance, 4:106, 2007. D. B. Neill, A. W. Moore, M. R. Sabhnani, and K. Daniel. Detection of emerging space-time clusters. In Proceedings of the 11th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2005. D. B. Neill, G. F. Cooper, K. Das, X. Jiang, and J. Schneider. Bayesian network scan statistics for multivariate pattern detection. In J. Glaz, V. Pozdnyakov, and S. Wallenstein, editors, Scan Statistics: Methods and Applications, 2008. 1561