nips nips2004 nips2004-51 nips2004-51-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Daniel B. Neill, Andrew W. Moore, Francisco Pereira, Tom M. Mitchell
Abstract: Assume a uniform, multidimensional grid of bivariate data, where each cell of the grid has a count ci and a baseline bi . Our goal is to find spatial regions (d-dimensional rectangles) where the ci are significantly higher than expected given bi . We focus on two applications: detection of clusters of disease cases from epidemiological data (emergency department visits, over-the-counter drug sales), and discovery of regions of increased brain activity corresponding to given cognitive tasks (from fMRI data). Each of these problems can be solved using a spatial scan statistic (Kulldorff, 1997), where we compute the maximum of a likelihood ratio statistic over all spatial regions, and find the significance of this region by randomization. However, computing the scan statistic for all spatial regions is generally computationally infeasible, so we introduce a novel fast spatial scan algorithm, generalizing the 2D scan algorithm of (Neill and Moore, 2004) to arbitrary dimensions. Our new multidimensional multiresolution algorithm allows us to find spatial clusters up to 1400x faster than the naive spatial scan, without any loss of accuracy. 1
[1] M. Kulldorff. 1997. A spatial scan statistic. Communications in Statistics: Theory and Methods 26(6), 1481-1496.
[2] M. Kulldorff. 1999. Spatial scan statistics: models, calculations, and applications. In Glaz and Balakrishnan, eds. Scan Statistics and Applications. Birkhauser: Boston, 303-322.
[3] D. B. Neill and A. W. Moore. 2003. A fast multi-resolution method for detection of significant spatial disease clusters. In Advances in Neural Information Processing Systems 16.
[4] D. B. Neill and A. W. Moore. 2004. Rapid detection of significant spatial clusters. To be published in Proc. 10th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining.
[5] J. L. Bentley. 1975. Multidimensional binary search trees used for associative searching. Comm. ACM 18, 509-517.
[6] R. A. Finkel and J. L. Bentley. 1974. Quadtrees: a data structure for retrieval on composite keys. Acta Informatica 4, 1-9.
[7] S. Openshaw, et al. 1988. Investigation of leukemia clusters by use of a geographical analysis machine. Lancet 1, 272-273.
[8] L. A. Waller, et al. 1994. Spatial analysis to detect disease clusters. In N. Lange, ed. Case Studies in Biometry. Wiley, 3-23.
[9] T. Mitchell et al. 2003. Learning to detect cognitive states from brain images. Machine Learning, in press.
[10] M. Perone Pacifico et al. 2003. False discovery rates for random fields. Carnegie Mellon University Dept. of Statistics, Technical Report 771.
[11] K. Worsley et al. 2003. Detecting activation in fMRI data. Stat. Meth. in Medical Research 12, 401-418.
[12] R. Agrawal, et al. 1998. Automatic subspace clustering of high dimensional data for data mining applications. Proc. ACM-SIGMOD Intl. Conference on Management of Data, 94-105.
[13] J. H. Friedman and N. I. Fisher. 1999. Bump hunting in high dimensional data. Statistics and Computing 9, 123-143.
[14] S. Goil, et al. 1999. MAFIA: efficient and scalable subspace clustering for very large data sets. Northwestern University, Technical Report CPDC-TR-9906-010.
[15] W. Wang, et al. 1997. STING: a statistical information grid approach to spatial data mining. Proc. 23rd Conference on Very Large Databases, 186-195.
[16] M. Kulldorff. 1998. Evaluating cluster alarms: a space-time scan statistic and brain cancer in Los Alamos. Am. J. Public Health 88, 1377-1380. 3 In a longer run on a different subject, where we iterate the scan statistic to pick out multiple significant regions, we found significant clusters in Broca’s and Wernicke’s areas in addition to the visual cortex. This makes sense given the nature of the experimental task; however, more data is needed before we can draw conclusive cross-subject comparisons.