nips nips2000 nips2000-148 nips2000-148-reference knowledge-graph by maker-knowledge-mining

148 nips-2000-`N-Body' Problems in Statistical Learning

Source: pdf

Author: Alexander G. Gray, Andrew W. Moore

Abstract: We present efficient algorithms for all-point-pairs problems , or 'Nbody '-like problems , which are ubiquitous in statistical learning. We focus on six examples, including nearest-neighbor classification, kernel density estimation, outlier detection , and the two-point correlation. These include any problem which abstractly requires a comparison of each of the N points in a dataset with each other point and would naively be solved using N 2 distance computations. In practice N is often large enough to make this infeasible. We present a suite of new geometric t echniques which are applicable in principle to any 'N-body ' computation including large-scale mixtures of Gaussians, RBF neural networks, and HMM 's. Our algorithms exhibit favorable asymptotic scaling and are empirically several orders of magnitude faster than the naive computation, even for small datasets. We are aware of no exact algorithms for these problems which are more efficient either empirically or theoretically. In addition, our framework yields simple and elegant algorithms. It also permits two important generalizations beyond the standard all-point-pairs problems, which are more difficult. These are represented by our final examples, the multiple two-point correlation and the notorious n-point correlation. 1

reference text

[1] J. BaInes and P. Hut . A Hierarchical O(NlogN) Force-Calculation Algorithm. Nature, 324 , 1986.

[2] K . Deng and A. W. Moore. Multiresolution instance-based learning. In Proceedings of the Twelfth International Joint Conference on Artificial Intelligence, pages 1233- 1239, San Francisco, 1995. Morgan Kaufmann.

[3] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best matches in logarithmic expected time. A CM Transa ctions on Mathematical Software, 3(3):209- 226, September 1977.

[4] L. Greengard and V. Rokhlin. A Fast Algorithm for Particle Simulations. Journal of Computational Physics, 73, 1987.

[5] D. E. Knuth. Sor路ting and Searching. Addison Wesley , 1973.

[6] A. W. M oore. Very fast mixt ure-model-based clustering using multiresolution kd-trees. In M. Kearns and D. Cohn, editors, Advan ces in Neural Information Processing Systems 10, pages 543- 549, San Francisco, April 1999. Morgan Kaufmann.

[7] A. W. Moore. The Anchors Hierarchy: Using the triangle inequality to survive high dimensional data. In Twelfth Conference on Un certainty in A rtificial Intelligence (to appear) . AAAI Press, 2000.

[8] A. W. Moore and M. S. Lee. Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets. Journal of Artificial Intelligen ce Research, 8, March 1998.

[9] D. Pelleg and A. W. Moore . Accelerating Exact k-means Algorithms with Geometric Reasoning. In Proceedings of the Fifth International Conference on Knowledge Discove ry and Data Mining. AAAI Press, 1999.

[10] F. P . Preparata and M. Shamos. Computational Geometry. Springer-Verlag, 1985.

[11] A. Szalay. Personal Communication. 2000.

[12] I. Szapudi. A New Method for Calculating Counts in Cells. The Astrophysical Journal, 1997. [l3] I. Szapudi, S. Colombi, and F. BeInardeau. Cosmic Statistics of Statistics. N otices of the Royal Astronomical Society, 1999. Monthly

[14] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An Efficient Data Clustering Method for Very Large Databases. In Proceedings of the Fifteenth ACM SIGACTSIGMOD-SIGART Symposium on Principles of Database Systems : PODS 1996. Assn for Computing Machinery, 1996.