jmlr jmlr2006 jmlr2006-63 jmlr2006-63-reference knowledge-graph by maker-knowledge-mining

63 jmlr-2006-New Algorithms for Efficient High-Dimensional Nonparametric Classification

Source: pdf

Author: Ting Liu, Andrew W. Moore, Alexander Gray

Abstract: This paper is about non-approximate acceleration of high-dimensional nonparametric operations such as k nearest neighbor classiﬁers. We attempt to exploit the fact that even if we want exact answers to nonparametric queries, we usually do not need to explicitly ﬁnd the data points close to the query, but merely need to answer questions about the properties of that set of data points. This offers a small amount of computational leeway, and we investigate how much that leeway can be exploited. This is applicable to many algorithms in nonparametric statistics, memory-based learning and kernel-based learning. But for clarity, this paper concentrates on pure k-NN classiﬁcation. We introduce new ball-tree algorithms that on real-world data sets give accelerations from 2-fold to 100-fold compared against highly optimized traditional ball-tree-based k-NN . These results include data sets with up to 106 dimensions and 105 records, and demonstrate non-trivial speed-ups while giving exact answers. keywords: ball-tree, k-NN classiﬁcation

reference text

S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An optimal algorithm for approximate nearest neighbor searching ﬁxed dimensions. Journal of the ACM, 45(6):891–923, 1998. URL citeseer.ist.psu.edu/arya94optimal.html. J. Baxter and P. Bartlett. The Canonical Distortion Measure in Feature Space and 1-NN Classiﬁcation. In Advances in Neural Information Processing Systems 10. Morgan Kaufmann, 1998. S. D. Bay. UCI KDD Archive [http://kdd.ics.uci.edu]. Irvine, CA: University of California, Dept of Information and Computer Science, 1999. C. Cardie and N. Howe. Improving minority class prediction using case-speciﬁc feature weights. In Proceedings of 14th International Conference on Machine Learning, pages 57–65. Morgan Kaufmann, 1997. URL citeseer.nj.nec.com/cardie97improving.html. C. L. Chang. Finding prototypes for nearest neighbor classiﬁers. IEEE Trans. Computers, C-23 (11):1179–1184, November 1974. P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efﬁcient access method for similarity search in metric spaces. In Proceedings of the 23rd VLDB International Conference, September 1997. K. Clarkson. Nearest Neighbor Searching in Metric Spaces: Experimental Results for sb(S). 2002. , S. Cost and S. Salzberg. A Weighted Nearest Neighbour Algorithm for Learning with Symbolic Features. Machine Learning, 10:57–67, 1993. T. M. Cover and P. E. Hart. Nearest neighbor pattern classiﬁcation. IEEE Trans. Information Theory, IT-13,no.1:21–27, 1967. S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391–407, 1990. K. Deng and A. W. Moore. Multiresolution Instance-based Learning. In Proceedings of the Twelfth International Joint Conference on Artiﬁcial Intelligence, pages 1233–1239, San Francisco, 1995. Morgan Kaufmann. L. Devroye and T. J. Wagner. Nearest neighbor methods in discrimination, volume 2. P.R. Krishnaiah and L. N. Kanal, eds., North-Holland, 1982. A. Djouadi and E. Bouktache. A fast algorithm for the nearest-neighbor classiﬁer. IEEE Trans. Pattern Analysis and Machine Intelligence, 19(3):277–282, March 1997. N. R. Draper and H. Smith. Applied Regression Analysis, 2nd ed. John Wiley, New York, 1981. R. O. Duda and P. E. Hart. Pattern Classiﬁcation and Scene Analysis. John Wiley & Sons, 1973. C. Faloutsos and D. W. Oard. A survey of information retrieval and ﬁltering methods. Technical Report CS-TR-3514, Carnegie Mellon University Computer Science Department, 1995. 1155 L IU , M OORE AND G RAY C. Faloutsos, R. Barber, M. Flickner, J. Hafner, W. Niblack, D. Petkovic, and William Equitz. Efﬁcient and effective querying by image content. Journal of Intelligent Information Systems, 3 (3/4):231–262, 1994. T. Fawcett and F. J. Provost. Adaptive fraud detection. Data Mining and Knowledge Discovery, 1 (3):291–316, 1997. URL citeseer.nj.nec.com/fawcett97adaptive.html. F. P. Fisher and E. A. Patrick. A preprocessing algorithm for nearest neighbor decision rules. In Proc. Nat’l Electronic Conf., volume 26, pages 481–485, December 1970. M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker. Query by image and video content: the qbic system. IEEE Computer, 28:23–32, 1995. J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for ﬁnding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3(3):209–226, September 1977. K. Fukunaga and P.M. Narendra. A Branch and Bound Algorithm for Computing K-Nearest Neighbors. IEEE Trans. Computers, C-24(7):750–753, 1975. G. W. Gates. The reduced nearest neighbor rule. IEEE Trans. Information Theory, IT-18(5):431– 433, May 1972. A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. In Proceedings of the 25th VLDB Conference, 1999. A. Gray and A. W. Moore. N-Body Problems in Statistical Learning. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13. MIT Press, 2001. A. Guttman. R-trees: A dynamic index structure for spatial searching. In Proceedings of the Third ACM SIGACT-SIGMOD Symposium on Principles of Database Systems. Assn for Computing Machinery, April 1984. J. M. Hammersley. The Distribution of Distances in a Hypersphere. Annals of Mathematical Statistics, 21:447–452, 1950. P. E. Hart. The condensed nearest neighbor rule. IEEE Trans. Information Theory, IT-14(5):515– 516, May 1968. T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbor classiﬁcation. IEEE Trans. Pattern Analysis and Machine Intelligence, 18(6):607–615, June 1996. P. Indyk. On approximate nearest neighbors under l∞ norm. Journal of Computer and System Sciences, 63(4), 2001. CMU informedia digital video library project. The trec-2001 video trackorganized by nist shot boundary task, 2001. V. Koivune and S. Kassam. Nearest neighbor ﬁlters for multivariate data. In IEEE Workshop on Nonlinear Signal and Image Processing, 1995. 1156 N EW A LGORITHMS FOR E FFICIENT H IGH -D IMENSIONAL N ONPARAMETRIC C LASSIFICATION P. Komarek and A. W. Moore. Fast robust logistic regression for large sparse datasets with binary outputs. In Artiﬁcial Intelligence and Statistics, 2003. C. Kruegel and G. Vigna. Anomaly detection of web-based attacks. In Proceedings of the 10th ACM conference on Computer and communications security table of contents, pages 251–261, 2003. E. Lee and S. Chae. Fast design of reduced-complexity nearest-neighbor classiﬁers using triangular inequality. IEEE Trans. Pattern Analysis and Machine Intelligence, 20(5):562–566, May 1998. T. Liu, A. W. Moore, A. Gray, and K. Yang. An investigation of practical approximate nearest neighbor algorithms. In Proceedings of Neural Information Processing Systems, 2004a. T. Liu, K. Yang, and A. Moore. The ioc algorithm: Efﬁcient many-class non-parametric classiﬁcation for high-dimensional data. In Proceedings of the conference on Knowledge Discovery in Databases (KDD), 2004b. S. Maneewongvatana and D. M. Mount. The analysis of a probabilistic approach to nearest neighbor searching. In In Proceedings of WADS 2001, 2001. A. W. Moore. The Anchors Hierarchy: Using the Triangle Inequality to Survive High-Dimensional Data. In Twelfth Conference on Uncertainty in Artiﬁcial Intelligence. AAAI Press, 2000. S. Omachi and H. Aso. A fast algorithm for a k-nn classiﬁer based on branch and bound method and computational quantity estimation. In In Systems and Computers in Japan, vol.31, no.6, pp.1-9, 2000. S. M. Omohundro. Bumptrees for Efﬁcient Function, Constraint, and Classiﬁcation Learning. In R. P. Lippmann, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems 3. Morgan Kaufmann, 1991. S. M. Omohundro. Efﬁcient Algorithms with Neural Network Behaviour. Journal of Complex Systems, 1(2):273–347, 1987. A. M. Palau and R. R. Snapp. The labeled cell classiﬁer: A fast approximation to k nearest neighbors. In Proceedings of the 14th International Conference on Pattern Recognition, 1998. E. P. D. Pednault, B. K. Rosen, and C. Apte. Handling imbalanced data sets in insurance risk modeling, 2000. D. Pelleg and A. W. Moore. Accelerating Exact k-means Algorithms with Geometric Reasoning. In Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining. ACM, 1999. A. Pentland, R. Picard, and S. Sclaroff. Photobook: Content-based manipulation of image databases, 1994. URL citeseer.ist.psu.edu/pentland95photobook.html. F. P. Preparata and M. Shamos. Computational Geometry. Springer-Verlag, 1985. Y. Qi, A. Hauptman, and T. Liu. Supervised classiﬁcation for video shot segmentation. In Proceedings of IEEE International Conference on Multimedia and Expo, 2003. 1157 L IU , M OORE AND G RAY G. L. Ritter, H. B. Woodruff, S. R. Lowry, and T. L. Isenhour. An algorithm for a selective nearest neighbor decision rule. IEEE Trans. Information Theory, IT-21(11):665–669, November 1975. G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill Book Company, New York, NY, 1983. I. K. Sethi. A fast algorithm for recognizing nearest neighbors. IEEE Trans. Systems, Man, and Cybernetics, SMC-11(3):245–248, March 1981. A. Smeulders and R. Jain, editors. Image Databases and Multi-media Search. World Scientiﬁc Publishing Company, 1996. S. Stolfo, W. Fan, W. Lee, A. Prodromidis, and P. Chan. Credit card fraud detection using metalearning: Issues and initial results, 1997. URL citeseer.nj.nec.com/stolfo97credit.html. Y. Hamamoto S. Uchimura and S. Tomita. A bootstrap technique for nearest neighbor classiﬁer design. IEEE Trans. Pattern Analysis and Machine Intelligence, 19(1):73–79, 1997. J. K. Uhlmann. Satisfying general proximity/similarity queries with metric trees. Information Processing Letters, 40:175–179, 1991. K. Woods, K. Bowyer, and W. P. Kegelmeyer Jr. Combination of multiple classiﬁers using local accuracy estimates. IEEE Trans. Pattern Analysis and Machine Intelligence, 19(4):405–410, 1997. B. Zhang and S. Srihari. Fast k-nearest neighbor classiﬁcation using cluster-based trees. IEEE Trans. Pattern Analysis and Machine Intelligence, 26(4):525–528, April 2004. T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An Efﬁcient Data Clustering Method for Very Large Databases. In Proceedings of the Fifteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems : PODS 1996. ACM, 1996. W. Zheng and A. Tropsha. A Novel Variable Selection QSAR Approach based on the K-Nearest Neighbor Principle. J. Chem. Inf.Comput. Sci., 40(1):185–194, 2000. 1158