nips nips2013 nips2013-261 nips2013-261-reference knowledge-graph by maker-knowledge-mining

261 nips-2013-Rapid Distance-Based Outlier Detection via Sampling

Source: pdf

Author: Mahito Sugiyama, Karsten Borgwardt

Abstract: Distance-based approaches to outlier detection are popular in data mining, as they do not require to model the underlying probability distribution, which is particularly challenging for high-dimensional data. We present an empirical comparison of various approaches to distance-based outlier detection across a large number of datasets. We report the surprising observation that a simple, sampling-based scheme outperforms state-of-the-art techniques in terms of both efﬁciency and effectiveness. To better understand this phenomenon, we provide a theoretical analysis why the sampling-based approach outperforms alternative methods based on k-nearest neighbor search. 1

reference text

[1] Aggarwal, C. C. Outlier Analysis. Springer, 2013.

[2] Bache, K. and Lichman, M. UCI machine learning repository, 2013.

[3] Bakar, Z. A., Mohemad, R., Ahmad, A., and Deris, M. M. A comparative study for outlier detection techniques in data mining. In Proceedings of IEEE International Conference on Cybernetics and Intelligent Systems, 1–6, 2006.

[4] Bay, S. D. and Schwabacher, M. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 29–38, 2003.

[5] Berchtold, S., Keim, D. A., and Kriegel, H.-P. The X-tree: An index structure for high-dimensional data. In Proceedings of the 22th International Conference on Very Large Data Bases, 28–39, 1996.

[6] Bhaduri, K., Matthews, B. L., and Giannella, C. R. Algorithms for speeding up distance-based outlier detection. In Proceedings of the 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 859–867, 2011.

[7] Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. LOF: Identifying density-based local outliers. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 93–104, 2000.

[8] Caputo, B., Sim, K., Furesjo, F., and Smola, A. Appearance-based object recognition using SVMs: Which kernel should I use? In Proceedings of NIPS Workshop on Statistical Methods for Computational Experiments in Visual Processing and Computer Vision, 2002.

[9] de Vries, T., Chawla, S., and Houle, M. E. Density-preserving projections for large-scale local anomaly detection. Knowledge and Information Systems, 32(1):25–52, 2012.

[10] Hawkins, D. Identiﬁcation of Outliers. Chapman and Hall, 1980.

[11] Knorr, E. M. and Ng, R. T. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the 24rd International Conference on Very Large Data Bases, 392–403, 1998.

[12] Knorr, E. M., Ng, R. T., and Tucakov, V. Distance-based outliers: algorithms and applications. The VLDB Journal, 8(3):237–253, 2000.

[13] Kriegel, H.-P., Kr¨ ger, P., and Zimak, A. Outlier detection techniques. Tutorial at 16th ACM SIGKDD o Conference on Knowledge Discovery and Data Mining, 2010.

[14] Kriegel, H.-P., Schubert, M., and Zimek, A. Angle-based outlier detection in high-dimensional data. In Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 444–452, 2008.

[15] Liu, F. T., Ting, K. M., and Zhou, Z. H. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data, 6(1):3:1–3:39, 2012.

[16] Orair, G. H., Teixeira, C. H. C., Wang, Y., Meira Jr., W., and Parthasarathy, S. Distance-based outlier detection: consolidation and renewed bearing. PVLDB, 3(2):1469–1480, 2010.

[17] Pham, N. and Pagh, R. A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data. In Proceedings of the 18th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 877–885, 2012.

[18] Ramaswamy, S., Rastogi, R., and Shim, K. Efﬁcient algorithms for mining outliers from large data sets. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 427–438, 2000.

[19] Sch¨ lkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., and Williamson, R. C. Estimating the support o of a high-dimensional distribution. Neural computation, 13(7):1443–1471, 2001.

[20] Weber, R., Schek, H.-J., and Blott, S. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the International Conference on Very Large Data Bases, 194–205, 1998.

[21] Wu, M. and Jermaine, C. Outlier detection by sampling with accuracy guarantees. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 767–772, 2006.

[22] Yamanishi, K., Takeuchi, J., Williams, G., and Milne, P. On-line unsupervised outlier detection using ﬁnite mixtures with discounting learning algorithms. Data Mining and Knowledge Discovery, 8(3):275– 300, 2004.

[23] Yu, D., Sheikholeslami, G., and Zhang, A. FindOut: Finding outliers in very large datasets. Knowledge and Information Systems, 4(4):387–412, 2002.

[24] Zimek, A., Schubert, E., and Kriegel, H.-P. A survey on unsupervised outlier detection in highdimensional numerical data. Statistical Analysis and Data Mining, 5(5):363—387, 2012. 9