nips nips2013 nips2013-261 nips2013-261-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mahito Sugiyama, Karsten Borgwardt
Abstract: Distance-based approaches to outlier detection are popular in data mining, as they do not require to model the underlying probability distribution, which is particularly challenging for high-dimensional data. We present an empirical comparison of various approaches to distance-based outlier detection across a large number of datasets. We report the surprising observation that a simple, sampling-based scheme outperforms state-of-the-art techniques in terms of both efficiency and effectiveness. To better understand this phenomenon, we provide a theoretical analysis why the sampling-based approach outperforms alternative methods based on k-nearest neighbor search. 1
[1] Aggarwal, C. C. Outlier Analysis. Springer, 2013.
[2] Bache, K. and Lichman, M. UCI machine learning repository, 2013.
[3] Bakar, Z. A., Mohemad, R., Ahmad, A., and Deris, M. M. A comparative study for outlier detection techniques in data mining. In Proceedings of IEEE International Conference on Cybernetics and Intelligent Systems, 1–6, 2006.
[4] Bay, S. D. and Schwabacher, M. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 29–38, 2003.
[5] Berchtold, S., Keim, D. A., and Kriegel, H.-P. The X-tree: An index structure for high-dimensional data. In Proceedings of the 22th International Conference on Very Large Data Bases, 28–39, 1996.
[6] Bhaduri, K., Matthews, B. L., and Giannella, C. R. Algorithms for speeding up distance-based outlier detection. In Proceedings of the 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 859–867, 2011.
[7] Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. LOF: Identifying density-based local outliers. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 93–104, 2000.
[8] Caputo, B., Sim, K., Furesjo, F., and Smola, A. Appearance-based object recognition using SVMs: Which kernel should I use? In Proceedings of NIPS Workshop on Statistical Methods for Computational Experiments in Visual Processing and Computer Vision, 2002.
[9] de Vries, T., Chawla, S., and Houle, M. E. Density-preserving projections for large-scale local anomaly detection. Knowledge and Information Systems, 32(1):25–52, 2012.
[10] Hawkins, D. Identification of Outliers. Chapman and Hall, 1980.
[11] Knorr, E. M. and Ng, R. T. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the 24rd International Conference on Very Large Data Bases, 392–403, 1998.
[12] Knorr, E. M., Ng, R. T., and Tucakov, V. Distance-based outliers: algorithms and applications. The VLDB Journal, 8(3):237–253, 2000.
[13] Kriegel, H.-P., Kr¨ ger, P., and Zimak, A. Outlier detection techniques. Tutorial at 16th ACM SIGKDD o Conference on Knowledge Discovery and Data Mining, 2010.
[14] Kriegel, H.-P., Schubert, M., and Zimek, A. Angle-based outlier detection in high-dimensional data. In Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 444–452, 2008.
[15] Liu, F. T., Ting, K. M., and Zhou, Z. H. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data, 6(1):3:1–3:39, 2012.
[16] Orair, G. H., Teixeira, C. H. C., Wang, Y., Meira Jr., W., and Parthasarathy, S. Distance-based outlier detection: consolidation and renewed bearing. PVLDB, 3(2):1469–1480, 2010.
[17] Pham, N. and Pagh, R. A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data. In Proceedings of the 18th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 877–885, 2012.
[18] Ramaswamy, S., Rastogi, R., and Shim, K. Efficient algorithms for mining outliers from large data sets. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 427–438, 2000.
[19] Sch¨ lkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., and Williamson, R. C. Estimating the support o of a high-dimensional distribution. Neural computation, 13(7):1443–1471, 2001.
[20] Weber, R., Schek, H.-J., and Blott, S. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the International Conference on Very Large Data Bases, 194–205, 1998.
[21] Wu, M. and Jermaine, C. Outlier detection by sampling with accuracy guarantees. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 767–772, 2006.
[22] Yamanishi, K., Takeuchi, J., Williams, G., and Milne, P. On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. Data Mining and Knowledge Discovery, 8(3):275– 300, 2004.
[23] Yu, D., Sheikholeslami, G., and Zhang, A. FindOut: Finding outliers in very large datasets. Knowledge and Information Systems, 4(4):387–412, 2002.
[24] Zimek, A., Schubert, E., and Kriegel, H.-P. A survey on unsupervised outlier detection in highdimensional numerical data. Statistical Analysis and Data Mining, 5(5):363—387, 2012. 9