jmlr jmlr2010 jmlr2010-102 jmlr2010-102-reference knowledge-graph by maker-knowledge-mining

102 jmlr-2010-Semi-Supervised Novelty Detection

Source: pdf

Author: Gilles Blanchard, Gyemin Lee, Clayton Scott

Abstract: A common setting for novelty detection assumes that labeled examples from the nominal class are available, but that labeled examples of novelties are unavailable. The standard (inductive) approach is to declare novelties where the nominal density is low, which reduces the problem to density level set estimation. In this paper, we consider the setting where an unlabeled and possibly contaminated sample is also available at learning time. We argue that novelty detection in this semi-supervised setting is naturally solved by a general reduction to a binary classiﬁcation problem. In particular, a detector with a desired false positive rate can be achieved through a reduction to Neyman-Pearson classiﬁcation. Unlike the inductive approach, semi-supervised novelty detection (SSND) yields detectors that are optimal (e.g., statistically consistent) regardless of the distribution on novelties. Therefore, in novelty detection, unlabeled data have a substantial impact on the theoretical properties of the decision rule. We validate the practical utility of SSND with an extensive experimental study. We also show that SSND provides distribution-free, learning-theoretic solutions to two well known problems in hypothesis testing. First, our results provide a general solution to the general two-sample problem, that is, the problem of determining whether two random samples arise from the same distribution. Second, a specialization of SSND coincides with the standard p-value approach to multiple testing under the so-called random effects model. Unlike standard rejection regions based on thresholded p-values, the general SSND framework allows for adaptation to arbitrary alternative distributions in multiple dimensions. Keywords: semi-supervised learning, novelty detection, Neyman-Pearson classiﬁcation, learning reduction, two-sample problem, multiple testing

reference text

S. Ben-David, T. Lu, and D. P´ l. Does unlabeled data provably help? Worst-case analysis of the a sample complexity of semi-supervised learning. In R. Servedio and T. Zhang, editors, Proc. 21st Annual Conference on Learning Theory (COLT), pages 33–44, Helsinki, 2008. A. Beygelzimer, V. Dani, T. Hayes, J. Langford, and B. Zadrozny. Error-limiting reductions between classiﬁcation tasks. In L. De Raedt and S. Wrobel, editors, Proceedings of the 22nd International Machine Learning Conference (ICML). ACM Press, 2005. A. Cannon, J. Howse, D. Hush, and C. Scovel. Learning with the Neyman-Pearson and min-max criteria. Technical Report LA-UR 02-2951, Los Alamos National Laboratory, 2002. C. C. Chang and C. J. Lin. LIBSVM : A library for support vector machines. http://www.csie.ntu.edu.tw/ cjlin/libsvm, 2001. Z. Chi. On the performance of FDR control: Constraints and a partial solution. Ann. Stat., 35(4): 1409–1431, 2007. Z. Chi. False discovery rate control with multivariate p-values. Electronic Journal of Statistics, 2: 368–411, 2008. J. Demˇar. Statistical comparisons of classiﬁers over multiple data sets. J. Machine Learning s Research, 7:1–30, 2006. F. Denis. PAC learning from positive statistical queries. In Proc. 9th Int. Conf. on Algorithmic Learning Theory (ALT), pages 112–126, Otzenhausen, Germany, 1998. F. Denis, R. Gilleron, and F. Letouzey. Learning from positive and unlabeled examples. Theoretical Computer Science, 348(1):70–83, 2005. L. Devroye. Any discrimination rule can have an arbitrarily bad probability of error for ﬁnite sample size. IEEE Trans. Patt. Anal. Mach. Intell., 4:154–157, 1982. B. Efron, R. Tibshirani, J.D. Storey, and V. Tusher. Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association, 96:1151–1160, 2001. R. El-Yaniv and M. Nisenson. Optimal single-class classiﬁcation strategies. In B. Sch¨ lkopf, J. Platt, o and T. Hoffman, editors, Adv. in Neural Inform. Proc. Systems 19. MIT Press, Cambridge, MA, 2007. C. Elkan and K. Noto. Learning classiﬁers from only positive and unlabeled data. In Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD08), pages 213–220, 2008. 3007 B LANCHARD , L EE AND S COTT T. Fawcett. An introduction to ROC analysis. Pattern Recognition Letters, 27:861–874, 2006. C. Genovese and L. Wasserman. A stochastic process approach to false discovery control. Annals of Statistics, 32(3):1035–1061, 2004. D. Ghosh. Assessing signiﬁcance of peptide spectrum matches in proteomics: A multiple testing approach. Statistics in Biosciences, 1:199–213, 2009. D. Ghosh and A. M. Chinnaiyan. Genomic outlier proﬁle analysis: Mixture models, null hypotheses, and nonparametric estimation. Biostatistics, 10:60–69, 2009. A. Gretton, K. M. Borgwardt, M. Rasch, B. Sch¨ lkopf, and A. J. Smola. A kernel method for the o two-sample problem. In B. Sch¨ lkopf, J. Platt, and T. Hoffman, editors, Advances in Neural o Information Processing Systems 19, pages 513–520. MIT Press, Cambridge, MA, 2007. A. Hero. Geometric entropy minimization for anomaly detection and localization. In B. Sch¨ lkopf, o J. Platt, and T. Hoffman, editors, Adv. in Neural Inform. Proc. Systems 19. MIT Press, Cambridge, MA, 2007. J. Lafferty and L. Wasserman. Statistical analysis of semi-supervised regression. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 801–808. MIT Press, Cambridge, MA, 2008. W. S. Lee and B. Liu. Learning with positive and unlabeled examples using weighted logistic regression. In Proc. 20th Int. Conf. on Machine Learning (ICML), pages 448–455, Washington, DC, 2003. E. Lehmann. Testing Statistical Hypotheses. Wiley, New York, 1986. B. Liu, W. S. Lee, P. S. Yu, and X. Li. Partially supervised classiﬁcation of text documents. In Proc. 19th Int. Conf. Machine Learning (ICML), pages 387–394, Sydney, Australia, 2002. B. Liu, Y. Dai, X. Li, W. S. Lee, and P. S. Yu. Building text classiﬁers using positive and unlabeled examples. In Proc. 3rd IEEE Int. Conf. on Data Mining (ICDM), pages 179–188, Melbourne, FL, 2003. K.-R. M¨ ller, S. Mika, G. R¨ tsch, K. Tsuda, and B. Sch¨ lkopf. An introduction to kernel-based u a o learning algorithms. IEEE Transactions on Neural Networks, 12:181–201, 2001. P. Rigollet. Generalization error bounds in semi-supervised classiﬁcation under the cluster assumption. J. Machine Learning Research, 8:1369–1392, 2007. B. Sch¨ lkopf, J. Platt, J. Shawe-Taylor, A. Smola, and R. Williamson. Estimating the support of a o high-dimensional distribution. Neural Computation, 13(7):1443–1472, 2001. C. Scott and G. Blanchard. Novelty detection: Unlabeled data deﬁnitely help. In D. van Dyk and M. Welling, editors, Proceedings of the Twelfth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS) 2009, pages 464–471, Clearwater Beach, Florida, 2009. JMLR: W&CP; 5. 3008 S EMI -S UPERVISED N OVELTY D ETECTION C. Scott and R. Nowak. A Neyman-Pearson approach to statistical learning. IEEE Trans. Inform. Theory, 51(11):3806–3819, 2005. C. Scott and R. Nowak. Learning minimum volume sets. J. Machine Learning Res., 7:665–704, 2006. A. Singh, R. Nowak, and X. Zhu. Unlabeled data: Now it helps, now it doesn’t. Proc. Neural Information Processing Systems 21 – NIPS ’08, 2009. A. Smola, L. Song, and C. H. Teo. Relative novelty detection. In D. van Dyk and M. Welling, editors, Proceedings of the Twelfth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS) 2009, pages 536–543, Clearwater Beach, Florida, 2009. JMLR: W&CP; 5. D. Steinberg and N. S. Cardell. Estimating logistic regression models when the dependent variable has no variance. Communications in Statistics - Theory and Methods, 21:423–450, 1992. I. Steinwart, D. Hush, and C. Scovel. A classiﬁcation framework for anomaly detection. J. Machine Learning Research, 6:211–232, 2005. J. D. Storey. The positive false discovery rate: A Bayesian interpretation and the q-value. Annals of Statistics, 31:6:2013–2035, 2003. W. Sun and T. Cai. Oracle and adaptive compound decision rules for false discovery rate control. J. Amer. Statist. Assoc., 102(479):901–912, 2007. V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. R. Vert and J.-P. Vert. Consistency and convergence rates of one-class SVM and related algorithms. J. Machine Learning Research, pages 817–854, 2006. G. Ward, T. Hastie, S. Barry, J. Elith, and J. R. Leathwick. Presence-only data and the EM algorithm. Biometrics, 65:554–564, 2009. B. Zhang and W. Zuo. Learning from positive and unlabeled examples: A survey. In Proceeding of the IEEE International Symposium on Information Processing (ISIP), pages 650–654, 2008. D. Zhang and W. S. Lee. A simple probabilistic approach to learning from positive and unlabeled examples. In Proc. 5th Annual UK Workshop on Comp. Intell. (UKCI), London, UK, 2005. 3009