jmlr jmlr2010 jmlr2010-102 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Gilles Blanchard, Gyemin Lee, Clayton Scott
Abstract: A common setting for novelty detection assumes that labeled examples from the nominal class are available, but that labeled examples of novelties are unavailable. The standard (inductive) approach is to declare novelties where the nominal density is low, which reduces the problem to density level set estimation. In this paper, we consider the setting where an unlabeled and possibly contaminated sample is also available at learning time. We argue that novelty detection in this semi-supervised setting is naturally solved by a general reduction to a binary classification problem. In particular, a detector with a desired false positive rate can be achieved through a reduction to Neyman-Pearson classification. Unlike the inductive approach, semi-supervised novelty detection (SSND) yields detectors that are optimal (e.g., statistically consistent) regardless of the distribution on novelties. Therefore, in novelty detection, unlabeled data have a substantial impact on the theoretical properties of the decision rule. We validate the practical utility of SSND with an extensive experimental study. We also show that SSND provides distribution-free, learning-theoretic solutions to two well known problems in hypothesis testing. First, our results provide a general solution to the general two-sample problem, that is, the problem of determining whether two random samples arise from the same distribution. Second, a specialization of SSND coincides with the standard p-value approach to multiple testing under the so-called random effects model. Unlike standard rejection regions based on thresholded p-values, the general SSND framework allows for adaptation to arbitrary alternative distributions in multiple dimensions. Keywords: semi-supervised learning, novelty detection, Neyman-Pearson classification, learning reduction, two-sample problem, multiple testing
Reference: text
sentIndex sentText sentNum sentScore
1 The standard (inductive) approach is to declare novelties where the nominal density is low, which reduces the problem to density level set estimation. [sent-7, score-0.508]
2 In this paper, we consider the setting where an unlabeled and possibly contaminated sample is also available at learning time. [sent-8, score-0.267]
3 We argue that novelty detection in this semi-supervised setting is naturally solved by a general reduction to a binary classification problem. [sent-9, score-0.451]
4 Unlike the inductive approach, semi-supervised novelty detection (SSND) yields detectors that are optimal (e. [sent-11, score-0.592]
5 Therefore, in novelty detection, unlabeled data have a substantial impact on the theoretical properties of the decision rule. [sent-14, score-0.515]
6 Keywords: semi-supervised learning, novelty detection, Neyman-Pearson classification, learning reduction, two-sample problem, multiple testing 1. [sent-20, score-0.414]
7 Introduction Several recent works in the machine learning literature have addressed the issue of novelty detection. [sent-21, score-0.351]
8 The basic task is to build a decision rule that distinguishes nominal from novel patterns. [sent-22, score-0.232]
9 , xm ∈ X of nominal patterns, obtained, for example, from a ∗. [sent-26, score-0.282]
10 The standard approach has been to estimate a level set of the nominal density (Sch¨ lkopf et al. [sent-31, score-0.261]
11 We refer to this approach as inductive novelty detection. [sent-34, score-0.502]
12 In this paper we incorporate unlabeled data into novelty detection, and argue that this framework offers substantial advantages over the inductive approach. [sent-35, score-0.666]
13 In particular, we assume that in addition to the nominal data, we also have access to an unlabeled sample xm+1 , . [sent-36, score-0.396]
14 , xm+n consisting potentially of both nominal and novel data. [sent-39, score-0.232]
15 , m + n is paired with an unobserved label yi ∈ {0, 1} indicating its status as nominal (yi = 0) or novel (yi = 1), and that (xm+1 , ym+1 ), . [sent-43, score-0.232]
16 Our emphasis here is on semi-supervised novelty detection (SSND), where the goal is to construct a general detector that could classify an arbitrary test point. [sent-54, score-0.486]
17 In particular, we argue that SSND can be addressed by applying a NP classification algorithm, treating the nominal and unlabeled samples as the two classes. [sent-61, score-0.396]
18 Even though a sample from P1 is not available, we argue that our approach can effectively adapt to any novelty distribution P1 , in contrast to the inductive approach which is only optimal in certain extremely unlikely scenarios. [sent-62, score-0.502]
19 , 2002; Scott and Nowak, 2005) and thereby deduce generalization error bounds, consistency, and rates of convergence for novelty detection. [sent-65, score-0.384]
20 SSND is particularly suited to situations where the novelties occupy regions where the nominal density is high. [sent-67, score-0.479]
21 If a single novelty lies in a region of high nominal density, it will appear nominal. [sent-68, score-0.583]
22 However, if many such novelties are present, the unlabeled data will be more concentrated than one would expect from just the nominal component, and the presence of novelties can be detected. [sent-69, score-0.832]
23 We emphasize that we do not assume 2974 S EMI -S UPERVISED N OVELTY D ETECTION that novelties are rare, that is, that π is very small, as in anomaly detection. [sent-71, score-0.246]
24 We present a hybrid approach that automatically reverts to the inductive approach when π = 0, while preserving the benefits of the NP reduction when π > 0. [sent-74, score-0.272]
25 Related Work Inductive novelty detection: Described in the introduction, this problem is also known as one-class classification (Sch¨ lkopf et al. [sent-85, score-0.351]
26 o The standard approach has been to assume that novelties are outliers with respect to the nominal distribution, and to build a novelty detector by estimating a level set of the nominal density (Scott and Nowak, 2006; Vert and Vert, 2006; El-Yaniv and Nisenson, 2007; Hero, 2007). [sent-87, score-1.132]
27 As we discuss below, density level set estimation is equivalent to assuming that novelties are uniformly distributed on the support of P0 . [sent-88, score-0.247]
28 (2005), inductive novelty detection is reduced to classification of P0 against P1 , wherein P1 can be arbitrary. [sent-91, score-0.567]
29 sample from P1 is assumed to be available in addition to the nominal data. [sent-95, score-0.232]
30 In contrast, the semi-supervised approach optimally adapts to P1 , where only an unlabeled contaminated sample is available besides the nominal data. [sent-96, score-0.499]
31 In contrast, we argue that for novelty detection, unlabeled data are essential for these desirable theoretical properties to hold. [sent-103, score-0.515]
32 Learning from positive and unlabeled examples: Classification of an unlabeled sample given data from one class has been addressed previously, but with certain key differences from our work. [sent-104, score-0.328]
33 This body of work is often termed learning from “positive” and unlabeled examples (LPUE), al2975 B LANCHARD , L EE AND S COTT though in our context we tend to think of nominal examples as negative. [sent-105, score-0.396]
34 While some ideas are common with the present work (such as classifying the nominal sample against the contaminated sample as a proxy for the ultimate goal), our point of view is relatively different and based on statistical learning theory. [sent-114, score-0.335]
35 In practice, the authors take this sample to be a contaminated sample consisting of both nominal and novel measurements, so the setting is the same as ours. [sent-136, score-0.335]
36 The former are the tests we would like to have, and the latter are tests we can estimate by treating the nominal and unlabeled samples as labeled training data for a binary classification problem. [sent-149, score-0.396]
37 In other words, we can discriminate P0 from P1 by discriminating between the nominal and unlabeled distributions. [sent-159, score-0.396]
38 In particular, estimating the nominal level set {x : h0 (x) ≥ λ} is equivalent to thresholding 1/h0 (x) at 1/λ. [sent-162, score-0.232]
39 This assumption that P1 is uniform on the support of P0 is therefore implicitly adopted by a majority of works on novelty detection. [sent-164, score-0.38]
40 This can be paralleled with the result that inductive novelty detection can be reduced to classification against uniform data (Steinwart et al. [sent-249, score-0.596]
41 The inductive approach to novelty detection performs density level set estimation. [sent-277, score-0.596]
42 Furthermore, as we saw in Section 3, density level sets are optimal decision regions for testing the nominal distribution against a uniform distribution. [sent-278, score-0.33]
43 Therefore, level set estimation can be achieved by generating an artificial uniform sample and performing weighted binary classification against the nominal data (this idea has been developed in more detail by Steinwart et al. [sent-279, score-0.261]
44 Our approach is to sprinkle a vanishingly small proportion of uniformly distributed data among the unlabeled data, and then implement SSND using NP classification on this modified data. [sent-281, score-0.236]
45 The proportion of operative novelties in the modified unlabeled sample is ˜ ˜ π := π(1− pn )+ pn . [sent-299, score-0.601]
46 The distribution of operative novelties is P1 := π(1−pn ) P1 + p˜n P2 , and the overall ˜ π π ˜ ˜ ˜ 1,α ˜ ˜˜ ˜ distribution of the modified unlabeled data is PX := πP1 + (1 − π)P0 . [sent-300, score-0.403]
47 Theoretically, it could be combined with specific, analyzable algorithms for Neyman-Pearson classification to yield novelty detectors with performance guarantees, as was illustrated in Section 4. [sent-319, score-0.376]
48 Practically, any algorithm for Neyman-Pearson classification that generally works well in practice can be applied in the hybrid framework to produce novelty detectors that perform well for values of π that are zero or very near zero. [sent-321, score-0.462]
49 We also remark that this hybrid procedure could be applied with any prior distribution on novelties besides uniform. [sent-323, score-0.304]
50 In addition, the hybrid approach could also be practically useful when n is small, assuming the artificial points are appended to the unlabeled sample. [sent-324, score-0.273]
51 Estimating π and Testing for π = 0 In the previous sections, our main goal was to find a good classifier function for the purpose of novelty detection. [sent-326, score-0.351]
52 Besides the detector itself, it is often relevant to the user to have an estimate or bound on the proportion π of novelties in the contaminated distribution PX . [sent-327, score-0.491]
53 To see this, consider the idealized case where we have an infinite amount of nominal and contaminated data, so that we have perfect knowledge of P0 and PX . [sent-334, score-0.335]
54 For the estimation of the proportion of novelties, however, it makes sense to define π as the minimal proportion of novelties that can explain the difference between P0 and PX . [sent-338, score-0.362]
55 We call P1 a proper novelty distribution with respect to P0 if there exists no decomposition of the form P1 = (1 − γ)Q + γP0 where Q is some probability distribution and 0 < γ ≤ 1 . [sent-340, score-0.384]
56 This defines a proper novelty distribution P1 as one that cannot be confounded with P0 ; it cannot be represented as a (nontrivial) mixture of P0 with another distribution. [sent-341, score-0.384]
57 The next result establishes a canonical decomposition of the contaminated distribution into a mixture of nominal data and proper novelties. [sent-342, score-0.368]
58 If PX = P0 , there is a unique π∗ ∈ (0, 1] and P1 such that the decomposition PX = (1 − π∗ )P0 + π∗ P1 holds, and such that P1 is a proper novelty distribution with respect to P0 . [sent-345, score-0.384]
59 For the rest of this section we assume for simplicity of notation that π and P1 are the proportion and distribution of proper novelties of PX with respect to P0 . [sent-348, score-0.323]
60 The results to come are also informative for improper novelty distributions, in the following sense: if P1 is not a proper novelty distribution and the decomposition PX = (1 − π)P0 + πP1 holds, then (7) entails that π > π∗ . [sent-349, score-0.735]
61 It follows that a lower bound on π∗ (either deterministic or valid with a certain confidence), as will be derived in the coming sections, is always also a valid lower confidence bound on π when non-proper novelties are considered. [sent-350, score-0.274]
62 We first treat the population case and optimal novelty detection over the set of all possible classifiers. [sent-354, score-0.416]
63 We define a distribution-free upper confidence bound π+ (δ) to be a function of the observed data such that, for any P0 , any proper novelty distribution P1 , and any novelty proportion π ≤ 1, we have π+ (δ) ≥ π with probability 1 − δ over the draw of the two samples. [sent-395, score-0.835]
64 The reason is that the novel distribution can be arbitrarily hard to distinguish from the nominal distribution. [sent-397, score-0.232]
65 It is possible to detect with some certainty that there is a non-zero proportion of novelties in the contaminated data (see Corollary 11 below), but we can never be sure that there are no novelties. [sent-398, score-0.393]
66 We will say that the nominal distribution P0 is weakly diffuse if for any γ > 0 there exists a set A such that 0 < P0 (A) < γ . [sent-400, score-0.261]
67 4 Testing for π = 0 The lower confidence bound on π can also be used as a test for π = 0, that is, a test for whether there are any novelties in the test data: Corollary 11 Let F be a set of classifiers. [sent-415, score-0.246]
68 The nominal distribution P0 is known to be exactly uniform on [0, 1] (equivalently, the nominal distribution is uniform and the nominal sample has infinite size); 3. [sent-458, score-0.754]
69 The class of novelty detectors considered is the set of intervals of the form [0,t],t ∈ [0, 1] . [sent-459, score-0.376]
70 Experiments Despite previous work on learning with positive and unlabeled examples (LPUE), as discussed in Section 2, the efficacy of our proposed learning reduction, compared to the method of inductive novelty detection, has not been empirically demonstrated. [sent-505, score-0.666]
71 To assess the impact of unlabeled data on novelty detection, we applied our framework to some data sets which are common benchmarks for binary classification. [sent-507, score-0.515]
72 Thus, the average (across permutations) nominal sample size m is (1 − πbase )Ntrain . [sent-534, score-0.232]
73 The negative examples from the training set were taken to form the nominal sample, and the positive training examples were not used at all in the experiments. [sent-543, score-0.232]
74 Thus, the average (across permutations) nominal sample size m is (1 − πbase )Ntrain . [sent-546, score-0.232]
75 The KDE plug-in rule can be implemented efficiently, and there is a natural inductive counterpart for comparison, the thresholded KDE based on the nominal sample. [sent-554, score-0.383]
76 The learning methods are the inductive approach, our proposed learning reduction, and three versions of the hybrid approach. [sent-564, score-0.237]
77 1, in which a uniform sample of size 100pn % of the unlabeled sample size is appended to the unlabeled data. [sent-568, score-0.38]
78 We implemented the inductive novelty detector using a thresholded kernel density estimate (KDE) with Gaussian kernel, and SSND using a plug-in KDE classifier. [sent-570, score-0.601]
79 Letting kσ denote a Gaussian kernel with bandwidth σ, the inductive novelty detector at density level λ is 1 1 if m ∑m kσ0 (x, xi ) > λ i=1 f (x) = 0 otherwise, and the SSND classifier at density ratio λ is f (x) = 1 1 if ( 1 ∑m+n kσX (x, xi ))/( m ∑m kσ0 (x, xi )) > λ i=1 n i=m+1 0 otherwise. [sent-573, score-0.63]
80 As π increases, the best performing hybrid has a correspondingly smaller amount of auxiliary uniform data appended to the unlabeled sample. [sent-631, score-0.302]
81 564 Table 2: AUC values for five novelty detection algorithms in the semi-supervised setting. [sent-1013, score-0.416]
82 604 Table 3: AUC values for five novelty detection algorithms in the transductive setting. [sent-1300, score-0.472]
83 This makes sense, because it is essentially trying to classify between two realizations of the nominal distribution. [sent-1372, score-0.255]
84 This trend suggests that as dimension increases, the assumption implicit in the inductive approach (that novelties are uniform where they overlap the support of the nominal distribution) breaks down. [sent-1383, score-0.63]
85 The top graph shows ROCs for a two-dimensional data set where the two classes are fairly well separated, meaning the novelties lie in the tails of the nominal class, and π = 0. [sent-1385, score-0.45]
86 Conclusions We have shown that semi-supervised novelty detection reduces to Neyman-Pearson classification. [sent-1446, score-0.416]
87 Our approach optimally adapts to the unknown novelty distribution, unlike inductive approaches, which operate as if novelties are uniformly distributed. [sent-1449, score-0.72]
88 We also introduced a hybrid method that has the properties of SSND when π > 0, and effectively reverts to the inductive method when π = 0. [sent-1450, score-0.237]
89 Our analysis strongly suggests that in novelty detection, unlike traditional binary classification, unlabeled data are essential for attaining optimal performance in terms of tight bounds, consistency, and rates of convergence. [sent-1451, score-0.515]
90 Furthermore, estimating the novelty proportion π can become arbitrarily difficult as the nominal and novel distributions become increasingly similar. [sent-1455, score-0.655]
91 Hence the above implies that γ = 0 for any such decomposition, that is, P1 must be a proper novelty distribution wrt. [sent-1509, score-0.384]
92 Hence for α > π∗ the corresponding Q is not a proper novelty distribution, and the only valid decomposition of PX into P0 and a proper novelty distribution is the one established previously. [sent-1516, score-0.768]
93 3 Lemma Used in Proof of Theorem 6 For the proof of Theorem 6 we made use of the following auxiliary result: Lemma 13 Assume P1 is a proper novelty distribution wrt. [sent-1518, score-0.384]
94 Hence with probability 1 − cδ = 1 − c(mn)−2 , we have ∗ ∗ R0 ( fk ) ≤ R0 ( fk ) + εm ≤ R0 ( f ∗ ) + 2γ + εm , and also ∗ ∗ RX ( fk ) ≤ RX ( fk ) + εn ≤ RX ( f ∗ ) + 2γ + εn . [sent-1536, score-0.316]
95 ⊗m ⊗n Let P0 , P1 , δ, π be given by the non-triviality assumption and P := P0 ⊗PX denote correspondingly the joint distribution of nominal and contaminated data. [sent-1547, score-0.335]
96 Since it has its support strictly included in the support of P0 , it is a proper novelty distribution with respect to P0 . [sent-1551, score-0.384]
97 Therefore, since P1 is also a proper novelty distribution with respect to P0 , 1 so is P1 := (1 − π) P0x∈A P0 + πP1 . [sent-1552, score-0.384]
98 (A) Now consider the novelty detection problem with nominal distribution P0 , novelty distribution P1 , and novelty proportion π = 1, so that PX = (1 − π)P0 + πP1 = P1 . [sent-1553, score-1.422]
99 Finally, define the modified ⊗m ⊗n joint distribution on nominal and contaminated data P = P0 ⊗ PX . [sent-1554, score-0.335]
100 Let us consider any generating distribution P ∈ P (m, n) for which π = 0, and the nominal distribution P0 is weakly diffuse (it is possible to find such a P0 since X is infinite). [sent-1566, score-0.261]
wordName wordTfidf (topN-words)
[('ssnd', 0.395), ('px', 0.395), ('novelty', 0.351), ('rx', 0.266), ('nominal', 0.232), ('novelties', 0.218), ('unlabeled', 0.164), ('inductive', 0.151), ('cott', 0.148), ('lanchard', 0.148), ('ovelty', 0.148), ('upervised', 0.148), ('etection', 0.127), ('emi', 0.114), ('contaminated', 0.103), ('hybrid', 0.086), ('pxy', 0.082), ('fk', 0.079), ('scott', 0.076), ('fdp', 0.074), ('ee', 0.074), ('proportion', 0.072), ('detector', 0.07), ('classi', 0.068), ('false', 0.067), ('hi', 0.066), ('detection', 0.065), ('lx', 0.064), ('pn', 0.063), ('lpue', 0.058), ('mfdr', 0.058), ('tnd', 0.058), ('transductive', 0.056), ('np', 0.055), ('auc', 0.053), ('splice', 0.051), ('null', 0.051), ('xm', 0.05), ('roc', 0.05), ('dence', 0.049), ('er', 0.046), ('kde', 0.041), ('mushrooms', 0.041), ('ntrain', 0.041), ('waveform', 0.041), ('gn', 0.041), ('banana', 0.041), ('fdr', 0.041), ('testing', 0.04), ('inf', 0.037), ('ers', 0.037), ('precision', 0.037), ('hx', 0.035), ('reduction', 0.035), ('nowak', 0.034), ('fn', 0.033), ('proper', 0.033), ('nemenyi', 0.033), ('rocs', 0.033), ('deduce', 0.033), ('steinwart', 0.033), ('hypothesis', 0.032), ('density', 0.029), ('uniform', 0.029), ('ionosphere', 0.029), ('diffuse', 0.029), ('py', 0.029), ('sonar', 0.029), ('twonorm', 0.029), ('bound', 0.028), ('anomaly', 0.028), ('fpr', 0.028), ('adult', 0.028), ('rejection', 0.028), ('thyroid', 0.027), ('titanic', 0.027), ('vert', 0.027), ('ringnorm', 0.027), ('ranks', 0.027), ('theorem', 0.027), ('randomized', 0.026), ('blanchard', 0.026), ('detectors', 0.025), ('cannon', 0.025), ('gyemin', 0.025), ('ntest', 0.025), ('diabetes', 0.024), ('multiple', 0.023), ('appended', 0.023), ('paradigms', 0.023), ('realizations', 0.023), ('universal', 0.022), ('web', 0.022), ('constraint', 0.022), ('liu', 0.021), ('measurable', 0.021), ('mum', 0.021), ('tables', 0.021), ('fnr', 0.021), ('operative', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 102 jmlr-2010-Semi-Supervised Novelty Detection
Author: Gilles Blanchard, Gyemin Lee, Clayton Scott
Abstract: A common setting for novelty detection assumes that labeled examples from the nominal class are available, but that labeled examples of novelties are unavailable. The standard (inductive) approach is to declare novelties where the nominal density is low, which reduces the problem to density level set estimation. In this paper, we consider the setting where an unlabeled and possibly contaminated sample is also available at learning time. We argue that novelty detection in this semi-supervised setting is naturally solved by a general reduction to a binary classification problem. In particular, a detector with a desired false positive rate can be achieved through a reduction to Neyman-Pearson classification. Unlike the inductive approach, semi-supervised novelty detection (SSND) yields detectors that are optimal (e.g., statistically consistent) regardless of the distribution on novelties. Therefore, in novelty detection, unlabeled data have a substantial impact on the theoretical properties of the decision rule. We validate the practical utility of SSND with an extensive experimental study. We also show that SSND provides distribution-free, learning-theoretic solutions to two well known problems in hypothesis testing. First, our results provide a general solution to the general two-sample problem, that is, the problem of determining whether two random samples arise from the same distribution. Second, a specialization of SSND coincides with the standard p-value approach to multiple testing under the so-called random effects model. Unlike standard rejection regions based on thresholded p-values, the general SSND framework allows for adaptation to arbitrary alternative distributions in multiple dimensions. Keywords: semi-supervised learning, novelty detection, Neyman-Pearson classification, learning reduction, two-sample problem, multiple testing
2 0.12688248 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
Author: Shiliang Sun, John Shawe-Taylor
Abstract: In this paper, we propose a general framework for sparse semi-supervised learning, which concerns using a small portion of unlabeled data and a few labeled data to represent target functions and thus has the merit of accelerating function evaluations when predicting the output of a new example. This framework makes use of Fenchel-Legendre conjugates to rewrite a convex insensitive loss involving a regularization with unlabeled data, and is applicable to a family of semi-supervised learning methods such as multi-view co-regularized least squares and single-view Laplacian support vector machines (SVMs). As an instantiation of this framework, we propose sparse multi-view SVMs which use a squared ε-insensitive loss. The resultant optimization is an inf-sup problem and the optimal solutions have arguably saddle-point properties. We present a globally optimal iterative algorithm to optimize the problem. We give the margin bound on the generalization error of the sparse multi-view SVMs, and derive the empirical Rademacher complexity for the induced function class. Experiments on artificial and real-world data show their effectiveness. We further give a sequential training approach to show their possibility and potential for uses in large-scale problems and provide encouraging experimental results indicating the efficacy of the margin bound and empirical Rademacher complexity on characterizing the roles of unlabeled data for semi-supervised learning. Keywords: semi-supervised learning, Fenchel-Legendre conjugate, representer theorem, multiview regularization, support vector machine, statistical learning theory
3 0.11467189 90 jmlr-2010-Permutation Tests for Studying Classifier Performance
Author: Markus Ojala, Gemma C. Garriga
Abstract: We explore the framework of permutation-based p-values for assessing the performance of classifiers. In this paper we study two simple permutation tests. The first test assess whether the classifier has found a real class structure in the data; the corresponding null distribution is estimated by permuting the labels in the data. This test has been used extensively in classification problems in computational biology. The second test studies whether the classifier is exploiting the dependency between the features in classification; the corresponding null distribution is estimated by permuting the features within classes, inspired by restricted randomization techniques traditionally used in statistics. This new test can serve to identify descriptive features which can be valuable information in improving the classifier performance. We study the properties of these tests and present an extensive empirical evaluation on real and synthetic data. Our analysis shows that studying the classifier performance via permutation tests is effective. In particular, the restricted permutation test clearly reveals whether the classifier exploits the interdependency between the features in the data. Keywords: classification, labeled data, permutation tests, restricted randomization, significance testing
4 0.086619705 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
Author: Gideon S. Mann, Andrew McCallum
Abstract: In this paper, we present an overview of generalized expectation criteria (GE), a simple, robust, scalable method for semi-supervised training using weakly-labeled data. GE fits model parameters by favoring models that match certain expectation constraints, such as marginal label distributions, on the unlabeled data. This paper shows how to apply generalized expectation criteria to two classes of parametric models: maximum entropy models and conditional random fields. Experimental results demonstrate accuracy improvements over supervised training and a number of other stateof-the-art semi-supervised learning methods for these models. Keywords: generalized expectation criteria, semi-supervised learning, logistic regression, conditional random fields
5 0.078501679 40 jmlr-2010-Fast and Scalable Local Kernel Machines
Author: Nicola Segata, Enrico Blanzieri
Abstract: A computationally efficient approach to local learning with kernel methods is presented. The Fast Local Kernel Support Vector Machine (FaLK-SVM) trains a set of local SVMs on redundant neighbourhoods in the training set and an appropriate model for each query point is selected at testing time according to a proximity strategy. Supported by a recent result by Zakai and Ritov (2009) relating consistency and localizability, our approach achieves high classification accuracies by dividing the separation function in local optimisation problems that can be handled very efficiently from the computational viewpoint. The introduction of a fast local model selection further speeds-up the learning process. Learning and complexity bounds are derived for FaLK-SVM, and the empirical evaluation of the approach (with data sets up to 3 million points) showed that it is much faster and more accurate and scalable than state-of-the-art accurate and approximated SVM solvers at least for non high-dimensional data sets. More generally, we show that locality can be an important factor to sensibly speed-up learning approaches and kernel methods, differently from other recent techniques that tend to dismiss local information in order to improve scalability. Keywords: locality, kernel methods, local learning algorithms, support vector machines, instancebased learning
6 0.070969254 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
7 0.058088318 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
8 0.050739523 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation
9 0.050570954 22 jmlr-2010-Classification Using Geometric Level Sets
10 0.047307134 27 jmlr-2010-Consistent Nonparametric Tests of Independence
12 0.044318546 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks
13 0.043262817 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models
14 0.042217754 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression
15 0.041818872 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
16 0.040939577 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
17 0.038679991 60 jmlr-2010-Learnability, Stability and Uniform Convergence
18 0.03838335 15 jmlr-2010-Approximate Tree Kernels
19 0.037930809 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds
20 0.037157405 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
topicId topicWeight
[(0, -0.193), (1, 0.02), (2, -0.012), (3, 0.122), (4, -0.043), (5, 0.075), (6, -0.018), (7, 0.183), (8, 0.001), (9, -0.104), (10, 0.02), (11, -0.126), (12, 0.106), (13, 0.092), (14, 0.002), (15, 0.107), (16, -0.104), (17, 0.007), (18, -0.077), (19, 0.047), (20, -0.054), (21, 0.149), (22, -0.008), (23, 0.022), (24, -0.015), (25, 0.02), (26, 0.086), (27, -0.015), (28, -0.049), (29, -0.056), (30, -0.162), (31, 0.019), (32, -0.056), (33, -0.047), (34, 0.102), (35, 0.297), (36, 0.195), (37, -0.051), (38, 0.127), (39, 0.089), (40, -0.095), (41, -0.024), (42, -0.002), (43, -0.044), (44, -0.066), (45, -0.148), (46, -0.086), (47, 0.136), (48, -0.01), (49, 0.126)]
simIndex simValue paperId paperTitle
same-paper 1 0.90708351 102 jmlr-2010-Semi-Supervised Novelty Detection
Author: Gilles Blanchard, Gyemin Lee, Clayton Scott
Abstract: A common setting for novelty detection assumes that labeled examples from the nominal class are available, but that labeled examples of novelties are unavailable. The standard (inductive) approach is to declare novelties where the nominal density is low, which reduces the problem to density level set estimation. In this paper, we consider the setting where an unlabeled and possibly contaminated sample is also available at learning time. We argue that novelty detection in this semi-supervised setting is naturally solved by a general reduction to a binary classification problem. In particular, a detector with a desired false positive rate can be achieved through a reduction to Neyman-Pearson classification. Unlike the inductive approach, semi-supervised novelty detection (SSND) yields detectors that are optimal (e.g., statistically consistent) regardless of the distribution on novelties. Therefore, in novelty detection, unlabeled data have a substantial impact on the theoretical properties of the decision rule. We validate the practical utility of SSND with an extensive experimental study. We also show that SSND provides distribution-free, learning-theoretic solutions to two well known problems in hypothesis testing. First, our results provide a general solution to the general two-sample problem, that is, the problem of determining whether two random samples arise from the same distribution. Second, a specialization of SSND coincides with the standard p-value approach to multiple testing under the so-called random effects model. Unlike standard rejection regions based on thresholded p-values, the general SSND framework allows for adaptation to arbitrary alternative distributions in multiple dimensions. Keywords: semi-supervised learning, novelty detection, Neyman-Pearson classification, learning reduction, two-sample problem, multiple testing
2 0.57943326 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
Author: Shiliang Sun, John Shawe-Taylor
Abstract: In this paper, we propose a general framework for sparse semi-supervised learning, which concerns using a small portion of unlabeled data and a few labeled data to represent target functions and thus has the merit of accelerating function evaluations when predicting the output of a new example. This framework makes use of Fenchel-Legendre conjugates to rewrite a convex insensitive loss involving a regularization with unlabeled data, and is applicable to a family of semi-supervised learning methods such as multi-view co-regularized least squares and single-view Laplacian support vector machines (SVMs). As an instantiation of this framework, we propose sparse multi-view SVMs which use a squared ε-insensitive loss. The resultant optimization is an inf-sup problem and the optimal solutions have arguably saddle-point properties. We present a globally optimal iterative algorithm to optimize the problem. We give the margin bound on the generalization error of the sparse multi-view SVMs, and derive the empirical Rademacher complexity for the induced function class. Experiments on artificial and real-world data show their effectiveness. We further give a sequential training approach to show their possibility and potential for uses in large-scale problems and provide encouraging experimental results indicating the efficacy of the margin bound and empirical Rademacher complexity on characterizing the roles of unlabeled data for semi-supervised learning. Keywords: semi-supervised learning, Fenchel-Legendre conjugate, representer theorem, multiview regularization, support vector machine, statistical learning theory
3 0.55092335 90 jmlr-2010-Permutation Tests for Studying Classifier Performance
Author: Markus Ojala, Gemma C. Garriga
Abstract: We explore the framework of permutation-based p-values for assessing the performance of classifiers. In this paper we study two simple permutation tests. The first test assess whether the classifier has found a real class structure in the data; the corresponding null distribution is estimated by permuting the labels in the data. This test has been used extensively in classification problems in computational biology. The second test studies whether the classifier is exploiting the dependency between the features in classification; the corresponding null distribution is estimated by permuting the features within classes, inspired by restricted randomization techniques traditionally used in statistics. This new test can serve to identify descriptive features which can be valuable information in improving the classifier performance. We study the properties of these tests and present an extensive empirical evaluation on real and synthetic data. Our analysis shows that studying the classifier performance via permutation tests is effective. In particular, the restricted permutation test clearly reveals whether the classifier exploits the interdependency between the features in the data. Keywords: classification, labeled data, permutation tests, restricted randomization, significance testing
4 0.42347682 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
Author: Gideon S. Mann, Andrew McCallum
Abstract: In this paper, we present an overview of generalized expectation criteria (GE), a simple, robust, scalable method for semi-supervised training using weakly-labeled data. GE fits model parameters by favoring models that match certain expectation constraints, such as marginal label distributions, on the unlabeled data. This paper shows how to apply generalized expectation criteria to two classes of parametric models: maximum entropy models and conditional random fields. Experimental results demonstrate accuracy improvements over supervised training and a number of other stateof-the-art semi-supervised learning methods for these models. Keywords: generalized expectation criteria, semi-supervised learning, logistic regression, conditional random fields
5 0.40189421 22 jmlr-2010-Classification Using Geometric Level Sets
Author: Kush R. Varshney, Alan S. Willsky
Abstract: A variational level set method is developed for the supervised classification problem. Nonlinear classifier decision boundaries are obtained by minimizing an energy functional that is composed of an empirical risk term with a margin-based loss and a geometric regularization term new to machine learning: the surface area of the decision boundary. This geometric level set classifier is analyzed in terms of consistency and complexity through the calculation of its ε-entropy. For multicategory classification, an efficient scheme is developed using a logarithmic number of decision functions in the number of classes rather than the typical linear number of decision functions. Geometric level set classification yields performance results on benchmark data sets that are competitive with well-established methods. Keywords: level set methods, nonlinear classification, geometric regularization, consistency, complexity
6 0.36659881 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
7 0.32592091 27 jmlr-2010-Consistent Nonparametric Tests of Independence
8 0.29969618 40 jmlr-2010-Fast and Scalable Local Kernel Machines
9 0.28937069 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression
10 0.27285519 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
11 0.27232382 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
12 0.26503578 71 jmlr-2010-Matched Gene Selection and Committee Classifier for Molecular Classification of Heterogeneous Diseases
13 0.26489791 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
14 0.26173535 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation
16 0.21564752 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
17 0.21336491 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors
18 0.20932947 54 jmlr-2010-Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization
19 0.20698071 60 jmlr-2010-Learnability, Stability and Uniform Convergence
20 0.19964348 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
topicId topicWeight
[(1, 0.019), (3, 0.041), (4, 0.019), (8, 0.017), (15, 0.012), (21, 0.021), (24, 0.011), (32, 0.07), (33, 0.015), (36, 0.033), (37, 0.064), (72, 0.357), (75, 0.126), (81, 0.015), (85, 0.096), (96, 0.011), (97, 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.70245093 102 jmlr-2010-Semi-Supervised Novelty Detection
Author: Gilles Blanchard, Gyemin Lee, Clayton Scott
Abstract: A common setting for novelty detection assumes that labeled examples from the nominal class are available, but that labeled examples of novelties are unavailable. The standard (inductive) approach is to declare novelties where the nominal density is low, which reduces the problem to density level set estimation. In this paper, we consider the setting where an unlabeled and possibly contaminated sample is also available at learning time. We argue that novelty detection in this semi-supervised setting is naturally solved by a general reduction to a binary classification problem. In particular, a detector with a desired false positive rate can be achieved through a reduction to Neyman-Pearson classification. Unlike the inductive approach, semi-supervised novelty detection (SSND) yields detectors that are optimal (e.g., statistically consistent) regardless of the distribution on novelties. Therefore, in novelty detection, unlabeled data have a substantial impact on the theoretical properties of the decision rule. We validate the practical utility of SSND with an extensive experimental study. We also show that SSND provides distribution-free, learning-theoretic solutions to two well known problems in hypothesis testing. First, our results provide a general solution to the general two-sample problem, that is, the problem of determining whether two random samples arise from the same distribution. Second, a specialization of SSND coincides with the standard p-value approach to multiple testing under the so-called random effects model. Unlike standard rejection regions based on thresholded p-values, the general SSND framework allows for adaptation to arbitrary alternative distributions in multiple dimensions. Keywords: semi-supervised learning, novelty detection, Neyman-Pearson classification, learning reduction, two-sample problem, multiple testing
2 0.53871834 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks
Author: Yu Fan, Jing Xu, Christian R. Shelton
Abstract: A continuous time Bayesian network (CTBN) uses a structured representation to describe a dynamic system with a finite number of states which evolves in continuous time. Exact inference in a CTBN is often intractable as the state space of the dynamic system grows exponentially with the number of variables. In this paper, we first present an approximate inference algorithm based on importance sampling. We then extend it to continuous-time particle filtering and smoothing algorithms. These three algorithms can estimate the expectation of any function of a trajectory, conditioned on any evidence set constraining the values of subsets of the variables over subsets of the time line. We present experimental results on both synthetic networks and a network learned from a real data set on people’s life history events. We show the accuracy as well as the time efficiency of our algorithms, and compare them to other approximate algorithms: expectation propagation and Gibbs sampling. Keywords: continuous time Bayesian networks, importance sampling, approximate inference, filtering, smoothing
3 0.45453411 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
Author: Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian
Abstract: Estimating the error rates of classifiers or regression models is a fundamental task in machine learning which has thus far been studied exclusively using supervised learning techniques. We propose a novel unsupervised framework for estimating these error rates using only unlabeled data and mild assumptions. We prove consistency results for the framework and demonstrate its practical applicability on both synthetic and real world data. Keywords: classification and regression, maximum likelihood, latent variable models
4 0.45262492 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
Author: Shiliang Sun, John Shawe-Taylor
Abstract: In this paper, we propose a general framework for sparse semi-supervised learning, which concerns using a small portion of unlabeled data and a few labeled data to represent target functions and thus has the merit of accelerating function evaluations when predicting the output of a new example. This framework makes use of Fenchel-Legendre conjugates to rewrite a convex insensitive loss involving a regularization with unlabeled data, and is applicable to a family of semi-supervised learning methods such as multi-view co-regularized least squares and single-view Laplacian support vector machines (SVMs). As an instantiation of this framework, we propose sparse multi-view SVMs which use a squared ε-insensitive loss. The resultant optimization is an inf-sup problem and the optimal solutions have arguably saddle-point properties. We present a globally optimal iterative algorithm to optimize the problem. We give the margin bound on the generalization error of the sparse multi-view SVMs, and derive the empirical Rademacher complexity for the induced function class. Experiments on artificial and real-world data show their effectiveness. We further give a sequential training approach to show their possibility and potential for uses in large-scale problems and provide encouraging experimental results indicating the efficacy of the margin bound and empirical Rademacher complexity on characterizing the roles of unlabeled data for semi-supervised learning. Keywords: semi-supervised learning, Fenchel-Legendre conjugate, representer theorem, multiview regularization, support vector machine, statistical learning theory
5 0.45258626 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
Author: Pannagadatta K. Shivaswamy, Tony Jebara
Abstract: Leading classification methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identifies its sensitivity to affine transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classification function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classification function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efficiency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classification and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages. Keywords: support vector machines, kernel methods, large margin, Rademacher complexity
6 0.45148122 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
7 0.45084411 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
8 0.44788915 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
9 0.4471181 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
10 0.44419479 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
11 0.4436484 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems
12 0.44209751 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
13 0.44055754 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory
15 0.44024915 109 jmlr-2010-Stochastic Composite Likelihood
16 0.44010827 22 jmlr-2010-Classification Using Geometric Level Sets
17 0.43991694 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values
18 0.43494135 11 jmlr-2010-An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data
20 0.43390635 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices