nips nips2004 nips2004-49 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ingo Steinwart, Don Hush, Clint Scovel
Abstract: We show that anomaly detection can be interpreted as a binary classification problem. Using this interpretation we propose a support vector machine (SVM) for anomaly detection. We then present some theoretical results which include consistency and learning rates. Finally, we experimentally compare our SVM with the standard one-class SVM. 1
Reference: text
sentIndex sentText sentNum sentScore
1 gov Abstract We show that anomaly detection can be interpreted as a binary classification problem. [sent-2, score-0.245]
2 Using this interpretation we propose a support vector machine (SVM) for anomaly detection. [sent-3, score-0.189]
3 1 Introduction One of the most common ways to define anomalies is by saying that anomalies are not concentrated (see e. [sent-6, score-0.242]
4 Furthermore, to describe the concentration of Q we need a known reference distribution µ on X. [sent-10, score-0.082]
5 Let us assume that Q has a density h with respect to µ. [sent-11, score-0.16]
6 Then, the sets {h > ρ}, ρ > 0, describe the concentration of Q. [sent-12, score-0.118]
7 Consequently, to define anomalies in terms of the concentration we only have to fix a threshold level ρ > 0, so that an x ∈ X is considered to be anomalous whenever x ∈ {h ≤ ρ}. [sent-13, score-0.285]
8 Therefore our goal is to find the density level set {h ≤ ρ}, or equivalently, the ρ-level set {h > ρ}. [sent-14, score-0.17]
9 Finding density level sets is an old problem in statistics which also has some interesting applications (see e. [sent-17, score-0.206]
10 it always finds the level set of interest asymptotically, and b) learns with fast rates under realistic assumptions on h and µ. [sent-23, score-0.103]
11 Then, a density level detection algorithm constructs a function fT : X → R such that the set {fT > 0} is an estimate of the ρ-level set {h > ρ} of interest. [sent-36, score-0.238]
12 [6, 7] and the references therein) for measurable functions f : X → R is Sµ,h,ρ (f ) := µ {f > 0} {h > ρ} , where denotes the symmetric difference. [sent-40, score-0.176]
13 Obviously, the smaller Sµ,h,ρ (f ) is, the more {f > 0} coincides with the ρ-level set of h, and a function f minimizes Sµ,h,ρ if and only if {f > 0} is µ-almost surely identical to {h > ρ}. [sent-41, score-0.085]
14 Furthermore, for a sequence of functions fn : X → R with Sµ,h,ρ (fn ) → 0 we easily see that sign fn → 1{h>ρ} both µ-almost and Q-almost surely if 1A denotes the indicator function of a set A. [sent-42, score-0.411]
15 2 Detecting density levels is a classification problem In this section we show how the density level detection (DLD) problem can be formulated as a binary classification problem. [sent-44, score-0.382]
16 To this end we write Y := {−1, 1} and define: Definition 2. [sent-45, score-0.102]
17 Then the probability measure Q s µ on X × Y is defined by Q s µ (A) := sEx∼Q 1A (x, 1) + (1 − s)Ex∼µ 1A (x, −1) for all measurable A ⊂ X × Y . [sent-47, score-0.227]
18 Obviously, the measure P := Q s µ can be associated with a binary classification problem in which positive samples are drawn from Q and negative samples are drawn from µ. [sent-49, score-0.221]
19 Inspired by this interpretation let us recall that the binary classification risk for a measurable function f : X → R and a distribution P on X × Y is defined by RP (f ) = P {(x, y) : sign f (x) = y} , where we define sign t := 1 if t > 0 and sign t = −1 otherwise. [sent-50, score-0.615]
20 Furthermore, we denote the Bayes risk of P by RP := inf{RP (f ) f : X → R measurable}. [sent-51, score-0.128]
21 We will show that learning with respect to Sµ,h,ρ is equivalent to learning with respect to RP (. [sent-52, score-0.106]
22 To this end we begin with the following easy to prove but fundamental proposition: Proposition 2. [sent-54, score-0.09]
23 2 Let µ and Q be probability measures on X such that Q has a density h with respect to µ, and let s ∈ (0, 1). [sent-55, score-0.246]
24 sh(x) + 1 − s Note that the above formula for PX implies that the µ-zero sets of X are exactly the PX zero sets of X. [sent-60, score-0.072]
25 2 the latter is µ-almost surely equivalent to η(x) := P (y = 1|x) > 1/2 and hence µ({η > 1/2} {h > ρ}) = 0. [sent-69, score-0.096]
26 ) can serve as a performance measure as the following theorem shows: Theorem 2. [sent-72, score-0.104]
27 3 Let µ and Q be distributions on X such that Q has a density h with respect to 1 µ. [sent-73, score-0.16]
28 Then for all sequences (fn ) of measurable functions fn : X → R the following are equivalent: i) SP (fn ) → 0. [sent-76, score-0.322]
29 In particular, for measurable f : X → R we have SP (f ) = 0 if and only if RP (f ) = RP . [sent-78, score-0.176]
30 Since µ({h > ρ} 1 }) = 0 it is easy to see that the classification risk of fn can be computed by 2 RP (fn ) = RP + En {η > |2η − 1|dPX . [sent-80, score-0.303]
31 This implies that the measures |2η − 1|dPX and PX are absolutely continuous with respect to each other. [sent-82, score-0.13]
32 2 that PX and µ are absolutely continuous with respect to each other. [sent-84, score-0.085]
33 Now, the assertion follows from SP (fn ) = µ(En ). [sent-85, score-0.086]
34 ) as a performance measure for the DLD problem one can alternatively use the classification risk RP (. [sent-88, score-0.179]
35 Therefore, we will establish some basic properties of this performance measure in the following. [sent-90, score-0.101]
36 To this end we write I(y, t) := 1(−∞,0] (yt), y ∈ Y and t ∈ R, for the standard classification loss function. [sent-91, score-0.102]
37 Then for all measurable f : X → R we have RP (f ) = 1 1+ρ 1 ρ EQ I(1, sign f ) + Eµ I(−1, sign f ) . [sent-95, score-0.31]
38 1+ρ 1+ρ It is interesting that the classification risk RP (. [sent-96, score-0.128]
39 ) is strongly connected with another approach for the DLD problem which is based on the so-called excess mass (see e. [sent-97, score-0.249]
40 To be more precise let us first recall that the excess mass of a measurable function f : X → R is defined by EP (f ) := Q({f > 0}) − ρµ({f > 0}) , where Q, ρ and µ have the usual meaning. [sent-100, score-0.509]
41 Then for all measurable f : X → R we have 1 1+ρ EP (f ) = 1 − (1 + ρ)RP (f ) . [sent-106, score-0.176]
42 If Q is an empirical measure based on a training set T in the definition of EP (. [sent-107, score-0.102]
43 ) we obtain the empirical excess mass which we denote by ET (. [sent-108, score-0.3]
44 Then given a function class F the (empirical) excess mass approach chooses a function fT ∈ F which maximizes ET (. [sent-110, score-0.249]
45 Since the above proposition shows ET (f ) = 1 − 1 n n i=1 I(1, sign f (xi )) − ρEµ I(−1, sign f ) . [sent-112, score-0.254]
46 we see that this approach is actually a type of empirical risk minimization (ERM). [sent-113, score-0.179]
47 In the above mentioned papers the analysis of the excess mass approach needs an additional assumption on the behaviour of h around the level ρ. [sent-114, score-0.301]
48 6 Let µ be a distribution on X and h : X → [0, ∞) be a measurable function with hdµ = 1, i. [sent-118, score-0.176]
49 956] used (2) for the analysis of a DLD method which is based on a localized version of the empirical excess mass approach. [sent-128, score-0.3]
50 7 Let ρ > 0 and µ and Q be probability measures on X such that Q has a 1 density h with respect to µ. [sent-142, score-0.205]
51 Then we have i) If h is bounded there is a c > 0 such that for all measurable f : X → R we have RP (f ) − RP ≤ c SP (f ) . [sent-144, score-0.176]
52 ii) If h has ρ-exponent q there is a c > 0 such that for all measurable f : X → R we have SP (f ) ≤ c RP (f ) − RP q 1+q . [sent-145, score-0.176]
53 Sketch of the proof: The first assertion directly follows from (1) and Proposition 2. [sent-146, score-0.086]
54 3 A support vector machine for density level detection One of the benefits of interpreting the DLD problem as a classification problem is that we can construct an SVM for the DLD problem. [sent-158, score-0.315]
55 To this end let k : X × X → R be a positive definite kernel with reproducing kernel Hilbert space (RKHS) H. [sent-159, score-0.152]
56 Furthermore, let µ be a known probability measure on X and l : Y × R → [0, ∞) be the hinge loss function, i. [sent-160, score-0.12]
57 Although the measure µ is known, almost always the expectation Ex∼µ l(−1, f (x)) can be only numerically approximated by using finitely many function evaluations of f . [sent-168, score-0.08]
58 Furthermore it is also closely related to some heuristic methods for anomaly detection that are based on artificial samples (see [9] for more information). [sent-177, score-0.267]
59 For simplicity we will only state results for the Gaussian RBF kernel with width 1/σ, i. [sent-179, score-0.067]
60 We begin with a consistency result with respect to the performance measure RP (. [sent-183, score-0.156]
61 3 this is equivalent to consistency with respect to SP (. [sent-186, score-0.103]
62 Furthermore, let ρ > 0, and µ and Q be distributions on X such that Q has a 1 density h with respect to µ. [sent-189, score-0.201]
63 Sketch of the proof: Let us introduce the shorthand ν = Q ⊗ µ for the product measure of Q and µ. [sent-192, score-0.103]
64 Moreover, for a measurable function f : X → R we define the function g ◦ f : X × X → R by 1 ρ g ◦ f (x, x ) := l(1, f (x)) + l(−1, f (x )) , x, x ∈ X. [sent-193, score-0.176]
65 Analogously, we see ET ⊗T g ◦ f = ET s T l ◦ f if T ⊗ T denotes the product measure of the empirical measures based on T and T . [sent-196, score-0.147]
66 Now, using Hoeffding’s inequality for ν it is easy to establish a concentration inequality in the sense of [13, Lem. [sent-197, score-0.161]
67 In general, we cannot obtain convergence rates in the above theorem without assuming specific conditions on h, ρ, and µ. [sent-201, score-0.104]
68 To this end we write d(x, {h > ρ}) if x ∈ {h < ρ} τx := d(x, {h < ρ}) if x ∈ {h ≥ ρ} , where d(x, A) denotes the Euclidian distance between x and a set A. [sent-203, score-0.102]
69 2 Let µ be a distribution on X ⊂ Rd and h : X → [0, ∞) be a measurable function with hdµ = 1, i. [sent-205, score-0.176]
70 Since {h > ρ} and {h ≤ ρ} are the classes which have to be discriminated when interpreting the DLD problem as a classification problem it is easy to check by Proposition 2. [sent-209, score-0.068]
71 The latter is a sufficient condition for P to have geometric noise exponent α in the sense of [8]. [sent-211, score-0.097]
72 For fixed ρ > 0 assume that the density h has both ρ-exponent 0 < q ≤ ∞ and geometric ρ-exponent 0 < α < ∞. [sent-215, score-0.152]
73 7 we immediately obtain rates with respect to the performance measure SP (. [sent-222, score-0.144]
74 It turns out that these rates are very similar to those in [5] and [6] for the empirical excess mass approach. [sent-224, score-0.351]
75 4 Experiments We present experimental results for anomaly detection problems where the set X is a subset of Rd . [sent-225, score-0.219]
76 These algorithms are compared based on their risk RP (f ). [sent-227, score-0.128]
77 The data in each problem is partitioned into three pairs of sets; the training sets (T, T ), the validation sets (V, V ) and the test sets (W, W ). [sent-228, score-0.179]
78 The sets T , V and W contain samples drawn from Q and the sets T ,V and W contain samples drawn from µ. [sent-229, score-0.216]
79 The training and validation sets are used to design f and the test sets are used to estimate its performance by computing an empirical version of RP (f ) that we denote R(W,W ) (f ). [sent-230, score-0.194]
80 The first learning algorithm is the density level detection support vector machine (DLD– SVM) with Gaussian RBF kernel described in the previous section. [sent-231, score-0.313]
81 The regularization parameters λ and σ 2 are chosen to (approximately) minimize the empirical risk R(V,V ) (f ) on the validation sets. [sent-233, score-0.25]
82 This is accomplished by employing a grid search over λ and a combined grid/iterative search over σ 2 . [sent-234, score-0.17]
83 In particular, for each fixed grid value of λ we seek a minimizer over σ 2 by evaluating the validation risk at a coarse grid of σ 2 values and then performing a Golden search over the interval defined by the two σ 2 values on either side of the coarse grid minimum. [sent-235, score-0.613]
84 As the overall search proceeds the (λ, σ 2 ) pair with the smallest validation risk is retained. [sent-236, score-0.239]
85 The second learning algorithm is the one–class support vector machine (1CLASS–SVM) introduced by Sch¨ lkopf et al. [sent-237, score-0.099]
86 In particular both ν and σ 2 are chosen to (approximately) minimize the validation risk using the search procedure described above for the DLD–SVM where the grid search for λ is replaced by a Golden search (over [0, 1]) for ν. [sent-244, score-0.409]
87 Data for the second experiment are identical to the first except that the vector (0, 0, 0, 0, 0, 0, 0, 0, 0, 1) is added to the samples of x with probability 0. [sent-251, score-0.075]
88 The training and validation set sizes are |T | = 1000, |T | = 2000, |V | = 500, and |V | = 2000. [sent-254, score-0.114]
89 The λ grid for the DLD–SVM method consists of 15 values ranging from 10−7 to 1 and the coarse σ 2 grid for the DLD–SVM and 1CLASS–SVM methods consists of 9 values that range from 10−3 to 102 . [sent-256, score-0.272]
90 Figure 1(a) plots the risk R(W,W ) versus ρ for the two learning algorithms. [sent-258, score-0.128]
91 Comparisons for larger sizes of |T | and |V | yield similar results, but at smaller sample sizes the superiority of DLD–SVM is even more pronounced. [sent-261, score-0.086]
92 This is significant because ρ 1 appears to have little utility in the general anomaly detection problem since it defines anomalies in regions where the concentration of Q is much larger than the concentration of µ, which is contrary to our premise that anomalies are not concentrated. [sent-262, score-0.599]
93 The training, validation and test set sizes are |T | = 4000, |T | = 10000, |V | = 2000, |V | = 100, 000, |W | = 5664 and |W | = 100, 000. [sent-269, score-0.114]
94 grid for the DLD–SVM method consists of 7 values ranging from 10−7 to 10−1 and the coarse σ 2 grid for the DLD–SVM and 1CLASS–SVM methods consists of 6 values that range from 10−3 to 102 . [sent-286, score-0.272]
95 Figure 1(b) plots the risk R(W,W ) versus ρ for the two learning algorithms. [sent-290, score-0.128]
96 Estimation of a convex density contour in 2 dimensions. [sent-312, score-0.145]
97 Measuring mass concentrations and estimating density contour clusters—an excess mass aproach. [sent-320, score-0.496]
98 Learning distributions by their density levels: a paradigm for learning without a teacher. [sent-334, score-0.118]
99 Learning rates for support vector machines for density level detection. [sent-375, score-0.259]
100 Consistency of support vector machines and other regularized kernel machines. [sent-379, score-0.075]
wordName wordTfidf (topN-words)
[('dld', 0.571), ('rp', 0.413), ('svm', 0.184), ('measurable', 0.176), ('sp', 0.167), ('anomaly', 0.151), ('excess', 0.147), ('fn', 0.146), ('px', 0.142), ('risk', 0.128), ('proposition', 0.12), ('density', 0.118), ('anomalies', 0.108), ('ft', 0.106), ('mass', 0.102), ('grid', 0.09), ('ep', 0.087), ('hd', 0.086), ('assertion', 0.086), ('concentration', 0.082), ('hush', 0.075), ('validation', 0.071), ('sh', 0.069), ('ex', 0.068), ('furthermore', 0.068), ('detection', 0.068), ('sign', 0.067), ('write', 0.065), ('dpx', 0.065), ('rbf', 0.061), ('alamos', 0.059), ('scovel', 0.059), ('tl', 0.055), ('theorem', 0.053), ('classi', 0.052), ('shorthand', 0.052), ('surely', 0.052), ('level', 0.052), ('coarse', 0.052), ('measure', 0.051), ('rates', 0.051), ('empirical', 0.051), ('establish', 0.05), ('ctq', 0.05), ('dq', 0.05), ('tr', 0.049), ('samples', 0.048), ('rd', 0.046), ('measures', 0.045), ('sizes', 0.043), ('cation', 0.043), ('anomalous', 0.043), ('absolutely', 0.043), ('erm', 0.043), ('recall', 0.043), ('respect', 0.042), ('los', 0.042), ('let', 0.041), ('en', 0.041), ('exponent', 0.041), ('search', 0.04), ('ranging', 0.04), ('interpreting', 0.039), ('golden', 0.039), ('consistency', 0.039), ('support', 0.038), ('kernel', 0.037), ('end', 0.037), ('steinwart', 0.037), ('tsybakov', 0.037), ('sets', 0.036), ('obviously', 0.035), ('euclidian', 0.035), ('geometric', 0.034), ('psfrag', 0.033), ('coincides', 0.033), ('lkopf', 0.032), ('replacements', 0.032), ('laboratory', 0.031), ('libsvm', 0.03), ('width', 0.03), ('et', 0.029), ('easy', 0.029), ('proof', 0.029), ('therein', 0.029), ('evaluations', 0.029), ('hinge', 0.028), ('contour', 0.027), ('experiment', 0.027), ('sch', 0.027), ('binary', 0.026), ('concentrated', 0.026), ('bt', 0.025), ('national', 0.025), ('begin', 0.024), ('drawn', 0.024), ('sketch', 0.023), ('latter', 0.022), ('ne', 0.022), ('equivalent', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 49 nips-2004-Density Level Detection is Classification
Author: Ingo Steinwart, Don Hush, Clint Scovel
Abstract: We show that anomaly detection can be interpreted as a binary classification problem. Using this interpretation we propose a support vector machine (SVM) for anomaly detection. We then present some theoretical results which include consistency and learning rates. Finally, we experimentally compare our SVM with the standard one-class SVM. 1
2 0.30154544 69 nips-2004-Fast Rates to Bayes for Kernel Machines
Author: Ingo Steinwart, Clint Scovel
Abstract: We establish learning rates to the Bayes risk for support vector machines (SVMs) with hinge loss. In particular, for SVMs with Gaussian RBF kernels we propose a geometric condition for distributions which can be used to determine approximation properties of these kernels. Finally, we compare our methods with a recent paper of G. Blanchard et al.. 1
3 0.12355797 34 nips-2004-Breaking SVM Complexity with Cross-Training
Author: Léon Bottou, Jason Weston, Gökhan H. Bakir
Abstract: We propose to selectively remove examples from the training set using probabilistic estimates related to editing algorithms (Devijver and Kittler, 1982). This heuristic procedure aims at creating a separable distribution of training examples with minimal impact on the position of the decision boundary. It breaks the linear dependency between the number of SVs and the number of training examples, and sharply reduces the complexity of SVMs during both the training and prediction stages. 1
4 0.11507528 166 nips-2004-Semi-supervised Learning via Gaussian Processes
Author: Neil D. Lawrence, Michael I. Jordan
Abstract: We present a probabilistic approach to learning a Gaussian Process classifier in the presence of unlabeled data. Our approach involves a “null category noise model” (NCNM) inspired by ordered categorical noise models. The noise model reflects an assumption that the data density is lower between the class-conditional densities. We illustrate our approach on a toy problem and present comparative results for the semi-supervised classification of handwritten digits. 1
5 0.10288729 144 nips-2004-Parallel Support Vector Machines: The Cascade SVM
Author: Hans P. Graf, Eric Cosatto, Léon Bottou, Igor Dourdanovic, Vladimir Vapnik
Abstract: We describe an algorithm for support vector machines (SVM) that can be parallelized efficiently and scales to very large problems with hundreds of thousands of training vectors. Instead of analyzing the whole training set in one optimization step, the data are split into subsets and optimized separately with multiple SVMs. The partial results are combined and filtered again in a ‘Cascade’ of SVMs, until the global optimum is reached. The Cascade SVM can be spread over multiple processors with minimal communication overhead and requires far less memory, since the kernel matrices are much smaller than for a regular SVM. Convergence to the global optimum is guaranteed with multiple passes through the Cascade, but already a single pass provides good generalization. A single pass is 5x – 10x faster than a regular SVM for problems of 100,000 vectors when implemented on a single processor. Parallel implementations on a cluster of 16 processors were tested with over 1 million vectors (2-class problems), converging in a day or two, while a regular SVM never converged in over a week. 1
6 0.096351206 168 nips-2004-Semigroup Kernels on Finite Sets
7 0.093324602 137 nips-2004-On the Adaptive Properties of Decision Trees
8 0.089099325 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging
9 0.087563217 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition
10 0.085130706 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection
11 0.083049856 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
12 0.079185672 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants
13 0.07347814 92 nips-2004-Kernel Methods for Implicit Surface Modeling
14 0.073023625 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound
15 0.068716101 41 nips-2004-Comparing Beliefs, Surveys, and Random Walks
16 0.063922524 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems
17 0.060249176 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data
18 0.058190241 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning
19 0.05709755 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices
20 0.055559095 68 nips-2004-Face Detection --- Efficient and Rank Deficient
topicId topicWeight
[(0, -0.191), (1, 0.079), (2, -0.02), (3, 0.144), (4, -0.034), (5, 0.048), (6, 0.079), (7, -0.085), (8, 0.086), (9, -0.113), (10, -0.064), (11, 0.055), (12, 0.093), (13, -0.054), (14, -0.135), (15, -0.326), (16, 0.009), (17, 0.129), (18, -0.076), (19, 0.093), (20, -0.236), (21, 0.068), (22, 0.057), (23, 0.019), (24, -0.087), (25, 0.005), (26, 0.014), (27, -0.089), (28, -0.014), (29, -0.0), (30, -0.033), (31, 0.103), (32, 0.034), (33, -0.067), (34, 0.163), (35, 0.007), (36, 0.014), (37, -0.001), (38, -0.017), (39, 0.096), (40, 0.025), (41, -0.004), (42, 0.008), (43, -0.003), (44, 0.039), (45, -0.016), (46, 0.016), (47, 0.035), (48, 0.004), (49, -0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.94774783 49 nips-2004-Density Level Detection is Classification
Author: Ingo Steinwart, Don Hush, Clint Scovel
Abstract: We show that anomaly detection can be interpreted as a binary classification problem. Using this interpretation we propose a support vector machine (SVM) for anomaly detection. We then present some theoretical results which include consistency and learning rates. Finally, we experimentally compare our SVM with the standard one-class SVM. 1
2 0.90651059 69 nips-2004-Fast Rates to Bayes for Kernel Machines
Author: Ingo Steinwart, Clint Scovel
Abstract: We establish learning rates to the Bayes risk for support vector machines (SVMs) with hinge loss. In particular, for SVMs with Gaussian RBF kernels we propose a geometric condition for distributions which can be used to determine approximation properties of these kernels. Finally, we compare our methods with a recent paper of G. Blanchard et al.. 1
3 0.73003042 137 nips-2004-On the Adaptive Properties of Decision Trees
Author: Clayton Scott, Robert Nowak
Abstract: Decision trees are surprisingly adaptive in three important respects: They automatically (1) adapt to favorable conditions near the Bayes decision boundary; (2) focus on data distributed on lower dimensional manifolds; (3) reject irrelevant features. In this paper we examine a decision tree based on dyadic splits that adapts to each of these conditions to achieve minimax optimal rates of convergence. The proposed classifier is the first known to achieve these optimal rates while being practical and implementable. 1
4 0.51797009 34 nips-2004-Breaking SVM Complexity with Cross-Training
Author: Léon Bottou, Jason Weston, Gökhan H. Bakir
Abstract: We propose to selectively remove examples from the training set using probabilistic estimates related to editing algorithms (Devijver and Kittler, 1982). This heuristic procedure aims at creating a separable distribution of training examples with minimal impact on the position of the decision boundary. It breaks the linear dependency between the number of SVs and the number of training examples, and sharply reduces the complexity of SVMs during both the training and prediction stages. 1
5 0.48838583 166 nips-2004-Semi-supervised Learning via Gaussian Processes
Author: Neil D. Lawrence, Michael I. Jordan
Abstract: We present a probabilistic approach to learning a Gaussian Process classifier in the presence of unlabeled data. Our approach involves a “null category noise model” (NCNM) inspired by ordered categorical noise models. The noise model reflects an assumption that the data density is lower between the class-conditional densities. We illustrate our approach on a toy problem and present comparative results for the semi-supervised classification of handwritten digits. 1
6 0.48448363 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition
7 0.43127385 144 nips-2004-Parallel Support Vector Machines: The Cascade SVM
8 0.37414217 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems
9 0.36279851 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data
10 0.35524923 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound
11 0.34924203 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging
12 0.34315205 168 nips-2004-Semigroup Kernels on Finite Sets
13 0.33600241 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants
14 0.33440819 106 nips-2004-Machine Learning Applied to Perception: Decision Images for Gender Classification
15 0.33366624 68 nips-2004-Face Detection --- Efficient and Rank Deficient
16 0.32014632 6 nips-2004-A Hidden Markov Model for de Novo Peptide Sequencing
17 0.30853665 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms
18 0.29313168 92 nips-2004-Kernel Methods for Implicit Surface Modeling
19 0.28701699 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection
20 0.28345916 94 nips-2004-Kernels for Multi--task Learning
topicId topicWeight
[(13, 0.056), (15, 0.104), (26, 0.063), (31, 0.028), (33, 0.199), (39, 0.029), (50, 0.422)]
simIndex simValue paperId paperTitle
1 0.92590064 199 nips-2004-Using Machine Learning to Break Visual Human Interaction Proofs (HIPs)
Author: Kumar Chellapilla, Patrice Y. Simard
Abstract: Machine learning is often used to automatically solve human tasks. In this paper, we look for tasks where machine learning algorithms are not as good as humans with the hope of gaining insight into their current limitations. We studied various Human Interactive Proofs (HIPs) on the market, because they are systems designed to tell computers and humans apart by posing challenges presumably too hard for computers. We found that most HIPs are pure recognition tasks which can easily be broken using machine learning. The harder HIPs use a combination of segmentation and recognition tasks. From this observation, we found that building segmentation tasks is the most effective way to confuse machine learning algorithms. This has enabled us to build effective HIPs (which we deployed in MSN Passport), as well as design challenging segmentation tasks for machine learning algorithms. 1 In t rod u ct i on The OCR problem for high resolution printed text has virtually been solved 10 years ago [1]. On the other hand, cursive handwriting recognition today is still too poor for most people to rely on. Is there a fundamental difference between these two seemingly similar problems? To shed more light on this question, we study problems that have been designed to be difficult for computers. The hope is that we will get some insight on what the stumbling blocks are for machine learning and devise appropriate tests to further understand their similarities and differences. Work on distinguishing computers from humans traces back to the original Turing Test [2] which asks that a human distinguish between another human and a machine by asking questions of both. Recent interest has turned to developing systems that allow a computer to distinguish between another computer and a human. These systems enable the construction of automatic filters that can be used to prevent automated scripts from utilizing services intended for humans [4]. Such systems have been termed Human Interactive Proofs (HIPs) [3] or Completely Automated Public Turing Tests to Tell Computers and Humans Apart (CAPTCHAs) [4]. An overview of the work in this area can be found in [5]. Construction of HIPs that are of practical value is difficult because it is not sufficient to develop challenges at which humans are somewhat more successful than machines. This is because the cost of failure for an automatic attacker is minimal compared to the cost of failure for humans. Ideally a HIP should be solved by humans more than 80% of the time, while an automatic script with reasonable resource use should succeed less than 0.01% of the time. This latter ratio (1 in 10,000) is a function of the cost of an automatic trial divided by the cost of having a human perform the attack. This constraint of generating tasks that are failed 99.99% of the time by all automated algorithms has generated various solutions which can easily be sampled on the internet. Seven different HIPs, namely, Mailblocks, MSN (before April 28th, 2004), Ticketmaster, Yahoo, Yahoo v2 (after Sept’04), Register, and Google, will be given as examples in the next section. We will show in Section 3 that machinelearning-based attacks are far more successful than 1 in 10,000. Yet, some of these HIPs are harder than others and could be made even harder by identifying the recognition and segmentation parts, and emphasizing the latter. Section 4 presents examples of more difficult HIPs which are much more respectable challenges for machine learning, and yet surprisingly easy for humans. The final section discusses a (known) weakness of machine learning algorithms and suggests designing simple artificial datasets for studying this weakness. 2 Exa mp les o f H I Ps The HIPs explored in this paper are made of characters (or symbols) rendered to an image and presented to the user. Solving the HIP requires identifying all characters in the correct order. The following HIPs can be sampled from the web: Mailblocks: While signing up for free email service with (www.mailblocks.com), you will find HIP challenges of the type: mailblocks MSN: While signing up for free e-mail with MSN Hotmail (www.hotmail.com), you will find HIP challenges of the type: Register.com: While requesting a whois lookup for a domain at www.register.com, you will HIP challenges of the type: Yahoo!/EZ-Gimpy (CMU): While signing up for free e-mail service with Yahoo! (www.yahoo.com), you will receive HIP challenges of the type: Yahoo! (version 2): Starting in August 2004, Yahoo! introduced their second generation HIP. Three examples are presented below: Ticketmaster: While looking for concert tickets at www.ticketmaster.com, you will receive HIP challenges of the type: Google/Gmail: While signing up for free e-mail with Gmail at www.google.com, one will receive HIP challenges of the type: While solutions to Yahoo HIPs are common English words, those for ticketmaster and Google do not necessarily belong to the English dictionary. They appear to have been created using a phonetic generator [8]. 3 Usi n g ma ch i n e lea rn i n g t o b rea k H IP s Breaking HIPs is not new. Mori and Malik [7] have successfully broken the EZGimpy (92% success) and Gimpy (33% success) HIPs from CMU. Our approach aims at an automatic process for solving multiple HIPs with minimum human intervention, using machine learning. In this paper, our main goal is to learn more about the common strengths and weaknesses of these HIPs rather than to prove that we can break any one HIP in particular with the highest possible success rate. We have results for six different HIPs: EZ-Gimpy/Yahoo, Yahoo v2, mailblocks, register, ticketmaster, and Google. To simplify our study, we will not be using language models in our attempt to break HIPs. For example, there are only about 600 words in the EZ-Gimpy dictionary [7], which means that a random guess attack would get a success rate of 1 in 600 (more than enough to break the HIP, i.e., greater than 0.01% success). HIPs become harder when no language model is used. Similarly, when a HIP uses a language model to generate challenges, success rate of attacks can be significantly improved by incorporating the language model. Further, since the language model is not common to all HIPs studied, it was not used in this paper. Our generic method for breaking all of these HIPs is to write a custom algorithm to locate the characters, and then use machine learning for recognition. Surprisingly, segmentation, or finding the characters, is simple for many HIPs which makes the process of breaking the HIP particularly easy. Gimpy uses a single constant predictable color (black) for letters even though the background color changes. We quickly realized that once the segmentation problem is solved, solving the HIP becomes a pure recognition problem, and it can trivially be solved using machine learning. Our recognition engine is based on neural networks [6][9]. It yielded a 0.4% error rate on the MNIST database, uses little memory, and is very fast for recognition (important for breaking HIPs). For each HIP, we have a segmentation step, followed by a recognition step. It should be stressed that we are not trying to solve every HIP of a given type i.e., our goal is not 100% success rate, but something efficient that can achieve much better than 0.01%. In each of the following experiments, 2500 HIPs were hand labeled and used as follows (a) recognition (1600 for training, 200 for validation, and 200 for testing), and (b) segmentation (500 for testing segmentation). For each of the five HIPs, a convolution neural network, identical to the one described in [6], was trained and tested on gray level character images centered on the guessed character positions (see below). The trained neural network became the recognizer. 3.1 M a i l b l oc k s To solve the HIP, we select the red channel, binarize and erode it, extract the largest connected components (CCs), and breakup CCs that are too large into two or three adjacent CCs. Further, vertically overlapping half character size CCs are merged. The resulting rough segmentation works most of the time. Here is an example: For instance, in the example above, the NN would be trained, and tested on the following images: … The end-to-end success rate is 88.8% for segmentation, 95.9% for recognition (given correct segmentation), and (0.888)*(0.959)7 = 66.2% total. Note that most of the errors come from segmentation, even though this is where all the custom programming was invested. 3.2 Register The procedure to solve HIPs is very similar. The image was smoothed, binarized, and the largest 5 connected components were identified. Two examples are presented below: The end-to-end success rate is 95.4% for segmentation, 87.1% for recognition (given correct segmentation), and (0.954)*(0.871)5 = 47.8% total. 3.3 Y a h oo/ E Z - G i mp y Unlike the mailblocks and register HIPs, the Yahoo/EZ-Gimpy HIPs are richer in that a variety of backgrounds and clutter are possible. Though some amount of text warping is present, the text color, size, and font have low variability. Three simple segmentation algorithms were designed with associated rules to identify which algorithm to use. The goal was to keep these simple yet effective: a) No mesh: Convert to grayscale image, threshold to black and white, select large CCs with sizes close to HIP char sizes. One example: b) Black mesh: Convert to grayscale image, threshold to black and white, remove vertical and horizontal line pixels that don’t have neighboring pixels, select large CCs with sizes close to HIP char sizes. One example: c) White mesh: Convert to grayscale image, threshold to black and white, add black pixels (in white line locations) if there exist neighboring pixels, select large CCs with sizes close to HIP char sizes. One example: Tests for black and white meshes were performed to determine which segmentation algorithm to use. The end-to-end success rate was 56.2% for segmentation (38.2% came from a), 11.8% from b), and 6.2% from c), 90.3% for recognition (given correct segmentation), and (0.562)*(0.903)4.8 = 34.4% total. The average length of a Yahoo HIP solution is 4.8 characters. 3.4 T i c k e t ma s t e r The procedure that solved the Yahoo HIP is fairly successful at solving some of the ticket master HIPs. These HIPs are characterized by cris-crossing lines at random angles clustered around 0, 45, 90, and 135 degrees. A multipronged attack as in the Yahoo case (section 3.3) has potential. In the interests of simplicity, a single attack was developed: Convert to grayscale, threshold to black and white, up-sample image, dilate first then erode, select large CCs with sizes close to HIP char sizes. One example: The dilate-erode combination causes the lines to be removed (along with any thin objects) but retains solid thick characters. This single attack is successful in achieving an end-to-end success rate of 16.6% for segmentation, the recognition rate was 82.3% (in spite of interfering lines), and (0.166)*(0.823)6.23 = 4.9% total. The average HIP solution length is 6.23 characters. 3.5 Y a h oo ve r s i on 2 The second generation HIP from Yahoo had several changes: a) it did not use words from a dictionary or even use a phonetic generator, b) it uses only black and white colors, c) uses both letters and digits, and d) uses connected lines and arcs as clutter. The HIP is somewhat similar to the MSN/Passport HIP which does not use a dictionary, uses two colors, uses letters and digits, and background and foreground arcs as clutter. Unlike the MSN/Passport HIP, several different fonts are used. A single segmentation attack was developed: Remove 6 pixel border, up-sample, dilate first then erode, select large CCs with sizes close to HIP char sizes. The attack is practically identical to that used for the ticketmaster HIP with different preprocessing stages and slightly modified parameters. Two examples: This single attack is successful in achieving an end-to-end success rate of 58.4% for segmentation, the recognition rate was 95.2%, and (0.584)*(0.952)5 = 45.7% total. The average HIP solution length is 5 characters. 3.6 G oog l e / G M a i l The Google HIP is unique in that it uses only image warp as a means of distorting the characters. Similar to the MSN/Passport and Yahoo version 2 HIPs, it is also two color. The HIP characters are arranged closed to one another (they often touch) and follow a curved baseline. The following very simple attack was used to segment Google HIPs: Convert to grayscale, up-sample, threshold and separate connected components. a) b) This very simple attack gives an end-to-end success rate of 10.2% for segmentation, the recognition rate was 89.3%, giving (0.102)*(0.893)6.5 = 4.89% total probability of breaking a HIP. Average Google HIP solution length is 6.5 characters. This can be significantly improved upon by judicious use of dilate-erode attack. A direct application doesn’t do as well as it did on the ticketmaster and yahoo HIPs (because of the shear and warp of the baseline of the word). More successful and complicated attacks might estimate and counter the shear and warp of the baseline to achieve better success rates. 4 Lesso n s lea rn ed f ro m b rea ki n g H IPs From the previous section, it is clear that most of the errors come from incorrect segmentations, even though most of the development time is spent devising custom segmentation schemes. This observation raises the following questions: Why is segmentation a hard problem? Can we devise harder HIPs and datasets? Can we build an automatic segmentor? Can we compare classification algorithms based on how useful they are for segmentation? 4.1 T h e s e g me n t a t i on p r ob l e m As a review, segmentation is difficult for the following reasons: 1. Segmentation is computationally expensive. In order to find valid patterns, a recognizer must attempt recognition at many different candidate locations. 2. The segmentation function is complex. To segment successfully, the system must learn to identify which patterns are valid among the set of all possible valid and non-valid patterns. This task is intrinsically more difficult than classification because the space of input is considerably larger. Unlike the space of valid patterns, the space of non-valid patterns is typically too vast to sample. This is a problem for many learning algorithms which yield too many false positives when presented non-valid patterns. 3. Identifying valid characters among a set of valid and invalid candidates is a combinatorial problem. For example, correctly identifying which 8 characters among 20 candidates (assuming 12 false positives), has a 1 in 125,970 (20 choose 8) chances of success by random guessing. 4.2 B ui l d i n g b e t te r / h a r de r H I P s We can use what we have learned to build better HIPs. For instance the HIP below was designed to make segmentation difficult and a similar version has been deployed by MSN Passport for hotmail registrations (www.hotmail.com): The idea is that the additional arcs are themselves good candidates for false characters. The previous segmentation attacks would fail on this HIP. Furthermore, simple change of fonts, distortions, or arc types would require extensive work for the attacker to adjust to. We believe HIPs that emphasize the segmentation problem, such as the above example, are much stronger than the HIPs we examined in this paper, which rely on recognition being difficult. Pushing this to the extreme, we can easily generate the following HIPs: Despite the apparent difficulty of these HIPs, humans are surprisingly good at solving these, indicating that humans are far better than computers at segmentation. This approach of adding several competing false positives can in principle be used to automatically create difficult segmentation problems or benchmarks to test classification algorithms. 4.3 B ui l d i n g a n a ut o ma t i c s e g me n t or To build an automatic segmentor, we could use the following procedure. Label characters based on their correct position and train a recognizer. Apply the trained recognizer at all locations in the HIP image. Collect all candidate characters identified with high confidence by the recognizer. Compute the probability of each combination of candidates (going from left to right), and output the solution string with the highest probability. This is better illustrated with an example. Consider the following HIP (to the right). The trained neural network has these maps (warm colors indicate recognition) that show that K, Y, and so on are correctly identified. However, the maps for 7 and 9 show several false positives. In general, we would get the following color coded map for all the different candidates: HIP K Y B 7 9 With a threshold of 0.5 on the network’s outputs, the map obtained is: We note that there are several false positives for each true positive. The number of false positives per true positive character was found to be between 1 and 4, giving a 1 in C(16,8) = 12,870 to 1 in C(32,8) = 10,518,300 random chance of guessing the correct segmentation for the HIP characters. These numbers can be improved upon by constraining solution strings to flow sequentially from left to right and by restricting overlap. For each combination, we compute a probability by multiplying the 8 probabilities of the classifier for each position. The combination with the highest probability is the one proposed by the classifier. We do not have results for such an automatic segmentor at this time. It is interesting to note that with such a method a classifier that is robust to false positives would do far better than one that is not. This suggests another axis for comparing classifiers. 5 Con clu si on In this paper, we have successfully applied machine learning to the problem of solving HIPs. We have learned that decomposing the HIP problem into segmentation and recognition greatly simplifies analysis. Recognition on even unprocessed images (given segmentation is a solved) can be done automatically using neural networks. Segmentation, on the other hand, is the difficulty differentiator between weaker and stronger HIPs and requires custom intervention for each HIP. We have used this observation to design new HIPs and new tests for machine learning algorithms with the hope of improving them. A c k n ow l e d ge me n t s We would like to acknowledge Chau Luu and Eric Meltzer for their help with labeling and segmenting various HIPs. We would also like to acknowledge Josh Benaloh and Cem Paya for stimulating discussions on HIP security. References [1] Baird HS (1992), “Anatomy of a versatile page reader,” IEEE Pro., v.80, pp. 1059-1065. [2] Turing AM (1950), “Computing Machinery and Intelligence,” Mind, 59:236, pp. 433-460. [3] First Workshop on Human Interactive Proofs, Palo Alto, CA, January 2002. [4] Von Ahn L, Blum M, and Langford J, The Captcha Project. http://www.captcha.net [5] Baird HS and Popat K (2002) “Human Interactive Proofs and Document Image Analysis,” Proc. IAPR 2002 Workshop on Document Analysis Systerms, Princeton, NJ. [6] Simard PY, Steinkraus D, and Platt J, (2003) “Best Practice for Convolutional Neural Networks Applied to Visual Document Analysis,” in International Conference on Document Analysis and Recognition (ICDAR), pp. 958-962, IEEE Computer Society, Los Alamitos. [7] Mori G, Malik J (2003), “Recognizing Objects in Adversarial Clutter: Breaking a Visual CAPTCHA,” Proc. of the Computer Vision and Pattern Recognition (CVPR) Conference, IEEE Computer Society, vol.1, pages:I-134 - I-141, June 18-20, 2003 [8] Chew, M. and Baird, H. S. (2003), “BaffleText: a Human Interactive Proof,” Proc., 10th IS&T;/SPIE Document Recognition & Retrieval Conf., Santa Clara, CA, Jan. 22. [9] LeCun Y, Bottou L, Bengio Y, and Haffner P, “Gradient-based learning applied to document recognition,’ Proceedings of the IEEE, Nov. 1998.
2 0.88903737 136 nips-2004-On Semi-Supervised Classification
Author: Balaji Krishnapuram, David Williams, Ya Xue, Lawrence Carin, Mário Figueiredo, Alexander J. Hartemink
Abstract: A graph-based prior is proposed for parametric semi-supervised classification. The prior utilizes both labelled and unlabelled data; it also integrates features from multiple views of a given sample (e.g., multiple sensors), thus implementing a Bayesian form of co-training. An EM algorithm for training the classifier automatically adjusts the tradeoff between the contributions of: (a) the labelled data; (b) the unlabelled data; and (c) the co-training information. Active label query selection is performed using a mutual information based criterion that explicitly uses the unlabelled data and the co-training information. Encouraging results are presented on public benchmarks and on measured data from single and multiple sensors. 1
3 0.87033713 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning
Author: Liam Paninski
Abstract: Log-concavity is an important property in the context of optimization, Laplace approximation, and sampling; Bayesian methods based on Gaussian process priors have become quite popular recently for classification, regression, density estimation, and point process intensity estimation. Here we prove that the predictive densities corresponding to each of these applications are log-concave, given any observed data. We also prove that the likelihood is log-concave in the hyperparameters controlling the mean function of the Gaussian prior in the density and point process intensity estimation cases, and the mean, covariance, and observation noise parameters in the classification and regression cases; this result leads to a useful parameterization of these hyperparameters, indicating a suitably large class of priors for which the corresponding maximum a posteriori problem is log-concave.
same-paper 4 0.85527319 49 nips-2004-Density Level Detection is Classification
Author: Ingo Steinwart, Don Hush, Clint Scovel
Abstract: We show that anomaly detection can be interpreted as a binary classification problem. Using this interpretation we propose a support vector machine (SVM) for anomaly detection. We then present some theoretical results which include consistency and learning rates. Finally, we experimentally compare our SVM with the standard one-class SVM. 1
5 0.83660465 29 nips-2004-Beat Tracking the Graphical Model Way
Author: Dustin Lang, Nando D. Freitas
Abstract: We present a graphical model for beat tracking in recorded music. Using a probabilistic graphical model allows us to incorporate local information and global smoothness constraints in a principled manner. We evaluate our model on a set of varied and difficult examples, and achieve impressive results. By using a fast dual-tree algorithm for graphical model inference, our system runs in less time than the duration of the music being processed. 1
6 0.65232396 69 nips-2004-Fast Rates to Bayes for Kernel Machines
7 0.6327225 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
8 0.61509973 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms
9 0.6095379 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
10 0.60184312 103 nips-2004-Limits of Spectral Clustering
11 0.60087079 158 nips-2004-Sampling Methods for Unsupervised Learning
12 0.59537631 65 nips-2004-Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments
13 0.59519964 175 nips-2004-Stable adaptive control with online learning
14 0.5881204 48 nips-2004-Convergence and No-Regret in Multiagent Learning
15 0.58503258 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge
16 0.58116078 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors
17 0.57852781 168 nips-2004-Semigroup Kernels on Finite Sets
18 0.57761669 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss
19 0.57594031 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices
20 0.57512397 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data