jmlr jmlr2005 jmlr2005-3 knowledge-graph by maker-knowledge-mining

3 jmlr-2005-A Classification Framework for Anomaly Detection

Source: pdf

Author: Ingo Steinwart, Don Hush, Clint Scovel

Abstract: One way to describe anomalies is by saying that anomalies are not concentrated. This leads to the problem of ﬁnding level sets for the data generating density. We interpret this learning problem as a binary classiﬁcation problem and compare the corresponding classiﬁcation risk with the standard performance measure for the density level problem. In particular it turns out that the empirical classiﬁcation risk can serve as an empirical performance measure for the anomaly detection problem. This allows us to compare different anomaly detection algorithms empirically, i.e. with the help of a test set. Furthermore, by the above interpretation we can give a strong justiﬁcation for the well-known heuristic of artiﬁcially sampling “labeled” samples, provided that the sampling plan is well chosen. In particular this enables us to propose a support vector machine (SVM) for anomaly detection for which we can easily establish universal consistency. Finally, we report some experiments which compare our SVM to other commonly used methods including the standard one-class SVM. Keywords: unsupervised learning, anomaly detection, density levels, classiﬁcation, SVMs

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 GOV Modeling, Algorithms and Informatics Group, CCS-3 Los Alamos National Laboratory Los Alamos, NM 87545, USA Editor: Bernhard Sch¨ lkopf o Abstract One way to describe anomalies is by saying that anomalies are not concentrated. [sent-4, score-0.316]

2 We interpret this learning problem as a binary classiﬁcation problem and compare the corresponding classiﬁcation risk with the standard performance measure for the density level problem. [sent-6, score-0.282]

3 In particular it turns out that the empirical classiﬁcation risk can serve as an empirical performance measure for the anomaly detection problem. [sent-7, score-0.727]

4 This allows us to compare different anomaly detection algorithms empirically, i. [sent-8, score-0.522]

5 In particular this enables us to propose a support vector machine (SVM) for anomaly detection for which we can easily establish universal consistency. [sent-12, score-0.563]

6 Keywords: unsupervised learning, anomaly detection, density levels, classiﬁcation, SVMs 1. [sent-14, score-0.378]

7 Introduction Anomaly (or novelty) detection aims to detect anomalous observations from a system. [sent-15, score-0.342]

8 This anomaly detection learning problem has many important applications including the detection of e. [sent-18, score-0.769]

9 It is important to note that a typical feature of these applications is that only unlabeled samples are available, and hence one has to make some a-priori assumptions on anomalies in order to be able to distinguish between normal and anomalous future oberservations. [sent-28, score-0.33]

10 One of the most common ways to deﬁne anomalies is by saying that anomalies are not concentrated (see e. [sent-29, score-0.316]

11 Obviously, the density level sets {h > ρ}, ρ > 0, describe the concentration of Q. [sent-34, score-0.165]

12 Therefore to deﬁne anomalies in terms of the concentration one only has to ﬁx a threshold level ρ > 0 so that a sample x ∈ X is considered to be anomalous whenever h(x) ≤ ρ. [sent-35, score-0.315]

13 We emphasize that given the data-generating distribution Q the choice of µ determines the density h, and consequently anomalies are actually modeled by both µ and ρ. [sent-37, score-0.287]

14 Gaussian mixtures, Parzen windows and k-nearest neighbors density estimates) and therefore for these algorithms deﬁning anomalies is restricted to the choice of ρ. [sent-40, score-0.29]

15 In particular, this is true if we consider a modiﬁcation of the anomaly detection problem where µ is not known but can be sampled from. [sent-42, score-0.522]

16 Finding level sets of an unknown density is also a well known problem in statistics which has some important applications different from anomaly detection. [sent-44, score-0.408]

17 Unfortunately, the algorithms considered in these articles cannot be used for the anomaly detection problem since the imposed assumptions on h are often tailored to the above applications and are in general unrealistic for anomalies. [sent-50, score-0.522]

18 One of the main problems of anomaly detection—or more precisely density level detection—is the lack of an empirical performance measure which allows us to compare the generalization performance of different algorithms by test samples. [sent-51, score-0.471]

19 By interpreting the density level detection problem as binary classiﬁcation with respect to an appropriate measure, we show that the corresponding empirical classiﬁcation risk can serve as such an empirical performance measure for anomaly detection. [sent-52, score-0.897]

20 Furthermore, we compare the excess classiﬁcation risk with the standard performance measure for the density level detection problem. [sent-53, score-0.608]

21 , 2003) for anomaly detection is to generate a labeled data set by assigning one label to the original unlabeled data and another label to a set of artiﬁcially generated data, and then apply a binary classiﬁcation algorithm. [sent-60, score-0.557]

22 By interpreting the density level detection problem as a binary classiﬁcation problem we can show that this heuristic can be strongly justiﬁed provided that the sampling plan for the artiﬁcial samples is chosen in a certain way and the used classiﬁcation algorithm is well-adopted to this plan. [sent-61, score-0.532]

23 Detecting Density Levels is a Classiﬁcation Problem We begin with rigorously deﬁning the density level detection (DLD) problem. [sent-66, score-0.38]

24 For the density level detection problem and related tasks this is a common assumption (see e. [sent-73, score-0.38]

25 Furthermore, for a sequence of functions fn : X → R with Sµ,h,ρ ( fn ) → 0 we easily see that sign fn (x) → 1{h>ρ} (x) for µ-almost all x ∈ X, and since Q is absolutely continuous with respect to µ the same convergence holds Q-almost surely. [sent-91, score-0.306]

26 Then the probability measure Q s µ on X ×Y is deﬁned by Q s µ (A) := sEx∼Q 1A (x, 1) + (1 − s)Ex∼µ 1A (x, −1) for all measurable subsets A ⊂ X ×Y . [sent-110, score-0.141]

27 Inspired by this interpretation let us recall that the binary classiﬁcation risk for a measurable function f : X → R and a distribution P on X ×Y is deﬁned by RP ( f ) = P {(x, y) : sign f (x) = y} , where we deﬁne signt := 1 if t > 0 and signt = −1 otherwise. [sent-114, score-0.284]

28 Furthermore, the Bayes risk RP of P is the smallest possible classiﬁcation risk with respect to P, i. [sent-115, score-0.228]

29 Then for all sequences ( fn ) of measurable functions fn : X → R the following are equivalent: i) SP ( fn ) → 0. [sent-146, score-0.304]

30 Since by Corollary 3 we know µ({h > ρ} 1 {η > 2 }) = 0 it is easy to see that the classiﬁcation risk of fn can be computed by RP ( fn ) = RP + Z En |2η − 1|dPX . [sent-150, score-0.246]

31 Therefore, the assertion follows from SP ( fn ) = µ(En ). [sent-156, score-0.151]

32 Theorem 4 shows that instead of using SP as a performance measure for the density level detection problem one can alternatively use the classiﬁcation risk RP (. [sent-157, score-0.529]

33 Then for all measurable f : X → R we have RP ( f ) = 1 1+ρ and deﬁne 1 ρ EQ I(1, sign f ) + Eµ I(−1, sign f ) . [sent-163, score-0.234]

34 1+ρ 1+ρ Proof The ﬁrst assertion directly follows from RP ( f ) = P {(x, y) : sign f (x) = y} = P {(x, 1) : sign f (x) = −1} + P {(x, −1) : sign f (x) = 1} = sQ {sign f = −1} + (1 − s)µ {sign f = 1} = sEQ I(1, sign f ) + (1 − s)Eµ I(−1, sign f ) . [sent-165, score-0.405]

35 As described at the beginning of this section our main goal is to ﬁnd a performance measure for the density level detection problem which has an empirical counterpart. [sent-168, score-0.443]

36 , xn ) ∈ X n and a measurable function f : X → R we deﬁne RT ( f ) := n ρ 1 ∑ I(1, sign f (xi )) + 1 + ρ Eµ I(−1, sign f ) . [sent-174, score-0.234]

37 (1 + ρ)n i=1 If we identify T with the corresponding empirical measure it is easy to see that RT ( f ) is the 1 classiﬁcation risk with respect to the measure T s µ for s := 1+ρ . [sent-175, score-0.212]

38 ) is strongly connected with another approach for the density level detection problem which is based on the so-called excess mass (see e. [sent-180, score-0.515]

39 To be more precise let us ﬁrst recall that the excess mass of a measurable function f : X → R is deﬁned by EP ( f ) := Q({ f > 0}) − ρµ({ f > 0}) , where Q, ρ and µ have the usual meaning. [sent-183, score-0.241]

40 By the above proposition we have ET ( f ) = 1 − 1 n ∑ I(1, sign f (xi )) − ρEµ I(−1, sign f ) = 1 − (1 + ρ)RT ( f ) n i=1 (2) for all measurable f : X → R. [sent-193, score-0.287]

41 Now, given a class F of measurable functions from X to R the (empirical) excess mass approach considered e. [sent-194, score-0.241]

42 By equation (2) we see that this approach is actually a type of empirical risk minimization (ERM). [sent-198, score-0.142]

43 956) used (3) for the analysis of a density level detection method which is based on a localized version of the empirical excess mass approach. [sent-213, score-0.543]

44 Mammen and Tsybakov, 1999; Tsybakov, 2004; Steinwart and Scovel, 2004) as the following proposition proved in the appendix shows: Proposition 9 Let µ and Q be distributions on X such that Q has a density h with respect to µ. [sent-216, score-0.156]

45 However it appears that only the “RP ( fn ) → RP ⇒ SP ( fn ) → 0” assertion of Theorem 4 holds instead of equivalence. [sent-245, score-0.217]

46 Therefore we immediately obtain the following corollary Corollary 13 Under the assumptions of Theorem 12 both the DLD-SVM with offset and without ˜ offset are universally consistent with respect to SP (. [sent-322, score-0.153]

47 This gives a strong justiﬁcation for the well-known heuristic of adding artiﬁcial samples to anomaly detection problems with unlabeled data. [sent-328, score-0.625]

48 Experiments We present experimental results for anomaly detection problems where the set X is a subset of Rd . [sent-331, score-0.522]

49 Based on Deﬁnition 6 we deﬁne the empirical risk of f with respect to (S, S ) to be R(S,S ) ( f ) = ρ 1 ∑ I(1, sign f (x)) + (1 + ρ)|S | ∑ I(−1, sign f (x)). [sent-335, score-0.27]

50 the rate at which samples will be labeled anomalous by f ), and the quantity 1 |S | ∑x∈S I(−1, sign f (x)) is an estimate of µ({ f > 0}) which we call the volume of the predicted normal set. [sent-340, score-0.201]

51 Also, from the expression for the risk in Proposition 5 it is clear that for any two functions with the same alarm rate we prefer the function with the smaller volume and vice versa. [sent-344, score-0.18]

52 We consider three different anomaly detection problems, two are synthetic and one is an application in cybersecurity. [sent-347, score-0.522]

53 In each case we deﬁne a problem instance to be a triplet consisting of samples from Q, samples from µ, and a value for the density level ρ. [sent-348, score-0.217]

54 (2001) and the o others (including the Parzen windows method) are based on some of the most common parametric and non-parametric statistical methods for density-based anomaly detection in Rd . [sent-351, score-0.551]

55 The core procedures are run on the training sets and the values of the free parameters are chosen to minimize the empirical risk (8) on the validation sets. [sent-358, score-0.169]

56 The regularization parameters λ and σ2 are chosen to (approximately) minimize the empirical risk R(V,V ) ( f ) on the validation sets. [sent-363, score-0.169]

57 In particular, for each value of λ from a ﬁxed grid we seek a minimizer over σ2 by evaluating the validation risk at a coarse grid of σ2 values and then performing a Golden search over the interval deﬁned by the two σ2 values on either side of the coarse grid minimum. [sent-365, score-0.309]

58 2 As the overall search proceeds the (λ, σ2 ) pair with the smallest validation risk is retained. [sent-366, score-0.141]

59 In particular both ν and σ2 are chosen to (approximately) minimize the validation risk using the search procedure described above for the DLD–SVM where the grid search for λ is replaced by a Golden search (over [0, 1]) for ν. [sent-376, score-0.197]

60 The GML algorithm produces a function f = g − t where t is an offset and g is a Gaussian probability density function whose mean and inverse covariance are determined from maximum likelihood estimates formed from the training data T (see e. [sent-377, score-0.166]

61 Once the parameters of g are determined the offset t is chosen to minimize the training risk R(T,T ) . [sent-382, score-0.177]

62 The regularization parameter λ is chosen to (approximately) minimize the validation risk by searching a ﬁxed grid of λ values. [sent-383, score-0.197]

63 222 A C LASSIFICATION F RAMEWORK FOR A NOMALY D ETECTION Number of Q samples Number of µ samples λ grid (DLD–SVM/GML/MGML) σ2 grid (DLD–SVM/1CLASS–SVM) Train 1000 2000 Validate 500 2000 Test 100,000 100,000 1. [sent-388, score-0.196]

64 for all inverse covariance estimates and both λ and K are chosen to (approximately) minimize the validation risk by searching a ﬁxed grid of (λ, K) values. [sent-408, score-0.197]

65 Figures 1(a) and 1(c) plot the empirical risk R(W,W ) versus ρ while Figures 1(b) and 1(d) plot the corresponding performance curves. [sent-427, score-0.142]

66 These results are signiﬁcant because values of ρ substantially larger than one appear to have little utility here since they yield alarm rates that do not conform to our notion that anomalies are rare events. [sent-434, score-0.224]

67 In addition ρ 1 appears to have little utility in the general anomaly detection problem since it deﬁnes anomalies in regions where the concentration of Q is much larger than the concentration of µ, which is contrary to our premise that anomalies are not concentrated. [sent-435, score-0.902]

68 005 to 50 and the results are summarized by the empirical risk curve in Figure 2(a) and the corresponding performance curve in Figure 2(b). [sent-490, score-0.142]

69 The empirical risk values for DLD–SVM and MGML are nearly identical except for ρ = 0. [sent-491, score-0.142]

70 Number of Q samples Number of µ samples λ grid (DLD–SVM/GML/MGML) σ2 grid (DLD–SVM/1CLASS–SVM) Train 4000 10,000 Validate 2000 100,000 Test 5664 100,000 0. [sent-493, score-0.196]

71 Except for this case the empirical risk values for DLD–SVM and MGML are much better than 1CLASS–SVM and GML at nearly all values of ρ. [sent-514, score-0.142]

72 The performance curves conﬁrm the superiority of DLD–SVM and MGML, but also reveal differences not easily seen in the empirical risk curves. [sent-515, score-0.169]

73 For example, all four methods produced some solutions with identical performance estimates for different values of ρ which is reﬂected by the fact that the performance curves show fewer points than the corresponding empirical risk curves. [sent-516, score-0.169]

74 Discussion A review of the literature on anomaly detection suggests that there are many ways to characterize anomalies (see e. [sent-533, score-0.68]

75 In this work we assumed that anomalies are not concentrated. [sent-536, score-0.158]

76 This assumption can be speciﬁed by choosing a reference measure µ which determines a density and a level value ρ. [sent-537, score-0.194]

77 The density then quantiﬁes the degree of concentration and the density level ρ establishes a threshold on the degree that determines anomalies. [sent-538, score-0.294]

78 In practice the user chooses µ and ρ to capture some notion of anomaly that he deems relevant to the application. [sent-540, score-0.275]

79 This paper advances the existing state of “density based” anomaly detection in the following ways. [sent-541, score-0.522]

80 Therefore we accommodate a larger class of anomaly detection problems. [sent-543, score-0.522]

81 Consequently, it has been difﬁcult to compare different methods for anomaly detection in practice. [sent-549, score-0.522]

82 We have demonstrated this approach which is a rigorous variant of a well-known heuristic for anomaly detection in the formulation of the DLD-SVM. [sent-554, score-0.548]

83 These advances have created a situation in which much of the knowledge on classiﬁcation can now be used for anomaly detection. [sent-555, score-0.275]

84 Consequently, we expect substantial advances in anomaly detection in the future. [sent-556, score-0.522]

85 Finally let us consider a different learning scenario in which anomaly detection methods are also commonly employed. [sent-557, score-0.522]

86 Hidden classiﬁcation problems for example occur in network intrusion detection problems where it is impractical to obtain labels. [sent-561, score-0.275]

87 This means for example that when an anomaly detection method is used to produce a classiﬁer f for a HCP its anomaly detection performance RP ( f ) with 1 P := Q s µ and s := 1+ρ may be very different from its hidden classiﬁcation performance Rν ( f ). [sent-573, score-1.044]

88 Another consequence of the above considerations is that the common practice of measuring the performance of anomaly detection algorithms on (hidden) binary classiﬁcation problems is problematic. [sent-579, score-0.522]

89 Indeed, the obtained classiﬁcation errors depend on the model error and thus they provide an inadequate description how well the algorithms solve the anomaly detection problem. [sent-580, score-0.522]

90 In conclusion although there are clear similiarities between the use of the DLD formalism for anomaly detection and its use for the HCP there is also an important difference. [sent-584, score-0.522]

91 In the ﬁrst case the speciﬁcation of µ and ρ determines the deﬁnition of anomalies and therefore there is no model error, whereas in the second case the model error is determined by the choice of µ and ρ. [sent-585, score-0.184]

92 2) gives a sufﬁcient condition for the existence of a regular conditional probability: 228 A C LASSIFICATION F RAMEWORK FOR A NOMALY D ETECTION Theorem 17 If Y is a Polish space then a regular conditional probability P( . [sent-614, score-0.178]

93 Therefore we ﬁnd 1+t 1−t ρ≤h≤ ρ 1+t 1−t |2η − 1| ≤ t ≤ PX (1 − tr )ρ ≤ h ≤ (1 + tr )ρ = PX (1 − tl )ρ ≤ h ≤ (1 + tr )ρ ⊂ Hence for all sufﬁciently small t > 0 with t < = |h − ρ| ≤ tr ρ . [sent-644, score-0.182]

94 Applications of probability density estimation to the detection of abnormal conditions in engineering. [sent-704, score-0.35]

95 Using artiﬁcial anomalies to detect unknown and known network intrusions. [sent-728, score-0.158]

96 Support vector novelty detection applied to o jet engine vibration spectra. [sent-755, score-0.345]

97 The use of novelty detection techniques for monitoring high-integrity plant. [sent-773, score-0.307]

98 Network intrusion and fault detection: a statistical anomaly approach. [sent-786, score-0.303]

99 Measuring mass concentrations and estimating density contour clusters—an excess mass aproach. [sent-819, score-0.294]

100 Novelty detection for the identiﬁcation of masses in mammograms. [sent-882, score-0.247]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dld', 0.478), ('rp', 0.323), ('anomaly', 0.275), ('detection', 0.247), ('mgml', 0.189), ('gml', 0.164), ('anomalies', 0.158), ('px', 0.158), ('sp', 0.15), ('tsybakov', 0.14), ('covel', 0.138), ('teinwart', 0.138), ('etection', 0.126), ('nomaly', 0.126), ('ramework', 0.126), ('sh', 0.116), ('ush', 0.116), ('risk', 0.114), ('measurable', 0.106), ('density', 0.103), ('anomalous', 0.095), ('svm', 0.092), ('destination', 0.088), ('assertion', 0.085), ('excess', 0.079), ('lassification', 0.079), ('hcp', 0.076), ('steinwart', 0.066), ('fn', 0.066), ('alarm', 0.066), ('sign', 0.064), ('hd', 0.063), ('offset', 0.063), ('novelty', 0.06), ('grid', 0.056), ('mass', 0.056), ('proposition', 0.053), ('polonik', 0.053), ('bytes', 0.05), ('hayton', 0.05), ('markou', 0.05), ('sawitzki', 0.05), ('sq', 0.05), ('tarassenko', 0.05), ('theiler', 0.05), ('regular', 0.049), ('plan', 0.047), ('absolutely', 0.044), ('dp', 0.043), ('king', 0.042), ('mammen', 0.042), ('samples', 0.042), ('universal', 0.041), ('conditional', 0.04), ('exponent', 0.039), ('session', 0.039), ('ft', 0.039), ('dx', 0.038), ('classi', 0.038), ('dpx', 0.038), ('gov', 0.038), ('hartigan', 0.038), ('ingo', 0.038), ('jet', 0.038), ('lanl', 0.038), ('interpreting', 0.037), ('tr', 0.037), ('measure', 0.035), ('unlabeled', 0.035), ('tl', 0.034), ('en', 0.034), ('ep', 0.033), ('concentration', 0.032), ('level', 0.03), ('windows', 0.029), ('ex', 0.029), ('consistency', 0.029), ('empirical', 0.028), ('fan', 0.028), ('intrusion', 0.028), ('parzen', 0.028), ('sessions', 0.028), ('traf', 0.028), ('furthermore', 0.028), ('corollary', 0.027), ('validation', 0.027), ('curves', 0.027), ('ller', 0.026), ('determines', 0.026), ('heuristic', 0.026), ('alamos', 0.025), ('announced', 0.025), ('cuevas', 0.025), ('cybersecurity', 0.025), ('desforges', 0.025), ('gonz', 0.025), ('lez', 0.025), ('manikopoulos', 0.025), ('nairac', 0.025), ('packet', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 3 jmlr-2005-A Classification Framework for Anomaly Detection

Author: Ingo Steinwart, Don Hush, Clint Scovel

2 0.076587476 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

Author: Aapo Hyvärinen

Abstract: One often wants to estimate statistical models where the probability density function is known only up to a multiplicative normalization constant. Typically, one then has to resort to Markov Chain Monte Carlo methods, or approximations of the normalization constant. Here, we propose that such models can be estimated by minimizing the expected squared distance between the gradient of the log-density given by the model and the gradient of the log-density of the observed data. While the estimation of the gradient of log-density function is, in principle, a very diﬃcult non-parametric problem, we prove a surprising result that gives a simple formula for this objective function. The density function of the observed data does not appear in this formula, which simpliﬁes to a sample average of a sum of some derivatives of the log-density given by the model. The validity of the method is demonstrated on multivariate Gaussian and independent component analysis models, and by estimating an overcomplete ﬁlter set for natural image data. Keywords: statistical estimation, non-normalized densities, pseudo-likelihood, Markov chain Monte Carlo, contrastive divergence

3 0.074405946 23 jmlr-2005-Convergence Theorems for Generalized Alternating Minimization Procedures

Author: Asela Gunawardana, William Byrne

Abstract: The EM algorithm is widely used to develop iterative parameter estimation procedures for statistical models. In cases where these procedures strictly follow the EM formulation, the convergence properties of the estimation procedures are well understood. In some instances there are practical reasons to develop procedures that do not strictly fall within the EM framework. We study EM variants in which the E-step is not performed exactly, either to obtain improved rates of convergence, or due to approximations needed to compute statistics under a model family over which E-steps cannot be realized. Since these variants are not EM procedures, the standard (G)EM convergence results do not apply to them. We present an information geometric framework for describing such algorithms and analyzing their convergence properties. We apply this framework to analyze the convergence properties of incremental EM and variational EM. For incremental EM, we discuss conditions under these algorithms converge in likelihood. For variational EM, we show how the E-step approximation prevents convergence to local maxima in likelihood. Keywords: EM, variational EM, incremental EM, convergence, information geometry

4 0.071198076 72 jmlr-2005-What's Strange About Recent Events (WSARE): An Algorithm for the Early Detection of Disease Outbreaks

Author: Weng-Keen Wong, Andrew Moore, Gregory Cooper, Michael Wagner

Abstract: Traditional biosurveillance algorithms detect disease outbreaks by looking for peaks in a univariate time series of health-care data. Current health-care surveillance data, however, are no longer simply univariate data streams. Instead, a wealth of spatial, temporal, demographic and symptomatic information is available. We present an early disease outbreak detection algorithm called What’s Strange About Recent Events (WSARE), which uses a multivariate approach to improve its timeliness of detection. WSARE employs a rule-based technique that compares recent health-care data against data from a baseline distribution and ﬁnds subgroups of the recent data whose proportions have changed the most from the baseline data. In addition, health-care data also pose difﬁculties for surveillance algorithms because of inherent temporal trends such as seasonal effects and day of week variations. WSARE approaches this problem using a Bayesian network to produce a baseline distribution that accounts for these temporal trends. The algorithm itself incorporates a wide range of ideas, including association rules, Bayesian networks, hypothesis testing and permutation tests to produce a detection algorithm that is careful to evaluate the signiﬁcance of the alarms that it raises. Keywords: anomaly detection, syndromic surveillance, biosurveillance, Bayesian networks, applications

5 0.066645555 53 jmlr-2005-Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application

Author: Joseph F. Murray, Gordon F. Hughes, Kenneth Kreutz-Delgado

Abstract: We compare machine learning methods applied to a difﬁcult real-world problem: predicting computer hard-drive failure using attributes monitored internally by individual drives. The problem is one of detecting rare events in a time series of noisy and nonparametrically-distributed data. We develop a new algorithm based on the multiple-instance learning framework and the naive Bayesian classiﬁer (mi-NB) which is speciﬁcally designed for the low false-alarm case, and is shown to have promising performance. Other methods compared are support vector machines (SVMs), unsupervised clustering, and non-parametric statistical tests (rank-sum and reverse arrangements). The failure-prediction performance of the SVM, rank-sum and mi-NB algorithm is considerably better than the threshold method currently implemented in drives, while maintaining low false alarm rates. Our results suggest that nonparametric statistical tests should be considered for learning problems involving detecting rare events in time series data. An appendix details the calculation of rank-sum signiﬁcance probabilities in the case of discrete, tied observations, and we give new recommendations about when the exact calculation should be used instead of the commonly-used normal approximation. These normal approximations may be particularly inaccurate for rare event problems like hard drive failures. Keywords: hard drive failure prediction, rank-sum test, support vector machines (SVM), exact nonparametric statistics, multiple instance naive-Bayes

6 0.060046822 41 jmlr-2005-Kernel Methods for Measuring Independence

7 0.05763571 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)

8 0.054264855 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets

9 0.04618784 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification

10 0.045611884 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

11 0.041701552 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

12 0.040340614 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables

13 0.039891109 73 jmlr-2005-Working Set Selection Using Second Order Information for Training Support Vector Machines

14 0.036283188 16 jmlr-2005-Asymptotics in Empirical Risk Minimization

15 0.035630517 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines

16 0.034844317 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

17 0.034315035 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods

18 0.034249853 64 jmlr-2005-Semigroup Kernels on Measures

19 0.034083635 47 jmlr-2005-Learning from Examples as an Inverse Problem

20 0.032683335 9 jmlr-2005-Active Learning to Recognize Multiple Types of Plankton

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.196), (1, 0.08), (2, 0.019), (3, -0.035), (4, 0.058), (5, -0.015), (6, -0.008), (7, -0.37), (8, -0.194), (9, -0.057), (10, -0.189), (11, -0.065), (12, -0.071), (13, -0.194), (14, -0.093), (15, -0.01), (16, -0.07), (17, 0.045), (18, -0.01), (19, -0.08), (20, -0.147), (21, 0.034), (22, -0.113), (23, -0.032), (24, -0.082), (25, -0.009), (26, -0.15), (27, -0.021), (28, 0.069), (29, 0.085), (30, 0.036), (31, 0.032), (32, -0.123), (33, -0.038), (34, 0.001), (35, -0.109), (36, -0.072), (37, 0.029), (38, -0.016), (39, -0.111), (40, 0.091), (41, -0.252), (42, -0.106), (43, -0.034), (44, -0.113), (45, -0.133), (46, -0.107), (47, -0.32), (48, 0.002), (49, -0.288)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95146388 3 jmlr-2005-A Classification Framework for Anomaly Detection

Author: Ingo Steinwart, Don Hush, Clint Scovel

2 0.28146866 23 jmlr-2005-Convergence Theorems for Generalized Alternating Minimization Procedures

Author: Asela Gunawardana, William Byrne

3 0.26117566 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets

Author: Ivor W. Tsang, James T. Kwok, Pak-Ming Cheung

Abstract: O(m3 ) Standard SVM training has time and O(m2 ) space complexities, where m is the training set size. It is thus computationally infeasible on very large data sets. By observing that practical SVM implementations only approximate the optimal solution by an iterative strategy, we scale up kernel methods by exploiting such “approximateness” in this paper. We ﬁrst show that many kernel methods can be equivalently formulated as minimum enclosing ball (MEB) problems in computational geometry. Then, by adopting an efﬁcient approximate MEB algorithm, we obtain provably approximately optimal solutions with the idea of core sets. Our proposed Core Vector Machine (CVM) algorithm can be used with nonlinear kernels and has a time complexity that is linear in m and a space complexity that is independent of m. Experiments on large toy and realworld data sets demonstrate that the CVM is as accurate as existing SVM implementations, but is much faster and can handle much larger data sets than existing scale-up methods. For example, CVM with the Gaussian kernel produces superior results on the KDDCUP-99 intrusion detection data, which has about ﬁve million training patterns, in only 1.4 seconds on a 3.2GHz Pentium–4 PC. Keywords: kernel methods, approximation algorithm, minimum enclosing ball, core set, scalability

4 0.23949853 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)

Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson

Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by deﬁning a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semideﬁnite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classiﬁcation, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semideﬁnite programming

5 0.2343553 72 jmlr-2005-What's Strange About Recent Events (WSARE): An Algorithm for the Early Detection of Disease Outbreaks

Author: Weng-Keen Wong, Andrew Moore, Gregory Cooper, Michael Wagner

6 0.21298641 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

7 0.21158741 53 jmlr-2005-Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application

8 0.18545221 16 jmlr-2005-Asymptotics in Empirical Risk Minimization

9 0.18526343 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

10 0.17431366 47 jmlr-2005-Learning from Examples as an Inverse Problem

11 0.16815002 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables

12 0.15641285 57 jmlr-2005-Multiclass Boosting for Weak Classifiers

13 0.15201381 64 jmlr-2005-Semigroup Kernels on Measures

14 0.15070547 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines

15 0.13928394 41 jmlr-2005-Kernel Methods for Measuring Independence

16 0.13799319 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

17 0.13701035 51 jmlr-2005-Local Propagation in Conditional Gaussian Bayesian Networks

18 0.13294597 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach

19 0.13137582 40 jmlr-2005-Inner Product Spaces for Bayesian Networks

20 0.12965196 9 jmlr-2005-Active Learning to Recognize Multiple Types of Plankton

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.427), (13, 0.021), (17, 0.035), (19, 0.027), (36, 0.038), (37, 0.034), (43, 0.05), (47, 0.023), (52, 0.073), (59, 0.027), (70, 0.035), (88, 0.083), (90, 0.028), (94, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.63510722 3 jmlr-2005-A Classification Framework for Anomaly Detection

Author: Ingo Steinwart, Don Hush, Clint Scovel

2 0.30325183 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)

Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson

3 0.29875922 71 jmlr-2005-Variational Message Passing

Author: John Winn, Christopher M. Bishop

Abstract: Bayesian inference is now widely established as one of the principal foundations for machine learning. In practice, exact inference is rarely possible, and so a variety of approximation techniques have been developed, one of the most widely used being a deterministic framework called variational inference. In this paper we introduce Variational Message Passing (VMP), a general purpose algorithm for applying variational inference to Bayesian Networks. Like belief propagation, VMP proceeds by sending messages between nodes in the network and updating posterior beliefs using local operations at each node. Each such update increases a lower bound on the log evidence (unless already at a local maximum). In contrast to belief propagation, VMP can be applied to a very general class of conjugate-exponential models because it uses a factorised variational approximation. Furthermore, by introducing additional variational parameters, VMP can be applied to models containing non-conjugate distributions. The VMP framework also allows the lower bound to be evaluated, and this can be used both for model comparison and for detection of convergence. Variational message passing has been implemented in the form of a general purpose inference engine called VIBES (‘Variational Inference for BayEsian networkS’) which allows models to be speciﬁed graphically and then solved variationally without recourse to coding. Keywords: Bayesian networks, variational inference, message passing

4 0.29599014 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

Author: Antoine Bordes, Seyda Ertekin, Jason Weston, Léon Bottou

Abstract: Very high dimensional learning systems become theoretically possible when training examples are abundant. The computing cost then becomes the limiting factor. Any efﬁcient learning algorithm should at least take a brief look at each example. But should all examples be given equal attention? This contribution proposes an empirical answer. We ﬁrst present an online SVM algorithm based on this premise. LASVM yields competitive misclassiﬁcation rates after a single pass over the training examples, outspeeding state-of-the-art SVM solvers. Then we show how active example selection can yield faster training, higher accuracies, and simpler models, using only a fraction of the training example labels.

5 0.29537264 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

Author: Aapo Hyvärinen

6 0.29489136 39 jmlr-2005-Information Bottleneck for Gaussian Variables

7 0.29470089 36 jmlr-2005-Gaussian Processes for Ordinal Regression

8 0.29419604 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints

9 0.29414722 64 jmlr-2005-Semigroup Kernels on Measures

10 0.29257408 44 jmlr-2005-Learning Module Networks

11 0.28860435 20 jmlr-2005-Clustering with Bregman Divergences

12 0.28844291 11 jmlr-2005-Algorithmic Stability and Meta-Learning

13 0.28800881 48 jmlr-2005-Learning the Kernel Function via Regularization

14 0.28734013 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods

15 0.28687397 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning

16 0.28680798 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

17 0.2833904 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

18 0.28333658 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors

19 0.28008795 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial

20 0.27814439 54 jmlr-2005-Managing Diversity in Regression Ensembles