nips nips2006 nips2006-85 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Alfred O. Hero
Abstract: We introduce a novel adaptive non-parametric anomaly detection approach, called GEM, that is based on the minimal covering properties of K-point entropic graphs when constructed on N training samples from a nominal probability distribution. Such graphs have the property that as N → ∞ their span recovers the entropy minimizing set that supports at least ρ = K/N (100)% of the mass of the Lebesgue part of the distribution. When a test sample falls outside of the entropy minimizing set an anomaly can be declared at a statistical level of significance α = 1 − ρ. A method for implementing this non-parametric anomaly detector is proposed that approximates this minimum entropy set by the influence region of a K-point entropic graph built on the training data. By implementing an incremental leave-one-out k-nearest neighbor graph on resampled subsets of the training data GEM can efficiently detect outliers at a given level of significance and compute their empirical p-values. We illustrate GEM for several simulated and real data sets in high dimensional feature spaces. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Geometric entropy minimization (GEM) for anomaly detection and localization Alfred O Hero, III University of Michigan Ann Arbor, MI 48109-2122 hero@umich. [sent-1, score-0.559]
2 edu Abstract We introduce a novel adaptive non-parametric anomaly detection approach, called GEM, that is based on the minimal covering properties of K-point entropic graphs when constructed on N training samples from a nominal probability distribution. [sent-2, score-1.013]
3 Such graphs have the property that as N → ∞ their span recovers the entropy minimizing set that supports at least ρ = K/N (100)% of the mass of the Lebesgue part of the distribution. [sent-3, score-0.104]
4 When a test sample falls outside of the entropy minimizing set an anomaly can be declared at a statistical level of significance α = 1 − ρ. [sent-4, score-0.689]
5 A method for implementing this non-parametric anomaly detector is proposed that approximates this minimum entropy set by the influence region of a K-point entropic graph built on the training data. [sent-5, score-0.738]
6 By implementing an incremental leave-one-out k-nearest neighbor graph on resampled subsets of the training data GEM can efficiently detect outliers at a given level of significance and compute their empirical p-values. [sent-6, score-0.305]
7 1 Introduction Anomaly detection and localization are important but notoriously difficult problems. [sent-8, score-0.134]
8 In such problems it is crucial to identify a nominal or baseline feature distribution with respect to which statistically significant deviations can be reliably detected. [sent-9, score-0.27]
9 However, in most applications there is seldom enough information to specify the nominal density accurately, especially in high dimensional feature spaces for which the baseline shifts over time. [sent-10, score-0.337]
10 In such cases standard methods that involve estimation of the multivariate feature density from a fixed training sample are inapplicable (high dimension) or unreliable (shifting baseline). [sent-11, score-0.197]
11 In this paper we propose an adaptive non-parametric method that is based on a class of entropic graphs [1] called K-point minimal spanning trees [2] and overcomes the limitations of high dimensional feature spaces and baseline shift. [sent-12, score-0.24]
12 Thus we call this approach to anomaly detection the geometric entropy minimization (GEM) method. [sent-15, score-0.559]
13 Several approaches to anomaly detection have been previously proposed. [sent-16, score-0.512]
14 These methods fall under the statistical nomenclature of the classical slippage problem [3] and have been applied to detecting abrupt changes in dynamical systems, image segmentation, and general fault detection applications [4]. [sent-18, score-0.217]
15 The main drawback of these algorithms is that they rely on a family of parameterically defined nominal (no-fault) distributions. [sent-19, score-0.25]
16 An alternative to parametric methods of anomaly detection are the class of novelty detection algorithms and include the GEM approach described herein. [sent-20, score-0.685]
17 Scholkopf and Smola introduced a kernelbased novelty detection scheme that relies on unsupervised support vector machines (SVM) [5]. [sent-21, score-0.173]
18 The single class minimax probability machine of Lanckriet etal [6] derives minimax linear decision regions that are robust to unknown anomalous densities. [sent-22, score-0.157]
19 More closely related to our GEM approach is that of Scott and Nowak [7] who derive multiscale approximations of minimum-volume-sets to estimate a particular level set of the unknown nominal multivariate density from training samples. [sent-23, score-0.453]
20 (1) Unlike the MPM method of Lanckriet etal [6] the GEM anomaly detector is not restricted to linear or even convex decision regions. [sent-26, score-0.54]
21 This translates to higher power for specified false alarm level. [sent-27, score-0.175]
22 (4) Like the method of Scott and Nowak [7] GEM is completely non-parametric, learning the structure of the nominal distribution without assumptions of linearity, smoothness or continuity of the level set boundaries. [sent-30, score-0.314]
23 (5) Like Scott and Nowak’s method, GEM is provably optimal - indeed uniformly most powerful of specified level - for the case that the anomaly density is a mixture of the nominal and a uniform density. [sent-31, score-0.853]
24 changes in local dimensionality of the support of the nominal density. [sent-34, score-0.25]
25 We introduce an incremental Leave-one-out (L1O) kNNG as a particularly versatile and fast anomaly detector in the GEM class. [sent-35, score-0.545]
26 Despite the similarity in nomenclature, the L1O kNNG is different from k nearest neighbor (kNN) anomaly detection of [9]. [sent-36, score-0.537]
27 The kNN anomaly detector is based on thresholding the distance from the test point to the k-th nearest neighbor. [sent-37, score-0.566]
28 The L1O kNNG detector computes the change in the topology of the entire kNN graph due to the addition of a test sample and does not use a decision threshold. [sent-38, score-0.271]
29 Furthermore, the parent GEM anomaly detection methodology has proven theoretical properties, e. [sent-39, score-0.512]
30 We introduce the statistical framework for anomaly detection in the next section. [sent-42, score-0.512]
31 Given a new sample X the objective is to declare X to be a ”nominal” sample consistent with Xn or an ”anomalous” sample that is significantly different from Xn . [sent-50, score-0.22]
32 ) sample from a multivariate density f0 (x) supported on the unit d-dimensional cube [0, 1]d . [sent-56, score-0.153]
33 Anomaly detection can be formulated as testing the hypotheses H0 : f = fo versus H0 : f = fo at a prescribed level α of significance P (declare H1 |H0 ) ≤ α. [sent-58, score-0.298]
34 The minimum-volume-set of level α is defined as a set Ωα in I d which minimizes the volume R |Ωα | = Ωα dx subject to the constraint Ωα f0 (x)dx ≥ 1 − α. [sent-59, score-0.111]
35 The minimum-entropy-set of level 1 R e α is defined as a set Λα in I d which minimizes the R´ nyi entropy Hν (Λα ) = 1−α ln Λα f ν (x)dx subject to the constraint Λα f0 (x)dx ≥ 1 − α. [sent-60, score-0.129]
36 The test ”decide anomaly if X ∈ Ωα ” is equivalent to implementing the test function φ(x) = 1, 0, x ∈ Ωα o. [sent-63, score-0.509]
37 This test has a strong optimality property: when f0 is Lebesgue continuous it is a uniformly most powerful (UMP) level α for testing anomalies that follow a uniform mixture distribution. [sent-66, score-0.447]
38 Specif- ically, let X have density f (x) = (1 − )f0 (x) + U (x) where U (x) is the uniform density over [0, 1]d and ∈ [0, 1]. [sent-67, score-0.157]
39 Consider testing the hypotheses H0 H1 : : =0 >0 (1) (2) Proposition 1 Assume that under H0 the random vector X has a Lebesgue continuous density f0 and that Z = f0 (X) is also a continuous random variable. [sent-68, score-0.167]
40 Then the level-set test of level α is uniformly most powerful for testing (2). [sent-69, score-0.17]
41 Second, when only a training sample from f0 is available and f0 is unknown the level sets have to be learned from the training data. [sent-75, score-0.21]
42 These methods can be divided into two main approaches: density estimation followed by plug in estimation of Ωα via variational methods; and (2) direct estimation of the level set using function approximation and non-parametric estimation. [sent-77, score-0.131]
43 3 GEM and entropic graphs GEM is a method that directly estimates the critical region for detecting anomalies using minimum coverings of subsets of points in a nominal training sample. [sent-84, score-0.701]
44 These coverings are obtained by constructing minimal graphs, e. [sent-85, score-0.102]
45 Points not covered by these K-point minimal graphs are identified as tail events and allow one to adaptively set a pvalue for the detector. [sent-88, score-0.248]
46 Among all of the MST’s spanning these sets, the K-MST is defined as the one having minimal length minXn,K ⊂Xn LM ST (Xn,k ). [sent-95, score-0.124]
47 The K-MST thus specifies the minimal subset of K points in addition to specifying the minimum length. [sent-96, score-0.109]
48 This subset of points, which we call a minimal graph covering of Xn of size K, can be viewed as capturing the densest region of Xn . [sent-97, score-0.114]
49 sample from a multivariate density f (x) and if limK,n→∞ K/n = ρ and a greedy version of the K-MST is implemented, this set converges a. [sent-101, score-0.208]
50 Then we have the following As the minimum entropy set and minimum volume set are identical, this suggests the following minimal-volume-set anomaly detection algorithm, which we call the ”K-MST anomaly detector. [sent-106, score-1.023]
51 ” K-MST anomaly detection algorithm [1]Process training sample: Given a level of significance α and a training sample Xn = ∗ {X1 , . [sent-107, score-0.722]
52 [2]Process test sample: Given a test sample X run the K-MST on the merged training-test sample ∗ Xn+1 = Xn ∪ {X} and store the minimal set of points Xn+1,K . [sent-111, score-0.293]
53 When the density f0 generating the training sample is Lebesgue continuous, it follows from [2, Theorem 2] that as K, n → ∞ the K-MST anomaly detector has false alarm probability that converges to α = 1 − K/n and power that converges to that of the minimum-volume-set test of level α. [sent-116, score-0.985]
54 When the density f0 is not Lebesgue continuous some optimality properties of the K-MST anomaly detector still hold. [sent-117, score-0.603]
55 Let this nominal density have the decomposition f0 = λ0 + δ0 , where λ0 is Lebesgue continuous and δ0 is singular. [sent-118, score-0.341]
56 Then, according to [2, Theorem 2], the K-MST anomaly detector will have false alarm probability that converges to (1 − ψ)α, where ψ is the mass of the singular component of f0 , and it is a uniformly most powerful test for anomalies in the continuous component, i. [sent-119, score-0.933]
57 We have the following result that justifies the use of such an approximation as an anomaly detector of level α = 1 − ρ, where ρ = K/n: ∗ Proposition 2 Let Xn,K be the set of points in Xn that results from any approximation to the K∗ point kNNG that satisfies the property [2, Defn. [sent-150, score-0.576]
58 ρ Figures 1-2 illustrate the use of the K-point kNNG as an anomaly detection algorithm. [sent-157, score-0.512]
59 9, K=180 Bivariate Gaussian mixture density 3 3 2 2 1 1 0 0 −1 −1 −2 −2 −3 −3 −4 −4 −5 −6 −4 −2 0 2 4 −5 −6 −4 −2 0 2 4 Figure 1: Left: level sets of the nominal bivariate mixture density used to illustrate the K point kNNG anomaly detection algorithms. [sent-159, score-1.124]
60 Right: K-point kNNG over N=200 random training samples drawn from the nominal bivariate mixture at left. [sent-160, score-0.436]
61 9, K=180 3 3 2 2 1 1 0 0 −1 −1 −2 −2 −3 −3 −4 −4 −5 −6 −4 −2 0 2 4 −5 −6 −4 −2 0 2 4 Figure 2: Left: The test point ’*’ is declared anomalous at level α = 0. [sent-165, score-0.268]
62 1 as it is not captured by the K-point kNNG (K=180) constructed over the combined test sample and the training samples drawn from the nominal bivariate mixture shown in Fig. [sent-166, score-0.544]
63 Right: A different test point ’*’ is declared non-anomalous as it is captured by this K-point kNNG. [sent-168, score-0.142]
64 3 Leave-one-out kNNG (L1O-kNNG) The theoretical equivalence between the K-point kNNG and the level set anomaly detector motivates a low complexity anomaly detection scheme, which we call the leave-one-out kNNG, discussed in this section and adopted for the experiments below. [sent-170, score-1.067]
65 As before, assume a single test sample X = Xn+1 and a training sample Xn . [sent-171, score-0.21]
66 To determine the kNNG over the combined sample Xn+1 = Xn ∪ {Xn+1 } one can execute the following algorithm: L1O kNNG anomaly detection algorithm 1. [sent-173, score-0.57]
67 Declare the test sample Xn+1 an anomaly if Xn+1 = Xo . [sent-187, score-0.486]
68 This algorithm will detect anomalies with a false alarm level of approximately 1/(n+1). [sent-188, score-0.372]
69 Thus larger sizes n of the training sample will correspond to more stringent false alarm constraints. [sent-189, score-0.249]
70 In particular, let n vary from k to n and define n∗ as the minimum value of n for which Xi is declared an anomaly. [sent-191, score-0.124]
71 maxi ∆i LkN N (3) The coefficient η(Xn+1 ) = 1 when the test point Xn+1 is declared an anomaly. [sent-194, score-0.142]
72 For example, the experiments below have shown that the above algorithm can find and determine the p-value of 10 outliers among 1000 test samples in a few seconds on a Dell 2GHz processor running Matlab 7. [sent-196, score-0.108]
73 We show a few representative experiments for simple Gaussian and Gaussian mixture nominal densities f0 . [sent-199, score-0.298]
74 2 −4 −4 −6 −2 0 0 2 4 6 −6 −4 iteration 574, pvalue 0. [sent-213, score-0.19]
75 001 0 2 4 6 −6 −2 0 2 4 6 iteration 791, pvalue 0. [sent-214, score-0.19]
76 The boxes on peaks of curve correspond to positions of detected anomalies and the height of the boxes are equal to one minus the computed p-value. [sent-220, score-0.263]
77 The parameter ρ is equal to 1 − α, where α is the user defined false alarm rate. [sent-223, score-0.147]
78 Right: the resampled nominal distribution (”•”) and anomalous points detected (”*”) at the iterations indicated at left. [sent-224, score-0.421]
79 First we illustrate the L1O kNNG algorithm for detection of non-uniformly distributed anomalies from training samples following a bivariate Gaussian nominal density. [sent-225, score-0.683]
80 The test sample consisted of a mixture of this nominal and a zero mean 2D Gaussian with correlation coefficient 0. [sent-228, score-0.447]
81 3 the results of simulation with a training sample of 2000 samples and 1000 tests samples are shown. [sent-232, score-0.154]
82 3 is a plot of the relative influence curve (3) over the test samples as compared to the most outlying point in the (resampled) training sample. [sent-234, score-0.191]
83 When the relative influence curve is equal to 1 the corresponding test sample is the most outlying point and is declared an anomaly. [sent-235, score-0.271]
84 001 and therefore one would expect an average of only one false alarm at this level of significance. [sent-238, score-0.211]
85 3 the detected anomalies (asterisks) are shown along with the training sample (dots) used to grow the L1O kNNG for that particular iteration - note that to protect against bias the training sample is resampled at each iteration. [sent-240, score-0.489]
86 Next we compare the performance of the L1O kNNG detector to that of the UMP test for the hypotheses (2). [sent-241, score-0.182]
87 For this simple case the minimum volume set of level α is a disk centered at the ori√ gin with radius 2σ 2 ln 1/α and the power of the of the UMP can be computed in closed form: β = (1 − )α + (1 − 2πσ 2 ln 1/α). [sent-245, score-0.182]
88 We implemented the GEM anomaly detector with the incremental leave-one-out kNNG using k = 5. [sent-246, score-0.521]
89 The training set consisted of 1000 samples from f0 and the test set consisted of 1000 samples from the mixture of a uniform density and f0 with parameter ranging from 0 to 0. [sent-247, score-0.33]
90 1 Figure 4: ROC curves for the leave-one-out kNNG anomaly detector described in Sec. [sent-275, score-0.51]
91 The labeled ”clairvoyant” curve is the ROC of the UMP anomaly detector. [sent-278, score-0.403]
92 1 and the test sample is a this 2D Gaussian and a 2D uniform-[0, 1]2 mixture density. [sent-280, score-0.156]
93 5 Conclusions A new and versatile anomaly detection method has been introduced that uses geometric entropy minimization (GEM) to extract minimal set coverings that can be used to detect anomalies from a set of training samples. [sent-282, score-0.89]
94 This method can be implemented through the K-point minimal spanning tree (MST) or the K-point nearest neighbor graph (kNNG). [sent-283, score-0.157]
95 We illustrated the L1O kNNG method on simulated data containing anomalies and showed that it comes close to achieving the optimal performance of the UMP detector for testing the nominal against a uniform mixture with unknown mixing parameter. [sent-285, score-0.628]
96 As the L1O kNNG computes p-values on detected anomalies it can be easily extended to account for false discovery rate constraints. [sent-286, score-0.253]
97 By using a sliding window, the methodology derived in this paper is easily extendible to on-line applications and has been applied to non-parametric intruder detection using our Crossbow sensor network testbed (reported elsewhere). [sent-287, score-0.134]
98 Gorman, “Applications of entropic spanning graphs,” IEEE Signal Processing Magazine, vol. [sent-293, score-0.105]
99 Jordan, “Robust novelty detection with single-class mpm,” in Advances in Neural Information Processing Systems (NIPS), vol. [sent-333, score-0.173]
100 Kumar, “A comparative study of anomaly detection schemes in network intrusion detection,” in SIAM Conference on data mining, 2003. [sent-345, score-0.512]
wordName wordTfidf (topN-words)
[('knng', 0.616), ('anomaly', 0.378), ('gem', 0.308), ('nominal', 0.25), ('xn', 0.184), ('anomalies', 0.161), ('pvalue', 0.154), ('lkn', 0.139), ('detection', 0.134), ('detector', 0.113), ('mst', 0.107), ('knn', 0.098), ('alarm', 0.094), ('declared', 0.092), ('ump', 0.092), ('lebesgue', 0.082), ('bivariate', 0.068), ('density', 0.067), ('level', 0.064), ('anomalous', 0.062), ('entropic', 0.061), ('sample', 0.058), ('minimal', 0.056), ('false', 0.053), ('test', 0.05), ('resampled', 0.049), ('nowak', 0.049), ('hero', 0.049), ('mixture', 0.048), ('entropy', 0.047), ('coverings', 0.046), ('declare', 0.046), ('outlying', 0.046), ('spanning', 0.044), ('training', 0.044), ('ei', 0.043), ('clairvoyant', 0.04), ('ravi', 0.04), ('novelty', 0.039), ('detected', 0.039), ('xi', 0.038), ('graphs', 0.038), ('greedy', 0.037), ('cance', 0.037), ('iteration', 0.036), ('scott', 0.036), ('testing', 0.033), ('minimum', 0.032), ('graph', 0.032), ('outliers', 0.032), ('implementing', 0.031), ('etal', 0.031), ('nomenclature', 0.031), ('lm', 0.031), ('incremental', 0.03), ('lanckriet', 0.028), ('power', 0.028), ('multivariate', 0.028), ('proposition', 0.027), ('roc', 0.027), ('logn', 0.027), ('xo', 0.027), ('michel', 0.027), ('mpm', 0.027), ('limn', 0.027), ('abrupt', 0.027), ('samples', 0.026), ('covering', 0.026), ('curve', 0.025), ('coef', 0.025), ('dx', 0.025), ('detecting', 0.025), ('nearest', 0.025), ('fo', 0.024), ('scholkopf', 0.024), ('versatile', 0.024), ('length', 0.024), ('continuous', 0.024), ('edges', 0.024), ('powerful', 0.023), ('uniform', 0.023), ('minimax', 0.023), ('consisted', 0.023), ('subsets', 0.023), ('volume', 0.022), ('overcomes', 0.021), ('points', 0.021), ('decide', 0.021), ('optimality', 0.021), ('uence', 0.02), ('baseline', 0.02), ('curves', 0.019), ('hypotheses', 0.019), ('boxes', 0.019), ('mass', 0.019), ('decision', 0.018), ('converges', 0.018), ('correlation', 0.018), ('ln', 0.018), ('gaussian', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 85 nips-2006-Geometric entropy minimization (GEM) for anomaly detection and localization
Author: Alfred O. Hero
Abstract: We introduce a novel adaptive non-parametric anomaly detection approach, called GEM, that is based on the minimal covering properties of K-point entropic graphs when constructed on N training samples from a nominal probability distribution. Such graphs have the property that as N → ∞ their span recovers the entropy minimizing set that supports at least ρ = K/N (100)% of the mass of the Lebesgue part of the distribution. When a test sample falls outside of the entropy minimizing set an anomaly can be declared at a statistical level of significance α = 1 − ρ. A method for implementing this non-parametric anomaly detector is proposed that approximates this minimum entropy set by the influence region of a K-point entropic graph built on the training data. By implementing an incremental leave-one-out k-nearest neighbor graph on resampled subsets of the training data GEM can efficiently detect outliers at a given level of significance and compute their empirical p-values. We illustrate GEM for several simulated and real data sets in high dimensional feature spaces. 1
2 0.22406155 96 nips-2006-In-Network PCA and Anomaly Detection
Author: Ling Huang, Xuanlong Nguyen, Minos Garofalakis, Michael I. Jordan, Anthony Joseph, Nina Taft
Abstract: We consider the problem of network anomaly detection in large distributed systems. In this setting, Principal Component Analysis (PCA) has been proposed as a method for discovering anomalies by continuously tracking the projection of the data onto a residual subspace. This method was shown to work well empirically in highly aggregated networks, that is, those with a limited number of large nodes and at coarse time scales. This approach, however, has scalability limitations. To overcome these limitations, we develop a PCA-based anomaly detector in which adaptive local data filters send to a coordinator just enough data to enable accurate global detection. Our method is based on a stochastic matrix perturbation analysis that characterizes the tradeoff between the accuracy of anomaly detection and the amount of data communicated over the network.
3 0.13313752 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
Author: Dong S. Cheng, Vittorio Murino, Mário Figueiredo
Abstract: This paper proposes a new approach to model-based clustering under prior knowledge. The proposed formulation can be interpreted from two different angles: as penalized logistic regression, where the class labels are only indirectly observed (via the probability density of each class); as finite mixture learning under a grouping prior. To estimate the parameters of the proposed model, we derive a (generalized) EM algorithm with a closed-form E-step, in contrast with other recent approaches to semi-supervised probabilistic clustering which require Gibbs sampling or suboptimal shortcuts. We show that our approach is ideally suited for image segmentation: it avoids the combinatorial nature Markov random field priors, and opens the door to more sophisticated spatial priors (e.g., wavelet-based) in a simple and computationally efficient way. Finally, we extend our formulation to work in unsupervised, semi-supervised, or discriminative modes. 1
4 0.10456688 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes
Author: Huan Xu, Shie Mannor
Abstract: Computation of a satisfactory control policy for a Markov decision process when the parameters of the model are not exactly known is a problem encountered in many practical applications. The traditional robust approach is based on a worstcase analysis and may lead to an overly conservative policy. In this paper we consider the tradeoff between nominal performance and the worst case performance over all possible models. Based on parametric linear programming, we propose a method that computes the whole set of Pareto efficient policies in the performancerobustness plane when only the reward parameters are subject to uncertainty. In the more general case when the transition probabilities are also subject to error, we show that the strategy with the “optimal” tradeoff might be non-Markovian and hence is in general not tractable. 1
5 0.075322807 57 nips-2006-Conditional mean field
Author: Peter Carbonetto, Nando D. Freitas
Abstract: Despite all the attention paid to variational methods based on sum-product message passing (loopy belief propagation, tree-reweighted sum-product), these methods are still bound to inference on a small set of probabilistic models. Mean field approximations have been applied to a broader set of problems, but the solutions are often poor. We propose a new class of conditionally-specified variational approximations based on mean field theory. While not usable on their own, combined with sequential Monte Carlo they produce guaranteed improvements over conventional mean field. Moreover, experiments on a well-studied problem— inferring the stable configurations of the Ising spin glass—show that the solutions can be significantly better than those obtained using sum-product-based methods. 1
6 0.064536825 105 nips-2006-Large Margin Component Analysis
7 0.056983896 50 nips-2006-Chained Boosting
8 0.056540847 55 nips-2006-Computation of Similarity Measures for Sequential Data using Generalized Suffix Trees
9 0.053893983 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure
10 0.053030066 66 nips-2006-Detecting Humans via Their Pose
11 0.051893786 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds
12 0.046625059 154 nips-2006-Optimal Change-Detection and Spiking Neurons
13 0.045952111 20 nips-2006-Active learning for misspecified generalized linear models
14 0.040565029 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
15 0.039746907 131 nips-2006-Mixture Regression for Covariate Shift
16 0.036253046 35 nips-2006-Approximate inference using planar graph decomposition
17 0.03490052 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts
18 0.034677584 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions
19 0.034431376 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians
20 0.034119517 175 nips-2006-Simplifying Mixture Models through Function Approximation
topicId topicWeight
[(0, -0.133), (1, 0.032), (2, -0.0), (3, -0.011), (4, 0.022), (5, 0.015), (6, 0.004), (7, 0.002), (8, 0.043), (9, 0.026), (10, 0.044), (11, -0.01), (12, 0.068), (13, 0.078), (14, 0.036), (15, -0.016), (16, 0.008), (17, -0.013), (18, 0.038), (19, 0.022), (20, 0.06), (21, 0.054), (22, 0.105), (23, -0.018), (24, 0.087), (25, -0.286), (26, -0.194), (27, 0.294), (28, -0.218), (29, 0.014), (30, -0.305), (31, 0.044), (32, 0.004), (33, -0.029), (34, 0.07), (35, 0.09), (36, 0.132), (37, -0.022), (38, 0.044), (39, 0.08), (40, 0.049), (41, -0.073), (42, 0.008), (43, 0.044), (44, 0.009), (45, 0.159), (46, -0.068), (47, -0.044), (48, 0.016), (49, -0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.93834829 85 nips-2006-Geometric entropy minimization (GEM) for anomaly detection and localization
Author: Alfred O. Hero
Abstract: We introduce a novel adaptive non-parametric anomaly detection approach, called GEM, that is based on the minimal covering properties of K-point entropic graphs when constructed on N training samples from a nominal probability distribution. Such graphs have the property that as N → ∞ their span recovers the entropy minimizing set that supports at least ρ = K/N (100)% of the mass of the Lebesgue part of the distribution. When a test sample falls outside of the entropy minimizing set an anomaly can be declared at a statistical level of significance α = 1 − ρ. A method for implementing this non-parametric anomaly detector is proposed that approximates this minimum entropy set by the influence region of a K-point entropic graph built on the training data. By implementing an incremental leave-one-out k-nearest neighbor graph on resampled subsets of the training data GEM can efficiently detect outliers at a given level of significance and compute their empirical p-values. We illustrate GEM for several simulated and real data sets in high dimensional feature spaces. 1
2 0.7946167 96 nips-2006-In-Network PCA and Anomaly Detection
Author: Ling Huang, Xuanlong Nguyen, Minos Garofalakis, Michael I. Jordan, Anthony Joseph, Nina Taft
Abstract: We consider the problem of network anomaly detection in large distributed systems. In this setting, Principal Component Analysis (PCA) has been proposed as a method for discovering anomalies by continuously tracking the projection of the data onto a residual subspace. This method was shown to work well empirically in highly aggregated networks, that is, those with a limited number of large nodes and at coarse time scales. This approach, however, has scalability limitations. To overcome these limitations, we develop a PCA-based anomaly detector in which adaptive local data filters send to a coordinator just enough data to enable accurate global detection. Our method is based on a stochastic matrix perturbation analysis that characterizes the tradeoff between the accuracy of anomaly detection and the amount of data communicated over the network.
3 0.3307128 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
Author: Dong S. Cheng, Vittorio Murino, Mário Figueiredo
Abstract: This paper proposes a new approach to model-based clustering under prior knowledge. The proposed formulation can be interpreted from two different angles: as penalized logistic regression, where the class labels are only indirectly observed (via the probability density of each class); as finite mixture learning under a grouping prior. To estimate the parameters of the proposed model, we derive a (generalized) EM algorithm with a closed-form E-step, in contrast with other recent approaches to semi-supervised probabilistic clustering which require Gibbs sampling or suboptimal shortcuts. We show that our approach is ideally suited for image segmentation: it avoids the combinatorial nature Markov random field priors, and opens the door to more sophisticated spatial priors (e.g., wavelet-based) in a simple and computationally efficient way. Finally, we extend our formulation to work in unsupervised, semi-supervised, or discriminative modes. 1
4 0.31493658 50 nips-2006-Chained Boosting
Author: Christian R. Shelton, Wesley Huie, Kin F. Kan
Abstract: We describe a method to learn to make sequential stopping decisions, such as those made along a processing pipeline. We envision a scenario in which a series of decisions must be made as to whether to continue to process. Further processing costs time and resources, but may add value. Our goal is to create, based on historic data, a series of decision rules (one at each stage in the pipeline) that decide, based on information gathered up to that point, whether to continue processing the part. We demonstrate how our framework encompasses problems from manufacturing to vision processing. We derive a quadratic (in the number of decisions) bound on testing performance and provide empirical results on object detection. 1 Pipelined Decisions In many decision problems, all of the data do not arrive at the same time. Often further data collection can be expensive and we would like to make a decision without accruing the added cost. Consider silicon wafer manufacturing. The wafer is processed in a series of stages. After each stage some tests are performed to judge the quality of the wafer. If the wafer fails (due to flaws), then the processing time, energy, and materials are wasted. So, we would like to detect such a failure as early as possible in the production pipeline. A similar problem can occur in vision processing. Consider the case of object detection in images. Often low-level pixel operations (such as downsampling an image) can be performed in parallel by dedicated hardware (on a video capture board, for example). However, searching each subimage patch of the whole image to test whether it is the object in question takes time that is proportional to the number of pixels. Therefore, we can imagine a image pipeline in which low resolution versions of the whole image are scanned first. Subimages which are extremely unlikely to contain the desired object are rejected and only those which pass are processed at higher resolution. In this way, we save on many pixel operations and can reduce the cost in time to process an image. Even if downsampling is not possible through dedicated hardware, for most object detection schemes, the image must be downsampled to form an image pyramid in order to search for the object at different scales. Therefore, we can run the early stages of such a pipelined detector at the low resolution versions of the image and throw out large regions of the high resolution versions. Most of the processing is spent searching for small faces (at the high resolutions), so this method can save a lot of processing. Such chained decisions also occur if there is a human in the decision process (to ask further clarifying questions in database search, for instance). We propose a framework that can model all of these scenarios and allow such decision rules to be learned from historic data. We give a learning algorithm based on the minimization of the exponential loss and conclude with some experimental results. 1.1 Problem Formulation Let there be s stages to the processing pipeline. We assume that there is a static distribution from which the parts, objects, or units to be processed are drawn. Let p(x, c) represent this distribution in which x is a vector of the features of this unit and c represents the costs associated with this unit. In particular, let xi (1 ≤ i ≤ s) be the set of measurements (features) available to the decision maker immediately following stage i. Let c i (1 ≤ i ≤ s) be the cost of rejecting (or stopping the processing of) this unit immediately following stage i. Finally, let c s+1 be the cost of allowing the part to pass through all processing stages. Note that ci need not be monotonic in i. To take our wafer manufacturing example, for wafers that are good we might let c i = i for 1 ≤ i ≤ s, indicating that if a wafer is rejected at any stage, one unit of work has been invested for each stage of processing. For the same good wafers, we might let cs+1 = s − 1000, indicating that the value of a completed wafer is 1000 units and therefore the total cost is the processing cost minus the resulting value. For a flawed wafer, the values might be the same, except for c s+1 which we would set to s, indicating that there is no value for a bad wafer. Note that the costs may be either positive or negative. However, only their relative values are important. Once a part has been drawn from the distribution, there is no way of affecting the “base level” for the value of the part. Therefore, we assume for the remainder of this paper that c i ≥ 0 for 1 ≤ i ≤ s + 1 and that ci = 0 for some value of i (between 1 and s + 1). Our goal is to produce a series of decision rules f i (xi ) for 1 ≤ i ≤ s. We let fi have a range of {0, 1} and let 0 indicate that processing should continue and 1 indicate that processing should be halted. We let f denote the collection of these s decision rules and augment the collection with an additional rule f s+1 which is identically 1 (for ease of notation). The cost of using these rules to halt processing an example is therefore s+1 i−1 ci fi (xi ) L(f (x), c) = i=1 (1 − fj (xj )) . j=1 We would like to find a set of decision rules that minimize E p [L(f (x), c)]. While p(x, c) is not known, we do have a series of samples (training set) D = {(x1 , c1 ), (x2 , c2 ), . . . , (xn , cn )} of n examples drawn from the distribution p. We use superscripts to denote the example index and subscripts to denote the stage index. 2 Boosting Solution For this paper, we consider constructing the rules f i from simpler decision rules, much as in the Adaboost algorithm [1, 2]. We assume that each decision f i (xi ) is computed as the threshold of another function g i (xi ): fi (xi ) = I(gi (xi ) > 0).1 We bound the empirical risk: n n s+1 L(f (xk ), ck ) = i−1 ck I(gi (xk ) > 0) i i k=1 i=1 k=1 n s+1 j=1 k i−1 ck egi (xi ) i ≤ k=1 i=1 I(gj (xk ) ≤ 0) j k n s+1 j=1 k ck egi (xi )− i e−gj (xj ) = Pi−1 j=1 gj (xk ) j . (1) k=1 i=1 Our decision to make all costs positive ensures that the bounds hold. Our decision to make the optimal cost zero helps to ensure that the bound is reasonably tight. As in boosting, we restrict g i (xi ) to take the form mi αi,l hi,l (xi ), the weighted sum of m i subl=1 classifiers, each of which returns either −1 or +1. We will construct these weighted sums incrementally and greedily, adding one additional subclassifier and associated weight at each step. We will pick the stage, weight, and function of the subclassifier in order to make the largest negative change in the exponential bound to the empirical risk. The subclassifiers, h i,l will be drawn from a small class of hypotheses, H. 1 I is the indicator function that equals 1 if the argument is true and 0 otherwise. 1. Initialize gi (x) = 0 for all stages i k 2. Initialize wi = ck for all stages i and examples k. i 3. For each stage i: (a) Calculate targets for each training example, as shown in equation 5. (b) Let h be the result of running the base learner on this set. (c) Calculate the corresponding α as per equation 3. (d) Score this classification as per equation 4 ¯ 4. Select the stage ¯ with the best (highest) score. Let h and α be the classifier and ı ¯ weight found at that stage. ¯¯ 5. Let g¯(x) ← g¯(x) + αh(x). ı ı 6. Update the weights (see equation 2): ¯ k k ¯ ı • ∀1 ≤ k ≤ n, multiply w¯ by eαh(x¯ ) . ı ¯ k k ¯ ı • ∀1 ≤ k ≤ n, j > ¯, multiply wj by e−αh(x¯ ) . ı 7. Repeat from step 3 Figure 1: Chained Boosting Algorithm 2.1 Weight Optimization We first assume that the stage at which to add a new subclassifier and the subclassifier to add have ¯ ¯ already been chosen: ¯ and h, respectively. That is, h will become h¯,m¯+1 but we simplify it for ı ı ı ease of expression. Our goal is to find α ¯,m¯+1 which we similarly abbreviate to α. We first define ¯ ı ı k k wi = ck egi (xi )− i Pi−1 j=1 gj (xk ) j (2) + as the weight of example k at stage i, or its current contribution to our risk bound. If we let D h be ¯ − ¯ the set of indexes of the members of D for which h returns +1, and let D h be similarly defined for ¯ ¯ returns −1, we can further define those for which h s+1 + W¯ = ı k w¯ + ı + k∈Dh ¯ s+1 k wi − W¯ = ı − ı k∈Dh i=¯+1 ¯ k w¯ + ı − k∈Dh ¯ k wi . + ı k∈Dh i=¯+1 ¯ + ¯ We interpret W¯ to be the sum of the weights which h will emphasize. That is, it corresponds to ı ¯ ¯ the weights along the path that h selects: For those examples for which h recommends termination, we add the current weight (related to the cost of stopping the processing at this stage). For those ¯ examples for which h recommends continued processing, we add in all future weights (related to all − future costs associated with this example). W¯ can be similarly interpreted to be the weights (or ı ¯ costs) that h recommends skipping. If we optimize the loss bound of Equation 1 with respect to α, we obtain ¯ α= ¯ 1 Wı− log ¯ . + 2 W¯ ı (3) The more weight (cost) that the rule recommends to skip, the higher its α coefficient. 2.2 Full Optimization Using Equation 3 it is straight forward to show that the reduction in Equation 1 due to the addition of this new subclassifier will be + ¯ − ¯ W¯ (1 − eα ) + W¯ (1 − e−α ) . ı ı (4) We know of no efficient method for determining ¯, the stage at which to add a subclassifier, except ı by exhaustive search. However, within a stage, the choice of which subclassifier to use becomes one of maximizing n s+1 k¯ k k z¯ h(x¯ ) , where z¯ = ı ı ı k k wi − w¯ ı (5) i=¯+1 ı k=1 ¯ with respect to h. This is equivalent to an weighted empirical risk minimization where the training 1 2 n k k set is {x¯ , x¯ , . . . , x¯ }. The label of x¯ is the sign of z¯ , and the weight of the same example is the ı ı ı ı ı k magnitude of z¯ . ı 2.3 Algorithm The resulting algorithm is only slightly more complex than standard Adaboost. Instead of a weight vector (one weight for each data example), we now have a weight matrix (one weight for each data example for each stage). We initialize each weight to be the cost associated with halting the corresponding example at the corresponding stage. We start with all g i (x) = 0. The complete algorithm is as in Figure 1. Each time through steps 3 through 7, we complete one “round” and add one additional rule to one stage of the processing. We stop executing this loop when α ≤ 0 or when an iteration counter ¯ exceeds a preset threshold. Bottom-Up Variation In situations where information is only gained after each stage (such as in section 4), we can also train the classifiers “bottom-up.” That is, we can start by only adding classifiers to the last stage. Once finished with it, we proceed to the previous stage, and so on. Thus instead of selecting the best stage, i, in each round, we systematically work our way backward through the stages, never revisiting previously set stages. 3 Performance Bounds Using the bounds in [3] we can provide a risk bound for this problem. We let E denote the expectaˆ tion with respect to the true distribution p(x, c) and En denote the empirical average with respect to the n training samples. We first bound the indicator function with a piece-wise linear function, b θ , 1 with a maximum slope of θ : z ,0 . I(z > 0) ≤ bθ (z) = max min 1, 1 + θ We then bound the loss: L(f (x), c) ≤ φ θ (f (x), c) where s+1 φθ (f (x), c) = ci min{bθ (gi (xi )), bθ (−gi−1 (xi−1 )), bθ (−gi−2 (xi−2 )), . . . , bθ (−g1 (x1 ))} i=1 s+1 i ci Bθ (gi (xi ), gi−1 (xi−1 ), . . . , g1 (x1 )) = i=1 We replaced the product of indicator functions with a minimization and then bounded each indicator i with bθ . Bθ is just a more compact presentation of the composition of the function b θ and the minimization. We assume that the weights α at each stage have been scaled to sum to 1. This has no affect on the resulting classifications, but is necessary for the derivation below. Before stating the theorem, for clarity, we state two standard definition: Definition 1. Let p(x) be a probability distribution on the set X and let {x 1 , x2 , . . . , xn } be n independent samples from p(x). Let σ 1 , σ 2 , . . . , σ n be n independent samples from a Rademacher random variable (a binary variable that takes on either +1 or −1 with equal probability). Let F be a class of functions mapping X to . Define the Rademacher Complexity of F to be Rn (F ) = E sup f ∈F n 1 n σ i f (xi ) i=1 1 where the expectation is over the random draws of x through xn and σ 1 through σ n . Definition 2. Let p(x), {x1 , x2 , . . . , xn }, and F be as above. Let g 1 , g 2 , . . . , g n be n independent samples from a Gaussian distribution with mean 0 and variance 1. Analogous to the above definition, define the Gaussian Complexity of G to be Gn (F ) = E sup f ∈F 1 n n g i f (xi ) . i=1 We can now state our theorem, bounding the true risk by a function of the empirical risk: Theorem 3. Let H1 , H2 , . . . , Hs be the sequence of the sets of functions from which the base classifier draws for chain boosting. If H i is closed under negation for all i, all costs are bounded between 0 and 1, and the weights for the classifiers at each stage sum to 1, then with probability 1 − δ, k ˆ E [L(f (x), c)] ≤ En [φθ (f (x), c)] + θ s (i + 1)Gn (Hi ) + i=1 8 ln 2 δ n for some constant k. Proof. Theorem 8 of [3] states ˆ E [L(x, c)] ≤ En (φθ (f (x), c)) + 2Rn (φθ ◦ F ) + 8 ln 2 δ n and therefore we need only bound the R n (φθ ◦ F ) term to demonstrate our theorem. For our case, we have 1 Rn (φθ ◦ F ) = E sup n f ∈F = E sup f ∈F 1 n s+1 ≤ E sup j=1 f ∈F n n σ i φθ (f (xi ), ci ) i=1 s+1 σi i=1 1 n s ci Bθ (gj (xi ), gj−1 (xi ), . . . , g1 (xi )) j j j−1 1 j=1 n s+1 s σ i Bθ (gj (xi ), gj−1 (xi ), . . . , g1 (xi )) = j j−1 1 i=1 s Rn (Bθ ◦ G j ) j=1 where Gi is the space of convex combinations of functions from H i and G i is the cross product of G1 through Gi . The inequality comes from switching the expectation and the maximization and then from dropping the c i (see [4], lemma 5). j s s Lemma 4 of [3] states that there exists a k such that R n (Bθ ◦ G j ) ≤ kGn (Bθ ◦ G j ). Theorem 14 j 2 s j s of the same paper allows us to conclude that G n (Bθ ◦ G ) ≤ θ i=1 Gn (Gi ). (Because Bθ is the 1 s minimum over a set of functions with maximum slope of θ , the maximum slope of B θ is also 1 .) θ Theorem 12, part 2 states G n (Gi ) = Gn (Hi ). Taken together, this proves our result. Note that this bound has only quadratic dependence on s, the length of the chain and does not explicitly depend on the number of rounds of boosting (the number of rounds affects φ θ which, in turn, affects the bound). 4 Application We tested our algorithm on the MIT face database [5]. This database contains 19-by-19 gray-scale images of faces and non-faces. The training set has 2429 face images and 4548 non-face images. The testing set has 472 faces and 23573 non-faces. We weighted the training set images so that the ratio of the weight of face images to non-face images matched the ratio in the testing set. 0.4 0.4 CB Global CB Bottom−up SVM Boosting 0.3 0.25 0.2 0.15 0.1 150 0.2 100 0.1 False positive rate 0.3 50 Average number of pixels 0.35 average cost/error per example 200 training cost training error testing cost testing error 0.05 0 100 200 300 400 500 number of rounds 700 1000 (a) 0 0 0.2 0.4 0.6 False negative rate 0.8 0 1 (b) Figure 2: (a) Accuracy verses the number of rounds for a typical run, (b) Error rates and average costs for a variety of cost settings. 4.1 Object Detection as Chained Boosting Our goal is to produce a classifier that can identify non-face images at very low resolutions, thereby allowing for quick processing of large images (as explained later). Most image patches (or subwindows) do not contain faces. We, therefore, built a multi-stage detection system where any early rejection is labeled as a non-face. The first stage looks at image patches of size 3-by-3 (i.e. a lowerresolution version of the 19-by-19 original image). The next stage looks at the same image, but at a resolution of 6-by-6. The third stage considers the image at 12-by-12. We did not present the full 19-by-19 images as the classification did not significantly improve over the 12-by-12 versions. We employ a simple base classifier: the set of all functions that look at a single pixel and predict the class by thresholding the pixel’s value. The total classifier at any stage is a linear combination of these simple classifiers. For a given stage, all of the base classifiers that target a particular pixel are added together producing a complex function of the value of the pixel. Yet, this pixel can only take on a finite number of values (256 in this case). Therefore, we can compile this set of base classifiers into a single look-up function that maps the brightness of the pixel into a real number. The total classifier for the whole stage is merely the sum of these look-up functions. Therefore, the total work necessary to compute the classification at a stage is proportional to the number of pixels in the image considered at that stage, regardless of the number of base classifiers used. We therefore assign a cost to each stage of processing proportional to the number of pixels at that stage. If the image is a face, we add a negative cost (i.e. bonus) if the image is allowed to pass through all of the processing stages (and is therefore “accepted” as a face). If the image is a nonface, we add a bonus if the image is rejected at any stage before completion (i.e. correctly labelled). While this dataset has only segmented image patches, in a real application, the classifier would be run on all sub-windows of an image. More importantly, it would also be run at multiple resolutions in order to detect faces of different sizes (or at different distances from the camera). The classifier chain could be run simultaneously at each of these resolutions. To wit, while running the final 12-by12 stage at one resolution of the image, the 6-by-6 (previous) stage could be run at the same image resolution. This 6-by-6 processing would be the necessary pre-processing step to running the 12-by12 stage at a higher resolution. As we run our final scan for big faces (at a low resolution), we can already (at the same image resolution) be performing initial tests to throw out portions of the image as not worthy of testing for smaller faces (at a higher resolution). Most of the work of detecting objects must be done at the high resolutions because there are many more overlapping subwindows. This chained method allows the culling of most of this high-resolution image processing. 4.2 Experiments For each example, we construct a vector of stage costs as above. We add a constant to this vector to ensure that the minimal element is zero, as per section 1.1. We scale all vectors by the same amount to ensure that the maximal value is 1.This means that the number of misclassifications is an upper bound on the total cost that the learning algorithm is trying to minimize. There are three flexible quantities in this problem formulation: the cost of a pixel evaluation, the bonus for a correct face classification, and the bonus for a correct non-face classification. Changing these quantities will control the trade-off between false positives and true positives, and between classification error and speed. Figure 2(a) shows the result of a typical run of the algorithm. As a function of the number of rounds, it plots the cost (that which the algorithm is trying to minimize) and the error (number of misclassified image patches), for both the training and testing sets (where the training set has been reweighted to have the same proportion of faces to non-faces as the testing set). We compared our algorithm’s performance to the performance of support vector machines (SVM) [6] and Adaboost [1] trained and tested on the highest resolution, 12-by-12, image patches. We employed SVM-light [7] with a linear kernels. Figure 2(b) compares the error rates for the methods (solid lines, read against the left vertical axis). Note that the error rates are almost identical for the methods. The dashed lines (read against the right vertical axis) show the average number of pixels evaluated (or total processing cost) for each of the methods. The SVM and Adaboost algorithms have a constant processing cost. Our method (by either training scheme) produces lower processing cost for most error rates. 5 Related Work Cascade detectors for vision processing (see [8] or [9] for example) may appear to be similar to the work in this paper. Especially at first glance for the area of object detection, they appear almost the same. However, cascade detection and this work (chained detection) are quite different. Cascade detectors are built one at a time. A coarse detector is first trained. The examples which pass that detector are then passed to a finer detector for training, and so on. A series of targets for false-positive rates define the increasing accuracy of the detector cascade. By contrast, our chain detectors are trained as an ensemble. This is necessary because of two differences in the problem formulation. First, we assume that the information available at each stage changes. Second, we assume there is an explicit cost model that dictates the cost of proceeding from stage to stage and the cost of rejection (or acceptance) at any particular stage. By contrast, cascade detectors are seeking to minimize computational power necessary for a fixed decision. Therefore, the information available to all of the stages is the same, and there are no fixed costs associated with each stage. The ability to train all of the classifiers at the same time is crucial to good performance in our framework. The first classifier in the chain cannot determine whether it is advantageous to send an example further along unless it knows how the later stages will process the example. Conversely, the later stages cannot construct optimal classifications until they know the distribution of examples that they will see. Section 4.1 may further confuse the matter. We demonstrated how chained boosting can be used to reduce the computational costs of object detection in images. Cascade detectors are often used for the same purpose. However, the reductions in computational time come from two different sources. In cascade detectors, the time taken to evaluate a given image patch is reduced. In our chained detector formulation, image patches are ignored completely based on analysis of lower resolution patches in the image pyramid. To further illustrate the difference, cascade detectors can always be used to speed up asymmetric classification tasks (and are often applied to image detection). By contrast, in Section 4.1 we have exploited the fact that object detection in images is typically performed at multiple scales to turn the problem into a pipeline and apply our framework. Cascade detectors address situations in which prior class probabilities are not equal, while chained detectors address situations in which information is gained at a cost. Both are valid (and separate) ways of tackling image processing (and other tasks as well). In many ways, they are complementary approaches. Classic sequence analysis [10, 11] also addresses the problem of optimal stopping. However, it assumes that the samples are drawn i.i.d. from (usually) a known distribution. Our problem is quite different in that each consecutive sample is drawn from a different (and related) distribution and our goal is to find a decision rule without producing a generative model. WaldBoost [12] is a boosting algorithm based on this. It builds a series of features and a ratio comparison test in order to decide when to stop. For WaldBoost, the available features (information) not change between stages. Rather, any feature is available for selection at any point in the chain. Again, this is a different problem than the one considered in this paper. 6 Conclusions We feel this framework of staged decision making is useful in a wide variety of areas. This paper demonstrated how the framework applies to one vision processing task. Obviously it also applies to manufacturing pipelines where errors can be introduced at different stages. It should also be applicable to scenarios where information gathering is costly. Our current formulation only allows for early negative detection. In the face detection example above, this means that in order to report “face,” the classifier must process each stage, even if the result is assured earlier. In Figure 2(b), clearly the upper-left corner (100% false positives and 0% false negatives) is reachable with little effort: classify everything positive without looking at any features. We would like to extend this framework to cover such two-sided early decisions. While perhaps not useful in manufacturing (or even face detection, where the interesting part of the ROC curve is far from the upper-left), it would make the framework more applicable to informationgathering applications. Acknowledgements This research was supported through the grant “Adaptive Decision Making for Silicon Wafer Testing” from Intel Research and UC MICRO. References [1] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT, pages 23–37, 1995. [2] Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In ICML, pages 148–156, 1996. [3] Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. JMLR, 2:463–482, 2002. [4] Ron Meir and Tong Zhang. Generalization error bounds for Bayesian mixture algorithms. JMLR, 4:839–860, 2003. [5] MIT. CBCL face database #1, 2000. http://cbcl.mit.edu/cbcl/softwaredatasets/FaceData2.html. [6] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classifiers. In COLT, pages 144–152, 1992. [7] T. Joachims. Making large-scale SVM learning practical. In B. Schlkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods — Support Vector Learning. MIT-Press, 1999. [8] Paul A. Viola and Michael J. Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, pages 511–518, 2001. [9] Jianxin Wu, Matthew D. Mullin, and James M. Rehg. Linear asymmetric classifier for cascade detectors. In ICML, pages 988–995, 2005. [10] Abraham Wald. Sequential Analysis. Chapman & Hall, Ltd., 1947. [11] K. S. Fu. Sequential Methods in Pattern Recognition and Machine Learning. Academic Press, 1968. ˇ [12] Jan Sochman and Jiˇ´ Matas. Waldboost — learning for time constrained sequential detection. rı In CVPR, pages 150–156, 2005.
5 0.29387882 174 nips-2006-Similarity by Composition
Author: Oren Boiman, Michal Irani
Abstract: We propose a new approach for measuring similarity between two signals, which is applicable to many machine learning tasks, and to many signal types. We say that a signal S1 is “similar” to a signal S2 if it is “easy” to compose S1 from few large contiguous chunks of S2 . Obviously, if we use small enough pieces, then any signal can be composed of any other. Therefore, the larger those pieces are, the more similar S1 is to S2 . This induces a local similarity score at every point in the signal, based on the size of its supported surrounding region. These local scores can in turn be accumulated in a principled information-theoretic way into a global similarity score of the entire S1 to S2 . “Similarity by Composition” can be applied between pairs of signals, between groups of signals, and also between different portions of the same signal. It can therefore be employed in a wide variety of machine learning problems (clustering, classification, retrieval, segmentation, attention, saliency, labelling, etc.), and can be applied to a wide range of signal types (images, video, audio, biological data, etc.) We show a few such examples. 1
6 0.27659777 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes
7 0.27569613 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure
8 0.25293273 105 nips-2006-Large Margin Component Analysis
9 0.24870455 155 nips-2006-Optimal Single-Class Classification Strategies
10 0.24067886 57 nips-2006-Conditional mean field
11 0.23942994 55 nips-2006-Computation of Similarity Measures for Sequential Data using Generalized Suffix Trees
12 0.23569036 5 nips-2006-A Kernel Method for the Two-Sample-Problem
13 0.22014305 73 nips-2006-Efficient Methods for Privacy Preserving Face Detection
14 0.21766837 154 nips-2006-Optimal Change-Detection and Spiking Neurons
15 0.21002562 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions
16 0.20572631 66 nips-2006-Detecting Humans via Their Pose
17 0.20386843 69 nips-2006-Distributed Inference in Dynamical Systems
18 0.20217465 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation
19 0.20176293 129 nips-2006-Map-Reduce for Machine Learning on Multicore
20 0.20155285 153 nips-2006-Online Clustering of Moving Hyperplanes
topicId topicWeight
[(1, 0.105), (2, 0.308), (3, 0.013), (7, 0.064), (9, 0.034), (22, 0.049), (44, 0.151), (57, 0.069), (65, 0.048), (69, 0.025)]
simIndex simValue paperId paperTitle
1 0.85288745 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
Author: Peter Auer, Ronald Ortner
Abstract: We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm’s online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy. 1 1.1
2 0.81040341 27 nips-2006-An EM Algorithm for Localizing Multiple Sound Sources in Reverberant Environments
Author: Michael I. Mandel, Daniel P. Ellis, Tony Jebara
Abstract: We present a method for localizing and separating sound sources in stereo recordings that is robust to reverberation and does not make any assumptions about the source statistics. The method consists of a probabilistic model of binaural multisource recordings and an expectation maximization algorithm for finding the maximum likelihood parameters of that model. These parameters include distributions over delays and assignments of time-frequency regions to sources. We evaluate this method against two comparable algorithms on simulations of simultaneous speech from two or three sources. Our method outperforms the others in anechoic conditions and performs as well as the better of the two in the presence of reverberation. 1
same-paper 3 0.77619004 85 nips-2006-Geometric entropy minimization (GEM) for anomaly detection and localization
Author: Alfred O. Hero
Abstract: We introduce a novel adaptive non-parametric anomaly detection approach, called GEM, that is based on the minimal covering properties of K-point entropic graphs when constructed on N training samples from a nominal probability distribution. Such graphs have the property that as N → ∞ their span recovers the entropy minimizing set that supports at least ρ = K/N (100)% of the mass of the Lebesgue part of the distribution. When a test sample falls outside of the entropy minimizing set an anomaly can be declared at a statistical level of significance α = 1 − ρ. A method for implementing this non-parametric anomaly detector is proposed that approximates this minimum entropy set by the influence region of a K-point entropic graph built on the training data. By implementing an incremental leave-one-out k-nearest neighbor graph on resampled subsets of the training data GEM can efficiently detect outliers at a given level of significance and compute their empirical p-values. We illustrate GEM for several simulated and real data sets in high dimensional feature spaces. 1
4 0.73190671 174 nips-2006-Similarity by Composition
Author: Oren Boiman, Michal Irani
Abstract: We propose a new approach for measuring similarity between two signals, which is applicable to many machine learning tasks, and to many signal types. We say that a signal S1 is “similar” to a signal S2 if it is “easy” to compose S1 from few large contiguous chunks of S2 . Obviously, if we use small enough pieces, then any signal can be composed of any other. Therefore, the larger those pieces are, the more similar S1 is to S2 . This induces a local similarity score at every point in the signal, based on the size of its supported surrounding region. These local scores can in turn be accumulated in a principled information-theoretic way into a global similarity score of the entire S1 to S2 . “Similarity by Composition” can be applied between pairs of signals, between groups of signals, and also between different portions of the same signal. It can therefore be employed in a wide variety of machine learning problems (clustering, classification, retrieval, segmentation, attention, saliency, labelling, etc.), and can be applied to a wide range of signal types (images, video, audio, biological data, etc.) We show a few such examples. 1
5 0.62188387 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
Author: Peter L. Bartlett, Ambuj Tewari
Abstract: We consider methods that try to find a good policy for a Markov decision process by choosing one from a given class. The policy is chosen based on its empirical performance in simulations. We are interested in conditions on the complexity of the policy class that ensure the success of such simulation based policy search methods. We show that under bounds on the amount of computation involved in computing policies, transition dynamics and rewards, uniform convergence of empirical estimates to true value functions occurs. Previously, such results were derived by assuming boundedness of pseudodimension and Lipschitz continuity. These assumptions and ours are both stronger than the usual combinatorial complexity measures. We show, via minimax inequalities, that this is essential: boundedness of pseudodimension or fat-shattering dimension alone is not sufficient.
6 0.57410491 139 nips-2006-Multi-dynamic Bayesian Networks
7 0.57028228 57 nips-2006-Conditional mean field
8 0.55944902 38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments
9 0.55518866 96 nips-2006-In-Network PCA and Anomaly Detection
10 0.55336553 116 nips-2006-Learning from Multiple Sources
11 0.54726034 159 nips-2006-Parameter Expanded Variational Bayesian Methods
12 0.54209781 175 nips-2006-Simplifying Mixture Models through Function Approximation
13 0.54191279 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
14 0.54156661 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures
15 0.54083973 20 nips-2006-Active learning for misspecified generalized linear models
16 0.53768104 117 nips-2006-Learning on Graph with Laplacian Regularization
17 0.53749061 124 nips-2006-Linearly-solvable Markov decision problems
18 0.53720546 65 nips-2006-Denoising and Dimension Reduction in Feature Space
19 0.53532261 193 nips-2006-Tighter PAC-Bayes Bounds
20 0.53437203 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure