nips nips2012 nips2012-223 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ko-jen Hsiao, Kevin Xu, Jeff Calder, Alfred O. Hero
Abstract: We consider the problem of identifying patterns in a data set that exhibit anomalous behavior, often referred to as anomaly detection. In most anomaly detection algorithms, the dissimilarity between data samples is calculated by a single criterion, such as Euclidean distance. However, in many cases there may not exist a single dissimilarity measure that captures all possible anomalous patterns. In such a case, multiple criteria can be defined, and one can test for anomalies by scalarizing the multiple criteria using a linear combination of them. If the importance of the different criteria are not known in advance, the algorithm may need to be executed multiple times with different choices of weights in the linear combination. In this paper, we introduce a novel non-parametric multi-criteria anomaly detection method using Pareto depth analysis (PDA). PDA uses the concept of Pareto optimality to detect anomalies under multiple criteria without having to run an algorithm multiple times with different choices of weights. The proposed PDA approach scales linearly in the number of criteria and is provably better than linear combinations of the criteria. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We consider the problem of identifying patterns in a data set that exhibit anomalous behavior, often referred to as anomaly detection. [sent-4, score-0.504]
2 In most anomaly detection algorithms, the dissimilarity between data samples is calculated by a single criterion, such as Euclidean distance. [sent-5, score-0.608]
3 However, in many cases there may not exist a single dissimilarity measure that captures all possible anomalous patterns. [sent-6, score-0.228]
4 In such a case, multiple criteria can be defined, and one can test for anomalies by scalarizing the multiple criteria using a linear combination of them. [sent-7, score-0.321]
5 In this paper, we introduce a novel non-parametric multi-criteria anomaly detection method using Pareto depth analysis (PDA). [sent-9, score-0.533]
6 PDA uses the concept of Pareto optimality to detect anomalies under multiple criteria without having to run an algorithm multiple times with different choices of weights. [sent-10, score-0.25]
7 Many methods for anomaly detection have been developed using both parametric and non-parametric approaches. [sent-13, score-0.485]
8 For complex high-dimensional data, multiple dissimilarity measures corresponding to different criteria may be required to detect certain types of anomalies. [sent-15, score-0.227]
9 Multiple criteria, such as dissimilarity in object speeds or trajectory shapes, can be used to detect a greater range of anomalies than any single criterion. [sent-17, score-0.203]
10 In order to perform anomaly detection using these multiple criteria, one could first combine the dissimilarities using a linear combination. [sent-18, score-0.573]
11 Furthermore, when the weights are changed, the anomaly detection algorithm needs to be re-executed using the new weights. [sent-21, score-0.514]
12 In this paper we propose a novel non-parametric multi-criteria anomaly detection approach using Pareto depth analysis (PDA). [sent-22, score-0.533]
13 Hence, PDA is able to detect anomalies under multiple combinations of the criteria without explicitly forming these combinations. [sent-27, score-0.215]
14 Center: Dyads for the training samples (black dots) along with first 20 Pareto fronts (green lines) under two criteria: |∆x| and |∆y|. [sent-29, score-0.311]
15 The Pareto fronts induce a partial ordering on the set of dyads. [sent-30, score-0.265]
16 Dyads associated with the test sample marked by the red circle concentrate around shallow fronts (near the lower left of the figure). [sent-31, score-0.331]
17 The PDA approach involves creating dyads corresponding to dissimilarities between pairs of data samples under all of the dissimilarity measures. [sent-33, score-0.586]
18 The first Pareto front (depth one) is the set of non-dominated dyads. [sent-35, score-0.178]
19 The second Pareto front (depth two) is obtained by removing these non-dominated dyads, i. [sent-36, score-0.178]
20 peeling off the first front, and recomputing the first Pareto front of those remaining. [sent-38, score-0.178]
21 In this way, each dyad is assigned to a Pareto front at some depth (see Fig. [sent-40, score-0.359]
22 Nominal and anomalous samples are located near different Pareto front depths; thus computing the front depths of the dyads corresponding to a test sample can discriminate between nominal and anomalous samples. [sent-42, score-1.23]
23 Under assumptions that the multi-criteria dyads can be modeled as a realizations from a smooth K-dimensional density we provide a mathematical analysis of the behavior of the first Pareto front. [sent-44, score-0.408]
24 Furthermore, this theoretical prediction is experimentally validated by comparing PDA to several state-of-the-art anomaly detection algorithms in two experiments involving both synthetic and real data sets. [sent-46, score-0.497]
25 In Section 3 we provide an introduction to Pareto fronts and present a theoretical analysis of the properties of the first Pareto front. [sent-49, score-0.265]
26 Section 4 relates Pareto fronts to the multi-criteria anomaly detection problem, which leads to the PDA anomaly detection algorithm. [sent-50, score-1.235]
27 These methods typically formulate machine learning problems as multi-objective optimization problems where finding even the first Pareto front is quite difficult. [sent-53, score-0.178]
28 These methods differ from our use of Pareto optimality because we consider multiple Pareto fronts created from a finite set of items, so we do not need to employ sophisticated methods in order to find these fronts. [sent-54, score-0.313]
29 Hero and Fleury [4] introduced a method for gene ranking using Pareto fronts that is related to our approach. [sent-55, score-0.265]
30 The method ranks genes, in order of interest to a biologist, by creating Pareto fronts of the data samples, i. [sent-56, score-0.265]
31 In this paper, we consider Pareto fronts of dyads, which correspond to dissimilarities between pairs of data samples rather than the samples themselves, and use the distribution of dyads in Pareto fronts to perform multi-criteria anomaly detection rather than ranking. [sent-59, score-1.534]
32 A similar area is that of multiple kernel learning [8], which is typically applied to supervised learning problems, unlike the unsupervised anomaly detection setting we consider. [sent-66, score-0.506]
33 Finally, many other anomaly detection methods have previously been proposed. [sent-67, score-0.485]
34 [2] both provide extensive surveys of different anomaly detection methods and applications. [sent-69, score-0.497]
35 Byers and Raftery [9] proposed to use the distance between a sample and its kth-nearest neighbor as the anomaly score for the sample; similarly, Angiulli and Pizzuti [10] and Eskin et al. [sent-71, score-0.42]
36 [12] used an anomaly score based on the local density of the k nearest neighbors of a sample. [sent-74, score-0.461]
37 Hero [13] and Sricharan and Hero [14] introduced non-parametric adaptive anomaly detection methods using geometric entropy minimization, based on random k-point minimal spanning trees and bipartite k-nearest neighbor (k-NN) graphs, respectively. [sent-75, score-0.485]
38 Zhao and Saligrama [15] proposed an anomaly detection algorithm k-LPE using local p-value estimation (LPE) based on a k-NN graph. [sent-76, score-0.485]
39 These k-NN anomaly detection schemes only depend on the data through the pairs of data points (dyads) that define the edges in the k-NN graphs. [sent-77, score-0.511]
40 All of the aforementioned methods are designed for single-criteria anomaly detection. [sent-78, score-0.371]
41 In the multicriteria setting, the single-criteria algorithms must be executed multiple times with different weights, unlike the PDA anomaly detection algorithm that we propose in Section 4. [sent-79, score-0.526]
42 Different choices of (nonnegative) weights in the linear combination could result in different minimizers; a set of items that are minimizers under some linear combination can then be created by using a grid search over the weights, for example. [sent-94, score-0.181]
43 The second Pareto front can be constructed by finding items that are not strictly dominated by any of the remaining items, which are members of the set S \ F1 . [sent-101, score-0.251]
44 More generally, define the ith Pareto front by i−1 Fi = Pareto front of the set S \ Fj . [sent-102, score-0.356]
45 j=1 For convenience, we say that a Pareto front Fi is deeper than Fj if i > j. [sent-103, score-0.194]
46 1 Mathematical properties of Pareto fronts The distribution of the number of points on the first Pareto front was first studied by BarndorffNielsen and Sobel in their seminal work [17]. [sent-105, score-0.469]
47 We will be concerned here with properties of the first Pareto front that are relevant to the PDA anomaly detection algorithm and thus have not yet been considered in the literature. [sent-107, score-0.663]
48 For a measurable set A ⊂ Rd , we denote by FA the points on the first Pareto front of Y1 , . [sent-115, score-0.204]
49 + + Although this is a common motivation for Pareto methods, there are, to the best of our knowledge, no results in the literature regarding how many points on the Pareto front are missed by scalarization. [sent-130, score-0.204]
50 In the context of this paper, if some Pareto-optimal points are not identified, then the anomaly score (defined in section 4. [sent-138, score-0.428]
51 Hence the size of F \ L is a measure of how much the anomaly score is inflated and the degree to which Pareto methods will outperform linear scalarization. [sent-140, score-0.414]
52 4 Figure 2: Left: Non-convexities in the Pareto front induced by the geometry of the domain Ω (Theorem 1). [sent-189, score-0.198]
53 4 Multi-criteria anomaly detection Assume that a training set XN = {X1 , . [sent-205, score-0.503]
54 Given a test sample X, the objective of anomaly detection is to declare X to be an anomaly if X is significantly different from samples in XN . [sent-209, score-0.934]
55 Denote the dissimilarity between Xi and Xj computed using the measure corresponding to the lth criterion by dl (i, j). [sent-212, score-0.18]
56 For convenience, denote the set of all dyads by D and the space of all 2 dyads RK by D. [sent-225, score-0.816]
57 By the definition of strict dominance in Section 3, a dyad Dij strictly dominates + another dyad Di∗ j ∗ if dl (i, j) ≤ dl (i∗ , j ∗ ) for all l ∈ {1, . [sent-226, score-0.396]
58 The first Pareto front F1 corresponds to the set of dyads from D that are not strictly dominated by any other dyads from D. [sent-230, score-1.034]
59 The second Pareto front F2 corresponds to the set of dyads from D \ F1 that are not strictly dominated by any other dyads from D \ F1 , and so on, as defined in Section 3. [sent-231, score-1.034]
60 Recall that we refer to Fi as a deeper front than Fj if i > j. [sent-232, score-0.194]
61 1 Pareto fronts of dyads For each sample Xn , there are N − 1 dyads corresponding to its connections with the other N − 1 samples. [sent-234, score-1.099]
62 Define the set of N − 1 dyads associated with Xn by Dn . [sent-235, score-0.408]
63 If most dyads in Dn are located at shallow Pareto fronts, then the dissimilarities between Xn and the other N − 1 samples are small under some combination of the criteria. [sent-236, score-0.549]
64 This is the basic idea of the proposed multi-criteria anomaly detection method using PDA. [sent-238, score-0.485]
65 , FM of the dyads from the training set, where the total number of fronts M is the required number of fronts such that each dyad is a member of a front. [sent-242, score-1.089]
66 When a test sample X is obtained, we create new dyads corresponding to connections between X and training samples, as illustrated in Figure 1. [sent-243, score-0.478]
67 Similar to many other anomaly detection methods, we connect each test sample to its k nearest neighbors. [sent-244, score-0.558]
68 We create s = i=1 ki new dyads, which we denote by the set 5 Algorithm 1 PDA anomaly detection algorithm. [sent-246, score-0.528]
69 Training phase: 1: for l = 1 → K do 2: Calculate pairwise dissimilarities dl (i, j) between all training samples Xi and Xj 3: Create dyads Dij = [d1 (i, j), . [sent-247, score-0.557]
70 In other words, we create a dyad between X and Xj if Xj new is among the ki nearest neighbors1 of X in any criterion i. [sent-254, score-0.25]
71 2 Anomaly detection using depths of dyads In k-NN based anomaly detection algorithms such as those mentioned in Section 2, the anomaly score is a function of the k nearest neighbors to a test sample. [sent-262, score-1.519]
72 With multiple criteria, one could define an anomaly score by scalarization. [sent-263, score-0.423]
73 From the probabilistic properties of Pareto fronts discussed in Section 3. [sent-264, score-0.265]
74 This motivates us to develop a multi-criteria anomaly score using Pareto fronts. [sent-266, score-0.402]
75 We start with the observation from Figure 1 that dyads corresponding to a nominal test sample are typically located near shallower fronts than dyads corresponding to an anomalous test sample. [sent-267, score-1.371]
76 Each test sample is new associated with s new dyads, where the ith dyad Di has depth ei . [sent-268, score-0.254]
77 For each test sample X, we define the anomaly score v(X) to be the mean of the ei ’s, which corresponds to the average depth of the s dyads associated with X. [sent-269, score-0.931]
78 Thus the anomaly score can be easily computed and compared to the decision threshold σ using the test v(X) = 1 s s ei i=1 H1 σ. [sent-270, score-0.457]
79 H0 Pseudocode for the PDA anomaly detector is shown in Algorithm 1. [sent-271, score-0.371]
80 Both the training time and the 1 If a training sample is one of the ki nearest neighbors in multiple criteria, then multiple copies of the dyad corresponding to the connection between the test sample and the training sample are created. [sent-273, score-0.387]
81 dyads as well, and it is supported by experimental results presented in Section 2 of the supplementary material. [sent-282, score-0.408]
82 To handle multiple criteria, other anomaly detection methods, such as the ones mentioned in Section 2, need to be re-executed multiple times using different (non-negative) linear combinations of the K criteria. [sent-316, score-0.568]
83 5 Experiments We compare the PDA method with four other nearest neighbor-based single-criterion anomaly detection algorithms mentioned in Section 2. [sent-320, score-0.522]
84 For these methods, we use linear combinations of the criteria with different weights selected by grid search to compare performance with PDA. [sent-321, score-0.184]
85 The anomalous samples are located just outside of this hypercube. [sent-325, score-0.184]
86 Each class differs from the nominal distribution in one of the four dimensions; the distribution in the anomalous dimension is uniform on [1, 1. [sent-327, score-0.197]
87 We draw 300 training samples from the nominal distribution followed by 100 test samples from a mixture of the nominal and anomalous distributions with a 0. [sent-329, score-0.353]
88 If the criteria are combined using linear combinations, the combined dissimilarity measure reduces to weighted squared Euclidean distance. [sent-332, score-0.196]
89 Right: A subset of the dyads for the training samples along with the first 100 Pareto fronts. [sent-368, score-0.454]
90 The fronts are highly non-convex, partially explaining the superior performance of PDA. [sent-369, score-0.265]
91 We use two criteria for computing the dissimilarity between trajectories. [sent-370, score-0.184]
92 For a more detailed comparison, the ROC curve for PDA and the attainable region for k-LPE (the region between the ROC curves corresponding to weights resulting in the best and worst AUCs) is shown in Figure 3 along with the first 100 Pareto fronts for PDA. [sent-383, score-0.327]
93 6 Conclusion In this paper we proposed a new multi-criteria anomaly detection method. [sent-386, score-0.485]
94 The proposed method uses Pareto depth analysis to compute the anomaly score of a test sample by examining the Pareto front depths of dyads corresponding to the test sample. [sent-387, score-1.123]
95 Dyads corresponding to an anomalous sample tended to be located at deeper fronts compared to dyads corresponding to a nominal sample. [sent-388, score-0.927]
96 Instead of choosing a specific weighting or performing a grid search on the weights for different dissimilarity measures, the proposed method can efficiently detect anomalies in a manner that scales linearly in the number of criteria. [sent-389, score-0.237]
97 We also thank Daniel DeWoskin for suggesting a fast algorithm for computing Pareto fronts in two criteria. [sent-393, score-0.265]
98 A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. [sent-458, score-0.371]
99 Geometric entropy minimization (GEM) for anomaly detection and localization. [sent-474, score-0.485]
100 Anomaly detection with score functions based on nearest neighbor graphs. [sent-485, score-0.182]
wordName wordTfidf (topN-words)
[('pareto', 0.578), ('dyads', 0.408), ('anomaly', 0.371), ('pda', 0.337), ('fronts', 0.265), ('front', 0.178), ('anomalous', 0.133), ('dyad', 0.133), ('detection', 0.114), ('dissimilarity', 0.095), ('criteria', 0.089), ('nominal', 0.064), ('dissimilarities', 0.055), ('anomalies', 0.054), ('scalarization', 0.051), ('hero', 0.05), ('dl', 0.048), ('depth', 0.048), ('auc', 0.046), ('lpe', 0.041), ('item', 0.038), ('criterion', 0.037), ('fi', 0.037), ('nearest', 0.037), ('ei', 0.037), ('di', 0.036), ('dij', 0.036), ('pedestrian', 0.036), ('depths', 0.033), ('items', 0.033), ('trajectory', 0.032), ('score', 0.031), ('lof', 0.031), ('combinations', 0.029), ('weights', 0.029), ('nb', 0.028), ('trajectories', 0.028), ('samples', 0.028), ('aucs', 0.027), ('optimality', 0.027), ('ki', 0.027), ('points', 0.026), ('grid', 0.025), ('fl', 0.024), ('located', 0.023), ('roc', 0.022), ('neighbors', 0.022), ('strictly', 0.022), ('detect', 0.022), ('multiple', 0.021), ('yn', 0.021), ('attainable', 0.021), ('angiulli', 0.02), ('baryshnikov', 0.02), ('breunig', 0.02), ('chandola', 0.02), ('fleury', 0.02), ('multicriteria', 0.02), ('nbl', 0.02), ('pizzuti', 0.02), ('sobel', 0.02), ('yukich', 0.02), ('geometry', 0.02), ('minimizers', 0.02), ('rd', 0.02), ('training', 0.018), ('dominated', 0.018), ('shallow', 0.018), ('test', 0.018), ('byers', 0.018), ('hodge', 0.018), ('sricharan', 0.018), ('saligrama', 0.018), ('eskin', 0.018), ('sample', 0.018), ('combination', 0.017), ('views', 0.016), ('xn', 0.016), ('create', 0.016), ('choices', 0.016), ('near', 0.016), ('deeper', 0.016), ('ated', 0.016), ('raftery', 0.015), ('postponed', 0.015), ('yl', 0.015), ('fj', 0.014), ('declare', 0.014), ('median', 0.014), ('walking', 0.014), ('outlier', 0.013), ('linear', 0.012), ('curve', 0.012), ('surveys', 0.012), ('fth', 0.012), ('dominates', 0.012), ('concentrate', 0.012), ('theorem', 0.012), ('scales', 0.012), ('validated', 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 223 nips-2012-Multi-criteria Anomaly Detection using Pareto Depth Analysis
Author: Ko-jen Hsiao, Kevin Xu, Jeff Calder, Alfred O. Hero
Abstract: We consider the problem of identifying patterns in a data set that exhibit anomalous behavior, often referred to as anomaly detection. In most anomaly detection algorithms, the dissimilarity between data samples is calculated by a single criterion, such as Euclidean distance. However, in many cases there may not exist a single dissimilarity measure that captures all possible anomalous patterns. In such a case, multiple criteria can be defined, and one can test for anomalies by scalarizing the multiple criteria using a linear combination of them. If the importance of the different criteria are not known in advance, the algorithm may need to be executed multiple times with different choices of weights in the linear combination. In this paper, we introduce a novel non-parametric multi-criteria anomaly detection method using Pareto depth analysis (PDA). PDA uses the concept of Pareto optimality to detect anomalies under multiple criteria without having to run an algorithm multiple times with different choices of weights. The proposed PDA approach scales linearly in the number of criteria and is provably better than linear combinations of the criteria. 1
2 0.06913162 119 nips-2012-Entropy Estimations Using Correlated Symmetric Stable Random Projections
Author: Ping Li, Cun-hui Zhang
Abstract: Methods for efficiently estimating Shannon entropy of data streams have important applications in learning, data mining, and network anomaly detections (e.g., the DDoS attacks). For nonnegative data streams, the method of Compressed Counting (CC) [11, 13] based on maximally-skewed stable random projections can provide accurate estimates of the Shannon entropy using small storage. However, CC is no longer applicable when entries of data streams can be below zero, which is a common scenario when comparing two streams. In this paper, we propose an algorithm for entropy estimation in general data streams which allow negative entries. In our method, the Shannon entropy is approximated by the finite difference of two correlated frequency moments estimated from correlated samples of symmetric stable random variables. Interestingly, the estimator for the moment we recommend for entropy estimation barely has bounded variance itself, whereas the common geometric mean estimator (which has bounded higher-order moments) is not sufficient for entropy estimation. Our experiments confirm that this method is able to well approximate the Shannon entropy using small storage.
3 0.049840901 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization
Author: Du Tran, Junsong Yuan
Abstract: Structured output learning has been successfully applied to object localization, where the mapping between an image and an object bounding box can be well captured. Its extension to action localization in videos, however, is much more challenging, because we need to predict the locations of the action patterns both spatially and temporally, i.e., identifying a sequence of bounding boxes that track the action in video. The problem becomes intractable due to the exponentially large size of the structured video space where actions could occur. We propose a novel structured learning approach for spatio-temporal action localization. The mapping between a video and a spatio-temporal action trajectory is learned. The intractable inference and learning problems are addressed by leveraging an efficient Max-Path search method, thus making it feasible to optimize the model over the whole structured space. Experiments on two challenging benchmark datasets show that our proposed method outperforms the state-of-the-art methods. 1
4 0.04670272 97 nips-2012-Diffusion Decision Making for Adaptive k-Nearest Neighbor Classification
Author: Yung-kyun Noh, Frank Park, Daniel D. Lee
Abstract: This paper sheds light on some fundamental connections of the diffusion decision making model of neuroscience and cognitive psychology with k-nearest neighbor classification. We show that conventional k-nearest neighbor classification can be viewed as a special problem of the diffusion decision model in the asymptotic situation. By applying the optimal strategy associated with the diffusion decision model, an adaptive rule is developed for determining appropriate values of k in knearest neighbor classification. Making use of the sequential probability ratio test (SPRT) and Bayesian analysis, we propose five different criteria for adaptively acquiring nearest neighbors. Experiments with both synthetic and real datasets demonstrate the effectiveness of our classification criteria. 1
5 0.04652613 344 nips-2012-Timely Object Recognition
Author: Sergey Karayev, Tobias Baumgartner, Mario Fritz, Trevor Darrell
Abstract: In a large visual multi-class detection framework, the timeliness of results can be crucial. Our method for timely multi-class detection aims to give the best possible performance at any single point after a start time; it is terminated at a deadline time. Toward this goal, we formulate a dynamic, closed-loop policy that infers the contents of the image in order to decide which detector to deploy next. In contrast to previous work, our method significantly diverges from the predominant greedy strategies, and is able to learn to take actions with deferred values. We evaluate our method with a novel timeliness measure, computed as the area under an Average Precision vs. Time curve. Experiments are conducted on the PASCAL VOC object detection dataset. If execution is stopped when only half the detectors have been run, our method obtains 66% better AP than a random ordering, and 14% better performance than an intelligent baseline. On the timeliness measure, our method obtains at least 11% better performance. Our method is easily extensible, as it treats detectors and classifiers as black boxes and learns from execution traces using reinforcement learning. 1
6 0.038580839 133 nips-2012-Finding Exemplars from Pairwise Dissimilarities via Simultaneous Sparse Recovery
7 0.036192603 303 nips-2012-Searching for objects driven by context
8 0.033736113 40 nips-2012-Analyzing 3D Objects in Cluttered Images
9 0.033701144 106 nips-2012-Dynamical And-Or Graph Learning for Object Shape Modeling and Detection
10 0.033407483 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback
11 0.032215454 81 nips-2012-Context-Sensitive Decision Forests for Object Detection
12 0.032097388 272 nips-2012-Practical Bayesian Optimization of Machine Learning Algorithms
13 0.030787801 1 nips-2012-3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model
14 0.029598033 145 nips-2012-Gradient Weights help Nonparametric Regressors
15 0.029516205 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed
16 0.028953113 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation
17 0.02823597 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection
18 0.02816214 201 nips-2012-Localizing 3D cuboids in single-view images
19 0.027728904 258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection
20 0.027528541 30 nips-2012-Accuracy at the Top
topicId topicWeight
[(0, 0.077), (1, -0.002), (2, -0.022), (3, -0.009), (4, 0.022), (5, -0.041), (6, 0.007), (7, 0.017), (8, -0.001), (9, 0.013), (10, -0.04), (11, -0.007), (12, 0.02), (13, -0.046), (14, 0.021), (15, 0.038), (16, 0.032), (17, -0.008), (18, -0.003), (19, -0.001), (20, 0.033), (21, -0.007), (22, 0.068), (23, -0.006), (24, -0.027), (25, 0.012), (26, 0.054), (27, -0.004), (28, 0.035), (29, -0.045), (30, 0.011), (31, 0.033), (32, 0.018), (33, 0.014), (34, -0.023), (35, 0.026), (36, 0.038), (37, 0.09), (38, -0.018), (39, 0.011), (40, -0.07), (41, 0.018), (42, -0.044), (43, -0.03), (44, 0.013), (45, -0.029), (46, -0.031), (47, 0.007), (48, 0.005), (49, 0.002)]
simIndex simValue paperId paperTitle
same-paper 1 0.85420758 223 nips-2012-Multi-criteria Anomaly Detection using Pareto Depth Analysis
Author: Ko-jen Hsiao, Kevin Xu, Jeff Calder, Alfred O. Hero
Abstract: We consider the problem of identifying patterns in a data set that exhibit anomalous behavior, often referred to as anomaly detection. In most anomaly detection algorithms, the dissimilarity between data samples is calculated by a single criterion, such as Euclidean distance. However, in many cases there may not exist a single dissimilarity measure that captures all possible anomalous patterns. In such a case, multiple criteria can be defined, and one can test for anomalies by scalarizing the multiple criteria using a linear combination of them. If the importance of the different criteria are not known in advance, the algorithm may need to be executed multiple times with different choices of weights in the linear combination. In this paper, we introduce a novel non-parametric multi-criteria anomaly detection method using Pareto depth analysis (PDA). PDA uses the concept of Pareto optimality to detect anomalies under multiple criteria without having to run an algorithm multiple times with different choices of weights. The proposed PDA approach scales linearly in the number of criteria and is provably better than linear combinations of the criteria. 1
2 0.60179412 119 nips-2012-Entropy Estimations Using Correlated Symmetric Stable Random Projections
Author: Ping Li, Cun-hui Zhang
Abstract: Methods for efficiently estimating Shannon entropy of data streams have important applications in learning, data mining, and network anomaly detections (e.g., the DDoS attacks). For nonnegative data streams, the method of Compressed Counting (CC) [11, 13] based on maximally-skewed stable random projections can provide accurate estimates of the Shannon entropy using small storage. However, CC is no longer applicable when entries of data streams can be below zero, which is a common scenario when comparing two streams. In this paper, we propose an algorithm for entropy estimation in general data streams which allow negative entries. In our method, the Shannon entropy is approximated by the finite difference of two correlated frequency moments estimated from correlated samples of symmetric stable random variables. Interestingly, the estimator for the moment we recommend for entropy estimation barely has bounded variance itself, whereas the common geometric mean estimator (which has bounded higher-order moments) is not sufficient for entropy estimation. Our experiments confirm that this method is able to well approximate the Shannon entropy using small storage.
3 0.53980684 189 nips-2012-Learning from the Wisdom of Crowds by Minimax Entropy
Author: Dengyong Zhou, Sumit Basu, Yi Mao, John C. Platt
Abstract: An important way to make large training sets is to gather noisy labels from crowds of nonexperts. We propose a minimax entropy principle to improve the quality of these labels. Our method assumes that labels are generated by a probability distribution over workers, items, and labels. By maximizing the entropy of this distribution, the method naturally infers item confusability and worker expertise. We infer the ground truth by minimizing the entropy of this distribution, which we show minimizes the Kullback-Leibler (KL) divergence between the probability distribution and the unknown truth. We show that a simple coordinate descent scheme can optimize minimax entropy. Empirically, our results are substantially better than previously published methods for the same problem. 1
4 0.4867616 95 nips-2012-Density-Difference Estimation
Author: Masashi Sugiyama, Takafumi Kanamori, Taiji Suzuki, Marthinus D. Plessis, Song Liu, Ichiro Takeuchi
Abstract: We address the problem of estimating the difference between two probability densities. A naive approach is a two-step procedure of first estimating two densities separately and then computing their difference. However, such a two-step procedure does not necessarily work well because the first step is performed without regard to the second step and thus a small estimation error incurred in the first stage can cause a big error in the second stage. In this paper, we propose a single-shot procedure for directly estimating the density difference without separately estimating two densities. We derive a non-parametric finite-sample error bound for the proposed single-shot density-difference estimator and show that it achieves the optimal convergence rate. We then show how the proposed density-difference estimator can be utilized in L2 -distance approximation. Finally, we experimentally demonstrate the usefulness of the proposed method in robust distribution comparison such as class-prior estimation and change-point detection.
5 0.48292762 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation
Author: Kumar Sricharan, Alfred O. Hero
Abstract: The problem of estimation of entropy functionals of probability densities has received much attention in the information theory, machine learning and statistics communities. Kernel density plug-in estimators are simple, easy to implement and widely used for estimation of entropy. However, for large feature dimension d, kernel plug-in estimators suffer from the curse of dimensionality: the MSE rate of convergence is glacially slow - of order O(T −γ/d ), where T is the number of samples, and γ > 0 is a rate parameter. In this paper, it is shown that for sufficiently smooth densities, an ensemble of kernel plug-in estimators can be combined via a weighted convex combination, such that the resulting weighted estimator has a superior parametric MSE rate of convergence of order O(T −1 ). Furthermore, it is shown that these optimal weights can be determined by solving a convex optimization problem which does not require training data or knowledge of the underlying density, and therefore can be performed offline. This novel result is remarkable in that, while each of the individual kernel plug-in estimators belonging to the ensemble suffer from the curse of dimensionality, by appropriate ensemble averaging we can achieve parametric convergence rates. 1
6 0.47981003 1 nips-2012-3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model
7 0.47771674 201 nips-2012-Localizing 3D cuboids in single-view images
8 0.46427915 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data
9 0.46180913 40 nips-2012-Analyzing 3D Objects in Cluttered Images
10 0.4592984 57 nips-2012-Bayesian estimation of discrete entropy with mixtures of stick-breaking priors
11 0.45241061 303 nips-2012-Searching for objects driven by context
12 0.44561735 279 nips-2012-Projection Retrieval for Classification
13 0.43077838 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
14 0.4238832 123 nips-2012-Exponential Concentration for Mutual Information Estimation with Application to Forests
15 0.42366666 106 nips-2012-Dynamical And-Or Graph Learning for Object Shape Modeling and Detection
16 0.42157149 30 nips-2012-Accuracy at the Top
17 0.41888157 49 nips-2012-Automatic Feature Induction for Stagewise Collaborative Filtering
18 0.41850927 357 nips-2012-Unsupervised Template Learning for Fine-Grained Object Recognition
19 0.4183822 344 nips-2012-Timely Object Recognition
20 0.41623834 59 nips-2012-Bayesian nonparametric models for bipartite graphs
topicId topicWeight
[(0, 0.036), (1, 0.373), (11, 0.01), (17, 0.012), (21, 0.021), (38, 0.105), (42, 0.033), (49, 0.013), (55, 0.016), (74, 0.066), (76, 0.105), (80, 0.048), (92, 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.68416953 223 nips-2012-Multi-criteria Anomaly Detection using Pareto Depth Analysis
Author: Ko-jen Hsiao, Kevin Xu, Jeff Calder, Alfred O. Hero
Abstract: We consider the problem of identifying patterns in a data set that exhibit anomalous behavior, often referred to as anomaly detection. In most anomaly detection algorithms, the dissimilarity between data samples is calculated by a single criterion, such as Euclidean distance. However, in many cases there may not exist a single dissimilarity measure that captures all possible anomalous patterns. In such a case, multiple criteria can be defined, and one can test for anomalies by scalarizing the multiple criteria using a linear combination of them. If the importance of the different criteria are not known in advance, the algorithm may need to be executed multiple times with different choices of weights in the linear combination. In this paper, we introduce a novel non-parametric multi-criteria anomaly detection method using Pareto depth analysis (PDA). PDA uses the concept of Pareto optimality to detect anomalies under multiple criteria without having to run an algorithm multiple times with different choices of weights. The proposed PDA approach scales linearly in the number of criteria and is provably better than linear combinations of the criteria. 1
2 0.66040879 266 nips-2012-Patient Risk Stratification for Hospital-Associated C. diff as a Time-Series Classification Task
Author: Jenna Wiens, Eric Horvitz, John V. Guttag
Abstract: A patient’s risk for adverse events is affected by temporal processes including the nature and timing of diagnostic and therapeutic activities, and the overall evolution of the patient’s pathophysiology over time. Yet many investigators ignore this temporal aspect when modeling patient outcomes, considering only the patient’s current or aggregate state. In this paper, we represent patient risk as a time series. In doing so, patient risk stratification becomes a time-series classification task. The task differs from most applications of time-series analysis, like speech processing, since the time series itself must first be extracted. Thus, we begin by defining and extracting approximate risk processes, the evolving approximate daily risk of a patient. Once obtained, we use these signals to explore different approaches to time-series classification with the goal of identifying high-risk patterns. We apply the classification to the specific task of identifying patients at risk of testing positive for hospital acquired Clostridium difficile. We achieve an area under the receiver operating characteristic curve of 0.79 on a held-out set of several hundred patients. Our two-stage approach to risk stratification outperforms classifiers that consider only a patient’s current state (p<0.05). 1
3 0.56460297 43 nips-2012-Approximate Message Passing with Consistent Parameter Estimation and Applications to Sparse Learning
Author: Ulugbek Kamilov, Sundeep Rangan, Michael Unser, Alyson K. Fletcher
Abstract: We consider the estimation of an i.i.d. vector x ∈ Rn from measurements y ∈ Rm obtained by a general cascade model consisting of a known linear transform followed by a probabilistic componentwise (possibly nonlinear) measurement channel. We present a method, called adaptive generalized approximate message passing (Adaptive GAMP), that enables joint learning of the statistics of the prior and measurement channel along with estimation of the unknown vector x. Our method can be applied to a large class of learning problems including the learning of sparse priors in compressed sensing or identification of linear-nonlinear cascade models in dynamical systems and neural spiking processes. We prove that for large i.i.d. Gaussian transform matrices the asymptotic componentwise behavior of the adaptive GAMP algorithm is predicted by a simple set of scalar state evolution equations. This analysis shows that the adaptive GAMP method can yield asymptotically consistent parameter estimates, which implies that the algorithm achieves a reconstruction quality equivalent to the oracle algorithm that knows the correct parameter values. The adaptive GAMP methodology thus provides a systematic, general and computationally efficient method applicable to a large range of complex linear-nonlinear models with provable guarantees. 1
4 0.52976227 61 nips-2012-Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence
Author: Victor Gabillon, Mohammad Ghavamzadeh, Alessandro Lazaric
Abstract: We study the problem of identifying the best arm(s) in the stochastic multi-armed bandit setting. This problem has been studied in the literature from two different perspectives: fixed budget and fixed confidence. We propose a unifying approach that leads to a meta-algorithm called unified gap-based exploration (UGapE), with a common structure and similar theoretical analysis for these two settings. We prove a performance bound for the two versions of the algorithm showing that the two problems are characterized by the same notion of complexity. We also show how the UGapE algorithm as well as its theoretical analysis can be extended to take into account the variance of the arms and to multiple bandits. Finally, we evaluate the performance of UGapE and compare it with a number of existing fixed budget and fixed confidence algorithms. 1
5 0.5231207 188 nips-2012-Learning from Distributions via Support Measure Machines
Author: Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, Bernhard Schölkopf
Abstract: This paper presents a kernel-based discriminative learning framework on probability measures. Rather than relying on large collections of vectorial training examples, our framework learns using a collection of probability distributions that have been constructed to meaningfully represent training data. By representing these probability distributions as mean embeddings in the reproducing kernel Hilbert space (RKHS), we are able to apply many standard kernel-based learning techniques in straightforward fashion. To accomplish this, we construct a generalization of the support vector machine (SVM) called a support measure machine (SMM). Our analyses of SMMs provides several insights into their relationship to traditional SVMs. Based on such insights, we propose a flexible SVM (FlexSVM) that places different kernel functions on each training example. Experimental results on both synthetic and real-world data demonstrate the effectiveness of our proposed framework. 1
6 0.50889558 34 nips-2012-Active Learning of Multi-Index Function Models
7 0.4773052 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization
8 0.44306201 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance
9 0.42640087 273 nips-2012-Predicting Action Content On-Line and in Real Time before Action Onset – an Intracranial Human Study
10 0.42496136 260 nips-2012-Online Sum-Product Computation Over Trees
11 0.42311996 176 nips-2012-Learning Image Descriptors with the Boosting-Trick
12 0.4212718 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves
13 0.42112359 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models
14 0.42098016 69 nips-2012-Clustering Sparse Graphs
15 0.42063543 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity
16 0.42004415 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
17 0.41973948 112 nips-2012-Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model
18 0.41954792 148 nips-2012-Hamming Distance Metric Learning
19 0.41899616 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation
20 0.41895813 111 nips-2012-Efficient Sampling for Bipartite Matching Problems