nips nips2004 nips2004-15 knowledge-graph by maker-knowledge-mining

15 nips-2004-Active Learning for Anomaly and Rare-Category Detection


Source: pdf

Author: Dan Pelleg, Andrew W. Moore

Abstract: We introduce a novel active-learning scenario in which a user wants to work with a learning algorithm to identify useful anomalies. These are distinguished from the traditional statistical definition of anomalies as outliers or merely ill-modeled points. Our distinction is that the usefulness of anomalies is categorized subjectively by the user. We make two additional assumptions. First, there exist extremely few useful anomalies to be hunted down within a massive dataset. Second, both useful and useless anomalies may sometimes exist within tiny classes of similar anomalies. The challenge is thus to identify “rare category” records in an unlabeled noisy set with help (in the form of class labels) from a human expert who has a small budget of datapoints that they are prepared to categorize. We propose a technique to meet this challenge, which assumes a mixture model fit to the data, but otherwise makes no assumptions on the particular form of the mixture components. This property promises wide applicability in real-life scenarios and for various statistical models. We give an overview of several alternative methods, highlighting their strengths and weaknesses, and conclude with a detailed empirical analysis. We show that our method can quickly zoom in on an anomaly set containing a few tens of points in a dataset of hundreds of thousands. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We introduce a novel active-learning scenario in which a user wants to work with a learning algorithm to identify useful anomalies. [sent-5, score-0.163]

2 These are distinguished from the traditional statistical definition of anomalies as outliers or merely ill-modeled points. [sent-6, score-0.403]

3 Our distinction is that the usefulness of anomalies is categorized subjectively by the user. [sent-7, score-0.403]

4 First, there exist extremely few useful anomalies to be hunted down within a massive dataset. [sent-9, score-0.436]

5 Second, both useful and useless anomalies may sometimes exist within tiny classes of similar anomalies. [sent-10, score-0.54]

6 The challenge is thus to identify “rare category” records in an unlabeled noisy set with help (in the form of class labels) from a human expert who has a small budget of datapoints that they are prepared to categorize. [sent-11, score-0.48]

7 We propose a technique to meet this challenge, which assumes a mixture model fit to the data, but otherwise makes no assumptions on the particular form of the mixture components. [sent-12, score-0.158]

8 We show that our method can quickly zoom in on an anomaly set containing a few tens of points in a dataset of hundreds of thousands. [sent-15, score-0.216]

9 1 Introduction We begin with an example of a rare-category-detection problem: an astronomer needs to sift through a large set of sky survey images, each of which comes with many numerical parameters. [sent-16, score-0.148]

10 The remainder are anomalies, but 99% of these anomalies are uninteresting, and only 1% of them (0. [sent-19, score-0.428]

11 The first type of anomalies, called “boring anomalies”, are records which are strange for uninteresting reasons such as sensor faults or problems in the image processing software. [sent-21, score-0.224]

12 The useful anomalies are extraordinary objects which are worthy of further research. [sent-22, score-0.454]

13 Two rare categories of “boring” anomalies in our test astrophysics data are shown in Figure 1. [sent-29, score-0.612]

14 At this point, objects flagged as “anomalous” can still be almost entirely of the uninteresting class of anomalies. [sent-33, score-0.206]

15 The computational and statistical question is then how to use feedback from the human user to iteratively reorder the queue of anomalies to be shown to the user in order to increase the chance that the user will soon see an anomaly of a whole new category. [sent-34, score-0.827]

16 Each round starts with the teacher labeling a small number of examples. [sent-37, score-0.226]

17 The learner then identifies a small number of input records (“hints”) which are important in the sense that obtaining labels for them would help it improve the model. [sent-39, score-0.314]

18 These are shown to the teacher (in our scenario, a human expert) for labeling, and the cycle repeats. [sent-40, score-0.195]

19 It may seem too demanding to ask the human expert to give class labels instead of a simple “interesting” or “boring” flag. [sent-42, score-0.324]

20 For example, in the astronomical data we have seen a user place most objects into previously-known categories: point sources, low-surface-brightness galaxies, etc. [sent-44, score-0.147]

21 This also holds for the negative examples: it is frustrating to have to label all anomalies as “bad” without being able to explain why. [sent-45, score-0.427]

22 For all it cares, the label set can be utterly changed by the user from one round to another. [sent-48, score-0.201]

23 Our tools allow that: the labels are unconstrained and the user can add, refine, and delete classes at will. [sent-49, score-0.295]

24 Our work differs from traditional applications of active learning in that we assume the distribution of class sizes to be extremely skewed. [sent-51, score-0.207]

25 Generally in active learning, it is believed that, right from the start, examples from each class need to be presented to the oracle [1, 2, 3]. [sent-53, score-0.174]

26 But in datasets with the rare categories property, this no longer holds, and much of our effort is an attempt to remedy the situation. [sent-55, score-0.173]

27 1 More precisely, we allow multiple queries and labels in each learning round — the traditional presentation has just one. [sent-59, score-0.234]

28 (a) (b) (c) (d) (e) (f) Figure 2: Underlying data distribution for the example (a); behavior of the lowlik method (b–f). [sent-60, score-0.39]

29 The anomalous points according to lowlik, given the model in (b), are shown in (c). [sent-63, score-0.258]

30 Given labels for the points in (c), the model in (d) is fitted. [sent-64, score-0.227]

31 Given the new model, anomalous points according to lowlik are flagged (e). [sent-65, score-0.648]

32 Given labels for the points in (c) and (e), this is the new fitted model (f). [sent-66, score-0.227]

33 In any case, we need to be able to identify query points in the presence of noise. [sent-71, score-0.196]

34 This is a not just a bonus feature: points which the model considers noisy could very well be the key to improvement if presented to the oracle. [sent-72, score-0.17]

35 2 Overview of Hint Selection Methods In this section we survey several proposed methods for active learning as they apply to our setting. [sent-74, score-0.152]

36 One is an X-shaped distribution, from which 2000 points are drawn. [sent-80, score-0.131]

37 Before each M step we clamp the class membership values for the hinted records to match the hints (i. [sent-84, score-0.616]

38 Given fully labeled data, our learner would perfectly predict class membership for this data (although it would be a poor generative model): one Gaussian centered on the circle, and another spherical Gaussian with high variance centered on the X. [sent-87, score-0.227]

39 On the first iteration (when unsupervised) the algorithm will naturally use the two Gaussians to model the data as in Figure 2(b), with one Gaussian for each of the arms of the “X”, and the points in the circle represented as members of one of them. [sent-96, score-0.261]

40 Choosing Points with Low Likelihood: A rather intuitive approach is to select as hints the points which the model performs worst on. [sent-99, score-0.449]

41 This can be viewed as model variance (a) (b) (c) (d) (e) Figure 3: Behavior of the ambig (a–c) and interleave (d–e) methods. [sent-100, score-0.512]

42 The unsupervised model and the points which ambig flags as anomalous, given this model (a). [sent-101, score-0.401]

43 The model learned using labels for these points is (b), along with the point it flags. [sent-102, score-0.227]

44 minimization [4] or as selection of points furthest away from any labeled points [5]. [sent-104, score-0.307]

45 We do this by ranking each point in order of increasing model likelihood, and choosing the most anomalous items. [sent-105, score-0.153]

46 Each subsequent drawing shows a model which EM converged to after including the new labels, and the hints it chooses under a particular scheme (here it is what we call lowlik). [sent-108, score-0.318]

47 These hints affect the model shown for the next round. [sent-109, score-0.318]

48 In any event, none of the points in the circle is flagged. [sent-115, score-0.186]

49 Only after obtaining labels for all of the “outlier” points (that is, those on the extremes of the distribution) will this approach go far enough down the list to hit a point in the circle. [sent-118, score-0.349]

50 Choosing Ambiguous Points: Another popular approach is to choose the points which the learner is least certain about. [sent-120, score-0.179]

51 This way, the top of the list will have the objects which are “owned” by multiple components. [sent-125, score-0.173]

52 As expected, points on the decision boundaries between classes are chosen. [sent-127, score-0.234]

53 It may help modeling the points very close to the boundaries, but it does not improve generalization accuracy in the general case. [sent-131, score-0.162]

54 Indeed, we see that if we repeatedly apply this criterion we end up asking for labels for a great number of points in close proximity, to very little effect on the overall model. [sent-132, score-0.227]

55 Combining Unlikely and Ambiguous Points: Our next candidate is a hybrid method which tries to combine the hints from the two previous methods. [sent-134, score-0.318]

56 Recall they both produce a ranked list of all the points. [sent-135, score-0.17]

57 We merge the lists into another ranked list in the following way. [sent-136, score-0.317]

58 When all elements are taken, the output list is a ranked list as required. [sent-139, score-0.292]

59 We now pick the top items from this list for hints. [sent-140, score-0.175]

60 As expected we get a good mix of points in both hint sets (not shown). [sent-141, score-0.221]

61 The key insight is that our group of anomalies was, in fact, reasonably ordinary when analyzed on a global scale. [sent-147, score-0.403]

62 In other words, the mixture density of the region we chose for the group of anomalies is not sufficiently low for them to rank high on the hint list. [sent-148, score-0.572]

63 Our idea is that if we restrict the focus to match the “point of view” of just one component, these anomalies will become more apparent. [sent-154, score-0.403]

64 The hope is that given this restricted view, anomalies that do not fit the component’s own model will stand out. [sent-156, score-0.403]

65 For each c′ c component c we create a list of all the points for which c = arg maxc′ zi , ranked by zi . [sent-159, score-0.411]

66 Most of the points are along the major axes of the two elongated Gaussians, but two of the points are inside the small circle. [sent-166, score-0.262]

67 Correct labels for even just these two points result in perfect classification in the next EM run. [sent-167, score-0.227]

68 This modification lets it nominate hints more often than any other component. [sent-170, score-0.367]

69 In terms of list merging, we take one element from each of the lists of standard components, and then several elements from the list produced for the background component. [sent-171, score-0.35]

70 In other words, if there are N components (excluding uniform), then the first cycle of hint nomination will result in 20 + N hints, 20 of which from uniform. [sent-173, score-0.167]

71 The class size distribution is a geometric series with the largest class owning half of the data and each subsequent class being half as small. [sent-177, score-0.21]

72 4 0 200 400 600 800 hints 1000 1200 1400 1600 0 500 1000 1500 hints 2000 2500 3000 Figure 4: Learning curves for simulated data drawn from a mixture of dependency trees (left), and for the SHUTTLE set (right). [sent-193, score-0.77]

73 The Y axis shows the fraction of classes represented in queries sent to the teacher. [sent-194, score-0.16]

74 4 lowlik random ambig interleave lowlik mix-ambig-lowlik random ambig interleave 0. [sent-209, score-1.852]

75 1 0 50 100 150 200 hints 250 300 350 400 0 50 100 150 200 hints 250 300 350 400 Figure 5: Learning curves for the ABALONE (left) and KDD (right) sets. [sent-212, score-0.636]

76 There are ten tree classes and a uniform background component. [sent-218, score-0.14]

77 Only the results for 15 dimensions and 100 noisy points are shown as they are representative of the other experiments. [sent-220, score-0.17]

78 In each round of learning, the learner queries the teacher with a list of 50 points for labeling, and has access to all the queries and replies submitted previously. [sent-221, score-0.641]

79 Also included, are results for random, which is a baseline method choosing hints at random. [sent-225, score-0.344]

80 Our scoring function is driven by our application, and estimates the amount of effort the teacher has to expend before being presented by representatives of every single class. [sent-226, score-0.167]

81 The assumption is that the teacher can generalize from a single example (or a very few examples) to an entire class, and the valuable information is concentrated in the first queried member of each class. [sent-227, score-0.169]

82 More precisely, if there are n classes, then the score under this metric is 1/n times the number of classes represented in the query set. [sent-228, score-0.168]

83 The best performer so far is interleave, taking five rounds or less to reveal all of the classes, including the very rare ones. [sent-230, score-0.161]

84 We can also see that ambig performs worse than random. [sent-232, score-0.244]

85 This can be explained by the fact that ambig only chooses points that already have several existing components “competing” for them. [sent-233, score-0.414]

86 65 50 100 150 200 250 300 350 400 450 500 20 40 60 hints 80 100 120 140 160 180 200 hints Figure 6: Learning curves for the EDSGC (left) and SDSS (right) sets. [sent-248, score-0.636]

87 6% SOURCE [12] [13] [13] [14] [15] We were concerned that the poor performance of lowlik was just a consequence of our choice of metric. [sent-258, score-0.39]

88 These points are genuine anomalies, so it is possible that lowlik is being penalized unfairly for its focusing on the noise points. [sent-261, score-0.521]

89 , points drawn from the uniform background component) found by each algorithm, we discovered that lowlik actually scores worse than interleave on this metric as well. [sent-264, score-0.904]

90 We see that it takes the interleave algorithm five rounds to spot all classes, whereas the next best is lowlik, with 11. [sent-269, score-0.383]

91 In Figure 5 we see that lowlik performs uncharacteristically poorly. [sent-274, score-0.39]

92 Another surprise is that the combination of lowlik and ambig outperforms them both. [sent-275, score-0.634]

93 It also matches interleave in performance, and this is the only case where we have seen it do so. [sent-276, score-0.268]

94 The class labels relate to the shape and size of the sky object. [sent-278, score-0.217]

95 We see in Figure 6 that for the purpose of class discovery, we can do a good job in a small number of rounds: here, a human would have had to label just 250 objects before being presented with a member of the smallest class - comprising just 24 records out of a set of 1. [sent-279, score-0.456]

96 4 Conclusion We have shown that some of the popular methods for active learning perform poorly in realistic active-learning scenarios where classes are imbalanced. [sent-281, score-0.256]

97 These methods work well in the presence of noisy data and extremely rare classes and anomalies. [sent-283, score-0.279]

98 Our simulations show that a human user only needs to label one or two hundred examples before being presented with very rare anomalies in huge data sets. [sent-284, score-0.71]

99 Consequently, we expect our results to apply to many different kinds of component models, including the case where components are not dependency trees, or even not all from the same distribution. [sent-287, score-0.146]

100 Our application presents multiple indicators to help a user spot anomalous data, as well as controls for labeling points and adding classes. [sent-289, score-0.482]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('anomalies', 0.403), ('lowlik', 0.39), ('hints', 0.318), ('interleave', 0.268), ('ambig', 0.244), ('records', 0.139), ('points', 0.131), ('anomalous', 0.127), ('list', 0.122), ('teacher', 0.106), ('rare', 0.104), ('active', 0.104), ('classes', 0.103), ('abalone', 0.098), ('shuttle', 0.098), ('user', 0.096), ('labels', 0.096), ('hint', 0.09), ('anomaly', 0.085), ('uninteresting', 0.085), ('round', 0.081), ('mixture', 0.079), ('discovered', 0.078), ('astrophysics', 0.073), ('boring', 0.073), ('edsgc', 0.073), ('class', 0.07), ('lists', 0.069), ('expert', 0.067), ('query', 0.065), ('agged', 0.064), ('kdd', 0.064), ('spot', 0.058), ('queries', 0.057), ('rounds', 0.057), ('dependency', 0.055), ('circle', 0.055), ('merge', 0.054), ('component', 0.052), ('objects', 0.051), ('sky', 0.051), ('human', 0.051), ('unlabeled', 0.049), ('arms', 0.049), ('astronomer', 0.049), ('diffraction', 0.049), ('hinted', 0.049), ('nominate', 0.049), ('owned', 0.049), ('sdss', 0.049), ('scenarios', 0.049), ('twentieth', 0.049), ('learner', 0.048), ('ranked', 0.048), ('survey', 0.048), ('labeled', 0.045), ('ask', 0.04), ('membership', 0.04), ('components', 0.039), ('submitted', 0.039), ('noisy', 0.039), ('artifact', 0.039), ('labeling', 0.039), ('scenario', 0.038), ('cycle', 0.038), ('background', 0.037), ('effort', 0.037), ('em', 0.036), ('dan', 0.036), ('queried', 0.036), ('sloan', 0.036), ('david', 0.035), ('datapoints', 0.034), ('useless', 0.034), ('extremely', 0.033), ('cohen', 0.032), ('ambiguity', 0.032), ('cohn', 0.032), ('hundred', 0.032), ('categories', 0.032), ('help', 0.031), ('item', 0.031), ('ambiguous', 0.03), ('zi', 0.029), ('wants', 0.029), ('pick', 0.028), ('gaussians', 0.028), ('member', 0.027), ('members', 0.026), ('choosing', 0.026), ('unsupervised', 0.026), ('databases', 0.025), ('items', 0.025), ('remainder', 0.025), ('precisely', 0.024), ('smallest', 0.024), ('label', 0.024), ('random', 0.024), ('driven', 0.024), ('another', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000008 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection

Author: Dan Pelleg, Andrew W. Moore

Abstract: We introduce a novel active-learning scenario in which a user wants to work with a learning algorithm to identify useful anomalies. These are distinguished from the traditional statistical definition of anomalies as outliers or merely ill-modeled points. Our distinction is that the usefulness of anomalies is categorized subjectively by the user. We make two additional assumptions. First, there exist extremely few useful anomalies to be hunted down within a massive dataset. Second, both useful and useless anomalies may sometimes exist within tiny classes of similar anomalies. The challenge is thus to identify “rare category” records in an unlabeled noisy set with help (in the form of class labels) from a human expert who has a small budget of datapoints that they are prepared to categorize. We propose a technique to meet this challenge, which assumes a mixture model fit to the data, but otherwise makes no assumptions on the particular form of the mixture components. This property promises wide applicability in real-life scenarios and for various statistical models. We give an overview of several alternative methods, highlighting their strengths and weaknesses, and conclude with a detailed empirical analysis. We show that our method can quickly zoom in on an anomaly set containing a few tens of points in a dataset of hundreds of thousands. 1

2 0.16182096 23 nips-2004-Analysis of a greedy active learning strategy

Author: Sanjoy Dasgupta

Abstract: We abstract out the core search problem of active learning schemes, to better understand the extent to which adaptive labeling can improve sample complexity. We give various upper and lower bounds on the number of labels which need to be queried, and we prove that a popular greedy active learning rule is approximately as good as any other strategy for minimizing this number of labels. 1

3 0.091121465 164 nips-2004-Semi-supervised Learning by Entropy Minimization

Author: Yves Grandvalet, Yoshua Bengio

Abstract: We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. Our approach includes other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed solution benefits from unlabeled data. The method challenges mixture models when the data are sampled from the distribution class spanned by the generative model. The performances are definitely in favor of minimum entropy regularization when generative models are misspecified, and the weighting of unlabeled data provides robustness to the violation of the “cluster assumption”. Finally, we also illustrate that the method can also be far superior to manifold learning in high dimension spaces. 1

4 0.087902457 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

Author: Aharon Bar-hillel, Adam Spiro, Eran Stark

Abstract: Spike sorting involves clustering spike trains recorded by a microelectrode according to the source neuron. It is a complicated problem, which requires a lot of human labor, partly due to the non-stationary nature of the data. We propose an automated technique for the clustering of non-stationary Gaussian sources in a Bayesian framework. At a first search stage, data is divided into short time frames and candidate descriptions of the data as a mixture of Gaussians are computed for each frame. At a second stage transition probabilities between candidate mixtures are computed, and a globally optimal clustering is found as the MAP solution of the resulting probabilistic model. Transition probabilities are computed using local stationarity assumptions and are based on a Gaussian version of the Jensen-Shannon divergence. The method was applied to several recordings. The performance appeared almost indistinguishable from humans in a wide range of scenarios, including movement, merges, and splits of clusters. 1

5 0.085130706 49 nips-2004-Density Level Detection is Classification

Author: Ingo Steinwart, Don Hush, Clint Scovel

Abstract: We show that anomaly detection can be interpreted as a binary classification problem. Using this interpretation we propose a support vector machine (SVM) for anomaly detection. We then present some theoretical results which include consistency and learning rates. Finally, we experimentally compare our SVM with the standard one-class SVM. 1

6 0.075473502 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes

7 0.073318876 145 nips-2004-Parametric Embedding for Class Visualization

8 0.073055163 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

9 0.072758354 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

10 0.066475831 54 nips-2004-Distributed Information Regularization on Graphs

11 0.064887784 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms

12 0.062720113 85 nips-2004-Instance-Based Relevance Feedback for Image Retrieval

13 0.062631488 168 nips-2004-Semigroup Kernels on Finite Sets

14 0.061777987 136 nips-2004-On Semi-Supervised Classification

15 0.058326878 125 nips-2004-Multiple Relational Embedding

16 0.056915913 163 nips-2004-Semi-parametric Exponential Family PCA

17 0.056072071 92 nips-2004-Kernel Methods for Implicit Surface Modeling

18 0.055665761 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge

19 0.053126231 158 nips-2004-Sampling Methods for Unsupervised Learning

20 0.05297666 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.183), (1, 0.057), (2, -0.025), (3, -0.027), (4, 0.008), (5, 0.096), (6, 0.001), (7, 0.065), (8, 0.057), (9, -0.037), (10, 0.008), (11, 0.135), (12, 0.001), (13, -0.076), (14, -0.048), (15, 0.01), (16, 0.117), (17, -0.024), (18, 0.035), (19, -0.093), (20, -0.009), (21, 0.182), (22, 0.1), (23, 0.139), (24, 0.004), (25, -0.026), (26, -0.036), (27, 0.067), (28, 0.026), (29, 0.02), (30, -0.024), (31, 0.054), (32, 0.041), (33, -0.089), (34, 0.059), (35, 0.02), (36, 0.014), (37, -0.005), (38, -0.031), (39, 0.024), (40, -0.098), (41, 0.029), (42, 0.02), (43, -0.008), (44, -0.061), (45, -0.092), (46, 0.084), (47, 0.035), (48, -0.062), (49, 0.083)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93341708 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection

Author: Dan Pelleg, Andrew W. Moore

Abstract: We introduce a novel active-learning scenario in which a user wants to work with a learning algorithm to identify useful anomalies. These are distinguished from the traditional statistical definition of anomalies as outliers or merely ill-modeled points. Our distinction is that the usefulness of anomalies is categorized subjectively by the user. We make two additional assumptions. First, there exist extremely few useful anomalies to be hunted down within a massive dataset. Second, both useful and useless anomalies may sometimes exist within tiny classes of similar anomalies. The challenge is thus to identify “rare category” records in an unlabeled noisy set with help (in the form of class labels) from a human expert who has a small budget of datapoints that they are prepared to categorize. We propose a technique to meet this challenge, which assumes a mixture model fit to the data, but otherwise makes no assumptions on the particular form of the mixture components. This property promises wide applicability in real-life scenarios and for various statistical models. We give an overview of several alternative methods, highlighting their strengths and weaknesses, and conclude with a detailed empirical analysis. We show that our method can quickly zoom in on an anomaly set containing a few tens of points in a dataset of hundreds of thousands. 1

2 0.77481592 23 nips-2004-Analysis of a greedy active learning strategy

Author: Sanjoy Dasgupta

Abstract: We abstract out the core search problem of active learning schemes, to better understand the extent to which adaptive labeling can improve sample complexity. We give various upper and lower bounds on the number of labels which need to be queried, and we prove that a popular greedy active learning rule is approximately as good as any other strategy for minimizing this number of labels. 1

3 0.58605516 164 nips-2004-Semi-supervised Learning by Entropy Minimization

Author: Yves Grandvalet, Yoshua Bengio

Abstract: We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. Our approach includes other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed solution benefits from unlabeled data. The method challenges mixture models when the data are sampled from the distribution class spanned by the generative model. The performances are definitely in favor of minimum entropy regularization when generative models are misspecified, and the weighting of unlabeled data provides robustness to the violation of the “cluster assumption”. Finally, we also illustrate that the method can also be far superior to manifold learning in high dimension spaces. 1

4 0.58305031 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

Author: Omid Madani, David M. Pennock, Gary W. Flake

Abstract: In the context of binary classification, we define disagreement as a measure of how often two independently-trained models differ in their classification of unlabeled data. We explore the use of disagreement for error estimation and model selection. We call the procedure co-validation, since the two models effectively (in)validate one another by comparing results on unlabeled data, which we assume is relatively cheap and plentiful compared to labeled data. We show that per-instance disagreement is an unbiased estimate of the variance of error for that instance. We also show that disagreement provides a lower bound on the prediction (generalization) error, and a tight upper bound on the “variance of prediction error”, or the variance of the average error across instances, where variance is measured across training sets. We present experimental results on several data sets exploring co-validation for error estimation and model selection. The procedure is especially effective in active learning settings, where training sets are not drawn at random and cross validation overestimates error. 1

5 0.54607338 166 nips-2004-Semi-supervised Learning via Gaussian Processes

Author: Neil D. Lawrence, Michael I. Jordan

Abstract: We present a probabilistic approach to learning a Gaussian Process classifier in the presence of unlabeled data. Our approach involves a “null category noise model” (NCNM) inspired by ordered categorical noise models. The noise model reflects an assumption that the data density is lower between the class-conditional densities. We illustrate our approach on a toy problem and present comparative results for the semi-supervised classification of handwritten digits. 1

6 0.51280117 158 nips-2004-Sampling Methods for Unsupervised Learning

7 0.4878493 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

8 0.45650008 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms

9 0.44771832 136 nips-2004-On Semi-Supervised Classification

10 0.44316134 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

11 0.43834937 54 nips-2004-Distributed Information Regularization on Graphs

12 0.41985205 75 nips-2004-Heuristics for Ordering Cue Search in Decision Making

13 0.40741012 109 nips-2004-Mass Meta-analysis in Talairach Space

14 0.40419486 111 nips-2004-Maximal Margin Labeling for Multi-Topic Text Categorization

15 0.38658518 163 nips-2004-Semi-parametric Exponential Family PCA

16 0.3755044 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering

17 0.36918753 40 nips-2004-Common-Frame Model for Object Recognition

18 0.36224735 85 nips-2004-Instance-Based Relevance Feedback for Image Retrieval

19 0.36073744 127 nips-2004-Neighbourhood Components Analysis

20 0.34954667 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.076), (15, 0.119), (17, 0.012), (25, 0.013), (26, 0.043), (31, 0.053), (32, 0.024), (33, 0.183), (35, 0.021), (39, 0.015), (50, 0.047), (87, 0.015), (96, 0.288)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79628432 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection

Author: Dan Pelleg, Andrew W. Moore

Abstract: We introduce a novel active-learning scenario in which a user wants to work with a learning algorithm to identify useful anomalies. These are distinguished from the traditional statistical definition of anomalies as outliers or merely ill-modeled points. Our distinction is that the usefulness of anomalies is categorized subjectively by the user. We make two additional assumptions. First, there exist extremely few useful anomalies to be hunted down within a massive dataset. Second, both useful and useless anomalies may sometimes exist within tiny classes of similar anomalies. The challenge is thus to identify “rare category” records in an unlabeled noisy set with help (in the form of class labels) from a human expert who has a small budget of datapoints that they are prepared to categorize. We propose a technique to meet this challenge, which assumes a mixture model fit to the data, but otherwise makes no assumptions on the particular form of the mixture components. This property promises wide applicability in real-life scenarios and for various statistical models. We give an overview of several alternative methods, highlighting their strengths and weaknesses, and conclude with a detailed empirical analysis. We show that our method can quickly zoom in on an anomaly set containing a few tens of points in a dataset of hundreds of thousands. 1

2 0.77912384 56 nips-2004-Dynamic Bayesian Networks for Brain-Computer Interfaces

Author: Pradeep Shenoy, Rajesh P. Rao

Abstract: We describe an approach to building brain-computer interfaces (BCI) based on graphical models for probabilistic inference and learning. We show how a dynamic Bayesian network (DBN) can be used to infer probability distributions over brain- and body-states during planning and execution of actions. The DBN is learned directly from observed data and allows measured signals such as EEG and EMG to be interpreted in terms of internal states such as intent to move, preparatory activity, and movement execution. Unlike traditional classification-based approaches to BCI, the proposed approach (1) allows continuous tracking and prediction of internal states over time, and (2) generates control signals based on an entire probability distribution over states rather than binary yes/no decisions. We present preliminary results of brain- and body-state estimation using simultaneous EEG and EMG signals recorded during a self-paced left/right hand movement task. 1

3 0.63929111 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer

Abstract: Prediction suffix trees (PST) provide a popular and effective tool for tasks such as compression, classification, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efficient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the first provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any fixed PST determined in hindsight. 1

4 0.63927925 23 nips-2004-Analysis of a greedy active learning strategy

Author: Sanjoy Dasgupta

Abstract: We abstract out the core search problem of active learning schemes, to better understand the extent to which adaptive labeling can improve sample complexity. We give various upper and lower bounds on the number of labels which need to be queried, and we prove that a popular greedy active learning rule is approximately as good as any other strategy for minimizing this number of labels. 1

5 0.63566875 69 nips-2004-Fast Rates to Bayes for Kernel Machines

Author: Ingo Steinwart, Clint Scovel

Abstract: We establish learning rates to the Bayes risk for support vector machines (SVMs) with hinge loss. In particular, for SVMs with Gaussian RBF kernels we propose a geometric condition for distributions which can be used to determine approximation properties of these kernels. Finally, we compare our methods with a recent paper of G. Blanchard et al.. 1

6 0.63506228 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

7 0.63375169 77 nips-2004-Hierarchical Clustering of a Mixture Model

8 0.63353455 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

9 0.63334674 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

10 0.63241553 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering

11 0.63149261 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

12 0.63144976 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

13 0.63089716 131 nips-2004-Non-Local Manifold Tangent Learning

14 0.63021892 127 nips-2004-Neighbourhood Components Analysis

15 0.62948674 54 nips-2004-Distributed Information Regularization on Graphs

16 0.62944049 102 nips-2004-Learning first-order Markov models for control

17 0.62841076 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach

18 0.6278677 207 nips-2004-ℓ₀-norm Minimization for Basis Selection

19 0.62777114 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

20 0.62753606 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve