nips nips2000 nips2000-148 knowledge-graph by maker-knowledge-mining

148 nips-2000-`N-Body' Problems in Statistical Learning

Source: pdf

Author: Alexander G. Gray, Andrew W. Moore

Abstract: We present efficient algorithms for all-point-pairs problems , or 'Nbody '-like problems , which are ubiquitous in statistical learning. We focus on six examples, including nearest-neighbor classification, kernel density estimation, outlier detection , and the two-point correlation. These include any problem which abstractly requires a comparison of each of the N points in a dataset with each other point and would naively be solved using N 2 distance computations. In practice N is often large enough to make this infeasible. We present a suite of new geometric t echniques which are applicable in principle to any 'N-body ' computation including large-scale mixtures of Gaussians, RBF neural networks, and HMM 's. Our algorithms exhibit favorable asymptotic scaling and are empirically several orders of magnitude faster than the naive computation, even for small datasets. We are aware of no exact algorithms for these problems which are more efficient either empirically or theoretically. In addition, our framework yields simple and elegant algorithms. It also permits two important generalizations beyond the standard all-point-pairs problems, which are more difficult. These are represented by our final examples, the multiple two-point correlation and the notorious n-point correlation. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We present efficient algorithms for all-point-pairs problems , or 'Nbody '-like problems , which are ubiquitous in statistical learning. [sent-12, score-0.189]

2 We focus on six examples, including nearest-neighbor classification, kernel density estimation, outlier detection , and the two-point correlation. [sent-13, score-0.212]

3 These include any problem which abstractly requires a comparison of each of the N points in a dataset with each other point and would naively be solved using N 2 distance computations. [sent-14, score-0.297]

4 We present a suite of new geometric t echniques which are applicable in principle to any 'N-body ' computation including large-scale mixtures of Gaussians, RBF neural networks, and HMM 's. [sent-16, score-0.146]

5 Our algorithms exhibit favorable asymptotic scaling and are empirically several orders of magnitude faster than the naive computation, even for small datasets. [sent-17, score-0.187]

6 We are aware of no exact algorithms for these problems which are more efficient either empirically or theoretically. [sent-18, score-0.19]

7 These are represented by our final examples, the multiple two-point correlation and the notorious n-point correlation. [sent-21, score-0.121]

8 1 Introduction This paper is about accelerating a wide class of statistical methods that are naively quadratic in the number of datapoints. [sent-22, score-0.171]

9 1 We introduce a family of dual kd-tree traversal algorithms for these problems. [sent-23, score-0.139]

10 We describe in detail a dual-tree algorithm for calculating the two-point correlation, the simplest case of the problems we consider; for the five other statistical problems we consider, we show only performance results for lack of space. [sent-25, score-0.174]

11 The last of our examples, 1 In the general case, when we are computing distances between two different datasets having sizes Nl and N2, as in nearest-neighbor classification with separate training and test sets, say, the cost is O(NlN2). [sent-26, score-0.117]

12 The sizes and positions of the disks show the node counts and centroids. [sent-31, score-0.242]

13 The ellipses and rectangles show the covariances and bounding boxes. [sent-32, score-0.141]

14 (c) The rectangles show the nodes pruned dlITing a RangeSearch for one (depicted) query and radius. [sent-33, score-0.324]

15 (d) More pruning is possible using RangeCount instead of RangeSearch. [sent-34, score-0.163]

16 the n-point correlation, illustrates a generalization from all-point-pairs problems to alln-tuples problems, which are much harder (naively O(N ")). [sent-35, score-0.099]

17 For all the examples, we b elieve there exist no exact algorithms which are faster either empirically or theoretically, nor any approximate algorithms that are faster while providing guarantees of acceptably high accuracy (as ours do). [sent-36, score-0.295]

18 For n-tuple N -body problems in particular, this type of algorithm design appears to have surpassed the existing computational barriers. [sent-37, score-0.106]

19 In addition , all the algorithms in this paper can be compactly defined and are easy to implement. [sent-38, score-0.053]

20 We proceed by viewing these statistical problems as geometric problems, exploiting the data's hyperstructure. [sent-40, score-0.209]

21 Each algorithm utilizes Multiresolution kd-trees, providing a geometric partitioning of the data space which is used to reason about entire chunks of the data simultaneously. [sent-41, score-0.28]

22 A kd-tree [3] records a d-dimensional data set containing N records. [sent-43, score-0.084]

23 Each node represents a set of data points by their bounding box. [sent-44, score-0.385]

24 Non-leaf nodes have two children, obtained by splitting the widest dimension of the parent 's bounding box. [sent-45, score-0.128]

25 For the purposes of this paper, nodes are split until they contain only one point, where they become leaves. [sent-46, score-0.069]

26 An mrkd-tree [2, 6] is a conventional kd-tree decorated, at each node , with extra statistics about the node's data, such as their count, centroid, and covariance. [sent-47, score-0.205]

27 They are an instance of the idea of cached sufficient statistics [8] and are quite efficient in practice. [sent-48, score-0.115]

28 2 The 2-point correlation function The two-point correlation is a spatial statistic which is of fundamental importance in many natural sciences, in particular astrophysics and biology. [sent-50, score-0.242]

29 It is easily defined as the number of pairs of points in a dataset which lie within a given radius l' of each other. [sent-52, score-0.303]

30 The most naive approach is to simply compare each datum to each other one, incrementing a count if the distance between them is less than 1'. [sent-55, score-0.305]

31 This has O( N 2 ) cost , unacceptably high for problems of practical interest. [sent-56, score-0.1]

32 Although we have not needed to do so, they can modified to become disk-resident for data sets with billions of records, and they can be efficiently updated incrementally. [sent-58, score-0.048]

33 The idea of binning is simply to divide the data space into a fine grid defining a set of bins, perform the quadratic algorithm on the bins as if they were individual data, then multiply by the bin sizes as appropriate to get an estimate of the total count. [sent-62, score-0.418]

34 The idea of grid ding is to divide the data space into a coarse grid, perform the quadratic algorithm within each bin, and sum the results over all bins to get an estimate of the total count. [sent-63, score-0.374]

35 An approach to the two-point correlation computation that has been taken is to treat it as a range-searching problem [5 , 10], since kd-trees have been historically almost synonymous with range-searching. [sent-67, score-0.121]

36 The idea is that we will make each datapoint in turn a qu ery point and then execute a range search of the kd-tree to find all other points within distance r of the query. [sent-68, score-0.327]

37 A search is a depth-first traversal of the kd-tree, always checking the minimum possible distance d min between the query and the hyper-rectangle surrounding the current node. [sent-69, score-0.466]

38 If d min > r there is no point in visiting the node's children, and computation is saved. [sent-70, score-0.032]

39 The range searching avoids computing most of the distances between pairs of points further than r apart, which is a considerable saving if r is small. [sent-72, score-0.259]

40 2 Better geometric approaches: new algorithms Single-tree search (Range-Counting Algorithm). [sent-77, score-0.252]

41 A straightforward extension can exploit the fact that unlike conventional use of range searching, these statistics frequently don 't need to retrieve all the points in the radius but merely to count them. [sent-78, score-0.35]

42 The mrkd-tree has, in each node, the count of the number of data it contains-the simplest kind of cached sufficient statistic. [sent-79, score-0.268]

43 At a given node, if the distance between the query and the farthest point of the bounding box of the data in the node is smaller than the radius r, clearly every datum in the node is within range of the query. [sent-80, score-0.894]

44 We can then simply add the node 's stored count to the total count. [sent-81, score-0.385]

45 3 (Note that both exclusion and subsumption are simple computations because the geometric regions are always axis-parallel rectangles. [sent-83, score-0.218]

46 ) This paper introduces new single-tree algorithms for most of our examples, though it is not our main focus. [sent-84, score-0.053]

47 The idea is to consider the query points in chunks as well , as defined by nodes in a kd-tree. [sent-87, score-0.43]

48 In the general case where the query points are different from the data being queried, a separate kd-tree is built for the query points; otherwise a query node and a data node are simply pointers into the same kd-tree. [sent-88, score-1.186]

49 Dual-tree search can be thought of as a simultaneous traversal of two trees, instead of iterating over the query points in an outer loop and only exploiting single-tree-search in the inner loop. [sent-89, score-0.538]

50 Dual-tree search is based on node-node comparisons while Single-tree search was based on point-node comparisons. [sent-90, score-0.192]

51 Pseudocode for a recursive procedure called TwoPointO is shown in Figure 2. [sent-91, score-0.061]

52 It counts the number of pairs of points (x q E QNODE, Xd E DNoDE) such that I X q xdl < r. [sent-92, score-0.294]

53 Before doing any real work, the procedure checks whether it can perform an exclusion pruning (in which case the call terminates, returning 0) or subsumption pruning (in which case the call terminates, returning the product of the number of points in the two nodes). [sent-93, score-0.735]

54 If neither of these prunes occur, then depending on whether QNODE and/or DNODE are leaves, the corresponding recursive calls are made. [sent-94, score-0.145]

55 3S ubsumption can also be exploited when other aggregate statistics, such as centroids or covariances of sets of points in a range are required [2 , 14, 9]. [sent-95, score-0.145]

56 Importantly, both kinds of prunings can now apply to many query points at once, instead of each nearby query point rediscovering the same prune during the Singletree search. [sent-98, score-0.62]

57 First, if l' is so large that all pairs of points are counted then the Single-Tree search will perform O(N) operations, where each query point immediately prunes at the root , while Dual-Tree search will perform 0 (1) operations. [sent-100, score-0.769]

58 Second, if l' is so small that no pairs of points are counted, Single-Tree search will run to one leaf for each query, m eaning total work O(N log N ) whereas Dual-tree search will visit each leaf once, meaning O(N) work. [sent-101, score-0.517]

59 So far , we have discussed two operations which cut short the need to traverse the tree further - exclusion and subsumption. [sent-104, score-0.065]

60 Another form of pruning is to eliminate node-node comparisons which have been p erformed already in the reverse order. [sent-105, score-0.163]

61 This can be done [11] simply by (virtually) ranking the datapoints according to their position in a depth-first traversal of the tree , then recording for each node the minimum and maximum ranks of the points it owns, and pruning whenever QNODE'S maximum rank is less than DNODE's minimum rank. [sent-106, score-0.66]

62 This is useful for all-pairs problems , but becomes essential for all-n-tuples problems. [sent-107, score-0.068]

63 This kind of pruning is not practical for Single-tree search. [sent-108, score-0.163]

64 Figure 3 shows the p erformance of a two-point correlation algorithm using all the aforementioned pruning m ethods. [sent-109, score-0.322]

65 Most often in practice, the two-point is computed for many successive radii so that a curve can be plotted, indicating the clumpiness on different scales. [sent-111, score-0.151]

66 It is possible to perform a single, faster computation for all the radii simultaneously, by taking advantage of the nesting structure of the ordered radii , with an algorithm which recursively narrows the radii which still need to 4We'1l summarize the asymptotic analysis briefly. [sent-113, score-0.584]

67 Disappointingly, for 2-point, this performance is asymptotically the same cost as Single-tree. [sent-115, score-0.032]

68 Furthermore, if we can accept an approximate ' ' . [sent-117, score-0.031]

69 -fanswer, t he cost IS (nond)(O nd /(n-O nd) ) wh ICh IS Ind epend ent 0 f N . [sent-119, score-0.032]

70 4 60 280 835 1626 nearest nearest nearest 10,000 50,000 150,000 300,000 10,000 20,000 50,000 70 48 114 110 471 1545 3090 99 57 132 outliers outliers outliers outliers 10,000 50,000 150,000 300,000 141 3525 est. [sent-135, score-0.529]

71 5 21 44 61 294 917 1834 118 542 1572 3001 Algorithm twopoint twopomt twopoint twopoint Figure 3: Our experiments timed our algorithms on large astronomical datasets of current scientific interest , consisting of x-y positions of sky objects from the Sloane Digital Sky Survey. [sent-141, score-0.963]

72 The larger runtimes for the quadratic algorithm were estimated based on those for smaller datasets. [sent-143, score-0.252]

73 The dual kd-tree method is about a factor of 2 faster than the single kd-tree method, and both are 3 orders of magnitude faster than the quadratic method for a medium-sized dataset of 300,000 points. [sent-144, score-0.222]

74 8 65 151 Speedup 188 543 1589 2786 Figure 4: (a) Runtimes for multiple 2-point correlation with increasing number of radii, and the speedup factored compared to 1,000 separate Dual-tree 2-point correlations. [sent-159, score-0.251]

75 (b) Runtimes for kernel density estimation with decreasing levels of approximation, controlled by parameter ~, and speedup over quadratic. [sent-160, score-0.173]

76 be considered based on the current closest and farthest distances between the nodes. [sent-161, score-0.088]

77 The results in Figure 4 confirm that the algorithm quickly focuses on the radii of relevance: for 150 ,000 data, computing 1,000 2-point correlations took only 7 times as long as computing one. [sent-163, score-0.261]

78 A fourth major type of pruning opportunity is approximation. [sent-165, score-0.163]

79 This is often needed in all-point-pairs computations which involve computing some real-valued function f(x, y) between every pair of points x and y. [sent-166, score-0.149]

80 An example is kernel density estimation with an infinite-tailed kernel such as a Gaussian, in which every training point has some non-zero (though perhaps infinitesimal) contribution to the density at each test point. [sent-167, score-0.118]

81 For each query point Xq we need to accumulate K Ei w(lxq - Xii) where K is a normalizing constant and w is a weighting function (which we will need to assume is monotonic). [sent-168, score-0.237]

82 A recursive call of the Dual-tree implementation has the following job: for Xq E QNODE compute the contribution to xq's summed weights that are due to all points in DNODE. [sent-169, score-0.212]

83 Once again, before doing any real work we use simple rectangle geometry to compute the shortest and furthest possible distances between any (x q , Xd) pair. [sent-170, score-0.049]

84 This bounds the minimum and maximum possible values of Kw(lx q - xdl). [sent-171, score-0.031]

85 If these bounds are tight enough (according to an approximation parameter f) we prune by simply distributing the midpoint weight to all the points in QNODE. [sent-172, score-0.21]

86 I # Data 1000 2000 10000 20000 I 1 13 1470 14441 - - Time 1 2 3 4 < < < < 1 1 1 1 < 3 6 7 1 < 1 23 57 73 Figure 5: (a) Runtimes for approximate n-point correlation with t = 0. [sent-173, score-0.121]

87 (c) Runtimes for exact n-point, run on 2000 datapoints of galaxies in d-dimensional color space. [sent-177, score-0.121]

88 4 The n-point correlation, for n >2 The n-point correlation is the generalization of the 2-point correlation, which counts the number of n-tuples of points lying within radius 7' of each other , or more generally, between some 7'min and 7'max. [sent-178, score-0.401]

89 Even for algorithms such as 2-point, that return exact counts , bounded approximation is possible. [sent-182, score-0.162]

90 Suppose the true value of the 2-point function is V* but that we can tolerate a fractional error of f: we'll accept any value V such that IV - V*I < fV*. [sent-183, score-0.031]

91 5 Outlier detection, nearest neighbors, and other problems One of the main intents of this paper is to point out the broad applicability of this type of algorithm within statistical learning. [sent-185, score-0.271]

92 Figure 3 shows performance results for our outlier detection and nearest neighbors algorithms. [sent-186, score-0.337]

93 Figure 6 lists many N-body problems which are clear candidates for acceleration in future work. [sent-187, score-0.111]

94 6 5The n-point correlation is useful for detailed characterizations of mass distributions (including galaxies and biomasses). [sent-188, score-0.171]

95 For example, the three-point correlation, which measures the number of triplets of points meeting the specified geometric constraints, can distinguish between two distributions that have the same 2-point correlations but differ in their degree of "stripiness" versus "spottiness" . [sent-190, score-0.216]

96 6In our nearest neighbors algorithm we consider the problem of finding, for each query point, its single nearest neighbor among the data points. [sent-191, score-0.554]

97 ) The methods are easily generalized to the case of finding the k nearest neighbors, as in k-NN classification and locally weighted regression. [sent-193, score-0.095]

98 Outlier detection is one of the most common statistical operations encountered in data analysis. [sent-194, score-0.1]

99 We present here a natural operation which might be used directly for outlier detection, or within another procedure: for each of the points, find the number of other points that are within distance r of it - those having zero neighbors within r are defined as outliers. [sent-196, score-0.465]

100 An algorithm for finding best matches in logarithmic expected time. [sent-221, score-0.038]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('qnode', 0.376), ('dnode', 0.301), ('twopoint', 0.276), ('optional', 0.226), ('yes', 0.221), ('query', 0.205), ('node', 0.165), ('pruning', 0.163), ('radii', 0.151), ('runtimes', 0.151), ('count', 0.145), ('speedup', 0.13), ('correlation', 0.121), ('outlier', 0.117), ('points', 0.113), ('geometric', 0.103), ('search', 0.096), ('nearest', 0.095), ('traversal', 0.086), ('counts', 0.077), ('cached', 0.075), ('leftchild', 0.075), ('notleaf', 0.075), ('rightchild', 0.075), ('xq', 0.075), ('le', 0.073), ('neighbors', 0.073), ('batch', 0.073), ('af', 0.073), ('nodes', 0.069), ('problems', 0.068), ('attributes', 0.065), ('naively', 0.065), ('exclusion', 0.065), ('prune', 0.065), ('multiresolution', 0.065), ('quadratic', 0.063), ('outliers', 0.061), ('pairs', 0.061), ('recursive', 0.061), ('faster', 0.06), ('bounding', 0.059), ('leaf', 0.054), ('algorithms', 0.053), ('radius', 0.052), ('detection', 0.052), ('arning', 0.05), ('binning', 0.05), ('counted', 0.05), ('eturn', 0.05), ('galaxies', 0.05), ('nond', 0.05), ('prunes', 0.05), ('rectangles', 0.05), ('subsumption', 0.05), ('distances', 0.049), ('distance', 0.048), ('data', 0.048), ('moore', 0.046), ('bins', 0.046), ('kernel', 0.043), ('acceleration', 0.043), ('terminates', 0.043), ('chunks', 0.043), ('xdl', 0.043), ('echniques', 0.043), ('datum', 0.043), ('sky', 0.043), ('accelerating', 0.043), ('total', 0.043), ('statistics', 0.04), ('dataset', 0.039), ('datapoints', 0.039), ('farthest', 0.039), ('astronomical', 0.039), ('twelfth', 0.039), ('algorithm', 0.038), ('within', 0.038), ('exploiting', 0.038), ('call', 0.038), ('naive', 0.037), ('empirically', 0.037), ('children', 0.036), ('aaai', 0.036), ('records', 0.036), ('returning', 0.036), ('computing', 0.036), ('clustering', 0.036), ('calls', 0.034), ('grid', 0.034), ('perform', 0.033), ('exact', 0.032), ('covariances', 0.032), ('simply', 0.032), ('point', 0.032), ('cost', 0.032), ('minimum', 0.031), ('accept', 0.031), ('harder', 0.031), ('divide', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 148 nips-2000-`N-Body' Problems in Statistical Learning

Author: Alexander G. Gray, Andrew W. Moore

2 0.14261691 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks

Author: Simon Tong, Daphne Koller

Abstract: Bayesian networks are graphical representations of probability distributions. In virtually all of the work on learning these networks, the assumption is that we are presented with a data set consisting of randomly generated instances from the underlying distribution. In many situations, however, we also have the option of active learning, where we have the possibility of guiding the sampling process by querying for certain types of samples. This paper addresses the problem of estimating the parameters of Bayesian networks in an active learning setting. We provide a theoretical framework for this problem, and an algorithm that chooses which active learning queries to generate based on the model learned so far. We present experimental results showing that our active learning algorithm can significantly reduce the need for training data in many situations.

3 0.088462383 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams

Author: Robert St-Aubin, Jesse Hoey, Craig Boutilier

Abstract: We propose a method of approximate dynamic programming for Markov decision processes (MDPs) using algebraic decision diagrams (ADDs). We produce near-optimal value functions and policies with much lower time and space requirements than exact dynamic programming. Our method reduces the sizes of the intermediate value functions generated during value iteration by replacing the values at the terminals of the ADD with ranges of values. Our method is demonstrated on a class of large MDPs (with up to 34 billion states), and we compare the results with the optimal value functions.

4 0.082573511 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles

Author: Martin J. Wainwright, Erik B. Sudderth, Alan S. Willsky

Abstract: We present the embedded trees algorithm, an iterative technique for estimation of Gaussian processes defined on arbitrary graphs. By exactly solving a series of modified problems on embedded spanning trees, it computes the conditional means with an efficiency comparable to or better than other techniques. Unlike other methods, the embedded trees algorithm also computes exact error covariances. The error covariance computation is most efficient for graphs in which removing a small number of edges reveals an embedded tree. In this context, we demonstrate that sparse loopy graphs can provide a significant increase in modeling power relative to trees, with only a minor increase in estimation complexity. 1

5 0.076056078 4 nips-2000-A Linear Programming Approach to Novelty Detection

Author: Colin Campbell, Kristin P. Bennett

Abstract: Novelty detection involves modeling the normal behaviour of a system hence enabling detection of any divergence from normality. It has potential applications in many areas such as detection of machine damage or highlighting abnormal features in medical data. One approach is to build a hypothesis estimating the support of the normal data i. e. constructing a function which is positive in the region where the data is located and negative elsewhere. Recently kernel methods have been proposed for estimating the support of a distribution and they have performed well in practice - training involves solution of a quadratic programming problem. In this paper we propose a simpler kernel method for estimating the support based on linear programming. The method is easy to implement and can learn large datasets rapidly. We demonstrate the method on medical and fault detection datasets. 1 Introduction. An important classification task is the ability to distinguish b etween new instances similar to m embers of the training set and all other instances that can occur. For example, we may want to learn the normal running behaviour of a machine and highlight any significant divergence from normality which may indicate onset of damage or faults. This issue is a generic problem in many fields. For example, an abnormal event or feature in medical diagnostic data typically leads to further investigation. Novel events can be highlighted by constructing a real-valued density estimation function. However, here we will consider the simpler task of modelling the support of a data distribution i.e. creating a binary-valued function which is positive in those regions of input space where the data predominantly lies and negative elsewhere. Recently kernel methods have been applied to this problem [4]. In this approach data is implicitly mapped to a high-dimensional space called feature space [13]. Suppose the data points in input space are X i (with i = 1, . . . , m) and the mapping is Xi --+ ¢;(Xi) then in the span of {¢;(Xi)}, we can expand a vector w = Lj cr.j¢;(Xj). Hence we can define separating hyperplanes in feature space by w . ¢;(x;) + b = O. We will refer to w . ¢;(Xi) + b as the margin which will be positive on one side of the separating hyperplane and negative on the other. Thus we can also define a decision function: (1) where z is a new data point. The data appears in the form of an inner product in feature space so we can implicitly define feature space by our choice of kernel function: (2) A number of choices for the kernel are possible, for example, RBF kernels: (3) With the given kernel the decision function is therefore given by: (4) One approach to novelty detection is to find a hypersphere in feature space with a minimal radius R and centre a which contains most of the data: novel test points lie outside the boundary of this hypersphere [3 , 12] . This approach to novelty detection was proposed by Tax and Duin [10] and successfully used on real life applications [11] . The effect of outliers is reduced by using slack variables to allow for datapoints outside the sphere and the task is to minimise the volume of the sphere and number of datapoints outside i.e. e i mIll s.t. [R2 + oX L i ei 1 (Xi - a) . (Xi - a) S R2 + e ei i, ~ a (5) Since the data appears in the form of inner products kernel substitution can be applied and the learning task can be reduced to a quadratic programming problem. An alternative approach has been developed by Scholkopf et al. [7]. Suppose we restricted our attention to RBF kernels (3) then the data lies on the surface of a hypersphere in feature space since ¢;(x) . ¢;(x) = K(x , x) = l. The objective is therefore to separate off the surface region constaining data from the region containing no data. This is achieved by constructing a hyperplane which is maximally distant from the origin with all datapoints lying on the opposite side from the origin and such that the margin is positive. The learning task in dual form involves minimisation of: mIll s.t. W(cr.) = t L7,'k=l cr.icr.jK(Xi, Xj) a S cr.i S C, L::1 cr.i = l. (6) However, the origin plays a special role in this model. As the authors point out [9] this is a disadvantage since the origin effectively acts as a prior for where the class of abnormal instances is assumed to lie. In this paper we avoid this problem: rather than repelling the hyperplane away from an arbitrary point outside the data distribution we instead try and attract the hyperplane towards the centre of the data distribution. In this paper we will outline a new algorithm for novelty detection which can be easily implemented using linear programming (LP) techniques. As we illustrate in section 3 it performs well in practice on datasets involving the detection of abnormalities in medical data and fault detection in condition monitoring. 2 The Algorithm For the hard margin case (see Figure 1) the objective is to find a surface in input space which wraps around the data clusters: anything outside this surface is viewed as abnormal. This surface is defined as the level set, J(z) = 0, of some nonlinear function. In feature space, J(z) = L; O'.;K(z, x;) + b, this corresponds to a hyperplane which is pulled onto the mapped datapoints with the restriction that the margin always remains positive or zero. We make the fit of this nonlinear function or hyperplane as tight as possible by minimizing the mean value of the output of the function, i.e., Li J(x;). This is achieved by minimising: (7) subject to: m LO'.jK(x;,Xj) + b 2:: 0 (8) j=l m L 0'.; = 1, 0'.; 2:: 0 (9) ;=1 The bias b is just treated as an additional parameter in the minimisation process though unrestricted in sign. The added constraints (9) on 0'. bound the class of models to be considered - we don't want to consider simple linear rescalings of the model. These constraints amount to a choice of scale for the weight vector normal to the hyperplane in feature space and hence do not impose a restriction on the model. Also, these constraints ensure that the problem is well-posed and that an optimal solution with 0'. i- 0 exists. Other constraints on the class of functions are possible, e.g. 110'.111 = 1 with no restriction on the sign of O'.i. Many real-life datasets contain noise and outliers. To handle these we can introduce a soft margin in analogy to the usual approach used with support vector machines. In this case we minimise: (10) subject to: m LO:jJ{(Xi , Xj)+b~-ei' ei~O (11) j=l and constraints (9). The parameter). controls the extent of margin errors (larger ). means fewer outliers are ignored: ). -+ 00 corresponds to the hard margin limit). The above problem can be easily solved for problems with thousands of points using standard simplex or interior point algorithms for linear programming. With the addition of column generation techniques, these same approaches can be adopted for very large problems in which the kernel matrix exceeds the capacity of main memory. Column generation algorithms incrementally add and drop columns each corresponding to a single kernel function until optimality is reached. Such approaches have been successfully applied to other support vector problems [6 , 2]. Basic simplex algorithms were sufficient for the problems considered in this paper, so we defer a listing of the code for column generation to a later paper together with experiments on large datasets [1]. 3 Experiments Artificial datasets. Before considering experiments on real-life data we will first illustrate the performance of the algorithm on some artificial datasets. In Figure 1 the algorithm places a boundary around two data clusters in input space: a hard margin was used with RBF kernels and (J

6 0.075757109 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach

7 0.067213476 23 nips-2000-An Adaptive Metric Machine for Pattern Classification

8 0.066265225 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method

9 0.060542852 114 nips-2000-Second Order Approximations for Probability Models

10 0.059279203 103 nips-2000-Probabilistic Semantic Video Indexing

11 0.05616878 130 nips-2000-Text Classification using String Kernels

12 0.055973481 18 nips-2000-Active Support Vector Machine Classification

13 0.053393964 12 nips-2000-A Support Vector Method for Clustering

14 0.053292025 120 nips-2000-Sparse Greedy Gaussian Process Regression

15 0.05133795 133 nips-2000-The Kernel Gibbs Sampler

16 0.051326849 54 nips-2000-Feature Selection for SVMs

17 0.050252989 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

18 0.049735479 134 nips-2000-The Kernel Trick for Distances

19 0.049223065 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm

20 0.048609242 44 nips-2000-Efficient Learning of Linear Perceptrons

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.186), (1, 0.039), (2, 0.058), (3, -0.028), (4, -0.001), (5, 0.055), (6, 0.052), (7, 0.025), (8, -0.063), (9, 0.166), (10, -0.02), (11, 0.13), (12, 0.028), (13, -0.033), (14, 0.081), (15, 0.004), (16, -0.039), (17, 0.081), (18, -0.104), (19, 0.101), (20, -0.096), (21, -0.061), (22, 0.002), (23, -0.041), (24, 0.032), (25, 0.034), (26, 0.093), (27, 0.13), (28, -0.018), (29, -0.15), (30, 0.108), (31, -0.175), (32, -0.11), (33, 0.154), (34, 0.157), (35, -0.126), (36, -0.188), (37, -0.103), (38, 0.074), (39, -0.078), (40, -0.194), (41, -0.12), (42, -0.024), (43, 0.024), (44, 0.039), (45, 0.014), (46, 0.004), (47, -0.023), (48, 0.034), (49, -0.04)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94481528 148 nips-2000-`N-Body' Problems in Statistical Learning

Author: Alexander G. Gray, Andrew W. Moore

2 0.58763123 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks

Author: Simon Tong, Daphne Koller

3 0.48167941 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams

Author: Robert St-Aubin, Jesse Hoey, Craig Boutilier

4 0.44325066 23 nips-2000-An Adaptive Metric Machine for Pattern Classification

Author: Carlotta Domeniconi, Jing Peng, Dimitrios Gunopulos

Abstract: Nearest neighbor classification assumes locally constant class conditional probabilities. This assumption becomes invalid in high dimensions with finite samples due to the curse of dimensionality. Severe bias can be introduced under these conditions when using the nearest neighbor rule. We propose a locally adaptive nearest neighbor classification method to try to minimize bias. We use a Chi-squared distance analysis to compute a flexible metric for producing neighborhoods that are elongated along less relevant feature dimensions and constricted along most influential ones. As a result, the class conditional probabilities tend to be smoother in the modified neighborhoods, whereby better classification performance can be achieved. The efficacy of our method is validated and compared against other techniques using a variety of real world data. 1

5 0.41159451 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles

Author: Martin J. Wainwright, Erik B. Sudderth, Alan S. Willsky

6 0.41093671 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach

7 0.41075948 93 nips-2000-On Iterative Krylov-Dogleg Trust-Region Steps for Solving Neural Networks Nonlinear Least Squares Problems

8 0.40325058 44 nips-2000-Efficient Learning of Linear Perceptrons

9 0.39321682 18 nips-2000-Active Support Vector Machine Classification

10 0.31421405 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method

11 0.30353406 4 nips-2000-A Linear Programming Approach to Novelty Detection

12 0.27606598 48 nips-2000-Exact Solutions to Time-Dependent MDPs

13 0.27568403 60 nips-2000-Gaussianization

14 0.25805759 139 nips-2000-The Use of MDL to Select among Computational Models of Cognition

15 0.25467965 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

16 0.24672894 29 nips-2000-Bayes Networks on Ice: Robotic Search for Antarctic Meteorites

17 0.23713902 21 nips-2000-Algorithmic Stability and Generalization Performance

18 0.23687212 120 nips-2000-Sparse Greedy Gaussian Process Regression

19 0.23460492 114 nips-2000-Second Order Approximations for Probability Models

20 0.23453234 56 nips-2000-Foundations for a Circuit Complexity Theory of Sensory Processing

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.018), (17, 0.096), (33, 0.532), (54, 0.019), (55, 0.02), (62, 0.037), (65, 0.012), (67, 0.034), (75, 0.017), (76, 0.028), (81, 0.02), (90, 0.048), (97, 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94237429 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals

Author: Lucas C. Parra, Clay Spence, Paul Sajda

Abstract: We present evidence that several higher-order statistical properties of natural images and signals can be explained by a stochastic model which simply varies scale of an otherwise stationary Gaussian process. We discuss two interesting consequences. The first is that a variety of natural signals can be related through a common model of spherically invariant random processes, which have the attractive property that the joint densities can be constructed from the one dimensional marginal. The second is that in some cases the non-stationarity assumption and only second order methods can be explicitly exploited to find a linear basis that is equivalent to independent components obtained with higher-order methods. This is demonstrated on spectro-temporal components of speech. 1

same-paper 2 0.93797708 148 nips-2000-`N-Body' Problems in Statistical Learning

Author: Alexander G. Gray, Andrew W. Moore

3 0.93219906 58 nips-2000-From Margin to Sparsity

Author: Thore Graepel, Ralf Herbrich, Robert C. Williamson

Abstract: We present an improvement of Novikoff's perceptron convergence theorem. Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PAC-style generalisation error bound for the classifier learned by the perceptron learning algorithm. The bound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. Ironically, the bound yields better guarantees than are currently available for the support vector solution itself. 1

4 0.58346111 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

Author: Claudio Gentile

Abstract: A new incremental learning algorithm is described which approximates the maximal margin hyperplane w.r.t. norm p ~ 2 for a set of linearly separable data. Our algorithm, called ALMAp (Approximate Large Margin algorithm w.r.t. norm p), takes 0 ((P~21;;2) corrections to separate the data with p-norm margin larger than (1 - 0:) ,,(, where,,( is the p-norm margin of the data and X is a bound on the p-norm of the instances. ALMAp avoids quadratic (or higher-order) programming methods. It is very easy to implement and is as fast as on-line algorithms, such as Rosenblatt's perceptron. We report on some experiments comparing ALMAp to two incremental algorithms: Perceptron and Li and Long's ROMMA. Our algorithm seems to perform quite better than both. The accuracy levels achieved by ALMAp are slightly inferior to those obtained by Support vector Machines (SVMs). On the other hand, ALMAp is quite faster and easier to implement than standard SVMs training algorithms.

5 0.56430542 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

Author: Ralf Herbrich, Thore Graepel

Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1

6 0.51262355 131 nips-2000-The Early Word Catches the Weights

7 0.49739483 37 nips-2000-Convergence of Large Margin Separable Linear Classification

8 0.48596939 91 nips-2000-Noise Suppression Based on Neurophysiologically-motivated SNR Estimation for Robust Speech Recognition

9 0.48532423 75 nips-2000-Large Scale Bayes Point Machines

10 0.47781157 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models

11 0.47622684 21 nips-2000-Algorithmic Stability and Generalization Performance

12 0.46958399 111 nips-2000-Regularized Winnow Methods

13 0.46499705 94 nips-2000-On Reversing Jensen's Inequality

14 0.45769751 36 nips-2000-Constrained Independent Component Analysis

15 0.45554712 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

16 0.45468742 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

17 0.45128372 97 nips-2000-Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping

18 0.45047149 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks

19 0.4437609 89 nips-2000-Natural Sound Statistics and Divisive Normalization in the Auditory System

20 0.43374133 44 nips-2000-Efficient Learning of Linear Perceptrons