nips nips2004 nips2004-51 knowledge-graph by maker-knowledge-mining

51 nips-2004-Detecting Significant Multidimensional Spatial Clusters

Source: pdf

Author: Daniel B. Neill, Andrew W. Moore, Francisco Pereira, Tom M. Mitchell

Abstract: Assume a uniform, multidimensional grid of bivariate data, where each cell of the grid has a count ci and a baseline bi . Our goal is to ﬁnd spatial regions (d-dimensional rectangles) where the ci are signiﬁcantly higher than expected given bi . We focus on two applications: detection of clusters of disease cases from epidemiological data (emergency department visits, over-the-counter drug sales), and discovery of regions of increased brain activity corresponding to given cognitive tasks (from fMRI data). Each of these problems can be solved using a spatial scan statistic (Kulldorff, 1997), where we compute the maximum of a likelihood ratio statistic over all spatial regions, and ﬁnd the signiﬁcance of this region by randomization. However, computing the scan statistic for all spatial regions is generally computationally infeasible, so we introduce a novel fast spatial scan algorithm, generalizing the 2D scan algorithm of (Neill and Moore, 2004) to arbitrary dimensions. Our new multidimensional multiresolution algorithm allows us to ﬁnd spatial clusters up to 1400x faster than the naive spatial scan, without any loss of accuracy. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Assume a uniform, multidimensional grid of bivariate data, where each cell of the grid has a count ci and a baseline bi . [sent-6, score-0.746]

2 Our goal is to ﬁnd spatial regions (d-dimensional rectangles) where the ci are signiﬁcantly higher than expected given bi . [sent-7, score-0.529]

3 We focus on two applications: detection of clusters of disease cases from epidemiological data (emergency department visits, over-the-counter drug sales), and discovery of regions of increased brain activity corresponding to given cognitive tasks (from fMRI data). [sent-8, score-0.612]

4 Each of these problems can be solved using a spatial scan statistic (Kulldorff, 1997), where we compute the maximum of a likelihood ratio statistic over all spatial regions, and ﬁnd the signiﬁcance of this region by randomization. [sent-9, score-1.712]

5 However, computing the scan statistic for all spatial regions is generally computationally infeasible, so we introduce a novel fast spatial scan algorithm, generalizing the 2D scan algorithm of (Neill and Moore, 2004) to arbitrary dimensions. [sent-10, score-2.271]

6 Our new multidimensional multiresolution algorithm allows us to ﬁnd spatial clusters up to 1400x faster than the naive spatial scan, without any loss of accuracy. [sent-11, score-0.649]

7 This is particularly important in epidemiological applications, where a rise in the number of disease cases in a region may or may not be indicative of an emerging epidemic. [sent-14, score-0.521]

8 More generally, we are interested in spatial data mining problems where the goal is detection of overdensities: spatial regions with high counts relative to some underlying baseline. [sent-17, score-0.651]

9 In the brain imaging datasets, our count is the total fMRI activation in a given set of voxels under the experimental condition, while our baseline is the total activation in that set of voxels under the null or control condition. [sent-21, score-0.651]

10 For the fMRI data, we have three spatial dimensions; for the epidemiological data, we have two spatial dimensions but also use several other quantities (time, patients’ age and gender) as “pseudo-spatial” dimensions; this is discussed in more detail below. [sent-23, score-0.687]

11 Each cell si ∈ G (where i is a d-dimensional vector) is associated with a count ci and a baseline bi . [sent-28, score-0.479]

12 Our goal is to search over all d-dimensional rectangular regions S ⊆ G, and ﬁnd regions where the total count C(S) = ∑S ci is higher than expected, given the baseline B(S) = ∑S bi . [sent-29, score-0.879]

13 As is necessary in the scan statistics framework, we focus on ﬁnding the single, most signiﬁcant region; the method can be iterated (removing each signiﬁcant cluster once it is found) to ﬁnd multiple signiﬁcant regions. [sent-31, score-0.523]

14 1 Likelihood ratio statistics Our basic model assumes that counts ci are generated by an inhomogeneous Poisson process with mean qbi , where q (the underlying ratio of count to baseline) may vary spatially. [sent-33, score-0.315]

15 To do so, for a given region S, we assume that q = qin uniformly for cells si ∈ S, and q = qout uniformly for cells si ∈ G − S. [sent-35, score-0.652]

16 If ε = 0, this is equivalent to the classical spatial scan statistic [1-2]: we are testing for regions where q in is greater than qout . [sent-37, score-1.294]

17 For example, a 10% increase in disease cases in some region may not be interesting to epidemiologists, even if the underlying population is large enough to conclude that this is a “real” (statistically signiﬁcant) increase in q. [sent-39, score-0.394]

18 By increasing ε, we can focus the scan statistic on regions with larger ratios of count to baseline. [sent-40, score-1.003]

19 For example, we can use the scan statistic with ε = 0. [sent-41, score-0.697]

20 25 to test for regions where q in is more than 25% higher than qout . [sent-42, score-0.344]

21 Following Kulldorff [1], our spatial scan statistic is the maximum, over all regions S, of the ratio of the likelihoods under the alternative and null hypotheses. [sent-43, score-1.209]

22 Then the scan statistic Dε,max is equal to the maximum Dε (S) Btot −B(S) over all spatial regions (d-dimensional rectangles) under consideration. [sent-45, score-1.094]

23 We note that our statistical and computational methods are not limited to the Poisson model given here; any model of null and alternative hypotheses such that the resulting statistic D(S) satisﬁes the conditions given in [4] can be used for the fast spatial scan. [sent-46, score-0.584]

24 2 Randomization testing Once we have found the highest scoring region S∗ = arg maxS D(S) of grid G, we must still determine the statistical signiﬁcance of this region. [sent-48, score-0.427]

25 More precisely, we pick ci ∼ Po(qbi ), where q = qin = (1+ε) Btot +εB(S∗ ) Ctot for si ∈ S∗ , and q = qout = Btot +εB(S∗ ) for si ∈ G − S∗ . [sent-51, score-0.416]

26 The number of replicas G with Dmax (G ) ≥ Dmax (G), divided by the total number of replications R, gives us the p-value for our most signiﬁcant region S∗ . [sent-52, score-0.384]

27 1), we can conclude that the discovered region is statistically signiﬁcant at level α. [sent-55, score-0.334]

28 3 The naive spatial scan The simplest method of ﬁnding Dmax is to compute D(S) for all rectangular regions of sizes k1 × k2 × . [sent-57, score-0.992]

29 Since there are a total of ∏d (N j − k j + 1) regions j=1 of each size, there are a total of O(∏d N 2 ) regions to examine. [sent-61, score-0.372]

30 We can compute D(S) j=1 j for any region S in constant time, by ﬁrst ﬁnding the count C(S) and baseline B(S), then computing D. [sent-62, score-0.559]

31 When the size of the grid is large, as is the case for the epidemiological and fMRI datasets we are considering, this naive approach is computationally infeasible. [sent-65, score-0.383]

32 This reduces the complexity to O(∏d N j log N j ) in cases where the most signiﬁcant region S∗ has a sufﬁciently high raj=1 tio of count to baseline, and (as we show in Section 3) typically results in tens to thousands of times speedup over the naive approach. [sent-67, score-0.502]

33 We note that this fast spatial scan algorithm is exact (always ﬁnds the correct value of Dmax and the corresponding region S∗ ); the speedup results from the observation that we do not need to search a given set of regions if we can prove that none of them have score > Dmax . [sent-68, score-1.31]

34 Thus we use a top-down, branch-and-bound approach: we maintain the current maximum score of the regions we have searched so far, calculate upper bounds on the scores of subregions contained in a given region, and prune regions whose upper bounds are less than the current value of Dmax . [sent-69, score-0.817]

35 When searching a replica grid, we care only whether Dmax of the replica grid is greater than Dmax (G). [sent-70, score-0.321]

36 Thus we can use Dmax of the original grid for pruning on the replicas, and can stop searching a replica if we ﬁnd a region with score > Dmax (G). [sent-71, score-0.637]

37 The overlap-kd tree is different from the standard kd-tree and quadtree in that adjacent regions overlap: rather than splitting the region in half along each dimension, instead each child contains more than half the area of the parent region. [sent-79, score-0.534]

38 1 An old trick makes it possible to compute the count and baseline of any rectangular region in time constant in N: we ﬁrst form a d-dimensional array of the cumulative counts, then compute each region’s count by adding/subtracting at most 2d cumulative counts. [sent-81, score-0.739]

39 Note that because of the exponential dependence on d, these techniques suffer from the “curse of dimensionality”: neither the naive spatial scan, nor the fast spatial scan discussed below, are appropriate for very high dimensional datasets. [sent-82, score-1.019]

40 Note that there is a region SC common to all of these children; we call this region the center of S. [sent-98, score-0.598]

41 When we partition region S in this manner, it can be proved that any subregion of S either a) is contained entirely in (at least) one of S1 . [sent-99, score-0.4]

42 S S_1 S_2 S_3 S_4 S_C Figure 1: Overlap-multires partitioning of region S (for d = 2). [sent-104, score-0.338]

43 There may be a large number of such “outer regions,” but since we know that each such region contains the center, we can place very tight bounds on the score of these regions, often allowing us to prune most or all of them. [sent-113, score-0.444]

44 2d for all S’ such that S’ is contained in S and contains S_C, call base-case-search(S’) } The fractions f i are selected based on the current sizes ki of the region being searched: 3 2 if ki = 2m , then fi = 4 , and if ki = 3 × 2m , then fi = 3 . [sent-118, score-0.444]

45 For simplicity, we assume that all Ni are powers of two, and thus all region sizes ki will fall into one of these two cases. [sent-119, score-0.332]

46 Each node represents a gridded region (denoted by a thick rectangle) of the entire dataset (thin square and dots). [sent-123, score-0.475]

47 First, for every rectangular region S ⊆ G, either S is a gridded region (contained in the overlap-kd tree), or there exists a unique gridded region S such that S is an outer region of S (i. [sent-125, score-1.605]

48 S is contained in S , and contains the center region of S ). [sent-127, score-0.345]

49 This means that, if overlap-search is called exactly once for each gridded region2 , and no pruning is done, then base-case-search will be called exactly once for every rectangular region S ⊆ G. [sent-128, score-0.548]

50 In practice, we will prune many regions, so base-case-search will be called at most once for every rectangular region, and every region will be either searched or pruned. [sent-129, score-0.463]

51 The second nice property of our overlap-kd tree is that the total number of gridded regions is O(∏d N j log N j ). [sent-130, score-0.377]

52 2 As in [4], we use “lazy expansion” to ensure that gridded regions are not multiply searched. [sent-133, score-0.328]

53 1 Score bounds and pruning We now consider which regions can be pruned (discarded without searching) during our multiresolution search procedure. [sent-135, score-0.366]

54 First, given some region S, we must calculate an upper bound on the scores D(S ) for regions S ⊂ S. [sent-136, score-0.522]

55 More precisely, we are interested in two upper bounds: a bound on the score of all subregions S ⊂ S, and a bound on the score of the outer subregions of S (those regions contained in S and containing its center SC ). [sent-137, score-0.616]

56 If the ﬁrst bound is less than or equal to Dmax , we can prune region S completely; we do not need to search any (gridded or outer) subregion of S. [sent-138, score-0.469]

57 If only the second bound is less than or equal to Dmax , we do not need to search the outer subregions of S, but we must recursively call overlap-search on the gridded children of S. [sent-139, score-0.446]

58 We also know the count C and baseline B of B region S, and the count ccenter and baseline bcenter of region SC . [sent-142, score-1.15]

59 To ﬁnd an upper bound on D(S ), we must calculate the values of cin c cin −c and bin which maximize D subject to the given constraints: bin −bcenter ≤ dinc , bin ≤ dmax , center in C−cin B−bin ≥ dmin , and bmin ≤ bin ≤ bmax . [sent-144, score-0.718]

60 The solution to this maximization problem is derived in [4], and (since scores are based only on count and baseline rather than the size and shape of the region) it applies directly to the multidimensional case. [sent-145, score-0.321]

61 2 Related work Our work builds most directly on the results of Kulldorff [1], who presents the twodimensional spatial scan framework and the classical (ε = 0) likelihood ratio statistic. [sent-149, score-0.754]

62 Our major extensions in the present work are twofold: the d-dimensional fast spatial scan, and the generalized likelihood ratio statistics Dε . [sent-151, score-0.356]

63 Our technique is exact (in that it calculates the maximum of the likelihood ratio statistic over all hyper-rectangular spatial regions), and uses a powerful statistical test to determine signiﬁcance. [sent-154, score-0.505]

64 Detection of elongated regions is extremely important in both epidemiology (because of the need to detect windborne or waterborne pathogens) and brain imaging (because of the “folded sheet” structure of the brain); the present work, as well as [4], allow detection of such clusters. [sent-158, score-0.437]

65 3 Results We now describe results of our fast spatial scan algorithm on three sets of real-world data: two sets of epidemiological data (from emergency department visits and over-the-counter drug sales), and one set of fMRI data. [sent-159, score-0.946]

66 First, the extension of scan statistics from two-dimensional to d-dimensional datasets dramatically increases the scope of problems for which these techniques can be used. [sent-161, score-0.531]

67 In addition to datasets with more than two spatial dimensions (for example, the fMRI data, which consists of a 3D picture of the brain), we can also examine data with a temporal component (as in the OTC dataset), or where we wish to take demographic information into account (as in the ED dataset). [sent-162, score-0.327]

68 Second, in all of these datasets, the use of the broader class of likelihood ratio statistics Dε (instead of only the classical scan statistic ε = 0) allows us to focus our search on smaller, denser regions rather than slight (but statistically signiﬁcant) increases over a large area. [sent-163, score-1.139]

69 Third, as our results here will demonstrate, the fast spatial scan gains huge performance improvements over the naive approach, making the use of the scan statistic feasible in these large, real-world datasets. [sent-164, score-1.505]

70 Thus we map records into a four-dimensional grid, consisting of two spatial dimensions (longitude, latitude) and two “pseudo-spatial” dimensions (patient gender and age decile). [sent-168, score-0.489]

71 For example, if a disease primarily strikes male infants, we might ﬁnd a cluster with gender = male and age decile = 0 in some spatial region, and this cluster may not be detectable from the combined data. [sent-171, score-0.627]

72 We used the Dε scan statistic with values of ε ranging from 0 to 1. [sent-176, score-0.697]

73 For the classical scan statistic (ε = 0), we found a region of size 35 × 34 × 2 × 8; thus the most signiﬁcant region was spatially localized but cut across all genders and age groups. [sent-178, score-1.436]

74 The region had C = 3570 and B = 6409, as compared to C = 0. [sent-179, score-0.299]

75 This was conﬁrmed B by the algorithm, which found the region statistically signiﬁcant (p-value 0/100). [sent-181, score-0.334]

76 With the three other values of ε, the algorithm found almost the same region (35 × 33 × 2 × 8, C = 3566, B = 6390) and again found it statistically signiﬁcant (p-value 0/100). [sent-182, score-0.334]

77 In this case, our goal was to see if the spatial distribution of sales in a given week (February 7-14, 2004) was signiﬁcantly different than the spatial distribution of sales during the previous week, and to identify a signiﬁcant cluster of increased sales if one exists. [sent-190, score-0.988]

78 44M 310K space-time scan statistic of [16], which also uses time as a third dimension. [sent-227, score-0.697]

79 The count of a cell was taken to be the number of sales in that spatial region on that day; to adjust for day-of-week effects, the baseline of a cell was taken to be the number of sales in that spatial region on the day one week prior (Jan. [sent-229, score-1.725]

80 For this dataset, the classical scan statistic (ε = 0) found a region of size 123 × 76 from February 7-11. [sent-233, score-1.038]

81 96 outside) this region would not be interesting to an epidemiologist. [sent-236, score-0.299]

82 Nevertheless, the region was found to be signiﬁcant (p-value 0/100) because of the large total baseline. [sent-237, score-0.299]

83 Thus, in this case, the classical scan statistic ﬁnds a large region of very slight overdensity rather than a smaller, denser region, and thus is not as useful for detecting epidemics. [sent-238, score-1.079]

84 5, the scan statistic found a much more interesting region: a 4 × 1 region on February 9 where C = 882 and B = 240. [sent-241, score-0.996]

85 For ε = 1, the region found was almost the same, consisting of three of these four cells, with C = 825 and B = 190. [sent-244, score-0.299]

86 Again the region was found to be signiﬁcant (p-value 0/100). [sent-245, score-0.299]

87 The fast scan statistic took between six seconds and four minutes per replication, plus ten minutes to search the original grid, thus obtaining speedups of 48-1400x on the OTC dataset. [sent-247, score-0.95]

88 Each snapshot consists of a 64 × 64 × 16 grid of voxels, with a reading of fMRI activation for the subset of the voxels where brain activity is occurring. [sent-251, score-0.365]

89 In this case, the count of a cell is the fMRI activation for that voxel under the experimental condition, and the baseline is the activation for that voxel under the null condition. [sent-252, score-0.525]

90 For the classical scan statistic (ε = 0) our algorithm found a 23 × 20 × 11 region, and again found this region signiﬁcant (p-value 0/100). [sent-256, score-1.038]

91 However, this is another example where the classical scan statistic ﬁnds a region which is large ( 1 of the entire brain) and only slightly 4 increased in count: C = 1. [sent-257, score-1.038]

92 The fast scan statistic took between 13 seconds and six minutes per replication, thus obtaining speedups of 7-148x on the fMRI dataset. [sent-276, score-0.849]

93 Thus we have demonstrated (through tests on a variety of real-world datasets) that the fast multidimensional spatial scan statistic has signiﬁcant performance advantages over the naive approach, resulting in speedups up to 1400x without any loss of accuracy. [sent-277, score-1.158]

94 This makes it feasible to apply scan statistics in a variety of application domains, including the spatial and spatio-temporal detection of disease epidemics (taking demographic information into account), as well as the detection of regions of increased brain activity in fMRI data. [sent-278, score-1.212]

95 The generalized likelihood ratio statistics presented here are a ﬁrst step toward this: by adjusting the parameter ε, we can “tune” the statistic to detect smaller and denser, or larger but less dense, regions as desired, and our statistical signiﬁcance test is adjusted accordingly. [sent-280, score-0.569]

96 We believe that the combination of fast computational algorithms and more powerful statistical tests presented here will enable the multidimensional spatial scan statistic to be useful in these and many other applications. [sent-281, score-1.031]

97 A fast multi-resolution method for detection of signiﬁcant spatial disease clusters. [sent-300, score-0.411]

98 STING: a statistical information grid approach to spatial data mining. [sent-383, score-0.339]

99 Evaluating cluster alarms: a space-time scan statistic and brain cancer in Los Alamos. [sent-389, score-0.817]

100 3 In a longer run on a different subject, where we iterate the scan statistic to pick out multiple signiﬁcant regions, we found signiﬁcant clusters in Broca’s and Wernicke’s areas in addition to the visual cortex. [sent-393, score-0.742]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('scan', 0.452), ('region', 0.299), ('statistic', 0.245), ('spatial', 0.211), ('dmax', 0.211), ('fmri', 0.201), ('regions', 0.186), ('qout', 0.158), ('sales', 0.158), ('gridded', 0.142), ('baseline', 0.14), ('grid', 0.128), ('ctot', 0.127), ('epidemiological', 0.127), ('count', 0.12), ('gender', 0.101), ('age', 0.099), ('subregions', 0.096), ('btot', 0.095), ('qin', 0.095), ('disease', 0.095), ('cant', 0.092), ('naive', 0.083), ('brain', 0.083), ('signi', 0.076), ('replica', 0.075), ('voxels', 0.075), ('bi', 0.069), ('sc', 0.068), ('patient', 0.066), ('null', 0.066), ('outer', 0.065), ('cin', 0.063), ('neill', 0.063), ('po', 0.063), ('ci', 0.063), ('fast', 0.062), ('multidimensional', 0.061), ('rectangular', 0.06), ('prune', 0.06), ('detect', 0.055), ('search', 0.055), ('latitude', 0.055), ('subregion', 0.055), ('week', 0.055), ('children', 0.055), ('bin', 0.054), ('cance', 0.052), ('si', 0.05), ('ratio', 0.049), ('tree', 0.049), ('cough', 0.047), ('decile', 0.047), ('emergency', 0.047), ('kulldorff', 0.047), ('otc', 0.047), ('replicas', 0.047), ('replication', 0.047), ('zip', 0.047), ('visits', 0.047), ('pruning', 0.047), ('contained', 0.046), ('minutes', 0.046), ('activation', 0.046), ('datasets', 0.045), ('clusters', 0.045), ('score', 0.045), ('searched', 0.044), ('speedups', 0.044), ('detection', 0.043), ('searching', 0.043), ('classical', 0.042), ('longitude', 0.041), ('baselines', 0.041), ('denser', 0.041), ('bounds', 0.04), ('partitioning', 0.039), ('dimensions', 0.039), ('multiresolution', 0.038), ('elongated', 0.038), ('replications', 0.038), ('upper', 0.037), ('cell', 0.037), ('cluster', 0.037), ('voxel', 0.035), ('statistically', 0.035), ('statistics', 0.034), ('dataset', 0.034), ('activity', 0.033), ('patients', 0.033), ('recursively', 0.033), ('ki', 0.033), ('bcenter', 0.032), ('bmax', 0.032), ('bmin', 0.032), ('demographic', 0.032), ('dinc', 0.032), ('dmin', 0.032), ('epidemiologists', 0.032), ('epidemiology', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 51 nips-2004-Detecting Significant Multidimensional Spatial Clusters

Author: Daniel B. Neill, Andrew W. Moore, Francisco Pereira, Tom M. Mitchell

2 0.10583965 54 nips-2004-Distributed Information Regularization on Graphs

Author: Adrian Corduneanu, Tommi S. Jaakkola

Abstract: We provide a principle for semi-supervised learning based on optimizing the rate of communicating labels for unlabeled points with side information. The side information is expressed in terms of identities of sets of points or regions with the purpose of biasing the labels in each region to be the same. The resulting regularization objective is convex, has a unique solution, and the solution can be found with a pair of local propagation operations on graphs induced by the regions. We analyze the properties of the algorithm and demonstrate its performance on document classiﬁcation tasks. 1

3 0.10530588 186 nips-2004-The Correlated Correspondence Algorithm for Unsupervised Registration of Nonrigid Surfaces

Author: Dragomir Anguelov, Praveen Srinivasan, Hoi-cheung Pang, Daphne Koller, Sebastian Thrun, James Davis

Abstract: We present an unsupervised algorithm for registering 3D surface scans of an object undergoing signiﬁcant deformations. Our algorithm does not need markers, nor does it assume prior knowledge about object shape, the dynamics of its deformation, or scan alignment. The algorithm registers two meshes by optimizing a joint probabilistic model over all point-topoint correspondences between them. This model enforces preservation of local mesh geometry, as well as more global constraints that capture the preservation of geodesic distance between corresponding point pairs. The algorithm applies even when one of the meshes is an incomplete range scan; thus, it can be used to automatically ﬁll in the remaining surfaces for this partial scan, even if those surfaces were previously only seen in a different conﬁguration. We evaluate the algorithm on several real-world datasets, where we demonstrate good results in the presence of signiﬁcant movement of articulated parts and non-rigid surface deformation. Finally, we show that the output of the algorithm can be used for compelling computer graphics tasks such as interpolation between two scans of a non-rigid object and automatic recovery of articulated object models. 1

4 0.085258007 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging

Author: Vladimir Koltchinskii, Manel Martínez-ramón, Stefan Posse

Abstract: We study a method of optimal data-driven aggregation of classiﬁers in a convex combination and establish tight upper bounds on its excess risk with respect to a convex loss function under the assumption that the solution of optimal aggregation problem is sparse. We use a boosting type algorithm of optimal aggregation to develop aggregate classiﬁers of activation patterns in fMRI based on locally trained SVM classiﬁers. The aggregation coefﬁcients are then used to design a ”boosting map” of the brain needed to identify the regions with most signiﬁcant impact on classiﬁcation. 1

5 0.076074168 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms

Author: Ting Liu, Andrew W. Moore, Ke Yang, Alexander G. Gray

Abstract: This paper concerns approximate nearest neighbor searching algorithms, which have become increasingly important, especially in high dimensional perception areas such as computer vision, with dozens of publications in recent years. Much of this enthusiasm is due to a successful new approximate nearest neighbor approach called Locality Sensitive Hashing (LSH). In this paper we ask the question: can earlier spatial data structure approaches to exact nearest neighbor, such as metric trees, be altered to provide approximate answers to proximity queries and if so, how? We introduce a new kind of metric tree that allows overlap: certain datapoints may appear in both the children of a parent. We also introduce new approximate k-NN search algorithms on this structure. We show why these structures should be able to exploit the same randomprojection-based approximations that LSH enjoys, but with a simpler algorithm and perhaps with greater efﬁciency. We then provide a detailed empirical evaluation on ﬁve large, high dimensional datasets which show up to 31-fold accelerations over LSH. This result holds true throughout the spectrum of approximation levels. 1

6 0.067010462 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression

7 0.057452813 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment

8 0.056060128 117 nips-2004-Methods Towards Invasive Human Brain Computer Interfaces

9 0.054303687 109 nips-2004-Mass Meta-analysis in Talairach Space

10 0.050947718 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition

11 0.048563167 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

12 0.047335088 166 nips-2004-Semi-supervised Learning via Gaussian Processes

13 0.04723594 23 nips-2004-Analysis of a greedy active learning strategy

14 0.047196869 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data

15 0.046254594 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

16 0.043374598 163 nips-2004-Semi-parametric Exponential Family PCA

17 0.039527208 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform

18 0.038536154 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

19 0.038467627 161 nips-2004-Self-Tuning Spectral Clustering

20 0.03845229 62 nips-2004-Euclidean Embedding of Co-Occurrence Data

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.143), (1, 0.015), (2, -0.017), (3, -0.044), (4, 0.016), (5, 0.082), (6, 0.022), (7, 0.057), (8, -0.003), (9, -0.043), (10, -0.03), (11, 0.019), (12, -0.022), (13, -0.033), (14, 0.033), (15, -0.052), (16, 0.064), (17, -0.073), (18, -0.025), (19, 0.028), (20, 0.005), (21, 0.064), (22, -0.037), (23, 0.079), (24, 0.058), (25, 0.02), (26, -0.032), (27, -0.108), (28, -0.013), (29, 0.115), (30, -0.044), (31, -0.004), (32, -0.068), (33, 0.132), (34, 0.105), (35, 0.035), (36, -0.071), (37, -0.041), (38, 0.112), (39, -0.233), (40, 0.008), (41, 0.193), (42, -0.003), (43, -0.205), (44, -0.002), (45, -0.044), (46, 0.001), (47, 0.064), (48, 0.086), (49, -0.099)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97607374 51 nips-2004-Detecting Significant Multidimensional Spatial Clusters

Author: Daniel B. Neill, Andrew W. Moore, Francisco Pereira, Tom M. Mitchell

2 0.59118485 109 nips-2004-Mass Meta-analysis in Talairach Space

Author: Finn \. Nielsen

Abstract: We provide a method for mass meta-analysis in a neuroinformatics database containing stereotaxic Talairach coordinates from neuroimaging experiments. Database labels are used to group the individual experiments, e.g., according to cognitive function, and the consistent pattern of the experiments within the groups are determined. The method voxelizes each group of experiments via a kernel density estimation, forming probability density volumes. The values in the probability density volumes are compared to null-hypothesis distributions generated by resamplings from the entire unlabeled set of experiments, and the distances to the nullhypotheses are used to sort the voxels across groups of experiments. This allows for mass meta-analysis, with the construction of a list with the most prominent associations between brain areas and group labels. Furthermore, the method can be used for functional labeling of voxels. 1

3 0.4352361 186 nips-2004-The Correlated Correspondence Algorithm for Unsupervised Registration of Nonrigid Surfaces

Author: Dragomir Anguelov, Praveen Srinivasan, Hoi-cheung Pang, Daphne Koller, Sebastian Thrun, James Davis

4 0.43428195 54 nips-2004-Distributed Information Regularization on Graphs

Author: Adrian Corduneanu, Tommi S. Jaakkola

5 0.40268889 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms

Author: Ting Liu, Andrew W. Moore, Ke Yang, Alexander G. Gray

6 0.36213687 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging

7 0.34271544 199 nips-2004-Using Machine Learning to Break Visual Human Interaction Proofs (HIPs)

8 0.33569837 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition

9 0.33053738 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data

10 0.32323304 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression

11 0.31543392 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform

12 0.31342787 155 nips-2004-Responding to Modalities with Different Latencies

13 0.31041265 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment

14 0.30591002 75 nips-2004-Heuristics for Ordering Cue Search in Decision Making

15 0.30575651 137 nips-2004-On the Adaptive Properties of Decision Trees

16 0.28537744 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

17 0.27862999 146 nips-2004-Pictorial Structures for Molecular Modeling: Interpreting Density Maps

18 0.26779905 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension

19 0.26435 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters

20 0.26033443 52 nips-2004-Discrete profile alignment via constrained information bottleneck

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.068), (15, 0.132), (25, 0.02), (26, 0.035), (31, 0.028), (33, 0.21), (35, 0.027), (39, 0.016), (41, 0.314), (50, 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90172809 146 nips-2004-Pictorial Structures for Molecular Modeling: Interpreting Density Maps

Author: Frank Dimaio, George Phillips, Jude W. Shavlik

Abstract: X-ray crystallography is currently the most common way protein structures are elucidated. One of the most time-consuming steps in the crystallographic process is interpretation of the electron density map, a task that involves finding patterns in a three-dimensional picture of a protein. This paper describes DEFT (DEFormable Template), an algorithm using pictorial structures to build a flexible protein model from the protein's amino-acid sequence. Matching this pictorial structure into the density map is a way of automating density-map interpretation. Also described are several extensions to the pictorial structure matching algorithm necessary for this automated interpretation. DEFT is tested on a set of density maps ranging from 2 to 4Å resolution, producing rootmean-squared errors ranging from 1.38 to 1.84Å. 1 In trod u ction An important question in molecular biology is what is the structure of a particular protein? Knowledge of a protein’s unique conformation provides insight into the mechanisms by which a protein acts. However, no algorithm exists that accurately maps sequence to structure, and one is forced to use

same-paper 2 0.81954312 51 nips-2004-Detecting Significant Multidimensional Spatial Clusters

Author: Daniel B. Neill, Andrew W. Moore, Francisco Pereira, Tom M. Mitchell

3 0.63974559 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

Author: Xiaojin Zhu, Jaz Kandola, Zoubin Ghahramani, John D. Lafferty

Abstract: We present an algorithm based on convex optimization for constructing kernels for semi-supervised learning. The kernel matrices are derived from the spectral decomposition of graph Laplacians, and combine labeled and unlabeled data in a systematic fashion. Unlike previous work using diffusion kernels and Gaussian random ﬁeld kernels, a nonparametric kernel approach is presented that incorporates order constraints during optimization. This results in ﬂexible kernels and avoids the need to choose among different parametric forms. Our approach relies on a quadratically constrained quadratic program (QCQP), and is computationally feasible for large datasets. We evaluate the kernels on real datasets using support vector machines, with encouraging results. 1

4 0.63695186 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering

Author: Zhengdong Lu, Todd K. Leen

Abstract: While clustering is usually an unsupervised operation, there are circumstances in which we believe (with varying degrees of certainty) that items A and B should be assigned to the same cluster, while items A and C should not. We would like such pairwise relations to inﬂuence cluster assignments of out-of-sample data in a manner consistent with the prior knowledge expressed in the training set. Our starting point is probabilistic clustering based on Gaussian mixture models (GMM) of the data distribution. We express clustering preferences in the prior distribution over assignments of data points to clusters. This prior penalizes cluster assignments according to the degree with which they violate the preferences. We ﬁt the model parameters with EM. Experiments on a variety of data sets show that PPC can consistently improve clustering results.

5 0.63631344 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach

Author: Francis R. Bach, Michael I. Jordan

Abstract: We present an algorithm to perform blind, one-microphone speech separation. Our algorithm separates mixtures of speech without modeling individual speakers. Instead, we formulate the problem of speech separation as a problem in segmenting the spectrogram of the signal into two or more disjoint sets. We build feature sets for our segmenter using classical cues from speech psychophysics. We then combine these features into parameterized afﬁnity matrices. We also take advantage of the fact that we can generate training examples for segmentation by artiﬁcially superposing separately-recorded signals. Thus the parameters of the afﬁnity matrices can be tuned using recent work on learning spectral clustering [1]. This yields an adaptive, speech-speciﬁc segmentation algorithm that can successfully separate one-microphone speech mixtures. 1

6 0.63505179 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

7 0.63446504 77 nips-2004-Hierarchical Clustering of a Mixture Model

8 0.63393688 44 nips-2004-Conditional Random Fields for Object Recognition

9 0.63300532 179 nips-2004-Surface Reconstruction using Learned Shape Models

10 0.63249546 161 nips-2004-Self-Tuning Spectral Clustering

11 0.63217086 2 nips-2004-A Direct Formulation for Sparse PCA Using Semidefinite Programming

12 0.63150704 127 nips-2004-Neighbourhood Components Analysis

13 0.63137871 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

14 0.6312688 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters

15 0.6307593 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

16 0.63047165 207 nips-2004-ℓ₀-norm Minimization for Basis Selection

17 0.62982988 53 nips-2004-Discriminant Saliency for Visual Recognition from Cluttered Scenes

18 0.62907404 166 nips-2004-Semi-supervised Learning via Gaussian Processes

19 0.62898958 99 nips-2004-Learning Hyper-Features for Visual Identification

20 0.6286118 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees