nips nips2005 nips2005-59 knowledge-graph by maker-knowledge-mining

59 nips-2005-Dual-Tree Fast Gauss Transforms


Source: pdf

Author: Dongryeol Lee, Andrew W. Moore, Alexander G. Gray

Abstract: In previous work we presented an efficient approach to computing kernel summations which arise in many machine learning methods such as kernel density estimation. This approach, dual-tree recursion with finitedifference approximation, generalized existing methods for similar problems arising in computational physics in two ways appropriate for statistical problems: toward distribution sensitivity and general dimension, partly by avoiding series expansions. While this proved to be the fastest practical method for multivariate kernel density estimation at the optimal bandwidth, it is much less efficient at larger-than-optimal bandwidths. In this work, we explore the extent to which the dual-tree approach can be integrated with multipole-like Hermite expansions in order to achieve reasonable efficiency across all bandwidth scales, though only for low dimensionalities. In the process, we derive and demonstrate the first truly hierarchical fast Gauss transforms, effectively combining the best tools from discrete algorithms and continuous approximation theory. 1 Fast Gaussian Summation Kernel summations are fundamental in both statistics/learning and computational physics. NR e This paper will focus on the common form G(xq ) = −||xq −xr ||2 2h2 i.e. where the ker- r=1 nel is the Gaussian kernel with scaling parameter, or bandwidth h, there are NR reference points xr , and we desire the sum for NQ different query points xq . Such kernel summations appear in a wide array of statistical/learning methods [5], perhaps most obviously in kernel density estimation [11], the most widely used distribution-free method for the fundamental task of density estimation, which will be our main example. Understanding kernel summation algorithms from a recently developed unified perspective [5] begins with the picture of Figure 1, then separately considers the discrete and continuous aspects. Discrete/geometric aspect. In terms of discrete algorithmic structure, the dual-tree framework of [5], in the context of kernel summation, generalizes all of the well-known algorithms. 1 It was applied to the problem of kernel density estimation in [7] using a simple 1 These include the Barnes-Hut algorithm [2], the Fast Multipole Method [8], Appel’s algorithm [1], and the WSPD [4]: the dual-tree method is a node-node algorithm (considers query regions rather than points), is fully recursive, can use distribution-sensitive data structures such as kd-trees, and is bichromatic (can specialize for differing query and reference sets). Figure 1: The basic idea is to approximate the kernel sum contribution of some subset of the reference points XR , lying in some compact region of space R with centroid xR , to a query point. In more efficient schemes a query region is considered, i.e. the approximate contribution is made to an entire subset of the query points XQ lying in some region of space Q, with centroid xQ . finite-difference approximation, which is tantamount to a centroid approximation. Partially by avoiding series expansions, which depend explicitly on the dimension, the result was the fastest such algorithm for general dimension, when operating at the optimal bandwidth. Unfortunately, when performing cross-validation to determine the (initially unknown) optimal bandwidth, both suboptimally small and large bandwidths must be evaluated. The finite-difference-based dual-tree method tends to be efficient at or below the optimal bandwidth, and at very large bandwidths, but for intermediately-large bandwidths it suffers. Continuous/approximation aspect. This motivates investigating a multipole-like series approximation which is appropriate for the Gaussian kernel, as introduced by [9], which can be shown the generalize the centroid approximation. We define the Hermite functions 2 hn (t) by hn (t) = e−t Hn (t), where the Hermite polynomials Hn (t) are defined by the 2 2 Rodrigues formula: Hn (t) = (−1)n et Dn e−t , t ∈ R1 . After scaling and shifting the argument t appropriately, then taking the product of univariate functions for each dimension, we obtain the multivariate Hermite expansion NR G(xq ) = e −||xq −xr ||2 2h2 NR = r=1 r=1 α≥0 1 α! xr − xR √ 2h2 α hα xq − xR √ 2h2 (1) where we’ve adopted the usual multi-index notation as in [9]. This can be re-written as NR G(xq ) = e r=1 −||xq −xr ||2 2h2 NR = r=1 α≥0 1 hα α! xr − xQ √ 2h2 xq − xQ √ 2h2 α (2) to express the sum as a Taylor (local) expansion about a nearby representative centroid xQ in the query region. We will be using both types of expansions simultaneously. Since series approximations only hold locally, Greengard and Rokhlin [8] showed that it is useful to think in terms of a set of three ‘translation operators’ for converting between expansions centered at different points, in order to create their celebrated hierarchical algorithm. This was done in the context of the Coulombic kernel, but the Gaussian kernel has importantly different mathematical properties. The original Fast Gauss Transform (FGT) [9] was based on a flat grid, and thus provided only one operator (“H2L” of the next section), with an associated error bound (which was unfortunately incorrect). The Improved Fast Gauss Transform (IFGT) [14] was based on a flat set of clusters and provided no operators with a rearranged series approximation, which intended to be more favorable in higher dimensions but had an incorrect error bound. We will show the derivations of all the translation operators and associated error bounds needed to obtain, for the first time, a hierarchical algorithm for the Gaussian kernel. 2 Translation Operators and Error Bounds The first operator converts a multipole expansion of a reference node to form a local expansion centered at the centroid of the query node, and is our main approximation workhorse. Lemma 2.1. Hermite-to-local (H2L) translation operator for Gaussian kernel (as presented in Lemma 2.2 in [9, 10]): Given a reference node XR , a query node XQ , and the Hermite expansion centered at a centroid xR of XR : G(xq ) = Aα hα α≥0 xq −xR √ 2h2 , the Taylor expansion of the Hermite expansion at the centroid xQ of the query node XQ is given by G(xq ) = Bβ β≥0 xq −xQ √ 2h2 β where Bβ = (−1)|β| β! Aα hα+β α≥0 xQ −xR √ 2h2 . Proof. (sketch) The proof consists of replacing the Hermite function portion of the expansion with its Taylor series. NR Note that we can rewrite G(xq ) = α≥0 r=1 1 α! xr −xR √ 2h2 α hα xq −xR √ 2h2 by interchanging the summation order, such that the term in the brackets depends only on the reference points, and can thus be computed indepedent of any query location – we will call such terms Hermite moments. The next operator allows the efficient pre-computation of the Hermite moments in the reference tree in a bottom-up fashion from its children. Lemma 2.2. Hermite-to-Hermite (H2H) translation operator for Gaussian kernel: Given the Hermite expansion centered at a centroid xR′ in a reference node XR′ : xq −x G(xq ) = A′ hα √2hR′ , this same Hermite expansion shifted to a new locaα 2 α≥0 tion xR of the parent node of XR is given by G(xq ) = Aγ hγ γ≥0 Aγ = 0≤α≤γ 1 ′ (γ−α)! Aα xR′ −xR √ 2h2 xq −xR √ 2h2 where γ−α . Proof. We simply replace the Hermite function part of the expansion by a new Taylor series, as follows: « x q − x R′ √ 2h2 α≥0 „ « X ′ X 1 „ x R − x R′ « β xq − xR √ √ = Aα (−1)|β| hα+β β! 2h2 2h2 α≥0 β≥0 „ «β « „ X X ′ 1 x R − x R′ xq − xR |β| √ √ (−1) hα+β = Aα β! 2h2 2h2 α≥0 β≥0 „ «β „ « X X ′ 1 x R′ − x R xq − xR √ √ Aα = hα+β β! 2h2 2h2 α≥0 β≥0 3 2 «γ−α „ « „ X X 1 x R′ − x R q ′ 5 hγ x√− xR 4 √ = Aα (γ − α)! 2h2 2h2 γ≥0 0≤α≤γ G(xq ) = where γ = α + β. X A′ hα α „ The next operator acts as a “clean-up” routine in a hierarchical algorithm. Since we can approximate at different scales in the query tree, we must somehow combine all the approximations at the end of the computation. By performing a breadth-first traversal of the query tree, the L2L operator shifts a node’s local expansion to the centroid of each child. Lemma 2.3. Local-to-local (L2L) translation operator for Gaussian kernel: Given a Taylor expansion centered at a centroid xQ′ of a query node XQ′ : G(xq ) = xq −xQ′ √ 2h2 Bβ β≥0 β , the Taylor expansion obtained by shift- ing this expansion to the new centroid xQ of the child node XQ is G(xq ) = α≥0 β≥α β! α!(β−α)! Bβ β−α xQ −xQ′ √ 2h2 xq −xQ √ 2h2 α . Proof. Applying the multinomial theorem to to expand about the new center xQ yields: G(xq ) = X Bβ β≥0 = „ XX β≥0 α≤β xq − xQ′ √ 2h2 Bβ «β β! α!(β − α)! „ xQ − xQ′ √ 2h2 «β−α „ xq − xQ √ 2h2 «α . whose summation order can be interchanged to achieve the result. Because the Hermite and the Taylor expansion are truncated after taking pD terms, we incur an error in approximation. The original error bounds for the Gaussian kernel in [9, 10] were wrong and corrections were shown in [3]. Here, we will present all necessary three error bounds incurred in performing translation operators. We note that these error bounds place limits on the size of the query node and the reference node. 2 Lemma 2.4. Error Bound for Truncating an Hermite Expansion (as presented in [3]): Suppose we are given an Hermite expansion of a reference node XR about its centroid xR : G(xq ) = Aα hα α≥0 xq −xR √ 2h2 NR where Aα = r=1 1 α! xr −xR √ 2h2 α . For any query point xq , the error due to truncating the series after the first pD term is |ǫM (p)| ≤ rp )k rp √ p! NR (1−r)D D−1 k=0 D k (1 − D−k where ∀xr ∈ XR satisfies ||xr − xR ||∞ < rh for r < 1. Proof. (sketch) We expand the Hermite expansion as a product of one-dimensional Hermite functions, and utilize a bound on one-dimensional Hermite functions due to [13]: n −x2 1 2 √ 2 e 2 , n ≥ 0, x ∈ R1 . n! |hn (x)| ≤ n! Lemma 2.5. Error Bound for Truncating a Taylor Expansion Converted from an Hermite Expansion of Infinite Order: Suppose we are given the following Taylor expansion about the centroid xQ of a query node G(xq ) = Bβ β≥0 2 xq −xQ √ 2h2 β where `Strainn[12] proposed the interesting idea of using Stirling’s formula (for any non-negative integer ´ ≤ n!) to lift the node size constraint; one might imagine that this could allow approxin: n+1 e mation of larger regions in a tree-based algorithm. Unfortunately, the error bounds developed in [12] were also incorrect. We have derived the three necessary corrected error bounds based on the techniques in [3]. However, due to space, and because using these bounds actually degraded performance slightly, we do not include those lemmas here. (−1)|β| β! Bβ = Aα hα+β α≥0 xQ −xR √ 2h2 and Aα ’s are the coefficients of the Hermite ex- pansion centered at the reference node centroid xR . Then, truncating the series after pD terms satisfies the error bound |ǫL (p)| ≤ NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k where ||xq − xQ ||∞ < rh for r < 1, ∀xq ∈ XQ . Proof. Taylor expansion of the Hermite function yields e −||xq −xr ||2 2h2 Use e „ «„ «β X (−1)|β| X 1 „ xr − xR «α xq − xQ xQ − xR √ √ √ hα+β = β! α! 2h2 2h2 2h2 α≥0 β≥0 «α „ «„ «β „ X (−1)|β| X 1 xR − xr xQ − xR xq − xQ |α| √ √ √ = (−1) hα+β β! α! 2h2 2h2 2h2 β≥0 α≥0 «„ «β „ X (−1)|β| xq − xQ xQ − xr √ √ = hβ β! 2h2 2h2 β≥0 −||xq −xr ||2 2h2 D = i=1 (up (xqi , xri , xQi ) + vp (xqi , xri , xQi )) for 1 ≤ i ≤ D, where «„ «n „ X (−1)ni xqi − xQi i xQi − xri √ √ hni ni ! 2h2 2h2 ni =0 „ «„ «ni ∞ X (−1)ni xqi − xQi xQi − xri √ √ hni vp (xqi , xri , xQi ) = . ni ! 2h2 2h2 ni =p p−1 up (xqi , xri , xQi ) = 1−r p 1−r These univariate functions respectively satisfy up (xqi , xri , xQi ) ≤ 1 rp vp (xqi , xri , xQi ) ≤ √p! 1−r , for 1 ≤ i ≤ D, achieving the multivariate bound. and Lemma 2.6. Error Bound for Truncating a Taylor Expansion Converted from an Already Truncated Hermite Expansion: A truncated Hermite expansion centered about xq −xR the centroid xR of a reference node G(xq ) = Aα hα √2h2 has the following α < rh, and a reference node XR for which ||xr − xR ||∞ < rh for r < 1 , ∀xq ∈ XQ , ∀xr ∈ XR . 2 Proof. We define upi = up (xqi , xri , xQi , xRi ), vpi = vp (xqi , xri , xQi , xRi ), wpi = wp (xqi , xri , xQi , xRi ) for 1 ≤ i ≤ D: upi = „ «„ «ni p−1 X X (−1)ni p−1 1 „ xR − xr «nj xqi − xQi xQi − xRi i i √ √ √ (−1)nj hni +nj ni ! n =0 nj ! 2h2 2h2 2h2 ni =0 j vpi = „ «„ «n ∞ X (−1)ni X 1 „ xR − xr «nj xQi − xRi xqi − xQi i i i √ √ √ (−1)nj hni +nj ni ! n =p nj ! 2h2 2h2 2h2 ni =0 j p−1 wpi = „ «„ «n ∞ ∞ X (−1)ni X 1 „ xR − xr «nj xQi − xRi xqi − xQi i i i √ √ √ (−1)nj hni +nj ni ! n =0 nj ! 2h2 2h2 2h2 ni =p j Note that e −||xq −xr ||2 2h2 D = i=1 (upi + vpi + wpi ) for 1 ≤ i ≤ D. Using the bound for Hermite functions and the property of geometric series, we obtain the following upper bounds: p−1 p−1 upi ≤ X X (2r)ni (2r)nj = ni =0 nj =0 „ 1 − (2r)p ) 1 − 2r «2 „ «„ « p−1 ∞ 1 X X 1 1 − (2r)p (2r)p vpi ≤ √ (2r)ni (2r)nj = √ 1 − 2r 1 − 2r p! n =0 n =p p! i 1 wpi ≤ √ p! j ∞ ∞ X X ni =p nj 1 (2r)ni (2r)nj = √ p! =0 „ 1 1 − 2r «„ (2r)p 1 − 2r « Therefore, ˛ ˛ ! «D−k „ D D−1 ˛ −||xq −xr ||2 ˛ Y X D ((2r)p )(2 − (2r)p ) ˛ ˛ −2D 2 2h √ − upi ˛ ≤ (1 − 2r) ((1 − (2r)p )2 )k ˛e ˛ ˛ k p! i=1 k=0 ˛ ˛ ˛ « ˛ „ „ « D−1 “ ” X D X ˛ xq − xQ β ˛ ((2r)p )(2 − (2r)p ) D−k NR p 2 k ˛≤ ˛G(xq ) − √ ((1 − (2r) ) ) √ Cβ ˛ ˛ 2D (1 − 2r) k p! 2h2 ˛ ˛ k=0 β < 2h, pDH = the smallest p ≥ 1 such that NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k < ǫGmin . Q if Q.maxside < 2h, pDL = the smallest p ≥ 1 such that NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k < ǫGmin . Q if max(Q.maxside,R.maxside) < h, pH2L = the smallest p ≥ 1 such that NR (1−2r)2D D−1 k=0 D k ((1 − (2r)p )2 )k ((2r)p )(2−(2r)p ) √ p! D−k < ǫGmin . Q cDH = pD NQ . cDL = pD NR . cH2L = DpD+1 . cDirect = DNQ NR . DH DL H2L if no Hermite coefficient of order pDH exists for XR , Compute it. cDH = cDH + pD NR . DH if no Hermite coefficient of order pH2L exists for XR , Compute it. cH2L = cH2L + pD NR . H2L c = min(cDH , cDL , cH2L , cDirect ). if c = cDH < ∞, (Direct Hermite) Evaluate each xq at the Hermite series of order pDH centered about xR of XR using Equation 1. if c = cDL < ∞, (Direct Local) Accumulate each xr ∈ XR as the Taylor series of order pDL about the center xQ of XQ using Equation 2. if c = cH2L < ∞, (Hermite-to-Local) Convert the Hermite series of order pH2L centered about xR of XR to the Taylor series of the same order centered about xQ of XQ using Lemma 2.1. if c = cDirect , Update Gmin and Gmax in Q and all its children. return. if leaf(Q) and leaf(R), Perform the naive algorithm on every pair of points in Q and R. else DFGT(Q.left, R.left). DFGT(Q.left, R.right). DFGT(Q.right, R.left). DFGT(Q.right, R.right). ˛ ˛ ˛b ˛ For the FGT, note that the algorithm only ensures: ˛G(xq ) − Gtrue (xq )˛ ≤ τ . Therefore, we first set τ = ǫ, halving τ until the error tolerance ǫ was met. For the IFGT, which has multiple parameters that must be tweaked simultaneously, an automatic scheme was created, based on the recommendations given in the paper and software documentation: For D = 2, use p = 8; for D = 3, √ use p = 6; set ρx = 2.5; start with K = N and double K until the error tolerance is met. When this failed to meet the tolerance, we resorted to additional trial and error by hand. The costs of parameter selection for these methods in both computer and human time is not included in the table. 4 Algorithm \ scale Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH 0.001 0.01 0.1 1 10 100 sj2-50000-2 (astronomy: positions), D = 2, N = 50000, h∗ = 0.00139506 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM 3.892312 2.01846 0.319538 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.837724 1.087066 1.658592 6.018158 62.077669 151.590062 0.849935 1.11567 4.599235 72.435177 18.450387 2.777454 0.846294 1.10654 1.683913 6.265131 5.063365 1.036626 ∗ = 0.0016911 colors50k (astronomy: colors), D = 2, N = 50000, h 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM > 2×Naive > 2×Naive 0.475281 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 1.095838 1.469454 2.802112 30.294007 280.633106 81.373053 1.099828 1.983888 29.231309 285.719266 12.886239 5.336602 1.081216 1.47692 2.855083 24.598749 7.142465 1.78648 ∗ edsgc-radec-rnd (astronomy: angles), D = 2, N = 50000, h = 0.00466204 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM 2.859245 1.768738 0.210799 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.812462 1.083528 1.682261 5.860172 63.849361 357.099354 0.84023 1.120015 4.346061 73.036687 21.652047 3.424304 0.821672 1.104545 1.737799 6.037217 5.7398 1.883216 ∗ mockgalaxy-D-1M-rnd (cosmology: positions), D = 3, N = 50000, h = 0.000768201 354.868751 354.868751 354.868751 354.868751 354.868751 354.868751 out of RAM out of RAM out of RAM out of RAM > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.70054 0.701547 0.761524 0.843451 1.086608 42.022605 0.73007 0.733638 0.799711 0.999316 50.619588 125.059911 0.724004 0.719951 0.789002 0.877564 1.265064 22.6106 ∗ bio5-rnd (biology: drug activity), D = 5, N = 50000, h = 0.000567161 364.439228 364.439228 364.439228 364.439228 364.439228 364.439228 out of RAM out of RAM out of RAM out of RAM out of RAM out of RAM > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 2.249868 2.4958865 4.70948 12.065697 94.345003 412.39142 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 1000 301.696 0.183616 7.576783 1.551019 2.532401 0.68471 301.696 0.114430 7.55986 3.604753 3.5638 0.627554 301.696 0.059664 7.585585 0.743045 1.977302 0.436596 354.868751 > 2×Naive > 2×Naive 383.12048 109.353701 87.488392 364.439228 out of RAM > 2×Naive 107.675935 > 2×Naive > 2×Naive Discussion. The experiments indicate that the DFGTH method is able to achieve reasonable performance across all bandwidth scales. Unfortunately none of the series approximation-based methods do well on the 5-dimensional data, as expected, highlighting the main weakness of the approach presented. Pursuing corrections to the error bounds necessary to use the intriguing series form of [14] may allow an increase in dimensionality. References [1] A. W. Appel. An Efficient Program for Many-Body Simulations. SIAM Journal on Scientific and Statistical Computing, 6(1):85–103, 1985. [2] J. Barnes and P. Hut. A Hierarchical O(N logN ) Force-Calculation Algorithm. Nature, 324, 1986. [3] B. Baxter and G. Roussos. A new error estimate of the fast gauss transform. SIAM Journal on Scientific Computing, 24(1):257–259, 2002. [4] P. Callahan and S. Kosaraju. A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields. Journal of the ACM, 62(1):67–90, January 1995. [5] A. Gray and A. W. Moore. N-Body Problems in Statistical Learning. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13 (December 2000). MIT Press, 2001. [6] A. G. Gray. Bringing Tractability to Generalized N-Body Problems in Statistical and Scientific Computation. PhD thesis, Carnegie Mellon University, 2003. [7] A. G. Gray and A. W. Moore. Rapid Evaluation of Multiple Density Models. In Artificial Intelligence and Statistics 2003, 2003. [8] L. Greengard and V. Rokhlin. A Fast Algorithm for Particle Simulations. Journal of Computational Physics, 73, 1987. [9] L. Greengard and J. Strain. The fast gauss transform. SIAM Journal on Scientific and Statistical Computing, 12(1):79–94, 1991. [10] L. Greengard and X. Sun. A new version of the fast gauss transform. Documenta Mathematica, Extra Volume ICM(III):575– 584, 1998. [11] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986. [12] J. Strain. The fast gauss transform with variable scales. SIAM Journal on Scientific and Statistical Computing, 12:1131– 1139, 1991. [13] O. Sz´ sz. On the relative extrema of the hermite orthogonal functions. J. Indian Math. Soc., 15:129–134, 1951. a [14] C. Yang, R. Duraiswami, N. A. Gumerov, and L. Davis. Improved fast gauss transform and efficient kernel density estimation. International Conference on Computer Vision, 2003.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract In previous work we presented an efficient approach to computing kernel summations which arise in many machine learning methods such as kernel density estimation. [sent-9, score-0.103]

2 This approach, dual-tree recursion with finitedifference approximation, generalized existing methods for similar problems arising in computational physics in two ways appropriate for statistical problems: toward distribution sensitivity and general dimension, partly by avoiding series expansions. [sent-10, score-0.051]

3 While this proved to be the fastest practical method for multivariate kernel density estimation at the optimal bandwidth, it is much less efficient at larger-than-optimal bandwidths. [sent-11, score-0.07]

4 In this work, we explore the extent to which the dual-tree approach can be integrated with multipole-like Hermite expansions in order to achieve reasonable efficiency across all bandwidth scales, though only for low dimensionalities. [sent-12, score-0.049]

5 In the process, we derive and demonstrate the first truly hierarchical fast Gauss transforms, effectively combining the best tools from discrete algorithms and continuous approximation theory. [sent-13, score-0.037]

6 1 Fast Gaussian Summation Kernel summations are fundamental in both statistics/learning and computational physics. [sent-14, score-0.022]

7 where the ker- r=1 nel is the Gaussian kernel with scaling parameter, or bandwidth h, there are NR reference points xr , and we desire the sum for NQ different query points xq . [sent-17, score-1.329]

8 Such kernel summations appear in a wide array of statistical/learning methods [5], perhaps most obviously in kernel density estimation [11], the most widely used distribution-free method for the fundamental task of density estimation, which will be our main example. [sent-18, score-0.118]

9 Understanding kernel summation algorithms from a recently developed unified perspective [5] begins with the picture of Figure 1, then separately considers the discrete and continuous aspects. [sent-19, score-0.062]

10 In terms of discrete algorithmic structure, the dual-tree framework of [5], in the context of kernel summation, generalizes all of the well-known algorithms. [sent-21, score-0.033]

11 Figure 1: The basic idea is to approximate the kernel sum contribution of some subset of the reference points XR , lying in some compact region of space R with centroid xR , to a query point. [sent-23, score-0.285]

12 In more efficient schemes a query region is considered, i. [sent-24, score-0.082]

13 the approximate contribution is made to an entire subset of the query points XQ lying in some region of space Q, with centroid xQ . [sent-26, score-0.202]

14 finite-difference approximation, which is tantamount to a centroid approximation. [sent-27, score-0.101]

15 Partially by avoiding series expansions, which depend explicitly on the dimension, the result was the fastest such algorithm for general dimension, when operating at the optimal bandwidth. [sent-28, score-0.055]

16 Unfortunately, when performing cross-validation to determine the (initially unknown) optimal bandwidth, both suboptimally small and large bandwidths must be evaluated. [sent-29, score-0.032]

17 The finite-difference-based dual-tree method tends to be efficient at or below the optimal bandwidth, and at very large bandwidths, but for intermediately-large bandwidths it suffers. [sent-30, score-0.024]

18 This motivates investigating a multipole-like series approximation which is appropriate for the Gaussian kernel, as introduced by [9], which can be shown the generalize the centroid approximation. [sent-32, score-0.132]

19 We define the Hermite functions 2 hn (t) by hn (t) = e−t Hn (t), where the Hermite polynomials Hn (t) are defined by the 2 2 Rodrigues formula: Hn (t) = (−1)n et Dn e−t , t ∈ R1 . [sent-33, score-0.062]

20 After scaling and shifting the argument t appropriately, then taking the product of univariate functions for each dimension, we obtain the multivariate Hermite expansion NR G(xq ) = e −||xq −xr ||2 2h2 NR = r=1 r=1 α≥0 1 α! [sent-34, score-0.132]

21 xr − xR √ 2h2 α hα xq − xR √ 2h2 (1) where we’ve adopted the usual multi-index notation as in [9]. [sent-35, score-1.119]

22 xr − xQ √ 2h2 xq − xQ √ 2h2 α (2) to express the sum as a Taylor (local) expansion about a nearby representative centroid xQ in the query region. [sent-37, score-1.412]

23 We will be using both types of expansions simultaneously. [sent-38, score-0.022]

24 Since series approximations only hold locally, Greengard and Rokhlin [8] showed that it is useful to think in terms of a set of three ‘translation operators’ for converting between expansions centered at different points, in order to create their celebrated hierarchical algorithm. [sent-39, score-0.099]

25 This was done in the context of the Coulombic kernel, but the Gaussian kernel has importantly different mathematical properties. [sent-40, score-0.033]

26 The original Fast Gauss Transform (FGT) [9] was based on a flat grid, and thus provided only one operator (“H2L” of the next section), with an associated error bound (which was unfortunately incorrect). [sent-41, score-0.069]

27 The Improved Fast Gauss Transform (IFGT) [14] was based on a flat set of clusters and provided no operators with a rearranged series approximation, which intended to be more favorable in higher dimensions but had an incorrect error bound. [sent-42, score-0.086]

28 We will show the derivations of all the translation operators and associated error bounds needed to obtain, for the first time, a hierarchical algorithm for the Gaussian kernel. [sent-43, score-0.111]

29 2 Translation Operators and Error Bounds The first operator converts a multipole expansion of a reference node to form a local expansion centered at the centroid of the query node, and is our main approximation workhorse. [sent-44, score-0.572]

30 Hermite-to-local (H2L) translation operator for Gaussian kernel (as presented in Lemma 2. [sent-47, score-0.093]

31 (sketch) The proof consists of replacing the Hermite function portion of the expansion with its Taylor series. [sent-51, score-0.11]

32 xr −xR √ 2h2 α hα xq −xR √ 2h2 by interchanging the summation order, such that the term in the brackets depends only on the reference points, and can thus be computed indepedent of any query location – we will call such terms Hermite moments. [sent-53, score-1.272]

33 The next operator allows the efficient pre-computation of the Hermite moments in the reference tree in a bottom-up fashion from its children. [sent-54, score-0.085]

34 We simply replace the Hermite function part of the expansion by a new Taylor series, as follows: « x q − x R′ √ 2h2 α≥0 „ « X ′ X 1 „ x R − x R′ « β xq − xR √ √ = Aα (−1)|β| hα+β β! [sent-60, score-0.773]

35 2h2 2h2 α≥0 β≥0 „ «β « „ X X ′ 1 x R − x R′ xq − xR |β| √ √ (−1) hα+β = Aα β! [sent-61, score-0.663]

36 2h2 2h2 α≥0 β≥0 „ «β „ « X X ′ 1 x R′ − x R xq − xR √ √ Aα = hα+β β! [sent-62, score-0.663]

37 X A′ hα α „ The next operator acts as a “clean-up” routine in a hierarchical algorithm. [sent-65, score-0.042]

38 Since we can approximate at different scales in the query tree, we must somehow combine all the approximations at the end of the computation. [sent-66, score-0.082]

39 By performing a breadth-first traversal of the query tree, the L2L operator shifts a node’s local expansion to the centroid of each child. [sent-67, score-0.319]

40 Applying the multinomial theorem to to expand about the new center xQ yields: G(xq ) = X Bβ β≥0 = „ XX β≥0 α≤β xq − xQ′ √ 2h2 Bβ «β β! [sent-75, score-0.673]

41 whose summation order can be interchanged to achieve the result. [sent-79, score-0.029]

42 Because the Hermite and the Taylor expansion are truncated after taking pD terms, we incur an error in approximation. [sent-80, score-0.142]

43 The original error bounds for the Gaussian kernel in [9, 10] were wrong and corrections were shown in [3]. [sent-81, score-0.086]

44 Here, we will present all necessary three error bounds incurred in performing translation operators. [sent-82, score-0.074]

45 We note that these error bounds place limits on the size of the query node and the reference node. [sent-83, score-0.221]

46 Error Bound for Truncating an Hermite Expansion (as presented in [3]): Suppose we are given an Hermite expansion of a reference node XR about its centroid xR : G(xq ) = Aα hα α≥0 xq −xR √ 2h2 NR where Aα = r=1 1 α! [sent-86, score-0.973]

47 For any query point xq , the error due to truncating the series after the first pD term is |ǫM (p)| ≤ rp )k rp √ p! [sent-88, score-0.952]

48 NR (1−r)D D−1 k=0 D k (1 − D−k where ∀xr ∈ XR satisfies ||xr − xR ||∞ < rh for r < 1. [sent-89, score-0.032]

49 (sketch) We expand the Hermite expansion as a product of one-dimensional Hermite functions, and utilize a bound on one-dimensional Hermite functions due to [13]: n −x2 1 2 √ 2 e 2 , n ≥ 0, x ∈ R1 . [sent-91, score-0.133]

50 ) to lift the node size constraint; one might imagine that this could allow approxin: n+1 e mation of larger regions in a tree-based algorithm. [sent-97, score-0.049]

51 Unfortunately, the error bounds developed in [12] were also incorrect. [sent-98, score-0.04]

52 We have derived the three necessary corrected error bounds based on the techniques in [3]. [sent-99, score-0.04]

53 However, due to space, and because using these bounds actually degraded performance slightly, we do not include those lemmas here. [sent-100, score-0.023]

54 Bβ = Aα hα+β α≥0 xQ −xR √ 2h2 and Aα ’s are the coefficients of the Hermite ex- pansion centered at the reference node centroid xR . [sent-102, score-0.23]

55 Then, truncating the series after pD terms satisfies the error bound |ǫL (p)| ≤ NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! [sent-103, score-0.22]

56 D−k where ||xq − xQ ||∞ < rh for r < 1, ∀xq ∈ XQ . [sent-104, score-0.032]

57 Taylor expansion of the Hermite function yields e −||xq −xr ||2 2h2 Use e „ «„ «β X (−1)|β| X 1 „ xr − xR «α xq − xQ xQ − xR √ √ √ hα+β = β! [sent-106, score-1.229]

58 2h2 2h2 2h2 α≥0 β≥0 «α „ «„ «β „ X (−1)|β| X 1 xR − xr xQ − xR xq − xQ |α| √ √ √ = (−1) hα+β β! [sent-108, score-1.119]

59 2h2 2h2 2h2 β≥0 α≥0 «„ «β „ X (−1)|β| xq − xQ xQ − xr √ √ = hβ β! [sent-110, score-1.119]

60 2h2 2h2 β≥0 −||xq −xr ||2 2h2 D = i=1 (up (xqi , xri , xQi ) + vp (xqi , xri , xQi )) for 1 ≤ i ≤ D, where «„ «n „ X (−1)ni xqi − xQi i xQi − xri √ √ hni ni ! [sent-111, score-0.884]

61 2h2 2h2 ni =0 „ «„ «ni ∞ X (−1)ni xqi − xQi xQi − xri √ √ hni vp (xqi , xri , xQi ) = . [sent-112, score-0.73]

62 2h2 2h2 ni =p p−1 up (xqi , xri , xQi ) = 1−r p 1−r These univariate functions respectively satisfy up (xqi , xri , xQi ) ≤ 1 rp vp (xqi , xri , xQi ) ≤ √p! [sent-114, score-0.67]

63 1−r , for 1 ≤ i ≤ D, achieving the multivariate bound. [sent-115, score-0.01]

64 We define upi = up (xqi , xri , xQi , xRi ), vpi = vp (xqi , xri , xQi , xRi ), wpi = wp (xqi , xri , xQi , xRi ) for 1 ≤ i ≤ D: upi = „ «„ «ni p−1 X X (−1)ni p−1 1 „ xR − xr «nj xqi − xQi xQi − xRi i i √ √ √ (−1)nj hni +nj ni ! [sent-120, score-1.498]

65 2h2 2h2 2h2 ni =0 j vpi = „ «„ «n ∞ X (−1)ni X 1 „ xR − xr «nj xQi − xRi xqi − xQi i i i √ √ √ (−1)nj hni +nj ni ! [sent-122, score-0.995]

66 2h2 2h2 2h2 ni =0 j p−1 wpi = „ «„ «n ∞ ∞ X (−1)ni X 1 „ xR − xr «nj xQi − xRi xqi − xQi i i i √ √ √ (−1)nj hni +nj ni ! [sent-124, score-0.991]

67 2h2 2h2 2h2 ni =p j Note that e −||xq −xr ||2 2h2 D = i=1 (upi + vpi + wpi ) for 1 ≤ i ≤ D. [sent-126, score-0.178]

68 Using the bound for Hermite functions and the property of geometric series, we obtain the following upper bounds: p−1 p−1 upi ≤ X X (2r)ni (2r)nj = ni =0 nj =0 „ 1 − (2r)p ) 1 − 2r «2 „ «„ « p−1 ∞ 1 X X 1 1 − (2r)p (2r)p vpi ≤ √ (2r)ni (2r)nj = √ 1 − 2r 1 − 2r p! [sent-127, score-0.276]

69 «D−k „ D D−1 ˛ −||xq −xr ||2 ˛ Y X D ((2r)p )(2 − (2r)p ) ˛ ˛ −2D 2 2h √ − upi ˛ ≤ (1 − 2r) ((1 − (2r)p )2 )k ˛e ˛ ˛ k p! [sent-132, score-0.045]

70 i=1 k=0 ˛ ˛ ˛ « ˛ „ „ « D−1 “ ” X D X ˛ xq − xQ β ˛ ((2r)p )(2 − (2r)p ) D−k NR p 2 k ˛≤ ˛G(xq ) − √ ((1 − (2r) ) ) √ Cβ ˛ ˛ 2D (1 − 2r) k p! [sent-133, score-0.663]

71 2h2 ˛ ˛ k=0 β < 2h, pDH = the smallest p ≥ 1 such that NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! [sent-134, score-0.123]

72 maxside < 2h, pDL = the smallest p ≥ 1 such that NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! [sent-137, score-0.123]

73 maxside) < h, pH2L = the smallest p ≥ 1 such that NR (1−2r)2D D−1 k=0 D k ((1 − (2r)p )2 )k ((2r)p )(2−(2r)p ) √ p! [sent-141, score-0.009]

74 if c = cDH < ∞, (Direct Hermite) Evaluate each xq at the Hermite series of order pDH centered about xR of XR using Equation 1. [sent-152, score-0.724]

75 if c = cDL < ∞, (Direct Local) Accumulate each xr ∈ XR as the Taylor series of order pDL about the center xQ of XQ using Equation 2. [sent-153, score-0.487]

76 if c = cH2L < ∞, (Hermite-to-Local) Convert the Hermite series of order pH2L centered about xR of XR to the Taylor series of the same order centered about xQ of XQ using Lemma 2. [sent-154, score-0.122]

77 if leaf(Q) and leaf(R), Perform the naive algorithm on every pair of points in Q and R. [sent-158, score-0.235]

78 Therefore, we first set τ = ǫ, halving τ until the error tolerance ǫ was met. [sent-172, score-0.04]

79 For the IFGT, which has multiple parameters that must be tweaked simultaneously, an automatic scheme was created, based on the recommendations given in the paper and software documentation: For D = 2, use p = 8; for D = 3, √ use p = 6; set ρx = 2. [sent-173, score-0.008]

80 5; start with K = N and double K until the error tolerance is met. [sent-174, score-0.032]

81 When this failed to meet the tolerance, we resorted to additional trial and error by hand. [sent-175, score-0.017]

82 The experiments indicate that the DFGTH method is able to achieve reasonable performance across all bandwidth scales. [sent-325, score-0.027]

83 Unfortunately none of the series approximation-based methods do well on the 5-dimensional data, as expected, highlighting the main weakness of the approach presented. [sent-326, score-0.039]

84 Pursuing corrections to the error bounds necessary to use the intriguing series form of [14] may allow an increase in dimensionality. [sent-327, score-0.084]

85 A new error estimate of the fast gauss transform. [sent-341, score-0.09]

86 On the relative extrema of the hermite orthogonal functions. [sent-398, score-0.281]

87 Improved fast gauss transform and efficient kernel density estimation. [sent-409, score-0.135]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('xq', 0.663), ('xr', 0.456), ('hermite', 0.273), ('xqi', 0.238), ('naive', 0.226), ('xri', 0.154), ('ram', 0.121), ('expansion', 0.11), ('ni', 0.11), ('nr', 0.108), ('centroid', 0.101), ('query', 0.082), ('dfgt', 0.082), ('nj', 0.072), ('taylor', 0.065), ('rp', 0.057), ('dfgth', 0.055), ('gauss', 0.052), ('fgt', 0.05), ('ifgt', 0.05), ('reference', 0.05), ('node', 0.049), ('cdh', 0.045), ('dfd', 0.045), ('hni', 0.045), ('truncating', 0.045), ('upi', 0.045), ('gmin', 0.036), ('vpi', 0.036), ('pd', 0.034), ('translation', 0.034), ('kernel', 0.033), ('greengard', 0.032), ('rh', 0.032), ('wpi', 0.032), ('series', 0.031), ('hn', 0.031), ('centered', 0.03), ('vp', 0.029), ('bandwidth', 0.027), ('cdirect', 0.027), ('cdl', 0.027), ('pdh', 0.027), ('operator', 0.026), ('lemma', 0.025), ('astronomy', 0.024), ('bandwidths', 0.024), ('bounds', 0.023), ('expansions', 0.022), ('summations', 0.022), ('fast', 0.021), ('summation', 0.021), ('operators', 0.021), ('nq', 0.018), ('pdl', 0.018), ('error', 0.017), ('scienti', 0.017), ('carnegie', 0.017), ('mellon', 0.016), ('hierarchical', 0.016), ('siam', 0.016), ('density', 0.015), ('tolerance', 0.015), ('truncated', 0.015), ('multipole', 0.014), ('transform', 0.014), ('corrections', 0.013), ('dh', 0.013), ('bound', 0.013), ('unfortunately', 0.013), ('fastest', 0.012), ('leaf', 0.012), ('avoiding', 0.012), ('univariate', 0.012), ('expand', 0.01), ('converted', 0.01), ('lying', 0.01), ('multivariate', 0.01), ('gray', 0.01), ('gaussian', 0.01), ('points', 0.009), ('smallest', 0.009), ('tree', 0.009), ('sketch', 0.009), ('incorrect', 0.009), ('physics', 0.008), ('considers', 0.008), ('cosmology', 0.008), ('extrema', 0.008), ('interchanged', 0.008), ('recommendations', 0.008), ('documentation', 0.008), ('gumerov', 0.008), ('halving', 0.008), ('highlighting', 0.008), ('indian', 0.008), ('mathematica', 0.008), ('pursuing', 0.008), ('rearranged', 0.008), ('suboptimally', 0.008)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 59 nips-2005-Dual-Tree Fast Gauss Transforms

Author: Dongryeol Lee, Andrew W. Moore, Alexander G. Gray

Abstract: In previous work we presented an efficient approach to computing kernel summations which arise in many machine learning methods such as kernel density estimation. This approach, dual-tree recursion with finitedifference approximation, generalized existing methods for similar problems arising in computational physics in two ways appropriate for statistical problems: toward distribution sensitivity and general dimension, partly by avoiding series expansions. While this proved to be the fastest practical method for multivariate kernel density estimation at the optimal bandwidth, it is much less efficient at larger-than-optimal bandwidths. In this work, we explore the extent to which the dual-tree approach can be integrated with multipole-like Hermite expansions in order to achieve reasonable efficiency across all bandwidth scales, though only for low dimensionalities. In the process, we derive and demonstrate the first truly hierarchical fast Gauss transforms, effectively combining the best tools from discrete algorithms and continuous approximation theory. 1 Fast Gaussian Summation Kernel summations are fundamental in both statistics/learning and computational physics. NR e This paper will focus on the common form G(xq ) = −||xq −xr ||2 2h2 i.e. where the ker- r=1 nel is the Gaussian kernel with scaling parameter, or bandwidth h, there are NR reference points xr , and we desire the sum for NQ different query points xq . Such kernel summations appear in a wide array of statistical/learning methods [5], perhaps most obviously in kernel density estimation [11], the most widely used distribution-free method for the fundamental task of density estimation, which will be our main example. Understanding kernel summation algorithms from a recently developed unified perspective [5] begins with the picture of Figure 1, then separately considers the discrete and continuous aspects. Discrete/geometric aspect. In terms of discrete algorithmic structure, the dual-tree framework of [5], in the context of kernel summation, generalizes all of the well-known algorithms. 1 It was applied to the problem of kernel density estimation in [7] using a simple 1 These include the Barnes-Hut algorithm [2], the Fast Multipole Method [8], Appel’s algorithm [1], and the WSPD [4]: the dual-tree method is a node-node algorithm (considers query regions rather than points), is fully recursive, can use distribution-sensitive data structures such as kd-trees, and is bichromatic (can specialize for differing query and reference sets). Figure 1: The basic idea is to approximate the kernel sum contribution of some subset of the reference points XR , lying in some compact region of space R with centroid xR , to a query point. In more efficient schemes a query region is considered, i.e. the approximate contribution is made to an entire subset of the query points XQ lying in some region of space Q, with centroid xQ . finite-difference approximation, which is tantamount to a centroid approximation. Partially by avoiding series expansions, which depend explicitly on the dimension, the result was the fastest such algorithm for general dimension, when operating at the optimal bandwidth. Unfortunately, when performing cross-validation to determine the (initially unknown) optimal bandwidth, both suboptimally small and large bandwidths must be evaluated. The finite-difference-based dual-tree method tends to be efficient at or below the optimal bandwidth, and at very large bandwidths, but for intermediately-large bandwidths it suffers. Continuous/approximation aspect. This motivates investigating a multipole-like series approximation which is appropriate for the Gaussian kernel, as introduced by [9], which can be shown the generalize the centroid approximation. We define the Hermite functions 2 hn (t) by hn (t) = e−t Hn (t), where the Hermite polynomials Hn (t) are defined by the 2 2 Rodrigues formula: Hn (t) = (−1)n et Dn e−t , t ∈ R1 . After scaling and shifting the argument t appropriately, then taking the product of univariate functions for each dimension, we obtain the multivariate Hermite expansion NR G(xq ) = e −||xq −xr ||2 2h2 NR = r=1 r=1 α≥0 1 α! xr − xR √ 2h2 α hα xq − xR √ 2h2 (1) where we’ve adopted the usual multi-index notation as in [9]. This can be re-written as NR G(xq ) = e r=1 −||xq −xr ||2 2h2 NR = r=1 α≥0 1 hα α! xr − xQ √ 2h2 xq − xQ √ 2h2 α (2) to express the sum as a Taylor (local) expansion about a nearby representative centroid xQ in the query region. We will be using both types of expansions simultaneously. Since series approximations only hold locally, Greengard and Rokhlin [8] showed that it is useful to think in terms of a set of three ‘translation operators’ for converting between expansions centered at different points, in order to create their celebrated hierarchical algorithm. This was done in the context of the Coulombic kernel, but the Gaussian kernel has importantly different mathematical properties. The original Fast Gauss Transform (FGT) [9] was based on a flat grid, and thus provided only one operator (“H2L” of the next section), with an associated error bound (which was unfortunately incorrect). The Improved Fast Gauss Transform (IFGT) [14] was based on a flat set of clusters and provided no operators with a rearranged series approximation, which intended to be more favorable in higher dimensions but had an incorrect error bound. We will show the derivations of all the translation operators and associated error bounds needed to obtain, for the first time, a hierarchical algorithm for the Gaussian kernel. 2 Translation Operators and Error Bounds The first operator converts a multipole expansion of a reference node to form a local expansion centered at the centroid of the query node, and is our main approximation workhorse. Lemma 2.1. Hermite-to-local (H2L) translation operator for Gaussian kernel (as presented in Lemma 2.2 in [9, 10]): Given a reference node XR , a query node XQ , and the Hermite expansion centered at a centroid xR of XR : G(xq ) = Aα hα α≥0 xq −xR √ 2h2 , the Taylor expansion of the Hermite expansion at the centroid xQ of the query node XQ is given by G(xq ) = Bβ β≥0 xq −xQ √ 2h2 β where Bβ = (−1)|β| β! Aα hα+β α≥0 xQ −xR √ 2h2 . Proof. (sketch) The proof consists of replacing the Hermite function portion of the expansion with its Taylor series. NR Note that we can rewrite G(xq ) = α≥0 r=1 1 α! xr −xR √ 2h2 α hα xq −xR √ 2h2 by interchanging the summation order, such that the term in the brackets depends only on the reference points, and can thus be computed indepedent of any query location – we will call such terms Hermite moments. The next operator allows the efficient pre-computation of the Hermite moments in the reference tree in a bottom-up fashion from its children. Lemma 2.2. Hermite-to-Hermite (H2H) translation operator for Gaussian kernel: Given the Hermite expansion centered at a centroid xR′ in a reference node XR′ : xq −x G(xq ) = A′ hα √2hR′ , this same Hermite expansion shifted to a new locaα 2 α≥0 tion xR of the parent node of XR is given by G(xq ) = Aγ hγ γ≥0 Aγ = 0≤α≤γ 1 ′ (γ−α)! Aα xR′ −xR √ 2h2 xq −xR √ 2h2 where γ−α . Proof. We simply replace the Hermite function part of the expansion by a new Taylor series, as follows: « x q − x R′ √ 2h2 α≥0 „ « X ′ X 1 „ x R − x R′ « β xq − xR √ √ = Aα (−1)|β| hα+β β! 2h2 2h2 α≥0 β≥0 „ «β « „ X X ′ 1 x R − x R′ xq − xR |β| √ √ (−1) hα+β = Aα β! 2h2 2h2 α≥0 β≥0 „ «β „ « X X ′ 1 x R′ − x R xq − xR √ √ Aα = hα+β β! 2h2 2h2 α≥0 β≥0 3 2 «γ−α „ « „ X X 1 x R′ − x R q ′ 5 hγ x√− xR 4 √ = Aα (γ − α)! 2h2 2h2 γ≥0 0≤α≤γ G(xq ) = where γ = α + β. X A′ hα α „ The next operator acts as a “clean-up” routine in a hierarchical algorithm. Since we can approximate at different scales in the query tree, we must somehow combine all the approximations at the end of the computation. By performing a breadth-first traversal of the query tree, the L2L operator shifts a node’s local expansion to the centroid of each child. Lemma 2.3. Local-to-local (L2L) translation operator for Gaussian kernel: Given a Taylor expansion centered at a centroid xQ′ of a query node XQ′ : G(xq ) = xq −xQ′ √ 2h2 Bβ β≥0 β , the Taylor expansion obtained by shift- ing this expansion to the new centroid xQ of the child node XQ is G(xq ) = α≥0 β≥α β! α!(β−α)! Bβ β−α xQ −xQ′ √ 2h2 xq −xQ √ 2h2 α . Proof. Applying the multinomial theorem to to expand about the new center xQ yields: G(xq ) = X Bβ β≥0 = „ XX β≥0 α≤β xq − xQ′ √ 2h2 Bβ «β β! α!(β − α)! „ xQ − xQ′ √ 2h2 «β−α „ xq − xQ √ 2h2 «α . whose summation order can be interchanged to achieve the result. Because the Hermite and the Taylor expansion are truncated after taking pD terms, we incur an error in approximation. The original error bounds for the Gaussian kernel in [9, 10] were wrong and corrections were shown in [3]. Here, we will present all necessary three error bounds incurred in performing translation operators. We note that these error bounds place limits on the size of the query node and the reference node. 2 Lemma 2.4. Error Bound for Truncating an Hermite Expansion (as presented in [3]): Suppose we are given an Hermite expansion of a reference node XR about its centroid xR : G(xq ) = Aα hα α≥0 xq −xR √ 2h2 NR where Aα = r=1 1 α! xr −xR √ 2h2 α . For any query point xq , the error due to truncating the series after the first pD term is |ǫM (p)| ≤ rp )k rp √ p! NR (1−r)D D−1 k=0 D k (1 − D−k where ∀xr ∈ XR satisfies ||xr − xR ||∞ < rh for r < 1. Proof. (sketch) We expand the Hermite expansion as a product of one-dimensional Hermite functions, and utilize a bound on one-dimensional Hermite functions due to [13]: n −x2 1 2 √ 2 e 2 , n ≥ 0, x ∈ R1 . n! |hn (x)| ≤ n! Lemma 2.5. Error Bound for Truncating a Taylor Expansion Converted from an Hermite Expansion of Infinite Order: Suppose we are given the following Taylor expansion about the centroid xQ of a query node G(xq ) = Bβ β≥0 2 xq −xQ √ 2h2 β where `Strainn[12] proposed the interesting idea of using Stirling’s formula (for any non-negative integer ´ ≤ n!) to lift the node size constraint; one might imagine that this could allow approxin: n+1 e mation of larger regions in a tree-based algorithm. Unfortunately, the error bounds developed in [12] were also incorrect. We have derived the three necessary corrected error bounds based on the techniques in [3]. However, due to space, and because using these bounds actually degraded performance slightly, we do not include those lemmas here. (−1)|β| β! Bβ = Aα hα+β α≥0 xQ −xR √ 2h2 and Aα ’s are the coefficients of the Hermite ex- pansion centered at the reference node centroid xR . Then, truncating the series after pD terms satisfies the error bound |ǫL (p)| ≤ NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k where ||xq − xQ ||∞ < rh for r < 1, ∀xq ∈ XQ . Proof. Taylor expansion of the Hermite function yields e −||xq −xr ||2 2h2 Use e „ «„ «β X (−1)|β| X 1 „ xr − xR «α xq − xQ xQ − xR √ √ √ hα+β = β! α! 2h2 2h2 2h2 α≥0 β≥0 «α „ «„ «β „ X (−1)|β| X 1 xR − xr xQ − xR xq − xQ |α| √ √ √ = (−1) hα+β β! α! 2h2 2h2 2h2 β≥0 α≥0 «„ «β „ X (−1)|β| xq − xQ xQ − xr √ √ = hβ β! 2h2 2h2 β≥0 −||xq −xr ||2 2h2 D = i=1 (up (xqi , xri , xQi ) + vp (xqi , xri , xQi )) for 1 ≤ i ≤ D, where «„ «n „ X (−1)ni xqi − xQi i xQi − xri √ √ hni ni ! 2h2 2h2 ni =0 „ «„ «ni ∞ X (−1)ni xqi − xQi xQi − xri √ √ hni vp (xqi , xri , xQi ) = . ni ! 2h2 2h2 ni =p p−1 up (xqi , xri , xQi ) = 1−r p 1−r These univariate functions respectively satisfy up (xqi , xri , xQi ) ≤ 1 rp vp (xqi , xri , xQi ) ≤ √p! 1−r , for 1 ≤ i ≤ D, achieving the multivariate bound. and Lemma 2.6. Error Bound for Truncating a Taylor Expansion Converted from an Already Truncated Hermite Expansion: A truncated Hermite expansion centered about xq −xR the centroid xR of a reference node G(xq ) = Aα hα √2h2 has the following α < rh, and a reference node XR for which ||xr − xR ||∞ < rh for r < 1 , ∀xq ∈ XQ , ∀xr ∈ XR . 2 Proof. We define upi = up (xqi , xri , xQi , xRi ), vpi = vp (xqi , xri , xQi , xRi ), wpi = wp (xqi , xri , xQi , xRi ) for 1 ≤ i ≤ D: upi = „ «„ «ni p−1 X X (−1)ni p−1 1 „ xR − xr «nj xqi − xQi xQi − xRi i i √ √ √ (−1)nj hni +nj ni ! n =0 nj ! 2h2 2h2 2h2 ni =0 j vpi = „ «„ «n ∞ X (−1)ni X 1 „ xR − xr «nj xQi − xRi xqi − xQi i i i √ √ √ (−1)nj hni +nj ni ! n =p nj ! 2h2 2h2 2h2 ni =0 j p−1 wpi = „ «„ «n ∞ ∞ X (−1)ni X 1 „ xR − xr «nj xQi − xRi xqi − xQi i i i √ √ √ (−1)nj hni +nj ni ! n =0 nj ! 2h2 2h2 2h2 ni =p j Note that e −||xq −xr ||2 2h2 D = i=1 (upi + vpi + wpi ) for 1 ≤ i ≤ D. Using the bound for Hermite functions and the property of geometric series, we obtain the following upper bounds: p−1 p−1 upi ≤ X X (2r)ni (2r)nj = ni =0 nj =0 „ 1 − (2r)p ) 1 − 2r «2 „ «„ « p−1 ∞ 1 X X 1 1 − (2r)p (2r)p vpi ≤ √ (2r)ni (2r)nj = √ 1 − 2r 1 − 2r p! n =0 n =p p! i 1 wpi ≤ √ p! j ∞ ∞ X X ni =p nj 1 (2r)ni (2r)nj = √ p! =0 „ 1 1 − 2r «„ (2r)p 1 − 2r « Therefore, ˛ ˛ ! «D−k „ D D−1 ˛ −||xq −xr ||2 ˛ Y X D ((2r)p )(2 − (2r)p ) ˛ ˛ −2D 2 2h √ − upi ˛ ≤ (1 − 2r) ((1 − (2r)p )2 )k ˛e ˛ ˛ k p! i=1 k=0 ˛ ˛ ˛ « ˛ „ „ « D−1 “ ” X D X ˛ xq − xQ β ˛ ((2r)p )(2 − (2r)p ) D−k NR p 2 k ˛≤ ˛G(xq ) − √ ((1 − (2r) ) ) √ Cβ ˛ ˛ 2D (1 − 2r) k p! 2h2 ˛ ˛ k=0 β < 2h, pDH = the smallest p ≥ 1 such that NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k < ǫGmin . Q if Q.maxside < 2h, pDL = the smallest p ≥ 1 such that NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k < ǫGmin . Q if max(Q.maxside,R.maxside) < h, pH2L = the smallest p ≥ 1 such that NR (1−2r)2D D−1 k=0 D k ((1 − (2r)p )2 )k ((2r)p )(2−(2r)p ) √ p! D−k < ǫGmin . Q cDH = pD NQ . cDL = pD NR . cH2L = DpD+1 . cDirect = DNQ NR . DH DL H2L if no Hermite coefficient of order pDH exists for XR , Compute it. cDH = cDH + pD NR . DH if no Hermite coefficient of order pH2L exists for XR , Compute it. cH2L = cH2L + pD NR . H2L c = min(cDH , cDL , cH2L , cDirect ). if c = cDH < ∞, (Direct Hermite) Evaluate each xq at the Hermite series of order pDH centered about xR of XR using Equation 1. if c = cDL < ∞, (Direct Local) Accumulate each xr ∈ XR as the Taylor series of order pDL about the center xQ of XQ using Equation 2. if c = cH2L < ∞, (Hermite-to-Local) Convert the Hermite series of order pH2L centered about xR of XR to the Taylor series of the same order centered about xQ of XQ using Lemma 2.1. if c = cDirect , Update Gmin and Gmax in Q and all its children. return. if leaf(Q) and leaf(R), Perform the naive algorithm on every pair of points in Q and R. else DFGT(Q.left, R.left). DFGT(Q.left, R.right). DFGT(Q.right, R.left). DFGT(Q.right, R.right). ˛ ˛ ˛b ˛ For the FGT, note that the algorithm only ensures: ˛G(xq ) − Gtrue (xq )˛ ≤ τ . Therefore, we first set τ = ǫ, halving τ until the error tolerance ǫ was met. For the IFGT, which has multiple parameters that must be tweaked simultaneously, an automatic scheme was created, based on the recommendations given in the paper and software documentation: For D = 2, use p = 8; for D = 3, √ use p = 6; set ρx = 2.5; start with K = N and double K until the error tolerance is met. When this failed to meet the tolerance, we resorted to additional trial and error by hand. The costs of parameter selection for these methods in both computer and human time is not included in the table. 4 Algorithm \ scale Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH 0.001 0.01 0.1 1 10 100 sj2-50000-2 (astronomy: positions), D = 2, N = 50000, h∗ = 0.00139506 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM 3.892312 2.01846 0.319538 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.837724 1.087066 1.658592 6.018158 62.077669 151.590062 0.849935 1.11567 4.599235 72.435177 18.450387 2.777454 0.846294 1.10654 1.683913 6.265131 5.063365 1.036626 ∗ = 0.0016911 colors50k (astronomy: colors), D = 2, N = 50000, h 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM > 2×Naive > 2×Naive 0.475281 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 1.095838 1.469454 2.802112 30.294007 280.633106 81.373053 1.099828 1.983888 29.231309 285.719266 12.886239 5.336602 1.081216 1.47692 2.855083 24.598749 7.142465 1.78648 ∗ edsgc-radec-rnd (astronomy: angles), D = 2, N = 50000, h = 0.00466204 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM 2.859245 1.768738 0.210799 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.812462 1.083528 1.682261 5.860172 63.849361 357.099354 0.84023 1.120015 4.346061 73.036687 21.652047 3.424304 0.821672 1.104545 1.737799 6.037217 5.7398 1.883216 ∗ mockgalaxy-D-1M-rnd (cosmology: positions), D = 3, N = 50000, h = 0.000768201 354.868751 354.868751 354.868751 354.868751 354.868751 354.868751 out of RAM out of RAM out of RAM out of RAM > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.70054 0.701547 0.761524 0.843451 1.086608 42.022605 0.73007 0.733638 0.799711 0.999316 50.619588 125.059911 0.724004 0.719951 0.789002 0.877564 1.265064 22.6106 ∗ bio5-rnd (biology: drug activity), D = 5, N = 50000, h = 0.000567161 364.439228 364.439228 364.439228 364.439228 364.439228 364.439228 out of RAM out of RAM out of RAM out of RAM out of RAM out of RAM > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 2.249868 2.4958865 4.70948 12.065697 94.345003 412.39142 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 1000 301.696 0.183616 7.576783 1.551019 2.532401 0.68471 301.696 0.114430 7.55986 3.604753 3.5638 0.627554 301.696 0.059664 7.585585 0.743045 1.977302 0.436596 354.868751 > 2×Naive > 2×Naive 383.12048 109.353701 87.488392 364.439228 out of RAM > 2×Naive 107.675935 > 2×Naive > 2×Naive Discussion. The experiments indicate that the DFGTH method is able to achieve reasonable performance across all bandwidth scales. Unfortunately none of the series approximation-based methods do well on the 5-dimensional data, as expected, highlighting the main weakness of the approach presented. Pursuing corrections to the error bounds necessary to use the intriguing series form of [14] may allow an increase in dimensionality. References [1] A. W. Appel. An Efficient Program for Many-Body Simulations. SIAM Journal on Scientific and Statistical Computing, 6(1):85–103, 1985. [2] J. Barnes and P. Hut. A Hierarchical O(N logN ) Force-Calculation Algorithm. Nature, 324, 1986. [3] B. Baxter and G. Roussos. A new error estimate of the fast gauss transform. SIAM Journal on Scientific Computing, 24(1):257–259, 2002. [4] P. Callahan and S. Kosaraju. A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields. Journal of the ACM, 62(1):67–90, January 1995. [5] A. Gray and A. W. Moore. N-Body Problems in Statistical Learning. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13 (December 2000). MIT Press, 2001. [6] A. G. Gray. Bringing Tractability to Generalized N-Body Problems in Statistical and Scientific Computation. PhD thesis, Carnegie Mellon University, 2003. [7] A. G. Gray and A. W. Moore. Rapid Evaluation of Multiple Density Models. In Artificial Intelligence and Statistics 2003, 2003. [8] L. Greengard and V. Rokhlin. A Fast Algorithm for Particle Simulations. Journal of Computational Physics, 73, 1987. [9] L. Greengard and J. Strain. The fast gauss transform. SIAM Journal on Scientific and Statistical Computing, 12(1):79–94, 1991. [10] L. Greengard and X. Sun. A new version of the fast gauss transform. Documenta Mathematica, Extra Volume ICM(III):575– 584, 1998. [11] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986. [12] J. Strain. The fast gauss transform with variable scales. SIAM Journal on Scientific and Statistical Computing, 12:1131– 1139, 1991. [13] O. Sz´ sz. On the relative extrema of the hermite orthogonal functions. J. Indian Math. Soc., 15:129–134, 1951. a [14] C. Yang, R. Duraiswami, N. A. Gumerov, and L. Davis. Improved fast gauss transform and efficient kernel density estimation. International Conference on Computer Vision, 2003.

2 0.089534268 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts

Author: Edward Meeds, Simon Osindero

Abstract: We present an infinite mixture model in which each component comprises a multivariate Gaussian distribution over an input space, and a Gaussian Process model over an output space. Our model is neatly able to deal with non-stationary covariance functions, discontinuities, multimodality and overlapping output signals. The work is similar to that by Rasmussen and Ghahramani [1]; however, we use a full generative model over input and output space rather than just a conditional model. This allows us to deal with incomplete data, to perform inference over inverse functional mappings as well as for regression, and also leads to a more powerful and consistent Bayesian specification of the effective ‘gating network’ for the different experts. 1

3 0.08233463 114 nips-2005-Learning Rankings via Convex Hull Separation

Author: Glenn Fung, Rómer Rosales, Balaji Krishnapuram

Abstract: We propose efficient algorithms for learning ranking functions from order constraints between sets—i.e. classes—of training samples. Our algorithms may be used for maximizing the generalized Wilcoxon Mann Whitney statistic that accounts for the partial ordering of the classes: special cases include maximizing the area under the ROC curve for binary classification and its generalization for ordinal regression. Experiments on public benchmarks indicate that: (a) the proposed algorithm is at least as accurate as the current state-of-the-art; (b) computationally, it is several orders of magnitude faster and—unlike current methods—it is easily able to handle even large datasets with over 20,000 samples. 1

4 0.046070337 33 nips-2005-Bayesian Sets

Author: Zoubin Ghahramani, Katherine A. Heller

Abstract: Inspired by “Google™ Sets”, we consider the problem of retrieving items from a concept or cluster, given a query consisting of a few items from that cluster. We formulate this as a Bayesian inference problem and describe a very simple algorithm for solving it. Our algorithm uses a modelbased concept of a cluster and ranks items using a score which evaluates the marginal probability that each item belongs to a cluster containing the query items. For exponential family models with conjugate priors this marginal probability is a simple function of sufficient statistics. We focus on sparse binary data and show that our score can be evaluated exactly using a single sparse matrix multiplication, making it possible to apply our algorithm to very large datasets. We evaluate our algorithm on three datasets: retrieving movies from EachMovie, finding completions of author sets from the NIPS dataset, and finding completions of sets of words appearing in the Grolier encyclopedia. We compare to Google™ Sets and show that Bayesian Sets gives very reasonable set completions. 1

5 0.041234318 195 nips-2005-Transfer learning for text classification

Author: Chuong B. Do, Andrew Y. Ng

Abstract: Linear text classification algorithms work by computing an inner product between a test document vector and a parameter vector. In many such algorithms, including naive Bayes and most TFIDF variants, the parameters are determined by some simple, closed-form, function of training set statistics; we call this mapping mapping from statistics to parameters, the parameter function. Much research in text classification over the last few decades has consisted of manual efforts to identify better parameter functions. In this paper, we propose an algorithm for automatically learning this function from related classification problems. The parameter function found by our algorithm then defines a new learning algorithm for text classification, which we can apply to novel classification tasks. We find that our learned classifier outperforms existing methods on a variety of multiclass text classification tasks. 1

6 0.038696837 117 nips-2005-Learning from Data of Variable Quality

7 0.036245052 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions

8 0.036186371 69 nips-2005-Fast Gaussian Process Regression using KD-Trees

9 0.033864498 71 nips-2005-Fast Krylov Methods for N-Body Learning

10 0.028613016 31 nips-2005-Asymptotics of Gaussian Regularized Least Squares

11 0.028418237 70 nips-2005-Fast Information Value for Graphical Models

12 0.027795542 104 nips-2005-Laplacian Score for Feature Selection

13 0.026349707 160 nips-2005-Query by Committee Made Real

14 0.025299372 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

15 0.022184059 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression

16 0.020057306 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery

17 0.019954167 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

18 0.01951698 147 nips-2005-On the Convergence of Eigenspaces in Kernel Principal Component Analysis

19 0.018151676 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach

20 0.018141111 121 nips-2005-Location-based activity recognition


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.059), (1, 0.027), (2, -0.014), (3, -0.021), (4, 0.014), (5, -0.002), (6, -0.019), (7, 0.036), (8, 0.032), (9, 0.047), (10, 0.013), (11, 0.022), (12, 0.048), (13, 0.039), (14, -0.026), (15, 0.047), (16, 0.013), (17, -0.055), (18, -0.005), (19, 0.0), (20, -0.019), (21, -0.03), (22, 0.043), (23, -0.032), (24, -0.025), (25, -0.001), (26, -0.01), (27, 0.146), (28, -0.067), (29, -0.021), (30, -0.06), (31, 0.023), (32, 0.064), (33, 0.027), (34, -0.077), (35, 0.087), (36, -0.027), (37, -0.128), (38, -0.03), (39, 0.089), (40, -0.016), (41, 0.208), (42, -0.107), (43, -0.036), (44, -0.053), (45, -0.04), (46, -0.014), (47, 0.022), (48, -0.143), (49, 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97482121 59 nips-2005-Dual-Tree Fast Gauss Transforms

Author: Dongryeol Lee, Andrew W. Moore, Alexander G. Gray

Abstract: In previous work we presented an efficient approach to computing kernel summations which arise in many machine learning methods such as kernel density estimation. This approach, dual-tree recursion with finitedifference approximation, generalized existing methods for similar problems arising in computational physics in two ways appropriate for statistical problems: toward distribution sensitivity and general dimension, partly by avoiding series expansions. While this proved to be the fastest practical method for multivariate kernel density estimation at the optimal bandwidth, it is much less efficient at larger-than-optimal bandwidths. In this work, we explore the extent to which the dual-tree approach can be integrated with multipole-like Hermite expansions in order to achieve reasonable efficiency across all bandwidth scales, though only for low dimensionalities. In the process, we derive and demonstrate the first truly hierarchical fast Gauss transforms, effectively combining the best tools from discrete algorithms and continuous approximation theory. 1 Fast Gaussian Summation Kernel summations are fundamental in both statistics/learning and computational physics. NR e This paper will focus on the common form G(xq ) = −||xq −xr ||2 2h2 i.e. where the ker- r=1 nel is the Gaussian kernel with scaling parameter, or bandwidth h, there are NR reference points xr , and we desire the sum for NQ different query points xq . Such kernel summations appear in a wide array of statistical/learning methods [5], perhaps most obviously in kernel density estimation [11], the most widely used distribution-free method for the fundamental task of density estimation, which will be our main example. Understanding kernel summation algorithms from a recently developed unified perspective [5] begins with the picture of Figure 1, then separately considers the discrete and continuous aspects. Discrete/geometric aspect. In terms of discrete algorithmic structure, the dual-tree framework of [5], in the context of kernel summation, generalizes all of the well-known algorithms. 1 It was applied to the problem of kernel density estimation in [7] using a simple 1 These include the Barnes-Hut algorithm [2], the Fast Multipole Method [8], Appel’s algorithm [1], and the WSPD [4]: the dual-tree method is a node-node algorithm (considers query regions rather than points), is fully recursive, can use distribution-sensitive data structures such as kd-trees, and is bichromatic (can specialize for differing query and reference sets). Figure 1: The basic idea is to approximate the kernel sum contribution of some subset of the reference points XR , lying in some compact region of space R with centroid xR , to a query point. In more efficient schemes a query region is considered, i.e. the approximate contribution is made to an entire subset of the query points XQ lying in some region of space Q, with centroid xQ . finite-difference approximation, which is tantamount to a centroid approximation. Partially by avoiding series expansions, which depend explicitly on the dimension, the result was the fastest such algorithm for general dimension, when operating at the optimal bandwidth. Unfortunately, when performing cross-validation to determine the (initially unknown) optimal bandwidth, both suboptimally small and large bandwidths must be evaluated. The finite-difference-based dual-tree method tends to be efficient at or below the optimal bandwidth, and at very large bandwidths, but for intermediately-large bandwidths it suffers. Continuous/approximation aspect. This motivates investigating a multipole-like series approximation which is appropriate for the Gaussian kernel, as introduced by [9], which can be shown the generalize the centroid approximation. We define the Hermite functions 2 hn (t) by hn (t) = e−t Hn (t), where the Hermite polynomials Hn (t) are defined by the 2 2 Rodrigues formula: Hn (t) = (−1)n et Dn e−t , t ∈ R1 . After scaling and shifting the argument t appropriately, then taking the product of univariate functions for each dimension, we obtain the multivariate Hermite expansion NR G(xq ) = e −||xq −xr ||2 2h2 NR = r=1 r=1 α≥0 1 α! xr − xR √ 2h2 α hα xq − xR √ 2h2 (1) where we’ve adopted the usual multi-index notation as in [9]. This can be re-written as NR G(xq ) = e r=1 −||xq −xr ||2 2h2 NR = r=1 α≥0 1 hα α! xr − xQ √ 2h2 xq − xQ √ 2h2 α (2) to express the sum as a Taylor (local) expansion about a nearby representative centroid xQ in the query region. We will be using both types of expansions simultaneously. Since series approximations only hold locally, Greengard and Rokhlin [8] showed that it is useful to think in terms of a set of three ‘translation operators’ for converting between expansions centered at different points, in order to create their celebrated hierarchical algorithm. This was done in the context of the Coulombic kernel, but the Gaussian kernel has importantly different mathematical properties. The original Fast Gauss Transform (FGT) [9] was based on a flat grid, and thus provided only one operator (“H2L” of the next section), with an associated error bound (which was unfortunately incorrect). The Improved Fast Gauss Transform (IFGT) [14] was based on a flat set of clusters and provided no operators with a rearranged series approximation, which intended to be more favorable in higher dimensions but had an incorrect error bound. We will show the derivations of all the translation operators and associated error bounds needed to obtain, for the first time, a hierarchical algorithm for the Gaussian kernel. 2 Translation Operators and Error Bounds The first operator converts a multipole expansion of a reference node to form a local expansion centered at the centroid of the query node, and is our main approximation workhorse. Lemma 2.1. Hermite-to-local (H2L) translation operator for Gaussian kernel (as presented in Lemma 2.2 in [9, 10]): Given a reference node XR , a query node XQ , and the Hermite expansion centered at a centroid xR of XR : G(xq ) = Aα hα α≥0 xq −xR √ 2h2 , the Taylor expansion of the Hermite expansion at the centroid xQ of the query node XQ is given by G(xq ) = Bβ β≥0 xq −xQ √ 2h2 β where Bβ = (−1)|β| β! Aα hα+β α≥0 xQ −xR √ 2h2 . Proof. (sketch) The proof consists of replacing the Hermite function portion of the expansion with its Taylor series. NR Note that we can rewrite G(xq ) = α≥0 r=1 1 α! xr −xR √ 2h2 α hα xq −xR √ 2h2 by interchanging the summation order, such that the term in the brackets depends only on the reference points, and can thus be computed indepedent of any query location – we will call such terms Hermite moments. The next operator allows the efficient pre-computation of the Hermite moments in the reference tree in a bottom-up fashion from its children. Lemma 2.2. Hermite-to-Hermite (H2H) translation operator for Gaussian kernel: Given the Hermite expansion centered at a centroid xR′ in a reference node XR′ : xq −x G(xq ) = A′ hα √2hR′ , this same Hermite expansion shifted to a new locaα 2 α≥0 tion xR of the parent node of XR is given by G(xq ) = Aγ hγ γ≥0 Aγ = 0≤α≤γ 1 ′ (γ−α)! Aα xR′ −xR √ 2h2 xq −xR √ 2h2 where γ−α . Proof. We simply replace the Hermite function part of the expansion by a new Taylor series, as follows: « x q − x R′ √ 2h2 α≥0 „ « X ′ X 1 „ x R − x R′ « β xq − xR √ √ = Aα (−1)|β| hα+β β! 2h2 2h2 α≥0 β≥0 „ «β « „ X X ′ 1 x R − x R′ xq − xR |β| √ √ (−1) hα+β = Aα β! 2h2 2h2 α≥0 β≥0 „ «β „ « X X ′ 1 x R′ − x R xq − xR √ √ Aα = hα+β β! 2h2 2h2 α≥0 β≥0 3 2 «γ−α „ « „ X X 1 x R′ − x R q ′ 5 hγ x√− xR 4 √ = Aα (γ − α)! 2h2 2h2 γ≥0 0≤α≤γ G(xq ) = where γ = α + β. X A′ hα α „ The next operator acts as a “clean-up” routine in a hierarchical algorithm. Since we can approximate at different scales in the query tree, we must somehow combine all the approximations at the end of the computation. By performing a breadth-first traversal of the query tree, the L2L operator shifts a node’s local expansion to the centroid of each child. Lemma 2.3. Local-to-local (L2L) translation operator for Gaussian kernel: Given a Taylor expansion centered at a centroid xQ′ of a query node XQ′ : G(xq ) = xq −xQ′ √ 2h2 Bβ β≥0 β , the Taylor expansion obtained by shift- ing this expansion to the new centroid xQ of the child node XQ is G(xq ) = α≥0 β≥α β! α!(β−α)! Bβ β−α xQ −xQ′ √ 2h2 xq −xQ √ 2h2 α . Proof. Applying the multinomial theorem to to expand about the new center xQ yields: G(xq ) = X Bβ β≥0 = „ XX β≥0 α≤β xq − xQ′ √ 2h2 Bβ «β β! α!(β − α)! „ xQ − xQ′ √ 2h2 «β−α „ xq − xQ √ 2h2 «α . whose summation order can be interchanged to achieve the result. Because the Hermite and the Taylor expansion are truncated after taking pD terms, we incur an error in approximation. The original error bounds for the Gaussian kernel in [9, 10] were wrong and corrections were shown in [3]. Here, we will present all necessary three error bounds incurred in performing translation operators. We note that these error bounds place limits on the size of the query node and the reference node. 2 Lemma 2.4. Error Bound for Truncating an Hermite Expansion (as presented in [3]): Suppose we are given an Hermite expansion of a reference node XR about its centroid xR : G(xq ) = Aα hα α≥0 xq −xR √ 2h2 NR where Aα = r=1 1 α! xr −xR √ 2h2 α . For any query point xq , the error due to truncating the series after the first pD term is |ǫM (p)| ≤ rp )k rp √ p! NR (1−r)D D−1 k=0 D k (1 − D−k where ∀xr ∈ XR satisfies ||xr − xR ||∞ < rh for r < 1. Proof. (sketch) We expand the Hermite expansion as a product of one-dimensional Hermite functions, and utilize a bound on one-dimensional Hermite functions due to [13]: n −x2 1 2 √ 2 e 2 , n ≥ 0, x ∈ R1 . n! |hn (x)| ≤ n! Lemma 2.5. Error Bound for Truncating a Taylor Expansion Converted from an Hermite Expansion of Infinite Order: Suppose we are given the following Taylor expansion about the centroid xQ of a query node G(xq ) = Bβ β≥0 2 xq −xQ √ 2h2 β where `Strainn[12] proposed the interesting idea of using Stirling’s formula (for any non-negative integer ´ ≤ n!) to lift the node size constraint; one might imagine that this could allow approxin: n+1 e mation of larger regions in a tree-based algorithm. Unfortunately, the error bounds developed in [12] were also incorrect. We have derived the three necessary corrected error bounds based on the techniques in [3]. However, due to space, and because using these bounds actually degraded performance slightly, we do not include those lemmas here. (−1)|β| β! Bβ = Aα hα+β α≥0 xQ −xR √ 2h2 and Aα ’s are the coefficients of the Hermite ex- pansion centered at the reference node centroid xR . Then, truncating the series after pD terms satisfies the error bound |ǫL (p)| ≤ NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k where ||xq − xQ ||∞ < rh for r < 1, ∀xq ∈ XQ . Proof. Taylor expansion of the Hermite function yields e −||xq −xr ||2 2h2 Use e „ «„ «β X (−1)|β| X 1 „ xr − xR «α xq − xQ xQ − xR √ √ √ hα+β = β! α! 2h2 2h2 2h2 α≥0 β≥0 «α „ «„ «β „ X (−1)|β| X 1 xR − xr xQ − xR xq − xQ |α| √ √ √ = (−1) hα+β β! α! 2h2 2h2 2h2 β≥0 α≥0 «„ «β „ X (−1)|β| xq − xQ xQ − xr √ √ = hβ β! 2h2 2h2 β≥0 −||xq −xr ||2 2h2 D = i=1 (up (xqi , xri , xQi ) + vp (xqi , xri , xQi )) for 1 ≤ i ≤ D, where «„ «n „ X (−1)ni xqi − xQi i xQi − xri √ √ hni ni ! 2h2 2h2 ni =0 „ «„ «ni ∞ X (−1)ni xqi − xQi xQi − xri √ √ hni vp (xqi , xri , xQi ) = . ni ! 2h2 2h2 ni =p p−1 up (xqi , xri , xQi ) = 1−r p 1−r These univariate functions respectively satisfy up (xqi , xri , xQi ) ≤ 1 rp vp (xqi , xri , xQi ) ≤ √p! 1−r , for 1 ≤ i ≤ D, achieving the multivariate bound. and Lemma 2.6. Error Bound for Truncating a Taylor Expansion Converted from an Already Truncated Hermite Expansion: A truncated Hermite expansion centered about xq −xR the centroid xR of a reference node G(xq ) = Aα hα √2h2 has the following α < rh, and a reference node XR for which ||xr − xR ||∞ < rh for r < 1 , ∀xq ∈ XQ , ∀xr ∈ XR . 2 Proof. We define upi = up (xqi , xri , xQi , xRi ), vpi = vp (xqi , xri , xQi , xRi ), wpi = wp (xqi , xri , xQi , xRi ) for 1 ≤ i ≤ D: upi = „ «„ «ni p−1 X X (−1)ni p−1 1 „ xR − xr «nj xqi − xQi xQi − xRi i i √ √ √ (−1)nj hni +nj ni ! n =0 nj ! 2h2 2h2 2h2 ni =0 j vpi = „ «„ «n ∞ X (−1)ni X 1 „ xR − xr «nj xQi − xRi xqi − xQi i i i √ √ √ (−1)nj hni +nj ni ! n =p nj ! 2h2 2h2 2h2 ni =0 j p−1 wpi = „ «„ «n ∞ ∞ X (−1)ni X 1 „ xR − xr «nj xQi − xRi xqi − xQi i i i √ √ √ (−1)nj hni +nj ni ! n =0 nj ! 2h2 2h2 2h2 ni =p j Note that e −||xq −xr ||2 2h2 D = i=1 (upi + vpi + wpi ) for 1 ≤ i ≤ D. Using the bound for Hermite functions and the property of geometric series, we obtain the following upper bounds: p−1 p−1 upi ≤ X X (2r)ni (2r)nj = ni =0 nj =0 „ 1 − (2r)p ) 1 − 2r «2 „ «„ « p−1 ∞ 1 X X 1 1 − (2r)p (2r)p vpi ≤ √ (2r)ni (2r)nj = √ 1 − 2r 1 − 2r p! n =0 n =p p! i 1 wpi ≤ √ p! j ∞ ∞ X X ni =p nj 1 (2r)ni (2r)nj = √ p! =0 „ 1 1 − 2r «„ (2r)p 1 − 2r « Therefore, ˛ ˛ ! «D−k „ D D−1 ˛ −||xq −xr ||2 ˛ Y X D ((2r)p )(2 − (2r)p ) ˛ ˛ −2D 2 2h √ − upi ˛ ≤ (1 − 2r) ((1 − (2r)p )2 )k ˛e ˛ ˛ k p! i=1 k=0 ˛ ˛ ˛ « ˛ „ „ « D−1 “ ” X D X ˛ xq − xQ β ˛ ((2r)p )(2 − (2r)p ) D−k NR p 2 k ˛≤ ˛G(xq ) − √ ((1 − (2r) ) ) √ Cβ ˛ ˛ 2D (1 − 2r) k p! 2h2 ˛ ˛ k=0 β < 2h, pDH = the smallest p ≥ 1 such that NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k < ǫGmin . Q if Q.maxside < 2h, pDL = the smallest p ≥ 1 such that NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k < ǫGmin . Q if max(Q.maxside,R.maxside) < h, pH2L = the smallest p ≥ 1 such that NR (1−2r)2D D−1 k=0 D k ((1 − (2r)p )2 )k ((2r)p )(2−(2r)p ) √ p! D−k < ǫGmin . Q cDH = pD NQ . cDL = pD NR . cH2L = DpD+1 . cDirect = DNQ NR . DH DL H2L if no Hermite coefficient of order pDH exists for XR , Compute it. cDH = cDH + pD NR . DH if no Hermite coefficient of order pH2L exists for XR , Compute it. cH2L = cH2L + pD NR . H2L c = min(cDH , cDL , cH2L , cDirect ). if c = cDH < ∞, (Direct Hermite) Evaluate each xq at the Hermite series of order pDH centered about xR of XR using Equation 1. if c = cDL < ∞, (Direct Local) Accumulate each xr ∈ XR as the Taylor series of order pDL about the center xQ of XQ using Equation 2. if c = cH2L < ∞, (Hermite-to-Local) Convert the Hermite series of order pH2L centered about xR of XR to the Taylor series of the same order centered about xQ of XQ using Lemma 2.1. if c = cDirect , Update Gmin and Gmax in Q and all its children. return. if leaf(Q) and leaf(R), Perform the naive algorithm on every pair of points in Q and R. else DFGT(Q.left, R.left). DFGT(Q.left, R.right). DFGT(Q.right, R.left). DFGT(Q.right, R.right). ˛ ˛ ˛b ˛ For the FGT, note that the algorithm only ensures: ˛G(xq ) − Gtrue (xq )˛ ≤ τ . Therefore, we first set τ = ǫ, halving τ until the error tolerance ǫ was met. For the IFGT, which has multiple parameters that must be tweaked simultaneously, an automatic scheme was created, based on the recommendations given in the paper and software documentation: For D = 2, use p = 8; for D = 3, √ use p = 6; set ρx = 2.5; start with K = N and double K until the error tolerance is met. When this failed to meet the tolerance, we resorted to additional trial and error by hand. The costs of parameter selection for these methods in both computer and human time is not included in the table. 4 Algorithm \ scale Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH 0.001 0.01 0.1 1 10 100 sj2-50000-2 (astronomy: positions), D = 2, N = 50000, h∗ = 0.00139506 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM 3.892312 2.01846 0.319538 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.837724 1.087066 1.658592 6.018158 62.077669 151.590062 0.849935 1.11567 4.599235 72.435177 18.450387 2.777454 0.846294 1.10654 1.683913 6.265131 5.063365 1.036626 ∗ = 0.0016911 colors50k (astronomy: colors), D = 2, N = 50000, h 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM > 2×Naive > 2×Naive 0.475281 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 1.095838 1.469454 2.802112 30.294007 280.633106 81.373053 1.099828 1.983888 29.231309 285.719266 12.886239 5.336602 1.081216 1.47692 2.855083 24.598749 7.142465 1.78648 ∗ edsgc-radec-rnd (astronomy: angles), D = 2, N = 50000, h = 0.00466204 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM 2.859245 1.768738 0.210799 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.812462 1.083528 1.682261 5.860172 63.849361 357.099354 0.84023 1.120015 4.346061 73.036687 21.652047 3.424304 0.821672 1.104545 1.737799 6.037217 5.7398 1.883216 ∗ mockgalaxy-D-1M-rnd (cosmology: positions), D = 3, N = 50000, h = 0.000768201 354.868751 354.868751 354.868751 354.868751 354.868751 354.868751 out of RAM out of RAM out of RAM out of RAM > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.70054 0.701547 0.761524 0.843451 1.086608 42.022605 0.73007 0.733638 0.799711 0.999316 50.619588 125.059911 0.724004 0.719951 0.789002 0.877564 1.265064 22.6106 ∗ bio5-rnd (biology: drug activity), D = 5, N = 50000, h = 0.000567161 364.439228 364.439228 364.439228 364.439228 364.439228 364.439228 out of RAM out of RAM out of RAM out of RAM out of RAM out of RAM > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 2.249868 2.4958865 4.70948 12.065697 94.345003 412.39142 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 1000 301.696 0.183616 7.576783 1.551019 2.532401 0.68471 301.696 0.114430 7.55986 3.604753 3.5638 0.627554 301.696 0.059664 7.585585 0.743045 1.977302 0.436596 354.868751 > 2×Naive > 2×Naive 383.12048 109.353701 87.488392 364.439228 out of RAM > 2×Naive 107.675935 > 2×Naive > 2×Naive Discussion. The experiments indicate that the DFGTH method is able to achieve reasonable performance across all bandwidth scales. Unfortunately none of the series approximation-based methods do well on the 5-dimensional data, as expected, highlighting the main weakness of the approach presented. Pursuing corrections to the error bounds necessary to use the intriguing series form of [14] may allow an increase in dimensionality. References [1] A. W. Appel. An Efficient Program for Many-Body Simulations. SIAM Journal on Scientific and Statistical Computing, 6(1):85–103, 1985. [2] J. Barnes and P. Hut. A Hierarchical O(N logN ) Force-Calculation Algorithm. Nature, 324, 1986. [3] B. Baxter and G. Roussos. A new error estimate of the fast gauss transform. SIAM Journal on Scientific Computing, 24(1):257–259, 2002. [4] P. Callahan and S. Kosaraju. A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields. Journal of the ACM, 62(1):67–90, January 1995. [5] A. Gray and A. W. Moore. N-Body Problems in Statistical Learning. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13 (December 2000). MIT Press, 2001. [6] A. G. Gray. Bringing Tractability to Generalized N-Body Problems in Statistical and Scientific Computation. PhD thesis, Carnegie Mellon University, 2003. [7] A. G. Gray and A. W. Moore. Rapid Evaluation of Multiple Density Models. In Artificial Intelligence and Statistics 2003, 2003. [8] L. Greengard and V. Rokhlin. A Fast Algorithm for Particle Simulations. Journal of Computational Physics, 73, 1987. [9] L. Greengard and J. Strain. The fast gauss transform. SIAM Journal on Scientific and Statistical Computing, 12(1):79–94, 1991. [10] L. Greengard and X. Sun. A new version of the fast gauss transform. Documenta Mathematica, Extra Volume ICM(III):575– 584, 1998. [11] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986. [12] J. Strain. The fast gauss transform with variable scales. SIAM Journal on Scientific and Statistical Computing, 12:1131– 1139, 1991. [13] O. Sz´ sz. On the relative extrema of the hermite orthogonal functions. J. Indian Math. Soc., 15:129–134, 1951. a [14] C. Yang, R. Duraiswami, N. A. Gumerov, and L. Davis. Improved fast gauss transform and efficient kernel density estimation. International Conference on Computer Vision, 2003.

2 0.42715371 71 nips-2005-Fast Krylov Methods for N-Body Learning

Author: Nando D. Freitas, Yang Wang, Maryam Mahdaviani, Dustin Lang

Abstract: This paper addresses the issue of numerical computation in machine learning domains based on similarity metrics, such as kernel methods, spectral techniques and Gaussian processes. It presents a general solution strategy based on Krylov subspace iteration and fast N-body learning methods. The experiments show significant gains in computation and storage on datasets arising in image segmentation, object detection and dimensionality reduction. The paper also presents theoretical bounds on the stability of these methods.

3 0.40976906 33 nips-2005-Bayesian Sets

Author: Zoubin Ghahramani, Katherine A. Heller

Abstract: Inspired by “Google™ Sets”, we consider the problem of retrieving items from a concept or cluster, given a query consisting of a few items from that cluster. We formulate this as a Bayesian inference problem and describe a very simple algorithm for solving it. Our algorithm uses a modelbased concept of a cluster and ranks items using a score which evaluates the marginal probability that each item belongs to a cluster containing the query items. For exponential family models with conjugate priors this marginal probability is a simple function of sufficient statistics. We focus on sparse binary data and show that our score can be evaluated exactly using a single sparse matrix multiplication, making it possible to apply our algorithm to very large datasets. We evaluate our algorithm on three datasets: retrieving movies from EachMovie, finding completions of author sets from the NIPS dataset, and finding completions of sets of words appearing in the Grolier encyclopedia. We compare to Google™ Sets and show that Bayesian Sets gives very reasonable set completions. 1

4 0.39658991 114 nips-2005-Learning Rankings via Convex Hull Separation

Author: Glenn Fung, Rómer Rosales, Balaji Krishnapuram

Abstract: We propose efficient algorithms for learning ranking functions from order constraints between sets—i.e. classes—of training samples. Our algorithms may be used for maximizing the generalized Wilcoxon Mann Whitney statistic that accounts for the partial ordering of the classes: special cases include maximizing the area under the ROC curve for binary classification and its generalization for ordinal regression. Experiments on public benchmarks indicate that: (a) the proposed algorithm is at least as accurate as the current state-of-the-art; (b) computationally, it is several orders of magnitude faster and—unlike current methods—it is easily able to handle even large datasets with over 20,000 samples. 1

5 0.39298186 117 nips-2005-Learning from Data of Variable Quality

Author: Koby Crammer, Michael Kearns, Jennifer Wortman

Abstract: We initiate the study of learning from multiple sources of limited data, each of which may be corrupted at a different rate. We develop a complete theory of which data sources should be used for two fundamental problems: estimating the bias of a coin, and learning a classifier in the presence of label noise. In both cases, efficient algorithms are provided for computing the optimal subset of data. 1

6 0.38176277 31 nips-2005-Asymptotics of Gaussian Regularized Least Squares

7 0.38143417 69 nips-2005-Fast Gaussian Process Regression using KD-Trees

8 0.33225021 104 nips-2005-Laplacian Score for Feature Selection

9 0.32680437 86 nips-2005-Generalized Nonnegative Matrix Approximations with Bregman Divergences

10 0.31639427 120 nips-2005-Learning vehicular dynamics, with application to modeling helicopters

11 0.30754527 70 nips-2005-Fast Information Value for Graphical Models

12 0.29731119 85 nips-2005-Generalization to Unseen Cases

13 0.29648143 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data

14 0.28929996 130 nips-2005-Modeling Neuronal Interactivity using Dynamic Bayesian Networks

15 0.27633461 1 nips-2005-AER Building Blocks for Multi-Layer Multi-Chip Neuromorphic Vision Systems

16 0.23232244 84 nips-2005-Generalization in Clustering with Unobserved Features

17 0.2281495 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts

18 0.22724476 112 nips-2005-Learning Minimum Volume Sets

19 0.21156918 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation

20 0.19816312 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.044), (10, 0.024), (18, 0.012), (27, 0.015), (31, 0.019), (34, 0.044), (55, 0.037), (63, 0.497), (69, 0.03), (73, 0.06), (88, 0.032), (91, 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89571714 59 nips-2005-Dual-Tree Fast Gauss Transforms

Author: Dongryeol Lee, Andrew W. Moore, Alexander G. Gray

Abstract: In previous work we presented an efficient approach to computing kernel summations which arise in many machine learning methods such as kernel density estimation. This approach, dual-tree recursion with finitedifference approximation, generalized existing methods for similar problems arising in computational physics in two ways appropriate for statistical problems: toward distribution sensitivity and general dimension, partly by avoiding series expansions. While this proved to be the fastest practical method for multivariate kernel density estimation at the optimal bandwidth, it is much less efficient at larger-than-optimal bandwidths. In this work, we explore the extent to which the dual-tree approach can be integrated with multipole-like Hermite expansions in order to achieve reasonable efficiency across all bandwidth scales, though only for low dimensionalities. In the process, we derive and demonstrate the first truly hierarchical fast Gauss transforms, effectively combining the best tools from discrete algorithms and continuous approximation theory. 1 Fast Gaussian Summation Kernel summations are fundamental in both statistics/learning and computational physics. NR e This paper will focus on the common form G(xq ) = −||xq −xr ||2 2h2 i.e. where the ker- r=1 nel is the Gaussian kernel with scaling parameter, or bandwidth h, there are NR reference points xr , and we desire the sum for NQ different query points xq . Such kernel summations appear in a wide array of statistical/learning methods [5], perhaps most obviously in kernel density estimation [11], the most widely used distribution-free method for the fundamental task of density estimation, which will be our main example. Understanding kernel summation algorithms from a recently developed unified perspective [5] begins with the picture of Figure 1, then separately considers the discrete and continuous aspects. Discrete/geometric aspect. In terms of discrete algorithmic structure, the dual-tree framework of [5], in the context of kernel summation, generalizes all of the well-known algorithms. 1 It was applied to the problem of kernel density estimation in [7] using a simple 1 These include the Barnes-Hut algorithm [2], the Fast Multipole Method [8], Appel’s algorithm [1], and the WSPD [4]: the dual-tree method is a node-node algorithm (considers query regions rather than points), is fully recursive, can use distribution-sensitive data structures such as kd-trees, and is bichromatic (can specialize for differing query and reference sets). Figure 1: The basic idea is to approximate the kernel sum contribution of some subset of the reference points XR , lying in some compact region of space R with centroid xR , to a query point. In more efficient schemes a query region is considered, i.e. the approximate contribution is made to an entire subset of the query points XQ lying in some region of space Q, with centroid xQ . finite-difference approximation, which is tantamount to a centroid approximation. Partially by avoiding series expansions, which depend explicitly on the dimension, the result was the fastest such algorithm for general dimension, when operating at the optimal bandwidth. Unfortunately, when performing cross-validation to determine the (initially unknown) optimal bandwidth, both suboptimally small and large bandwidths must be evaluated. The finite-difference-based dual-tree method tends to be efficient at or below the optimal bandwidth, and at very large bandwidths, but for intermediately-large bandwidths it suffers. Continuous/approximation aspect. This motivates investigating a multipole-like series approximation which is appropriate for the Gaussian kernel, as introduced by [9], which can be shown the generalize the centroid approximation. We define the Hermite functions 2 hn (t) by hn (t) = e−t Hn (t), where the Hermite polynomials Hn (t) are defined by the 2 2 Rodrigues formula: Hn (t) = (−1)n et Dn e−t , t ∈ R1 . After scaling and shifting the argument t appropriately, then taking the product of univariate functions for each dimension, we obtain the multivariate Hermite expansion NR G(xq ) = e −||xq −xr ||2 2h2 NR = r=1 r=1 α≥0 1 α! xr − xR √ 2h2 α hα xq − xR √ 2h2 (1) where we’ve adopted the usual multi-index notation as in [9]. This can be re-written as NR G(xq ) = e r=1 −||xq −xr ||2 2h2 NR = r=1 α≥0 1 hα α! xr − xQ √ 2h2 xq − xQ √ 2h2 α (2) to express the sum as a Taylor (local) expansion about a nearby representative centroid xQ in the query region. We will be using both types of expansions simultaneously. Since series approximations only hold locally, Greengard and Rokhlin [8] showed that it is useful to think in terms of a set of three ‘translation operators’ for converting between expansions centered at different points, in order to create their celebrated hierarchical algorithm. This was done in the context of the Coulombic kernel, but the Gaussian kernel has importantly different mathematical properties. The original Fast Gauss Transform (FGT) [9] was based on a flat grid, and thus provided only one operator (“H2L” of the next section), with an associated error bound (which was unfortunately incorrect). The Improved Fast Gauss Transform (IFGT) [14] was based on a flat set of clusters and provided no operators with a rearranged series approximation, which intended to be more favorable in higher dimensions but had an incorrect error bound. We will show the derivations of all the translation operators and associated error bounds needed to obtain, for the first time, a hierarchical algorithm for the Gaussian kernel. 2 Translation Operators and Error Bounds The first operator converts a multipole expansion of a reference node to form a local expansion centered at the centroid of the query node, and is our main approximation workhorse. Lemma 2.1. Hermite-to-local (H2L) translation operator for Gaussian kernel (as presented in Lemma 2.2 in [9, 10]): Given a reference node XR , a query node XQ , and the Hermite expansion centered at a centroid xR of XR : G(xq ) = Aα hα α≥0 xq −xR √ 2h2 , the Taylor expansion of the Hermite expansion at the centroid xQ of the query node XQ is given by G(xq ) = Bβ β≥0 xq −xQ √ 2h2 β where Bβ = (−1)|β| β! Aα hα+β α≥0 xQ −xR √ 2h2 . Proof. (sketch) The proof consists of replacing the Hermite function portion of the expansion with its Taylor series. NR Note that we can rewrite G(xq ) = α≥0 r=1 1 α! xr −xR √ 2h2 α hα xq −xR √ 2h2 by interchanging the summation order, such that the term in the brackets depends only on the reference points, and can thus be computed indepedent of any query location – we will call such terms Hermite moments. The next operator allows the efficient pre-computation of the Hermite moments in the reference tree in a bottom-up fashion from its children. Lemma 2.2. Hermite-to-Hermite (H2H) translation operator for Gaussian kernel: Given the Hermite expansion centered at a centroid xR′ in a reference node XR′ : xq −x G(xq ) = A′ hα √2hR′ , this same Hermite expansion shifted to a new locaα 2 α≥0 tion xR of the parent node of XR is given by G(xq ) = Aγ hγ γ≥0 Aγ = 0≤α≤γ 1 ′ (γ−α)! Aα xR′ −xR √ 2h2 xq −xR √ 2h2 where γ−α . Proof. We simply replace the Hermite function part of the expansion by a new Taylor series, as follows: « x q − x R′ √ 2h2 α≥0 „ « X ′ X 1 „ x R − x R′ « β xq − xR √ √ = Aα (−1)|β| hα+β β! 2h2 2h2 α≥0 β≥0 „ «β « „ X X ′ 1 x R − x R′ xq − xR |β| √ √ (−1) hα+β = Aα β! 2h2 2h2 α≥0 β≥0 „ «β „ « X X ′ 1 x R′ − x R xq − xR √ √ Aα = hα+β β! 2h2 2h2 α≥0 β≥0 3 2 «γ−α „ « „ X X 1 x R′ − x R q ′ 5 hγ x√− xR 4 √ = Aα (γ − α)! 2h2 2h2 γ≥0 0≤α≤γ G(xq ) = where γ = α + β. X A′ hα α „ The next operator acts as a “clean-up” routine in a hierarchical algorithm. Since we can approximate at different scales in the query tree, we must somehow combine all the approximations at the end of the computation. By performing a breadth-first traversal of the query tree, the L2L operator shifts a node’s local expansion to the centroid of each child. Lemma 2.3. Local-to-local (L2L) translation operator for Gaussian kernel: Given a Taylor expansion centered at a centroid xQ′ of a query node XQ′ : G(xq ) = xq −xQ′ √ 2h2 Bβ β≥0 β , the Taylor expansion obtained by shift- ing this expansion to the new centroid xQ of the child node XQ is G(xq ) = α≥0 β≥α β! α!(β−α)! Bβ β−α xQ −xQ′ √ 2h2 xq −xQ √ 2h2 α . Proof. Applying the multinomial theorem to to expand about the new center xQ yields: G(xq ) = X Bβ β≥0 = „ XX β≥0 α≤β xq − xQ′ √ 2h2 Bβ «β β! α!(β − α)! „ xQ − xQ′ √ 2h2 «β−α „ xq − xQ √ 2h2 «α . whose summation order can be interchanged to achieve the result. Because the Hermite and the Taylor expansion are truncated after taking pD terms, we incur an error in approximation. The original error bounds for the Gaussian kernel in [9, 10] were wrong and corrections were shown in [3]. Here, we will present all necessary three error bounds incurred in performing translation operators. We note that these error bounds place limits on the size of the query node and the reference node. 2 Lemma 2.4. Error Bound for Truncating an Hermite Expansion (as presented in [3]): Suppose we are given an Hermite expansion of a reference node XR about its centroid xR : G(xq ) = Aα hα α≥0 xq −xR √ 2h2 NR where Aα = r=1 1 α! xr −xR √ 2h2 α . For any query point xq , the error due to truncating the series after the first pD term is |ǫM (p)| ≤ rp )k rp √ p! NR (1−r)D D−1 k=0 D k (1 − D−k where ∀xr ∈ XR satisfies ||xr − xR ||∞ < rh for r < 1. Proof. (sketch) We expand the Hermite expansion as a product of one-dimensional Hermite functions, and utilize a bound on one-dimensional Hermite functions due to [13]: n −x2 1 2 √ 2 e 2 , n ≥ 0, x ∈ R1 . n! |hn (x)| ≤ n! Lemma 2.5. Error Bound for Truncating a Taylor Expansion Converted from an Hermite Expansion of Infinite Order: Suppose we are given the following Taylor expansion about the centroid xQ of a query node G(xq ) = Bβ β≥0 2 xq −xQ √ 2h2 β where `Strainn[12] proposed the interesting idea of using Stirling’s formula (for any non-negative integer ´ ≤ n!) to lift the node size constraint; one might imagine that this could allow approxin: n+1 e mation of larger regions in a tree-based algorithm. Unfortunately, the error bounds developed in [12] were also incorrect. We have derived the three necessary corrected error bounds based on the techniques in [3]. However, due to space, and because using these bounds actually degraded performance slightly, we do not include those lemmas here. (−1)|β| β! Bβ = Aα hα+β α≥0 xQ −xR √ 2h2 and Aα ’s are the coefficients of the Hermite ex- pansion centered at the reference node centroid xR . Then, truncating the series after pD terms satisfies the error bound |ǫL (p)| ≤ NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k where ||xq − xQ ||∞ < rh for r < 1, ∀xq ∈ XQ . Proof. Taylor expansion of the Hermite function yields e −||xq −xr ||2 2h2 Use e „ «„ «β X (−1)|β| X 1 „ xr − xR «α xq − xQ xQ − xR √ √ √ hα+β = β! α! 2h2 2h2 2h2 α≥0 β≥0 «α „ «„ «β „ X (−1)|β| X 1 xR − xr xQ − xR xq − xQ |α| √ √ √ = (−1) hα+β β! α! 2h2 2h2 2h2 β≥0 α≥0 «„ «β „ X (−1)|β| xq − xQ xQ − xr √ √ = hβ β! 2h2 2h2 β≥0 −||xq −xr ||2 2h2 D = i=1 (up (xqi , xri , xQi ) + vp (xqi , xri , xQi )) for 1 ≤ i ≤ D, where «„ «n „ X (−1)ni xqi − xQi i xQi − xri √ √ hni ni ! 2h2 2h2 ni =0 „ «„ «ni ∞ X (−1)ni xqi − xQi xQi − xri √ √ hni vp (xqi , xri , xQi ) = . ni ! 2h2 2h2 ni =p p−1 up (xqi , xri , xQi ) = 1−r p 1−r These univariate functions respectively satisfy up (xqi , xri , xQi ) ≤ 1 rp vp (xqi , xri , xQi ) ≤ √p! 1−r , for 1 ≤ i ≤ D, achieving the multivariate bound. and Lemma 2.6. Error Bound for Truncating a Taylor Expansion Converted from an Already Truncated Hermite Expansion: A truncated Hermite expansion centered about xq −xR the centroid xR of a reference node G(xq ) = Aα hα √2h2 has the following α < rh, and a reference node XR for which ||xr − xR ||∞ < rh for r < 1 , ∀xq ∈ XQ , ∀xr ∈ XR . 2 Proof. We define upi = up (xqi , xri , xQi , xRi ), vpi = vp (xqi , xri , xQi , xRi ), wpi = wp (xqi , xri , xQi , xRi ) for 1 ≤ i ≤ D: upi = „ «„ «ni p−1 X X (−1)ni p−1 1 „ xR − xr «nj xqi − xQi xQi − xRi i i √ √ √ (−1)nj hni +nj ni ! n =0 nj ! 2h2 2h2 2h2 ni =0 j vpi = „ «„ «n ∞ X (−1)ni X 1 „ xR − xr «nj xQi − xRi xqi − xQi i i i √ √ √ (−1)nj hni +nj ni ! n =p nj ! 2h2 2h2 2h2 ni =0 j p−1 wpi = „ «„ «n ∞ ∞ X (−1)ni X 1 „ xR − xr «nj xQi − xRi xqi − xQi i i i √ √ √ (−1)nj hni +nj ni ! n =0 nj ! 2h2 2h2 2h2 ni =p j Note that e −||xq −xr ||2 2h2 D = i=1 (upi + vpi + wpi ) for 1 ≤ i ≤ D. Using the bound for Hermite functions and the property of geometric series, we obtain the following upper bounds: p−1 p−1 upi ≤ X X (2r)ni (2r)nj = ni =0 nj =0 „ 1 − (2r)p ) 1 − 2r «2 „ «„ « p−1 ∞ 1 X X 1 1 − (2r)p (2r)p vpi ≤ √ (2r)ni (2r)nj = √ 1 − 2r 1 − 2r p! n =0 n =p p! i 1 wpi ≤ √ p! j ∞ ∞ X X ni =p nj 1 (2r)ni (2r)nj = √ p! =0 „ 1 1 − 2r «„ (2r)p 1 − 2r « Therefore, ˛ ˛ ! «D−k „ D D−1 ˛ −||xq −xr ||2 ˛ Y X D ((2r)p )(2 − (2r)p ) ˛ ˛ −2D 2 2h √ − upi ˛ ≤ (1 − 2r) ((1 − (2r)p )2 )k ˛e ˛ ˛ k p! i=1 k=0 ˛ ˛ ˛ « ˛ „ „ « D−1 “ ” X D X ˛ xq − xQ β ˛ ((2r)p )(2 − (2r)p ) D−k NR p 2 k ˛≤ ˛G(xq ) − √ ((1 − (2r) ) ) √ Cβ ˛ ˛ 2D (1 − 2r) k p! 2h2 ˛ ˛ k=0 β < 2h, pDH = the smallest p ≥ 1 such that NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k < ǫGmin . Q if Q.maxside < 2h, pDL = the smallest p ≥ 1 such that NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k < ǫGmin . Q if max(Q.maxside,R.maxside) < h, pH2L = the smallest p ≥ 1 such that NR (1−2r)2D D−1 k=0 D k ((1 − (2r)p )2 )k ((2r)p )(2−(2r)p ) √ p! D−k < ǫGmin . Q cDH = pD NQ . cDL = pD NR . cH2L = DpD+1 . cDirect = DNQ NR . DH DL H2L if no Hermite coefficient of order pDH exists for XR , Compute it. cDH = cDH + pD NR . DH if no Hermite coefficient of order pH2L exists for XR , Compute it. cH2L = cH2L + pD NR . H2L c = min(cDH , cDL , cH2L , cDirect ). if c = cDH < ∞, (Direct Hermite) Evaluate each xq at the Hermite series of order pDH centered about xR of XR using Equation 1. if c = cDL < ∞, (Direct Local) Accumulate each xr ∈ XR as the Taylor series of order pDL about the center xQ of XQ using Equation 2. if c = cH2L < ∞, (Hermite-to-Local) Convert the Hermite series of order pH2L centered about xR of XR to the Taylor series of the same order centered about xQ of XQ using Lemma 2.1. if c = cDirect , Update Gmin and Gmax in Q and all its children. return. if leaf(Q) and leaf(R), Perform the naive algorithm on every pair of points in Q and R. else DFGT(Q.left, R.left). DFGT(Q.left, R.right). DFGT(Q.right, R.left). DFGT(Q.right, R.right). ˛ ˛ ˛b ˛ For the FGT, note that the algorithm only ensures: ˛G(xq ) − Gtrue (xq )˛ ≤ τ . Therefore, we first set τ = ǫ, halving τ until the error tolerance ǫ was met. For the IFGT, which has multiple parameters that must be tweaked simultaneously, an automatic scheme was created, based on the recommendations given in the paper and software documentation: For D = 2, use p = 8; for D = 3, √ use p = 6; set ρx = 2.5; start with K = N and double K until the error tolerance is met. When this failed to meet the tolerance, we resorted to additional trial and error by hand. The costs of parameter selection for these methods in both computer and human time is not included in the table. 4 Algorithm \ scale Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH 0.001 0.01 0.1 1 10 100 sj2-50000-2 (astronomy: positions), D = 2, N = 50000, h∗ = 0.00139506 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM 3.892312 2.01846 0.319538 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.837724 1.087066 1.658592 6.018158 62.077669 151.590062 0.849935 1.11567 4.599235 72.435177 18.450387 2.777454 0.846294 1.10654 1.683913 6.265131 5.063365 1.036626 ∗ = 0.0016911 colors50k (astronomy: colors), D = 2, N = 50000, h 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM > 2×Naive > 2×Naive 0.475281 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 1.095838 1.469454 2.802112 30.294007 280.633106 81.373053 1.099828 1.983888 29.231309 285.719266 12.886239 5.336602 1.081216 1.47692 2.855083 24.598749 7.142465 1.78648 ∗ edsgc-radec-rnd (astronomy: angles), D = 2, N = 50000, h = 0.00466204 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM 2.859245 1.768738 0.210799 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.812462 1.083528 1.682261 5.860172 63.849361 357.099354 0.84023 1.120015 4.346061 73.036687 21.652047 3.424304 0.821672 1.104545 1.737799 6.037217 5.7398 1.883216 ∗ mockgalaxy-D-1M-rnd (cosmology: positions), D = 3, N = 50000, h = 0.000768201 354.868751 354.868751 354.868751 354.868751 354.868751 354.868751 out of RAM out of RAM out of RAM out of RAM > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.70054 0.701547 0.761524 0.843451 1.086608 42.022605 0.73007 0.733638 0.799711 0.999316 50.619588 125.059911 0.724004 0.719951 0.789002 0.877564 1.265064 22.6106 ∗ bio5-rnd (biology: drug activity), D = 5, N = 50000, h = 0.000567161 364.439228 364.439228 364.439228 364.439228 364.439228 364.439228 out of RAM out of RAM out of RAM out of RAM out of RAM out of RAM > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 2.249868 2.4958865 4.70948 12.065697 94.345003 412.39142 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 1000 301.696 0.183616 7.576783 1.551019 2.532401 0.68471 301.696 0.114430 7.55986 3.604753 3.5638 0.627554 301.696 0.059664 7.585585 0.743045 1.977302 0.436596 354.868751 > 2×Naive > 2×Naive 383.12048 109.353701 87.488392 364.439228 out of RAM > 2×Naive 107.675935 > 2×Naive > 2×Naive Discussion. The experiments indicate that the DFGTH method is able to achieve reasonable performance across all bandwidth scales. Unfortunately none of the series approximation-based methods do well on the 5-dimensional data, as expected, highlighting the main weakness of the approach presented. Pursuing corrections to the error bounds necessary to use the intriguing series form of [14] may allow an increase in dimensionality. References [1] A. W. Appel. An Efficient Program for Many-Body Simulations. SIAM Journal on Scientific and Statistical Computing, 6(1):85–103, 1985. [2] J. Barnes and P. Hut. A Hierarchical O(N logN ) Force-Calculation Algorithm. Nature, 324, 1986. [3] B. Baxter and G. Roussos. A new error estimate of the fast gauss transform. SIAM Journal on Scientific Computing, 24(1):257–259, 2002. [4] P. Callahan and S. Kosaraju. A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields. Journal of the ACM, 62(1):67–90, January 1995. [5] A. Gray and A. W. Moore. N-Body Problems in Statistical Learning. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13 (December 2000). MIT Press, 2001. [6] A. G. Gray. Bringing Tractability to Generalized N-Body Problems in Statistical and Scientific Computation. PhD thesis, Carnegie Mellon University, 2003. [7] A. G. Gray and A. W. Moore. Rapid Evaluation of Multiple Density Models. In Artificial Intelligence and Statistics 2003, 2003. [8] L. Greengard and V. Rokhlin. A Fast Algorithm for Particle Simulations. Journal of Computational Physics, 73, 1987. [9] L. Greengard and J. Strain. The fast gauss transform. SIAM Journal on Scientific and Statistical Computing, 12(1):79–94, 1991. [10] L. Greengard and X. Sun. A new version of the fast gauss transform. Documenta Mathematica, Extra Volume ICM(III):575– 584, 1998. [11] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986. [12] J. Strain. The fast gauss transform with variable scales. SIAM Journal on Scientific and Statistical Computing, 12:1131– 1139, 1991. [13] O. Sz´ sz. On the relative extrema of the hermite orthogonal functions. J. Indian Math. Soc., 15:129–134, 1951. a [14] C. Yang, R. Duraiswami, N. A. Gumerov, and L. Davis. Improved fast gauss transform and efficient kernel density estimation. International Conference on Computer Vision, 2003.

2 0.45742199 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity

Author: Afsheen Afshar, Gopal Santhanam, Stephen I. Ryu, Maneesh Sahani, Byron M. Yu, Krishna V. Shenoy

Abstract: Spiking activity from neurophysiological experiments often exhibits dynamics beyond that driven by external stimulation, presumably reflecting the extensive recurrence of neural circuitry. Characterizing these dynamics may reveal important features of neural computation, particularly during internally-driven cognitive operations. For example, the activity of premotor cortex (PMd) neurons during an instructed delay period separating movement-target specification and a movementinitiation cue is believed to be involved in motor planning. We show that the dynamics underlying this activity can be captured by a lowdimensional non-linear dynamical systems model, with underlying recurrent structure and stochastic point-process output. We present and validate latent variable methods that simultaneously estimate the system parameters and the trial-by-trial dynamical trajectories. These methods are applied to characterize the dynamics in PMd data recorded from a chronically-implanted 96-electrode array while monkeys perform delayed-reach tasks. 1

3 0.20319109 198 nips-2005-Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails

Author: Nebojsa Jojic, Vladimir Jojic, Christopher Meek, David Heckerman, Brendan J. Frey

Abstract: We introduce a new model of genetic diversity which summarizes a large input dataset into an epitome, a short sequence or a small set of short sequences of probability distributions capturing many overlapping subsequences from the dataset. The epitome as a representation has already been used in modeling real-valued signals, such as images and audio. The discrete sequence model we introduce in this paper targets applications in genetics, from multiple alignment to recombination and mutation inference. In our experiments, we concentrate on modeling the diversity of HIV where the epitome emerges as a natural model for producing relatively small vaccines covering a large number of immune system targets known as epitopes. Our experiments show that the epitome includes more epitopes than other vaccine designs of similar length, including cocktails of consensus strains, phylogenetic tree centers, and observed strains. We also discuss epitome designs that take into account uncertainty about Tcell cross reactivity and epitope presentation. In our experiments, we find that vaccine optimization is fairly robust to these uncertainties. 1

4 0.20102306 102 nips-2005-Kernelized Infomax Clustering

Author: David Barber, Felix V. Agakov

Abstract: We propose a simple information-theoretic approach to soft clustering based on maximizing the mutual information I(x, y) between the unknown cluster labels y and the training patterns x with respect to parameters of specifically constrained encoding distributions. The constraints are chosen such that patterns are likely to be clustered similarly if they lie close to specific unknown vectors in the feature space. The method may be conveniently applied to learning the optimal affinity matrix, which corresponds to learning parameters of the kernelized encoder. The procedure does not require computations of eigenvalues of the Gram matrices, which makes it potentially attractive for clustering large data sets. 1

5 0.19709314 177 nips-2005-Size Regularized Cut for Data Clustering

Author: Yixin Chen, Ya Zhang, Xiang Ji

Abstract: We present a novel spectral clustering method that enables users to incorporate prior knowledge of the size of clusters into the clustering process. The cost function, which is named size regularized cut (SRcut), is defined as the sum of the inter-cluster similarity and a regularization term measuring the relative size of two clusters. Finding a partition of the data set to minimize SRcut is proved to be NP-complete. An approximation algorithm is proposed to solve a relaxed version of the optimization problem as an eigenvalue problem. Evaluations over different data sets demonstrate that the method is not sensitive to outliers and performs better than normalized cut. 1

6 0.19678521 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity

7 0.19640994 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

8 0.1956055 104 nips-2005-Laplacian Score for Feature Selection

9 0.19543354 55 nips-2005-Describing Visual Scenes using Transformed Dirichlet Processes

10 0.19501808 84 nips-2005-Generalization in Clustering with Unobserved Features

11 0.19497159 71 nips-2005-Fast Krylov Methods for N-Body Learning

12 0.19293821 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning

13 0.19124408 178 nips-2005-Soft Clustering on Graphs

14 0.19074504 112 nips-2005-Learning Minimum Volume Sets

15 0.18936209 13 nips-2005-A Probabilistic Approach for Optimizing Spectral Clustering

16 0.18797456 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

17 0.18796389 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

18 0.18724805 98 nips-2005-Infinite latent feature models and the Indian buffet process

19 0.18709984 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction

20 0.18622871 58 nips-2005-Divergences, surrogate loss functions and experimental design