nips nips2008 nips2008-188 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Michael P. Holmes, Jr. Isbell, Charles Lee, Alexander G. Gray
Abstract: The Singular Value Decomposition is a key operation in many machine learning methods. Its computational cost, however, makes it unscalable and impractical for applications involving large datasets or real-time responsiveness, which are becoming increasingly common. We present a new method, QUIC-SVD, for fast approximation of the whole-matrix SVD based on a new sampling mechanism called the cosine tree. Our empirical tests show speedups of several orders of magnitude over exact SVD. Such scalability should enable QUIC-SVD to accelerate and enable a wide array of SVD-based methods and applications. 1
Reference: text
sentIndex sentText sentNum sentScore
1 We present a new method, QUIC-SVD, for fast approximation of the whole-matrix SVD based on a new sampling mechanism called the cosine tree. [sent-8, score-0.422]
2 We present a new method, QUIC-SVD, for fast, sample-based SVD approximation with automatic relative error control. [sent-16, score-0.18]
3 This algorithm is based on a new type of data partitioning tree, the cosine tree, that shows excellent ability to home in on the subspaces needed for good SVD approximation. [sent-17, score-0.329]
4 We demonstrate several-orderof-magnitude speedups on medium-sized datasets, and verify that approximation error is properly controlled. [sent-18, score-0.205]
5 Since the columns of V ∈ Om×n are an orthonormal basis, we sometimes use expressions such as “the subspace V ” to refer to the subspace spanned by the columns of V . [sent-22, score-0.668]
6 Throughout this paper we assume m ≥ n, such that sampling rows gives bigger speedup than sampling columns. [sent-23, score-0.309]
7 1 Algorithm 1 Optimal approximate SVD within a row subspace V . [sent-26, score-0.303]
8 E XTRACT SVD Input: target matrix A ∈ Rm×n , subspace basis V ∈ On×k Output: U , Σ, V , the SVD of the best approximation to A within the subspace spanned by V ’s columns 1. [sent-27, score-0.704]
9 Equivalently, we can write the SVD as a weighted sum of rank-one outer products: A = ρ T i=1 σi ui vi , where ui and vi represent the ith columns of U and V . [sent-36, score-0.169]
10 The columns ui and vi are referred to as the left and right singular vectors, while the weights σi are the singular values. [sent-37, score-0.259]
11 The optimal k-rank approximation, in the sense of minimizing the squared error ||A − A||2 , is the k-rank truncation of the SVD: F k T σi ui vi = Uk Σk Vk . [sent-44, score-0.226]
12 Ak = (2) i=1 Ak is the projection of A’s rows onto the subspace spanned by the top k right singular vectors, i. [sent-45, score-0.553]
13 The optimality of Ak implies that the columns of Vk span the subspace of dimension at most k in which the squared error of A’s row-wise projection is minimized. [sent-48, score-0.557]
14 This leads us to a formulation of SVD approximation in which we seek to find a subspace in which A’s projection has sufficiently low error, then perform the SVD of A in that subspace. [sent-49, score-0.362]
15 If the subspace is substantially lower in rank/dimension than A, the SVD of the projection can be computed significantly faster than the SVD of the original A (quadratically so, as we will have decreased the n in O(mn2 )). [sent-50, score-0.329]
16 An important procedure we will require is the extraction of the best approximate SVD within a given subspace V . [sent-51, score-0.296]
17 Given a target matrix A and a row subspace basis stored in the columns of V , E XTRACT SVD has the following properties: 1. [sent-55, score-0.415]
18 , the extracted SVD reconstructs exactly to the projection of A’s rows onto the subspace spanned by V . [sent-60, score-0.503]
19 The runtime of the procedure is O(kmn), where k is the rank of V . [sent-64, score-0.138]
20 As this SVD extraction will constitute the last and most expensive step of our algorithm, we therefore require a subspace discovery method that finds a subspace of sufficient quality with as low a rank k as possible. [sent-65, score-0.633]
21 QUIC-SVD Previous LRMA Methods Iterative buildup, fast empirical error control One-off computation, loose error bound Adaptive sample size minimization Fixed a priori sample size (loose) Cosine tree sampling Various sampling schemes geometric structure of a matrix to efficiently derive compact (i. [sent-68, score-0.636]
22 A recent vein of work in the theory and algorithms community has focused on using sampling to solve the problem of low-rank matrix approximation (LRMA). [sent-72, score-0.151]
23 Each LRMA algorithm has a way of sampling to build up a subspace in which the matrix projection has bounded error. [sent-77, score-0.447]
24 Three main LRMA sampling techniques have emerged,1 and we will discuss each from the perspective of iteratively sampling a row, updating a subspace so it spans the new row, and continuing until the subspace captures the input matrix to within a desired error threshold. [sent-79, score-0.827]
25 , rank-compactness) is for each sampled row to represent well the rows that are not yet well represented in the subspace. [sent-84, score-0.107]
26 It is essentially an importance sampling scheme for the squared error objective. [sent-88, score-0.246]
27 Both of these lead to wasted samples and needless inflation of the subspace rank. [sent-92, score-0.288]
28 Introduced by Deshpande and Vempala [2], RLS modifies the LS probabilities after each subspace update by setting pi = ||A(i) − ΠV (A(i) )||2 /||A − F ΠV (A)||2 , where ΠV represents projection onto the current subspace V . [sent-94, score-0.629]
29 In particular, our cosine tree sampling method can be viewed as combining the representativeness of RP sampling with the adaptivity of RLS, which explains its empirically dominant rank efficiency. [sent-106, score-0.59]
30 CTN ODE Input: A ∈ Rm×n Output: cosine tree node containing the rows of A 1. [sent-109, score-0.457]
31 return N CTN ODE S PLIT Input: cosine tree node N Output: left and right children obtained by cosine-splitting of N 1. [sent-115, score-0.433]
32 nRows (a) if cmax − ci ≤ ci − cmin , Al ← (b) else Ar ← Al N. [sent-124, score-0.148]
33 In contrast to the error bounds of previous methods, which are stated in terms of the unknown low-rank Ak , our error bound is in terms of the known A. [sent-128, score-0.244]
34 This enables us to use a fast, empirical Monte Carlo technique to determine with high confidence when we have achieved the error target, and therefore to terminate with as few samples and as compact a subspace as possible. [sent-129, score-0.367]
35 Minimizing subspace rank is crucial for speed, as the final SVD extraction is greatly slowed by excess rank when the input matrix is large. [sent-130, score-0.488]
36 We use an iterative subspace buildup as described in the previous section, with sampling governed by a new spatial partitioning structure we call the cosine tree. [sent-131, score-0.65]
37 Cosine trees are designed to leverage the geometrical structure of a matrix and a partial subspace in order to quickly home in on good representative samples from the regions least well represented. [sent-132, score-0.427]
38 Key to the efficiency of our algorithm is an efficient error checking scheme, which we accomplish by Monte Carlo error estimation at judiciously chosen stages. [sent-133, score-0.302]
39 The ideal subspace discovery algorithm would oracularly choose as samples the singular vectors vi . [sent-136, score-0.407]
40 Each vi is precisely the direction that, added to the subspace spanned by the previous singular vectors, will maximally decrease residual error over all rows of the matrix. [sent-137, score-0.65]
41 This intuition is the guiding idea for cosine trees. [sent-138, score-0.274]
42 Starting with a root node, which contains all points (rows), we take its centroid as a representative to include in our subspace span, and randomly sample a point to serve as the pivot for splitting. [sent-140, score-0.386]
43 We sample the pivot from the basic LS distribution, that being the cheapest source of information as to sample importance. [sent-141, score-0.098]
44 The two groups are assigned to two child nodes, which are placed in a 4 Algorithm 3 Monte Carlo estimation of the squared error of a matrix projection onto a subspace. [sent-143, score-0.309]
45 of each sampled row’s projection onto V (a) wgtM agSq[i] = 1 pS (i) ||S(i) V ||2 // pS(i) is prob. [sent-153, score-0.113]
46 return ||A||2 − magSqLB F Algorithm 4 QUIC-SVD: fast whole-matrix approximate SVD with relative error control. [sent-156, score-0.216]
47 return E XTRACT SVD(A, V ) queue prioritized by the residual error of each node. [sent-172, score-0.21]
48 By prioritizing expansion by the residual error of the frontier nodes, sampling is always focused on the areas with maximum potential for error reduction. [sent-176, score-0.365]
49 Since cosine-based splitting guides the nodes toward groupings with higher parallelism, the residual magnitude of each node is increasingly likely to be well captured along the direction of the node centroid. [sent-177, score-0.167]
50 Expanding the subspace in the direction of the highest-priority node centroid is therefore a good guess as to the direction that will maximally reduce residual error. [sent-178, score-0.403]
51 Thus, cosine tree sampling approximates the ideal of oracularly sampling the true singular vectors. [sent-179, score-0.622]
52 Algorithm 4, QUIC-SVD (QUantized Iterative Cosine tree)2 , specifies a way to leverage cosine trees in the construction of an approximate SVD while providing a strong probabilistic error guarantee. [sent-182, score-0.464]
53 The algorithm builds a subspace by expanding a cosine tree as described above, checking residual error after each expansion. [sent-183, score-0.883]
54 Once the residual error is sufficiently low, we return the SVD of the projection into the subspace. [sent-184, score-0.281]
55 Note that exact error checking would require an expensive O(k 2 mn) total cost, where k is the final subspace rank, so we instead use a Monte Carlo error estimate as specified in Algorithm 3. [sent-185, score-0.589]
56 We also employ Algorithm 3 for the error estimates used in node prioritization. [sent-186, score-0.161]
57 With Monte Carlo instead of exact error computations, the total cost for error checking decreases to O(k 2 n log m), a significant practical reduction. [sent-187, score-0.331]
58 2 Quantized alludes to each node being represented by a single point that is added to the subspace basis. [sent-188, score-0.31]
59 5 The other main contributions to runtime are: 1) k cosine tree node splits for a total of O(kmn), 2) O(k) single-vector Gram-Schmidt orthonormalizations at O(km) each for a total of O(k 2 m), and 3) final SVD extraction at O(kmn). [sent-189, score-0.523]
60 Total runtime is therefore O(kmn), with the final projection onto the subspace being the costliest step since the O(kmn) from node splitting is a very loose worst-case bound. [sent-190, score-0.518]
61 F From Lemma 1 we know that E XTRACT SVD returns an SVD that reconstructs to A’s projection onto V (i. [sent-196, score-0.14]
62 Thus, we have only to show that mcSqErr in the terminal iteration is an upper bound on the error ||A − A||2 with probability at least 1 − δ. [sent-199, score-0.135]
63 Note that intermediate F error checks do not affect the success probability, since they only ever tell us to continue expanding the subspace, which is never a failure. [sent-200, score-0.135]
64 Though the QUIC-SVD procedure specified in Algorithm 4 provides a strong error guarantee, in practice its error checking routine is overconservative and is invoked more frequently than necessary. [sent-208, score-0.302]
65 For practical usage, we therefore approximate the strict error checking of Algorithm 4 by making three modifications: 1. [sent-209, score-0.193]
66 At each error check, estimate mcSqErr with several repeated Monte Carlo evaluations (i. [sent-212, score-0.139]
67 In each iteration, use a linear extrapolation from past decreases in error to estimate the number of additional node splits required to achieve the error target. [sent-216, score-0.301]
68 Perform this projected number of splits before checking error again, thus eliminating needless intermediate error checks. [sent-217, score-0.363]
69 Although these modifications forfeit the strict guarantee of Theorem 1, they are principled approximations that more aggressively accelerate the computation while still keeping error well under control (this will be demonstrated empirically). [sent-218, score-0.136]
70 Under such a symmetric distribution, the probability that a single evaluation of mcSqErr will exceed the true error is 0. [sent-220, score-0.109]
71 The probability that, in a series of x evaluations, at least one of them will exceed the true error is approximately 1 − 0. [sent-222, score-0.109]
72 The probability that at least one of our mcSqErr evaluations results in an upper bound on the true error (i. [sent-224, score-0.165]
73 , the probability that our error check is correct) thus goes quickly to 1. [sent-226, score-0.109]
74 Change 3 exploits that fact that the rate at which error decreases is typically monotonically nonincreasing. [sent-232, score-0.109]
75 Thus, extrapolating the rate of error decrease from past error evaluations yields a conservative estimate of the number of splits required to achieve the error target. [sent-233, score-0.388]
76 4 Performance We report the results of two sets of experiments, one comparing the sample efficiency of cosine trees to previous LRMA sampling methods, and the other evaluating the composite speed and error performance of QUIC-SVD. [sent-236, score-0.548]
77 Due to space considerations we give results for only two datasets, and 6 madelon 0. [sent-237, score-0.186]
78 03 relative squared error relative squared error 0. [sent-239, score-0.4]
79 005 0 0 0 10 20 30 subspace rank (a) madelon kernel (2000 40 50 60 0 2000) 50 100 150 200 subspace rank (b) declaration (4656 250 300 350 400 3923) Figure 1: Relative squared error vs. [sent-249, score-1.204]
80 LS is length-squared, RLS is residual length-squared, RP is random projection, and CT is cosine tree. [sent-251, score-0.337]
81 Because the runtime of our algorithm is O(kmn), where k is the final dimension of the projection subspace, it is critical that we use a sampling method that achieves the error target with the minimum possible subspace rank k. [sent-255, score-0.697]
82 We therefore compare our cosine tree sampling method to the previous sampling methods proposed in the LRMA literature. [sent-256, score-0.511]
83 Figure 1 shows results for the various sampling methods on two matrices, one a 2000 2000 Gaussian kernel matrix produced by the Madelon dataset from the NIPS 2003 Workshop on Feature Extraction (madelon kernel), and the other a 4656 3923 scan of the US Declaration of Independence (declaration). [sent-257, score-0.148]
84 Plotted is the relative squared error of the input matrix’s projection onto the subspaces generated by each method at each subspace rank. [sent-258, score-0.599]
85 Also shown is the optimal error produced by the exact SVD at each rank. [sent-259, score-0.138]
86 Both graphs show cosine trees dominating the other methods in terms of rank efficiency. [sent-260, score-0.407]
87 It is particularly interesting how closely the cosine tree error can track that of the exact SVD. [sent-262, score-0.481]
88 This would seem to give some justification to the principle of grouping points according to their degree of mutual parallelism, and validates our use of cosine trees as the sampling mechanism for QUIC-SVD. [sent-263, score-0.412]
89 In the second set of experiments we evaluate the runtime and error performance of QUIC-SVD. [sent-265, score-0.168]
90 Figure 2 shows results for the madelon kernel and declaration matrices. [sent-266, score-0.368]
91 On the top row we show how speedup over exact SVD varies with the target error . [sent-267, score-0.299]
92 Speedups range from 831 at = 0 0025 to over 3,600 at = 0 023 for madelon kernel, and from 118 at = 0 01 to nearly 20,000 at = 0 03 for declaration. [sent-268, score-0.186]
93 On the bottom row we show the actual error of the algorithm in comparison to the target error. [sent-269, score-0.191]
94 While the actual error is most often slightly above the target, it nevertheless hugs the target line quite closely, never exceeding the target by more than 10%. [sent-270, score-0.183]
95 Overall, the several-order-of-magnitude speedups and controlled error shown by QUIC-SVD would seem to make it an attractive option for any algorithm computing costly SVDs. [sent-271, score-0.172]
96 5 Conclusion We have presented a fast approximate SVD algorithm, QUIC-SVD, and demonstrated severalorder-of-magnitude speedups with controlled error on medium-sized datasets. [sent-272, score-0.203]
97 The new version is greatly simplified, and features even greater speed with a deterministic error guarantee. [sent-276, score-0.109]
98 More work is needed to explore the SVD-using methods to which QUIC-SVD can be applied, particularly with an eye to how the introduction of controlled error in the SVD will 7 madelon declaration 4,000 2. [sent-277, score-0.447]
99 035 (b) speedup - declaration madelon declaration 1 1 actual error target error actual error target error 0. [sent-290, score-1.079]
100 035 epsilon (c) relative error - madelon kernel (d) relative error - declaration Figure 2: Speedup and actual relative error vs. [sent-310, score-0.863]
wordName wordTfidf (topN-words)
[('svd', 0.61), ('cosine', 0.274), ('subspace', 0.258), ('av', 0.205), ('lrma', 0.202), ('madelon', 0.186), ('mcsqerr', 0.186), ('declaration', 0.152), ('mcs', 0.135), ('rror', 0.135), ('ls', 0.12), ('error', 0.109), ('ctn', 0.101), ('kmn', 0.101), ('xtract', 0.084), ('sampling', 0.084), ('checking', 0.084), ('rls', 0.084), ('speedup', 0.079), ('rank', 0.079), ('singular', 0.077), ('projection', 0.071), ('tree', 0.069), ('ode', 0.068), ('carlo', 0.068), ('monte', 0.066), ('residual', 0.063), ('speedups', 0.063), ('nc', 0.063), ('rows', 0.062), ('runtime', 0.059), ('epsilon', 0.054), ('trees', 0.054), ('squared', 0.053), ('node', 0.052), ('agsq', 0.051), ('wgtm', 0.051), ('ci', 0.047), ('row', 0.045), ('pivot', 0.044), ('spanned', 0.043), ('onto', 0.042), ('columns', 0.041), ('rp', 0.041), ('rm', 0.039), ('relative', 0.038), ('vi', 0.038), ('return', 0.038), ('extraction', 0.038), ('target', 0.037), ('ak', 0.036), ('loose', 0.036), ('ar', 0.036), ('kannan', 0.036), ('buildup', 0.034), ('deshpande', 0.034), ('errc', 0.034), ('friedland', 0.034), ('frieze', 0.034), ('isbell', 0.034), ('magsqlb', 0.034), ('mgs', 0.034), ('nroot', 0.034), ('oracularly', 0.034), ('parallelism', 0.034), ('plit', 0.034), ('sqerr', 0.034), ('unscalable', 0.034), ('matrix', 0.034), ('approximation', 0.033), ('fast', 0.031), ('splits', 0.031), ('evaluations', 0.03), ('kernel', 0.03), ('needless', 0.03), ('achlioptas', 0.03), ('centroid', 0.03), ('vempala', 0.03), ('al', 0.029), ('exact', 0.029), ('distinctions', 0.029), ('impractical', 0.029), ('vk', 0.028), ('subspaces', 0.028), ('sample', 0.027), ('leverage', 0.027), ('orthonormal', 0.027), ('reconstructs', 0.027), ('accelerate', 0.027), ('cmax', 0.027), ('om', 0.027), ('home', 0.027), ('cmin', 0.027), ('quantized', 0.027), ('representative', 0.027), ('bound', 0.026), ('ui', 0.026), ('expanding', 0.026), ('drineas', 0.025), ('span', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 188 nips-2008-QUIC-SVD: Fast SVD Using Cosine Trees
Author: Michael P. Holmes, Jr. Isbell, Charles Lee, Alexander G. Gray
Abstract: The Singular Value Decomposition is a key operation in many machine learning methods. Its computational cost, however, makes it unscalable and impractical for applications involving large datasets or real-time responsiveness, which are becoming increasingly common. We present a new method, QUIC-SVD, for fast approximation of the whole-matrix SVD based on a new sampling mechanism called the cosine tree. Our empirical tests show speedups of several orders of magnitude over exact SVD. Such scalability should enable QUIC-SVD to accelerate and enable a wide array of SVD-based methods and applications. 1
2 0.10749519 83 nips-2008-Fast High-dimensional Kernel Summations Using the Monte Carlo Multipole Method
Author: Dongryeol Lee, Alexander G. Gray
Abstract: We propose a new fast Gaussian summation algorithm for high-dimensional datasets with high accuracy. First, we extend the original fast multipole-type methods to use approximation schemes with both hard and probabilistic error. Second, we utilize a new data structure called subspace tree which maps each data point in the node to its lower dimensional mapping as determined by any linear dimension reduction method such as PCA. This new data structure is suitable for reducing the cost of each pairwise distance computation, the most dominant cost in many kernel methods. Our algorithm guarantees probabilistic relative error on each kernel sum, and can be applied to high-dimensional Gaussian summations which are ubiquitous inside many kernel methods as the key computational bottleneck. We provide empirical speedup results on low to high-dimensional datasets up to 89 dimensions. 1 Fast Gaussian Kernel Summation In this paper, we propose new computational techniques for efficiently approximating the following sum for each query point qi ∈ Q: Φ(qi , R) = e−||qi −rj || 2 /(2h2 ) (1) rj ∈R where R is the reference set; each reference point is associated with a Gaussian function with a smoothing parameter h (the ’bandwidth’). This form of summation is ubiquitous in many statistical learning methods, including kernel density estimation, kernel regression, Gaussian process regression, radial basis function networks, spectral clustering, support vector machines, and kernel PCA [1, 4]. Cross-validation in all of these methods require evaluating Equation 1 for multiple values of h. Kernel density estimation, for example, requires |R| density estimate based on |R| − 1 points, yielding a brute-force computational cost scaling quadratically (that is O(|R| 2 )). Error bounds. Due to its expensive computational cost, many algorithms approximate the Gaussian kernel sums at the expense of reduced precision. Therefore, it is natural to discuss error bound criteria which measure the quality of the approximations with respect to their corresponding true values. The following error bound criteria are common in literature: Definition 1.1. An algorithm guarantees absolute error bound, if for each exact value Φ(q i , R) for qi ∈ Q, it computes Φ(qi , R) such that Φ(qi , R) − Φ(qi , R) ≤ . Definition 1.2. An algorithm guarantees relative error bound, if for each exact value Φ(q i , R) for qi ∈ Q, it computes Φ(qi , R) ∈ R such that Φ(qi , R) − Φ(qi , R) ≤ |Φ(qi , R)|. 1 Bounding the relative error (e.g., the percentage deviation) is much harder because the error bound criterion is in terms of the initially unknown exact quantity. As a result, many previous methods [7] have focused on bounding the absolute error. The relative error bound criterion is preferred to the absolute error bound criterion in statistical applications in which high accuracy is desired. Our new algorithm will enforce the following “relaxed” form of the relative error bound criterion, whose motivation will be discussed shortly. Definition 1.3. An algorithm guarantees (1 − α) probabilistic relative error bound, if for each exact value Φ(qi , R) for qi ∈ Q, it computes Φ(qi , R) ∈ R, such that with at least probability 0 < 1 − α < 1, Φ(qi , R) − Φ(qi , R) ≤ |Φ(qi , R)|. Previous work. The most successful class of acceleration methods employ “higher-order divide and conquer” or generalized N -body algorithms (GNA) [4]. This approach can use any spatial partioning tree such as kd-trees or ball-trees for both the query set Q and reference data R and performs a simulataneous recursive descent on both trees. GNA with relative error bounds (Definition 1.2) [5, 6, 11, 10] utilized bounding boxes and additional cached-sufficient statistics such as higher-order moments needed for series-expansion. [5, 6] utilized bounding-box based error bounds which tend to be very loose, which resulted in slow empirical performance around suboptimally small and large bandwidths. [11, 10] extended GNA-based Gaussian summations with series-expansion which provided tighter bounds; it showed enormous performance improvements, but only up to low dimensional settings (up to D = 5) since the number of required terms in series expansion increases exponentially with respect to D. [9] introduces an iterative sampling based GNA for accelerating the computation of nested sums (a related easier problem). Its speedup is achieved by replacing pessimistic error bounds provided by bounding boxes with normal-based confidence interval from Monte Carlo sampling. [9] demonstrates the speedup many orders of magnitude faster than the previous state of the art in the context of computing aggregates over the queries (such as the LSCV score for selecting the optimal bandwidth). However, the authors did not discuss the sampling-based approach for computations that require per-query estimates, such as those required for kernel density estimation. None of the previous approaches for kernel summations addresses the issue of reducing the computational cost of each distance computation which incurs O(D) cost. However, the intrinsic dimensionality d of most high-dimensional datasets is much smaller than the explicit dimension D (that is, d << D). [12] proposed tree structures using a global dimension reduction method, such as random projection, as a preprocessing step for efficient (1 + ) approximate nearest neighbor search. Similarly, we develop a new data structure for kernel summations; our new data structure is constructed in a top-down fashion to perform the initial spatial partitioning in the original input space R D and performs a local dimension reduction to a localized subset of the data in a bottom-up fashion. This paper. We propose a new fast Gaussian summation algorithm that enables speedup in higher dimensions. Our approach utilizes: 1) probabilistic relative error bounds (Definition 1.3) on kernel sums provided by Monte Carlo estimates 2) a new tree structure called subspace tree for reducing the computational cost of each distance computation. The former can be seen as relaxing the strict requirement of guaranteeing hard relative bound on very small quantities, as done in [5, 6, 11, 10]. The latter was mentioned as a possible way of ameliorating the effects of the curse of dimensionality in [14], a pioneering paper in this area. Notations. Each query point and reference point (a D-dimensional vector) is indexed by natural numbers i, j ∈ N, and denoted qi and rj respectively. For any set S, |S| denotes the number of elements in S. The entities related to the left and the right child are denoted with superscripts L and R; an internal node N has the child nodes N L and N R . 2 Gaussian Summation by Monte Carlo Sampling Here we describe the extension needed for probabilistic computation of kernel summation satisfying Definition 1.3. The main routine for the probabilistic kernel summation is shown in Algorithm 1. The function MCMM takes the query node Q and the reference node R (each initially called with the roots of the query tree and the reference tree, Qroot and Rroot ) and β (initially called with α value which controls the probability guarantee that each kernel sum is within relative error). 2 Algorithm 1 The core dual-tree routine for probabilistic Gaussian kernel summation. MCMM(Q, R, β) if C AN S UMMARIZE E XACT(Q, R, ) then S UMMARIZE E XACT(Q, R) else if C AN S UMMARIZE MC(Q, R, , β) then 5: S UMMARIZE MC(Q, R, , β) else if Q is a leaf node then if R is a leaf node then MCMMBASE(Q, R) 10: else MCMM Q, RL , β , MCMM Q, RR , β 2 2 else if R is a leaf node then MCMM(QL , R, β), MCMM(QR , R, β) 15: else MCMM QL , RL , β , MCMM QL , RR , β 2 2 MCMM QR , RL , β , MCMM QR , RR , β 2 2 The idea of Monte Carlo sampling used in the new algorithm is similar to the one in [9], except the sampling is done per query and we use approximations that provide hard error bounds as well (i.e. finite difference, exhaustive base case: MCMMBASE). This means that the approximation has less variance than a pure Monte Carlo approach used in [9]. Algorithm 1 first attempts approximations with hard error bounds, which are computationally cheaper than sampling-based approximations. For example, finite-difference scheme [5, 6] can be used for the C AN S UMMARIZE E XACT and S UMMARIZE E XACT functions in any general dimension. The C AN S UMMARIZE MC function takes two parameters that specify the accuracy: the relative error and its probability guarantee and decides whether to use Monte Carlo sampling for the given pair of nodes. If the reference node R contains too few points, it may be more efficient to process it using exact methods that use error bounds based on bounding primitives on the node pair or exhaustive pair-wise evaluations, which is determined by the condition: τ · minitial ≤ |R| where τ > 1 controls the minimum number of reference points needed for Monte Carlo sampling to proceed. If the reference node does contain enough points, then for each query point q ∈ Q, the S AMPLE routine samples minitial terms over the terms in the summation Φ(q, R) = Kh (||q − rjn ||) rjn ∈R where Φ(q, R) denotes the exact contribution of R to q’s kernel sum. Basically, we are interested in estimating Φ(q, R) by Φ(q, R) = |R|µS , where µS is the sample mean of S. From the Central 2 Limit Theorem, given enough m samples, µS N (µ, σS /m) where Φ(q, R) = |R|µ (i.e. µ is the average of the kernel value between q and any reference point r ∈ R); this implies that √ |µS − µ| ≤ zβ/2 σS / m with probability 1 − β. The pruning rule we have to enforce for each query point for the contribution of R is: σS Φ(q, R) zβ/2 √ ≤ |R| m where σS the sample standard deviation of S. Since Φ(q, R) is one of the unknown quanities we want to compute, we instead enforce the following: σS zβ/2 √ ≤ m Φl (q, R) + |R| µS − |R| zβ/2 σS √ m (2) where Φl (q, R) is the currently running lower bound on the sum computed using exact methods z σ √ and |R| µS − β/2 S is the probabilistic component contributed by R. Denoting Φl,new (q, R) = m zβ/2 σ Φl (q, R) + |R| µS − √ S , the minimum number of samples for q needed to achieve the |S| 3 target error the right side of the inequality in Equation 2 with at least probability of 1 − β is: 2 2 m ≥ zβ/2 σS (|R| + |R|)2 2 (Φl (q, R) + |R|µ )2 S If the given query node and reference node pair cannot be pruned using either nonprobabilistic/probabilistic approximations, then we recurse on a smaller subsets of two sets. In particular, when dividing over the reference node R, we recurse with half of the β value 1 . We now state the probablistic error guarantee of our algorithm as a theorem. Theorem 2.1. After calling MCMM with Q = Qroot , R = Rroot , and β = α, Algorithm 1 approximates each Φ(q, R) with Φ(q, R) such that Definition 1.3 holds. Proof. For a query/reference (Q, R) pair and 0 < β < 1, MCMMBASE and S UMMARIZE E XACT compute estimates for q ∈ Q such that Φ(q, R) − Φ(q, R) < Φ(q,R)|R| with probability at |R| least 1 > 1 − β. By Equation 2, S UMMARIZE MC computes estimates for q ∈ Q such that Φ(q, R) − Φ(q, R) < Φ(q,R)|R| with probability 1 − β. |R| We now induct on |Q ∪ R|. Line 11 of Algorithm 1 divides over the reference whose subcalls compute estimates that satisfy Φ(q, RL ) − Φ(q, RL ) ≤ Φ(q,R)|RR | |R| L each with at least 1 − β 2 Φ(q,R)|RL | |R| and Φ(q, RR ) − Φ(q, RR ) ≤ probability by induction hypothesis. For q ∈ Q, Φ(q, R) = Φ(q, R )+ Φ(q, R ) which means |Φ(q, R)−Φ(q, R)| ≤ Φ(q,R)|R| with probability at least 1−β. |R| Line 14 divides over the query and each subcall computes estimates that hold with at least probability 1 − β for q ∈ QL and q ∈ QR . Line 16 and 17 divides both over the query and the reference, and the correctness can be proven similarly. Therefore, M CM M (Qroot , Rroot , α) computes estimates satisfying Definition 1.3. R “Reclaiming” probability. We note that the assigned probability β for the query/reference pair computed with exact bounds (S UMMARIZE E XACT and MCMMBASE) is not used. This portion of the probability can be “reclaimed” in a similar fashion as done in [10] and re-used to prune more aggressively in the later stages of the algorithm. All experiments presented in this paper were benefited by this simple modification. 3 Subspace Tree A subspace tree is basically a space-partitioning tree with a set of orthogonal bases associated with each node N : N.Ω = (µ, U, Λ, d) where µ is the mean, U is a D×d matrix whose columns consist of d eigenvectors, and Λ the corresponding eigenvalues. The orthogonal basis set is constructed using a linear dimension reduction method such as PCA. It is constructed in the top-down manner using the PARTITION S ET function dividing the given set of points into two (where the PARTITION S ET function divides along the dimension with the highest variance in case of a kd-tree for example), with the subspace in each node formed in the bottom-up manner. Algorithm 3 shows a PCA tree (a subspace tree using PCA as a dimension reduction) for a 3-D dataset. The subspace of each leaf node is computed using P CA BASE which can use the exact PCA [3] or a stochastic one [2]. For an internal node, the subspaces of the child nodes, N L .Ω = (µL , U L , ΛL , dL ) and N R .Ω = (µR , U R , ΛR , dR ), are approximately merged using the M ERGE S UBSPACES function which involves solving an (d L + dR + 1) × (dL + dR + 1) eigenvalue problem [8], which runs in O((dL + dR + 1)3 ) << O(D 3 ) given that the dataset is sparse. In addition, each data point x in each node N is mapped to its new lower-dimensional coordinate using the orthogonal basis set of N : xproj = U T (x − µ). The L2 norm reconstruction error is given by: ||xrecon − x||2 = ||(U xproj + µ) − x||2 . 2 2 Monte Carlo sampling using a subspace tree. Consider C AN S UMMARIZE MC function in Algorithm 2. The “outer-loop” over this algorithm is over the query set Q, and it would make sense to project each query point q ∈ Q to the subspace owned by the reference node R. Let U and µ be the orthogonal basis system for R consisting of d basis. For each q ∈ Q, consider the squared distance 1 We could also divide β such that the node that may be harder to approximate gets a lower value. 4 Algorithm 2 Monte Carlo sampling based approximation routines. C AN S UMMARIZE MC(Q, R, , α) S AMPLE(q, R, , α, S, m) return τ · minitial ≤ |R| for k = 1 to m do r ← random point in R S UMMARIZE MC(Q, R, , α) S ← S ∪ {Kh (||q − r||)} 2 for qi ∈ Q do µS ← M EAN(S), σS ← VARIANCE(S) S ← ∅, m ← minitial zα/2 σS Φl,new (q, R) ← Φl (q, R) + |R| µS − √ repeat |S| 2 S AMPLE(qi , R, , α, S, m) 2 2 mthresh ← zα/2 σS 2 (Φ(|R|+ |R|) S )2 l (q,R)+|R|µ until m ≤ 0 m ← mthresh − |S| Φ(qi , R) ← Φ(qi , R) + |R| · M EAN(S) ||(q − µ) − rproj ||2 (where (q − µ) is q’s coordinates expressed in terms of the coordinate system of R) as shown in Figure 1. For the Gaussian kernel, each pairwise kernel value is approximated as: e−||q−r|| 2 /(2h2 ) ≈ e−||q−qrecon || 2 (3) /(2h2 ) −||qproj −rproj ||2 /(2h2 ) e where qrecon = U qproj +µ and qproj = U (q−µ). For a fixed query point q, e can be precomputed (which takes d dot products between two D-dimensional vectors) and re-used for every distance computation between q and any reference point r ∈ R whose cost is now O(d) << O(D). Therefore, we can take more samples efficiently. For a total of sufficiently large m samples, the computational cost is O(d(D + m)) << O(D · m) for each query point. −||q−qrecon ||2 /(2h2 ) T Increased variance comes at the cost of inexact distance computations, however. Each distance computation incurs at most squared L2 norm of ||rrecon − r||2 error. That is, 2 ||q − rrecon ||2 − ||q − r||2 ≤ ||rrecon − r||2 . Neverhteless, the sample variance for each query 2 2 2 point plus the inexactness due to dimension reduction τS can be shown to be bounded for the Gaus2 2 sian kernel as: (where each s = e−||q−rrecon || /(2h ) ): 1 m−1 ≤ 1 m−1 s∈S s2 − m · µ 2 S s2 + τS 2 min 1, max e||rrecon −r||2 /h s∈S r∈R 2 2 − m µS min e−||rrecon −r||2 /(2h 2 2 ) r∈R Exhaustive computations using a subspace tree. Now suppose we have built subspace trees for the query and the reference sets. We can project either each query point onto the reference subspace, or each reference point onto the query subspace, depending on which subspace has a smaller dimension and the number of points in each node. The subspaces formed in the leaf nodes usually are highly numerically accurate since it contains very few points compared to the extrinsic dimensionality D. 4 Experimental Results We empirically evaluated the runtime performance of our algorithm on seven real-world datasets, scaled to fit in [0, 1]D hypercube, for approximating the Gaussian sum at every query point with a range of bandwidths. This experiment is motivated by many kernel methods that require computing the Gaussian sum at different bandwidth values (according to the standard least-sqares crossvalidation scores [15]). Nevertheless, we emphasize that the acceleration results are applicable to other kernel methods that require efficient Gaussian summation. In this paper, the reference set equals the query set. All datasets have 50K points so that the exact exhaustive method can be tractably computed. All times are in seconds and include the time needed to build the trees. Codes are in C/C++ and run on a dual Intel Xeon 3GHz with 8 Gb of main memory. The measurements in second to eigth columns are obtained by running the algorithms at the bandwidth kh∗ where 10−3 ≤ k ≤ 103 is the constant in the corresponding column header. The last columns denote the total time needed to run on all seven bandwidth values. Each table has results for five algorithms: the naive algorithm and four algorithms. The algorithms with p = 1 denote the previous state-of-the-art (finite-difference with error redistribution) [10], 5 Algorithm 3 PCA tree building routine. B UILD P CAT REE(P) if C AN PARTITION(P) then {P L , P R } ← PARTITION S ET(P) N ← empty node N L ← B UILD P CAT REE(P L ) N R ← B UILD P CAT REE(P R ) N.S ← M ERGE S UBSPACES(N L .S, N R .S) else N ← B UILD P CAT REE BASE(P) N.S ← P CA BASE(P) N.Pproj ← P ROJECT(P, N.S) return N while those with p < 1 denote our probabilistic version. Each entry has the running time and the percentage of the query points that did not satisfy the relative error . Analysis. Readers should focus on the last columns containing the total time needed for evaluating Gaussian sum at all points for seven different bandwidth values. This is indicated by boldfaced numbers for our probabilistic algorithm. As expected, On low-dimensional datasets (below 6 dimensions), the algorithm using series-expansion based bounds gives two to three times speedup compared to our approach that uses Monte Carlo sampling. Multipole moments are an effective form of compression in low dimensions with analytical error bounds that can be evaluated; our Monte Carlo-based method has an asymptotic error bound which must be “learned” through sampling. As we go from 7 dimensions and beyond, series-expansion cannot be done efficiently because of its slow convergence. Our probabilistic algorithm (p = 0.9) using Monte Carlo consistently performs better than the algorithm using exact bounds (p = 1) by at least a factor of two. Compared to naive, it achieves the maximum speedup of about nine times on an 16-dimensional dataset; on an 89-dimensional dataset, it is at least three times as fast as the naive. Note that all the datasets contain only 50K points, and the speedup will be more dramatic as we increase the number of points. 5 Conclusion We presented an extension to fast multipole methods to use approximation methods with both hard and probabilistic bounds. Our experimental results show speedup over the previous state-of-the-art on high-dimensional datasets. Our future work will include possible improvements inspired by a recent work done in the FMM community using a matrix-factorization formulation [13]. Figure 1: Left: A PCA-tree for a 3-D dataset. Right: The squared Euclidean distance between a given query point and a reference point projected onto a subspace can be decomposed into two components: the orthogonal component and the component in the subspace. 6 Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 Σ mockgalaxy-D-1M-rnd (cosmology: positions), D = 3, N = 50000, h ∗ = 0.000768201 Naive 182 182 182 182 182 182 182 1274 MCMM 3 3 5 10 26 48 2 97 ( = 0.1, p = 0.9) 1% 1% 1% 1% 1% 1% 5% DFGT 2 2 2 2 6 19 3 36 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 3 3 4 11 27 58 21 127 ( = 0.01, p = 0.9) 0 % 0% 1% 1% 1% 1% 7% DFGT 2 2 2 2 7 30 5 50 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 Σ bio5-rnd (biology: drug activity), D = 5, N = 50000, h∗ = 0.000567161 Naive 214 214 214 214 214 214 214 1498 MCMM 4 4 6 144 149 65 1 373 ( = 0.1, p = 0.9) 0% 0% 0% 0% 1% 0% 1% DFGT 4 4 5 24 96 65 2 200 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 4 4 6 148 165 126 1 454 ( = 0.01, p = 0.9) 0 % 0% 0% 0% 1% 0% 1% DFGT 4 4 5 25 139 126 4 307 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 Σ pall7 − rnd , D = 7, N = 50000, h∗ = 0.00131865 Naive 327 327 327 327 327 327 327 2289 MCMM 3 3 3 3 63 224 <1 300 ( = 0.1, p = 0.9) 0% 0% 0% 1% 1% 12 % 0% DFGT 10 10 11 14 84 263 223 615 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 3 3 3 3 70 265 5 352 ( = 0.01, p = 0.9) 0 % 0% 0% 1% 2% 1% 8% DFGT 10 10 11 14 85 299 374 803 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 Σ covtype − rnd , D = 10, N = 50000, h∗ = 0.0154758 Naive 380 380 380 380 380 380 380 2660 MCMM 11 11 13 39 318 <1 <1 381 ( = 0.1, p = 0.9) 0% 0% 0% 1% 0% 0% 0% DFGT 26 27 38 177 390 244 <1 903 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 11 11 13 77 362 2 <1 477 ( = 0.01, p = 0.9) 0 % 0% 0% 1% 1% 10 % 0% DFGT 26 27 38 180 427 416 <1 1115 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 Σ CoocTexture − rnd , D = 16, N = 50000, h∗ = 0.0263958 Naive 472 472 472 472 472 472 472 3304 MCMM 10 11 22 189 109 <1 <1 343 ( = 0.1, p = 0.9) 0% 0% 0% 1% 8% 0% 0% DFGT 22 26 82 240 452 66 <1 889 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 10 11 22 204 285 <1 <1 534 ( = 0.01, p = 0.9) 0 % 0% 1% 1% 10 % 4% 0% DFGT 22 26 83 254 543 230 <1 1159 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% 7 Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 LayoutHistogram − rnd , D = 32, N = 50000, h∗ = 0.0609892 Naive 757 757 757 757 757 757 757 MCMM 32 32 54 168 583 8 8 ( = 0.1, p = 0.9) 0% 0% 1% 1% 1% 0% 0% DFGT 153 159 221 492 849 212 <1 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 32 45 60 183 858 8 8 ( = 0.01, p = 0.9) 0 % 0% 1% 6% 1% 0% 0% DFGT 153 159 222 503 888 659 <1 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 CorelCombined − rnd , D = 89, N = 50000, h∗ = 0.0512583 Naive 1716 1716 1716 1716 1716 1716 1716 MCMM 384 418 575 428 1679 17 17 ( = 0.1, p = 0.9) 0% 0% 0% 1% 10 % 0% 0% DFGT 659 677 864 1397 1772 836 17 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 401 419 575 437 1905 17 17 ( = 0.01, p = 0.9) 0 % 0% 0% 1% 2% 0% 0% DFGT 659 677 865 1425 1794 1649 17 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Σ 5299 885 2087 1246 2585 Σ 12012 3518 6205 3771 7086 References [1] Nando de Freitas, Yang Wang, Maryam Mahdaviani, and Dustin Lang. Fast krylov methods for n-body learning. In Y. Weiss, B. Sch¨ lkopf, and J. Platt, editors, Advances in Neural Information Processing o Systems 18, pages 251–258. MIT Press, Cambridge, MA, 2006. [2] P. Drineas, R. Kannan, and M. Mahoney. Fast monte carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition, 2004. [3] G. Golub. Matrix Computations, Third Edition. The Johns Hopkins University Press, 1996. [4] A. Gray and A. W. Moore. N-Body Problems in Statistical Learning. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13 (December 2000). MIT Press, 2001. [5] Alexander G. Gray and Andrew W. Moore. Nonparametric Density Estimation: Toward Computational Tractability. In SIAM International Conference on Data Mining 2003, 2003. [6] Alexander G. Gray and Andrew W. Moore. Very Fast Multivariate Kernel Density Estimation via Computational Geometry. In Joint Statistical Meeting 2003, 2003. to be submitted to JASA. [7] L. Greengard and J. Strain. The Fast Gauss Transform. SIAM Journal of Scientific and Statistical Computing, 12(1):79–94, 1991. [8] Peter Hall, David Marshall, and Ralph Martin. Merging and splitting eigenspace models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(9):1042–1049, 2000. [9] Michael Holmes, Alexander Gray, and Charles Isbell. Ultrafast monte carlo for statistical summations. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 673–680. MIT Press, Cambridge, MA, 2008. [10] Dongryeol Lee and Alexander Gray. Faster gaussian summation: Theory and experiment. In Proceedings of the Twenty-second Conference on Uncertainty in Artificial Intelligence. 2006. [11] Dongryeol Lee, Alexander Gray, and Andrew Moore. Dual-tree fast gauss transforms. In Y. Weiss, B. Sch¨ lkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 747– o 754. MIT Press, Cambridge, MA, 2006. [12] Ting Liu, Andrew W. Moore, and Alexander Gray. Efficient exact k-nn and nonparametric classification ¨ in high dimensions. In Sebastian Thrun, Lawrence Saul, and Bernhard Sch olkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004. [13] P. G. Martinsson and Vladimir Rokhlin. An accelerated kernel-independent fast multipole method in one dimension. SIAM J. Scientific Computing, 29(3):1160–1178, 2007. [14] A. W. Moore, J. Schneider, and K. Deng. Efficient locally weighted polynomial regression predictions. In D. Fisher, editor, Proceedings of the Fourteenth International Conference on Machine Learning, pages 196–204, San Francisco, 1997. Morgan Kaufmann. [15] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall/CRC, 1986. 8
3 0.081202999 60 nips-2008-Designing neurophysiology experiments to optimally constrain receptive field models along parametric submanifolds
Author: Jeremy Lewi, Robert Butera, David M. Schneider, Sarah Woolley, Liam Paninski
Abstract: Sequential optimal design methods hold great promise for improving the efficiency of neurophysiology experiments. However, previous methods for optimal experimental design have incorporated only weak prior information about the underlying neural system (e.g., the sparseness or smoothness of the receptive field). Here we describe how to use stronger prior information, in the form of parametric models of the receptive field, in order to construct optimal stimuli and further improve the efficiency of our experiments. For example, if we believe that the receptive field is well-approximated by a Gabor function, then our method constructs stimuli that optimally constrain the Gabor parameters (orientation, spatial frequency, etc.) using as few experimental trials as possible. More generally, we may believe a priori that the receptive field lies near a known sub-manifold of the full parameter space; in this case, our method chooses stimuli in order to reduce the uncertainty along the tangent space of this sub-manifold as rapidly as possible. Applications to simulated and real data indicate that these methods may in many cases improve the experimental efficiency. 1
4 0.078754514 51 nips-2008-Convergence and Rate of Convergence of a Manifold-Based Dimension Reduction Algorithm
Author: Andrew Smith, Hongyuan Zha, Xiao-ming Wu
Abstract: We study the convergence and the rate of convergence of a local manifold learning algorithm: LTSA [13]. The main technical tool is the perturbation analysis on the linear invariant subspace that corresponds to the solution of LTSA. We derive a worst-case upper bound of errors for LTSA which naturally leads to a convergence result. We then derive the rate of convergence for LTSA in a special case. 1
5 0.068315409 80 nips-2008-Extended Grassmann Kernels for Subspace-Based Learning
Author: Jihun Hamm, Daniel D. Lee
Abstract: Subspace-based learning problems involve data whose elements are linear subspaces of a vector space. To handle such data structures, Grassmann kernels have been proposed and used previously. In this paper, we analyze the relationship between Grassmann kernels and probabilistic similarity measures. Firstly, we show that the KL distance in the limit yields the Projection kernel on the Grassmann manifold, whereas the Bhattacharyya kernel becomes trivial in the limit and is suboptimal for subspace-based problems. Secondly, based on our analysis of the KL distance, we propose extensions of the Projection kernel which can be extended to the set of affine as well as scaled subspaces. We demonstrate the advantages of these extended kernels for classification and recognition tasks with Support Vector Machines and Kernel Discriminant Analysis using synthetic and real image databases. 1
6 0.0616186 238 nips-2008-Theory of matching pursuit
7 0.057598889 29 nips-2008-Automatic online tuning for fast Gaussian summation
8 0.046873447 70 nips-2008-Efficient Inference in Phylogenetic InDel Trees
9 0.046485607 200 nips-2008-Robust Kernel Principal Component Analysis
10 0.044853792 122 nips-2008-Learning with Consistency between Inductive Functions and Kernels
11 0.044369362 157 nips-2008-Nonrigid Structure from Motion in Trajectory Space
12 0.043770868 84 nips-2008-Fast Prediction on a Tree
13 0.043444786 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning
14 0.043224189 12 nips-2008-Accelerating Bayesian Inference over Nonlinear Differential Equations with Gaussian Processes
15 0.043162186 4 nips-2008-A Scalable Hierarchical Distributed Language Model
16 0.042255111 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions
17 0.041288853 129 nips-2008-MAS: a multiplicative approximation scheme for probabilistic inference
18 0.040676136 193 nips-2008-Regularized Co-Clustering with Dual Supervision
19 0.040440373 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models
20 0.038984142 78 nips-2008-Exact Convex Confidence-Weighted Learning
topicId topicWeight
[(0, -0.135), (1, -0.003), (2, -0.019), (3, 0.049), (4, 0.043), (5, -0.038), (6, 0.05), (7, 0.023), (8, 0.022), (9, -0.011), (10, 0.042), (11, -0.046), (12, -0.009), (13, 0.061), (14, 0.013), (15, 0.028), (16, 0.057), (17, 0.023), (18, 0.068), (19, -0.043), (20, 0.01), (21, 0.035), (22, -0.012), (23, -0.079), (24, -0.074), (25, 0.055), (26, -0.003), (27, -0.044), (28, 0.035), (29, 0.173), (30, 0.045), (31, 0.054), (32, 0.085), (33, -0.005), (34, 0.005), (35, -0.032), (36, 0.077), (37, 0.021), (38, 0.096), (39, 0.013), (40, -0.012), (41, -0.105), (42, -0.023), (43, 0.077), (44, -0.029), (45, 0.068), (46, 0.097), (47, -0.015), (48, 0.034), (49, -0.109)]
simIndex simValue paperId paperTitle
same-paper 1 0.94173664 188 nips-2008-QUIC-SVD: Fast SVD Using Cosine Trees
Author: Michael P. Holmes, Jr. Isbell, Charles Lee, Alexander G. Gray
Abstract: The Singular Value Decomposition is a key operation in many machine learning methods. Its computational cost, however, makes it unscalable and impractical for applications involving large datasets or real-time responsiveness, which are becoming increasingly common. We present a new method, QUIC-SVD, for fast approximation of the whole-matrix SVD based on a new sampling mechanism called the cosine tree. Our empirical tests show speedups of several orders of magnitude over exact SVD. Such scalability should enable QUIC-SVD to accelerate and enable a wide array of SVD-based methods and applications. 1
2 0.70408672 83 nips-2008-Fast High-dimensional Kernel Summations Using the Monte Carlo Multipole Method
Author: Dongryeol Lee, Alexander G. Gray
Abstract: We propose a new fast Gaussian summation algorithm for high-dimensional datasets with high accuracy. First, we extend the original fast multipole-type methods to use approximation schemes with both hard and probabilistic error. Second, we utilize a new data structure called subspace tree which maps each data point in the node to its lower dimensional mapping as determined by any linear dimension reduction method such as PCA. This new data structure is suitable for reducing the cost of each pairwise distance computation, the most dominant cost in many kernel methods. Our algorithm guarantees probabilistic relative error on each kernel sum, and can be applied to high-dimensional Gaussian summations which are ubiquitous inside many kernel methods as the key computational bottleneck. We provide empirical speedup results on low to high-dimensional datasets up to 89 dimensions. 1 Fast Gaussian Kernel Summation In this paper, we propose new computational techniques for efficiently approximating the following sum for each query point qi ∈ Q: Φ(qi , R) = e−||qi −rj || 2 /(2h2 ) (1) rj ∈R where R is the reference set; each reference point is associated with a Gaussian function with a smoothing parameter h (the ’bandwidth’). This form of summation is ubiquitous in many statistical learning methods, including kernel density estimation, kernel regression, Gaussian process regression, radial basis function networks, spectral clustering, support vector machines, and kernel PCA [1, 4]. Cross-validation in all of these methods require evaluating Equation 1 for multiple values of h. Kernel density estimation, for example, requires |R| density estimate based on |R| − 1 points, yielding a brute-force computational cost scaling quadratically (that is O(|R| 2 )). Error bounds. Due to its expensive computational cost, many algorithms approximate the Gaussian kernel sums at the expense of reduced precision. Therefore, it is natural to discuss error bound criteria which measure the quality of the approximations with respect to their corresponding true values. The following error bound criteria are common in literature: Definition 1.1. An algorithm guarantees absolute error bound, if for each exact value Φ(q i , R) for qi ∈ Q, it computes Φ(qi , R) such that Φ(qi , R) − Φ(qi , R) ≤ . Definition 1.2. An algorithm guarantees relative error bound, if for each exact value Φ(q i , R) for qi ∈ Q, it computes Φ(qi , R) ∈ R such that Φ(qi , R) − Φ(qi , R) ≤ |Φ(qi , R)|. 1 Bounding the relative error (e.g., the percentage deviation) is much harder because the error bound criterion is in terms of the initially unknown exact quantity. As a result, many previous methods [7] have focused on bounding the absolute error. The relative error bound criterion is preferred to the absolute error bound criterion in statistical applications in which high accuracy is desired. Our new algorithm will enforce the following “relaxed” form of the relative error bound criterion, whose motivation will be discussed shortly. Definition 1.3. An algorithm guarantees (1 − α) probabilistic relative error bound, if for each exact value Φ(qi , R) for qi ∈ Q, it computes Φ(qi , R) ∈ R, such that with at least probability 0 < 1 − α < 1, Φ(qi , R) − Φ(qi , R) ≤ |Φ(qi , R)|. Previous work. The most successful class of acceleration methods employ “higher-order divide and conquer” or generalized N -body algorithms (GNA) [4]. This approach can use any spatial partioning tree such as kd-trees or ball-trees for both the query set Q and reference data R and performs a simulataneous recursive descent on both trees. GNA with relative error bounds (Definition 1.2) [5, 6, 11, 10] utilized bounding boxes and additional cached-sufficient statistics such as higher-order moments needed for series-expansion. [5, 6] utilized bounding-box based error bounds which tend to be very loose, which resulted in slow empirical performance around suboptimally small and large bandwidths. [11, 10] extended GNA-based Gaussian summations with series-expansion which provided tighter bounds; it showed enormous performance improvements, but only up to low dimensional settings (up to D = 5) since the number of required terms in series expansion increases exponentially with respect to D. [9] introduces an iterative sampling based GNA for accelerating the computation of nested sums (a related easier problem). Its speedup is achieved by replacing pessimistic error bounds provided by bounding boxes with normal-based confidence interval from Monte Carlo sampling. [9] demonstrates the speedup many orders of magnitude faster than the previous state of the art in the context of computing aggregates over the queries (such as the LSCV score for selecting the optimal bandwidth). However, the authors did not discuss the sampling-based approach for computations that require per-query estimates, such as those required for kernel density estimation. None of the previous approaches for kernel summations addresses the issue of reducing the computational cost of each distance computation which incurs O(D) cost. However, the intrinsic dimensionality d of most high-dimensional datasets is much smaller than the explicit dimension D (that is, d << D). [12] proposed tree structures using a global dimension reduction method, such as random projection, as a preprocessing step for efficient (1 + ) approximate nearest neighbor search. Similarly, we develop a new data structure for kernel summations; our new data structure is constructed in a top-down fashion to perform the initial spatial partitioning in the original input space R D and performs a local dimension reduction to a localized subset of the data in a bottom-up fashion. This paper. We propose a new fast Gaussian summation algorithm that enables speedup in higher dimensions. Our approach utilizes: 1) probabilistic relative error bounds (Definition 1.3) on kernel sums provided by Monte Carlo estimates 2) a new tree structure called subspace tree for reducing the computational cost of each distance computation. The former can be seen as relaxing the strict requirement of guaranteeing hard relative bound on very small quantities, as done in [5, 6, 11, 10]. The latter was mentioned as a possible way of ameliorating the effects of the curse of dimensionality in [14], a pioneering paper in this area. Notations. Each query point and reference point (a D-dimensional vector) is indexed by natural numbers i, j ∈ N, and denoted qi and rj respectively. For any set S, |S| denotes the number of elements in S. The entities related to the left and the right child are denoted with superscripts L and R; an internal node N has the child nodes N L and N R . 2 Gaussian Summation by Monte Carlo Sampling Here we describe the extension needed for probabilistic computation of kernel summation satisfying Definition 1.3. The main routine for the probabilistic kernel summation is shown in Algorithm 1. The function MCMM takes the query node Q and the reference node R (each initially called with the roots of the query tree and the reference tree, Qroot and Rroot ) and β (initially called with α value which controls the probability guarantee that each kernel sum is within relative error). 2 Algorithm 1 The core dual-tree routine for probabilistic Gaussian kernel summation. MCMM(Q, R, β) if C AN S UMMARIZE E XACT(Q, R, ) then S UMMARIZE E XACT(Q, R) else if C AN S UMMARIZE MC(Q, R, , β) then 5: S UMMARIZE MC(Q, R, , β) else if Q is a leaf node then if R is a leaf node then MCMMBASE(Q, R) 10: else MCMM Q, RL , β , MCMM Q, RR , β 2 2 else if R is a leaf node then MCMM(QL , R, β), MCMM(QR , R, β) 15: else MCMM QL , RL , β , MCMM QL , RR , β 2 2 MCMM QR , RL , β , MCMM QR , RR , β 2 2 The idea of Monte Carlo sampling used in the new algorithm is similar to the one in [9], except the sampling is done per query and we use approximations that provide hard error bounds as well (i.e. finite difference, exhaustive base case: MCMMBASE). This means that the approximation has less variance than a pure Monte Carlo approach used in [9]. Algorithm 1 first attempts approximations with hard error bounds, which are computationally cheaper than sampling-based approximations. For example, finite-difference scheme [5, 6] can be used for the C AN S UMMARIZE E XACT and S UMMARIZE E XACT functions in any general dimension. The C AN S UMMARIZE MC function takes two parameters that specify the accuracy: the relative error and its probability guarantee and decides whether to use Monte Carlo sampling for the given pair of nodes. If the reference node R contains too few points, it may be more efficient to process it using exact methods that use error bounds based on bounding primitives on the node pair or exhaustive pair-wise evaluations, which is determined by the condition: τ · minitial ≤ |R| where τ > 1 controls the minimum number of reference points needed for Monte Carlo sampling to proceed. If the reference node does contain enough points, then for each query point q ∈ Q, the S AMPLE routine samples minitial terms over the terms in the summation Φ(q, R) = Kh (||q − rjn ||) rjn ∈R where Φ(q, R) denotes the exact contribution of R to q’s kernel sum. Basically, we are interested in estimating Φ(q, R) by Φ(q, R) = |R|µS , where µS is the sample mean of S. From the Central 2 Limit Theorem, given enough m samples, µS N (µ, σS /m) where Φ(q, R) = |R|µ (i.e. µ is the average of the kernel value between q and any reference point r ∈ R); this implies that √ |µS − µ| ≤ zβ/2 σS / m with probability 1 − β. The pruning rule we have to enforce for each query point for the contribution of R is: σS Φ(q, R) zβ/2 √ ≤ |R| m where σS the sample standard deviation of S. Since Φ(q, R) is one of the unknown quanities we want to compute, we instead enforce the following: σS zβ/2 √ ≤ m Φl (q, R) + |R| µS − |R| zβ/2 σS √ m (2) where Φl (q, R) is the currently running lower bound on the sum computed using exact methods z σ √ and |R| µS − β/2 S is the probabilistic component contributed by R. Denoting Φl,new (q, R) = m zβ/2 σ Φl (q, R) + |R| µS − √ S , the minimum number of samples for q needed to achieve the |S| 3 target error the right side of the inequality in Equation 2 with at least probability of 1 − β is: 2 2 m ≥ zβ/2 σS (|R| + |R|)2 2 (Φl (q, R) + |R|µ )2 S If the given query node and reference node pair cannot be pruned using either nonprobabilistic/probabilistic approximations, then we recurse on a smaller subsets of two sets. In particular, when dividing over the reference node R, we recurse with half of the β value 1 . We now state the probablistic error guarantee of our algorithm as a theorem. Theorem 2.1. After calling MCMM with Q = Qroot , R = Rroot , and β = α, Algorithm 1 approximates each Φ(q, R) with Φ(q, R) such that Definition 1.3 holds. Proof. For a query/reference (Q, R) pair and 0 < β < 1, MCMMBASE and S UMMARIZE E XACT compute estimates for q ∈ Q such that Φ(q, R) − Φ(q, R) < Φ(q,R)|R| with probability at |R| least 1 > 1 − β. By Equation 2, S UMMARIZE MC computes estimates for q ∈ Q such that Φ(q, R) − Φ(q, R) < Φ(q,R)|R| with probability 1 − β. |R| We now induct on |Q ∪ R|. Line 11 of Algorithm 1 divides over the reference whose subcalls compute estimates that satisfy Φ(q, RL ) − Φ(q, RL ) ≤ Φ(q,R)|RR | |R| L each with at least 1 − β 2 Φ(q,R)|RL | |R| and Φ(q, RR ) − Φ(q, RR ) ≤ probability by induction hypothesis. For q ∈ Q, Φ(q, R) = Φ(q, R )+ Φ(q, R ) which means |Φ(q, R)−Φ(q, R)| ≤ Φ(q,R)|R| with probability at least 1−β. |R| Line 14 divides over the query and each subcall computes estimates that hold with at least probability 1 − β for q ∈ QL and q ∈ QR . Line 16 and 17 divides both over the query and the reference, and the correctness can be proven similarly. Therefore, M CM M (Qroot , Rroot , α) computes estimates satisfying Definition 1.3. R “Reclaiming” probability. We note that the assigned probability β for the query/reference pair computed with exact bounds (S UMMARIZE E XACT and MCMMBASE) is not used. This portion of the probability can be “reclaimed” in a similar fashion as done in [10] and re-used to prune more aggressively in the later stages of the algorithm. All experiments presented in this paper were benefited by this simple modification. 3 Subspace Tree A subspace tree is basically a space-partitioning tree with a set of orthogonal bases associated with each node N : N.Ω = (µ, U, Λ, d) where µ is the mean, U is a D×d matrix whose columns consist of d eigenvectors, and Λ the corresponding eigenvalues. The orthogonal basis set is constructed using a linear dimension reduction method such as PCA. It is constructed in the top-down manner using the PARTITION S ET function dividing the given set of points into two (where the PARTITION S ET function divides along the dimension with the highest variance in case of a kd-tree for example), with the subspace in each node formed in the bottom-up manner. Algorithm 3 shows a PCA tree (a subspace tree using PCA as a dimension reduction) for a 3-D dataset. The subspace of each leaf node is computed using P CA BASE which can use the exact PCA [3] or a stochastic one [2]. For an internal node, the subspaces of the child nodes, N L .Ω = (µL , U L , ΛL , dL ) and N R .Ω = (µR , U R , ΛR , dR ), are approximately merged using the M ERGE S UBSPACES function which involves solving an (d L + dR + 1) × (dL + dR + 1) eigenvalue problem [8], which runs in O((dL + dR + 1)3 ) << O(D 3 ) given that the dataset is sparse. In addition, each data point x in each node N is mapped to its new lower-dimensional coordinate using the orthogonal basis set of N : xproj = U T (x − µ). The L2 norm reconstruction error is given by: ||xrecon − x||2 = ||(U xproj + µ) − x||2 . 2 2 Monte Carlo sampling using a subspace tree. Consider C AN S UMMARIZE MC function in Algorithm 2. The “outer-loop” over this algorithm is over the query set Q, and it would make sense to project each query point q ∈ Q to the subspace owned by the reference node R. Let U and µ be the orthogonal basis system for R consisting of d basis. For each q ∈ Q, consider the squared distance 1 We could also divide β such that the node that may be harder to approximate gets a lower value. 4 Algorithm 2 Monte Carlo sampling based approximation routines. C AN S UMMARIZE MC(Q, R, , α) S AMPLE(q, R, , α, S, m) return τ · minitial ≤ |R| for k = 1 to m do r ← random point in R S UMMARIZE MC(Q, R, , α) S ← S ∪ {Kh (||q − r||)} 2 for qi ∈ Q do µS ← M EAN(S), σS ← VARIANCE(S) S ← ∅, m ← minitial zα/2 σS Φl,new (q, R) ← Φl (q, R) + |R| µS − √ repeat |S| 2 S AMPLE(qi , R, , α, S, m) 2 2 mthresh ← zα/2 σS 2 (Φ(|R|+ |R|) S )2 l (q,R)+|R|µ until m ≤ 0 m ← mthresh − |S| Φ(qi , R) ← Φ(qi , R) + |R| · M EAN(S) ||(q − µ) − rproj ||2 (where (q − µ) is q’s coordinates expressed in terms of the coordinate system of R) as shown in Figure 1. For the Gaussian kernel, each pairwise kernel value is approximated as: e−||q−r|| 2 /(2h2 ) ≈ e−||q−qrecon || 2 (3) /(2h2 ) −||qproj −rproj ||2 /(2h2 ) e where qrecon = U qproj +µ and qproj = U (q−µ). For a fixed query point q, e can be precomputed (which takes d dot products between two D-dimensional vectors) and re-used for every distance computation between q and any reference point r ∈ R whose cost is now O(d) << O(D). Therefore, we can take more samples efficiently. For a total of sufficiently large m samples, the computational cost is O(d(D + m)) << O(D · m) for each query point. −||q−qrecon ||2 /(2h2 ) T Increased variance comes at the cost of inexact distance computations, however. Each distance computation incurs at most squared L2 norm of ||rrecon − r||2 error. That is, 2 ||q − rrecon ||2 − ||q − r||2 ≤ ||rrecon − r||2 . Neverhteless, the sample variance for each query 2 2 2 point plus the inexactness due to dimension reduction τS can be shown to be bounded for the Gaus2 2 sian kernel as: (where each s = e−||q−rrecon || /(2h ) ): 1 m−1 ≤ 1 m−1 s∈S s2 − m · µ 2 S s2 + τS 2 min 1, max e||rrecon −r||2 /h s∈S r∈R 2 2 − m µS min e−||rrecon −r||2 /(2h 2 2 ) r∈R Exhaustive computations using a subspace tree. Now suppose we have built subspace trees for the query and the reference sets. We can project either each query point onto the reference subspace, or each reference point onto the query subspace, depending on which subspace has a smaller dimension and the number of points in each node. The subspaces formed in the leaf nodes usually are highly numerically accurate since it contains very few points compared to the extrinsic dimensionality D. 4 Experimental Results We empirically evaluated the runtime performance of our algorithm on seven real-world datasets, scaled to fit in [0, 1]D hypercube, for approximating the Gaussian sum at every query point with a range of bandwidths. This experiment is motivated by many kernel methods that require computing the Gaussian sum at different bandwidth values (according to the standard least-sqares crossvalidation scores [15]). Nevertheless, we emphasize that the acceleration results are applicable to other kernel methods that require efficient Gaussian summation. In this paper, the reference set equals the query set. All datasets have 50K points so that the exact exhaustive method can be tractably computed. All times are in seconds and include the time needed to build the trees. Codes are in C/C++ and run on a dual Intel Xeon 3GHz with 8 Gb of main memory. The measurements in second to eigth columns are obtained by running the algorithms at the bandwidth kh∗ where 10−3 ≤ k ≤ 103 is the constant in the corresponding column header. The last columns denote the total time needed to run on all seven bandwidth values. Each table has results for five algorithms: the naive algorithm and four algorithms. The algorithms with p = 1 denote the previous state-of-the-art (finite-difference with error redistribution) [10], 5 Algorithm 3 PCA tree building routine. B UILD P CAT REE(P) if C AN PARTITION(P) then {P L , P R } ← PARTITION S ET(P) N ← empty node N L ← B UILD P CAT REE(P L ) N R ← B UILD P CAT REE(P R ) N.S ← M ERGE S UBSPACES(N L .S, N R .S) else N ← B UILD P CAT REE BASE(P) N.S ← P CA BASE(P) N.Pproj ← P ROJECT(P, N.S) return N while those with p < 1 denote our probabilistic version. Each entry has the running time and the percentage of the query points that did not satisfy the relative error . Analysis. Readers should focus on the last columns containing the total time needed for evaluating Gaussian sum at all points for seven different bandwidth values. This is indicated by boldfaced numbers for our probabilistic algorithm. As expected, On low-dimensional datasets (below 6 dimensions), the algorithm using series-expansion based bounds gives two to three times speedup compared to our approach that uses Monte Carlo sampling. Multipole moments are an effective form of compression in low dimensions with analytical error bounds that can be evaluated; our Monte Carlo-based method has an asymptotic error bound which must be “learned” through sampling. As we go from 7 dimensions and beyond, series-expansion cannot be done efficiently because of its slow convergence. Our probabilistic algorithm (p = 0.9) using Monte Carlo consistently performs better than the algorithm using exact bounds (p = 1) by at least a factor of two. Compared to naive, it achieves the maximum speedup of about nine times on an 16-dimensional dataset; on an 89-dimensional dataset, it is at least three times as fast as the naive. Note that all the datasets contain only 50K points, and the speedup will be more dramatic as we increase the number of points. 5 Conclusion We presented an extension to fast multipole methods to use approximation methods with both hard and probabilistic bounds. Our experimental results show speedup over the previous state-of-the-art on high-dimensional datasets. Our future work will include possible improvements inspired by a recent work done in the FMM community using a matrix-factorization formulation [13]. Figure 1: Left: A PCA-tree for a 3-D dataset. Right: The squared Euclidean distance between a given query point and a reference point projected onto a subspace can be decomposed into two components: the orthogonal component and the component in the subspace. 6 Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 Σ mockgalaxy-D-1M-rnd (cosmology: positions), D = 3, N = 50000, h ∗ = 0.000768201 Naive 182 182 182 182 182 182 182 1274 MCMM 3 3 5 10 26 48 2 97 ( = 0.1, p = 0.9) 1% 1% 1% 1% 1% 1% 5% DFGT 2 2 2 2 6 19 3 36 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 3 3 4 11 27 58 21 127 ( = 0.01, p = 0.9) 0 % 0% 1% 1% 1% 1% 7% DFGT 2 2 2 2 7 30 5 50 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 Σ bio5-rnd (biology: drug activity), D = 5, N = 50000, h∗ = 0.000567161 Naive 214 214 214 214 214 214 214 1498 MCMM 4 4 6 144 149 65 1 373 ( = 0.1, p = 0.9) 0% 0% 0% 0% 1% 0% 1% DFGT 4 4 5 24 96 65 2 200 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 4 4 6 148 165 126 1 454 ( = 0.01, p = 0.9) 0 % 0% 0% 0% 1% 0% 1% DFGT 4 4 5 25 139 126 4 307 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 Σ pall7 − rnd , D = 7, N = 50000, h∗ = 0.00131865 Naive 327 327 327 327 327 327 327 2289 MCMM 3 3 3 3 63 224 <1 300 ( = 0.1, p = 0.9) 0% 0% 0% 1% 1% 12 % 0% DFGT 10 10 11 14 84 263 223 615 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 3 3 3 3 70 265 5 352 ( = 0.01, p = 0.9) 0 % 0% 0% 1% 2% 1% 8% DFGT 10 10 11 14 85 299 374 803 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 Σ covtype − rnd , D = 10, N = 50000, h∗ = 0.0154758 Naive 380 380 380 380 380 380 380 2660 MCMM 11 11 13 39 318 <1 <1 381 ( = 0.1, p = 0.9) 0% 0% 0% 1% 0% 0% 0% DFGT 26 27 38 177 390 244 <1 903 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 11 11 13 77 362 2 <1 477 ( = 0.01, p = 0.9) 0 % 0% 0% 1% 1% 10 % 0% DFGT 26 27 38 180 427 416 <1 1115 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 Σ CoocTexture − rnd , D = 16, N = 50000, h∗ = 0.0263958 Naive 472 472 472 472 472 472 472 3304 MCMM 10 11 22 189 109 <1 <1 343 ( = 0.1, p = 0.9) 0% 0% 0% 1% 8% 0% 0% DFGT 22 26 82 240 452 66 <1 889 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 10 11 22 204 285 <1 <1 534 ( = 0.01, p = 0.9) 0 % 0% 1% 1% 10 % 4% 0% DFGT 22 26 83 254 543 230 <1 1159 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% 7 Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 LayoutHistogram − rnd , D = 32, N = 50000, h∗ = 0.0609892 Naive 757 757 757 757 757 757 757 MCMM 32 32 54 168 583 8 8 ( = 0.1, p = 0.9) 0% 0% 1% 1% 1% 0% 0% DFGT 153 159 221 492 849 212 <1 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 32 45 60 183 858 8 8 ( = 0.01, p = 0.9) 0 % 0% 1% 6% 1% 0% 0% DFGT 153 159 222 503 888 659 <1 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 CorelCombined − rnd , D = 89, N = 50000, h∗ = 0.0512583 Naive 1716 1716 1716 1716 1716 1716 1716 MCMM 384 418 575 428 1679 17 17 ( = 0.1, p = 0.9) 0% 0% 0% 1% 10 % 0% 0% DFGT 659 677 864 1397 1772 836 17 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 401 419 575 437 1905 17 17 ( = 0.01, p = 0.9) 0 % 0% 0% 1% 2% 0% 0% DFGT 659 677 865 1425 1794 1649 17 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Σ 5299 885 2087 1246 2585 Σ 12012 3518 6205 3771 7086 References [1] Nando de Freitas, Yang Wang, Maryam Mahdaviani, and Dustin Lang. Fast krylov methods for n-body learning. In Y. Weiss, B. Sch¨ lkopf, and J. Platt, editors, Advances in Neural Information Processing o Systems 18, pages 251–258. MIT Press, Cambridge, MA, 2006. [2] P. Drineas, R. Kannan, and M. Mahoney. Fast monte carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition, 2004. [3] G. Golub. Matrix Computations, Third Edition. The Johns Hopkins University Press, 1996. [4] A. Gray and A. W. Moore. N-Body Problems in Statistical Learning. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13 (December 2000). MIT Press, 2001. [5] Alexander G. Gray and Andrew W. Moore. Nonparametric Density Estimation: Toward Computational Tractability. In SIAM International Conference on Data Mining 2003, 2003. [6] Alexander G. Gray and Andrew W. Moore. Very Fast Multivariate Kernel Density Estimation via Computational Geometry. In Joint Statistical Meeting 2003, 2003. to be submitted to JASA. [7] L. Greengard and J. Strain. The Fast Gauss Transform. SIAM Journal of Scientific and Statistical Computing, 12(1):79–94, 1991. [8] Peter Hall, David Marshall, and Ralph Martin. Merging and splitting eigenspace models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(9):1042–1049, 2000. [9] Michael Holmes, Alexander Gray, and Charles Isbell. Ultrafast monte carlo for statistical summations. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 673–680. MIT Press, Cambridge, MA, 2008. [10] Dongryeol Lee and Alexander Gray. Faster gaussian summation: Theory and experiment. In Proceedings of the Twenty-second Conference on Uncertainty in Artificial Intelligence. 2006. [11] Dongryeol Lee, Alexander Gray, and Andrew Moore. Dual-tree fast gauss transforms. In Y. Weiss, B. Sch¨ lkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 747– o 754. MIT Press, Cambridge, MA, 2006. [12] Ting Liu, Andrew W. Moore, and Alexander Gray. Efficient exact k-nn and nonparametric classification ¨ in high dimensions. In Sebastian Thrun, Lawrence Saul, and Bernhard Sch olkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004. [13] P. G. Martinsson and Vladimir Rokhlin. An accelerated kernel-independent fast multipole method in one dimension. SIAM J. Scientific Computing, 29(3):1160–1178, 2007. [14] A. W. Moore, J. Schneider, and K. Deng. Efficient locally weighted polynomial regression predictions. In D. Fisher, editor, Proceedings of the Fourteenth International Conference on Machine Learning, pages 196–204, San Francisco, 1997. Morgan Kaufmann. [15] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall/CRC, 1986. 8
3 0.61982512 51 nips-2008-Convergence and Rate of Convergence of a Manifold-Based Dimension Reduction Algorithm
Author: Andrew Smith, Hongyuan Zha, Xiao-ming Wu
Abstract: We study the convergence and the rate of convergence of a local manifold learning algorithm: LTSA [13]. The main technical tool is the perturbation analysis on the linear invariant subspace that corresponds to the solution of LTSA. We derive a worst-case upper bound of errors for LTSA which naturally leads to a convergence result. We then derive the rate of convergence for LTSA in a special case. 1
Author: Jeremy Lewi, Robert Butera, David M. Schneider, Sarah Woolley, Liam Paninski
Abstract: Sequential optimal design methods hold great promise for improving the efficiency of neurophysiology experiments. However, previous methods for optimal experimental design have incorporated only weak prior information about the underlying neural system (e.g., the sparseness or smoothness of the receptive field). Here we describe how to use stronger prior information, in the form of parametric models of the receptive field, in order to construct optimal stimuli and further improve the efficiency of our experiments. For example, if we believe that the receptive field is well-approximated by a Gabor function, then our method constructs stimuli that optimally constrain the Gabor parameters (orientation, spatial frequency, etc.) using as few experimental trials as possible. More generally, we may believe a priori that the receptive field lies near a known sub-manifold of the full parameter space; in this case, our method chooses stimuli in order to reduce the uncertainty along the tangent space of this sub-manifold as rapidly as possible. Applications to simulated and real data indicate that these methods may in many cases improve the experimental efficiency. 1
5 0.55295497 151 nips-2008-Non-parametric Regression Between Manifolds
Author: Florian Steinke, Matthias Hein
Abstract: This paper discusses non-parametric regression between Riemannian manifolds. This learning problem arises frequently in many application areas ranging from signal processing, computer vision, over robotics to computer graphics. We present a new algorithmic scheme for the solution of this general learning problem based on regularized empirical risk minimization. The regularization functional takes into account the geometry of input and output manifold, and we show that it implements a prior which is particularly natural. Moreover, we demonstrate that our algorithm performs well in a difficult surface registration problem. 1
6 0.54207218 29 nips-2008-Automatic online tuning for fast Gaussian summation
7 0.46971062 126 nips-2008-Localized Sliced Inverse Regression
8 0.44592524 70 nips-2008-Efficient Inference in Phylogenetic InDel Trees
9 0.42924798 80 nips-2008-Extended Grassmann Kernels for Subspace-Based Learning
10 0.41457736 30 nips-2008-Bayesian Experimental Design of Magnetic Resonance Imaging Sequences
11 0.40676296 82 nips-2008-Fast Computation of Posterior Mode in Multi-Level Hierarchical Models
12 0.40227798 3 nips-2008-A Massively Parallel Digital Learning Processor
13 0.39803421 238 nips-2008-Theory of matching pursuit
14 0.39709243 61 nips-2008-Diffeomorphic Dimensionality Reduction
15 0.39583167 68 nips-2008-Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection
16 0.39458561 18 nips-2008-An Efficient Sequential Monte Carlo Algorithm for Coalescent Clustering
17 0.3937493 122 nips-2008-Learning with Consistency between Inductive Functions and Kernels
18 0.39030346 221 nips-2008-Stochastic Relational Models for Large-scale Dyadic Data using MCMC
19 0.38793188 165 nips-2008-On the Reliability of Clustering Stability in the Large Sample Regime
20 0.37922734 44 nips-2008-Characteristic Kernels on Groups and Semigroups
topicId topicWeight
[(4, 0.011), (5, 0.011), (6, 0.049), (7, 0.105), (12, 0.046), (15, 0.01), (28, 0.148), (57, 0.059), (59, 0.019), (63, 0.034), (71, 0.023), (77, 0.045), (78, 0.013), (83, 0.033), (84, 0.298)]
simIndex simValue paperId paperTitle
same-paper 1 0.79648393 188 nips-2008-QUIC-SVD: Fast SVD Using Cosine Trees
Author: Michael P. Holmes, Jr. Isbell, Charles Lee, Alexander G. Gray
Abstract: The Singular Value Decomposition is a key operation in many machine learning methods. Its computational cost, however, makes it unscalable and impractical for applications involving large datasets or real-time responsiveness, which are becoming increasingly common. We present a new method, QUIC-SVD, for fast approximation of the whole-matrix SVD based on a new sampling mechanism called the cosine tree. Our empirical tests show speedups of several orders of magnitude over exact SVD. Such scalability should enable QUIC-SVD to accelerate and enable a wide array of SVD-based methods and applications. 1
2 0.72957116 10 nips-2008-A rational model of preference learning and choice prediction by children
Author: Christopher G. Lucas, Thomas L. Griffiths, Fei Xu, Christine Fawcett
Abstract: Young children demonstrate the ability to make inferences about the preferences of other agents based on their choices. However, there exists no overarching account of what children are doing when they learn about preferences or how they use that knowledge. We use a rational model of preference learning, drawing on ideas from economics and computer science, to explain the behavior of children in several recent experiments. Specifically, we show how a simple econometric model can be extended to capture two- to four-year-olds’ use of statistical information in inferring preferences, and their generalization of these preferences. 1
3 0.64121741 195 nips-2008-Regularized Policy Iteration
Author: Amir M. Farahmand, Mohammad Ghavamzadeh, Shie Mannor, Csaba Szepesvári
Abstract: In this paper we consider approximate policy-iteration-based reinforcement learning algorithms. In order to implement a flexible function approximation scheme we propose the use of non-parametric methods with regularization, providing a convenient way to control the complexity of the function approximator. We propose two novel regularized policy iteration algorithms by adding L2 -regularization to two widely-used policy evaluation methods: Bellman residual minimization (BRM) and least-squares temporal difference learning (LSTD). We derive efficient implementation for our algorithms when the approximate value-functions belong to a reproducing kernel Hilbert space. We also provide finite-sample performance bounds for our algorithms and show that they are able to achieve optimal rates of convergence under the studied conditions. 1
4 0.6226058 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization
Author: Shai Shalev-shwartz, Sham M. Kakade
Abstract: We describe a primal-dual framework for the design and analysis of online strongly convex optimization algorithms. Our framework yields the tightest known logarithmic regret bounds for Follow-The-Leader and for the gradient descent algorithm proposed in Hazan et al. [2006]. We then show that one can interpolate between these two extreme cases. In particular, we derive a new algorithm that shares the computational simplicity of gradient descent but achieves lower regret in many practical situations. Finally, we further extend our framework for generalized strongly convex functions. 1
5 0.56346095 66 nips-2008-Dynamic visual attention: searching for coding length increments
Author: Xiaodi Hou, Liqing Zhang
Abstract: A visual attention system should respond placidly when common stimuli are presented, while at the same time keep alert to anomalous visual inputs. In this paper, a dynamic visual attention model based on the rarity of features is proposed. We introduce the Incremental Coding Length (ICL) to measure the perspective entropy gain of each feature. The objective of our model is to maximize the entropy of the sampled visual features. In order to optimize energy consumption, the limit amount of energy of the system is re-distributed amongst features according to their Incremental Coding Length. By selecting features with large coding length increments, the computational system can achieve attention selectivity in both static and dynamic scenes. We demonstrate that the proposed model achieves superior accuracy in comparison to mainstream approaches in static saliency map generation. Moreover, we also show that our model captures several less-reported dynamic visual search behaviors, such as attentional swing and inhibition of return. 1
6 0.557612 71 nips-2008-Efficient Sampling for Gaussian Process Inference using Control Variables
7 0.55749071 62 nips-2008-Differentiable Sparse Coding
9 0.55489802 83 nips-2008-Fast High-dimensional Kernel Summations Using the Monte Carlo Multipole Method
10 0.5548957 4 nips-2008-A Scalable Hierarchical Distributed Language Model
11 0.55470318 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
12 0.55458778 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
13 0.5537582 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
14 0.55289674 118 nips-2008-Learning Transformational Invariants from Natural Movies
15 0.5519495 192 nips-2008-Reducing statistical dependencies in natural signals using radial Gaussianization
16 0.5518896 99 nips-2008-High-dimensional support union recovery in multivariate regression
17 0.55146271 129 nips-2008-MAS: a multiplicative approximation scheme for probabilistic inference
18 0.55098206 137 nips-2008-Modeling Short-term Noise Dependence of Spike Counts in Macaque Prefrontal Cortex
19 0.55089766 138 nips-2008-Modeling human function learning with Gaussian processes
20 0.55088669 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE