nips nips2008 nips2008-29 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Vlad I. Morariu, Balaji V. Srinivasan, Vikas C. Raykar, Ramani Duraiswami, Larry S. Davis
Abstract: Many machine learning algorithms require the summation of Gaussian kernel functions, an expensive operation if implemented straightforwardly. Several methods have been proposed to reduce the computational complexity of evaluating such sums, including tree and analysis based methods. These achieve varying speedups depending on the bandwidth, dimension, and prescribed error, making the choice between methods difficult for machine learning tasks. We provide an algorithm that combines tree methods with the Improved Fast Gauss Transform (IFGT). As originally proposed the IFGT suffers from two problems: (1) the Taylor series expansion does not perform well for very low bandwidths, and (2) parameter selection is not trivial and can drastically affect performance and ease of use. We address the first problem by employing a tree data structure, resulting in four evaluation methods whose performance varies based on the distribution of sources and targets and input parameters such as desired accuracy and bandwidth. To solve the second problem, we present an online tuning approach that results in a black box method that automatically chooses the evaluation method and its parameters to yield the best performance for the input data, desired accuracy, and bandwidth. In addition, the new IFGT parameter selection approach allows for tighter error bounds. Our approach chooses the fastest method at negligible additional cost, and has superior performance in comparisons with previous approaches. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Automatic online tuning for fast Gaussian summation Vlad I. [sent-1, score-0.098]
2 Several methods have been proposed to reduce the computational complexity of evaluating such sums, including tree and analysis based methods. [sent-16, score-0.166]
3 We provide an algorithm that combines tree methods with the Improved Fast Gauss Transform (IFGT). [sent-18, score-0.141]
4 As originally proposed the IFGT suffers from two problems: (1) the Taylor series expansion does not perform well for very low bandwidths, and (2) parameter selection is not trivial and can drastically affect performance and ease of use. [sent-19, score-0.101]
5 We address the first problem by employing a tree data structure, resulting in four evaluation methods whose performance varies based on the distribution of sources and targets and input parameters such as desired accuracy and bandwidth. [sent-20, score-0.414]
6 To solve the second problem, we present an online tuning approach that results in a black box method that automatically chooses the evaluation method and its parameters to yield the best performance for the input data, desired accuracy, and bandwidth. [sent-21, score-0.223]
7 In addition, the new IFGT parameter selection approach allows for tighter error bounds. [sent-22, score-0.138]
8 Our approach chooses the fastest method at negligible additional cost, and has superior performance in comparisons with previous approaches. [sent-23, score-0.063]
9 1 Introduction Gaussian summations occur in many machine learning algorithms, including kernel density estimation [1], Gaussian process regression [2], fast particle smoothing [3], and kernel based machine learning techniques that need to solve a linear system with a similarity matrix [4]. [sent-24, score-0.115]
10 In such algorithms, 2 2 N the sum g(yj ) = i=1 qi e−||xi −yj || /h must be computed for j = 1, . [sent-25, score-0.094]
11 , yM } are d-dimensional source and target (or reference and query) points, respectively; qi is the weight associated with xi ; and h is the bandwidth. [sent-34, score-0.341]
12 To reduce the computational complexity, Greengard and Strain proposed the Fast Gauss Transform (FGT) [5], using two expansions, the far-field Hermite expansion and the local Taylor expansion, and a translation process that converts between the two, yielding an overall complexity of O(M + N ). [sent-36, score-0.065]
13 Dual-tree methods [7, 8, 9, 10] approach the problem by building two separate trees for the source and target points respectively, and recursively considering contributions from nodes of the source tree to nodes of the target tree. [sent-40, score-0.584]
14 The most recent works [9, 10] present new expansions and error control schemes that yield improved results for bandwidths in a large range above and below the optimal bandwidth, as determined by the standard least-squares cross-validation score [11]. [sent-41, score-0.177]
15 Efficiency across bandwidth scales is important in cases where the optimal bandwidth must be searched for. [sent-42, score-0.352]
16 ∗ Our code is available for download as open source at http://sourceforge. [sent-43, score-0.135]
17 Another approach, the Improved Fast Gauss Transform (IFGT) [6, 12, 13], uses a Taylor expansion and a space subdivision different than the original FGT, allowing for efficient evaluation in higher dimensions. [sent-45, score-0.092]
18 However, the approach as initially presented in [6, 12] was not accompanied by an automatic parameter selection algorithm. [sent-47, score-0.109]
19 Because the parameters interact in a non-trivial way, some authors designed simple parameter selection methods to meet the error bounds, but which did not maximize performance [14]; others attempted, unsuccessfully, to choose parameters, reporting times of “∞” for IFGT [9, 10]. [sent-48, score-0.098]
20 Recently, Raykar et al [13] presented an approach which selects parameters that minimize the constant term that appears in the asymptotic complexity of the method, while guaranteeing that error bounds are satisfied. [sent-49, score-0.114]
21 In addition, the IFGT performs poorly at low bandwidths because of the number of Taylor expansion terms that must be retained to meet error bounds. [sent-52, score-0.241]
22 We address both problems with IFGT: 1) small bandwidth performance, and 2) parameter selection. [sent-53, score-0.176]
23 First we employ a tree data structure [15, 16] that allows for fast neighbor search and greatly speeds up computation for low bandwidths. [sent-54, score-0.237]
24 We improve parameter selection by removing the assumption that data is uniformly distributed and by providing a method for selecting individual source and target truncation numbers that allows for tighter error bounds. [sent-56, score-0.607]
25 Finally, we provide an algorithm that automatically selects the evaluation method that is likely to be fastest for the given data, bandwidth, and error tolerance. [sent-57, score-0.204]
26 This is done in a way that is automatic and transparent to the user, as for other software packages such as FFTW [17] and ATLAS [18]. [sent-58, score-0.072]
27 The speedup is achieved by employing a truncated Taylor series factorization, using a space sub-division to reduce the number of terms needed to satisfy the error bound, and ignoring sources whose contributions are negligible. [sent-61, score-0.269]
28 The factorization that IFGT uses involves the truncated multivariate Taylor expansion α α 2 2 2 2 2 2 2α yj − x∗ xi − x∗ e− yj −xi /h = e− xi −x∗ /h e−||yj −x∗ /h + ∆ij α! [sent-63, score-0.42]
29 (1) Because reducing the distance ||xi − x∗ || also reduces the error bound given above, the sources can be divided into K clusters, so the Taylor series center of expansion for source xi is the center of the cluster to which the source belongs. [sent-66, score-0.69]
30 Because of the rapid decay of the Gaussian function, the k k contribution of sources in cluster k can be ignored if ||yj − ck || > ry = rx + h log(1/ǫ), where k ck and rx are cluster center and radius of the k th cluster, respectively. [sent-67, score-1.109]
31 In [13], the authors ensure that the error bound is met by choosing the truncation number pi for each N N source so that ∆ij ≤ ǫ. [sent-68, score-0.46]
32 This guarantees that |ˆ(yj ) − g(yj )| = | i=1 qi ∆ij | ≤ i=1 |qi |ǫ = Qǫ. [sent-69, score-0.094]
33 1 2 Target Target c c 2 r r c Sources Sources direct direct+tree Target Target r c 1 2 3 Sources ifgt c c 1 3 Sources ifgt+tree Figure 1: The four evaluation methods. [sent-91, score-0.729]
34 First, the number of clusters K, maximum truncation number pmax , and the cut-off radius r are selected by assuming that sources are uniformly distributed. [sent-94, score-0.86]
35 , cK , and the set of sources S is partitioned into S1 , . [sent-98, score-0.109]
36 Using the max cluster radius rx , the truncation number pmax is found that satisfies worst-case error bound. [sent-102, score-0.936]
37 Choosing pi for each source xi so that ∆ij ≤ ǫ, source contributions are accumulated to cluster centers: α ||xi −ck ||2 2α xi − ck k Cα = qi e− h2 1|α|≤pi −1 α! [sent-103, score-0.783]
38 Because the number of values of α such that |α| ≤ p is rpd = C(p + d, d), the total complexity of the algorithm is O (N + M nc )(log K + r(pmax −1)d ) where nc is the number of cluster centers that are within the cut-off radius of a target point. [sent-105, score-0.792]
39 Searching for clusters within the cut-off radius of each target can take time O(M K), but efficient data-structures can be used to reduce the cost to O(M nc log K). [sent-107, score-0.534]
40 An increase in either K or pmax increases the total cost of the algorithm. [sent-109, score-0.302]
41 However, few sources have a contribution greater than qi ǫ at low bandwidths, since the cut-off radius becomes very small. [sent-111, score-0.38]
42 Also, because the number of clusters increases as the bandwidth decreases, we need an efficient way of searching for clusters that are within the cut-off radius. [sent-112, score-0.354]
43 For this reason, a tree data structure can be used since it allows for efficient fixed-radius nearest neighbor search. [sent-113, score-0.167]
44 If h is moderately low, a tree data structure can be built on the cluster centers, such that the nc influential clusters within the cut-off radius can be found in O(nc log K) time [15, 16]. [sent-114, score-0.726]
45 If the bandwidth is very low, then it is more efficient to simply find all source points xi that influence a target yj and perform exact evaluation for those source points. [sent-115, score-0.758]
46 Thus, if ns source points are within the cut-off radius of yj , then the time to build the structure is O(N log N ) and the time to perform a query is O(ns log N ) for each target. [sent-116, score-0.643]
47 Thus, we have four methods that may be used for evaluation of the Gauss Transform: direct evaluation, direct evaluation with the tree data structure, IFGT evaluation, and IFGT evaluation with a tree data structure on the cluster centers. [sent-117, score-0.781]
48 using direct+tree evaluation when ifgt is optimal could result in a running time that is many orders of magnitude larger), we will need an efficient and online method selection approach, which is presented in section 5. [sent-121, score-0.757]
49 1 3 10 Actual radius Predicted radius 0 10 d=1 d=2 d=3 d=4 d=5 d=6 2 10 Speedup Max Cluster Radius, rx 10 −1 10 −2 10 1 10 0 10 −3 10 −1 −4 10 10 0 0. [sent-122, score-0.396]
50 8 2 −2 −1 10 4 x 10 10 Bandwidth h Figure 2: Selecting pmax and K using cluster radius, for M =N =20000, sources dist. [sent-130, score-0.532]
51 Left: Predicted cluster radius as K −1/d vs actual cluster radius for d = 3. [sent-132, score-0.622]
52 In [13], the pointwise error bounds described in Eq. [sent-135, score-0.089]
53 1 were used in an automatic parameter selection scheme that is optimized when sources are uniformly distributed. [sent-136, score-0.241]
54 We remove the uniformity assumption and also make the error bounds tighter by selecting individual source and target truncation numbers to satisfy cluster-wise error bounds instead of the worst-case point-wise error bounds. [sent-137, score-0.719]
55 1 Number of Clusters and Maximum Truncation Number The task of selecting the number of clusters K and maximum truncation number pmax is difficult because they depend on each other indirectly through the source distribution. [sent-140, score-0.71]
56 For example, increasing K decreases the cluster radius, which allows for a lower truncation number while still satisfying the error bound; conversely, increasing pmax allows clusters to have a larger radius, which allows for a smaller K. [sent-141, score-0.782]
57 Unfortunately, we cannot find the balance between the two without analyzing the source distribution because it influences the rate at which the cluster radius decreases. [sent-143, score-0.434]
58 The uniformity assumption leads to an estimate of maximum cluster radius, rx ∼ K −1/d [13]. [sent-144, score-0.262]
59 2, actual rx will decrease faster than K −1/d , leading to over-clustering and increased running time. [sent-146, score-0.166]
60 Our solution is to perform clustering as part of the parameter selection process, obtaining the actual cluster radii for each value of K. [sent-147, score-0.231]
61 We can then increment the value K, obtain the maximum cluster radius, and then find the lowest p that satisfies the error bound, picking the final value K which yields the lowest computational cost. [sent-150, score-0.257]
62 Note that if we simply set the maximum number of clusters to Klimit = N , we would spend O(N log N ) time to estimate parameters. [sent-151, score-0.115]
63 However, in practice, the optimal value of K is low relative to N , and it is possible to detect when we cannot lower cost further by increasing K or lowering pmax , thus allowing the search to terminate early. [sent-152, score-0.352]
64 2 Individual Truncation Numbers by Cluster-wise Error Bounds Once the maximum truncation number pmax is selected, we can guarantee that the worst sourcetarget pairwise error is below the desired error bound. [sent-155, score-0.706]
65 However, simply setting each source and target truncation number to pmax wastes computational resources since most source-target pairs do not contribute much error. [sent-156, score-0.691]
66 For d=1, the gain of lowering truncation is not large enough to make up for overhead costs. [sent-159, score-0.265]
67 However, this means that each cluster will have to compute r(pi −1)d coefficients where pi is the truncation number of its farthest point. [sent-161, score-0.416]
68 We propose a method for further decreasing most individual source and target truncation numbers by considering the total error incurred by evaluation at any target |ˆ(yj ) − g(yj )| ≤ g |qi |ǫ k : ||yj −ck ||>ry xi ∈Sk where the left term on the r. [sent-162, score-0.671]
69 is the error from truncating the Taylor series for the clusters that are within the cut-off radius, and the right term bounds the error from ignoring clusters outside the cut-off radius, ry . [sent-165, score-0.457]
70 Instead of ensuring that ∆ij ≤ ǫ for all (i, j) pairs, we ensure xi ∈Sk |qi |∆ij ≤ xi ∈Sk |qi |ǫ = Qk ǫ for all clusters. [sent-166, score-0.084]
71 In this case, if a cluster is outside the cut-off radius, then the error incurred is no greater than Qk ǫ; otherwise, the cluster-wise error bounds guarantee that the error is still no greater than Qk ǫ. [sent-167, score-0.357]
72 Summing over all clusters we have |ˆ(yj ) − g(yj )| ≤ k Qk ǫ = Qǫ, g our desired error bound. [sent-168, score-0.177]
73 The lowest truncation number that satisfies the cluster-wise error for each cluster is found in O(pmax N ) time by evaluating the cluster-wise error for all clusters for each value of p = {1 . [sent-169, score-0.617]
74 In addition, we can find individual target point truncation numbers by k not only considering the worst case target distance ry when computing cluster error contributions, but considering target errors for sources at varying distance ranges from each cluster center. [sent-173, score-1.081]
75 This yields concentric regions around each cluster, each of which has its own truncation number, which can be used for targets in that region. [sent-174, score-0.263]
76 5 Automatic Tuning via Method Selection For any input source and target point distribution, requested absolute error, and Gaussian bandwidth, we have the option of evaluating the Gauss Transform using any one of four methods: direct, direct+tree, ifgt, and ifgt+tree. [sent-176, score-0.236]
77 Thus, we require an efficient scheme to automatically choose the best method online based on the input. [sent-179, score-0.084]
78 The scheme must use the distribution of both the source and target points in making its decision, while at the same time avoiding long computations that would defeat the purpose of automatic method selection. [sent-180, score-0.335]
79 Right: Ratio of automatic to fastest method and automatic to slowest method, showing that method selection incurs very small overhead while preventing potentially large slowdowns. [sent-184, score-0.306]
80 A simple approach to estimating the distribution dependent ns and nc is to build a tree on sample source points and compute the average number of neighbors to a sampled set of targets. [sent-186, score-0.602]
81 However, ns and nc can be estimated in O(M + N ) time even without sampling by using techniques from the field of database management systems for estimating spatial join selectivity[21]. [sent-188, score-0.38]
82 Given ns , we predict the cost of direct+tree, and estimate Klimit as the highest value that might yield lower costs than direct or direct+tree. [sent-189, score-0.263]
83 If Klimit > 0, then, we can estimate the parameters and costs of ifgt or ifgt+tree. [sent-190, score-0.563]
84 As figure 4 shows, our method selection approach chooses the correct method across bandwidths at very low computational cost. [sent-192, score-0.241]
85 As in [10], we scale the data to fit the unit hypercube and evaluate the Gauss transform using all 50K points as sources and targets, with bandwidths varying from 10−3 to 103 times the optimal bandwidth. [sent-195, score-0.267]
86 Because our method satisfies an absolute error, we use for absolute ǫ the highest value that guarantees a relative error of 10−2 (to achieve this, ǫ ranges from 10−1 to 10−4 by factors of 10). [sent-196, score-0.093]
87 We do not include the time required to choose ǫ (since we are doing this only to evaluate the running times of the two methods for the same relative errors) but we do include the time to automatically select the method and parameters. [sent-197, score-0.135]
88 For most bandwidths our method is generally faster by about one order of magnitude (sometimes as much as 1000 times faster). [sent-200, score-0.177]
89 For each dataset we ran five experiments: the first four fixed one of the four methods (direct, direct+tree, ifgt, ifgt+tree) and used it for all conjugate gradient iterations; the fifth automatically selected the best method at each iteration (denoted by auto in figure 6). [sent-224, score-0.361]
90 To validate our solutions, we measured the relative error between the vectors found by the direct method and our approximate methods; they were small, ranging from ∼ 10−10 to ∼ 10−5 . [sent-225, score-0.176]
91 As expected, auto chose the correct method for each dataset, incurring only a small overhead cost. [sent-226, score-0.301]
92 Also, for the abalone dataset, auto outperformed any of the fixed method experiments; as the right side of figure 6 shows, half way through the iterations, the required accuracy decreased enough to make ifgt faster than direct evaluation. [sent-227, score-0.992]
93 By switching methods dynamically, the automatic selection approach outperformed any fixed method, further demonstrating the usefulness of our online tuning approach. [sent-228, score-0.161]
94 Finally, we embed our automatic method selection in the the two-filter particle smoothing demo provided by the authors of [3]3 . [sent-230, score-0.177]
95 56s for the direct, dual-tree and automatic (ifgt was chosen) methods respectively. [sent-234, score-0.072]
96 797s 8 −log(desired accuracy) Dims Size direct ifgt direct-tree ifgt-tree auto Robotarm 2 1000 0. [sent-254, score-0.885]
97 7 Conclusion We presented an automatic online tuning approach to Gaussian summations that combines a tree data structure with IFGT that is well suited for both high and low bandwidths and which users can treat as a black box. [sent-262, score-0.438]
98 The approach also tunes IFGT parameters to the source distribution, and provides tighter error bounds. [sent-263, score-0.236]
99 Experiments demonstrated that our approach outperforms competing methods for most bandwidth settings, and dynamically adapts to various datasets and input parameters. [sent-264, score-0.205]
100 Improved fast Gauss transform and efficient kernel density estimation. [sent-317, score-0.088]
wordName wordTfidf (topN-words)
[('ifgt', 0.563), ('pmax', 0.277), ('auto', 0.239), ('truncation', 0.209), ('bandwidth', 0.176), ('nc', 0.171), ('ns', 0.155), ('radius', 0.153), ('yj', 0.148), ('cluster', 0.146), ('tree', 0.141), ('cpu', 0.137), ('source', 0.135), ('gauss', 0.133), ('ry', 0.129), ('ck', 0.123), ('bandwidths', 0.116), ('klimit', 0.113), ('sources', 0.109), ('costdirect', 0.097), ('costifgt', 0.097), ('qi', 0.094), ('rx', 0.09), ('clusters', 0.089), ('direct', 0.083), ('automatic', 0.072), ('worst', 0.071), ('target', 0.07), ('speedup', 0.066), ('ij', 0.062), ('error', 0.061), ('gpr', 0.056), ('targets', 0.054), ('taylor', 0.052), ('evaluation', 0.052), ('sk', 0.05), ('duraiswami', 0.048), ('feder', 0.048), ('robotarm', 0.048), ('fast', 0.046), ('abalone', 0.046), ('naive', 0.043), ('xi', 0.042), ('dik', 0.042), ('transform', 0.042), ('qk', 0.04), ('tighter', 0.04), ('expansion', 0.04), ('krylov', 0.039), ('selection', 0.037), ('particle', 0.036), ('pi', 0.033), ('contributions', 0.033), ('summations', 0.033), ('arya', 0.032), ('dfd', 0.032), ('dfdo', 0.032), ('dfto', 0.032), ('fgt', 0.032), ('gonzalez', 0.032), ('greene', 0.032), ('klaas', 0.032), ('ramani', 0.032), ('raykar', 0.032), ('rpd', 0.032), ('method', 0.032), ('fastest', 0.031), ('calculate', 0.031), ('four', 0.031), ('overhead', 0.03), ('faster', 0.029), ('datasets', 0.029), ('bounds', 0.028), ('tuning', 0.028), ('farthest', 0.028), ('greengard', 0.028), ('join', 0.028), ('automatically', 0.028), ('desired', 0.027), ('neighbor', 0.026), ('time', 0.026), ('atlas', 0.026), ('lowering', 0.026), ('uniformity', 0.026), ('cost', 0.025), ('lowest', 0.025), ('phase', 0.025), ('complexity', 0.025), ('clustering', 0.024), ('low', 0.024), ('inexact', 0.024), ('uential', 0.024), ('online', 0.024), ('actual', 0.024), ('centers', 0.024), ('gaussian', 0.023), ('running', 0.023), ('housing', 0.023), ('uniformly', 0.023), ('bound', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 29 nips-2008-Automatic online tuning for fast Gaussian summation
Author: Vlad I. Morariu, Balaji V. Srinivasan, Vikas C. Raykar, Ramani Duraiswami, Larry S. Davis
Abstract: Many machine learning algorithms require the summation of Gaussian kernel functions, an expensive operation if implemented straightforwardly. Several methods have been proposed to reduce the computational complexity of evaluating such sums, including tree and analysis based methods. These achieve varying speedups depending on the bandwidth, dimension, and prescribed error, making the choice between methods difficult for machine learning tasks. We provide an algorithm that combines tree methods with the Improved Fast Gauss Transform (IFGT). As originally proposed the IFGT suffers from two problems: (1) the Taylor series expansion does not perform well for very low bandwidths, and (2) parameter selection is not trivial and can drastically affect performance and ease of use. We address the first problem by employing a tree data structure, resulting in four evaluation methods whose performance varies based on the distribution of sources and targets and input parameters such as desired accuracy and bandwidth. To solve the second problem, we present an online tuning approach that results in a black box method that automatically chooses the evaluation method and its parameters to yield the best performance for the input data, desired accuracy, and bandwidth. In addition, the new IFGT parameter selection approach allows for tighter error bounds. Our approach chooses the fastest method at negligible additional cost, and has superior performance in comparisons with previous approaches. 1
2 0.10203908 83 nips-2008-Fast High-dimensional Kernel Summations Using the Monte Carlo Multipole Method
Author: Dongryeol Lee, Alexander G. Gray
Abstract: We propose a new fast Gaussian summation algorithm for high-dimensional datasets with high accuracy. First, we extend the original fast multipole-type methods to use approximation schemes with both hard and probabilistic error. Second, we utilize a new data structure called subspace tree which maps each data point in the node to its lower dimensional mapping as determined by any linear dimension reduction method such as PCA. This new data structure is suitable for reducing the cost of each pairwise distance computation, the most dominant cost in many kernel methods. Our algorithm guarantees probabilistic relative error on each kernel sum, and can be applied to high-dimensional Gaussian summations which are ubiquitous inside many kernel methods as the key computational bottleneck. We provide empirical speedup results on low to high-dimensional datasets up to 89 dimensions. 1 Fast Gaussian Kernel Summation In this paper, we propose new computational techniques for efficiently approximating the following sum for each query point qi ∈ Q: Φ(qi , R) = e−||qi −rj || 2 /(2h2 ) (1) rj ∈R where R is the reference set; each reference point is associated with a Gaussian function with a smoothing parameter h (the ’bandwidth’). This form of summation is ubiquitous in many statistical learning methods, including kernel density estimation, kernel regression, Gaussian process regression, radial basis function networks, spectral clustering, support vector machines, and kernel PCA [1, 4]. Cross-validation in all of these methods require evaluating Equation 1 for multiple values of h. Kernel density estimation, for example, requires |R| density estimate based on |R| − 1 points, yielding a brute-force computational cost scaling quadratically (that is O(|R| 2 )). Error bounds. Due to its expensive computational cost, many algorithms approximate the Gaussian kernel sums at the expense of reduced precision. Therefore, it is natural to discuss error bound criteria which measure the quality of the approximations with respect to their corresponding true values. The following error bound criteria are common in literature: Definition 1.1. An algorithm guarantees absolute error bound, if for each exact value Φ(q i , R) for qi ∈ Q, it computes Φ(qi , R) such that Φ(qi , R) − Φ(qi , R) ≤ . Definition 1.2. An algorithm guarantees relative error bound, if for each exact value Φ(q i , R) for qi ∈ Q, it computes Φ(qi , R) ∈ R such that Φ(qi , R) − Φ(qi , R) ≤ |Φ(qi , R)|. 1 Bounding the relative error (e.g., the percentage deviation) is much harder because the error bound criterion is in terms of the initially unknown exact quantity. As a result, many previous methods [7] have focused on bounding the absolute error. The relative error bound criterion is preferred to the absolute error bound criterion in statistical applications in which high accuracy is desired. Our new algorithm will enforce the following “relaxed” form of the relative error bound criterion, whose motivation will be discussed shortly. Definition 1.3. An algorithm guarantees (1 − α) probabilistic relative error bound, if for each exact value Φ(qi , R) for qi ∈ Q, it computes Φ(qi , R) ∈ R, such that with at least probability 0 < 1 − α < 1, Φ(qi , R) − Φ(qi , R) ≤ |Φ(qi , R)|. Previous work. The most successful class of acceleration methods employ “higher-order divide and conquer” or generalized N -body algorithms (GNA) [4]. This approach can use any spatial partioning tree such as kd-trees or ball-trees for both the query set Q and reference data R and performs a simulataneous recursive descent on both trees. GNA with relative error bounds (Definition 1.2) [5, 6, 11, 10] utilized bounding boxes and additional cached-sufficient statistics such as higher-order moments needed for series-expansion. [5, 6] utilized bounding-box based error bounds which tend to be very loose, which resulted in slow empirical performance around suboptimally small and large bandwidths. [11, 10] extended GNA-based Gaussian summations with series-expansion which provided tighter bounds; it showed enormous performance improvements, but only up to low dimensional settings (up to D = 5) since the number of required terms in series expansion increases exponentially with respect to D. [9] introduces an iterative sampling based GNA for accelerating the computation of nested sums (a related easier problem). Its speedup is achieved by replacing pessimistic error bounds provided by bounding boxes with normal-based confidence interval from Monte Carlo sampling. [9] demonstrates the speedup many orders of magnitude faster than the previous state of the art in the context of computing aggregates over the queries (such as the LSCV score for selecting the optimal bandwidth). However, the authors did not discuss the sampling-based approach for computations that require per-query estimates, such as those required for kernel density estimation. None of the previous approaches for kernel summations addresses the issue of reducing the computational cost of each distance computation which incurs O(D) cost. However, the intrinsic dimensionality d of most high-dimensional datasets is much smaller than the explicit dimension D (that is, d << D). [12] proposed tree structures using a global dimension reduction method, such as random projection, as a preprocessing step for efficient (1 + ) approximate nearest neighbor search. Similarly, we develop a new data structure for kernel summations; our new data structure is constructed in a top-down fashion to perform the initial spatial partitioning in the original input space R D and performs a local dimension reduction to a localized subset of the data in a bottom-up fashion. This paper. We propose a new fast Gaussian summation algorithm that enables speedup in higher dimensions. Our approach utilizes: 1) probabilistic relative error bounds (Definition 1.3) on kernel sums provided by Monte Carlo estimates 2) a new tree structure called subspace tree for reducing the computational cost of each distance computation. The former can be seen as relaxing the strict requirement of guaranteeing hard relative bound on very small quantities, as done in [5, 6, 11, 10]. The latter was mentioned as a possible way of ameliorating the effects of the curse of dimensionality in [14], a pioneering paper in this area. Notations. Each query point and reference point (a D-dimensional vector) is indexed by natural numbers i, j ∈ N, and denoted qi and rj respectively. For any set S, |S| denotes the number of elements in S. The entities related to the left and the right child are denoted with superscripts L and R; an internal node N has the child nodes N L and N R . 2 Gaussian Summation by Monte Carlo Sampling Here we describe the extension needed for probabilistic computation of kernel summation satisfying Definition 1.3. The main routine for the probabilistic kernel summation is shown in Algorithm 1. The function MCMM takes the query node Q and the reference node R (each initially called with the roots of the query tree and the reference tree, Qroot and Rroot ) and β (initially called with α value which controls the probability guarantee that each kernel sum is within relative error). 2 Algorithm 1 The core dual-tree routine for probabilistic Gaussian kernel summation. MCMM(Q, R, β) if C AN S UMMARIZE E XACT(Q, R, ) then S UMMARIZE E XACT(Q, R) else if C AN S UMMARIZE MC(Q, R, , β) then 5: S UMMARIZE MC(Q, R, , β) else if Q is a leaf node then if R is a leaf node then MCMMBASE(Q, R) 10: else MCMM Q, RL , β , MCMM Q, RR , β 2 2 else if R is a leaf node then MCMM(QL , R, β), MCMM(QR , R, β) 15: else MCMM QL , RL , β , MCMM QL , RR , β 2 2 MCMM QR , RL , β , MCMM QR , RR , β 2 2 The idea of Monte Carlo sampling used in the new algorithm is similar to the one in [9], except the sampling is done per query and we use approximations that provide hard error bounds as well (i.e. finite difference, exhaustive base case: MCMMBASE). This means that the approximation has less variance than a pure Monte Carlo approach used in [9]. Algorithm 1 first attempts approximations with hard error bounds, which are computationally cheaper than sampling-based approximations. For example, finite-difference scheme [5, 6] can be used for the C AN S UMMARIZE E XACT and S UMMARIZE E XACT functions in any general dimension. The C AN S UMMARIZE MC function takes two parameters that specify the accuracy: the relative error and its probability guarantee and decides whether to use Monte Carlo sampling for the given pair of nodes. If the reference node R contains too few points, it may be more efficient to process it using exact methods that use error bounds based on bounding primitives on the node pair or exhaustive pair-wise evaluations, which is determined by the condition: τ · minitial ≤ |R| where τ > 1 controls the minimum number of reference points needed for Monte Carlo sampling to proceed. If the reference node does contain enough points, then for each query point q ∈ Q, the S AMPLE routine samples minitial terms over the terms in the summation Φ(q, R) = Kh (||q − rjn ||) rjn ∈R where Φ(q, R) denotes the exact contribution of R to q’s kernel sum. Basically, we are interested in estimating Φ(q, R) by Φ(q, R) = |R|µS , where µS is the sample mean of S. From the Central 2 Limit Theorem, given enough m samples, µS N (µ, σS /m) where Φ(q, R) = |R|µ (i.e. µ is the average of the kernel value between q and any reference point r ∈ R); this implies that √ |µS − µ| ≤ zβ/2 σS / m with probability 1 − β. The pruning rule we have to enforce for each query point for the contribution of R is: σS Φ(q, R) zβ/2 √ ≤ |R| m where σS the sample standard deviation of S. Since Φ(q, R) is one of the unknown quanities we want to compute, we instead enforce the following: σS zβ/2 √ ≤ m Φl (q, R) + |R| µS − |R| zβ/2 σS √ m (2) where Φl (q, R) is the currently running lower bound on the sum computed using exact methods z σ √ and |R| µS − β/2 S is the probabilistic component contributed by R. Denoting Φl,new (q, R) = m zβ/2 σ Φl (q, R) + |R| µS − √ S , the minimum number of samples for q needed to achieve the |S| 3 target error the right side of the inequality in Equation 2 with at least probability of 1 − β is: 2 2 m ≥ zβ/2 σS (|R| + |R|)2 2 (Φl (q, R) + |R|µ )2 S If the given query node and reference node pair cannot be pruned using either nonprobabilistic/probabilistic approximations, then we recurse on a smaller subsets of two sets. In particular, when dividing over the reference node R, we recurse with half of the β value 1 . We now state the probablistic error guarantee of our algorithm as a theorem. Theorem 2.1. After calling MCMM with Q = Qroot , R = Rroot , and β = α, Algorithm 1 approximates each Φ(q, R) with Φ(q, R) such that Definition 1.3 holds. Proof. For a query/reference (Q, R) pair and 0 < β < 1, MCMMBASE and S UMMARIZE E XACT compute estimates for q ∈ Q such that Φ(q, R) − Φ(q, R) < Φ(q,R)|R| with probability at |R| least 1 > 1 − β. By Equation 2, S UMMARIZE MC computes estimates for q ∈ Q such that Φ(q, R) − Φ(q, R) < Φ(q,R)|R| with probability 1 − β. |R| We now induct on |Q ∪ R|. Line 11 of Algorithm 1 divides over the reference whose subcalls compute estimates that satisfy Φ(q, RL ) − Φ(q, RL ) ≤ Φ(q,R)|RR | |R| L each with at least 1 − β 2 Φ(q,R)|RL | |R| and Φ(q, RR ) − Φ(q, RR ) ≤ probability by induction hypothesis. For q ∈ Q, Φ(q, R) = Φ(q, R )+ Φ(q, R ) which means |Φ(q, R)−Φ(q, R)| ≤ Φ(q,R)|R| with probability at least 1−β. |R| Line 14 divides over the query and each subcall computes estimates that hold with at least probability 1 − β for q ∈ QL and q ∈ QR . Line 16 and 17 divides both over the query and the reference, and the correctness can be proven similarly. Therefore, M CM M (Qroot , Rroot , α) computes estimates satisfying Definition 1.3. R “Reclaiming” probability. We note that the assigned probability β for the query/reference pair computed with exact bounds (S UMMARIZE E XACT and MCMMBASE) is not used. This portion of the probability can be “reclaimed” in a similar fashion as done in [10] and re-used to prune more aggressively in the later stages of the algorithm. All experiments presented in this paper were benefited by this simple modification. 3 Subspace Tree A subspace tree is basically a space-partitioning tree with a set of orthogonal bases associated with each node N : N.Ω = (µ, U, Λ, d) where µ is the mean, U is a D×d matrix whose columns consist of d eigenvectors, and Λ the corresponding eigenvalues. The orthogonal basis set is constructed using a linear dimension reduction method such as PCA. It is constructed in the top-down manner using the PARTITION S ET function dividing the given set of points into two (where the PARTITION S ET function divides along the dimension with the highest variance in case of a kd-tree for example), with the subspace in each node formed in the bottom-up manner. Algorithm 3 shows a PCA tree (a subspace tree using PCA as a dimension reduction) for a 3-D dataset. The subspace of each leaf node is computed using P CA BASE which can use the exact PCA [3] or a stochastic one [2]. For an internal node, the subspaces of the child nodes, N L .Ω = (µL , U L , ΛL , dL ) and N R .Ω = (µR , U R , ΛR , dR ), are approximately merged using the M ERGE S UBSPACES function which involves solving an (d L + dR + 1) × (dL + dR + 1) eigenvalue problem [8], which runs in O((dL + dR + 1)3 ) << O(D 3 ) given that the dataset is sparse. In addition, each data point x in each node N is mapped to its new lower-dimensional coordinate using the orthogonal basis set of N : xproj = U T (x − µ). The L2 norm reconstruction error is given by: ||xrecon − x||2 = ||(U xproj + µ) − x||2 . 2 2 Monte Carlo sampling using a subspace tree. Consider C AN S UMMARIZE MC function in Algorithm 2. The “outer-loop” over this algorithm is over the query set Q, and it would make sense to project each query point q ∈ Q to the subspace owned by the reference node R. Let U and µ be the orthogonal basis system for R consisting of d basis. For each q ∈ Q, consider the squared distance 1 We could also divide β such that the node that may be harder to approximate gets a lower value. 4 Algorithm 2 Monte Carlo sampling based approximation routines. C AN S UMMARIZE MC(Q, R, , α) S AMPLE(q, R, , α, S, m) return τ · minitial ≤ |R| for k = 1 to m do r ← random point in R S UMMARIZE MC(Q, R, , α) S ← S ∪ {Kh (||q − r||)} 2 for qi ∈ Q do µS ← M EAN(S), σS ← VARIANCE(S) S ← ∅, m ← minitial zα/2 σS Φl,new (q, R) ← Φl (q, R) + |R| µS − √ repeat |S| 2 S AMPLE(qi , R, , α, S, m) 2 2 mthresh ← zα/2 σS 2 (Φ(|R|+ |R|) S )2 l (q,R)+|R|µ until m ≤ 0 m ← mthresh − |S| Φ(qi , R) ← Φ(qi , R) + |R| · M EAN(S) ||(q − µ) − rproj ||2 (where (q − µ) is q’s coordinates expressed in terms of the coordinate system of R) as shown in Figure 1. For the Gaussian kernel, each pairwise kernel value is approximated as: e−||q−r|| 2 /(2h2 ) ≈ e−||q−qrecon || 2 (3) /(2h2 ) −||qproj −rproj ||2 /(2h2 ) e where qrecon = U qproj +µ and qproj = U (q−µ). For a fixed query point q, e can be precomputed (which takes d dot products between two D-dimensional vectors) and re-used for every distance computation between q and any reference point r ∈ R whose cost is now O(d) << O(D). Therefore, we can take more samples efficiently. For a total of sufficiently large m samples, the computational cost is O(d(D + m)) << O(D · m) for each query point. −||q−qrecon ||2 /(2h2 ) T Increased variance comes at the cost of inexact distance computations, however. Each distance computation incurs at most squared L2 norm of ||rrecon − r||2 error. That is, 2 ||q − rrecon ||2 − ||q − r||2 ≤ ||rrecon − r||2 . Neverhteless, the sample variance for each query 2 2 2 point plus the inexactness due to dimension reduction τS can be shown to be bounded for the Gaus2 2 sian kernel as: (where each s = e−||q−rrecon || /(2h ) ): 1 m−1 ≤ 1 m−1 s∈S s2 − m · µ 2 S s2 + τS 2 min 1, max e||rrecon −r||2 /h s∈S r∈R 2 2 − m µS min e−||rrecon −r||2 /(2h 2 2 ) r∈R Exhaustive computations using a subspace tree. Now suppose we have built subspace trees for the query and the reference sets. We can project either each query point onto the reference subspace, or each reference point onto the query subspace, depending on which subspace has a smaller dimension and the number of points in each node. The subspaces formed in the leaf nodes usually are highly numerically accurate since it contains very few points compared to the extrinsic dimensionality D. 4 Experimental Results We empirically evaluated the runtime performance of our algorithm on seven real-world datasets, scaled to fit in [0, 1]D hypercube, for approximating the Gaussian sum at every query point with a range of bandwidths. This experiment is motivated by many kernel methods that require computing the Gaussian sum at different bandwidth values (according to the standard least-sqares crossvalidation scores [15]). Nevertheless, we emphasize that the acceleration results are applicable to other kernel methods that require efficient Gaussian summation. In this paper, the reference set equals the query set. All datasets have 50K points so that the exact exhaustive method can be tractably computed. All times are in seconds and include the time needed to build the trees. Codes are in C/C++ and run on a dual Intel Xeon 3GHz with 8 Gb of main memory. The measurements in second to eigth columns are obtained by running the algorithms at the bandwidth kh∗ where 10−3 ≤ k ≤ 103 is the constant in the corresponding column header. The last columns denote the total time needed to run on all seven bandwidth values. Each table has results for five algorithms: the naive algorithm and four algorithms. The algorithms with p = 1 denote the previous state-of-the-art (finite-difference with error redistribution) [10], 5 Algorithm 3 PCA tree building routine. B UILD P CAT REE(P) if C AN PARTITION(P) then {P L , P R } ← PARTITION S ET(P) N ← empty node N L ← B UILD P CAT REE(P L ) N R ← B UILD P CAT REE(P R ) N.S ← M ERGE S UBSPACES(N L .S, N R .S) else N ← B UILD P CAT REE BASE(P) N.S ← P CA BASE(P) N.Pproj ← P ROJECT(P, N.S) return N while those with p < 1 denote our probabilistic version. Each entry has the running time and the percentage of the query points that did not satisfy the relative error . Analysis. Readers should focus on the last columns containing the total time needed for evaluating Gaussian sum at all points for seven different bandwidth values. This is indicated by boldfaced numbers for our probabilistic algorithm. As expected, On low-dimensional datasets (below 6 dimensions), the algorithm using series-expansion based bounds gives two to three times speedup compared to our approach that uses Monte Carlo sampling. Multipole moments are an effective form of compression in low dimensions with analytical error bounds that can be evaluated; our Monte Carlo-based method has an asymptotic error bound which must be “learned” through sampling. As we go from 7 dimensions and beyond, series-expansion cannot be done efficiently because of its slow convergence. Our probabilistic algorithm (p = 0.9) using Monte Carlo consistently performs better than the algorithm using exact bounds (p = 1) by at least a factor of two. Compared to naive, it achieves the maximum speedup of about nine times on an 16-dimensional dataset; on an 89-dimensional dataset, it is at least three times as fast as the naive. Note that all the datasets contain only 50K points, and the speedup will be more dramatic as we increase the number of points. 5 Conclusion We presented an extension to fast multipole methods to use approximation methods with both hard and probabilistic bounds. Our experimental results show speedup over the previous state-of-the-art on high-dimensional datasets. Our future work will include possible improvements inspired by a recent work done in the FMM community using a matrix-factorization formulation [13]. Figure 1: Left: A PCA-tree for a 3-D dataset. Right: The squared Euclidean distance between a given query point and a reference point projected onto a subspace can be decomposed into two components: the orthogonal component and the component in the subspace. 6 Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 Σ mockgalaxy-D-1M-rnd (cosmology: positions), D = 3, N = 50000, h ∗ = 0.000768201 Naive 182 182 182 182 182 182 182 1274 MCMM 3 3 5 10 26 48 2 97 ( = 0.1, p = 0.9) 1% 1% 1% 1% 1% 1% 5% DFGT 2 2 2 2 6 19 3 36 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 3 3 4 11 27 58 21 127 ( = 0.01, p = 0.9) 0 % 0% 1% 1% 1% 1% 7% DFGT 2 2 2 2 7 30 5 50 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 Σ bio5-rnd (biology: drug activity), D = 5, N = 50000, h∗ = 0.000567161 Naive 214 214 214 214 214 214 214 1498 MCMM 4 4 6 144 149 65 1 373 ( = 0.1, p = 0.9) 0% 0% 0% 0% 1% 0% 1% DFGT 4 4 5 24 96 65 2 200 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 4 4 6 148 165 126 1 454 ( = 0.01, p = 0.9) 0 % 0% 0% 0% 1% 0% 1% DFGT 4 4 5 25 139 126 4 307 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 Σ pall7 − rnd , D = 7, N = 50000, h∗ = 0.00131865 Naive 327 327 327 327 327 327 327 2289 MCMM 3 3 3 3 63 224 <1 300 ( = 0.1, p = 0.9) 0% 0% 0% 1% 1% 12 % 0% DFGT 10 10 11 14 84 263 223 615 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 3 3 3 3 70 265 5 352 ( = 0.01, p = 0.9) 0 % 0% 0% 1% 2% 1% 8% DFGT 10 10 11 14 85 299 374 803 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 Σ covtype − rnd , D = 10, N = 50000, h∗ = 0.0154758 Naive 380 380 380 380 380 380 380 2660 MCMM 11 11 13 39 318 <1 <1 381 ( = 0.1, p = 0.9) 0% 0% 0% 1% 0% 0% 0% DFGT 26 27 38 177 390 244 <1 903 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 11 11 13 77 362 2 <1 477 ( = 0.01, p = 0.9) 0 % 0% 0% 1% 1% 10 % 0% DFGT 26 27 38 180 427 416 <1 1115 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 Σ CoocTexture − rnd , D = 16, N = 50000, h∗ = 0.0263958 Naive 472 472 472 472 472 472 472 3304 MCMM 10 11 22 189 109 <1 <1 343 ( = 0.1, p = 0.9) 0% 0% 0% 1% 8% 0% 0% DFGT 22 26 82 240 452 66 <1 889 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 10 11 22 204 285 <1 <1 534 ( = 0.01, p = 0.9) 0 % 0% 1% 1% 10 % 4% 0% DFGT 22 26 83 254 543 230 <1 1159 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% 7 Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 LayoutHistogram − rnd , D = 32, N = 50000, h∗ = 0.0609892 Naive 757 757 757 757 757 757 757 MCMM 32 32 54 168 583 8 8 ( = 0.1, p = 0.9) 0% 0% 1% 1% 1% 0% 0% DFGT 153 159 221 492 849 212 <1 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 32 45 60 183 858 8 8 ( = 0.01, p = 0.9) 0 % 0% 1% 6% 1% 0% 0% DFGT 153 159 222 503 888 659 <1 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 CorelCombined − rnd , D = 89, N = 50000, h∗ = 0.0512583 Naive 1716 1716 1716 1716 1716 1716 1716 MCMM 384 418 575 428 1679 17 17 ( = 0.1, p = 0.9) 0% 0% 0% 1% 10 % 0% 0% DFGT 659 677 864 1397 1772 836 17 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 401 419 575 437 1905 17 17 ( = 0.01, p = 0.9) 0 % 0% 0% 1% 2% 0% 0% DFGT 659 677 865 1425 1794 1649 17 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Σ 5299 885 2087 1246 2585 Σ 12012 3518 6205 3771 7086 References [1] Nando de Freitas, Yang Wang, Maryam Mahdaviani, and Dustin Lang. Fast krylov methods for n-body learning. In Y. Weiss, B. Sch¨ lkopf, and J. Platt, editors, Advances in Neural Information Processing o Systems 18, pages 251–258. MIT Press, Cambridge, MA, 2006. [2] P. Drineas, R. Kannan, and M. Mahoney. Fast monte carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition, 2004. [3] G. Golub. Matrix Computations, Third Edition. The Johns Hopkins University Press, 1996. [4] A. Gray and A. W. Moore. N-Body Problems in Statistical Learning. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13 (December 2000). MIT Press, 2001. [5] Alexander G. Gray and Andrew W. Moore. Nonparametric Density Estimation: Toward Computational Tractability. In SIAM International Conference on Data Mining 2003, 2003. [6] Alexander G. Gray and Andrew W. Moore. Very Fast Multivariate Kernel Density Estimation via Computational Geometry. In Joint Statistical Meeting 2003, 2003. to be submitted to JASA. [7] L. Greengard and J. Strain. The Fast Gauss Transform. SIAM Journal of Scientific and Statistical Computing, 12(1):79–94, 1991. [8] Peter Hall, David Marshall, and Ralph Martin. Merging and splitting eigenspace models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(9):1042–1049, 2000. [9] Michael Holmes, Alexander Gray, and Charles Isbell. Ultrafast monte carlo for statistical summations. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 673–680. MIT Press, Cambridge, MA, 2008. [10] Dongryeol Lee and Alexander Gray. Faster gaussian summation: Theory and experiment. In Proceedings of the Twenty-second Conference on Uncertainty in Artificial Intelligence. 2006. [11] Dongryeol Lee, Alexander Gray, and Andrew Moore. Dual-tree fast gauss transforms. In Y. Weiss, B. Sch¨ lkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 747– o 754. MIT Press, Cambridge, MA, 2006. [12] Ting Liu, Andrew W. Moore, and Alexander Gray. Efficient exact k-nn and nonparametric classification ¨ in high dimensions. In Sebastian Thrun, Lawrence Saul, and Bernhard Sch olkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004. [13] P. G. Martinsson and Vladimir Rokhlin. An accelerated kernel-independent fast multipole method in one dimension. SIAM J. Scientific Computing, 29(3):1160–1178, 2007. [14] A. W. Moore, J. Schneider, and K. Deng. Efficient locally weighted polynomial regression predictions. In D. Fisher, editor, Proceedings of the Fourteenth International Conference on Machine Learning, pages 196–204, San Francisco, 1997. Morgan Kaufmann. [15] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall/CRC, 1986. 8
3 0.088734612 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations
Author: David Sontag, Amir Globerson, Tommi S. Jaakkola
Abstract: We propose a new class of consistency constraints for Linear Programming (LP) relaxations for finding the most probable (MAP) configuration in graphical models. Usual cluster-based LP relaxations enforce joint consistency on the beliefs of a cluster of variables, with computational cost increasing exponentially with the size of the clusters. By partitioning the state space of a cluster and enforcing consistency only across partitions, we obtain a class of constraints which, although less tight, are computationally feasible for large clusters. We show how to solve the cluster selection and partitioning problem monotonically in the dual LP, using the current beliefs to guide these choices. We obtain a dual message passing algorithm and apply it to protein design problems where the variables have large state spaces and the usual cluster-based relaxations are very costly. The resulting method solves many of these problems exactly, and significantly faster than a method that does not use partitioning. 1
4 0.084936306 19 nips-2008-An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis
Author: Gabriele Schweikert, Gunnar Rätsch, Christian Widmer, Bernhard Schölkopf
Abstract: We study the problem of domain transfer for a supervised classification task in mRNA splicing. We consider a number of recent domain transfer methods from machine learning, including some that are novel, and evaluate them on genomic sequence data from model organisms of varying evolutionary distance. We find that in cases where the organisms are not closely related, the use of domain adaptation methods can help improve classification performance.
5 0.074513242 53 nips-2008-Counting Solution Clusters in Graph Coloring Problems Using Belief Propagation
Author: Lukas Kroc, Ashish Sabharwal, Bart Selman
Abstract: We show that an important and computationally challenging solution space feature of the graph coloring problem (COL), namely the number of clusters of solutions, can be accurately estimated by a technique very similar to one for counting the number of solutions. This cluster counting approach can be naturally written in terms of a new factor graph derived from the factor graph representing the COL instance. Using a variant of the Belief Propagation inference framework, we can efficiently approximate cluster counts in random COL problems over a large range of graph densities. We illustrate the algorithm on instances with up to 100, 000 vertices. Moreover, we supply a methodology for computing the number of clusters exactly using advanced techniques from the knowledge compilation literature. This methodology scales up to several hundred variables. 1
6 0.071260698 117 nips-2008-Learning Taxonomies by Dependence Maximization
7 0.067964971 74 nips-2008-Estimating the Location and Orientation of Complex, Correlated Neural Activity using MEG
8 0.06491711 125 nips-2008-Local Gaussian Process Regression for Real Time Online Model Learning
9 0.062275436 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation
10 0.060098577 48 nips-2008-Clustering via LP-based Stabilities
11 0.059941322 65 nips-2008-Domain Adaptation with Multiple Sources
12 0.057768673 84 nips-2008-Fast Prediction on a Tree
13 0.057598889 188 nips-2008-QUIC-SVD: Fast SVD Using Cosine Trees
14 0.05733633 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
15 0.056666423 55 nips-2008-Cyclizing Clusters via Zeta Function of a Graph
16 0.05534767 4 nips-2008-A Scalable Hierarchical Distributed Language Model
17 0.053299811 3 nips-2008-A Massively Parallel Digital Learning Processor
18 0.051105421 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features
19 0.050681513 2 nips-2008-A Convex Upper Bound on the Log-Partition Function for Binary Distributions
20 0.049209088 73 nips-2008-Estimating Robust Query Models with Convex Optimization
topicId topicWeight
[(0, -0.158), (1, -0.007), (2, -0.021), (3, 0.043), (4, 0.014), (5, -0.042), (6, 0.054), (7, -0.006), (8, 0.02), (9, -0.054), (10, 0.004), (11, 0.028), (12, -0.001), (13, 0.096), (14, -0.044), (15, 0.066), (16, -0.058), (17, 0.067), (18, 0.062), (19, -0.0), (20, 0.138), (21, 0.049), (22, -0.023), (23, 0.002), (24, -0.031), (25, 0.018), (26, 0.085), (27, -0.007), (28, -0.02), (29, 0.079), (30, -0.009), (31, 0.112), (32, -0.038), (33, 0.013), (34, 0.009), (35, 0.094), (36, -0.006), (37, 0.008), (38, 0.093), (39, 0.069), (40, -0.06), (41, -0.047), (42, -0.028), (43, 0.088), (44, -0.082), (45, 0.06), (46, -0.011), (47, -0.031), (48, 0.031), (49, -0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.93393803 29 nips-2008-Automatic online tuning for fast Gaussian summation
Author: Vlad I. Morariu, Balaji V. Srinivasan, Vikas C. Raykar, Ramani Duraiswami, Larry S. Davis
Abstract: Many machine learning algorithms require the summation of Gaussian kernel functions, an expensive operation if implemented straightforwardly. Several methods have been proposed to reduce the computational complexity of evaluating such sums, including tree and analysis based methods. These achieve varying speedups depending on the bandwidth, dimension, and prescribed error, making the choice between methods difficult for machine learning tasks. We provide an algorithm that combines tree methods with the Improved Fast Gauss Transform (IFGT). As originally proposed the IFGT suffers from two problems: (1) the Taylor series expansion does not perform well for very low bandwidths, and (2) parameter selection is not trivial and can drastically affect performance and ease of use. We address the first problem by employing a tree data structure, resulting in four evaluation methods whose performance varies based on the distribution of sources and targets and input parameters such as desired accuracy and bandwidth. To solve the second problem, we present an online tuning approach that results in a black box method that automatically chooses the evaluation method and its parameters to yield the best performance for the input data, desired accuracy, and bandwidth. In addition, the new IFGT parameter selection approach allows for tighter error bounds. Our approach chooses the fastest method at negligible additional cost, and has superior performance in comparisons with previous approaches. 1
2 0.64923155 83 nips-2008-Fast High-dimensional Kernel Summations Using the Monte Carlo Multipole Method
Author: Dongryeol Lee, Alexander G. Gray
Abstract: We propose a new fast Gaussian summation algorithm for high-dimensional datasets with high accuracy. First, we extend the original fast multipole-type methods to use approximation schemes with both hard and probabilistic error. Second, we utilize a new data structure called subspace tree which maps each data point in the node to its lower dimensional mapping as determined by any linear dimension reduction method such as PCA. This new data structure is suitable for reducing the cost of each pairwise distance computation, the most dominant cost in many kernel methods. Our algorithm guarantees probabilistic relative error on each kernel sum, and can be applied to high-dimensional Gaussian summations which are ubiquitous inside many kernel methods as the key computational bottleneck. We provide empirical speedup results on low to high-dimensional datasets up to 89 dimensions. 1 Fast Gaussian Kernel Summation In this paper, we propose new computational techniques for efficiently approximating the following sum for each query point qi ∈ Q: Φ(qi , R) = e−||qi −rj || 2 /(2h2 ) (1) rj ∈R where R is the reference set; each reference point is associated with a Gaussian function with a smoothing parameter h (the ’bandwidth’). This form of summation is ubiquitous in many statistical learning methods, including kernel density estimation, kernel regression, Gaussian process regression, radial basis function networks, spectral clustering, support vector machines, and kernel PCA [1, 4]. Cross-validation in all of these methods require evaluating Equation 1 for multiple values of h. Kernel density estimation, for example, requires |R| density estimate based on |R| − 1 points, yielding a brute-force computational cost scaling quadratically (that is O(|R| 2 )). Error bounds. Due to its expensive computational cost, many algorithms approximate the Gaussian kernel sums at the expense of reduced precision. Therefore, it is natural to discuss error bound criteria which measure the quality of the approximations with respect to their corresponding true values. The following error bound criteria are common in literature: Definition 1.1. An algorithm guarantees absolute error bound, if for each exact value Φ(q i , R) for qi ∈ Q, it computes Φ(qi , R) such that Φ(qi , R) − Φ(qi , R) ≤ . Definition 1.2. An algorithm guarantees relative error bound, if for each exact value Φ(q i , R) for qi ∈ Q, it computes Φ(qi , R) ∈ R such that Φ(qi , R) − Φ(qi , R) ≤ |Φ(qi , R)|. 1 Bounding the relative error (e.g., the percentage deviation) is much harder because the error bound criterion is in terms of the initially unknown exact quantity. As a result, many previous methods [7] have focused on bounding the absolute error. The relative error bound criterion is preferred to the absolute error bound criterion in statistical applications in which high accuracy is desired. Our new algorithm will enforce the following “relaxed” form of the relative error bound criterion, whose motivation will be discussed shortly. Definition 1.3. An algorithm guarantees (1 − α) probabilistic relative error bound, if for each exact value Φ(qi , R) for qi ∈ Q, it computes Φ(qi , R) ∈ R, such that with at least probability 0 < 1 − α < 1, Φ(qi , R) − Φ(qi , R) ≤ |Φ(qi , R)|. Previous work. The most successful class of acceleration methods employ “higher-order divide and conquer” or generalized N -body algorithms (GNA) [4]. This approach can use any spatial partioning tree such as kd-trees or ball-trees for both the query set Q and reference data R and performs a simulataneous recursive descent on both trees. GNA with relative error bounds (Definition 1.2) [5, 6, 11, 10] utilized bounding boxes and additional cached-sufficient statistics such as higher-order moments needed for series-expansion. [5, 6] utilized bounding-box based error bounds which tend to be very loose, which resulted in slow empirical performance around suboptimally small and large bandwidths. [11, 10] extended GNA-based Gaussian summations with series-expansion which provided tighter bounds; it showed enormous performance improvements, but only up to low dimensional settings (up to D = 5) since the number of required terms in series expansion increases exponentially with respect to D. [9] introduces an iterative sampling based GNA for accelerating the computation of nested sums (a related easier problem). Its speedup is achieved by replacing pessimistic error bounds provided by bounding boxes with normal-based confidence interval from Monte Carlo sampling. [9] demonstrates the speedup many orders of magnitude faster than the previous state of the art in the context of computing aggregates over the queries (such as the LSCV score for selecting the optimal bandwidth). However, the authors did not discuss the sampling-based approach for computations that require per-query estimates, such as those required for kernel density estimation. None of the previous approaches for kernel summations addresses the issue of reducing the computational cost of each distance computation which incurs O(D) cost. However, the intrinsic dimensionality d of most high-dimensional datasets is much smaller than the explicit dimension D (that is, d << D). [12] proposed tree structures using a global dimension reduction method, such as random projection, as a preprocessing step for efficient (1 + ) approximate nearest neighbor search. Similarly, we develop a new data structure for kernel summations; our new data structure is constructed in a top-down fashion to perform the initial spatial partitioning in the original input space R D and performs a local dimension reduction to a localized subset of the data in a bottom-up fashion. This paper. We propose a new fast Gaussian summation algorithm that enables speedup in higher dimensions. Our approach utilizes: 1) probabilistic relative error bounds (Definition 1.3) on kernel sums provided by Monte Carlo estimates 2) a new tree structure called subspace tree for reducing the computational cost of each distance computation. The former can be seen as relaxing the strict requirement of guaranteeing hard relative bound on very small quantities, as done in [5, 6, 11, 10]. The latter was mentioned as a possible way of ameliorating the effects of the curse of dimensionality in [14], a pioneering paper in this area. Notations. Each query point and reference point (a D-dimensional vector) is indexed by natural numbers i, j ∈ N, and denoted qi and rj respectively. For any set S, |S| denotes the number of elements in S. The entities related to the left and the right child are denoted with superscripts L and R; an internal node N has the child nodes N L and N R . 2 Gaussian Summation by Monte Carlo Sampling Here we describe the extension needed for probabilistic computation of kernel summation satisfying Definition 1.3. The main routine for the probabilistic kernel summation is shown in Algorithm 1. The function MCMM takes the query node Q and the reference node R (each initially called with the roots of the query tree and the reference tree, Qroot and Rroot ) and β (initially called with α value which controls the probability guarantee that each kernel sum is within relative error). 2 Algorithm 1 The core dual-tree routine for probabilistic Gaussian kernel summation. MCMM(Q, R, β) if C AN S UMMARIZE E XACT(Q, R, ) then S UMMARIZE E XACT(Q, R) else if C AN S UMMARIZE MC(Q, R, , β) then 5: S UMMARIZE MC(Q, R, , β) else if Q is a leaf node then if R is a leaf node then MCMMBASE(Q, R) 10: else MCMM Q, RL , β , MCMM Q, RR , β 2 2 else if R is a leaf node then MCMM(QL , R, β), MCMM(QR , R, β) 15: else MCMM QL , RL , β , MCMM QL , RR , β 2 2 MCMM QR , RL , β , MCMM QR , RR , β 2 2 The idea of Monte Carlo sampling used in the new algorithm is similar to the one in [9], except the sampling is done per query and we use approximations that provide hard error bounds as well (i.e. finite difference, exhaustive base case: MCMMBASE). This means that the approximation has less variance than a pure Monte Carlo approach used in [9]. Algorithm 1 first attempts approximations with hard error bounds, which are computationally cheaper than sampling-based approximations. For example, finite-difference scheme [5, 6] can be used for the C AN S UMMARIZE E XACT and S UMMARIZE E XACT functions in any general dimension. The C AN S UMMARIZE MC function takes two parameters that specify the accuracy: the relative error and its probability guarantee and decides whether to use Monte Carlo sampling for the given pair of nodes. If the reference node R contains too few points, it may be more efficient to process it using exact methods that use error bounds based on bounding primitives on the node pair or exhaustive pair-wise evaluations, which is determined by the condition: τ · minitial ≤ |R| where τ > 1 controls the minimum number of reference points needed for Monte Carlo sampling to proceed. If the reference node does contain enough points, then for each query point q ∈ Q, the S AMPLE routine samples minitial terms over the terms in the summation Φ(q, R) = Kh (||q − rjn ||) rjn ∈R where Φ(q, R) denotes the exact contribution of R to q’s kernel sum. Basically, we are interested in estimating Φ(q, R) by Φ(q, R) = |R|µS , where µS is the sample mean of S. From the Central 2 Limit Theorem, given enough m samples, µS N (µ, σS /m) where Φ(q, R) = |R|µ (i.e. µ is the average of the kernel value between q and any reference point r ∈ R); this implies that √ |µS − µ| ≤ zβ/2 σS / m with probability 1 − β. The pruning rule we have to enforce for each query point for the contribution of R is: σS Φ(q, R) zβ/2 √ ≤ |R| m where σS the sample standard deviation of S. Since Φ(q, R) is one of the unknown quanities we want to compute, we instead enforce the following: σS zβ/2 √ ≤ m Φl (q, R) + |R| µS − |R| zβ/2 σS √ m (2) where Φl (q, R) is the currently running lower bound on the sum computed using exact methods z σ √ and |R| µS − β/2 S is the probabilistic component contributed by R. Denoting Φl,new (q, R) = m zβ/2 σ Φl (q, R) + |R| µS − √ S , the minimum number of samples for q needed to achieve the |S| 3 target error the right side of the inequality in Equation 2 with at least probability of 1 − β is: 2 2 m ≥ zβ/2 σS (|R| + |R|)2 2 (Φl (q, R) + |R|µ )2 S If the given query node and reference node pair cannot be pruned using either nonprobabilistic/probabilistic approximations, then we recurse on a smaller subsets of two sets. In particular, when dividing over the reference node R, we recurse with half of the β value 1 . We now state the probablistic error guarantee of our algorithm as a theorem. Theorem 2.1. After calling MCMM with Q = Qroot , R = Rroot , and β = α, Algorithm 1 approximates each Φ(q, R) with Φ(q, R) such that Definition 1.3 holds. Proof. For a query/reference (Q, R) pair and 0 < β < 1, MCMMBASE and S UMMARIZE E XACT compute estimates for q ∈ Q such that Φ(q, R) − Φ(q, R) < Φ(q,R)|R| with probability at |R| least 1 > 1 − β. By Equation 2, S UMMARIZE MC computes estimates for q ∈ Q such that Φ(q, R) − Φ(q, R) < Φ(q,R)|R| with probability 1 − β. |R| We now induct on |Q ∪ R|. Line 11 of Algorithm 1 divides over the reference whose subcalls compute estimates that satisfy Φ(q, RL ) − Φ(q, RL ) ≤ Φ(q,R)|RR | |R| L each with at least 1 − β 2 Φ(q,R)|RL | |R| and Φ(q, RR ) − Φ(q, RR ) ≤ probability by induction hypothesis. For q ∈ Q, Φ(q, R) = Φ(q, R )+ Φ(q, R ) which means |Φ(q, R)−Φ(q, R)| ≤ Φ(q,R)|R| with probability at least 1−β. |R| Line 14 divides over the query and each subcall computes estimates that hold with at least probability 1 − β for q ∈ QL and q ∈ QR . Line 16 and 17 divides both over the query and the reference, and the correctness can be proven similarly. Therefore, M CM M (Qroot , Rroot , α) computes estimates satisfying Definition 1.3. R “Reclaiming” probability. We note that the assigned probability β for the query/reference pair computed with exact bounds (S UMMARIZE E XACT and MCMMBASE) is not used. This portion of the probability can be “reclaimed” in a similar fashion as done in [10] and re-used to prune more aggressively in the later stages of the algorithm. All experiments presented in this paper were benefited by this simple modification. 3 Subspace Tree A subspace tree is basically a space-partitioning tree with a set of orthogonal bases associated with each node N : N.Ω = (µ, U, Λ, d) where µ is the mean, U is a D×d matrix whose columns consist of d eigenvectors, and Λ the corresponding eigenvalues. The orthogonal basis set is constructed using a linear dimension reduction method such as PCA. It is constructed in the top-down manner using the PARTITION S ET function dividing the given set of points into two (where the PARTITION S ET function divides along the dimension with the highest variance in case of a kd-tree for example), with the subspace in each node formed in the bottom-up manner. Algorithm 3 shows a PCA tree (a subspace tree using PCA as a dimension reduction) for a 3-D dataset. The subspace of each leaf node is computed using P CA BASE which can use the exact PCA [3] or a stochastic one [2]. For an internal node, the subspaces of the child nodes, N L .Ω = (µL , U L , ΛL , dL ) and N R .Ω = (µR , U R , ΛR , dR ), are approximately merged using the M ERGE S UBSPACES function which involves solving an (d L + dR + 1) × (dL + dR + 1) eigenvalue problem [8], which runs in O((dL + dR + 1)3 ) << O(D 3 ) given that the dataset is sparse. In addition, each data point x in each node N is mapped to its new lower-dimensional coordinate using the orthogonal basis set of N : xproj = U T (x − µ). The L2 norm reconstruction error is given by: ||xrecon − x||2 = ||(U xproj + µ) − x||2 . 2 2 Monte Carlo sampling using a subspace tree. Consider C AN S UMMARIZE MC function in Algorithm 2. The “outer-loop” over this algorithm is over the query set Q, and it would make sense to project each query point q ∈ Q to the subspace owned by the reference node R. Let U and µ be the orthogonal basis system for R consisting of d basis. For each q ∈ Q, consider the squared distance 1 We could also divide β such that the node that may be harder to approximate gets a lower value. 4 Algorithm 2 Monte Carlo sampling based approximation routines. C AN S UMMARIZE MC(Q, R, , α) S AMPLE(q, R, , α, S, m) return τ · minitial ≤ |R| for k = 1 to m do r ← random point in R S UMMARIZE MC(Q, R, , α) S ← S ∪ {Kh (||q − r||)} 2 for qi ∈ Q do µS ← M EAN(S), σS ← VARIANCE(S) S ← ∅, m ← minitial zα/2 σS Φl,new (q, R) ← Φl (q, R) + |R| µS − √ repeat |S| 2 S AMPLE(qi , R, , α, S, m) 2 2 mthresh ← zα/2 σS 2 (Φ(|R|+ |R|) S )2 l (q,R)+|R|µ until m ≤ 0 m ← mthresh − |S| Φ(qi , R) ← Φ(qi , R) + |R| · M EAN(S) ||(q − µ) − rproj ||2 (where (q − µ) is q’s coordinates expressed in terms of the coordinate system of R) as shown in Figure 1. For the Gaussian kernel, each pairwise kernel value is approximated as: e−||q−r|| 2 /(2h2 ) ≈ e−||q−qrecon || 2 (3) /(2h2 ) −||qproj −rproj ||2 /(2h2 ) e where qrecon = U qproj +µ and qproj = U (q−µ). For a fixed query point q, e can be precomputed (which takes d dot products between two D-dimensional vectors) and re-used for every distance computation between q and any reference point r ∈ R whose cost is now O(d) << O(D). Therefore, we can take more samples efficiently. For a total of sufficiently large m samples, the computational cost is O(d(D + m)) << O(D · m) for each query point. −||q−qrecon ||2 /(2h2 ) T Increased variance comes at the cost of inexact distance computations, however. Each distance computation incurs at most squared L2 norm of ||rrecon − r||2 error. That is, 2 ||q − rrecon ||2 − ||q − r||2 ≤ ||rrecon − r||2 . Neverhteless, the sample variance for each query 2 2 2 point plus the inexactness due to dimension reduction τS can be shown to be bounded for the Gaus2 2 sian kernel as: (where each s = e−||q−rrecon || /(2h ) ): 1 m−1 ≤ 1 m−1 s∈S s2 − m · µ 2 S s2 + τS 2 min 1, max e||rrecon −r||2 /h s∈S r∈R 2 2 − m µS min e−||rrecon −r||2 /(2h 2 2 ) r∈R Exhaustive computations using a subspace tree. Now suppose we have built subspace trees for the query and the reference sets. We can project either each query point onto the reference subspace, or each reference point onto the query subspace, depending on which subspace has a smaller dimension and the number of points in each node. The subspaces formed in the leaf nodes usually are highly numerically accurate since it contains very few points compared to the extrinsic dimensionality D. 4 Experimental Results We empirically evaluated the runtime performance of our algorithm on seven real-world datasets, scaled to fit in [0, 1]D hypercube, for approximating the Gaussian sum at every query point with a range of bandwidths. This experiment is motivated by many kernel methods that require computing the Gaussian sum at different bandwidth values (according to the standard least-sqares crossvalidation scores [15]). Nevertheless, we emphasize that the acceleration results are applicable to other kernel methods that require efficient Gaussian summation. In this paper, the reference set equals the query set. All datasets have 50K points so that the exact exhaustive method can be tractably computed. All times are in seconds and include the time needed to build the trees. Codes are in C/C++ and run on a dual Intel Xeon 3GHz with 8 Gb of main memory. The measurements in second to eigth columns are obtained by running the algorithms at the bandwidth kh∗ where 10−3 ≤ k ≤ 103 is the constant in the corresponding column header. The last columns denote the total time needed to run on all seven bandwidth values. Each table has results for five algorithms: the naive algorithm and four algorithms. The algorithms with p = 1 denote the previous state-of-the-art (finite-difference with error redistribution) [10], 5 Algorithm 3 PCA tree building routine. B UILD P CAT REE(P) if C AN PARTITION(P) then {P L , P R } ← PARTITION S ET(P) N ← empty node N L ← B UILD P CAT REE(P L ) N R ← B UILD P CAT REE(P R ) N.S ← M ERGE S UBSPACES(N L .S, N R .S) else N ← B UILD P CAT REE BASE(P) N.S ← P CA BASE(P) N.Pproj ← P ROJECT(P, N.S) return N while those with p < 1 denote our probabilistic version. Each entry has the running time and the percentage of the query points that did not satisfy the relative error . Analysis. Readers should focus on the last columns containing the total time needed for evaluating Gaussian sum at all points for seven different bandwidth values. This is indicated by boldfaced numbers for our probabilistic algorithm. As expected, On low-dimensional datasets (below 6 dimensions), the algorithm using series-expansion based bounds gives two to three times speedup compared to our approach that uses Monte Carlo sampling. Multipole moments are an effective form of compression in low dimensions with analytical error bounds that can be evaluated; our Monte Carlo-based method has an asymptotic error bound which must be “learned” through sampling. As we go from 7 dimensions and beyond, series-expansion cannot be done efficiently because of its slow convergence. Our probabilistic algorithm (p = 0.9) using Monte Carlo consistently performs better than the algorithm using exact bounds (p = 1) by at least a factor of two. Compared to naive, it achieves the maximum speedup of about nine times on an 16-dimensional dataset; on an 89-dimensional dataset, it is at least three times as fast as the naive. Note that all the datasets contain only 50K points, and the speedup will be more dramatic as we increase the number of points. 5 Conclusion We presented an extension to fast multipole methods to use approximation methods with both hard and probabilistic bounds. Our experimental results show speedup over the previous state-of-the-art on high-dimensional datasets. Our future work will include possible improvements inspired by a recent work done in the FMM community using a matrix-factorization formulation [13]. Figure 1: Left: A PCA-tree for a 3-D dataset. Right: The squared Euclidean distance between a given query point and a reference point projected onto a subspace can be decomposed into two components: the orthogonal component and the component in the subspace. 6 Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 Σ mockgalaxy-D-1M-rnd (cosmology: positions), D = 3, N = 50000, h ∗ = 0.000768201 Naive 182 182 182 182 182 182 182 1274 MCMM 3 3 5 10 26 48 2 97 ( = 0.1, p = 0.9) 1% 1% 1% 1% 1% 1% 5% DFGT 2 2 2 2 6 19 3 36 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 3 3 4 11 27 58 21 127 ( = 0.01, p = 0.9) 0 % 0% 1% 1% 1% 1% 7% DFGT 2 2 2 2 7 30 5 50 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 Σ bio5-rnd (biology: drug activity), D = 5, N = 50000, h∗ = 0.000567161 Naive 214 214 214 214 214 214 214 1498 MCMM 4 4 6 144 149 65 1 373 ( = 0.1, p = 0.9) 0% 0% 0% 0% 1% 0% 1% DFGT 4 4 5 24 96 65 2 200 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 4 4 6 148 165 126 1 454 ( = 0.01, p = 0.9) 0 % 0% 0% 0% 1% 0% 1% DFGT 4 4 5 25 139 126 4 307 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 Σ pall7 − rnd , D = 7, N = 50000, h∗ = 0.00131865 Naive 327 327 327 327 327 327 327 2289 MCMM 3 3 3 3 63 224 <1 300 ( = 0.1, p = 0.9) 0% 0% 0% 1% 1% 12 % 0% DFGT 10 10 11 14 84 263 223 615 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 3 3 3 3 70 265 5 352 ( = 0.01, p = 0.9) 0 % 0% 0% 1% 2% 1% 8% DFGT 10 10 11 14 85 299 374 803 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 Σ covtype − rnd , D = 10, N = 50000, h∗ = 0.0154758 Naive 380 380 380 380 380 380 380 2660 MCMM 11 11 13 39 318 <1 <1 381 ( = 0.1, p = 0.9) 0% 0% 0% 1% 0% 0% 0% DFGT 26 27 38 177 390 244 <1 903 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 11 11 13 77 362 2 <1 477 ( = 0.01, p = 0.9) 0 % 0% 0% 1% 1% 10 % 0% DFGT 26 27 38 180 427 416 <1 1115 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 Σ CoocTexture − rnd , D = 16, N = 50000, h∗ = 0.0263958 Naive 472 472 472 472 472 472 472 3304 MCMM 10 11 22 189 109 <1 <1 343 ( = 0.1, p = 0.9) 0% 0% 0% 1% 8% 0% 0% DFGT 22 26 82 240 452 66 <1 889 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 10 11 22 204 285 <1 <1 534 ( = 0.01, p = 0.9) 0 % 0% 1% 1% 10 % 4% 0% DFGT 22 26 83 254 543 230 <1 1159 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% 7 Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 LayoutHistogram − rnd , D = 32, N = 50000, h∗ = 0.0609892 Naive 757 757 757 757 757 757 757 MCMM 32 32 54 168 583 8 8 ( = 0.1, p = 0.9) 0% 0% 1% 1% 1% 0% 0% DFGT 153 159 221 492 849 212 <1 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 32 45 60 183 858 8 8 ( = 0.01, p = 0.9) 0 % 0% 1% 6% 1% 0% 0% DFGT 153 159 222 503 888 659 <1 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 CorelCombined − rnd , D = 89, N = 50000, h∗ = 0.0512583 Naive 1716 1716 1716 1716 1716 1716 1716 MCMM 384 418 575 428 1679 17 17 ( = 0.1, p = 0.9) 0% 0% 0% 1% 10 % 0% 0% DFGT 659 677 864 1397 1772 836 17 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 401 419 575 437 1905 17 17 ( = 0.01, p = 0.9) 0 % 0% 0% 1% 2% 0% 0% DFGT 659 677 865 1425 1794 1649 17 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Σ 5299 885 2087 1246 2585 Σ 12012 3518 6205 3771 7086 References [1] Nando de Freitas, Yang Wang, Maryam Mahdaviani, and Dustin Lang. Fast krylov methods for n-body learning. In Y. Weiss, B. Sch¨ lkopf, and J. Platt, editors, Advances in Neural Information Processing o Systems 18, pages 251–258. MIT Press, Cambridge, MA, 2006. [2] P. Drineas, R. Kannan, and M. Mahoney. Fast monte carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition, 2004. [3] G. Golub. Matrix Computations, Third Edition. The Johns Hopkins University Press, 1996. [4] A. Gray and A. W. Moore. N-Body Problems in Statistical Learning. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13 (December 2000). MIT Press, 2001. [5] Alexander G. Gray and Andrew W. Moore. Nonparametric Density Estimation: Toward Computational Tractability. In SIAM International Conference on Data Mining 2003, 2003. [6] Alexander G. Gray and Andrew W. Moore. Very Fast Multivariate Kernel Density Estimation via Computational Geometry. In Joint Statistical Meeting 2003, 2003. to be submitted to JASA. [7] L. Greengard and J. Strain. The Fast Gauss Transform. SIAM Journal of Scientific and Statistical Computing, 12(1):79–94, 1991. [8] Peter Hall, David Marshall, and Ralph Martin. Merging and splitting eigenspace models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(9):1042–1049, 2000. [9] Michael Holmes, Alexander Gray, and Charles Isbell. Ultrafast monte carlo for statistical summations. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 673–680. MIT Press, Cambridge, MA, 2008. [10] Dongryeol Lee and Alexander Gray. Faster gaussian summation: Theory and experiment. In Proceedings of the Twenty-second Conference on Uncertainty in Artificial Intelligence. 2006. [11] Dongryeol Lee, Alexander Gray, and Andrew Moore. Dual-tree fast gauss transforms. In Y. Weiss, B. Sch¨ lkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 747– o 754. MIT Press, Cambridge, MA, 2006. [12] Ting Liu, Andrew W. Moore, and Alexander Gray. Efficient exact k-nn and nonparametric classification ¨ in high dimensions. In Sebastian Thrun, Lawrence Saul, and Bernhard Sch olkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004. [13] P. G. Martinsson and Vladimir Rokhlin. An accelerated kernel-independent fast multipole method in one dimension. SIAM J. Scientific Computing, 29(3):1160–1178, 2007. [14] A. W. Moore, J. Schneider, and K. Deng. Efficient locally weighted polynomial regression predictions. In D. Fisher, editor, Proceedings of the Fourteenth International Conference on Machine Learning, pages 196–204, San Francisco, 1997. Morgan Kaufmann. [15] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall/CRC, 1986. 8
3 0.60132718 188 nips-2008-QUIC-SVD: Fast SVD Using Cosine Trees
Author: Michael P. Holmes, Jr. Isbell, Charles Lee, Alexander G. Gray
Abstract: The Singular Value Decomposition is a key operation in many machine learning methods. Its computational cost, however, makes it unscalable and impractical for applications involving large datasets or real-time responsiveness, which are becoming increasingly common. We present a new method, QUIC-SVD, for fast approximation of the whole-matrix SVD based on a new sampling mechanism called the cosine tree. Our empirical tests show speedups of several orders of magnitude over exact SVD. Such scalability should enable QUIC-SVD to accelerate and enable a wide array of SVD-based methods and applications. 1
4 0.58405602 55 nips-2008-Cyclizing Clusters via Zeta Function of a Graph
Author: Deli Zhao, Xiaoou Tang
Abstract: Detecting underlying clusters from large-scale data plays a central role in machine learning research. In this paper, we tackle the problem of clustering complex data of multiple distributions and multiple scales. To this end, we develop an algorithm named Zeta l-links (Zell) which consists of two parts: Zeta merging with a similarity graph and an initial set of small clusters derived from local l-links of samples. More specifically, we propose to structurize a cluster using cycles in the associated subgraph. A new mathematical tool, Zeta function of a graph, is introduced for the integration of all cycles, leading to a structural descriptor of a cluster in determinantal form. The popularity character of a cluster is conceptualized as the global fusion of variations of such a structural descriptor by means of the leave-one-out strategy in the cluster. Zeta merging proceeds, in the hierarchical agglomerative fashion, according to the maximum incremental popularity among all pairwise clusters. Experiments on toy data clustering, imagery pattern clustering, and image segmentation show the competitive performance of Zell. The 98.1% accuracy, in the sense of the normalized mutual information (NMI), is obtained on the FRGC face data of 16028 samples and 466 facial clusters. 1
5 0.53865397 117 nips-2008-Learning Taxonomies by Dependence Maximization
Author: Matthew Blaschko, Arthur Gretton
Abstract: We introduce a family of unsupervised algorithms, numerical taxonomy clustering, to simultaneously cluster data, and to learn a taxonomy that encodes the relationship between the clusters. The algorithms work by maximizing the dependence between the taxonomy and the original data. The resulting taxonomy is a more informative visualization of complex data than simple clustering; in addition, taking into account the relations between different clusters is shown to substantially improve the quality of the clustering, when compared with state-ofthe-art algorithms in the literature (both spectral clustering and a previous dependence maximization approach). We demonstrate our algorithm on image and text data. 1
6 0.524225 53 nips-2008-Counting Solution Clusters in Graph Coloring Problems Using Belief Propagation
7 0.50942975 165 nips-2008-On the Reliability of Clustering Stability in the Large Sample Regime
8 0.50824583 132 nips-2008-Measures of Clustering Quality: A Working Set of Axioms for Clustering
9 0.50155145 48 nips-2008-Clustering via LP-based Stabilities
10 0.48012379 74 nips-2008-Estimating the Location and Orientation of Complex, Correlated Neural Activity using MEG
11 0.45498076 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation
12 0.44895765 4 nips-2008-A Scalable Hierarchical Distributed Language Model
13 0.43636236 68 nips-2008-Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection
14 0.42316115 125 nips-2008-Local Gaussian Process Regression for Real Time Online Model Learning
15 0.40309069 19 nips-2008-An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis
16 0.40046763 249 nips-2008-Variational Mixture of Gaussian Process Experts
17 0.39629513 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations
18 0.39384744 184 nips-2008-Predictive Indexing for Fast Search
19 0.38329262 218 nips-2008-Spectral Clustering with Perturbed Data
20 0.38197675 70 nips-2008-Efficient Inference in Phylogenetic InDel Trees
topicId topicWeight
[(4, 0.343), (5, 0.013), (6, 0.068), (7, 0.061), (12, 0.046), (15, 0.012), (28, 0.173), (57, 0.043), (59, 0.01), (63, 0.034), (71, 0.021), (77, 0.036), (83, 0.056)]
simIndex simValue paperId paperTitle
1 0.92867672 58 nips-2008-Dependence of Orientation Tuning on Recurrent Excitation and Inhibition in a Network Model of V1
Author: Klaus Wimmer, Marcel Stimberg, Robert Martin, Lars Schwabe, Jorge Mariño, James Schummers, David C. Lyon, Mriganka Sur, Klaus Obermayer
Abstract: The computational role of the local recurrent network in primary visual cortex is still a matter of debate. To address this issue, we analyze intracellular recording data of cat V1, which combine measuring the tuning of a range of neuronal properties with a precise localization of the recording sites in the orientation preference map. For the analysis, we consider a network model of Hodgkin-Huxley type neurons arranged according to a biologically plausible two-dimensional topographic orientation preference map. We then systematically vary the strength of the recurrent excitation and inhibition relative to the strength of the afferent input. Each parametrization gives rise to a different model instance for which the tuning of model neurons at different locations of the orientation map is compared to the experimentally measured orientation tuning of membrane potential, spike output, excitatory, and inhibitory conductances. A quantitative analysis shows that the data provides strong evidence for a network model in which the afferent input is dominated by strong, balanced contributions of recurrent excitation and inhibition. This recurrent regime is close to a regime of “instability”, where strong, self-sustained activity of the network occurs. The firing rate of neurons in the best-fitting network is particularly sensitive to small modulations of model parameters, which could be one of the functional benefits of a network operating in this particular regime. 1
2 0.92670721 39 nips-2008-Bounding Performance Loss in Approximate MDP Homomorphisms
Author: Jonathan Taylor, Doina Precup, Prakash Panagaden
Abstract: We define a metric for measuring behavior similarity between states in a Markov decision process (MDP), which takes action similarity into account. We show that the kernel of our metric corresponds exactly to the classes of states defined by MDP homomorphisms (Ravindran & Barto, 2003). We prove that the difference in the optimal value function of different states can be upper-bounded by the value of this metric, and that the bound is tighter than previous bounds provided by bisimulation metrics (Ferns et al. 2004, 2005). Our results hold both for discrete and for continuous actions. We provide an algorithm for constructing approximate homomorphisms, by using this metric to identify states that can be grouped together, as well as actions that can be matched. Previous research on this topic is based mainly on heuristics.
Author: Richard S. Sutton, Hamid R. Maei, Csaba Szepesvári
Abstract: We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, behavior policy, and target policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d. policy-evaluation setting in which the data need not come from on-policy experience. The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L2 norm. We prove that this algorithm is stable and convergent under the usual stochastic approximation conditions to the same least-squares solution as found by the LSTD, but without LSTD’s quadratic computational complexity. GTD is online and incremental, and does not involve multiplying by products of likelihood ratios as in importance-sampling methods. 1 Off-policy learning methods Off-policy methods have an important role to play in the larger ambitions of modern reinforcement learning. In general, updates to a statistic of a dynamical process are said to be “off-policy” if their distribution does not match the dynamics of the process, particularly if the mismatch is due to the way actions are chosen. The prototypical example in reinforcement learning is the learning of the value function for one policy, the target policy, using data obtained while following another policy, the behavior policy. For example, the popular Q-learning algorithm (Watkins 1989) is an offpolicy temporal-difference algorithm in which the target policy is greedy with respect to estimated action values, and the behavior policy is something more exploratory, such as a corresponding greedy policy. Off-policy methods are also critical to reinforcement-learning-based efforts to model human-level world knowledge and state representations as predictions of option outcomes (e.g., Sutton, Precup & Singh 1999; Sutton, Rafols & Koop 2006). Unfortunately, off-policy methods such as Q-learning are not sound when used with approximations that are linear in the learned parameters—the most popular form of function approximation in reinforcement learning. Counterexamples have been known for many years (e.g., Baird 1995) in which Q-learning’s parameters diverge to infinity for any positive step size. This is a severe problem in so far as function approximation is widely viewed as necessary for large-scale applications of reinforcement learning. The need is so great that practitioners have often simply ignored the problem and continued to use Q-learning with linear function approximation anyway. Although no instances ∗ Csaba Szepesv´ ri is on leave from MTA SZTAKI. a 1 of absolute divergence in applications have been reported in the literature, the potential for instability is disturbing and probably belies real but less obvious problems. The stability problem is not specific to reinforcement learning. Classical dynamic programming methods such as value and policy iteration are also off-policy methods and also diverge on some problems when used with linear function approximation. Reinforcement learning methods are actually an improvement over conventional dynamic programming methods in that at least they can be used stably with linear function approximation in their on-policy form. The stability problem is also not due to the interaction of control and prediction, or to stochastic approximation effects; the simplest counterexamples are for deterministic, expected-value-style, synchronous policy evaluation (see Baird 1995; Sutton & Barto 1998). Prior to the current work, the possibility of instability could not be avoided whenever four individually desirable algorithmic features were combined: 1) off-policy updates, 2) temporal-difference learning, 3) linear function approximation, and 4) linear complexity in memory and per-time-step computation. If any one of these four is abandoned, then stable methods can be obtained relatively easily. But each feature brings value and practitioners are loath to give any of them up, as we discuss later in a penultimate related-work section. In this paper we present the first algorithm to achieve all four desirable features and be stable and convergent for all finite Markov decision processes, all target and behavior policies, and all feature representations for the linear approximator. Moreover, our algorithm does not use importance sampling and can be expected to be much better conditioned and of lower variance than importance sampling methods. Our algorithm can be viewed as performing stochastic gradient-descent in a novel objective function whose optimum is the least-squares TD solution. Our algorithm is also incremental and suitable for online use just as are simple temporaldifference learning algorithms such as Q-learning and TD(λ) (Sutton 1988). Our algorithm can be broadly characterized as a gradient-descent version of TD(0), and accordingly we call it GTD(0). 2 Sub-sampling and i.i.d. formulations of temporal-difference learning In this section we formulate the off-policy policy-evaluation problem for one-step temporaldifference learning such that the data consists of independent, identically-distributed (i.i.d.) samples. We start by considering the standard reinforcement learning framework, in which a learning agent interacts with an environment consisting of a finite Markov decision process (MDP). At each of a sequence of discrete time steps, t = 1, 2, . . ., the environment is in a state st ∈ S, the agent chooses an action at ∈ A, and then the environment emits a reward rt ∈ R, and transitions to its next state st+1 ∈ S. The state and action sets are finite. State transitions are stochastic and dependent on the immediately preceding state and action. Rewards are stochastic and dependent on the preceding state and action, and on the next state. The agent process generating the actions is termed the behavior policy. To start, we assume a deterministic target policy π : S → A. The objective is to learn an approximation to its state-value function: ∞ V π (s) = Eπ γ t−1 rt |s1 = s , (1) t=1 where γ ∈ [0, 1) is the discount rate. The learning is to be done without knowledge of the process dynamics and from observations of a single continuous trajectory with no resets. In many problems of interest the state set is too large for it to be practical to approximate the value of each state individually. Here we consider linear function approximation, in which states are mapped to feature vectors with fewer components than the number of states. That is, for each state s ∈ S there is a corresponding feature vector φ(s) ∈ Rn , with n |S|. The approximation to the value function is then required to be linear in the feature vectors and a corresponding parameter vector θ ∈ Rn : V π (s) ≈ θ φ(s). (2) Further, we assume that the states st are not visible to the learning agent in any way other than through the feature vectors. Thus this function approximation formulation can include partialobservability formulations such as POMDPs as a special case. The environment and the behavior policy together generate a stream of states, actions and rewards, s1 , a1 , r1 , s2 , a2 , r2 , . . ., which we can break into causally related 4-tuples, (s1 , a1 , r1 , s1 ), 2 (s2 , a2 , r2 , s2 ), . . . , where st = st+1 . For some tuples, the action will match what the target policy would do in that state, and for others it will not. We can discard all of the latter as not relevant to the target policy. For the former, we can discard the action because it can be determined from the state via the target policy. With a slight abuse of notation, let sk denote the kth state in which an on-policy action was taken, and let rk and sk denote the associated reward and next state. The kth on-policy transition, denoted (sk , rk , sk ), is a triple consisting of the starting state of the transition, the reward on the transition, and the ending state of the transition. The corresponding data available to the learning algorithm is the triple (φ(sk ), rk , φ(sk )). The MDP under the behavior policy is assumed to be ergodic, so that it determines a stationary state-occupancy distribution µ(s) = limk→∞ P r{sk = s}. For any state s, the MDP and target policy together determine an N × N state-transition-probability matrix P , where pss = P r{sk = s |sk = s}, and an N × 1 expected-reward vector R, where Rs = E[rk |sk = s]. These two together completely characterize the statistics of on-policy transitions, and all the samples in the sequence of (φ(sk ), rk , φ(sk )) respect these statistics. The problem still has a Markov structure in that there are temporal dependencies between the sample transitions. In our analysis we first consider a formulation without such dependencies, the i.i.d. case, and then prove that our results extend to the original case. In the i.i.d. formulation, the states sk are generated independently and identically distributed according to an arbitrary probability distribution µ. From each sk , a corresponding sk is generated according to the on-policy state-transition matrix, P , and a corresponding rk is generated according to an arbitrary bounded distribution with expected value Rsk . The final i.i.d. data sequence, from which an approximate value function is to be learned, is then the sequence (φ(sk ), rk , φ(sk )), for k = 1, 2, . . . Further, because each sample is i.i.d., we can remove the indices and talk about a single tuple of random variables (φ, r, φ ) drawn from µ. It remains to define the objective of learning. The TD error for the linear setting is δ = r + γθ φ − θ φ. (3) Given this, we define the one-step linear TD solution as any value of θ at which 0 = E[δφ] = −Aθ + b, (4) where A = E φ(φ − γφ ) and b = E[rφ]. This is the parameter value to which the linear TD(0) algorithm (Sutton 1988) converges under on-policy training, as well as the value found by LSTD(0) (Bradtke & Barto 1996) under both on-policy and off-policy training. The TD solution is always a fixed-point of the linear TD(0) algorithm, but under off-policy training it may not be stable; if θ does not exactly satisfy (4), then the TD(0) algorithm may cause it to move away in expected value and eventually diverge to infinity. 3 The GTD(0) algorithm We next present the idea and gradient-descent derivation leading to the GTD(0) algorithm. As discussed above, the vector E[δφ] can be viewed as an error in the current solution θ. The vector should be zero, so its norm is a measure of how far we are away from the TD solution. A distinctive feature of our gradient-descent analysis of temporal-difference learning is that we use as our objective function the L2 norm of this vector: J(θ) = E[δφ] E[δφ] . (5) This objective function is quadratic and unimodal; it’s minimum value of 0 is achieved when E[δφ] = 0, which can always be achieved. The gradient of this objective function is θ J(θ) = 2( = 2E φ( θ E[δφ])E[δφ] θ δ) E[δφ] = −2E φ(φ − γφ ) E[δφ] . (6) This last equation is key to our analysis. We would like to take a stochastic gradient-descent approach, in which a small change is made on each sample in such a way that the expected update 3 is the direction opposite to the gradient. This is straightforward if the gradient can be written as a single expected value, but here we have a product of two expected values. One cannot sample both of them because the sample product will be biased by their correlation. However, one could store a long-term, quasi-stationary estimate of either of the expectations and then sample the other. The question is, which expectation should be estimated and stored, and which should be sampled? Both ways seem to lead to interesting learning algorithms. First let us consider the algorithm obtained by forming and storing a separate estimate of the first expectation, that is, of the matrix A = E φ(φ − γφ ) . This matrix is straightforward to estimate from experience as a simple arithmetic average of all previously observed sample outer products φ(φ − γφ ) . Note that A is a stationary statistic in any fixed-policy policy-evaluation problem; it does not depend on θ and would not need to be re-estimated if θ were to change. Let Ak be the estimate of A after observing the first k samples, (φ1 , r1 , φ1 ), . . . , (φk , rk , φk ). Then this algorithm is defined by k 1 Ak = φi (φi − γφi ) (7) k i=1 along with the gradient descent rule: θk+1 = θk + αk Ak δk φk , k ≥ 1, (8) where θ1 is arbitrary, δk = rk + γθk φk − θk φk , and αk > 0 is a series of step-size parameters, possibly decreasing over time. We call this algorithm A TD(0) because it is essentially conventional TD(0) prefixed by an estimate of the matrix A . Although we find this algorithm interesting, we do not consider it further here because it requires O(n2 ) memory and computation per time step. The second path to a stochastic-approximation algorithm for estimating the gradient (6) is to form and store an estimate of the second expectation, the vector E[δφ], and to sample the first expectation, E φ(φ − γφ ) . Let uk denote the estimate of E[δφ] after observing the first k − 1 samples, with u1 = 0. The GTD(0) algorithm is defined by uk+1 = uk + βk (δk φk − uk ) (9) and θk+1 = θk + αk (φk − γφk )φk uk , (10) where θ1 is arbitrary, δk is as in (3) using θk , and αk > 0 and βk > 0 are step-size parameters, possibly decreasing over time. Notice that if the product is formed right-to-left, then the entire computation is O(n) per time step. 4 Convergence The purpose of this section is to establish that GTD(0) converges with probability one to the TD solution in the i.i.d. problem formulation under standard assumptions. In particular, we have the following result: Theorem 4.1 (Convergence of GTD(0)). Consider the GTD(0) iteration (9,10) with step-size se∞ ∞ 2 quences αk and βk satisfying βk = ηαk , η > 0, αk , βk ∈ (0, 1], k=0 αk = ∞, k=0 αk < ∞. Further assume that (φk , rk , φk ) is an i.i.d. sequence with uniformly bounded second moments. Let A = E φk (φk − γφk ) and b = E[rk φk ] (note that A and b are well-defined because the distribution of (φk , rk , φk ) does not depend on the sequence index k). Assume that A is non-singular. Then the parameter vector θk converges with probability one to the TD solution (4). Proof. We use the ordinary-differential-equation (ODE) approach (Borkar & Meyn 2000). First, we rewrite the algorithm’s two iterations as a single iteration in a combined parameter vector with √ 2n components ρk = (vk , θk ), where vk = uk / η, and a new reward-related vector with 2n components gk+1 = (rk φk , 0 ): √ ρk+1 = ρk + αk η (Gk+1 ρk + gk+1 ) , where Gk+1 = √ − ηI (φk − γφk )φk 4 φk (γφk − φk ) 0 . Let G = E[Gk ] and g = E[gk ]. Note that G and g are well-defined as by the assumption the process {φk , rk , φk }k is i.i.d. In particular, √ − η I −A b G= , g= . 0 A 0 Further, note that (4) follows from Gρ + g = 0, (11) where ρ = (v , θ ). Now we apply Theorem 2.2 of Borkar & Meyn (2000). For this purpose we write ρk+1 = ρk + √ √ αk η(Gρk +g+(Gk+1 −G)ρk +(gk+1 −g)) = ρk +αk (h(ρk )+Mk+1 ), where αk = αk η, h(ρ) = g + Gρ and Mk+1 = (Gk+1 − G)ρk + gk+1 − g. Let Fk = σ(ρ1 , M1 , . . . , ρk−1 , Mk ). Theorem 2.2 requires the verification of the following conditions: (i) The function h is Lipschitz and h∞ (ρ) = limr→∞ h(rρ)/r is well-defined for every ρ ∈ R2n ; (ii-a) The sequence (Mk , Fk ) is a martingale difference sequence, and (ii-b) for some C0 > 0, E Mk+1 2 | Fk ≤ C0 (1 + ρk 2 ) holds for ∞ any initial parameter vector ρ1 ; (iii) The sequence αk satisfies 0 < αk ≤ 1, k=1 αk = ∞, ∞ 2 ˙ k=1 (αk ) < +∞; and (iv) The ODE ρ = h(ρ) has a globally asymptotically stable equilibrium. Clearly, h(ρ) is Lipschitz with coefficient G and h∞ (ρ) = Gρ. By construction, (Mk , Fk ) satisfies E[Mk+1 |Fk ] = 0 and Mk ∈ Fk , i.e., it is a martingale difference sequence. Condition (ii-b) can be shown to hold by a simple application of the triangle inequality and the boundedness of the the second moments of (φk , rk , φk ). Condition (iii) is satisfied by our conditions on the step-size sequences αk , βk . Finally, the last condition (iv) will follow from the elementary theory of linear differential equations if we can show that the real parts of all the eigenvalues of G are negative. First, let us show that G is non-singular. Using the determinant rule for partitioned matrices1 we get det(G) = det(A A) = 0. This indicates that all the eigenvalues of G are non-zero. Now, let λ ∈ C, λ = 0 be an eigenvalue of G with corresponding normalized eigenvector x ∈ C2n ; 2 that is, x = x∗ x = 1, where x∗ is the complex conjugate of x. Hence x∗ Gx = λ. Let √ 2 x = (x1 , x2 ), where x1 , x2 ∈ Cn . Using the definition of G, λ = x∗ Gx = − η x1 + x∗ Ax2 − x∗ A x1 . Because A is real, A∗ = A , and it follows that (x∗ Ax2 )∗ = x∗ A x1 . Thus, 1 2 1 2 √ 2 Re(λ) = Re(x∗ Gx) = − η x1 ≤ 0. We are now done if we show that x1 cannot be zero. If x1 = 0, then from λ = x∗ Gx we get that λ = 0, which contradicts with λ = 0. The next result concerns the convergence of GTD(0) when (φk , rk , φk ) is obtained by the off-policy sub-sampling process described originally in Section 2. We make the following assumption: Assumption A1 The behavior policy πb (generator of the actions at ) selects all actions of the target policy π with positive probability in every state, and the target policy is deterministic. This assumption is needed to ensure that the sub-sampled process sk is well-defined and that the obtained sample is of “high quality”. Under this assumption it holds that sk is again a Markov chain by the strong Markov property of Markov processes (as the times selected when actions correspond to those of the behavior policy form Markov times with respect to the filtration defined by the original process st ). The following theorem shows that the conclusion of the previous result continues to hold in this case: Theorem 4.2 (Convergence of GTD(0) with a sub-sampled process.). Assume A1. Let the parameters θk , uk be updated by (9,10). Further assume that (φk , rk , φk ) is such that E φk 2 |sk−1 , 2 E rk |sk−1 , E φk 2 |sk−1 are uniformly bounded. Assume that the Markov chain (sk ) is aperiodic and irreducible, so that limk→∞ P(sk = s |s0 = s) = µ(s ) exists and is unique. Let s be a state randomly drawn from µ, and let s be a state obtained by following π for one time step in the MDP from s. Further, let r(s, s ) be the reward incurred. Let A = E φ(s)(φ(s) − γφ(s )) and b = E[r(s, s )φ(s)]. Assume that A is non-singular. Then the parameter vector θk converges with probability one to the TD solution (4), provided that s1 ∼ µ. Proof. The proof of Theorem 4.1 goes through without any changes once we observe that G = E[Gk+1 |Fk ] and g = E[gk+1 | Fk ]. 1 R According to this rule, if A ∈ Rn×n , B ∈ Rn×m , C ∈ Rm×n , D ∈ Rm×m then for F = [A B; C D] ∈ , det(F ) = det(A) det(D − CA−1 B). (n+m)×(n+m) 5 The condition that (sk ) is aperiodic and irreducible guarantees the existence of the steady state distribution µ. Further, the aperiodicity and irreducibility of (sk ) follows from the same property of the original process (st ). For further discussion of these conditions cf. Section 6.3 of Bertsekas and Tsitsiklis (1996). With considerable more work the result can be extended to the case when s1 follows an arbitrary distribution. This requires an extension of Theorem 2.2 of Borkar and Meyn (2000) to processes of the form ρk+1 + ρk (h(ρk ) + Mk+1 + ek+1 ), where ek+1 is a fast decaying perturbation (see, e.g., the proof of Proposition 4.8 of Bertsekas and Tsitsiklis (1996)). 5 Extensions to action values, stochastic target policies, and other sample weightings The GTD algorithm extends immediately to the case of off-policy learning of action-value functions. For this assume that a behavior policy πb is followed that samples all actions in every state with positive probability. Let the target policy to be evaluated be π. In this case the basis functions are dependent on both the states and actions: φ : S × A → Rn . The learning equations are unchanged, except that φt and φt are redefined as follows: φt = φ(st , at ), (12) φt = (13) π(st+1 , a)φ(st+1 , a). a (We use time indices t denoting physical time.) Here π(s, a) is the probability of selecting action a in state s under the target policy π. Let us call the resulting algorithm “one-step gradient-based Q-evaluation,” or GQE(0). Theorem 5.1 (Convergence of GQE(0)). Assume that st is a state sequence generated by following some stationary policy πb in a finite MDP. Let rt be the corresponding sequence of rewards and let φt , φt be given by the respective equations (12) and (13), and assume that E φt 2 |st−1 , 2 E rt |st−1 , E φt 2 |st−1 are uniformly bounded. Let the parameters θt , ut be updated by Equations (9) and (10). Assume that the Markov chain (st ) is aperiodic and irreducible, so that limt→∞ P(st = s |s0 = s) = µ(s ) exists and is unique. Let s be a state randomly drawn from µ, a be an action chosen by πb in s, let s be the next state obtained and let a = π(s ) be the action chosen by the target policy in state s . Further, let r(s, a, s ) be the reward incurred in this transition. Let A = E φ(s, a)(φ(s, a) − γφ(s , a )) and b = E[r(s, a, s )φ(s, a)]. Assume that A is non-singular. Then the parameter vector θt converges with probability one to a TD solution (4), provided that s1 is selected from the steady-state distribution µ. The proof is almost identical to that of Theorem 4.2, and hence it is omitted. Our main convergence results are also readily generalized to stochastic target policies by replacing the sub-sampling process described in Section 2 with a sample-weighting process. That is, instead of including or excluding transitions depending upon whether the action taken matches a deterministic policy, we include all transitions but give each a weight. For example, we might let the weight wt for time step t be equal to the probability π(st , at ) of taking the action actually taken under the target policy. We can consider the i.i.d. samples now to have four components (φk , rk , φk , wk ), with the update rules (9) and (10) replaced by uk+1 = uk + βk (δk φk − uk )wk , (14) θk+1 = θk + αk (φk − γφk )φk uk wk . (15) and Each sample is also weighted by wk in the expected values, such as that defining the TD solution (4). With these changes, Theorems 4.1 and 4.2 go through immediately for stochastic policies. The reweighting is, in effect, an adjustment to the i.i.d. sampling distribution, µ, and thus our results hold because they hold for all µ. The choice wt = π(st , at ) is only one possibility, notable for its equivalence to our original case if the target policy is deterministic. Another natural weighting is wt = π(st , at )/πb (st , at ), where πb is the behavior policy. This weighting may result in the TD solution (4) better matching the target policy’s value function (1). 6 6 Related work There have been several prior attempts to attain the four desirable algorithmic features mentioned at the beginning this paper (off-policy stability, temporal-difference learning, linear function approximation, and O(n) complexity) but none has been completely successful. One idea for retaining all four desirable features is to use importance sampling techniques to reweight off-policy updates so that they are in the same direction as on-policy updates in expected value (Precup, Sutton & Dasgupta 2001; Precup, Sutton & Singh 2000). Convergence can sometimes then be assured by existing results on the convergence of on-policy methods (Tsitsiklis & Van Roy 1997; Tadic 2001). However, the importance sampling weights are cumulative products of (possibly many) target-to-behavior-policy likelihood ratios, and consequently they and the corresponding updates may be of very high variance. The use of “recognizers” to construct the target policy directly from the behavior policy (Precup, Sutton, Paduraru, Koop & Singh 2006) is one strategy for limiting the variance; another is careful choice of the target policies (see Precup, Sutton & Dasgupta 2001). However, it remains the case that for all of such methods to date there are always choices of problem, behavior policy, and target policy for which the variance is infinite, and thus for which there is no guarantee of convergence. Residual gradient algorithms (Baird 1995) have also been proposed as a way of obtaining all four desirable features. These methods can be viewed as gradient descent in the expected squared TD error, E δ 2 ; thus they converge stably to the solution that minimizes this objective for arbitrary differentiable function approximators. However, this solution has always been found to be much inferior to the TD solution (exemplified by (4) for the one-step linear case). In the literature (Baird 1995; Sutton & Barto 1998), it is often claimed that residual-gradient methods are guaranteed to find the TD solution in two special cases: 1) systems with deterministic transitions and 2) systems in which two samples can be drawn for each next state (e.g., for which a simulation model is available). Our own analysis indicates that even these two special requirements are insufficient to guarantee convergence to the TD solution.2 Gordon (1995) and others have questioned the need for linear function approximation. He has proposed replacing linear function approximation with a more restricted class of approximators, known as averagers, that never extrapolate outside the range of the observed data and thus cannot diverge. Rightly or wrongly, averagers have been seen as being too constraining and have not been used on large applications involving online learning. Linear methods, on the other hand, have been widely used (e.g., Baxter, Tridgell & Weaver 1998; Sturtevant & White 2006; Schaeffer, Hlynka & Jussila 2001). The need for linear complexity has also been questioned. Second-order methods for linear approximators, such as LSTD (Bradtke & Barto 1996; Boyan 2002) and LSPI (Lagoudakis & Parr 2003; see also Peters, Vijayakumar & Schaal 2005), can be effective on moderately sized problems. If the number of features in the linear approximator is n, then these methods require memory and per-timestep computation that is O(n2 ). Newer incremental methods such as iLSTD (Geramifard, Bowling & Sutton 2006) have reduced the per-time-complexity to O(n), but are still O(n2 ) in memory. Sparsification methods may reduce the complexity further, they do not help in the general case, and may apply to O(n) methods as well to further reduce their complexity. Linear function approximation is most powerful when very large numbers of features are used, perhaps millions of features (e.g., as in Silver, Sutton & M¨ ller 2007). In such cases, O(n2 ) methods are not feasible. u 7 Conclusion GTD(0) is the first off-policy TD algorithm to converge under general conditions with linear function approximation and linear complexity. As such, it breaks new ground in terms of important, 2 For a counterexample, consider that given in Dayan’s (1992) Figure 2, except now consider that state A is actually two states, A and A’, which share the same feature vector. The two states occur with 50-50 probability, and when one occurs the transition is always deterministically to B followed by the outcome 1, whereas when the other occurs the transition is always deterministically to the outcome 0. In this case V (A) and V (B) will converge under the residual-gradient algorithm to the wrong answers, 1/3 and 2/3, even though the system is deterministic, and even if multiple samples are drawn from each state (they will all be the same). 7 absolute abilities not previous available in existing algorithms. We have conducted empirical studies with the GTD(0) algorithm and have confirmed that it converges reliably on standard off-policy counterexamples such as Baird’s (1995) “star” problem. On on-policy problems such as the n-state random walk (Sutton 1988; Sutton & Barto 1998), GTD(0) does not seem to learn as efficiently as classic TD(0), although we are still exploring different ways of setting the step-size parameters, and other variations on the algorithm. It is not clear that the GTD(0) algorithm in its current form will be a fully satisfactory solution to the off-policy learning problem, but it is clear that is breaks new ground and achieves important abilities that were previously unattainable. Acknowledgments The authors gratefully acknowledge insights and assistance they have received from David Silver, Eric Wiewiora, Mark Ring, Michael Bowling, and Alborz Geramifard. This research was supported by iCORE, NSERC and the Alberta Ingenuity Fund. References Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning, pp. 30–37. Morgan Kaufmann. Baxter, J., Tridgell, A., Weaver, L. (1998). Experiments in parameter learning using temporal differences. International Computer Chess Association Journal, 21, 84–99. Bertsekas, D. P., Tsitsiklis. J. (1996). Neuro-Dynamic Programming. Athena Scientific, 1996. Borkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control And Optimization , 38(2):447–469. Boyan, J. (2002). Technical update: Least-squares temporal difference learning. Machine Learning, 49:233– 246. Bradtke, S., Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:33–57. Dayan, P. (1992). The convergence of TD(λ) for general λ. Machine Learning, 8:341–362. Geramifard, A., Bowling, M., Sutton, R. S. (2006). Incremental least-square temporal difference learning. Proceedings of the National Conference on Artificial Intelligence, pp. 356–361. Gordon, G. J. (1995). Stable function approximation in dynamic programming. Proceedings of the Twelfth International Conference on Machine Learning, pp. 261–268. Morgan Kaufmann, San Francisco. Lagoudakis, M., Parr, R. (2003). Least squares policy iteration. Journal of Machine Learning Research, 4:1107-1149. Peters, J., Vijayakumar, S. and Schaal, S. (2005). Natural Actor-Critic. Proceedings of the 16th European Conference on Machine Learning, pp. 280–291. Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. Proceedings of the 18th International Conference on Machine Learning, pp. 417–424. Precup, D., Sutton, R. S., Paduraru, C., Koop, A., Singh, S. (2006). Off-policy Learning with Recognizers. Advances in Neural Information Processing Systems 18. Precup, D., Sutton, R. S., Singh, S. (2000). Eligibility traces for off-policy policy evaluation. Proceedings of the 17th International Conference on Machine Learning, pp. 759–766. Morgan Kaufmann. Schaeffer, J., Hlynka, M., Jussila, V. (2001). Temporal difference learning applied to a high-performance gameplaying program. Proceedings of the International Joint Conference on Artificial Intelligence, pp. 529–534. Silver, D., Sutton, R. S., M¨ ller, M. (2007). Reinforcement learning of local shape in the game of Go. u Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 1053–1058. Sturtevant, N. R., White, A. M. (2006). Feature construction for reinforcement learning in hearts. In Proceedings of the 5th International Conference on Computers and Games. Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3:9–44. Sutton, R. S., Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press. Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112:181–211. Sutton, R. S., Rafols, E.J., and Koop, A. 2006. Temporal abstraction in temporal-difference networks. Advances in Neural Information Processing Systems 18. Tadic, V. (2001). On the convergence of temporal-difference learning with linear function approximation. In Machine Learning 42:241–267 Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42:674–690. Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. Ph.D. thesis, Cambridge University. 8
same-paper 4 0.76757401 29 nips-2008-Automatic online tuning for fast Gaussian summation
Author: Vlad I. Morariu, Balaji V. Srinivasan, Vikas C. Raykar, Ramani Duraiswami, Larry S. Davis
Abstract: Many machine learning algorithms require the summation of Gaussian kernel functions, an expensive operation if implemented straightforwardly. Several methods have been proposed to reduce the computational complexity of evaluating such sums, including tree and analysis based methods. These achieve varying speedups depending on the bandwidth, dimension, and prescribed error, making the choice between methods difficult for machine learning tasks. We provide an algorithm that combines tree methods with the Improved Fast Gauss Transform (IFGT). As originally proposed the IFGT suffers from two problems: (1) the Taylor series expansion does not perform well for very low bandwidths, and (2) parameter selection is not trivial and can drastically affect performance and ease of use. We address the first problem by employing a tree data structure, resulting in four evaluation methods whose performance varies based on the distribution of sources and targets and input parameters such as desired accuracy and bandwidth. To solve the second problem, we present an online tuning approach that results in a black box method that automatically chooses the evaluation method and its parameters to yield the best performance for the input data, desired accuracy, and bandwidth. In addition, the new IFGT parameter selection approach allows for tighter error bounds. Our approach chooses the fastest method at negligible additional cost, and has superior performance in comparisons with previous approaches. 1
5 0.75978905 94 nips-2008-Goal-directed decision making in prefrontal cortex: a computational framework
Author: Matthew Botvinick, James An
Abstract: Research in animal learning and behavioral neuroscience has distinguished between two forms of action control: a habit-based form, which relies on stored action values, and a goal-directed form, which forecasts and compares action outcomes based on a model of the environment. While habit-based control has been the subject of extensive computational research, the computational principles underlying goal-directed control in animals have so far received less attention. In the present paper, we advance a computational framework for goal-directed control in animals and humans. We take three empirically motivated points as founding premises: (1) Neurons in dorsolateral prefrontal cortex represent action policies, (2) Neurons in orbitofrontal cortex represent rewards, and (3) Neural computation, across domains, can be appropriately understood as performing structured probabilistic inference. On a purely computational level, the resulting account relates closely to previous work using Bayesian inference to solve Markov decision problems, but extends this work by introducing a new algorithm, which provably converges on optimal plans. On a cognitive and neuroscientific level, the theory provides a unifying framework for several different forms of goal-directed action selection, placing emphasis on a novel form, within which orbitofrontal reward representations directly drive policy selection. 1 G oal- d irect ed act i on cont rol In the study of human and animal behavior, it is a long-standing idea that reward-based decision making may rely on two qualitatively different mechanisms. In habit-based decision making, stimuli elicit reflex-like responses, shaped by past reinforcement [1]. In goal-directed or purposive decision making, on the other hand, actions are selected based on a prospective consideration of possible outcomes and future lines of action [2]. Over the past twenty years or so, the attention of cognitive neuroscientists and computationally minded psychologists has tended to focus on habit-based control, due in large part to interest in potential links between dopaminergic function and temporal-difference algorithms for reinforcement learning. However, a resurgence of interest in purposive action selection is now being driven by innovations in animal behavior research, which have yielded powerful new behavioral assays [3], and revealed specific effects of focal neural damage on goaldirected behavior [4]. In discussing some of the relevant data, Daw, Niv and Dayan [5] recently pointed out the close relationship between purposive decision making, as understood in the behavioral sciences, and model-based methods for the solution of Markov decision problems (MDPs), where action policies are derived from a joint analysis of a transition function (a mapping from states and actions to outcomes) and a reward function (a mapping from states to rewards). Beyond this important insight, little work has yet been done to characterize the computations underlying goal-directed action selection (though see [6, 7]). As discussed below, a great deal of evidence indicates that purposive action selection depends critically on a particular region of the brain, the prefrontal cortex. However, it is currently a critical, and quite open, question what the relevant computations within this part of the brain might be. Of course, the basic computational problem of formulating an optimal policy given a model of an MDP has been extensively studied, and there is no shortage of algorithms one might consider as potentially relevant to prefrontal function (e.g., value iteration, policy iteration, backward induction, linear programming, and others). However, from a cognitive and neuroscientific perspective, there is one approach to solving MDPs that it seems particularly appealing to consider. In particular, several researchers have suggested methods for solving MDPs through probabilistic inference [8-12]. The interest of this idea, in the present context, derives from a recent movement toward framing human and animal information processing, as well as the underlying neural computations, in terms of structured probabilistic inference [13, 14]. Given this perspective, it is inviting to consider whether goal-directed action selection, and the neural mechanisms that underlie it, might be understood in those same terms. One challenge in investigating this possibility is that previous research furnishes no ‘off-theshelf’ algorithm for solving MDPs through probabilistic inference that both provably yields optimal policies and aligns with what is known about action selection in the brain. We endeavor here to start filling in that gap. In the following section, we introduce an account of how goal-directed action selection can be performed based on probabilisitic inference, within a network whose components map grossly onto specific brain structures. As part of this account, we introduce a new algorithm for solving MDPs through Bayesian inference, along with a convergence proof. We then present results from a set of simulations illustrating how the framework would account for a variety of behavioral phenomena that are thought to involve purposive action selection. 2 Co m p u t a t i o n a l m o d el As noted earlier, the prefrontal cortex (PFC) is believed to play a pivotal role in purposive behavior. This is indicated by a broad association between prefrontal lesions and impairments in goal-directed action in both humans (see [15]) and animals [4]. Single-unit recording and other data suggest that different sectors of PFC make distinct contributions. In particular, neurons in dorsolateral prefrontal cortex (DLPFC) appear to encode taskspecific mappings from stimuli to responses (e.g., [16]): “task representations,” in the language of psychology, or “policies” in the language of dynamic programming. Although there is some understanding of how policy representations in DLPFC may guide action execution [15], little is yet known about how these representations are themselves selected. Our most basic proposal is that DLPFC policy representations are selected in a prospective, model-based fashion, leveraging information about action-outcome contingencies (i.e., the transition function) and about the incentive value associated with specific outcomes or states (the reward function). There is extensive evidence to suggest that state-reward associations are represented in another area of the PFC, the orbitofrontal cortex (OFC) [17, 18]. As for the transition function, although it is clear that the brain contains detailed representations of action-outcome associations [19], their anatomical localization is not yet entirely clear. However, some evidence suggests that the enviromental effects of simple actions may be represented in inferior fronto-parietal cortex [20], and there is also evidence suggesting that medial temporal structures may be important in forecasting action outcomes [21]. As detailed in the next section, our model assumes that policy representations in DLPFC, reward representations in OFC, and representations of states and actions in other brain regions, are coordinated within a network structure that represents their causal or statistical interdependencies, and that policy selection occurs, within this network, through a process of probabilistic inference. 2.1 A rc h i t e c t u re The implementation takes the form of a directed graphical model [22], with the layout shown in Figure 1. Each node represents a discrete random variable. State variables (s), representing the set of m possible world states, serve the role played by parietal and medial temporal cortices in representing action outcomes. Action variables (a) representing the set of available actions, play the role of high-level cortical motor areas involved in the programming of action sequences. Policy variables ( ), each repre-senting the set of all deterministic policies associated with a specific state, capture the representational role of DLPFC. Local and global utility variables, described further Fig 1. Left: Single-step decision. Right: Sequential decision. below, capture the role of OFC in Each time-slice includes a set of m policy nodes. representing incentive value. A separate set of nodes is included for each discrete time-step up to the planning horizon. The conditional probabilities associated with each variable are represented in tabular form. State probabilities are based on the state and action variables in the preceding time-step, and thus encode the transition function. Action probabilities depend on the current state and its associated policy variable. Utilities depend only on the current state. Rather than representing reward magnitude as a continuous variable, we adopt an approach introduced by [23], representing reward through the posterior probability of a binary variable (u). States associated with large positive reward raise p(u) (i.e, p(u=1|s)) near to one; states associated with large negative rewards reduce p(u) to near zero. In the simulations reported below, we used a simple linear transformation to map from scalar reward values to p(u): p (u si ) = 1 R ( si ) +1 , rmax 2 rmax max j R ( s j ) (1) In situations involving sequential actions, expected returns from different time-steps must be integrated into a global representation of expected value. In order to accomplish this, we employ a technique proposed by [8], introducing a “global” utility variable (u G). Like u, this 1 is a binary random variable, but associated with a posterior probability determined as: p (uG ) = 1 N p(u i ) (2) i where N is the number of u nodes. The network as whole embodies a generative model for instrumental action. The basic idea is to use this model as a substrate for probabilistic inference, in order to arrive at optimal policies. There are three general methods for accomplishing this, which correspond three forms of query. First, a desired outcome state can be identified, by treating one of the state variables (as well as the initial state variable) as observed (see [9] for an application of this approach). Second, the expected return for specific plans can be evaluated and compared by conditioning on specific sets of values over the policy nodes (see [5, 21]). However, our focus here is on a less obvious possibility, which is to condition directly on the utility variable u G , as explained next. 2.2 P o l i c y s e l e c t i o n b y p ro b a b i l i s t i c i n f e re n c e : a n i t e r a t i v e a l g o r i t h m Cooper [23] introduced the idea of inferring optimal decisions in influence diagrams by treating utility nodes into binary random variables and then conditioning on these variables. Although this technique has been adopted in some more recent work [9, 12], we are aware of no application that guarantees optimal decisions, in the expected-reward sense, in multi-step tasks. We introduce here a simple algorithm that does furnish such a guarantee. The procedure is as follows: (1) Initialize the policy nodes with any set of non-deterministic 2 priors. (2) Treating the initial state and u G as observed variables (u G = 1), use standard belief 1 Note that temporal discounting can be incorporated into the framework through minimal modifications to Equation 2. 2 In the single-action situation, where there is only one u node, it is this variable that is treated as observed (u = 1). propagation (or a comparable algorithm) to infer the posterior distributions over all policy nodes. (3) Set the prior distributions over the policy nodes to the values (posteriors) obtained in step 2. (4) Go to step 2. The next two sections present proofs of monotonicity and convergence for this algorithm. 2.2.1 Monotonicity We show first that, at each policy node, the probability associated with the optimal policy will rise on every iteration. Define * as follows: ( * p uG , + ) > p (u + , G ), * (3) where + is the current set of probability distributions at all policy nodes on subsequent time-steps. (Note that we assume here, for simplicity, that there is a unique optimal policy.) The objective is to establish that: p ( t* ) > p ( t* 1 ) (4) where t indexes processing iterations. The dynamics of the network entail that p( ) = p( t t 1 uG ) (5) where represents any value (i.e., policy) of the decision node being considered. Substituting this into (4) gives p t* 1 uG > p ( t* 1 ) (6) ( ) From this point on the focus is on a single iteration, which permits us to omit the relevant subscripts. Applying Bayes’ law to (6) yields p (uG * p (uG ) p( ) > p * )p ( ) ( ) * (7) Canceling, and bringing the denominator up, this becomes p (uG * )> p (uG ) p( ) (8) Rewriting the left hand side, we obtain p ( uG * ) p( ) > p (uG ) p( ) (9) Subtracting and further rearranging: p (uG p (uG * ) p ( uG * * * ) p (uG ) p( ) + p (uG * * ) * p ( uG ) p( ) > 0 p (uG * ) p (uG ) p( ) > 0 (10) ) p( ) > 0 (11) (12) Note that this last inequality (12) follows from the definition of *. Remark: Of course, the identity of * depends on +. In particular, the policy * will only be part of a globally optimal plan if the set of choices + is optimal. Fortunately, this requirement is guaranteed to be met, as long as no upper bound is placed on the number of processing cycles. Recalling that we are considering only finite-horizon problems, note that for policies leading to states with no successors, + is empty. Thus * at the relevant policy nodes is fixed, and is guaranteed to be part of the optimal policy. The proof above shows that * will continuously rise. Once it reaches a maximum, * at immediately preceding decisions will perforce fit with the globally optimal policy. The process works backward, in the fashion of backward induction. 2.2.2 Convergence Continuing with the same notation, we show now that pt ( limt uG ) = 1 * (13) Note that, if we apply Bayes’ law recursively, pt ( uG ) = ( ) p ( ) = p (u p uG t ) G pi (uG ) 2 pt pi (uG ) pt 1 ( ( )= ) p uG 1 ( uG ) pt (uG ) pt 3 pt 2 ( ) 1 ( u G ) pt 2 ( u G ) … (14) Thus, p1 ( uG ) = ( p uG ) p ( ), p ( p (u ) 1 uG ) = 2 1 G 2 ( ) p ( ), p uG 1 p2 (uG ) p1 (uG ) p3 ( 3 ( ) p( ) p uG uG ) = 1 p3 (uG ) p2 (uG ) p1 (uG ) , (15) and so forth. Thus, what we wish to prove is ( * p uG ) p ( ) =1 * 1 (16) pt (uG ) t =1 or, rearranging, pt (uG ) ( = p1 ( ) p uG t =1 (17) ). Note that, given the stipulated relationship between p( ) on each processing iteration and p( | uG) on the previous iteration, p (uG pt (uG ) = )p ( ) = p ( uG = pt 1 )p ( p (uG t uG ) = t 1 3 )p ( ) 4 p (uG t 1 = (uG ) pt 2 (uG ) pt 1 ) pt 2 p (uG )p ( ) t 1 pt 1 1 ( ) (uG ) pt 2 (uG ) pt 3 (uG ) ( uG ) (18) … With this in mind, we can rewrite the left hand side product in (17) as follows: p ( uG p1 (uG ) ( p uG ) p (u G 2 )p( ) ) p (u 1 G ) 3 p (uG 1 ( p uG )p( ) ) p (u 1 G 4 p (uG 1 ( ) p2 (uG ) p uG ) p (u 1 ) p( ) 1 G ) p2 (uG ) p3 (uG ) … (19) Note that, given (18), the numerator in each factor of (19) cancels with the denominator in the subsequent factor, leaving only p(uG| *) in that denominator. The expression can thus be rewritten as 1 ( p uG 1 ) p (u G ) p (u G 4 p (uG 1 ) ) p( ) 1 ( p uG ) … = p (uG ( p uG ) ) p1 ( ). (20) The objective is then to show that the above equals p( *). It proceeds directly from the definition of * that, for all other than *, p ( uG ( p uG ) ) <1 (21) Thus, all but one of the terms in the sum above approach zero, and the remaining term equals p1( *). Thus, p (uG ( p uG ) ) p1 ( ) = p1 ( ) (22) 3 Simulations 3.1 Binary choice We begin with a simulation of a simple incentive choice situation. Here, an animal faces two levers. Pressing the left lever reliably yields a preferred food (r = 2), the right a less preferred food (r = 1). Representing these contingencies in a network structured as in Fig. 1 (left) and employing the iterative algorithm described in section 2.2 yields the results in Figure 2A. Shown here are the posterior probabilities for the policies press left and press right, along with the marginal value of p(u = 1) under these posteriors (labeled EV for expected value). The dashed horizontal line indicates the expected value for the optimal plan, to which the model obviously converges. A key empirical assay for purposive behavior involves outcome devaluation. Here, actions yielding a previously valued outcome are abandoned after the incentive value of the outcome is reduced, for example by pairing with an aversive event (e.g., [4]). To simulate this within the binary choice scenario just described, we reduced to zero the reward value of the food yielded by the left lever (fL), by making the appropriate change to p(u|fL). This yielded a reversal in lever choice (Fig. 2B). Another signature of purposive actions is that they are abandoned when their causal connection with rewarding outcomes is removed (contingency degradation, see [4]). We simulated this by starting with the model from Fig. 2A and changing conditional probabilities at s for t=2 to reflect a decoupling of the left action from the fL outcome. The resulting behavior is shown in Fig. 2C. Fig 2. Simulation results, binary choice. 3.2 Stochastic outcomes A critical aspect of the present modeling paradigm is that it yields reward-maximizing choices in stochastic domains, a property that distinguishes it from some other recent approaches using graphical models to do planning (e.g., [9]). To illustrate, we used the architecture in Figure 1 (left) to simulate a choice between two fair coins. A ‘left’ coin yields $1 for heads, $0 for tails; a ‘right’ coin $2 for heads but for tails a $3 loss. As illustrated in Fig. 2D, the model maximizes expected value by opting for the left coin. Fig 3. Simulation results, two-step sequential choice. 3.3 Sequential decision Here, we adopt the two-step T-maze scenario used by [24] (Fig. 3A). Representing the task contingencies in a graphical model based on the template from Fig 1 (right), and using the reward values indicated in Fig. 3A, yields the choice behavior shown in Figure 3B. Following [24], a shift in motivational state from hunger to thirst can be represented in the graphical model by changing the reward function (R(cheese) = 2, R(X) = 0, R(water) = 4, R(carrots) = 1). Imposing this change at the level of the u variables yields the choice behavior shown in Fig. 3C. The model can also be used to simulate effort-based decision. Starting with the scenario in Fig. 2A, we simulated the insertion of an effort-demanding scalable barrier at S 2 (R(S 2 ) = -2) by making appropriate changes p(u|s). The resulting behavior is shown in Fig. 3D. A famous empirical demonstration of purposive control involves detour behavior. Using a maze like the one shown in Fig. 4A, with a food reward placed at s5 , Tolman [2] found that rats reacted to a barrier at location A by taking the upper route, but to a barrier at B by taking the longer lower route. We simulated this experiment by representing the corresponding 3 transition and reward functions in a graphical model of the form shown in Fig. 1 (right), representing the insertion of barriers by appropriate changes to the transition function. The resulting choice behavior at the critical juncture s2 is shown in Fig. 4. Fig 4. Simulation results, detour behavior. B: No barrier. C: Barrier at A. D: Barrier at B. Another classic empirical demonstration involves latent learning. Blodgett [25] allowed rats to explore the maze shown in Fig. 5. Later insertion of a food reward at s13 was followed immediately by dramatic reductions in the running time, reflecting a reduction in entries into blind alleys. We simulated this effect in a model based on the template in Fig. 1 (right), representing the maze layout via an appropriate transition function. In the absence of a reward at s12 , random choices occurred at each intersection. However, setting R(s13 ) = 1 resulted in the set of choices indicated by the heavier arrows in Fig. 5. 4 Fig 5. Latent learning. Rel a t i o n t o p revi o u s work Initial proposals for how to solve decision problems through probabilistic inference in graphical models, including the idea of encoding reward as the posterior probability of a random utility variable, were put forth by Cooper [23]. Related ideas were presented by Shachter and Peot [12], including the use of nodes that integrate information from multiple utility nodes. More recently, Attias [11] and Verma and Rao [9] have used graphical models to solve shortest-path problems, leveraging probabilistic representations of rewards, though not in a way that guaranteed convergence on optimal (reward maximizing) plans. More closely related to the present research is work by Toussaint and Storkey [10], employing the EM algorithm. The iterative approach we have introduced here has a certain resemblance to the EM procedure, which becomes evident if one views the policy variables in our models as parameters on the mapping from states to actions. It seems possible that there may be a formal equivalence between the algorithm we have proposed and the one reported by [10]. As a cognitive and neuroscientific proposal, the present work bears a close relation to recent work by Hasselmo [6], addressing the prefrontal computations underlying goal-directed action selection (see also [7]). The present efforts are tied more closely to normative principles of decision-making, whereas the work in [6] is tied more closely to the details of neural circuitry. In this respect, the two approaches may prove complementary, and it will be interesting to further consider their interrelations. 3 In this simulation and the next, the set of states associated with each state node was limited to the set of reachable states for the relevant time-step, assuming an initial state of s1 . Acknowledgments Thanks to Andrew Ledvina, David Blei, Yael Niv, Nathaniel Daw, and Francisco Pereira for useful comments. R e f e re n c e s [1] Hull, C.L., Principles of Behavior. 1943, New York: Appleton-Century. [2] Tolman, E.C., Purposive Behavior in Animals and Men. 1932, New York: Century. [3] Dickinson, A., Actions and habits: the development of behavioral autonomy. Philosophical Transactions of the Royal Society (London), Series B, 1985. 308: p. 67-78. [4] Balleine, B.W. and A. Dickinson, Goal-directed instrumental action: contingency and incentive learning and their cortical substrates. Neuropharmacology, 1998. 37: p. 407-419. [5] Daw, N.D., Y. Niv, and P. Dayan, Uncertainty-based competition between prefrontal and striatal systems for behavioral control. Nature Neuroscience, 2005. 8: p. 1704-1711. [6] Hasselmo, M.E., A model of prefrontal cortical mechanisms for goal-directed behavior. Journal of Cognitive Neuroscience, 2005. 17: p. 1115-1129. [7] Schmajuk, N.A. and A.D. Thieme, Purposive behavior and cognitive mapping. A neural network model. Biological Cybernetics, 1992. 67: p. 165-174. [8] Tatman, J.A. and R.D. Shachter, Dynamic programming and influence diagrams. IEEE Transactions on Systems, Man and Cybernetics, 1990. 20: p. 365-379. [9] Verma, D. and R.P.N. Rao. Planning and acting in uncertain enviroments using probabilistic inference. in IEEE/RSJ International Conference on Intelligent Robots and Systems. 2006. [10] Toussaint, M. and A. Storkey. Probabilistic inference for solving discrete and continuous state markov decision processes. in Proceedings of the 23rd International Conference on Machine Learning. 2006. Pittsburgh, PA. [11] Attias, H. Planning by probabilistic inference. in Proceedings of the 9th Int. Workshop on Artificial Intelligence and Statistics. 2003. [12] Shachter, R.D. and M.A. Peot. Decision making using probabilistic inference methods. in Uncertainty in artificial intelligence: Proceedings of the Eighth Conference (1992). 1992. Stanford University: M. Kaufmann. [13] Chater, N., J.B. Tenenbaum, and A. Yuille, Probabilistic models of cognition: conceptual foundations. Trends in Cognitive Sciences, 2006. 10(7): p. 287-291. [14] Doya, K., et al., eds. The Bayesian Brain: Probabilistic Approaches to Neural Coding. 2006, MIT Press: Cambridge, MA. [15] Miller, E.K. and J.D. Cohen, An integrative theory of prefrontal cortex function. Annual Review of Neuroscience, 2001. 24: p. 167-202. [16] Asaad, W.F., G. Rainer, and E.K. Miller, Task-specific neural activity in the primate prefrontal cortex. Journal of Neurophysiology, 2000. 84: p. 451-459. [17] Rolls, E.T., The functions of the orbitofrontal cortex. Brain and Cognition, 2004. 55: p. 11-29. [18] Padoa-Schioppa, C. and J.A. Assad, Neurons in the orbitofrontal cortex encode economic value. Nature, 2006. 441: p. 223-226. [19] Gopnik, A., et al., A theory of causal learning in children: causal maps and Bayes nets. Psychological Review, 2004. 111: p. 1-31. [20] Hamilton, A.F.d.C. and S.T. Grafton, Action outcomes are represented in human inferior frontoparietal cortex. Cerebral Cortex, 2008. 18: p. 1160-1168. [21] Johnson, A., M.A.A. van der Meer, and D.A. Redish, Integrating hippocampus and striatum in decision-making. Current Opinion in Neurobiology, 2008. 17: p. 692-697. [22] Jensen, F.V., Bayesian Networks and Decision Graphs. 2001, New York: Springer Verlag. [23] Cooper, G.F. A method for using belief networks as influence diagrams. in Fourth Workshop on Uncertainty in Artificial Intelligence. 1988. University of Minnesota, Minneapolis. [24] Niv, Y., D. Joel, and P. Dayan, A normative perspective on motivation. Trends in Cognitive Sciences, 2006. 10: p. 375-381. [25] Blodgett, H.C., The effect of the introduction of reward upon the maze performance of rats. University of California Publications in Psychology, 1929. 4: p. 113-134.
6 0.70603818 150 nips-2008-Near-optimal Regret Bounds for Reinforcement Learning
7 0.62045133 195 nips-2008-Regularized Policy Iteration
8 0.61210233 87 nips-2008-Fitted Q-iteration by Advantage Weighted Regression
9 0.61119288 181 nips-2008-Policy Search for Motor Primitives in Robotics
10 0.59604973 37 nips-2008-Biasing Approximate Dynamic Programming with a Lower Discount Factor
11 0.58017081 131 nips-2008-MDPs with Non-Deterministic Policies
12 0.56294334 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions
13 0.5625934 231 nips-2008-Temporal Dynamics of Cognitive Control
14 0.55978543 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation
15 0.55912304 121 nips-2008-Learning to Use Working Memory in Partially Observable Environments through Dopaminergic Reinforcement
16 0.55758208 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
17 0.55659986 173 nips-2008-Optimization on a Budget: A Reinforcement Learning Approach
18 0.55502737 210 nips-2008-Signal-to-Noise Ratio Analysis of Policy Gradient Algorithms
19 0.5491904 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs
20 0.54695898 240 nips-2008-Tracking Changing Stimuli in Continuous Attractor Neural Networks