nips nips2007 nips2007-209 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Charles L. Isbell, Michael P. Holmes, Alexander G. Gray
Abstract: Machine learning contains many computational bottlenecks in the form of nested summations over datasets. Kernel estimators and other methods are burdened by these expensive computations. Exact evaluation is typically O(n2 ) or higher, which severely limits application to large datasets. We present a multi-stage stratified Monte Carlo method for approximating such summations with probabilistic relative error control. The essential idea is fast approximation by sampling in trees. This method differs from many previous scalability techniques (such as standard multi-tree methods) in that its error is stochastic, but we derive conditions for error control and demonstrate that they work. Further, we give a theoretical sample complexity for the method that is independent of dataset size, and show that this appears to hold in experiments, where speedups reach as high as 1014 , many orders of magnitude beyond the previous state of the art. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Machine learning contains many computational bottlenecks in the form of nested summations over datasets. [sent-6, score-0.444]
2 We present a multi-stage stratified Monte Carlo method for approximating such summations with probabilistic relative error control. [sent-9, score-0.385]
3 This method differs from many previous scalability techniques (such as standard multi-tree methods) in that its error is stochastic, but we derive conditions for error control and demonstrate that they work. [sent-11, score-0.168]
4 Further, we give a theoretical sample complexity for the method that is independent of dataset size, and show that this appears to hold in experiments, where speedups reach as high as 1014 , many orders of magnitude beyond the previous state of the art. [sent-12, score-0.368]
5 1 Introduction Many machine learning methods have computational bottlenecks in the form of nested summations that become intractable for large datasets. [sent-13, score-0.444]
6 In this work we formalize the general class of nested summations and present a new multi-stage Monte Carlo method for approximating any problem in the class with rigorous relative error control. [sent-17, score-0.628]
7 We derive error guarantees and sample complexity bounds, with the intriguing result that runtime depends not on dataset size but on statistical features such as variance and kurtosis, which can be controlled through stratification. [sent-21, score-0.476]
8 We also present experiments that validate these theoretical results and demonstrate tremendous speedup over the prior state of the art. [sent-22, score-0.198]
9 Previous approaches to algorithmic acceleration of this kind fall into roughly two groups: 1) methods that run non-accelerated algorithms on subsets of the data, typically without error bounds, and 2) multi-tree methods with deterministic error bounds. [sent-23, score-0.197]
10 The former are of less interest due to the lack of error control, while the latter are good when exact error control is required, but have built-in overconservatism that limits speedup, and are difficult to extend to new problems. [sent-24, score-0.26]
11 Our Monte Carlo approach offers much larger speedup and a generality that makes it simple to adapt to new problems, while retaining strong error control. [sent-25, score-0.255]
12 While there are non-summative problems to which the standard multi-tree methodology is applicable and our Monte Carlo method is not, our method appears to give greater speedup by many orders of magnitude on problems where both methods can be used. [sent-26, score-0.303]
13 −2 j=i Kh2 (||xi − xj ||) These nested sums have quadratic and cubic computation times that are intractable for large datasets. [sent-31, score-0.191]
14 (2) B represents the base case, in which a tuple of constant arguments Xc may be specified and a tuple of variable arguments Xi is indexed by a set I, which may be a function of Xc . [sent-37, score-0.173]
15 For instance, in the innermost leave-one-out summations of SKR , Xc is the single point xi while I(Xc ) indexes all single points other than xi . [sent-38, score-0.36]
16 Note that |I| is the number of terms in a summation of type B, and therefore represents the base time complexity. [sent-39, score-0.19]
17 Whenever I consists of all k-tuples or leave-one-out k-tuples, the base complexity is O(nk ), where n is the size of the dataset. [sent-40, score-0.136]
18 The inductive case G is either: 1) the base case B, or 2) a sum where the arguments to the summand function are Xc and a series of nested instances of type G. [sent-41, score-0.319]
19 In SKR the outermost summation is an example of this. [sent-42, score-0.166]
20 The base complexity here is |I| multiplied by the maximum base complexity among the nested instances, e. [sent-43, score-0.363]
21 While these approaches can have asymptotic convergence, there are no error guarantees for finite sample sizes. [sent-52, score-0.276]
22 Our approach also exploits the speedup that comes from sampling, but provides a rigorous relative error guarantee and is able to automatically determine the necessary sample size to provide that guarantee. [sent-54, score-0.49]
23 The other main class of acceleration methods consists of those employing “higher order divide and conquer” or multi-tree techniques that give either exact answers or deterministic error bounds (e. [sent-55, score-0.254]
24 While the class of GNPs has yet to be formally defined, the generalized summations we address are clearly related and have at least partial overlap. [sent-59, score-0.329]
25 First, although it gives deterministic error bounds, the bounds are usually quite loose, resulting in overconservatism that prevents aggressive approximation that could give greater speed. [sent-61, score-0.289]
26 Second, creating a new multi-tree method to accelerate a given algorithm requires complex custom derivation of error bounds and pruning rules. [sent-62, score-0.163]
27 Error is rigorously controlled and driven by sample variance, allowing the Monte Carlo approach to make aggressive approximations and avoid the overconservatism of deterministic multi-tree methods. [sent-67, score-0.163]
28 This yields greater speedups by many orders of magnitude. [sent-68, score-0.178]
29 Further, our Monte Carlo approach handles the class of nested summations in full generality, making it easy to specialize to new problems. [sent-69, score-0.415]
30 The main tradeoff is that Monte Carlo error bounds are probabilistic, though the bound probability is a parameter to the algorithm. [sent-71, score-0.16]
31 We then move beyond to present novel sample complexity bounds and extend the single-stage results to the multi-stage and multi-stage stratified cases. [sent-77, score-0.169]
32 These extensions allow us to efficiently bring Monte Carlo principles to bear on the entire class of generalized summations, while yielding insights into the dependence of computational complexity on sample statistics and how tree-based methods can improve those statistics. [sent-78, score-0.184]
33 To begin, note that the summation B (Xc ) can be written as nE[fi ] = nµf , where n = |I| and the 1 expectation is taken over a discrete distribution Pf that puts mass n on each term fi = f (Xc , Xi ). [sent-79, score-0.256]
34 ˆ Our goal is to produce an estimate B that has low relative error with high probability. [sent-80, score-0.151]
35 Let µf be the sample mean of m samples taken ˆ µ ˆ from Pf . [sent-83, score-0.15]
36 When µf satisfies this bound, our relative error condition is ˆ ˆ √ √ implied by zα/2 σf / m ≤ |µf |, and we also have |µf | ≥ |ˆf | − zα/2 σf / m. [sent-85, score-0.179]
37 Combining these, ˆ µ ˆ √ √ µ ˆ we can ensure our target relative error by requiring that zα/2 σf / m ≤ (|ˆf | − zα/2 σf / m), ˆ which rearranges to: 2 m ≥ zα/2 ˆ2 (1 + )2 σf . [sent-86, score-0.151]
38 2 µ2 ˆf (3) Equation 3 gives an empirically testable condition that guarantees the target relative error level with probability 1 − α, given that µf has reached its asymptotic distribution N (µf , σf /m). [sent-87, score-0.295]
39 This ˆ ˆ2 suggests an iterative sampling procedure in which m starts at a value mmin chosen to make the normal approximation valid, and then is increased until the condition of Equation 3 is met. [sent-88, score-0.341]
40 Given mmin large enough to put µf in its asymptotic normal regime, with probability ˆ at least 1 − α Algorithm 1 approximates the summation S with relative error no greater than . [sent-91, score-0.699]
41 We have already established that Equation 3 is a sufficient condition for relative error with probability 1 − α. [sent-93, score-0.179]
42 addSamples(samples, mneeded , S, Xc ) for i = 1 to mneeded Xi ← rand(S. [sent-98, score-0.354]
43 f (Xc , Xi ) end for MC-Approx(S, Xc , , α, mmin ) samples ← ∅, mneeded ← mmin repeat addSamples(samples, mneeded , S, Xc ) m, µf , σf ← calcStats(samples) ˆ ˆ2 2 ˆf mthresh ← zα/2 (1 + )2 σf / 2 µ2 ˆ2 mneeded ← mthresh − m until m ≥ mthresh return |S. [sent-100, score-1.126]
44 Given mmin large enough to put µf and σf in their asymptotic normal regimes, with ˆ ˆ probability at least 1 − 2α Algorithm 1 terminates with m ≤ O σf /ˆ2 , ˆ 2 µf 2 σf µ2 f + σf |µf | µ4f 4 σf −1 . [sent-103, score-0.428]
45 Next, because the sample variance is asymptotically distributed as µ ˆ 2 4 N (σf , (µ4f − σf )/m), where µ4f is the fourth central moment, we can apply the delta method 4 2 to infer that σf converges in distribution to N (σf , (µ4f − σf )/4σf m). [sent-107, score-0.169]
46 This 2 is because the sample complexity depends only on the distributional features (σ f , µf , and µ4f ) of the summation terms, and not on the number of terms. [sent-113, score-0.287]
47 datasets in particular, these distributional features are convergent, which means the sample or computational complexity converges to a constant while speedup becomes unbounded as the dataset size goes to infinity. [sent-117, score-0.471]
48 Algorithm 1 therefore gives greatest speedup for summations whose terms have low dispersion and low kurtosis. [sent-120, score-0.405]
49 This motivates the additional speedup we later gain by stratifying the dataset into low-variance regions. [sent-122, score-0.248]
50 Finally, the sample complexity bound indicates whether Algorithm 1 will actually give speedup for any particular problem. [sent-123, score-0.316]
51 For a given summation, let the speedup be defined as the total number of terms n divided by the number of terms evaluated by the approximation. [sent-124, score-0.171]
52 For a desired speedup τ , we need n ≥ τ mbound , where mbound is the expression in Equation 4. [sent-125, score-0.249]
53 This is the fundamental characterization of whether speedup will be attained. [sent-126, score-0.171]
54 4 Algorithm 2 Iterative Monte Carlo approximation for nested summations. [sent-127, score-0.181]
55 MC-Approx: as in Algorithm 1 calcStats: as in Algorithm 1 4 addSamples(samples, mneeded , S, Xc ) for i = 1 to mneeded Xi ← rand(S. [sent-128, score-0.354]
56 f (Xc , mcArgs) end for Multi-stage Monte Carlo We now turn to the inductive case of nested summations, i. [sent-134, score-0.201]
57 The approach we take is to apply the single-stage Monte Carlo algorithm over the terms fi as before, but with recursive invocation to obtain approximations for the arguments Gj . [sent-137, score-0.237]
58 Given mmin large enough to put µf in its asymptotic normal regime, with probability ˆ at least 1 − α Algorithm 2 approximates the summation S with relative error no greater than . [sent-140, score-0.699]
59 We begin by noting that the proof of correctness for Algorithm 1 rests on 1) the ability to 1 sample from a distribution Pf whose expectation is µf = n i fi , and 2) the ability to invoke the CLT on the sample mean µf in terms of the sample variance σf . [sent-142, score-0.45]
60 Given these properties, Equation 3 ˆ ˆ2 follows as a sufficient condition for relative error no greater than with probability at least 1−α. [sent-143, score-0.247]
61 For each sampled fi , let Gj be the recursive approximation for argument Gj . [sent-145, score-0.212]
62 Because the Gj are recursively approximated, this is an inductive hypothesis, with the remainder of the proof showing that if the hypothesis holds for the recursive invocations, it also holds for the outer invocation. [sent-147, score-0.155]
63 The base case, where all recursions must bottom out, is the type-B summation already shown to give CLT-governed answers (see proof of Theorem 1). [sent-148, score-0.218]
64 ) be the vector of Gj values after each Gj has been estimated from mj samples ( mj = m), and let G be the vector of true Gj values. [sent-152, score-0.144]
65 We leave the detailed 2 entries of the covariance Σm unspecified, except to note that its jjth element is σj /mj , and that its off-diagonal elements may be non-zero if the Gj are generated in a correlated way (this can be used as a variance reduction technique). [sent-154, score-0.156]
66 Given the asymptotic normality of Gm , the same arguments used to derive the multivariate delta method can be used, with some modification, to show that fi (Gm ) N (fi (G), f (G)Σm T (G)). [sent-155, score-0.265]
67 f Thus, asymptotically, fi (Gm ) is normally distributed around its true value with a variance that depends on both the gradient of f and the covariance matrix of the approximated arguments in Gm . [sent-156, score-0.335]
68 This being the case, uniform sampling of the recursively estimated fi is equivalent to sampling from 1 ˜ a distribution Pf that gives weight n to a normal distribution centered on each fi . [sent-157, score-0.477]
69 ˜2 f 2 In other words, the variance with recursive approximation is the exact variance σ f plus the average 2 of the variances σi of the approximated fi . [sent-161, score-0.458]
70 Because we are still dealing with a sample mean, Theorem 2 still holds in the nested case. [sent-163, score-0.227]
71 Given mmin large enough to put µf and σf in their asymptotic normal regimes, ˆ ˆ σ2 ˜ σ ˜ µ ˜ with probability at least 1 − 2α Algorithm 2 terminates with m ≤ O µf + |µf | σ4f − 1 . [sent-166, score-0.428]
72 ˜2 ˜2 5 Algorithm 3 Iterative Monte Carlo approximation for nested summations with stratification. [sent-168, score-0.415]
73 MC-Approx: as in Algorithm 1 calcStats(strata, samples) m ← count(samples) µf s ← stratAvg(strata, samples) ˆ σf s ← stratV ar(strata, samples) ˆ2 return m, µf s , σf s ˆ ˆ2 5 addSamples(strata, samples, mneeded , S, Xc ) needP erStrat = optAlloc(samples, strata, mneeded ) for s = 1 to strata. [sent-169, score-0.354]
74 f (Xc , mcArgs) end for end for Variance Reduction With Algorithm 2 we have coverage of the entire generalized summation problem class, and our focus turns to maximizing efficiency. [sent-176, score-0.162]
75 As noted above, Theorem 2 implies we need fewer samples when the summation terms are tightly concentrated in a few clusters. [sent-177, score-0.201]
76 Additionally, by use of correlated sampling we induce covariance between recursively estimated summations whenever the overall variance can be reduced by doing so. [sent-179, score-0.498]
77 The idea is that strata with higher variance can be sampled more heavily than those with lower variance, thereby making more efficient use of samples than in uniform sampling. [sent-183, score-0.364]
78 The approximation procedure runs as it did before, except that the sampling and sample statistics are modified to make use of the trees. [sent-191, score-0.162]
79 This heuristic will later be justified by the variance expression for the stratified sample mean. [sent-193, score-0.169]
80 Given mmin large enough to put µf in its asymptotic normal regime, with probability ˆ at least 1 − α Algorithm 3 approximates the summation S with relative error no greater than . [sent-196, score-0.699]
81 Although the sample allocation fractions qj can be chosen arbi2 2 trarily, σf s is minimized when qj ∝ pj σj . [sent-206, score-0.249]
82 Finally, the Theorem 2 sample complexity still holds for the CLT-governed stratified sample mean. [sent-212, score-0.195]
83 Given mmin large enough to put µf s and σf s in their asymptotic normal regimes, ˆ ˆ with probability at least 1 − 2α Algorithm 3 terminates with m ≤ O 2 σf s µ2 f + σf s |µf | µ4f s 4 σf s −1 . [sent-215, score-0.428]
84 The variance of recursively estimated fi , as expressed by f (G)Σm T (G), f depends on the full covariance matrix of the estimated arguments. [sent-217, score-0.305]
85 If the gradient of f is such that the variance of fi depends negatively (positively) on a covariance σjk , we can reduce the variance by inducing positive (negative) covariance between Gj and Gk . [sent-218, score-0.377]
86 In some cases the expression for fi ’s variance is such that the effect of correlated sampling is datadependent; when this happens, it is easy to test and check whether correlation helps. [sent-220, score-0.31]
87 We show that the error distributions conform closely to our asymptotic theory. [sent-224, score-0.165]
88 These results are presented for two method-dataset pairs: kernel regression on a dataset containing 2 million 4-dimensional redshift measurements used for quasar identification, and kernel conditional density estimation on an n-body galaxy simulation dataset containing 3. [sent-226, score-0.3]
89 The objective of this first set of experiments is to validate the guarantee that relative error will be less than or equal to with probability 1 − α. [sent-233, score-0.208]
90 We measured the distribution of error on a series of random data subsets up to the highest size for which the exact computation was tractable. [sent-234, score-0.147]
91 The approximation is therefore quite good, and could be improved if desired by increasing mmin or the number of strata, but in this case we chose to trade a slight increase in error for an increase in speed. [sent-245, score-0.286]
92 This is in accord with Theorem 2 and its corollaries, which predict sample and computational complexity independent of dataset size. [sent-252, score-0.196]
93 All speedups are relative to extrapolated runtimes based on the O() order of the exact computation. [sent-260, score-0.178]
94 1 0 1000 2000 3000 4000 5000 6000 dataset size 7000 8000 9000 0 50 10000 100 150 dataset size 200 250 Figure 1: Error distribution vs. [sent-275, score-0.214]
95 computation time (ms) 5000 3000 2000 1000 4000 3000 2000 1000 0 0 0 500,000 1,000,000 dataset size 1,500,000 −1000 0 2,000,000 1,000,000 2,000,000 dataset size 3,000,000 Figure 2: Runtime vs. [sent-279, score-0.214]
96 7 Conclusion We have presented a multi-stage stratified Monte Carlo method for efficiently approximating a broad class of generalized nested summations. [sent-282, score-0.216]
97 Summations of this type lead to computational bottlenecks in kernel estimators and elsewhere in machine learning. [sent-283, score-0.146]
98 The theory derived for this Monte Carlo approach predicts: 1) relative error no greater than with probability at least 1−α, for user-specified and α, and 2) sample and computational complexity independent of dataset size. [sent-284, score-0.415]
99 Our experimental results validate these theoretical guarantees on real datasets, where we accelerate kernel crossvalidation scores by as much as 1014 on millions of points. [sent-285, score-0.141]
100 In addition to applications, future work will likely include automatic selection of stratification granularity, additional variance reduction techniques, and further generalization to other computational bottlenecks such as linear algebraic operations. [sent-287, score-0.152]
wordName wordTfidf (topN-words)
[('strati', 0.412), ('xc', 0.387), ('summations', 0.234), ('strata', 0.197), ('carlo', 0.193), ('monte', 0.193), ('gj', 0.18), ('mneeded', 0.177), ('mmin', 0.172), ('speedup', 0.171), ('nested', 0.151), ('fi', 0.129), ('summation', 0.127), ('skr', 0.118), ('addsamples', 0.099), ('kcde', 0.099), ('variance', 0.093), ('gm', 0.088), ('kr', 0.088), ('error', 0.084), ('asymptotic', 0.081), ('calcstats', 0.079), ('mcargs', 0.079), ('speedups', 0.078), ('dataset', 0.077), ('sample', 0.076), ('samples', 0.074), ('kh', 0.073), ('clt', 0.069), ('relative', 0.067), ('pf', 0.066), ('xi', 0.063), ('pj', 0.063), ('base', 0.063), ('orders', 0.062), ('isbell', 0.059), ('mthresh', 0.059), ('overconservatism', 0.059), ('skcde', 0.059), ('stratum', 0.059), ('bottlenecks', 0.059), ('sampling', 0.056), ('arguments', 0.055), ('qj', 0.055), ('normal', 0.055), ('recursive', 0.053), ('recursively', 0.052), ('rand', 0.052), ('inductive', 0.05), ('bounds', 0.05), ('kernel', 0.05), ('regimes', 0.047), ('partitioning', 0.046), ('million', 0.046), ('terminates', 0.045), ('put', 0.045), ('quantile', 0.044), ('complexity', 0.043), ('distributional', 0.041), ('bandwidths', 0.041), ('equation', 0.04), ('xj', 0.04), ('erstrat', 0.039), ('gnps', 0.039), ('mbound', 0.039), ('needp', 0.039), ('optalloc', 0.039), ('outermost', 0.039), ('alexander', 0.039), ('greater', 0.038), ('runtime', 0.038), ('theorem', 0.037), ('estimators', 0.037), ('ms', 0.036), ('regime', 0.036), ('generalized', 0.035), ('mj', 0.035), ('guarantees', 0.035), ('holmes', 0.034), ('exact', 0.033), ('datasets', 0.033), ('magnitude', 0.032), ('gray', 0.032), ('correlated', 0.032), ('rigorous', 0.032), ('covariance', 0.031), ('class', 0.03), ('approximation', 0.03), ('least', 0.03), ('size', 0.03), ('guarantee', 0.03), ('kurtosis', 0.029), ('accelerate', 0.029), ('acceleration', 0.029), ('condition', 0.028), ('answers', 0.028), ('aggressive', 0.028), ('validate', 0.027), ('approximated', 0.027), ('bound', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999994 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations
Author: Charles L. Isbell, Michael P. Holmes, Alexander G. Gray
Abstract: Machine learning contains many computational bottlenecks in the form of nested summations over datasets. Kernel estimators and other methods are burdened by these expensive computations. Exact evaluation is typically O(n2 ) or higher, which severely limits application to large datasets. We present a multi-stage stratified Monte Carlo method for approximating such summations with probabilistic relative error control. The essential idea is fast approximation by sampling in trees. This method differs from many previous scalability techniques (such as standard multi-tree methods) in that its error is stochastic, but we derive conditions for error control and demonstrate that they work. Further, we give a theoretical sample complexity for the method that is independent of dataset size, and show that this appears to hold in experiments, where speedups reach as high as 1014 , many orders of magnitude beyond the previous state of the art. 1
2 0.12898757 123 nips-2007-Loop Series and Bethe Variational Bounds in Attractive Graphical Models
Author: Alan S. Willsky, Erik B. Sudderth, Martin J. Wainwright
Abstract: Variational methods are frequently used to approximate or bound the partition or likelihood function of a Markov random field. Methods based on mean field theory are guaranteed to provide lower bounds, whereas certain types of convex relaxations provide upper bounds. In general, loopy belief propagation (BP) provides often accurate approximations, but not bounds. We prove that for a class of attractive binary models, the so–called Bethe approximation associated with any fixed point of loopy BP always lower bounds the true likelihood. Empirically, this bound is much tighter than the naive mean field bound, and requires no further work than running BP. We establish these lower bounds using a loop series expansion due to Chertkov and Chernyak, which we show can be derived as a consequence of the tree reparameterization characterization of BP fixed points. 1
3 0.10874504 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC
Author: Matthew Hoffman, Arnaud Doucet, Nando D. Freitas, Ajay Jasra
Abstract: A recently proposed formulation of the stochastic planning and control problem as one of parameter estimation for suitable artificial statistical models has led to the adoption of inference algorithms for this notoriously hard problem. At the algorithmic level, the focus has been on developing Expectation-Maximization (EM) algorithms. In this paper, we begin by making the crucial observation that the stochastic control problem can be reinterpreted as one of trans-dimensional inference. With this new interpretation, we are able to propose a novel reversible jump Markov chain Monte Carlo (MCMC) algorithm that is more efficient than its EM counterparts. Moreover, it enables us to implement full Bayesian policy search, without the need for gradients and with one single Markov chain. The new approach involves sampling directly from a distribution that is proportional to the reward and, consequently, performs better than classic simulations methods in situations where the reward is a rare event.
4 0.067883141 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers
Author: Kaihua Zhu, Hao Wang, Hongjie Bai, Jian Li, Zhihuan Qiu, Hang Cui, Edward Y. Chang
Abstract: Support Vector Machines (SVMs) suffer from a widely recognized scalability problem in both memory use and computational time. To improve scalability, we have developed a parallel SVM algorithm (PSVM), which reduces memory use through performing a row-based, approximate matrix factorization, and which loads only essential data to each machine to perform parallel computation. Let n denote the number of training instances, p the reduced matrix dimension after factorization (p is significantly smaller than n), and m the number of machines. PSVM reduces the memory requirement from O(n2 ) to O(np/m), and improves computation time to O(np2 /m). Empirical study shows PSVM to be effective. PSVM Open Source is available for download at http://code.google.com/p/psvm/.
5 0.066456869 61 nips-2007-Convex Clustering with Exemplar-Based Models
Author: Danial Lashkari, Polina Golland
Abstract: Clustering is often formulated as the maximum likelihood estimation of a mixture model that explains the data. The EM algorithm widely used to solve the resulting optimization problem is inherently a gradient-descent method and is sensitive to initialization. The resulting solution is a local optimum in the neighborhood of the initial guess. This sensitivity to initialization presents a significant challenge in clustering large data sets into many clusters. In this paper, we present a different approach to approximate mixture fitting for clustering. We introduce an exemplar-based likelihood function that approximates the exact likelihood. This formulation leads to a convex minimization problem and an efficient algorithm with guaranteed convergence to the globally optimal solution. The resulting clustering can be thought of as a probabilistic mapping of the data points to the set of exemplars that minimizes the average distance and the information-theoretic cost of mapping. We present experimental results illustrating the performance of our algorithm and its comparison with the conventional approach to mixture model clustering. 1
6 0.063877851 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations
7 0.062824063 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis
8 0.062230315 149 nips-2007-Optimal ROC Curve for a Combination of Classifiers
9 0.058883861 174 nips-2007-Selecting Observations against Adversarial Objectives
10 0.058179747 125 nips-2007-Markov Chain Monte Carlo with People
11 0.057421405 157 nips-2007-Privacy-Preserving Belief Propagation and Sampling
12 0.054843716 30 nips-2007-Bayes-Adaptive POMDPs
13 0.054033604 160 nips-2007-Random Features for Large-Scale Kernel Machines
14 0.051756635 13 nips-2007-A Unified Near-Optimal Estimator For Dimension Reduction in $l \alpha$ ($0<\alpha\leq 2$) Using Stable Random Projections
15 0.051700339 213 nips-2007-Variational Inference for Diffusion Processes
16 0.051328134 162 nips-2007-Random Sampling of States in Dynamic Programming
17 0.050645187 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning
18 0.04756549 7 nips-2007-A Kernel Statistical Test of Independence
19 0.046425108 214 nips-2007-Variational inference for Markov jump processes
20 0.046369847 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
topicId topicWeight
[(0, -0.18), (1, -0.019), (2, -0.045), (3, 0.029), (4, -0.05), (5, -0.031), (6, 0.012), (7, -0.024), (8, -0.043), (9, -0.03), (10, -0.031), (11, 0.024), (12, -0.006), (13, -0.006), (14, 0.04), (15, 0.013), (16, 0.092), (17, 0.037), (18, -0.062), (19, -0.04), (20, 0.146), (21, 0.026), (22, 0.01), (23, 0.063), (24, 0.001), (25, 0.005), (26, 0.014), (27, 0.079), (28, -0.062), (29, -0.041), (30, 0.032), (31, -0.112), (32, 0.031), (33, -0.029), (34, 0.073), (35, 0.074), (36, 0.046), (37, -0.001), (38, -0.032), (39, 0.013), (40, -0.062), (41, 0.21), (42, 0.214), (43, -0.125), (44, -0.077), (45, -0.12), (46, 0.045), (47, 0.265), (48, 0.054), (49, -0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.93784702 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations
Author: Charles L. Isbell, Michael P. Holmes, Alexander G. Gray
Abstract: Machine learning contains many computational bottlenecks in the form of nested summations over datasets. Kernel estimators and other methods are burdened by these expensive computations. Exact evaluation is typically O(n2 ) or higher, which severely limits application to large datasets. We present a multi-stage stratified Monte Carlo method for approximating such summations with probabilistic relative error control. The essential idea is fast approximation by sampling in trees. This method differs from many previous scalability techniques (such as standard multi-tree methods) in that its error is stochastic, but we derive conditions for error control and demonstrate that they work. Further, we give a theoretical sample complexity for the method that is independent of dataset size, and show that this appears to hold in experiments, where speedups reach as high as 1014 , many orders of magnitude beyond the previous state of the art. 1
2 0.54692656 125 nips-2007-Markov Chain Monte Carlo with People
Author: Adam Sanborn, Thomas L. Griffiths
Abstract: Many formal models of cognition implicitly use subjective probability distributions to capture the assumptions of human learners. Most applications of these models determine these distributions indirectly. We propose a method for directly determining the assumptions of human learners by sampling from subjective probability distributions. Using a correspondence between a model of human choice and Markov chain Monte Carlo (MCMC), we describe a method for sampling from the distributions over objects that people associate with different categories. In our task, subjects choose whether to accept or reject a proposed change to an object. The task is constructed so that these decisions follow an MCMC acceptance rule, defining a Markov chain for which the stationary distribution is the category distribution. We test this procedure for both artificial categories acquired in the laboratory, and natural categories acquired from experience. 1
3 0.53893876 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains
Author: Andrea Lecchini-visintini, John Lygeros, Jan Maciejowski
Abstract: Simulated annealing is a popular method for approaching the solution of a global optimization problem. Existing results on its performance apply to discrete combinatorial optimization where the optimization variables can assume only a finite set of possible values. We introduce a new general formulation of simulated annealing which allows one to guarantee finite-time performance in the optimization of functions of continuous variables. The results hold universally for any optimization problem on a bounded domain and establish a connection between simulated annealing and up-to-date theory of convergence of Markov chain Monte Carlo methods on continuous domains. This work is inspired by the concept of finite-time learning with known accuracy and confidence developed in statistical learning theory. Optimization is the general problem of finding a value of a vector of variables θ that maximizes (or minimizes) some scalar criterion U (θ). The set of all possible values of the vector θ is called the optimization domain. The elements of θ can be discrete or continuous variables. In the first case the optimization domain is usually finite, such as in the well-known traveling salesman problem; in the second case the optimization domain is a continuous set. An important example of a continuous optimization domain is the set of 3-D configurations of a sequence of amino-acids in the problem of finding the minimum energy folding of the corresponding protein [1]. In principle, any optimization problem on a finite domain can be solved by an exhaustive search. However, this is often beyond computational capacity: the optimization domain of the traveling salesman problem with 100 cities contains more than 10155 possible tours. An efficient algorithm to solve the traveling salesman and many similar problems has not yet been found and such problems remain reliably solvable only in principle [2]. Statistical mechanics has inspired widely used methods for finding good approximate solutions in hard discrete optimization problems which defy efficient exact solutions [3, 4, 5, 6]. Here a key idea has been that of simulated annealing [3]: a random search based on the Metropolis-Hastings algorithm, such that the distribution of the elements of the domain visited during the search converges to an equilibrium distribution concentrated around the global optimizers. Convergence and finite-time performance of simulated annealing on finite domains has been evaluated in many works, e.g. [7, 8, 9, 10]. On continuous domains, most popular optimization methods perform a local gradient-based search and in general converge to local optimizers; with the notable exception of convex criteria where convergence to the unique global optimizer occurs [11]. Simulated annealing performs a global search and can be easily implemented on continuous domains. Hence it can be considered a powerful complement to local methods. In this paper, we introduce for the first time rigorous guarantees on the finite-time performance of simulated annealing on continuous domains. We will show that it is possible to derive simulated annealing algorithms which, with an arbitrarily high level of confidence, find an approximate solution to the problem of optimizing a function of continuous variables, within a specified tolerance to the global optimal solution after a known finite number of steps. Rigorous guarantees on the finite-time performance of simulated annealing in the optimization of functions of continuous variables have never been obtained before; the only results available state that simulated annealing converges to a global optimizer as the number of steps grows to infinity, e.g. [12, 13, 14, 15]. The background of our work is twofold. On the one hand, our notion of approximate solution to a global optimization problem is inspired by the concept of finite-time learning with known accuracy and confidence developed in statistical learning theory [16, 17]. We actually maintain an important aspect of statistical learning theory which is that we do not introduce any particular assumption on the optimization criterion, i.e. our results hold regardless of what U is. On the other hand, we ground our results on the theory of convergence, with quantitative bounds on the distance to the target distribution, of the Metropolis-Hastings algorithm and Markov Chain Monte Carlo (MCMC) methods, which has been one of the main achievements of recent research in statistics [18, 19, 20, 21]. In this paper, we will not develop any ready-to-use optimization algorithm. We will instead introduce a general formulation of the simulated annealing method which allows one to derive new simulated annealing algorithms with rigorous finite-time guarantees on the basis of existing theory. The Metropolis-Hastings algorithm and the general family of MCMC methods have many degrees of freedom. The choice and comparison of specific algorithms goes beyond the scope of the paper. The paper is organized in the following sections. In Simulated annealing we introduce the method and fix the notation. In Convergence we recall the reasons why finite-time guarantees for simulated annealing on continuous domains have not been obtained before. In Finite-time guarantees we present the main result of the paper. In Conclusions we state our findings and conclude the paper. 1 Simulated annealing The original formulation of simulated annealing was inspired by the analogy between the stochastic evolution of the thermodynamic state of an annealing material towards the configurations of minimal energy and the search for the global minimum of an optimization criterion [3]. In the procedure, the optimization criterion plays the role of the energy and the state of the annealed material is simulated by the evolution of the state of an inhomogeneous Markov chain. The state of the chain evolves according to the Metropolis-Hastings algorithm in order to simulate the Boltzmann distribution of thermodynamic equilibrium. The Boltzmann distribution is simulated for a decreasing sequence of temperatures (“cooling”). The target distribution of the cooling procedure is the limiting Boltzmann distribution, for the temperature that tends to zero, which takes non-zero values only on the set of global minimizers [7]. The original formulation of the method was for a finite domain. However, simulated annealing can be generalized straightforwardly to a continuous domain because the Metropolis-Hastings algorithm can be used with almost no differences on discrete and continuous domains The main difference is that on a continuous domain the equilibrium distributions are specified by probability densities. On a continuous domain, Markov transition kernels in which the distribution of the elements visited by the chain converges to an equilibrium distribution with the desired density can be constructed using the Metropolis-Hastings algorithm and the general family of MCMC methods [22]. We point out that Boltzmann distributions are not the only distributions which can be adopted as equilibrium distributions in simulated annealing [7]. In this paper it is convenient for us to adopt a different type of equilibrium distribution in place of Boltzmann distributions. 1.1 Our setting The optimization criterion is U : Θ → [0, 1], with Θ ⊂ RN . The assumption that U takes values in the interval [0, 1] is a technical one. It does not imply any serious loss of generality. In general, any bounded optimization criterion can be scaled to take values in [0, 1]. We assume that the optimization task is to find a global maximizer; this can be done without loss of generality. We also assume that Θ is a bounded set. We consider equilibrium distributions defined by probability density functions proportional to [U (θ) + δ]J where J and δ are two strictly positive parameters. We use π (J) to denote an equilibrium distribution, i.e. π (J) (dθ) ∝ [U (θ) + δ]J πLeb (dθ) where πLeb is the standard Lebesgue measure. Here, J −1 plays the role of the temperature: if the function U (θ) plus δ is taken to a positive power J then as J increases (i.e. as J −1 decreases) [U (θ) + δ]J becomes increasingly peaked around the global maximizers. The parameter δ is an offset which guarantees that the equilibrium densities are always strictly positive, even if U takes zero values on some elements of the domain. The offset δ is chosen by the user and we show later that our results allow one to make an optimal selection of δ. The zero-temperature distribution is the limiting distribution, for J → ∞, which takes non-zero values only on the set of global maximizers. It is denoted by π (∞) . In the generic formulation of the method, the Markov transition kernel of the k-th step of the inhomogeneous chain has equilibrium distribution π (Jk ) where {Jk }k=1,2,... is the “cooling schedule”. The cooling schedule is a non-decreasing sequence of positive numbers according to which the equilibrium distribution become increasingly sharpened during the evolution of the chain. We use θk to denote the state of the chain and Pθk to denote its probability distribution. The distribution Pθk obviously depends on the initial condition θ0 . However, in this work, we don’t need to make this dependence explicit in the notation. Remark 1: If, given an element θ in Θ, the value U (θ) can be computed directly, we say that U is a deterministic criterion, e.g. the energy landscape in protein structure prediction [1]. In problems involving random variables, the value U (θ) may be the expected value U (θ) = g(x, θ)px (x; θ)dx of some function g which depends on both the optimization variable θ, and on some random variable x which has probability density px (x; θ) (which may itself depend on θ). In such problems it is usually not possible to compute U (θ) directly, either because evaluation of the integral requires too much computation, or because no analytical expression for px (x; θ) is available. Typically one must perform stochastic simulations in order to obtain samples of x for a given θ, hence obtain sample values of g(x, θ), and thus construct a Monte Carlo estimate of U (θ). The Bayesian design of clinical trials is an important application area where such expected-value criteria arise [23]. The authors of this paper investigate the optimization of expected-value criteria motivated by problems of aircraft routing [24]. In the particular case that px (x; θ) does not depend on θ, the optimization task is often called “empirical risk minimization”, and is studied extensively in statistical learning theory [16, 17]. The results of this paper apply in the same way to the optimization of both deterministic and expected-value criteria. The MCMC method developed by M¨ ller [25, 26] allows one u to construct simulated annealing algorithms for the optimization of expected-value criteria. M¨ ller u [25, 26] employs the same equilibrium distributions as those described in our setting; in his context J is restricted to integer values. 2 Convergence The rationale of simulated annealing is as follows: if the temperature is kept constant, say Jk = J, then the distribution of the state of the chain Pθk tends to the equilibrium distribution π (J) ; if J → ∞ then the equilibrium distribution π (J) tends to the zero-temperature distribution π (∞) ; as a result, if the cooling schedule Jk tends to infinity, one obtains that Pθk “follows” π (Jk ) and that π (Jk ) tends to π (∞) and eventually that the distribution of the state of the chain Pθk tends to π (∞) . The theory shows that, under conditions on the cooling schedule and the Markov transition kernels, the distribution of the state of the chain Pθk actually converges to the target zero-temperature distribution π (∞) as k → ∞ [12, 13, 14, 15]. Convergence to the zero-temperature distribution implies that asymptotically the state of the chain eventually coincides with a global optimizer with probability one. The difficulty which must be overcome in order to obtain finite step results on simulated annealing algorithms on a continuous domain is that usually, in an optimization problem defined over continuous variables, the set of global optimizers has zero Lebesgue measure (e.g. a set of isolated points). If the set of global optimizers has zero measure then the set of global optimizers has null probability according to the equilibrium distributions π (J) for any finite J and, as a consequence, according to the distributions Pθk for any finite k. Put another way, the probability that the state of the chain visits the set of global optimizers is constantly zero after any finite number of steps. Hence the confidence of the fact that the solution provided by the algorithm in finite time coincides with a global optimizer is also constantly zero. Notice that this is not the case for a finite domain, where the set of global optimizers is of non-null measure with respect to the reference counting measure [7, 8, 9, 10]. It is instructive to look at the issue also in terms of the rate of convergence to the target zerotemperature distribution. On a discrete domain, the distribution of the state of the chain at each step and the zero-temperature distribution are both standard discrete distributions. It is then possible to define a distance between them and study the rate of convergence of this distance to zero. This analysis allows one to obtain results on the finite-time behavior of simulated annealing [7, 8]. On a continuous domain and for a set of global optimizers of measure zero, the target zero-temperature distribution π (∞) ends up being a mixture of probability masses on the set of global optimizers. In this situation, although the distribution of the state of the chain Pθk still converges asymptotically to π (∞) , it is not possible to introduce a sensible distance between the two distributions and a rate of convergence to the target distribution cannot even be defined (weak convergence), see [12, Theorem 3.3]. This is the reason that until now there have been no guarantees on the performance of simulated annealing on a continuous domain after a finite number of computations: by adopting the zero-temperature distribution π (∞) as the target distribution it is only possible to prove asymptotic convergence in infinite time to a global optimizer. Remark 2: The standard distance between two distributions, say µ1 and µ2 , on a continuous support is the total variation norm µ1 − µ2 T V = supA |µ1 (A) − µ2 (A)|, see e.g. [21]. In simulated annealing on a continuous domain the distribution of the state of the chain Pθk is absolutely continuous with respect to the Lebesgue measure (i.e. πLeb (A) = 0 ⇒ Pθk (A) = 0), by construction for any finite k. Hence if the set of global optimizers has zero Lebesgue measure then it has zero measure also according to Pθk . The set of global optimizers has however measure 1 according to π (∞) . The distance Pθk − π (∞) T V is then constantly 1 for any finite k. It is also worth mentioning that if the set of global optimizers has zero measure then asymptotic convergence to the zero-temperature distribution π (∞) can be proven only under the additional assumptions of continuity and differentiability of U [12, 13, 14, 15]. 3 Finite-time guarantees In general, optimization algorithms for problems defined on continuous variables can only find approximate solutions in finite time [27]. Given an element θ of a continuous domain how can we assess how good it is as an approximate solution to an optimization problem? Here we introduce the concept of approximate global optimizer to answer this question. The definition is given for a maximization problem in a continuous but bounded domain. We use two parameters: the value imprecision ǫ (greater than or equal to 0) and the residual domain α (between 0 and 1) which together determine the level of approximation. We say that θ is an approximate global optimizer of U with value imprecision ǫ and residual domain α if the function U takes values strictly greater than U (θ) + ǫ only on a subset of values of θ no larger than an α portion of the optimization domain. The formal definition is as follows. Definition 1 Let U : Θ → R be an optimization criterion where Θ ⊂ RN is bounded. Let πLeb denote the standard Lebesgue measure. Let ǫ ≥ 0 and α ∈ [0, 1] be given numbers. Then θ is an approximate global optimizer of U with value imprecision ǫ and residual domain α if πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ) . In other words, the value U (θ) is within ǫ of a value which is greater than the values that U takes on at least a 1 − α portion of the domain. The smaller ǫ and α are, the better is the approximation of a true global optimizer. If both α and ǫ are equal to zero then U (θ) coincides with the essential supremum of U . Our definition of approximate global optimizer carries an important property, which holds regardless of what the criterion U is: if ǫ and α have non-zero values then the set of approximate global optimizers always has non-zero Lebesgue measure. It follows that the probability that the chain visits the set of approximate global optimizers can be non-zero. Hence, it is sensible to study the confidence of the fact that the solution found by simulated annealing in finite time is an approximate global optimizer. Remark 3: The intuition that our notion of approximate global optimizer can be used to obtain formal guarantees on the finite-time performance of optimization methods based on a stochastic search of the domain is already apparent in the work of Vidyasagar [17, 28]. Vidyasagar [17, 28] introduces a similar definition and obtains rigorous finite-time guarantees in the optimization of ex- pected value criteria based on uniform independent sampling of the domain. Notably, the number of independent samples required to guarantee some desired accuracy and confidence turns out to be polynomial in the values of the desired imprecision, residual domain and confidence. Although the method of Vidyasagar is not highly sophisticated, it has had considerable success in solving difficult control system design applications [28, 29]. Its appeal stems from its rigorous finite-time guarantees which exist without the need for any particular assumption on the optimization criterion. Here we show that finite-time guarantees for simulated annealing can be obtained by selecting a distribution π (J) with a finite J as the target distribution in place of the zero-temperature distribution π (∞) . The fundamental result is the following theorem which allows one to select in a rigorous way δ and J in the target distribution π (J) . It is important to stress that the result holds universally for any optimization criterion U on a bounded domain. The only minor requirement is that U takes values in [0, 1]. Theorem 1 Let U : Θ → [0, 1] be an optimization criterion where Θ ⊂ RN is bounded. Let J ≥ 1 and δ > 0 be given numbers. Let θ be a multivariate random variable with distribution π (J) (dθ) ∝ [U (θ) + δ]J πLeb (dθ). Let α ∈ (0, 1] and ǫ ∈ [0, 1] be given numbers and define σ= 1 1+δ 1+ ǫ+1+δ J 1 1+δ 1+δ −1 α ǫ+δ δ . (1) Then the statement “θ is an approximate global optimizer of U with value imprecision ǫ and residual domain α” holds with probability at least σ. Proof. See Appendix A. The importance of the choice of a target distribution π (J) with a finite J is that π (J) is absolutely continuous with respect to the Lebesgue measure. Hence, the distance Pθk − π (J) TV between the distribution of the state of the chain Pθk and the target distribution π (J) is a meaningful quantity. Convergence of the Metropolis-Hastings algorithm and MCMC methods in total variation norm is a well studied problem. The theory provides simple conditions under which one derives upper bounds on the distance to the target distribution which are known at each step of the chain and decrease monotonically to zero as the number of steps of the chain grows. The theory has been developed mainly for homogeneous chains [18, 19, 20, 21]. In the case of simulated annealing, the factor that enables us to employ these results is the absolute continuity of the target distribution π (J) with respect to the Lebesgue measure. However, simulated annealing involves the simulation of inhomogeneous chains. In this respect, another important fact is that the choice of a target distribution π (J) with a finite J implies that the inhomogeneous Markov chain can in fact be formed by a finite sequence of homogeneous chains (i.e. the cooling schedule {Jk }k=1,2,... can be chosen to be a sequence that takes only a finite set of values). In turn, this allows one to apply the theory of homogeneous MCMC methods to study the convergence of Pθk to π (J) in total variation norm. On a bounded domain, simple conditions on the ‘proposal distribution’ in the iteration of the simulated annealing algorithm allows one to obtain upper bounds on Pθk − π (J) TV that decrease geometrically to zero as k → ∞, without the need for any additional assumption on U [18, 19, 20, 21]. It is then appropriate to introduce the following finite-time result. Theorem 2 Let the notation and assumptions of Theorem 1 hold. Let θk , with distribution Pθk , be the state of the inhomogeneous chain of a simulated annealing algorithm with target distribution π (J) . Then the statement “θk is an approximate global optimizer of U with value imprecision ǫ and residual domain α” holds with probability at least σ − Pθk − π (J) TV . The proof of the theorem follows directly from the definition of the total variation norm. It follows that if simulated annealing is implemented with an algorithm which converges in total variation distance to a target distribution π (J) with a finite J, then one can state with confidence arbitrarily close to 1 that the solution found by the algorithm after the known appropriate finite number of steps is an approximate global optimizer with the desired approximation level. For given non-zero values of ǫ, α the value of σ given by (1) can be made arbitrarily close to 1 by choice of J; while the distance Pθk − π (J) TV can be made arbitrarily small by taking the known sufficient number of steps. It can be shown that there exists the possibility of making an optimal choice of δ and J in the target distribution π (J) . In fact, for given ǫ and α and a given value of J there exists an optimal choice of δ which maximizes the value of σ given by (1). Hence, it is possible to obtain a desired σ with the smallest possible J. The advantage of choosing the smallest J, consistent with the required approximation and confidence, is that it will decrease the number of steps required to achieve the desired reduction of Pθk − π (J) TV . 4 Conclusions We have introduced a new formulation of simulated annealing which admits rigorous finite-time guarantees in the optimization of functions of continuous variables. First, we have introduced the notion of approximate global optimizer. Then, we have shown that simulated annealing is guaranteed to find approximate global optimizers, with the desired confidence and the desired level of accuracy, in a known finite number of steps, if a proper choice of the target distribution is made and conditions for convergence in total variation norm are met. The results hold for any optimization criterion on a bounded domain with the only minor requirement that it takes values between 0 and 1. In this framework, simulated annealing algorithms with rigorous finite-time guarantees can be derived by studying the choice of the proposal distribution and of the cooling schedule, in the generic iteration of simulated annealing, in order to ensure convergence to the target distribution in total variation norm. To do this, existing theory of convergence of the Metropolis-Hastings algorithm and MCMC methods on continuous domains can be used [18, 19, 20, 21]. Vidyasagar [17, 28] has introduced a similar definition of approximate global optimizer and has shown that approximate optimizers with desired accuracy and confidence can be obtained with a number of uniform independent samples of the domain which is polynomial in the accuracy and confidence parameters. In general, algorithms developed with the MCMC methodology can be expected to be equally or more efficient than uniform independent sampling. Acknowledgments Work supported by EPSRC, Grant EP/C014006/1, and by the European Commission under projects HYGEIA FP6-NEST-4995 and iFly FP6-TREN-037180. We thank S. Brooks, M. Vidyasagar and D. M. Wolpert for discussions and useful comments on the paper. A Proof of Theorem 1 Let α ∈ (0, 1] and ρ ∈ (0, 1] be given numbers. Let Uδ (θ) := U (θ) + δ. Let πδ be a normalized ¯ measure such that πδ (dθ) ∝ Uδ (θ)πLeb (dθ). In the first part of the proof we find a lower bound on the probability that θ belongs to the set {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} . ¯ Let yα := inf{y : πδ {θ ∈ Θ : Uδ (θ) ≤ y} ≥ 1 − α}. To start with we show that the set ¯ ¯ {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} coincides with {θ ∈ Θ : Uδ (θ) ≥ ρ yα }. Notice ¯ ¯ that the quantity πδ {θ ∈ Θ : Uδ (θ) ≤ y} is a right-continuous non-decreasing function of y because it has the form of a distribution function (see e.g. [30, p.162] and [17, Lemma 11.1]). Therefore we have πδ {θ ∈ Θ : Uδ (θ) ≤ yα } ≥ 1 − α and ¯ ¯ y ≥ ρ yα ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) ≤ y} ≥ 1 − α ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > y} ≤ α . ¯ ¯ ¯ Moreover, y < ρ yα ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) ≤ y} < 1 − α ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > y} > α ¯ ¯ ¯ and taking the contrapositive one obtains πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > y} ≤ α ⇒ y ≥ ρ yα . ¯ ¯ ′ Therefore {θ ∈ Θ : Uδ (θ) ≥ ρ yα } ≡ {θ ∈ Θ : πδ {θ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α}. ¯ ¯ We now derive a lower bound on π (J) {θ ∈ Θ : Uδ (θ) ≥ ρ yα }. Let us introduce the notation ¯ ¯¯ Aα := {θ ∈ Θ : Uδ (θ) < yα }, Aα := {θ ∈ Θ : Uδ (θ) ≥ yα }, Bα,ρ := {θ ∈ Θ : Uδ (θ) < ρ yα } ¯ ¯ ¯ ¯ ¯ ¯α,ρ := {θ ∈ Θ : Uδ (θ) ≥ ρ yα }. Notice that Bα,ρ ⊆ Aα and Aα ⊆ Bα,ρ . The quantity ¯¯ ¯¯ and B ¯ ¯ ¯ ¯ πδ {θ ∈ Θ : Uδ (θ) < y} as a function of y is the left-continuous version of πδ {θ ∈ Θ : Uδ (θ) ≤ ¯¯ y}[30, p.162]. Hence, the definition of yα implies πδ (Aα ) ≤ 1 − α and πδ (Aα ) ≥ α. Notice that ¯ ¯ ¯ ¯ δπLeb (Aα ) ¯ ≤ 1−α, ¯ πδ (Aα ) ≤ 1 − α ⇒ ¯ ¯ Uδ (θ)πLeb (dθ) Θ ¯¯ πδ (Aα ) ≥ α ¯ ¯¯ (1 + δ)πLeb (Aα ) ≥ α. ¯ U (θ)πLeb (dθ) Θ δ ⇒ ¯¯ Hence, πLeb (Aα ) > 0 and πLeb (Aα ) 1−α1+δ ¯ ¯ . ¯α ) ≤ α ¯ δ πLeb (A ¯ ¯¯ ¯¯ Notice that πLeb (Aα ) > 0 implies πLeb (Bα,ρ ) > 0. We obtain π (J) {θ ∈ Θ : Uδ (θ) ≥ ρ yα } = ¯ 1+ 1 ≥ Uδ (θ)J πLeb (dθ) Bα,ρ ¯ ¯¯ Bα,ρ ≥ 1+ J ρ J yα ¯ J yα ¯ Uδ (θ)J πLeb (dθ) 1+ 1 Uδ (θ)J πLeb (dθ) Bα,ρ ¯ ¯¯ Aα Uδ (θ)J πLeb (dθ) 1 1 1 ≥ ≥ . 1−α1+δ ¯ πLeb (Aα ) πLeb (Bα,ρ ) ¯ ¯ 1 + ρJ 1 + ρJ ¯¯ ¯¯ α ¯ δ πLeb (Aα ) πLeb (Aα ) Since {θ ∈ Θ : Uδ (θ) ≥ ρ yα } ≡ {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} the first part of ¯ ¯ the proof is complete. In the second part of the proof we show that the set {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} is contained in the set of approximate global optimizers of U with value imprecision ¯ ǫ := (ρ−1 − 1)(1 + δ) and residual domain α := 1+δ α. Hence, we show that {θ ∈ Θ : πδ {θ′ ∈ ˜ ˜ ǫ+δ ¯ ˜ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} ⊆ {θ ∈ Θ : πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ)}. We ¯ ˜ ˜ have U (θ′ ) > U (θ) + ǫ ⇔ ρ Uδ (θ′ ) > ρ [Uδ (θ) + ǫ] ⇒ ρ Uδ (θ′ ) > Uδ (θ) ˜ ˜ which is proven by noticing that ρ [Uδ (θ) + ǫ] ≥ Uδ (θ) ⇔ 1 − ρ ≥ U (θ)(1 − ρ) ˜ and U (θ) ∈ [0, 1]. Hence {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ⊇ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} . ˜ Therefore πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α ⇒ πδ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α . Let ¯ ˜ ¯ Qθ,˜ := {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} and notice that ˜ ǫ U (θ′ )πLeb (dθ′ ) + δπLeb (Qθ,˜) ǫ πδ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} = ˜ Qθ,˜ ǫ . U (θ′ )πLeb (dθ′ ) + δπLeb (Θ) Θ We obtain πδ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α ⇒ ǫ πLeb (Qθ,˜) + δπLeb (Qθ,˜) ≤ α(1 + δ)πLeb (Θ) ˜ ¯ ˜ ¯ ǫ ǫ ⇒ πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ) . ˜ ˜ Hence we can conclude that πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α ⇒ πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ) ¯ ˜ ˜ and the second part of the proof is complete. We have shown that given α ∈ (0, 1], ρ ∈ (0, 1], ǫ := (ρ−1 − 1)(1 + δ), α := ¯ ˜ ˜ σ := 1 = 1−α1+δ ¯ 1+δ 1+ρ 1+ α ¯ δ ǫ+1+δ ˜ J 1+δ ǫ+δ ˜ 1 J 1 1+δ 1+δ −1 α ǫ+δ ˜˜ δ α and ¯ , the statement “θ is an approximate global optimizer of U with value imprecision ǫ and residual ˜ domain α” holds with probability at least σ. Notice that ǫ ∈ [0, 1] and α ∈ (0, 1] are linked through ˜ ˜ ˜ ǫ+δ ˜ a bijective relation to ρ ∈ [ 1+δ , 1] and α ∈ (0, 1+δ ]. The statement of the theorem is eventually ¯ 2+δ obtained by expressing σ as a function of desired ǫ = ǫ and α = α. ˜ ˜ References [1] D. J. Wales. Energy Landscapes. Cambridge University Press, Cambridge, UK, 2003. [2] D. Achlioptas, A. Naor, and Y. Peres. Rigorous location of phase transitions in hard optimization problems. Nature, 435:759–764, 2005. [3] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. 220(4598):671–680, 1983. Optimization by Simulated Annealing. Science, [4] E. Bonomi and J. Lutton. The N -city travelling salesman problem: statistical mechanics and the Metropolis algorithm. SIAM Rev., 26(4):551–568, 1984. [5] Y. Fu and P. W. Anderson. Application of statistical mechanics to NP-complete problems in combinatorial optimization. J. Phys. A: Math. Gen., 19(9):1605–1620, 1986. [6] M. M´ zard, G. Parisi, and R. Zecchina. Analytic and Algorithmic Solution of Random Satisfiability e Problems. Science, 297:812–815, 2002. [7] P. M. J. van Laarhoven and E. H. L. Aarts. Simulated Annealing: Theory and Applications. D. Reidel Publishing Company, Dordrecht, Holland, 1987. [8] D. Mitra, F. Romeo, and A. Sangiovanni-Vincentelli. Convergence and finite-time behavior of simulated annealing. Adv. Appl. Prob., 18:747–771, 1986. [9] B. Hajek. Cooling schedules for optimal annealing. Math. Oper. Res., 13:311–329, 1988. [10] J. Hannig, E. K. P. Chong, and S. R. Kulkarni. Relative Frequencies of Generalized Simulated Annealing. Math. Oper. Res., 31(1):199–216, 2006. [11] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, UK, 2004. [12] H. Haario and E. Saksman. Simulated annealing process in general state space. Adv. Appl. Prob., 23:866– 893, 1991. [13] S. B. Gelfand and S. K. Mitter. Simulated Annealing Type Algorithms for Multivariate Optimization. Algorithmica, 6:419–436, 1991. [14] C. Tsallis and D. A. Stariolo. Generalized simulated annealing. Physica A, 233:395–406, 1996. [15] M. Locatelli. Simulated Annealing Algorithms for Continuous Global Optimization: Convergence Conditions. J. Optimiz. Theory App., 104(1):121–133, 2000. [16] V. N. Vapnik. The Nature of Statistical Learning Theory. Cambridge University Press, Springer, New York, US, 1995. [17] M. Vidyasagar. Learning and Generalization: With Application to Neural Networks. Springer-Verlag, London, second edition, 2003. [18] S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. Springer-Verlag, London, 1993. [19] J. S. Rosenthal. Minorization Conditions and Convergence Rates for Markov Chain Monte Carlo. J. Am. Stat. Assoc., 90(430):558–566, 1995. [20] K. L. Mengersen and R. L. Tweedie. Rates of convergence of the Hastings and Metropolis algorithm. Ann. Stat., 24(1):101–121, 1996. [21] G. O. Roberts and J. S. Rosenthal. General state space Markov chains and MCMC algorithms. Prob. Surv., 1:20–71, 2004. [22] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer-Verlag, New York, second edition, 2004. [23] D.J. Spiegelhalter, K.R. Abrams, and J.P. Myles. Bayesian approaches to clinical trials and health-care evaluation. John Wiley & Sons, Chichester, UK, 2004. [24] A. Lecchini-Visintini, W. Glover, J. Lygeros, and J. M. Maciejowski. Monte Carlo Optimization for Conflict Resolution in Air Traffic Control. IEEE Trans. Intell. Transp. Syst., 7(4):470–482, 2006. [25] P. M¨ ller. Simulation based optimal design. In J. O. Berger, J. M. Bernardo, A. P. Dawid, and A. F. M. u Smith, editors, Bayesian Statistics 6: proceedings of the Sixth Valencia International Meeting, pages 459–474. Oxford: Clarendon Press, 1999. [26] P. M¨ ller, B. Sans´ , and M. De Iorio. Optimal Bayesian design by Inhomogeneous Markov Chain Simuu o lation. J. Am. Stat. Assoc., 99(467):788–798, 2004. [27] L. Blum, C. Cucker, M. Shub, and S. Smale. Complexity and Real Computation. Springer-Verlag, New York, 1998. [28] M. Vidyasagar. Randomized algorithms for robust controller synthesis using statistical learning theory. Automatica, 37(10):1515–1528, 2001. [29] R. Tempo, G. Calafiore, and F. Dabbene. Randomized Algorithms for Analysis and Control of Uncertain Systems. Springer-Verlag, London, 2005. [30] B.V. Gnedenko. Theory of Probability. Chelsea, New York, fourth edition, 1968.
4 0.47371963 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC
Author: Matthew Hoffman, Arnaud Doucet, Nando D. Freitas, Ajay Jasra
Abstract: A recently proposed formulation of the stochastic planning and control problem as one of parameter estimation for suitable artificial statistical models has led to the adoption of inference algorithms for this notoriously hard problem. At the algorithmic level, the focus has been on developing Expectation-Maximization (EM) algorithms. In this paper, we begin by making the crucial observation that the stochastic control problem can be reinterpreted as one of trans-dimensional inference. With this new interpretation, we are able to propose a novel reversible jump Markov chain Monte Carlo (MCMC) algorithm that is more efficient than its EM counterparts. Moreover, it enables us to implement full Bayesian policy search, without the need for gradients and with one single Markov chain. The new approach involves sampling directly from a distribution that is proportional to the reward and, consequently, performs better than classic simulations methods in situations where the reward is a rare event.
5 0.47310451 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)
Author: John Wright, Yangyu Tao, Zhouchen Lin, Yi Ma, Heung-yeung Shum
Abstract: We present a simple new criterion for classification, based on principles from lossy data compression. The criterion assigns a test sample to the class that uses the minimum number of additional bits to code the test sample, subject to an allowable distortion. We prove asymptotic optimality of this criterion for Gaussian data and analyze its relationships to classical classifiers. Theoretical results provide new insights into relationships among popular classifiers such as MAP and RDA, as well as unsupervised clustering methods based on lossy compression [13]. Minimizing the lossy coding length induces a regularization effect which stabilizes the (implicit) density estimate in a small-sample setting. Compression also provides a uniform means of handling classes of varying dimension. This simple classification criterion and its kernel and local versions perform competitively against existing classifiers on both synthetic examples and real imagery data such as handwritten digits and human faces, without requiring domain-specific information. 1
7 0.4391171 123 nips-2007-Loop Series and Bethe Variational Bounds in Attractive Graphical Models
8 0.43564367 174 nips-2007-Selecting Observations against Adversarial Objectives
9 0.43086404 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers
10 0.42173064 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models
11 0.38809267 188 nips-2007-Subspace-Based Face Recognition in Analog VLSI
12 0.38618037 89 nips-2007-Feature Selection Methods for Improving Protein Structure Prediction with Rosetta
13 0.37241319 139 nips-2007-Nearest-Neighbor-Based Active Learning for Rare Category Detection
15 0.35125259 97 nips-2007-Hidden Common Cause Relations in Relational Learning
16 0.34491533 157 nips-2007-Privacy-Preserving Belief Propagation and Sampling
17 0.33720139 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis
18 0.3348543 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization
19 0.31269509 121 nips-2007-Local Algorithms for Approximate Inference in Minor-Excluded Graphs
20 0.3044627 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes
topicId topicWeight
[(5, 0.034), (9, 0.137), (13, 0.038), (16, 0.037), (21, 0.086), (31, 0.031), (34, 0.028), (35, 0.041), (47, 0.083), (49, 0.036), (67, 0.127), (83, 0.122), (85, 0.026), (87, 0.025), (90, 0.044)]
simIndex simValue paperId paperTitle
1 0.81182212 96 nips-2007-Heterogeneous Component Analysis
Author: Shigeyuki Oba, Motoaki Kawanabe, Klaus-Robert Müller, Shin Ishii
Abstract: In bioinformatics it is often desirable to combine data from various measurement sources and thus structured feature vectors are to be analyzed that possess different intrinsic blocking characteristics (e.g., different patterns of missing values, observation noise levels, effective intrinsic dimensionalities). We propose a new machine learning tool, heterogeneous component analysis (HCA), for feature extraction in order to better understand the factors that underlie such complex structured heterogeneous data. HCA is a linear block-wise sparse Bayesian PCA based not only on a probabilistic model with block-wise residual variance terms but also on a Bayesian treatment of a block-wise sparse factor-loading matrix. We study various algorithms that implement our HCA concept extracting sparse heterogeneous structure by obtaining common components for the blocks and specific components within each block. Simulations on toy and bioinformatics data underline the usefulness of the proposed structured matrix factorization concept. 1
2 0.80958831 30 nips-2007-Bayes-Adaptive POMDPs
Author: Stephane Ross, Brahim Chaib-draa, Joelle Pineau
Abstract: Bayesian Reinforcement Learning has generated substantial interest recently, as it provides an elegant solution to the exploration-exploitation trade-off in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processes (MDPs). Our goal is to extend these ideas to the more general Partially Observable MDP (POMDP) framework, where the state is a hidden variable. To address this problem, we introduce a new mathematical model, the Bayes-Adaptive POMDP. This new model allows us to (1) improve knowledge of the POMDP domain through interaction with the environment, and (2) plan optimal sequences of actions which can tradeoff between improving the model, identifying the state, and gathering reward. We show how the model can be finitely approximated while preserving the value function. We describe approximations for belief tracking and planning in this model. Empirical results on two domains show that the model estimate and agent’s return improve over time, as the agent learns better model estimates. 1
same-paper 3 0.80851156 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations
Author: Charles L. Isbell, Michael P. Holmes, Alexander G. Gray
Abstract: Machine learning contains many computational bottlenecks in the form of nested summations over datasets. Kernel estimators and other methods are burdened by these expensive computations. Exact evaluation is typically O(n2 ) or higher, which severely limits application to large datasets. We present a multi-stage stratified Monte Carlo method for approximating such summations with probabilistic relative error control. The essential idea is fast approximation by sampling in trees. This method differs from many previous scalability techniques (such as standard multi-tree methods) in that its error is stochastic, but we derive conditions for error control and demonstrate that they work. Further, we give a theoretical sample complexity for the method that is independent of dataset size, and show that this appears to hold in experiments, where speedups reach as high as 1014 , many orders of magnitude beyond the previous state of the art. 1
4 0.75608534 101 nips-2007-How SVMs can estimate quantiles and the median
Author: Andreas Christmann, Ingo Steinwart
Abstract: We investigate quantile regression based on the pinball loss and the ǫ-insensitive loss. For the pinball loss a condition on the data-generating distribution P is given that ensures that the conditional quantiles are approximated with respect to · 1 . This result is then used to derive an oracle inequality for an SVM based on the pinball loss. Moreover, we show that SVMs based on the ǫ-insensitive loss estimate the conditional median only under certain conditions on P . 1
5 0.72060168 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes
Author: Geoffrey E. Hinton, Ruslan Salakhutdinov
Abstract: We show how to use unlabeled data and a deep belief net (DBN) to learn a good covariance kernel for a Gaussian process. We first learn a deep generative model of the unlabeled data using the fast, greedy algorithm introduced by [7]. If the data is high-dimensional and highly-structured, a Gaussian kernel applied to the top layer of features in the DBN works much better than a similar kernel applied to the raw input. Performance at both regression and classification can then be further improved by using backpropagation through the DBN to discriminatively fine-tune the covariance kernel.
6 0.70235395 158 nips-2007-Probabilistic Matrix Factorization
7 0.70178246 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs
8 0.69802004 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
9 0.69756037 215 nips-2007-What makes some POMDP problems easy to approximate?
10 0.69155514 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
11 0.68967307 18 nips-2007-A probabilistic model for generating realistic lip movements from speech
12 0.68953204 69 nips-2007-Discriminative Batch Mode Active Learning
13 0.68621218 156 nips-2007-Predictive Matrix-Variate t Models
14 0.68338931 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models
15 0.682863 63 nips-2007-Convex Relaxations of Latent Variable Training
16 0.68060255 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data
17 0.68037617 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons
18 0.68019384 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation
19 0.67929268 104 nips-2007-Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes
20 0.67920876 115 nips-2007-Learning the 2-D Topology of Images