nips nips2006 nips2006-144 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Carla P. Gomes, Ashish Sabharwal, Bart Selman
Abstract: We propose a new technique for sampling the solutions of combinatorial problems in a near-uniform manner. We focus on problems specified as a Boolean formula, i.e., on SAT instances. Sampling for SAT problems has been shown to have interesting connections with probabilistic reasoning, making practical sampling algorithms for SAT highly desirable. The best current approaches are based on Markov Chain Monte Carlo methods, which have some practical limitations. Our approach exploits combinatorial properties of random parity (XOR) constraints to prune away solutions near-uniformly. The final sample is identified amongst the remaining ones using a state-of-the-art SAT solver. The resulting sampling distribution is provably arbitrarily close to uniform. Our experiments show that our technique achieves a significantly better sampling quality than the best alternative. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu ∗ Abstract We propose a new technique for sampling the solutions of combinatorial problems in a near-uniform manner. [sent-4, score-0.297]
2 Our approach exploits combinatorial properties of random parity (XOR) constraints to prune away solutions near-uniformly. [sent-10, score-0.381]
3 1 Introduction We present a new method, XORSample, for uniformly sampling from the solutions of hard combinatorial problems. [sent-14, score-0.351]
4 There has also been a growing interest in combining logical and probabilistic constraints as in the work of Koller, Russell, Domingos, Bacchus, Halpern, Darwiche, and many others (see e. [sent-17, score-0.139]
5 statistical relational learning and Markov logic networks [1]), and a recently proposed Markov logic system for this task uses efficient SAT sampling as its core reasoning mechanism [2]. [sent-19, score-0.152]
6 Typical approaches for sampling from combinatorial spaces are based on Markov Chain Monte Carlo (MCMC) methods, such as the Metropolis algorithm and simulated annealing [3, 4, 5]. [sent-20, score-0.179]
7 Unfortunately, on many combinatorial problems, the time taken by the Markov chain to reach its stationary distribution scales exponentially with the problem size. [sent-23, score-0.124]
8 MCMC methods can also be used to find (globally optimal) solutions to combinatorial problems. [sent-24, score-0.22]
9 By lowering the temperature parameter to near zero, the distribution becomes highly concentrated around the minimum energy states, which correspond to the solutions of the combinatorial problem under consideration. [sent-26, score-0.22]
10 SA has been successfully applied to a number of combinatorial search problems. [sent-27, score-0.102]
11 However, many combinatorial problems, especially those with intricate constraint structure, are beyond the reach of SA and related MCMC methods. [sent-28, score-0.162]
12 As a consequence, these techniques tend to be highly biased, and sample the set of solutions in an extremely non-uniform way. [sent-33, score-0.139]
13 ) In this paper, we introduce a general probabilistic technique for obtaining near-uniform samples from the set of all (globally optimal) solutions of combinatorial problems. [sent-35, score-0.241]
14 Our method can use any state-of-the-art specialized combinatorial solver as a subroutine, without requiring any modifications to the solver. [sent-36, score-0.216]
15 Most importantly, the quality of our sampling method is not affected by the possible bias of the underlying specialized solver — all we need is a solver that is good at finding some solution or proving that none exists. [sent-38, score-0.347]
16 We also demonstrate the practical feasibility of our approach by sampling near-uniformly from instances of hard combinatorial problems. [sent-40, score-0.229]
17 In the SAT problem, we have a set of logical constraints on a set of Boolean (True/False) variables. [sent-42, score-0.139]
18 The challenge is to find a setting of the variables such that all logical constraints are satisfied. [sent-43, score-0.159]
19 , hardware and software verification and planning and scheduling, is to first translate the problem into SAT, and then use a state-of-the-art SAT solver to find a solution (or show that it does not exist). [sent-52, score-0.156]
20 As stated above, these specialized solvers derive much of their power from quickly focusing their search on a very small part of the combinatorial space. [sent-53, score-0.192]
21 Many SAT solvers are deterministic, but even when the solvers incorporate some randomization, solutions will be sampled in a highly non-uniform manner. [sent-54, score-0.28]
22 Assume for simplicity that our original SAT instance on n Boolean variables has 2s solutions or satisfying assignments. [sent-56, score-0.195]
23 We add special randomly generated logical constraints to our SAT problem. [sent-58, score-0.139]
24 Each random constraint is constructed in such a way that it rules out any given truth assignment exactly with probability 1/2 . [sent-59, score-0.12]
25 1 We then use a SAT solver to find the remaining satisfying assignment and output this as our first sample. [sent-61, score-0.18]
26 2 The randomization in the added constraints will guarantee that the assignment is selected uniformly at random. [sent-64, score-0.199]
27 For our added constraints, we use randomly generated parity or “exclusive-or” (XOR) constraints. [sent-66, score-0.106]
28 In recent work, we introduced XOR constraints for the problem of counting the number of solutions using MBound [15]. [sent-67, score-0.217]
29 As we will discuss below, an XOR constraint eliminates any given truth assignment with probability 1/2 , and therefore, in expectation, cuts the set of satisfying assignments in half. [sent-69, score-0.197]
30 Unfortunately, as far as is known, there are no compact (polynomial size) logical constraints that can achieve such complete independence. [sent-71, score-0.139]
31 However, XOR constraints guarantee at least pairwise independence, i. [sent-72, score-0.106]
32 , if we know that an XOR constraint C eliminates assignment σ 1 , this provides no information as to whether C will remove any another assignment σ2 . [sent-74, score-0.152]
33 Our sampling approach is inspired by earlier work in computational complexity theory by Valiant and Vazirani [16], who considered the question whether having one or more assignments affects 1 Of course, we don’t know the true value of s. [sent-76, score-0.125]
34 2 The practical feasibility of our approach exploits the fact that current SAT solvers are very effective in finding such truth assignments in many real-world domains. [sent-79, score-0.119]
35 They showed that, in essence, the number of solutions should not affect the hardness of the problem instances in the worst case [16]. [sent-81, score-0.177]
36 This was received as a negative result because it shows that finding a solution to a Unique SAT problem (a SAT instance that is guaranteed to have at most one solution) is not any easier than finding a solution to an arbitrary SAT instance. [sent-82, score-0.102]
37 Our sampling strategy turns this line of research into a positive direction by showing how a standard SAT solver, tailored to finding just one solution of a SAT problem, can now be used to sample near-uniformly from the set of solutions of an arbitrary SAT problem. [sent-83, score-0.253]
38 One question that arises is whether the state-of-the-art SAT solvers will perform well on problem instances with added XOR (or parity) constraints. [sent-85, score-0.116]
39 In fact, the addition of XOR constraints can be beneficial since the constraints lead to additional propagation that can be exploited by the solvers. [sent-87, score-0.148]
40 3 Our experiments show that we can effectively sample near-uniformly from hard practical combinatorial problems. [sent-88, score-0.147]
41 Our goal in this paper is to sample uniformly from the set of all solutions of a given formula F. [sent-102, score-0.258]
42 An XOR constraint D over variables V is the logical “xor” or parity of a subset of V ∪ {1}; σ satisfies D if it satisfies an odd number of elements in D. [sent-103, score-0.212]
43 For instance, D = {a, b, c, 1} represents the xor constraint a ⊕ b ⊕ c ⊕ 1, which is TRUE when an even number of a, b, c are TRUE. [sent-105, score-0.367]
44 Our focus will be on formulas which are a logical conjunction of a formula in Conjunctive Normal Form (CNF) and some XOR constraints. [sent-110, score-0.197]
45 In all our experiments, XOR constraints are translated into CNF using additional variables so that the full formula can be fed directly to standard (CNF-based) SAT solvers. [sent-111, score-0.183]
46 Such complementary XOR constraints have the simple but useful property that any assignment σ satisfies exactly one of them. [sent-129, score-0.151]
47 3 Note that there are certain classes of structured instances based on parity constraints that are designed to be hard for SAT solvers [17]. [sent-131, score-0.282]
48 Our augmented problem instances appear to behave quite differently from these specially constructed instances because of the interaction between the constraints in the original instance and the added random parity constraints. [sent-132, score-0.26]
49 3 Sampling using XOR constraints In this section, we describe and analyze two randomized algorithms, XORSample and XORSample’, for sampling solutions of a given Boolean formula F near-uniformly using streamlining with random XOR constraints. [sent-145, score-0.383]
50 The first algorithm, XORSample, uses a SAT solver as a subroutine on the randomly streamlined formula. [sent-148, score-0.163]
51 It repeatedly performs the streamlining process until the resulting formula has a unique solution. [sent-149, score-0.143]
52 Here s is chosen to be relatively small so that a moderate number of solutions survive. [sent-152, score-0.118]
53 XORSample’ then uses stronger subroutines, namely a SAT model counter and a model selector, to output one of the surviving solutions uniformly at random. [sent-153, score-0.282]
54 1 XOR -based sampling using SAT solvers: XORSample Let F be a formula over n variables, and q and s be the parameters of XORSample. [sent-155, score-0.166]
55 This generates a streamlined formula Fsq whose solutions (called the surviving solutions) are a subset of the solutions of F. [sent-157, score-0.472]
56 If there is a unique surviving solution σ , XORSample outputs σ and stops. [sent-158, score-0.206]
57 The check for uniqueness of σ is done by adding the negation of σ as a constraint to Fsq and testing whether the resulting formula is still satisfiable. [sent-160, score-0.129]
58 Let ps (σ ) be the overall probability that XORSample outputs σ . [sent-169, score-0.215]
59 By independence of the s XORs in Qs in XORSample, σ survives with probability exactly 2−s , giving the desired upper bound on pone,s (σ ). [sent-185, score-0.108]
60 Recall the interpretation of variable assignments and XOR constraints in the vector space F n (cf. [sent-192, score-0.122]
61 Now, pone,s (σ ) = Pr σ (Qs ) = 1 and for all other solutions σ of F, σ (Qs ) = 0 = Pr [σ (Qs ) = 1] · 1 − Pr for some solution σ = σ , σ (Qs ) = 1 | σ (Qs ) = 1 . [sent-218, score-0.155]
62 For any solution σ of F, the probability ps (σ ) with which XORSample with parameters q = 1/2 and s outputs σ satisfies ∗ ∗ 1 c(α ) 2−s < ps (σ ) < 2−s and min {ps (σ )} > c(α ) max {ps (σ )} . [sent-223, score-0.412]
63 pone,s (σ ), as before, is the probability that σ is the unique surviving solution. [sent-227, score-0.162]
64 ps (σ ), the overall probability of sampling σ , is given by the infinite geometric series ∗ ps (σ ) = pone,s (σ ) + (1 − p)pone,s (σ ) + (1 − p)2 pone,s (σ ) + . [sent-228, score-0.421]
65 In particular, ps (σ ) is proportional to pone,s (σ ). [sent-232, score-0.16]
66 ˆ Lemma 1 says that for any two solutions σ1 and σ2 of F, pone,s (σ1 ) and pone,s (σ2 ) are strictly within a factor of c(α ) of each other. [sent-233, score-0.142]
67 By the above discussion, ps (σ1 ) and ps (σ2 ) must also be strictly within a factor of c(α ) of each other, already proving the min vs. [sent-234, score-0.344]
68 Further, ∑σ ps (σ ) = 1 because of rejection sampling. [sent-236, score-0.182]
69 For the first part of the result, suppose for the sake of contradiction that ps (σ0 ) ≤ c(α )2−s for some σ0 , violating the claimed lower bound. [sent-237, score-0.16]
70 By the above argument, ps (σ ) is within a factor of c(α ) of ∗ ps (σ0 ) for every σ , and would therefore be at most 2−s . [sent-238, score-0.32]
71 This would make ∑σ ps (σ ) strictly less than one, a contradiction. [sent-239, score-0.16]
72 A similar argument proves the upper bound on ps (σ ). [sent-240, score-0.202]
73 Using the bounds on pone,s (σ ) from Lemma 1 and the fact that the unique survival of each ˆ ∗ ∗ ∗ ˆ ˆ of the 2s solutions σ are disjoint events, we have p ≤ 2s 2−s = 2−α and p > 2s c(α )2−s = c(α )2−α . [sent-242, score-0.147]
74 However, now the resulting streamlined formula Fsq is fed to an exact model counting subroutine to compute the number of surviving solutions, mc. [sent-246, score-0.291]
75 If mc > 0, XORSample’ succeeds and outputs the ith surviving solution using a model selector on Fsq , where i is chosen uniformly from {1, 2, . [sent-247, score-0.414]
76 else return Failure end Algorithm 2: XORSample’, sampling with XORs using a model counter and selector The sample-quality analysis of XORSample’ requires somewhat more complex ideas than that of ∗ XORSample. [sent-257, score-0.14]
77 , knowing the value of an XOR constraint on two variable assignments does not tell us anything about its value on a third assignment. [sent-262, score-0.114]
78 The idea is to show that the number of solutions surviving, given that any fixed solution σ survives, is independent of σ in expectation and is highly likely to be very close to the expected value. [sent-271, score-0.18]
79 As a result, the probability with which σ is output, which is inversely proportional to the number of solutions surviving along with σ , will be very close to the uniform probability. [sent-272, score-0.279]
80 For any solution σ of F, the probability ps (σ ) with which XORSample’ with parameters q = 1/2 and s outputs σ satisfies ∗ ∗ ps (σ ) > c (α ) 2−s , where c (α ) = 1 − 2−α /3 . [sent-277, score-0.412]
81 We begin by setting up a framework for analyzing the number of surviving solutions after s XORs Qs drawn from X(n,1/2 ) are added to F. [sent-281, score-0.246]
82 The variable mc (see Algorithm 2), which is the number of surviving solutions, equals ∑σ Yσ . [sent-287, score-0.258]
83 Consider the distribution of mc conditioned on the fact that σ survives. [sent-288, score-0.167]
84 We show that mc conditioned on σ (Qs ) = 1 indeed lies very close to µ . [sent-292, score-0.167]
85 By Chebychev’s inequality, Pr |mc − µ | ≥ µ 22β 22β 22β Var [mc | σ (Qs ) = 1] < = | σ (Qs ) = 1 ≤ (E [mc | σ (Qs ) = 1])2 E [mc | σ (Qs ) = 1] µ 2β Therefore, conditioned on σ (Qs ) = 1, with probability more than 1 − 22β /µ , mc lies between (1 − 2−β )µ and (1 + 2−β )µ . [sent-294, score-0.191]
86 Recall that ps (σ ) is the probability that XORSample’ outputs σ . [sent-295, score-0.215]
87 ps (σ ) = Pr [σ (Qs ) = 1] n 1 ∑ Pr [mc = i | σ (Qs ) = 1] i i=1 ≥ 2−s Pr mc ≤ (1 + 2−β )µ | σ (Qs ) = 1 1 1 − 22β /µ ≥ 2−s (1 + 2−β )µ (1 + 2−β )µ Simplifying this expression and optimizing it by setting β = α /3 gives the desired bound on p s (σ ). [sent-296, score-0.309]
88 Lastly, the success probability of XORSample’ is ∑σ ps (σ ) > c (α ). [sent-297, score-0.184]
89 For example, as the number of XORs used in XORSample is increased, α increases, the deviation c(α ) from the truly uniform sampling probability p∗ approaches 0 exponentially fast, and we get progressively smaller error bands around p∗ . [sent-300, score-0.129]
90 4 Empirical validation To validate our XOR-sampling technique, we consider two kinds of formulas: a random 3-SAT instance generated near the SAT phase transition [18] and a structured instance derived from a logistics planning domain (data and code available from the authors). [sent-303, score-0.123]
91 We used a complete model counter, Relsat [12], to find all solutions of our problem instances. [sent-304, score-0.118]
92 Our random instance with 75 variables has a total of 48 satisfying assignments, and our logistics formula with 352 variables has 512 satisfying assignments. [sent-305, score-0.258]
93 ) We used XORSample with MiniSat [14] as the underlying SAT solver to generate samples from the set of solutions of each formula. [sent-308, score-0.234]
94 Contrast this with the results for SampleSAT: SampleSAT does sample quite uniformly from solutions that lie near each other in Hamming distance but different solution clusters are sampled with different frequencies. [sent-316, score-0.226]
95 This SAT instance has two solution clusters: the first 32 solutions are sampled around 2, 900 times each, i. [sent-317, score-0.203]
96 , not frequently enough, whereas the remaining 16 solutions are sampled too frequently, around 6,700 times each. [sent-319, score-0.138]
97 (Although SampleSAT greatly improves on other sampling strategies for SAT, the split into disjoint sampling bands appears inherent in the approach. [sent-320, score-0.154]
98 SampleSAT in fact only found 256 of the 512 solutions in a total of 100,000 samples. [sent-329, score-0.118]
99 These experiments show that XORSample is a promising practical technique (with theoretical guarantees) for obtaining near-uniform samples from intricate combinatorial spaces. [sent-339, score-0.143]
100 The minimal disagreement parity problem as a hard satisfiability problem. [sent-464, score-0.111]
wordName wordTfidf (topN-words)
[('xorsample', 0.538), ('sat', 0.478), ('xor', 0.327), ('qs', 0.286), ('xors', 0.163), ('ps', 0.16), ('mc', 0.149), ('samplesat', 0.125), ('solutions', 0.118), ('surviving', 0.109), ('combinatorial', 0.102), ('fsq', 0.1), ('solver', 0.095), ('formula', 0.089), ('parity', 0.087), ('sampling', 0.077), ('constraints', 0.074), ('solvers', 0.071), ('logical', 0.065), ('fs', 0.065), ('pr', 0.058), ('assignment', 0.056), ('boolean', 0.055), ('satis', 0.052), ('cnf', 0.05), ('var', 0.05), ('assignments', 0.048), ('independence', 0.046), ('logistics', 0.043), ('formulas', 0.043), ('constraint', 0.04), ('july', 0.038), ('iterationsuccess', 0.038), ('satsolve', 0.038), ('selector', 0.038), ('streamlined', 0.038), ('survives', 0.038), ('solution', 0.037), ('fn', 0.035), ('hardness', 0.033), ('pairwise', 0.032), ('aaai', 0.032), ('outputs', 0.031), ('uniformly', 0.03), ('subroutine', 0.03), ('unique', 0.029), ('satisfying', 0.029), ('uniform', 0.028), ('instance', 0.028), ('valiant', 0.028), ('instances', 0.026), ('logic', 0.026), ('ul', 0.026), ('knowing', 0.026), ('lemma', 0.025), ('gomes', 0.025), ('jose', 0.025), ('mbound', 0.025), ('minisat', 0.025), ('params', 0.025), ('sabharwal', 0.025), ('streamlining', 0.025), ('subroutines', 0.025), ('counter', 0.025), ('expectation', 0.025), ('counting', 0.025), ('hard', 0.024), ('probability', 0.024), ('planning', 0.024), ('proving', 0.024), ('says', 0.024), ('reasoning', 0.023), ('sa', 0.023), ('rejection', 0.022), ('vazirani', 0.022), ('talk', 0.022), ('bacchus', 0.022), ('stationary', 0.022), ('samples', 0.021), ('sample', 0.021), ('complementary', 0.021), ('proves', 0.021), ('argument', 0.021), ('kl', 0.021), ('variables', 0.02), ('markov', 0.02), ('sampled', 0.02), ('es', 0.02), ('metropolis', 0.02), ('rosenbluth', 0.02), ('succeeds', 0.02), ('cornell', 0.02), ('intricate', 0.02), ('randomization', 0.02), ('added', 0.019), ('divergence', 0.019), ('false', 0.019), ('linearly', 0.019), ('specialized', 0.019), ('conditioned', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints
Author: Carla P. Gomes, Ashish Sabharwal, Bart Selman
Abstract: We propose a new technique for sampling the solutions of combinatorial problems in a near-uniform manner. We focus on problems specified as a Boolean formula, i.e., on SAT instances. Sampling for SAT problems has been shown to have interesting connections with probabilistic reasoning, making practical sampling algorithms for SAT highly desirable. The best current approaches are based on Markov Chain Monte Carlo methods, which have some practical limitations. Our approach exploits combinatorial properties of random parity (XOR) constraints to prune away solutions near-uniformly. The final sample is identified amongst the remaining ones using a state-of-the-art SAT solver. The resulting sampling distribution is provably arbitrarily close to uniform. Our experiments show that our technique achieves a significantly better sampling quality than the best alternative. 1
2 0.11280387 159 nips-2006-Parameter Expanded Variational Bayesian Methods
Author: Tommi S. Jaakkola, Yuan Qi
Abstract: Bayesian inference has become increasingly important in statistical machine learning. Exact Bayesian calculations are often not feasible in practice, however. A number of approximate Bayesian methods have been proposed to make such calculations practical, among them the variational Bayesian (VB) approach. The VB approach, while useful, can nevertheless suffer from slow convergence to the approximate solution. To address this problem, we propose Parameter-eXpanded Variational Bayesian (PX-VB) methods to speed up VB. The new algorithm is inspired by parameter-expanded expectation maximization (PX-EM) and parameterexpanded data augmentation (PX-DA). Similar to PX-EM and -DA, PX-VB expands a model with auxiliary variables to reduce the coupling between variables in the original model. We analyze the convergence rates of VB and PX-VB and demonstrate the superior convergence rates of PX-VB in variational probit regression and automatic relevance determination. 1
3 0.077125371 44 nips-2006-Bayesian Policy Gradient Algorithms
Author: Mohammad Ghavamzadeh, Yaakov Engel
Abstract: Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. Since Monte Carlo methods tend to have high variance, a large number of samples is required, resulting in slow convergence. In this paper, we propose a Bayesian framework that models the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates are provided at little extra cost. 1
4 0.068754219 193 nips-2006-Tighter PAC-Bayes Bounds
Author: Amiran Ambroladze, Emilio Parrado-hernández, John S. Shawe-taylor
Abstract: This paper proposes a PAC-Bayes bound to measure the performance of Support Vector Machine (SVM) classifiers. The bound is based on learning a prior over the distribution of classifiers with a part of the training samples. Experimental work shows that this bound is tighter than the original PAC-Bayes, resulting in an enhancement of the predictive capabilities of the PAC-Bayes bound. In addition, it is shown that the use of this bound as a means to estimate the hyperparameters of the classifier compares favourably with cross validation in terms of accuracy of the model, while saving a lot of computational burden. 1
5 0.062565558 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data
Author: Jiayuan Huang, Arthur Gretton, Karsten M. Borgwardt, Bernhard Schölkopf, Alex J. Smola
Abstract: We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. Most algorithms for this setting try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate. We present a nonparametric method which directly produces resampling weights without distribution estimation. Our method works by matching distributions between training and testing sets in feature space. Experimental results demonstrate that our method works well in practice.
6 0.058594517 189 nips-2006-Temporal dynamics of information content carried by neurons in the primary visual cortex
7 0.057736084 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression
8 0.041798815 201 nips-2006-Using Combinatorial Optimization within Max-Product Belief Propagation
9 0.041130424 38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments
10 0.040862761 57 nips-2006-Conditional mean field
11 0.038537949 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
12 0.037544951 155 nips-2006-Optimal Single-Class Classification Strategies
13 0.036048122 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure
14 0.035038676 109 nips-2006-Learnability and the doubling dimension
15 0.034710351 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions
16 0.034603477 186 nips-2006-Support Vector Machines on a Budget
17 0.033450462 98 nips-2006-Inferring Network Structure from Co-Occurrences
18 0.033122137 56 nips-2006-Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data
19 0.0328997 116 nips-2006-Learning from Multiple Sources
20 0.029930677 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
topicId topicWeight
[(0, -0.107), (1, 0.013), (2, -0.039), (3, -0.053), (4, 0.02), (5, 0.049), (6, 0.064), (7, 0.015), (8, 0.038), (9, 0.012), (10, 0.044), (11, -0.015), (12, 0.006), (13, -0.02), (14, -0.038), (15, 0.005), (16, -0.016), (17, -0.021), (18, 0.042), (19, 0.005), (20, 0.026), (21, -0.013), (22, -0.044), (23, -0.072), (24, -0.064), (25, -0.035), (26, 0.067), (27, 0.102), (28, 0.024), (29, -0.035), (30, 0.052), (31, -0.018), (32, 0.078), (33, 0.072), (34, -0.082), (35, 0.058), (36, -0.089), (37, 0.011), (38, -0.02), (39, -0.115), (40, -0.149), (41, 0.078), (42, 0.144), (43, -0.137), (44, 0.152), (45, -0.023), (46, 0.012), (47, 0.084), (48, -0.098), (49, 0.003)]
simIndex simValue paperId paperTitle
same-paper 1 0.92230356 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints
Author: Carla P. Gomes, Ashish Sabharwal, Bart Selman
Abstract: We propose a new technique for sampling the solutions of combinatorial problems in a near-uniform manner. We focus on problems specified as a Boolean formula, i.e., on SAT instances. Sampling for SAT problems has been shown to have interesting connections with probabilistic reasoning, making practical sampling algorithms for SAT highly desirable. The best current approaches are based on Markov Chain Monte Carlo methods, which have some practical limitations. Our approach exploits combinatorial properties of random parity (XOR) constraints to prune away solutions near-uniformly. The final sample is identified amongst the remaining ones using a state-of-the-art SAT solver. The resulting sampling distribution is provably arbitrarily close to uniform. Our experiments show that our technique achieves a significantly better sampling quality than the best alternative. 1
2 0.51667422 159 nips-2006-Parameter Expanded Variational Bayesian Methods
Author: Tommi S. Jaakkola, Yuan Qi
Abstract: Bayesian inference has become increasingly important in statistical machine learning. Exact Bayesian calculations are often not feasible in practice, however. A number of approximate Bayesian methods have been proposed to make such calculations practical, among them the variational Bayesian (VB) approach. The VB approach, while useful, can nevertheless suffer from slow convergence to the approximate solution. To address this problem, we propose Parameter-eXpanded Variational Bayesian (PX-VB) methods to speed up VB. The new algorithm is inspired by parameter-expanded expectation maximization (PX-EM) and parameterexpanded data augmentation (PX-DA). Similar to PX-EM and -DA, PX-VB expands a model with auxiliary variables to reduce the coupling between variables in the original model. We analyze the convergence rates of VB and PX-VB and demonstrate the superior convergence rates of PX-VB in variational probit regression and automatic relevance determination. 1
3 0.44368741 44 nips-2006-Bayesian Policy Gradient Algorithms
Author: Mohammad Ghavamzadeh, Yaakov Engel
Abstract: Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. Since Monte Carlo methods tend to have high variance, a large number of samples is required, resulting in slow convergence. In this paper, we propose a Bayesian framework that models the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates are provided at little extra cost. 1
4 0.43786606 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data
Author: Jiayuan Huang, Arthur Gretton, Karsten M. Borgwardt, Bernhard Schölkopf, Alex J. Smola
Abstract: We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. Most algorithms for this setting try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate. We present a nonparametric method which directly produces resampling weights without distribution estimation. Our method works by matching distributions between training and testing sets in feature space. Experimental results demonstrate that our method works well in practice.
5 0.42918926 193 nips-2006-Tighter PAC-Bayes Bounds
Author: Amiran Ambroladze, Emilio Parrado-hernández, John S. Shawe-taylor
Abstract: This paper proposes a PAC-Bayes bound to measure the performance of Support Vector Machine (SVM) classifiers. The bound is based on learning a prior over the distribution of classifiers with a part of the training samples. Experimental work shows that this bound is tighter than the original PAC-Bayes, resulting in an enhancement of the predictive capabilities of the PAC-Bayes bound. In addition, it is shown that the use of this bound as a means to estimate the hyperparameters of the classifier compares favourably with cross validation in terms of accuracy of the model, while saving a lot of computational burden. 1
6 0.41357139 189 nips-2006-Temporal dynamics of information content carried by neurons in the primary visual cortex
7 0.38482496 109 nips-2006-Learnability and the doubling dimension
8 0.37974697 98 nips-2006-Inferring Network Structure from Co-Occurrences
9 0.3540087 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression
10 0.34858945 1 nips-2006-A Bayesian Approach to Diffusion Models of Decision-Making and Response Time
11 0.34475392 5 nips-2006-A Kernel Method for the Two-Sample-Problem
12 0.31361729 155 nips-2006-Optimal Single-Class Classification Strategies
13 0.31347078 38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments
14 0.31100133 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis
15 0.2932187 186 nips-2006-Support Vector Machines on a Budget
16 0.29069108 57 nips-2006-Conditional mean field
17 0.28600881 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
18 0.28260654 56 nips-2006-Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data
19 0.28247988 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions
20 0.27695557 158 nips-2006-PG-means: learning the number of clusters in data
topicId topicWeight
[(1, 0.094), (3, 0.029), (7, 0.068), (9, 0.042), (20, 0.024), (22, 0.079), (44, 0.075), (57, 0.063), (65, 0.035), (69, 0.046), (87, 0.344)]
simIndex simValue paperId paperTitle
1 0.79011428 123 nips-2006-Learning with Hypergraphs: Clustering, Classification, and Embedding
Author: Dengyong Zhou, Jiayuan Huang, Bernhard Schölkopf
Abstract: We usually endow the investigated objects with pairwise relationships, which can be illustrated as graphs. In many real-world problems, however, relationships among the objects of our interest are more complex than pairwise. Naively squeezing the complex relationships into pairwise ones will inevitably lead to loss of information which can be expected valuable for our learning tasks however. Therefore we consider using hypergraphs instead to completely represent complex relationships among the objects of our interest, and thus the problem of learning with hypergraphs arises. Our main contribution in this paper is to generalize the powerful methodology of spectral clustering which originally operates on undirected graphs to hypergraphs, and further develop algorithms for hypergraph embedding and transductive classification on the basis of the spectral hypergraph clustering approach. Our experiments on a number of benchmarks showed the advantages of hypergraphs over usual graphs. 1
same-paper 2 0.73498631 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints
Author: Carla P. Gomes, Ashish Sabharwal, Bart Selman
Abstract: We propose a new technique for sampling the solutions of combinatorial problems in a near-uniform manner. We focus on problems specified as a Boolean formula, i.e., on SAT instances. Sampling for SAT problems has been shown to have interesting connections with probabilistic reasoning, making practical sampling algorithms for SAT highly desirable. The best current approaches are based on Markov Chain Monte Carlo methods, which have some practical limitations. Our approach exploits combinatorial properties of random parity (XOR) constraints to prune away solutions near-uniformly. The final sample is identified amongst the remaining ones using a state-of-the-art SAT solver. The resulting sampling distribution is provably arbitrarily close to uniform. Our experiments show that our technique achieves a significantly better sampling quality than the best alternative. 1
3 0.60634089 34 nips-2006-Approximate Correspondences in High Dimensions
Author: Kristen Grauman, Trevor Darrell
Abstract: Pyramid intersection is an efficient method for computing an approximate partial matching between two sets of feature vectors. We introduce a novel pyramid embedding based on a hierarchy of non-uniformly shaped bins that takes advantage of the underlying structure of the feature space and remains accurate even for sets with high-dimensional feature vectors. The matching similarity is computed in linear time and forms a Mercer kernel. Whereas previous matching approximation algorithms suffer from distortion factors that increase linearly with the feature dimension, we demonstrate that our approach can maintain constant accuracy even as the feature dimension increases. When used as a kernel in a discriminative classifier, our approach achieves improved object recognition results over a state-of-the-art set kernel. 1
4 0.46900553 198 nips-2006-Unified Inference for Variational Bayesian Linear Gaussian State-Space Models
Author: David Barber, Silvia Chiappa
Abstract: Linear Gaussian State-Space Models are widely used and a Bayesian treatment of parameters is therefore of considerable interest. The approximate Variational Bayesian method applied to these models is an attractive approach, used successfully in applications ranging from acoustics to bioinformatics. The most challenging aspect of implementing the method is in performing inference on the hidden state sequence of the model. We show how to convert the inference problem so that standard Kalman Filtering/Smoothing recursions from the literature may be applied. This is in contrast to previously published approaches based on Belief Propagation. Our framework both simplifies and unifies the inference problem, so that future applications may be more easily developed. We demonstrate the elegance of the approach on Bayesian temporal ICA, with an application to finding independent dynamical processes underlying noisy EEG signals. 1 Linear Gaussian State-Space Models Linear Gaussian State-Space Models (LGSSMs)1 are fundamental in time-series analysis [1, 2, 3]. In these models the observations v1:T 2 are generated from an underlying dynamical system on h1:T according to: v v vt = Bht + ηt , ηt ∼ N (0V , ΣV ), h h ht = Aht−1 + ηt , ηt ∼ N (0H , ΣH ) , where N (µ, Σ) denotes a Gaussian with mean µ and covariance Σ, and 0X denotes an Xdimensional zero vector. The observation vt has dimension V and the hidden state ht has dimension H. Probabilistically, the LGSSM is defined by: T p(v1:T , h1:T |Θ) = p(v1 |h1 )p(h1 ) p(vt |ht )p(ht |ht−1 ), t=2 with p(vt |ht ) = N (Bht , ΣV ), p(ht |ht−1 ) = N (Aht−1 , ΣH ), p(h1 ) = N (µ, Σ) and where Θ = {A, B, ΣH , ΣV , µ, Σ} denotes the model parameters. Because of the widespread use of these models, a Bayesian treatment of parameters is of considerable interest [4, 5, 6, 7, 8]. An exact implementation of the Bayesian LGSSM is formally intractable [8], and recently a Variational Bayesian (VB) approximation has been studied [4, 5, 6, 7, 9]. The most challenging part of implementing the VB method is performing inference over h1:T , and previous authors have developed their own specialized routines, based on Belief Propagation, since standard LGSSM inference routines appear, at first sight, not to be applicable. 1 2 Also called Kalman Filters/Smoothers, Linear Dynamical Systems. v1:T denotes v1 , . . . , vT . A key contribution of this paper is to show how the Variational Bayesian treatment of the LGSSM can be implemented using standard LGSSM inference routines. Based on the insight we provide, any standard inference method may be applied, including those specifically addressed to improve numerical stability [2, 10, 11]. In this article, we decided to describe the predictor-corrector and Rauch-Tung-Striebel recursions [2], and also suggest a small modification that reduces computational cost. The Bayesian LGSSM is particularly of interest when strong prior constraints are needed to find adequate solutions. One such case is in EEG signal analysis, whereby we wish to extract sources that evolve independently through time. Since EEG is particularly noisy [12], a prior that encourages sources to have preferential dynamics is advantageous. This application is discussed in Section 4, and demonstrates the ease of applying our VB framework. 2 Bayesian Linear Gaussian State-Space Models In the Bayesian treatment of the LGSSM, instead of considering the model parameters Θ as fixed, ˆ ˆ we define a prior distribution p(Θ|Θ), where Θ is a set of hyperparameters. Then: ˆ p(v1:T |Θ) = ˆ p(v1:T |Θ)p(Θ|Θ) . (1) Θ In a full Bayesian treatment we would define additional prior distributions over the hyperparameters ˆ Θ. Here we take instead the ML-II (‘evidence’) framework, in which the optimal set of hyperpaˆ ˆ rameters is found by maximizing p(v1:T |Θ) with respect to Θ [6, 7, 9]. For the parameter priors, here we define Gaussians on the columns of A and B 3 : H e− p(A|α, ΣH ) ∝ αj 2 ˆ ( A j −A j ) T ˆ Σ−1 (Aj −Aj ) H H , e− p(B|β, ΣV ) ∝ j=1 βj 2 T ˆ (Bj −Bj ) ˆ Σ−1 (Bj −Bj ) V , j=1 ˆ ˆ which has the effect of biasing the transition and emission matrices to desired forms A and B. The −1 −1 4 conjugate priors for general inverse covariances ΣH and ΣV are Wishart distributions [7] . In the simpler case assumed here of diagonal covariances these become Gamma distributions [5, 7]. The ˆ hyperparameters are then Θ = {α, β}5 . Variational Bayes ˆ Optimizing Eq. (1) with respect to Θ is difficult due to the intractability of the integrals. Instead, in VB, one considers the lower bound [6, 7, 9]6 : ˆ ˆ L = log p(v1:T |Θ) ≥ Hq (Θ, h1:T ) + log p(Θ|Θ) q(Θ) + E(h1:T , Θ) q(Θ,h1:T ) ≡ F, where E(h1:T , Θ) ≡ log p(v1:T , h1:T |Θ). Hd (x) signifies the entropy of the distribution d(x), and · d(x) denotes the expectation operator. The key approximation in VB is q(Θ, h1:T ) ≡ q(Θ)q(h1:T ), from which one may show that, for optimality of F, ˆ E(h1:T ,Θ) q(h1:T ) . q(h1:T ) ∝ e E(h1:T ,Θ) q(Θ) , q(Θ) ∝ p(Θ|Θ)e These coupled equations need to be iterated to convergence. The updates for the parameters q(Θ) are straightforward and are given in Appendices A and B. Once converged, the hyperparameters are ˆ updated by maximizing F with respect to Θ, which lead to simple update formulae [7]. Our main concern is with the update for q(h1:T ), for which this paper makes a departure from treatments previously presented. 3 More general Gaussian priors may be more suitable depending on the application. For expositional simplicity, we do not put priors on µ and Σ. 5 For simplicity, we keep the parameters of the Gamma priors fixed. 6 Strictly we should write throughout q(·|v1:T ). We omit the dependence on v1:T for notational convenience. 4 Unified Inference on q(h1:T ) 3 Optimally q(h1:T ) is Gaussian since, up to a constant, E(h1:T , Θ) − 1 2 q(Θ) is quadratic in h1:T 7 : T T (vt −Bht )T Σ−1 (vt −Bht ) V q(B,ΣV ) + (ht −Aht−1 ) Σ−1 (ht −Aht−1 ) H t=1 q(A,ΣH ) . (2) In addition, optimally, q(A|ΣH ) and q(B|ΣV ) are Gaussians (see Appendix A), so we can easily carry out the averages in Eq. (2). The further averages over q(ΣH ) and q(ΣV ) are also easy due to conjugacy. Whilst this defines the distribution q(h1:T ), quantities such as q(ht ), required for example for the parameter updates (see the Appendices), need to be inferred from this distribution. Clearly, in the non-Bayesian case, the averages over the parameters are not present, and the above simply represents the posterior distribution of an LGSSM whose visible variables have been clamped into their evidential states. In that case, inference can be performed using any standard LGSSM routine. Our aim, therefore, is to try to represent the averaged Eq. (2) directly as the posterior distribution q (h1:T |˜1:T ) of an LGSSM , for some suitable parameter settings. ˜ v Mean + Fluctuation Decomposition A useful decomposition is to write (vt − Bht )T Σ−1 (vt − Bht ) V = (vt − B ht )T Σ−1 (vt − B ht ) + hT SB ht , t V q(B,ΣV ) f luctuation mean and similarly (ht −Aht−1 )T Σ−1 (ht −Aht−1 ) H = (ht − A ht−1 )T Σ−1 (ht − A ht−1 ) +hT SA ht−1 , t−1 H q(A,ΣH ) mean f luctuation T −1 where the parameter covariances are SB ≡ B T Σ−1 B − B Σ−1 B = V HB and SA ≡ V V T −1 −1 −1 AT ΣH A − A ΣH A = HHA (for HA and HB defined in Appendix A). The mean terms simply represent a clamped LGSSM with averaged parameters. However, the extra contributions from the fluctuations mean that Eq. (2) cannot be written as a clamped LGSSM with averaged parameters. In order to deal with these extra terms, our idea is to treat the fluctuations as arising from an augmented visible variable, for which Eq. (2) can then be considered as a clamped LGSSM. Inference Using an Augmented LGSSM To represent Eq. (2) as an LGSSM q (h1:T |˜1:T ), we may augment vt and B as8 : ˜ v vt = vert(vt , 0H , 0H ), ˜ ˜ B = vert( B , UA , UB ), T where UA is the Cholesky decomposition of SA , so that UA UA = SA . Similarly, UB is the Cholesky decomposition of SB . The equivalent LGSSM q (h1:T |˜1:T ) is then completed by specifying9 ˜ v ˜ A≡ A , ˜ ΣH ≡ Σ−1 H −1 , ˜ ΣV ≡ diag( Σ−1 V −1 , IH , IH ), µ ≡ µ, ˜ ˜ Σ ≡ Σ. The validity of this parameter assignment can be checked by showing that, up to negligible constants, the exponent of this augmented LGSSM has the same form as Eq. (2)10 . Now that this has been written as an LGSSM q (h1:T |˜1:T ), standard inference routines in the literature may be applied to ˜ v compute q(ht |v1:T ) = q (ht |˜1:T ) [1, 2, 11]11 . ˜ v 7 For simplicity of exposition, we ignore the first time-point here. The notation vert(x1 , . . . , xn ) stands for vertically concatenating the arguments x1 , . . . , xn . 9 ˜ ˜ ˜ Strictly, we need a time-dependent emission Bt = B, for t = 1, . . . , T − 1. For time T , BT has the Cholesky factor UA replaced by 0H,H . 10 There are several ways of achieving a similar augmentation. We chose this since, in the non-Bayesian limit UA = UB = 0H,H , no numerical instabilities would be introduced. 11 Note that, since the augmented LGSSM q (h1:T |˜1:T ) is designed to match the fully clamped distribution ˜ v q(h1:T |v1:T ), the filtered posterior q (ht |˜1:t ) does not correspond to q(ht |v1:t ). ˜ v 8 Algorithm 1 LGSSM: Forward and backward recursive updates. The smoothed posterior p(ht |v1:T ) ˆ is returned in the mean hT and covariance PtT . t procedure F ORWARD 1a: P ← Σ −1 T T 1b: P ← DΣ, where D ≡ I − ΣUAB I + UAB ΣUAB UAB ˆ0 ← µ 2a: h1 ˆ 2b: h0 ← Dµ 1 1 ˆ ˆ ˆ 3: K ← P B T (BP B T + ΣV )−1 , P1 ← (I − KB)P , h1 ← h0 + K(vt − B h0 ) 1 1 1 for t ← 2, T do t−1 4: Ptt−1 ← APt−1 AT + ΣH t−1 5a: P ← Pt −1 T T 5b: P ← Dt Ptt−1 , where Dt ≡ I − Ptt−1 UAB I + UAB Ptt−1 UAB UAB ˆ ˆ 6a: ht−1 ← Aht−1 t t−1 ˆ ˆ 6b: ht−1 ← Dt Aht−1 t t−1 T ˆ ˆ ˆ 7: K ← P B (BP B T + ΣV )−1 , Ptt ← (I − KB)P , ht ← ht−1 + K(vt − B ht−1 ) t t t end for end procedure procedure BACKWARD for t ← T − 1, 1 do ← − t At ← Ptt AT (Pt+1 )−1 ← T − ← − T t t Pt ← Pt + At (Pt+1 − Pt+1 )At T ← ˆT − ˆ ˆ ˆ hT ← ht + At (ht+1 − Aht ) t t t end for end procedure For completeness, we decided to describe the standard predictor-corrector form of the Kalman Filter, together with the Rauch-Tung-Striebel Smoother [2]. These are given in Algorithm 1, where q (ht |˜1:T ) is computed by calling the FORWARD and BACKWARD procedures. ˜ v We present two variants of the FORWARD pass. Either we may call procedure FORWARD in ˜ ˜ ˜ ˜ ˜ ˜ Algorithm 1 with parameters A, B, ΣH , ΣV , µ, Σ and the augmented visible variables vt in which ˜ we use steps 1a, 2a, 5a and 6a. This is exactly the predictor-corrector form of a Kalman Filter [2]. Otherwise, in order to reduce the computational cost, we may call procedure FORWARD with the −1 ˜ ˜ parameters A, B , ΣH , Σ−1 , µ, Σ and the original visible variable vt in which we use steps ˜ ˜ V T 1b (where UAB UAB ≡ SA +SB ), 2b, 5b and 6b. The two algorithms are mathematically equivalent. Computing q(ht |v1:T ) = q (ht |˜1:T ) is then completed by calling the common BACKWARD pass. ˜ v The important point here is that the reader may supply any standard Kalman Filtering/Smoothing routine, and simply call it with the appropriate parameters. In some parameter regimes, or in very long time-series, numerical stability may be a serious concern, for which several stabilized algorithms have been developed over the years, for example the square-root forms [2, 10, 11]. By converting the problem to a standard form, we have therefore unified and simplified inference, so that future applications may be more readily developed12 . 3.1 Relation to Previous Approaches An alternative approach to the one above, and taken in [5, 7], is to write the posterior as T log q(h1:T ) = φt (ht−1 , ht ) + const. t=2 for suitably defined quadratic forms φt (ht−1 , ht ). Here the potentials φt (ht−1 , ht ) encode the averaging over the parameters A, B, ΣH , ΣV . The approach taken in [7] is to recognize this as a 12 The computation of the log-likelihood bound does not require any augmentation. pairwise Markov chain, for which the Belief Propagation recursions may be applied. The approach in [5] is based on a Kullback-Leibler minimization of the posterior with a chain structure, which is algorithmically equivalent to Belief Propagation. Whilst mathematically valid procedures, the resulting algorithms do not correspond to any of the standard forms in the Kalman Filtering/Smoothing literature, whose properties have been well studied [14]. 4 An Application to Bayesian ICA A particular case for which the Bayesian LGSSM is of interest is in extracting independent source signals underlying a multivariate timeseries [5, 15]. This will demonstrate how the approach developed in Section 3 makes VB easily to apply. The sources si are modeled as independent in the following sense: p(si , sj ) = p(si )p(sj ), 1:T 1:T 1:T 1:T for i = j, i, j = 1, . . . , C. Independence implies block diagonal transition and state noise matrices A, ΣH and Σ, where each block c has dimension Hc . A one dimensional source sc for each independent dynamical subsystem is then t formed from sc = 1T hc , where 1c is a unit vector and hc is the state of t c t t dynamical system c. Combining the sources, we can write st = P ht , where P = diag(1T , . . . , 1T ), ht = vert(h1 , . . . , hC ). The resulting 1 C t t emission matrix is constrained to be of the form B = W P , where W is the V × C mixing matrix. This means that the observations v are formed from linearly mixing the sources, vt = W st + ηt . The Figure 1: The structure of graphical structure of this model is presented in Fig 1. To encourage redundant components to be removed, we place a zero mean Gaussian the LGSSM for ICA. prior on W . In this case, we do not define a prior for the parameters ΣH and ΣV which are instead considered as hyperparameters. More details of the model are given in [15]. The constraint B = W P requires a minor modification from Section 3, as we discuss below. Inference on q(h1:T ) A small modification of the mean + fluctuation decomposition for B occurs, namely: (vt − Bht )T Σ−1 (vt − Bht ) V q(W ) = (vt − B ht )T Σ−1 (vt − B ht ) + hT P T SW P ht , t V −1 where B ≡ W P and SW = V HW . The quantities W and HW are obtained as in Appendix A.1 with the replacement ht ← P ht . To represent the above as a LGSSM, we augment vt and B as vt = vert(vt , 0H , 0C ), ˜ ˜ B = vert( B , UA , UW P ), where UW is the Cholesky decomposition of SW . The equivalent LGSSM is then completed by ˜ ˜ ˜ ˜ specifying A ≡ A , ΣH ≡ ΣH , ΣV ≡ diag(ΣV , IH , IC ), µ ≡ µ, Σ ≡ Σ, and inference for ˜ q(h1:T ) performed using Algorithm 1. This demonstrates the elegance and unity of the approach in Section 3, since no new algorithm needs to be developed to perform inference, even in this special constrained parameter case. 4.1 Demonstration As a simple demonstration, we used an LGSSM to generate 3 sources sc with random 5×5 transition t matrices Ac , µ = 0H and Σ ≡ ΣH ≡ IH . The sources were mixed into three observations v vt = W st + ηt , for W chosen with elements from a zero mean unit variance Gaussian distribution, and ΣV = IV . We then trained a Bayesian LGSSM with 5 sources and 7 × 7 transition matrices Ac . ˆ To bias the model to find the simplest sources, we used Ac ≡ 0Hc ,Hc for all sources. In Fig2a and Fig 2b we see the original sources and the noisy observations respectively. In Fig2c we see the estimated sources from our method after convergence of the hyperparameter updates. Two of the 5 sources have been removed, and the remaining three are a reasonable estimation of the original sources. Another possible approach for introducing prior knowledge is to use a Maximum a Posteriori (MAP) 0 50 100 150 200 250 300 0 50 100 (a) 150 200 250 300 0 50 (b) 100 150 200 250 300 0 50 (c) 100 150 200 250 300 (d) Figure 2: (a) Original sources st . (b) Observations resulting from mixing the original sources, v v vt = W st + ηt , ηt ∼ N (0, I). (c) Recovered sources using the Bayesian LGSSM. (d) Sources found with MAP LGSSM. 0 1 2 (a) 3s 0 1 2 (b) 3s 0 1 2 (c) 3s 0 1 2 (d) 3s 0 1 2 3s (e) Figure 3: (a) Original raw EEG recordings from 4 channels. (b-e) 16 sources st estimated by the Bayesian LGSSM. procedure by adding a prior term to the original log-likelihood log p(v1:T |A, W, ΣH , ΣV , µ, Σ) + log p(A|α) + log p(W |β). However, it is not clear how to reliably find the hyperparameters α and β in this case. One solution is to estimate them by optimizing the new objective function jointly with respect to the parameters and hyperparameters (this is the so-called joint map estimation – see for example [16]). A typical result of using this joint MAP approach on the artificial data is presented in Fig 2d. The joint MAP does not estimate the hyperparameters well, and the incorrect number of sources is identified. 4.2 Application to EEG Analysis In Fig 3a we plot three seconds of EEG data recorded from 4 channels (located in the right hemisphere) while a person is performing imagined movement of the right hand. As is typical in EEG, each channel shows drift terms below 1 Hz which correspond to artifacts of the instrumentation, together with the presence of 50 Hz mains contamination and masks the rhythmical activity related to the mental task, mainly centered at 10 and 20 Hz [17]. We would therefore like a method which enables us to extract components in these information-rich 10 and 20 Hz frequency bands. Standard ICA methods such as FastICA do not find satisfactory sources based on raw ‘noisy’ data, and preprocessing with band-pass filters is usually required. Additionally, in EEG research, flexibility in the number of recovered sources is important since there may be many independent oscillators of interest underlying the observations and we would like some way to automatically determine their effective number. To preferentially find sources at particular frequencies, we specified a block ˆ diagonal matrix Ac for each source c, where each block is a 2 × 2 rotation matrix at the desired frequency. We defined the following 16 groups of frequencies: [0.5], [0.5], [0.5], [0.5]; [10,11], [10,11], [10,11], [10,11]; [20,21], [20,21], [20,21], [20,21]; [50], [50], [50], [50]. The temporal evolution of the sources obtained after training the Bayesian LGSSM is given in Fig3(b,c,d,e) (grouped by frequency range). The Bayes LGSSM removed 4 unnecessary sources from the mixing matrix W , that is one [10,11] Hz and three [20,21] Hz sources. The first 4 sources contain dominant low frequency drift, sources 5, 6 and 8 contain [10,11] Hz, while source 10 contains [20,21] Hz centered activity. Of the 4 sources initialized to 50 Hz, only 2 retained 50 Hz activity, while the Ac of the other two have changed to model other frequencies present in the EEG. This method demonstrates the usefulness and applicability of the VB method in a real-world situation. 5 Conclusion We considered the application of Variational Bayesian learning to Linear Gaussian State-Space Models. This is an important class of models with widespread application, and finding a simple way to implement this approximate Bayesian procedure is of considerable interest. The most demanding part of the procedure is inference of the hidden states of the model. Previously, this has been achieved using Belief Propagation, which differs from inference in the Kalman Filtering/Smoothing literature, for which highly efficient and stabilized procedures exist. A central contribution of this paper is to show how inference can be written using the standard Kalman Filtering/Smoothing recursions by augmenting the original model. Additionally, a minor modification to the standard Kalman Filtering routine may be applied for computational efficiency. We demonstrated the elegance and unity of our approach by showing how to easily apply a Variational Bayes analysis of temporal ICA. Specifically, our Bayes ICA approach successfully extracts independent processes underlying EEG signals, biased towards preferred frequency ranges. We hope that this simple and unifying interpretation of Variational Bayesian LGSSMs may therefore facilitate the further application to related models. A A.1 Parameter Updates for A and B Determining q(B|ΣV ) By examining F, the contribution of q(B|ΣV ) can be interpreted as the negative KL divergence between q(B|ΣV ) and a Gaussian. Hence, optimally, q(B|ΣV ) is a Gaussian. The covariance [ΣB ]ij,kl ≡ Bij − Bij Bkl − Bkl (averages wrt q(B|ΣV )) is given by: T hj hl t t −1 [ΣB ]ij,kl = [HB ]jl [ΣV ]ik , where [HB ]jl ≡ t=1 −1 The mean is given by B = NB HB , where [NB ]ij ≡ T t=1 hj t q(ht ) q(ht ) + βj δjl . i ˆ vt + βj Bij . Determining q(A|ΣH ) Optimally, q(A|ΣH ) is a Gaussian with covariance T −1 hj hl t t −1 [ΣA ]ij,kl = [HA ]jl [ΣH ]ik , where [HA ]jl ≡ t=1 −1 The mean is given by A = NA HA , where [NA ]ij ≡ B T t=2 q(ht ) hj hi t−1 t + αj δjl . q(ht−1:t ) ˆ + αj Aij . Covariance Updates By specifying a Wishart prior for the inverse of the covariances, conjugate update formulae are possible. In practice, it is more common to specify diagonal inverse covariances, for which the corresponding priors are simply Gamma distributions [7, 5]. For this simple diagonal case, the explicit updates are given below. Determining q(ΣV ) For the constraint Σ−1 = diag(ρ), where each diagonal element follows a Gamma prior V Ga(b1 , b2 ) [7], q(ρ) factorizes and the optimal updates are q(ρi ) = Ga b1 + where GB ≡ −1 T NB HB NB . T T 1 i , b2 + (vt )2 − [GB ]ii + 2 2 t=1 j ˆ2 βj Bij , Determining q(ΣH ) Analogously, for Σ−1 = diag(τ ) with prior Ga(a1 , a2 ) [5], the updates are H T T −1 1 ˆij , a2 + (hi )2 − [GA ]ii + αj A2 , q(τi ) = Ga a1 + t 2 2 t=2 j −1 T where GA ≡ NA HA NA . Acknowledgments This work is supported by the European DIRAC Project FP6-0027787. This paper only reflects the authors’ views and funding agencies are not liable for any use that may be made of the information contained herein. References [1] Y. Bar-Shalom and X.-R. Li. Estimation and Tracking: Principles, Techniques and Software. Artech House, 1998. [2] M. S. Grewal and A. P. Andrews. Kalman Filtering: Theory and Practice Using MATLAB. John Wiley and Sons, Inc., 2001. [3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications. Springer, 2000. [4] M. J. Beal, F. Falciani, Z. Ghahramani, C. Rangel, and D. L. Wild. A Bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics, 21:349–356, 2005. [5] A. T. Cemgil and S. J. Godsill. Probabilistic phase vocoder and its application to interpolation of missing values in audio signals. In 13th European Signal Processing Conference, 2005. [6] H. Valpola and J. Karhunen. An unsupervised ensemble learning method for nonlinear dynamic statespace models. Neural Computation, 14:2647–2692, 2002. [7] M. J. Beal. Variational Algorithms for Approximate Bayesian Inference. Ph.D. thesis, Gatsby Computational Neuroscience Unit, University College London, 2003. [8] M. Davy and S. J. Godsill. Bayesian harmonic models for musical signal analysis (with discussion). In J.O. Bernardo, J.O. Berger, A.P Dawid, and A.F.M. Smith, editors, Bayesian Statistics VII. Oxford University Press, 2003. [9] D. J. C. MacKay. Ensemble learning and evidence maximisation. Unpublished manuscipt: www.variational-bayes.org, 1995. [10] M. Morf and T. Kailath. Square-root algorithms for least-squares estimation. IEEE Transactions on Automatic Control, 20:487–497, 1975. [11] P. Park and T. Kailath. New square-root smoothing algorithms. IEEE Transactions on Automatic Control, 41:727–732, 1996. [12] E. Niedermeyer and F. Lopes Da Silva. Electroencephalography: basic principles, clinical applications and related fields. Lippincott Williams and Wilkins, 1999. [13] S. Roweis and Z. Ghahramani. A unifying review of linear Gaussian models. Neural Computation, 11:305–345, 1999. [14] M. Verhaegen and P. Van Dooren. Numerical aspects of different Kalman filter implementations. IEEE Transactions of Automatic Control, 31:907–917, 1986. [15] S. Chiappa and D. Barber. Bayesian linear Gaussian state-space models for biosignal decomposition. Signal Processing Letters, 14, 2007. [16] S. S. Saquib, C. A. Bouman, and K. Sauer. ML parameter estimation for Markov random fields with applicationsto Bayesian tomography. IEEE Transactions on Image Processing, 7:1029–1044, 1998. [17] G. Pfurtscheller and F. H. Lopes da Silva. Event-related EEG/MEG synchronization and desynchronization: basic principles. Clinical Neurophysiology, pages 1842–1857, 1999.
5 0.46365502 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
Author: Kilian Q. Weinberger, Fei Sha, Qihui Zhu, Lawrence K. Saul
Abstract: In many areas of science and engineering, the problem arises how to discover low dimensional representations of high dimensional data. Recently, a number of researchers have converged on common solutions to this problem using methods from convex optimization. In particular, many results have been obtained by constructing semidefinite programs (SDPs) with low rank solutions. While the rank of matrix variables in SDPs cannot be directly constrained, it has been observed that low rank solutions emerge naturally by computing high variance or maximal trace solutions that respect local distance constraints. In this paper, we show how to solve very large problems of this type by a matrix factorization that leads to much smaller SDPs than those previously studied. The matrix factorization is derived by expanding the solution of the original problem in terms of the bottom eigenvectors of a graph Laplacian. The smaller SDPs obtained from this matrix factorization yield very good approximations to solutions of the original problem. Moreover, these approximations can be further refined by conjugate gradient descent. We illustrate the approach on localization in large scale sensor networks, where optimizations involving tens of thousands of nodes can be solved in just a few minutes. 1
6 0.46314988 20 nips-2006-Active learning for misspecified generalized linear models
7 0.46200734 175 nips-2006-Simplifying Mixture Models through Function Approximation
8 0.46147743 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
9 0.46116757 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
10 0.46107283 65 nips-2006-Denoising and Dimension Reduction in Feature Space
11 0.45986131 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
12 0.45958379 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
13 0.45827287 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
14 0.45659432 158 nips-2006-PG-means: learning the number of clusters in data
15 0.45551896 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures
16 0.45541474 169 nips-2006-Relational Learning with Gaussian Processes
17 0.45394522 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy
18 0.45376387 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation
19 0.45343289 194 nips-2006-Towards a general independent subspace analysis
20 0.45247856 167 nips-2006-Recursive ICA