nips nips2010 nips2010-32 knowledge-graph by maker-knowledge-mining

32 nips-2010-Approximate Inference by Compilation to Arithmetic Circuits

Source: pdf

Author: Daniel Lowd, Pedro Domingos

Abstract: Arithmetic circuits (ACs) exploit context-speciﬁc independence and determinism to allow exact inference even in networks with high treewidth. In this paper, we introduce the ﬁrst ever approximate inference methods using ACs, for domains where exact inference remains intractable. We propose and evaluate a variety of techniques based on exact compilation, forward sampling, AC structure learning, Markov network parameter learning, variational inference, and Gibbs sampling. In experiments on eight challenging real-world domains, we ﬁnd that the methods based on sampling and learning work best: one such method (AC2 -F) is faster and usually more accurate than loopy belief propagation, mean ﬁeld, and Gibbs sampling; another (AC2 -G) has a running time similar to Gibbs sampling but is consistently more accurate than all baselines. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Arithmetic circuits (ACs) exploit context-speciﬁc independence and determinism to allow exact inference even in networks with high treewidth. [sent-5, score-0.316]

2 In this paper, we introduce the ﬁrst ever approximate inference methods using ACs, for domains where exact inference remains intractable. [sent-6, score-0.236]

3 We propose and evaluate a variety of techniques based on exact compilation, forward sampling, AC structure learning, Markov network parameter learning, variational inference, and Gibbs sampling. [sent-7, score-0.241]

4 1 Introduction Compilation to arithmetic circuits (ACs) [1] is one of the most effective methods for exact inference in Bayesian networks. [sent-9, score-0.407]

5 Given an AC, we can efﬁciently condition on evidence or marginalize variables to yield a simpler AC for the conditional or marginal distribution, respectively. [sent-13, score-0.251]

6 However, as with junction trees, compiling a BN to an equivalent AC yields an exponentially-sized AC in the worst case, preventing their application to many domains of interest. [sent-16, score-0.171]

7 In this paper, we introduce approximate compilation methods, allowing us to construct effective ACs for previously intractable domains. [sent-17, score-0.283]

8 For selecting circuit structure, we compare exact compilation of a simpliﬁed network to learning it from samples. [sent-18, score-0.49]

9 For selecting circuit parameters, we compare variational inference to maximum likelihood learning from samples. [sent-20, score-0.366]

10 We ﬁnd that learning from samples works 1 best for both structure and parameters, achieving the highest accuracy on eight challenging, realworld domains. [sent-21, score-0.132]

11 In Section 2, we provide background on Bayesian networks and arithmetic circuits. [sent-25, score-0.239]

12 1 Background Bayesian networks Bayesian networks (BNs) exploit conditional independence to compactly represent a probability distribution over a set of variables, {X1 , . [sent-29, score-0.188]

13 A BN consists of a directed, acyclic graph with a node for each variable, and a set of conditional probability distributions (CPDs) describing the probability of each variable, Xi , given its parents in the graph, denoted πi [2]. [sent-33, score-0.182]

14 In a decision tree CPD for variable Xi , each interior node is labeled with one of the parent variables, and each of its outgoing edges is labeled with a value of that variable. [sent-39, score-0.18]

15 Each leaf node is a multinomial representing the marginal distribution of Xi conditioned on the parent values speciﬁed by its ancestor nodes and edges in the tree. [sent-40, score-0.16]

16 Bayesian networks can be represented as log-linear models: (1) log P (X = x) = − log Z + i wi fi (x) where each fi is a feature, each wi is a real-valued weight, and Z is the partition function. [sent-41, score-0.445]

17 The goal of inference in Bayesian networks and other graphical models is to answer arbitrary marginal and conditional queries (i. [sent-44, score-0.275]

18 , to compute the marginal distribution of a set of query variables, possibly conditioned on the values of a set of evidence variables). [sent-46, score-0.224]

19 In variational inference, the goal is to select a tractable distribution Q that is as close as possible to the original, intractable distribution P . [sent-48, score-0.202]

20 What makes the reverse KL divergence more tractable to optimize is that the expectations are done over Q instead of P . [sent-51, score-0.277]

21 This minimization also yields bounds on the log partition function, or the probability of evidence in a BN. [sent-52, score-0.179]

22 Generalized or structured mean ﬁeld operates on a set of clusters (possibly overlapping), or junction tree formed from a subset of the edges [6, 7, 8]. [sent-55, score-0.189]

23 One approach is to greedily delete arcs until the junction tree is tractable [6]. [sent-57, score-0.256]

24 2 Arithmetic circuits The probability distribution represented by a Bayesian network can be equivalently represented by a multilinear function known as the network polynomial [1]: P (X1 = x1 , . [sent-61, score-0.228]

25 This allows arbitrary marginal and conditional queries to be answered in time linear in the size of the polynomial. [sent-66, score-0.134]

26 The size of the network polynomial is exponential in the number of variables, but it can be more compactly represented using an arithmetic circuit (AC). [sent-68, score-0.452]

27 In the case of the network polynomial, the leaves are the indicators and network parameters. [sent-71, score-0.12]

28 Every junction tree has a corresponding AC, with an addition node for every instantiation of a separator, a multiplication node for every instantiation of a clique, and a summation node as the root. [sent-73, score-0.348]

29 Thus one way to compile a BN into an AC is via a junction tree. [sent-74, score-0.125]

30 However, when the network contains context-speciﬁc independences, a much more compact circuit can be obtained. [sent-75, score-0.272]

31 Other exact inference methods include variable elimination with algebraic decision diagrams (which can also be done with ACs [9]), AND/OR graphs [10], bucket elimination [11], and more. [sent-77, score-0.217]

32 The structure search is done in advance, once per network, while the parameters may be selected at query time, conditioned on evidence. [sent-80, score-0.126]

33 The parameter optimization allows us to ﬁne-tune the circuit to speciﬁc pieces of evidence. [sent-82, score-0.212]

34 Just as in variational inference methods such as mean ﬁeld, we optimize the parameters of a tractable distribution to best approximate an intractable one. [sent-83, score-0.319]

35 1 Structure search We considered two methods for generating circuit structures. [sent-86, score-0.212]

36 The second is to approximate the BN distribution with a set of samples and learn a circuit from this pseudo-empirical data. [sent-88, score-0.307]

37 1 Pruning and compiling Pruning and compiling a BN is somewhat analogous to edge deletion methods (e. [sent-91, score-0.152]

38 , [6]), except that instead of removing entire edges and building the full junction tree, we introduce contextspeciﬁc independencies and build an arithmetic circuit that can exploit them. [sent-93, score-0.523]

39 We explored several techniques for greedily simplifying a network into a tractable AC by pruning splits from its decision-tree CPDs. [sent-96, score-0.264]

40 We choose to optimize the KL divergence here because the reverse KL is prone to ﬁtting only a single mode, and we want to avoid excluding any signiﬁcant parts of the distribution before seeing evidence. [sent-102, score-0.184]

41 Since Q’s structure is a subset of P ’s, we can decompose the KL divergence as follows: KL(P Q) = P (xi |πi ) log P (πi ) i πi xi P (xi |πi ) Q(xi |πi ) (3) where the summation is over all states of the Xi ’s parents, Πi . [sent-103, score-0.215]

42 In other words, the KL divergence can be computed by adding the expected divergence of each local factor, where the expectation is computed according to the global probability distribution. [sent-104, score-0.268]

43 1), this means that knowing the distribution of the parent variables allows us to compute the change in KL divergence from pruning a tree CPD. [sent-106, score-0.336]

44 We tried two different methods for computing these distributions: estimating the joint parent probabilities from a large number of samples (one million in our experiments) (“P-Samp”), and forming the product of the parent marginals estimated using mean ﬁeld (“P-MF”). [sent-108, score-0.301]

45 We implement this by starting from a fully pruned network and greedily adding the splits that most decrease KL divergence. [sent-110, score-0.153]

46 After every 10 splits, we check the number of edges by compiling the candidate network to an AC using the C2D compiler. [sent-111, score-0.172]

47 2 Learning from samples The second approach we tried is learning a circuit from a set of generated samples. [sent-115, score-0.272]

48 The samples themselves are generated using forward sampling, in which each variable in the BN is sampled in topological order according to its conditional distribution given its parents. [sent-116, score-0.164]

49 The circuit learning method we chose is the LearnAC algorithm by Lowd and Domingos [13], which greedily learns an AC representing a BN with decision tree CPDs by trading off log likelihood and circuit size. [sent-117, score-0.565]

50 The effect of this modiﬁed procedure is to conservatively selects splits that add few edges to the circuit at ﬁrst, and become increasingly liberal until the edge limit is reached. [sent-120, score-0.301]

51 We also explored using the BN structure to guide the AC structure search (for example, by excluding splits that would violate the partial order of the original BN), but these restrictions offered no signiﬁcant advantage in accuracy. [sent-122, score-0.129]

52 Spending a long time ﬁnding the most accurate circuit may be worthwhile, since the cost is amortized over all queries. [sent-127, score-0.302]

53 We are not the ﬁrst to propose sampling as a method for converting intractable models into tractable ones. [sent-128, score-0.199]

54 They found that the learned models had faster or more accurate inference on a wide range of standard BNs (where exact inference is somewhat tractable). [sent-135, score-0.242]

55 1 Forward sampling In AC2 -F, we use forward sampling to generate a set of samples from the original BN (one million in our experiments) and maximum likelihood estimation to estimate the AC parameters from those samples. [sent-142, score-0.274]

56 AC2 -F can be viewed as approximately minimizing the KL divergence KL(P Q) between the BN distribution P and the AC distribution Q. [sent-146, score-0.134]

57 For conditional queries P (Y |X = xev ), we are more interested in the divergence of the conditional distributions, KL(P (. [sent-147, score-0.488]

58 The following theorem bounds the conditional KL divergence as a function of the unconditional KL divergence: Theorem 1. [sent-150, score-0.204]

59 For discrete probability distributions P and Q, and evidence xev , 1 KL(P (. [sent-151, score-0.286]

60 ) From this theorem, we expect AC2 -F to work better when evidence is likely (i. [sent-154, score-0.136]

61 For rare evidence, the conditional KL divergence could be much larger than the unconditional KL divergence. [sent-157, score-0.204]

62 An alternative is to choose AC parameters that (locally) minimize the reverse KL divergence to the BN conditioned on evidence. [sent-161, score-0.222]

63 In our application, P is the BN conditioned on evidence and Q is the AC. [sent-165, score-0.174]

64 We now discuss how to compute the gradient efﬁciently in a circuit with e edges. [sent-170, score-0.212]

65 By setting leaf values and evaluating the circuit as described by Darwiche [1], we can compute the probability of any conjunctive feature Q(fi ) (or Q(gk )) in O(e) operations. [sent-171, score-0.212]

66 If we differentiate the circuit after conditioning on a feature fi (or gk ), we can obtain the probabilities of the conjunctions Q(fi gj ) (or Q(gk gj )) for all gj in O(e) time. [sent-172, score-0.973]

67 These methods are applicable to any tractable structure represented as an AC, including low treewidth models, mixture models, latent tree models, etc. [sent-175, score-0.159]

68 3 Gibbs sampling While optimizing the reverse KL is a popular choice for approximate inference, there are certain risks. [sent-179, score-0.154]

69 We chose to approximate these expectations using Gibbs sampling, but an alternate inference method (e. [sent-184, score-0.147]

70 4 Experiments In this section, we compare the proposed methods experimentally and demonstrate that approximate compilation is an accurate and efﬁcient technique for inference in intractable networks. [sent-190, score-0.406]

71 1 Datasets We wanted to evaluate our methods on challenging, realistic networks where exact inference is intractable, even for the most sophisticated arithmetic circuit-based techniques. [sent-192, score-0.358]

72 We generated intractable networks by learning them from eight real-world datasets using the WinMine Toolkit [18]. [sent-194, score-0.16]

73 In theory, this additional structure can be exploited by existing arithmetic circuit techniques, but in practice, compilation techniques ran out of memory on all eight networks. [sent-196, score-0.645]

74 Since computing the KL divergence directly is intractable, we approximated it using random samples x(i) : P (x) 1 D(P ||Q) = P (x) log = EP [log(P (x)/Q(x))] ≈ log(P (x(i) )/Q(x(i) )) (8) Q(x) m i x where m is the number of samples (10,000 in our experiments). [sent-201, score-0.297]

75 All circuits were learned using 100,000 samples, and then the parameters were set using AC2 -F with 1 million samples. [sent-204, score-0.15]

76 The learned arithmetic circuit (LAC) achieves the best performance on all datasets, often by a wide margin. [sent-208, score-0.392]

77 We also observe that, of the pruning methods, samples (P-Samp) work better than mean ﬁeld marginals (P-MF). [sent-209, score-0.167]

78 Table 1: KL divergence of different structure selection algorithms. [sent-216, score-0.172]

79 08 Figure 1: Average conditional log likelihood of the query variables (y axis), divided by the number of query variables (x axis). [sent-289, score-0.303]

80 12 Using structures selected by LearnAC, we compared the accuracy of AC2 -F, AC2 -V, and AC2 -G to mean ﬁeld (MF), loopy belief propagation (BP), and Gibbs sampling (Gibbs) on conditional probability queries. [sent-338, score-0.176]

81 16 Since most of these queries are intractable to compute exactly, we cannot determine the true probabilities directly. [sent-347, score-0.169]

82 18 evidence (10%-50% of the total variables), and measured the log 10% 20% 30% 50% conditional probability of the non-evidence variables according to each40% inference method. [sent-349, score-0.376]

83 This approximates the KL divergence between the true and inferred conditional distributions up to a constant. [sent-351, score-0.204]

84 We reduced the variance of this approximation by selecting additional queries for each evidence conﬁguration. [sent-352, score-0.2]

85 Its somewhat worse performance at greater amounts of evidence is consistent with Theorem 1. [sent-358, score-0.136]

86 AC2 -F is also the fastest of the inference methods, making it a very good choice for speedy inference with small to moderate amounts of evidence. [sent-359, score-0.164]

87 Compared to the other AC methods, AC2 -G wins everywhere except for KDD at 10-40% evidence and Netﬂix at 10% evidence. [sent-369, score-0.166]

88 In follow-up experiments, we found that using Gibbs sampling to compute the marginals yielded slightly better accuracy than BP, but much slower. [sent-376, score-0.128]

89 5 Conclusion Arithmetic circuits are an attractive alternative to junction trees due to their ability to exploit determinism and context-speciﬁc independence. [sent-378, score-0.233]

90 However, even with ACs, exact inference remains intractable for many networks of interest. [sent-379, score-0.245]

91 In this paper, we introduced the ﬁrst approximate compilation methods, allowing us to apply ACs to any BN. [sent-380, score-0.216]

92 Our most efﬁcient method, AC2 -F, is faster than traditional approximate inference methods and more accurate most of the time. [sent-381, score-0.158]

93 One of the key lessons is that combining sampling and learning is a good strategy for accurate approximate inference. [sent-383, score-0.145]

94 For structure selection, an AC learning method applied to samples was more effective than exact compilation of a simpliﬁed network. [sent-385, score-0.316]

95 For parameter selection, maximum likelihood estimation applied to Gibbs samples was both faster and more effective than variational inference in ACs. [sent-386, score-0.214]

96 For future work, we hope to extend our methods to Markov networks, in which generating samples is a difﬁcult inference problem in itself. [sent-387, score-0.142]

97 Similar methods could be used to select AC structures tuned to particular queries, since a BN conditioned on evidence can be represented as a Markov network. [sent-388, score-0.174]

98 A variational approach for approximating Bayesian networks by edge deletion. [sent-427, score-0.131]

99 A variational inference procedure allowing internal structure for overlapping clusters and deterministic constraints. [sent-444, score-0.192]

100 Latent tree models and approximate inference in Bayesian networks. [sent-486, score-0.175]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ac', 0.482), ('kl', 0.317), ('bn', 0.245), ('circuit', 0.212), ('compilation', 0.181), ('arithmetic', 0.18), ('gj', 0.169), ('gibbs', 0.155), ('xev', 0.15), ('evidence', 0.136), ('divergence', 0.134), ('acs', 0.121), ('cpds', 0.115), ('eachmovie', 0.113), ('circuits', 0.108), ('fi', 0.099), ('mf', 0.099), ('eq', 0.098), ('junction', 0.095), ('bp', 0.095), ('learnac', 0.094), ('zp', 0.094), ('bns', 0.085), ('inference', 0.082), ('compiling', 0.076), ('gk', 0.074), ('variational', 0.072), ('conditional', 0.07), ('sampling', 0.069), ('intractable', 0.067), ('zq', 0.066), ('queries', 0.064), ('tractable', 0.063), ('lowd', 0.06), ('samples', 0.06), ('network', 0.06), ('networks', 0.059), ('marginals', 0.059), ('tree', 0.058), ('plants', 0.057), ('lac', 0.056), ('msweb', 0.056), ('winmine', 0.056), ('splits', 0.053), ('domingos', 0.051), ('parent', 0.051), ('wi', 0.051), ('query', 0.05), ('reverse', 0.05), ('kdd', 0.049), ('jester', 0.049), ('amortized', 0.049), ('pruning', 0.048), ('simpli', 0.047), ('vj', 0.047), ('variables', 0.045), ('cpd', 0.045), ('ep', 0.045), ('parents', 0.045), ('instantiation', 0.045), ('bayesian', 0.044), ('log', 0.043), ('conditioning', 0.043), ('million', 0.042), ('accurate', 0.041), ('eld', 0.041), ('cup', 0.04), ('greedily', 0.04), ('hq', 0.038), ('structure', 0.038), ('conditioned', 0.038), ('probabilities', 0.038), ('darwiche', 0.038), ('portland', 0.038), ('exact', 0.037), ('book', 0.037), ('loopy', 0.037), ('edges', 0.036), ('approximate', 0.035), ('node', 0.035), ('forward', 0.034), ('elimination', 0.034), ('davis', 0.034), ('eight', 0.034), ('audio', 0.033), ('ix', 0.033), ('hulten', 0.033), ('mas', 0.033), ('finland', 0.033), ('helsinki', 0.033), ('acyclic', 0.032), ('intelligence', 0.032), ('cial', 0.032), ('partly', 0.031), ('expectations', 0.03), ('wins', 0.03), ('auai', 0.03), ('bucket', 0.03), ('compile', 0.03), ('determinism', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 32 nips-2010-Approximate Inference by Compilation to Arithmetic Circuits

Author: Daniel Lowd, Pedro Domingos

2 0.13915032 224 nips-2010-Regularized estimation of image statistics by Score Matching

Author: Diederik P. Kingma, Yann L. Cun

Abstract: Score Matching is a recently-proposed criterion for training high-dimensional density models for which maximum likelihood training is intractable. It has been applied to learning natural image statistics but has so-far been limited to simple models due to the difﬁculty of differentiating the loss with respect to the model parameters. We show how this differentiation can be automated with an extended version of the double-backpropagation algorithm. In addition, we introduce a regularization term for the Score Matching loss that enables its use for a broader range of problem by suppressing instabilities that occur with ﬁnite training sample sizes and quantized input values. Results are reported for image denoising and super-resolution.

3 0.13262792 144 nips-2010-Learning Efficient Markov Networks

Author: Vibhav Gogate, William Webb, Pedro Domingos

Abstract: We present an algorithm for learning high-treewidth Markov networks where inference is still tractable. This is made possible by exploiting context-speciﬁc independence and determinism in the domain. The class of models our algorithm can learn has the same desirable properties as thin junction trees: polynomial inference, closed-form weight learning, etc., but is much broader. Our algorithm searches for a feature that divides the state space into subspaces where the remaining variables decompose into independent subsets (conditioned on the feature and its negation) and recurses on each subspace/subset of variables until no useful new features can be found. We provide probabilistic performance guarantees for our algorithm under the assumption that the maximum feature length is bounded by a constant k (the treewidth can be much larger) and dependences are of bounded strength. We also propose a greedy version of the algorithm that, while forgoing these guarantees, is much more efﬁcient. Experiments on a variety of domains show that our approach outperforms many state-of-the-art Markov network structure learners. 1

4 0.13085099 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes

Author: Manfred Opper, Andreas Ruttor, Guido Sanguinetti

Abstract: We present a novel approach to inference in conditionally Gaussian continuous time stochastic processes, where the latent process is a Markovian jump process. We ﬁrst consider the case of jump-diffusion processes, where the drift of a linear stochastic differential equation can jump at arbitrary time points. We derive partial differential equations for exact inference and present a very efﬁcient mean ﬁeld approximation. By introducing a novel lower bound on the free energy, we then generalise our approach to Gaussian processes with arbitrary covariance, such as the non-Markovian RBF covariance. We present results on both simulated and real data, showing that the approach is very accurate in capturing latent dynamics and can be useful in a number of real data modelling tasks.

5 0.12204384 56 nips-2010-Deciphering subsampled data: adaptive compressive sampling as a principle of brain communication

Author: Guy Isely, Christopher Hillar, Fritz Sommer

Abstract: A new algorithm is proposed for a) unsupervised learning of sparse representations from subsampled measurements and b) estimating the parameters required for linearly reconstructing signals from the sparse codes. We verify that the new algorithm performs efﬁcient data compression on par with the recent method of compressive sampling. Further, we demonstrate that the algorithm performs robustly when stacked in several stages or when applied in undercomplete or overcomplete situations. The new algorithm can explain how neural populations in the brain that receive subsampled input through ﬁber bottlenecks are able to form coherent response properties. 1

6 0.1075094 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning

7 0.097143926 53 nips-2010-Copula Bayesian Networks

8 0.095300518 63 nips-2010-Distributed Dual Averaging In Networks

9 0.089752994 83 nips-2010-Evidence-Specific Structures for Rich Tractable CRFs

10 0.087676309 101 nips-2010-Gaussian sampling by local perturbations

11 0.080041282 159 nips-2010-Lifted Inference Seen from the Other Side : The Tractable Features

12 0.074801363 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data

13 0.07310275 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks

14 0.072199076 118 nips-2010-Implicit Differentiation by Perturbation

15 0.069489211 194 nips-2010-Online Learning for Latent Dirichlet Allocation

16 0.068993628 268 nips-2010-The Neural Costs of Optimal Control

17 0.068740696 239 nips-2010-Sidestepping Intractable Inference with Structured Ensemble Cascades

18 0.068173572 120 nips-2010-Improvements to the Sequence Memoizer

19 0.066790283 283 nips-2010-Variational Inference over Combinatorial Spaces

20 0.065543227 115 nips-2010-Identifying Dendritic Processing

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.184), (1, 0.037), (2, -0.011), (3, 0.051), (4, -0.145), (5, 0.027), (6, 0.006), (7, 0.046), (8, 0.046), (9, 0.011), (10, -0.152), (11, -0.027), (12, -0.006), (13, -0.064), (14, -0.067), (15, -0.075), (16, -0.03), (17, -0.028), (18, -0.046), (19, -0.019), (20, -0.084), (21, 0.031), (22, 0.048), (23, -0.181), (24, -0.077), (25, -0.062), (26, 0.068), (27, -0.059), (28, -0.006), (29, -0.056), (30, 0.027), (31, -0.018), (32, 0.054), (33, -0.146), (34, 0.006), (35, 0.017), (36, -0.08), (37, 0.073), (38, -0.005), (39, 0.016), (40, 0.045), (41, 0.054), (42, 0.089), (43, 0.119), (44, 0.072), (45, -0.065), (46, -0.178), (47, 0.01), (48, 0.007), (49, -0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93754685 32 nips-2010-Approximate Inference by Compilation to Arithmetic Circuits

Author: Daniel Lowd, Pedro Domingos

2 0.74420363 84 nips-2010-Exact inference and learning for cumulative distribution functions on loopy graphs

Author: Nebojsa Jojic, Chris Meek, Jim C. Huang

Abstract: Many problem domains including climatology and epidemiology require models that can capture both heavy-tailed statistics and local dependencies. Specifying such distributions using graphical models for probability density functions (PDFs) generally lead to intractable inference and learning. Cumulative distribution networks (CDNs) provide a means to tractably specify multivariate heavy-tailed models as a product of cumulative distribution functions (CDFs). Existing algorithms for inference and learning in CDNs are limited to those with tree-structured (nonloopy) graphs. In this paper, we develop inference and learning algorithms for CDNs with arbitrary topology. Our approach to inference and learning relies on recursively decomposing the computation of mixed derivatives based on a junction trees over the cumulative distribution functions. We demonstrate that our systematic approach to utilizing the sparsity represented by the junction tree yields signiďŹ cant performance improvements over the general symbolic differentiation programs Mathematica and D*. Using two real-world datasets, we demonstrate that non-tree structured (loopy) CDNs are able to provide signiďŹ cantly better ďŹ ts to the data as compared to tree-structured and unstructured CDNs and other heavy-tailed multivariate distributions such as the multivariate copula and logistic models.

3 0.63328987 126 nips-2010-Inference with Multivariate Heavy-Tails in Linear Models

Author: Danny Bickson, Carlos Guestrin

Abstract: Heavy-tailed distributions naturally occur in many real life problems. Unfortunately, it is typically not possible to compute inference in closed-form in graphical models which involve such heavy-tailed distributions. In this work, we propose a novel simple linear graphical model for independent latent random variables, called linear characteristic model (LCM), defined in the characteristic function domain. Using stable distributions, a heavy-tailed family of distributions which is a generalization of Cauchy, L´ vy and Gaussian distrie butions, we show for the first time, how to compute both exact and approximate inference in such a linear multivariate graphical model. LCMs are not limited to stable distributions, in fact LCMs are always defined for any random variables (discrete, continuous or a mixture of both). We provide a realistic problem from the field of computer networks to demonstrate the applicability of our construction. Other potential application is iterative decoding of linear channels with non-Gaussian noise. 1

4 0.63202828 144 nips-2010-Learning Efficient Markov Networks

Author: Vibhav Gogate, William Webb, Pedro Domingos

5 0.57195914 118 nips-2010-Implicit Differentiation by Perturbation

Author: Justin Domke

Abstract: This paper proposes a simple and eﬃcient ﬁnite diﬀerence method for implicit diﬀerentiation of marginal inference results in discrete graphical models. Given an arbitrary loss function, deﬁned on marginals, we show that the derivatives of this loss with respect to model parameters can be obtained by running the inference procedure twice, on slightly perturbed model parameters. This method can be used with approximate inference, with a loss function over approximate marginals. Convenient choices of loss functions make it practical to ﬁt graphical models with hidden variables, high treewidth and/or model misspeciﬁcation. 1

6 0.55893636 283 nips-2010-Variational Inference over Combinatorial Spaces

7 0.54612249 224 nips-2010-Regularized estimation of image statistics by Score Matching

8 0.53805727 159 nips-2010-Lifted Inference Seen from the Other Side : The Tractable Features

9 0.50985736 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes

10 0.50775975 120 nips-2010-Improvements to the Sequence Memoizer

11 0.48992062 83 nips-2010-Evidence-Specific Structures for Rich Tractable CRFs

12 0.48970062 101 nips-2010-Gaussian sampling by local perturbations

13 0.48147237 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning

14 0.47370997 3 nips-2010-A Bayesian Framework for Figure-Ground Interpretation

15 0.44042283 63 nips-2010-Distributed Dual Averaging In Networks

16 0.43030521 190 nips-2010-On the Convexity of Latent Social Network Inference

17 0.41863853 168 nips-2010-Monte-Carlo Planning in Large POMDPs

18 0.41346449 106 nips-2010-Global Analytic Solution for Variational Bayesian Matrix Factorization

19 0.40102145 107 nips-2010-Global seismic monitoring as probabilistic inference

20 0.39931822 239 nips-2010-Sidestepping Intractable Inference with Structured Ensemble Cascades

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.024), (17, 0.016), (27, 0.042), (30, 0.051), (45, 0.69), (50, 0.037), (60, 0.013), (77, 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99908209 32 nips-2010-Approximate Inference by Compilation to Arithmetic Circuits

Author: Daniel Lowd, Pedro Domingos

2 0.99868006 255 nips-2010-Static Analysis of Binary Executables Using Structural SVMs

Author: Nikos Karampatziakis

Abstract: We cast the problem of identifying basic blocks of code in a binary executable as learning a mapping from a byte sequence to a segmentation of the sequence. In general, inference in segmentation models, such as semi-CRFs, can be cubic in the length of the sequence. By taking advantage of the structure of our problem, we derive a linear-time inference algorithm which makes our approach practical, given that even small programs are tens or hundreds of thousands bytes long. Furthermore, we introduce two loss functions which are appropriate for our problem and show how to use structural SVMs to optimize the learned mapping for these losses. Finally, we present experimental results that demonstrate the advantages of our method against a strong baseline. 1

3 0.99820477 11 nips-2010-A POMDP Extension with Belief-dependent Rewards

Author: Mauricio Araya, Olivier Buffet, Vincent Thomas, Françcois Charpillet

Abstract: Partially Observable Markov Decision Processes (POMDPs) model sequential decision-making problems under uncertainty and partial observability. Unfortunately, some problems cannot be modeled with state-dependent reward functions, e.g., problems whose objective explicitly implies reducing the uncertainty on the state. To that end, we introduce ρPOMDPs, an extension of POMDPs where the reward function ρ depends on the belief state. We show that, under the common assumption that ρ is convex, the value function is also convex, what makes it possible to (1) approximate ρ arbitrarily well with a piecewise linear and convex (PWLC) function, and (2) use state-of-the-art exact or approximate solving algorithms with limited changes. 1

4 0.99812967 167 nips-2010-Mixture of time-warped trajectory models for movement decoding

Author: Elaine Corbett, Eric Perreault, Konrad Koerding

Abstract: Applications of Brain-Machine-Interfaces typically estimate user intent based on biological signals that are under voluntary control. For example, we might want to estimate how a patient with a paralyzed arm wants to move based on residual muscle activity. To solve such problems it is necessary to integrate obtained information over time. To do so, state of the art approaches typically use a probabilistic model of how the state, e.g. position and velocity of the arm, evolves over time – a so-called trajectory model. We wanted to further develop this approach using two intuitive insights: (1) At any given point of time there may be a small set of likely movement targets, potentially identified by the location of objects in the workspace or by gaze information from the user. (2) The user may want to produce movements at varying speeds. We thus use a generative model with a trajectory model incorporating these insights. Approximate inference on that generative model is implemented using a mixture of extended Kalman filters. We find that the resulting algorithm allows us to decode arm movements dramatically better than when we use a trajectory model with linear dynamics. 1 In trod u cti on When patients have lost a limb or the ability to communicate with the outside world, brain machine interfaces (BMIs) are often used to enable robotic prostheses or restore communication. To achieve this, the user's intended state of the device must be decoded from biological signals. In the context of Bayesian statistics, two aspects are important for the design of an estimator of a temporally evolving state: the observation model, which describes how measured variables relate to the system’s state and the trajectory model which describes how the state changes over time in a probabilistic manner. Following this logic many recent BMI applications have relied on Bayesian estimation for a wide range of problems including the decoding of intended human [1] and animal [2] movements. In the context of BMIs, Bayesian approaches offer a principled way of formalizing the uncertainty about signals and thus often result in improvements over other signal processing techniques [1]-[3]. Most work on state estimation in dynamical systems has assumed linear dynamics and Gaussian noise. Under these circumstances, efficient algorithms result from belief propagation. The most frequent application uses the Kalman filter (KF), which recursively combines noisy state observations with the probabilistic evolution of state defined by the trajectory model to estimate the marginal distribution over states [4]. Such approaches have been used widely for applications including upper [1] and lower [5] extremity prosthetic 1 devices, functional electric stimulation [6] and human computer interactions [7]. As these algorithms are so commonly used, it seems promising to develop extensions to nonlinear trajectory models that may better describe the probabilistic distribution of movements in everyday life. One salient departure from the standard assumptions is that people tend to produce both slow and fast movements, depending on the situation. Models with linear dynamics only allow such deviation through the noise term, which makes these models poor at describing the natural variation of movement speeds during real world tasks. Explicitly incorporating movement speed into the trajectory model should lead to better movement estimates. Knowledge of the target position should also strongly affect trajectory models. After all , we tend to accelerate our arm early during movement and slow down later on. Target information can be linearly incorporated into the trajectory model, and this has greatly improved predictions [8]-[12]. Alternatively, if there are a small number of potential targets then a mixture of trajectory models approach [13] can be used. Here we are interested in the case where available data provide a prior over potential t argets but where movement targets may be anywhere. We want to incorporate target uncertainty and allow generalization to novel targets. Prior information about potential targets could come from a number of sources but would generally be noisy. For example, activity in the dorsal premotor cortex provides information about intended target location prior to movement and may be used where such recordings are available [14]. Target information may also be found noninvasively by tracking eye movements. However, such data will generally provide non-zero priors for a number of possible target locations as the subject saccades over the scene. While subjects almost always look at a target before reaching for it [15], there may be a delay of up to a second between looking at the target and the reach – a time interval over which up to 3 saccades are typically made. Each of these fixations could be the target. Hence, a probabilistic distribution of targets is appropriate when using either neural recordings or eye tracking to estimate potential reach targets Here we present an algorithm that uses a mixture of extended Kalman Filters (EKFs) to combine our insights related to the variation of movement speed and the availability of probabilistic target knowledge. Each of the mixture component s allows the speed of the movement to vary continuously over time. We tested how well we could use EMGs and eye movements to decode hand position of humans performing a three -dimensional large workspace reaching task. We find that using a trajectory model that allows for probabilistic target information and variation of speed leads to dramatic improvements in decoding quality. 2 Gen e ral Decod i n g S etti n g We wanted to test how well different decoding algorithms can decode human movement, over a wide range of dynamics. While many recent studies have looked at more restrictive, two-dimensional movements, a system to restore arm function should produce a wide range of 3D trajectories. We recorded arm kinematics and EMGs of healthy subjects during unconstrained 3D reaches to targets over a large workspace. Two healthy subjects were asked to reach at slow, normal and fast speeds, as they would in everyday life. Subjects were seated as they reached towards 16 LEDs in blocks of 150s, which were located on two planes positioned such that all targets were just reachable (Fig 1A). The target LED was lit for one second prior to an auditory go cue, at which time the subject would reach to the target at the appropriate speed. Slow, normal and fast reaches were allotted 3 s, 1.5s and 1s respectively; however, subjects determined the speed. An approximate total of 450 reaches were performed per subject. The subjects provided informed consent, and the protocol was approved by the Northwestern University Institutional Review Board. EMG signals were measured from the pectoralis major, and the three deltoid muscles of the shoulder. This represents a small subset of the muscles involved in reaching, and approximates those muscles retaining some voluntary control following mid-level cervical spinal cord injuries. 2 The EMG signals were band-pass filtered between 10 and 1,000 Hz, and subsequently anti aliased filtered. Hand, wrist, shoulder and head positions were tracked using an Optotrak motion analysis system. We simultaneously recorded eye movements with an ASL EYETRAC-6 head mounted eye tracker. Approximately 25% of the reaches were assigned to the test set, and the rest were used for training. Reaches for which either the motion capture data was incomplete, or there was visible motion artifact on the EMG were removed. As the state we used hand positions and joint angles (3 shoulder, 2 elbow, position, velocity and acceleration, 24 dimensions). Joint angles were calculated from the shoulder and wrist marker data using digitized bony landmarks which defined a coordinate system for the upper limb as detailed by Wu et al. [16]. As the motion data were sampled at 60Hz, the mean absolute value o f the EMG in the corresponding 16.7ms windows was used as an observation of the state at each time-step. Algorithm accuracy was quantified by normalizing the root -mean-squared error by the straight line distance between the first and final position of the endpoint for each reach. We compared the algorithms statistically using repeated measures ANOVAs with Tukey post -hoc tests, treating reach and subject as random effects. In the rest of the paper we will ask how well these reaching movements can be decoded from EMG and eye-tracking data. Figure 1: A Experimental setup and B sample kinematics and processed EMGs for one reach 3 Kal man Fi l ters w i th Target i n f ormati on All models that we consider in this paper assume linear observations with Gaussian noise: (1) where x is the state, y is the observation and v is the measurement noise with p(v) ~ N(0,R), and R is the observation covariance matrix. The model fitted the measured EMGs with an average r2 of 0.55. This highlights the need to integrate information over time. The standard approach also assumes linear dynamics and Gaussian process noise: (2) where, x t represents the hand and joint angle positions, w is the process noise with p(w) ~ N(0,Q), and Q is the state covariance matrix. The Kalman filter does optimal inference for this generative model. This model can effectively capture the dynamics of stereotypical reaches to a single target by appropriately tuning its parameters. However, when used to describe reaches to multiple targets, the model cannot describe target dependent aspects of reaching but boils down to a random drift model. Fast velocities are underestimated as they are unlikely under the trajectory model and there is excessive drift close to the target (Fig. 2A). 3 In many decoding applications we may know the subject’s target. A range of recent studies have addressed the issue of incorporating this information into the trajectory model [8, 13], and we might assume the effect of the target on the dynamics to be linear. This naturally suggests adding the target to the state space, which works well in practice [9, 12]. By appending the target to the state vector (KFT), the simple linear format of the KF may be retained: (3) where xTt is the vector of target positions, with dimensionality less than or equal to that of xt. This trajectory model thus allows describing both the rapid acceleration that characterizes the beginning of a reach and the stabilization towards its end. We compared the accuracy of the KF and the KFT to the Single Target Model (STM), a KF trained only on reaches to the target being tested (Fig. 2). The STM represents the best possible prediction that could be obtained with a Kalman filter. Assuming the target is perfectly known, we implemented the KFT by correctly initializing the target state xT at the beginning of the reach. We will relax this assumption below. The initial hand and joint angle positions were also assumed to be known. Figure 2: A Sample reach and predictions and B average accuracies with standard errors for KFT, KF and MTM. Consistent with the recent literature, both methods that incorporated target information produced higher prediction accuracy than the standard KF (both p<0.0001). Interestingly, there was no significant difference between the KFT and the STM (p=0.9). It seems that when we have knowledge of the target, we do not lose much by training a single model over the whole workspace rather than modeling the targets individually. This is encouraging, as we desire a BMI system that can generalize to any target within the workspace, not just specifically to those that are available in the training data. Clearly, adding the target to the state space allows the dynamics of typical movements to be modeled effectively, resulting in dramatic increases in decoding performance. 4 Ti me Warp i n g 4.1 I m p l e m e n t i n g a t i m e - w a r p e d t r a j e c t o r y mo d e l While the KFT above can capture the general reach trajectory profile, it does not allow for natural variability in the speed of movements. Depending on our task objectives, which would not directly be observed by a BMI, we might lazily reach toward a target or move a t maximal speed. We aim to change the trajectory model to explicitly incorporate a warping factor by which the average movement speed is scaled, allowing for such variability. As the movement speed will be positive in all practical cases, we model the logarithm of this factor, 4 and append it to the state vector: (4) We create a time-warped trajectory model by noting that if the average rate of a trajectory is to be scaled by a factor S, the position at time t will equal that of the original trajectory at time St. Differentiating, the velocity will be multiplied by S, and the acceleration by S 2. For simplicity, the trajectory noise is assumed to be additive and Gaussian, and the model is assumed to be stationary: (5) where Ip is the p-dimensional identity matrix and is a p p matrix of zeros. Only the terms used to predict the acceleration states need to be estimated to build the state transition matrix, and they are scaled as a nonlinear function of xs. After adding the variable movement speed to the state space the system is no longer linear. Therefore we need a different solution strategy. Instead of the typical KFT we use the Extended Kalman Filter (EKFT) to implement a nonlinear trajectory model by linearizing the dynamics around the best estimate at each time-step [17]. With this approach we add only small computational overhead to the KFT recursions. 4.2 Tr a i n i n g t h e t i m e w a r p i n g mo d e l The filter parameters were trained using a variant of the Expectation Maximization (EM) algorithm [18]. For extended Kalman filter learning the initialization for the variables may matter. S was initialized with the ground truth average reach speeds for each movement relative to the average speed across all movements. The state transition parameters were estimated using nonlinear least squares regression, while C, Q and R were estimated linearly for the new system, using the maximum likelihood solution [18] (M-step). For the E-step we used a standard extended Kalman smoother. We thus found the expected values for t he states given the current filter parameters. For this computation, and later when testing the algorithm, xs was initialized to its average value across all reaches while the remaining states were initialized to their true values. The smoothed estimate fo r xs was then used, along with the true values for the other states, to re-estimate the filter parameters in the M-step as before. We alternated between the E and M steps until the log likelihood converged (which it did in all cases). Following the training procedure, the diagonal of the state covariance matrix Q corresponding to xs was set to the variance of the smoothed xs over all reaches, according to how much this state should be allowed to change during prediction. This allowed the estimate of xs to develop over the course of the reach due to the evidence provided by the observations, better capturing the dynamics of reaches at different speeds. 4.3 P e r f o r ma n c e o f t h e t i m e - w a r p e d E K F T Incorporating time warping explicitly into the trajectory model pro duced a noticeable increase in decoding performance over the KFT. As the speed state xs is estimated throughout the course of the reach, based on the evidence provided by the observations, the trajectory model has the flexibility to follow the dynamics of the reach more accurately (Fig. 3). While at the normal self-selected speed the difference between the algorithms is small, for the slow and fast speeds, where the dynamics deviate from average, there i s a clear advantage to the time warping model. 5 Figure 3: Hand positions and predictions of the KFT and EKFT for sample reaches at A slow, B normal and C fast speeds. Note the different time scales between reaches. The models were first trained using data from all speeds (Fig. 4A). The EKFT was 1.8% more accurate on average (p<0.01), and the effect was significant at the slow (1.9%, p<0.05) and the fast (2.8%, p<0.01), but not at the normal (p=0.3) speed. We also trained the models from data using only reaches at the self-selected normal speed, as we wanted to see if there was enough variation to effectively train the EKFT (Fig. 4B). Interestingly, the performance of the EKFT was reduced by only 0.6%, and the KFT by 1.1%. The difference in performance between the EKFT and KFT was even more pronounced on aver age (2.3%, p<0.001), and for the slow and fast speeds (3.6 and 4.1%, both p< 0.0001). At the normal speed, the algorithms again were not statistically different (p=0.6). This result demonstrates that the EKFT is a practical option for a real BMI system, as it is not necessary to greatly vary the speeds while collecting training data for the model to be effective over a wide range of intended speeds. Explicitly incorporating speed information into the trajectory model helps decoding, by modeling the natural variation in volitional speed. Figure 4: Mean and standard error of EKFT and KFT accuracy at the different subjectselected speeds. Models were trained on reaches at A all speeds and B just normal speed reaches. Asterisks indicate statistically significant differences between the algorithms. 5 Mi xtu res of Target s So far, we have assumed that the targets of our reaches are perfectly known. In a real-world system, there will be uncertainty about the intended target of the reach. However, in typical applications there are a small number of possible objectives. Here we address this situation. Drawing on the recent literature, we use a mixture model to consider each of the possible targets [11, 13]. We condition the posterior probability for the state on the N possible targets, T: (6) 6 Using Bayes' Rule, this equation becomes: (7) As we are dealing with a mixture model, we perform the Kalman filter recursion for each possible target, xT, and our solution is a weighted sum of the outputs. The weights are proportional to the prior for that target, , and the likelihood of the model given that target . is independent of the target and does not need to be calculated. We tested mixtures of both algorithms, the mKFT and mEKFT, with real uncert ain priors obtained from eye-tracking in the one-second period preceding movement. As the targets were situated on two planes, the three-dimensional location of the eye gaze was found by projecting its direction onto those planes. The first, middle and last eye samples were selected, and all other samples were assigned to a group according to which of the three was closest. The mean and variance of these three groups were used to initialize three Kalman filters in the mixture model. The priors of the three groups were assigned proportional to the number of samples in them. If the subject looks at multiple positions prior to reaching, this method ensures with a high probability that the correct target was accounted for in one of the filters in the mixture. We also compared the MTM approach of Yu et al. [13], where a different KF model was generated for each target, and a mixture is performed over these models. This approach explicitly captures the dynamics of stereotypical reaches to specific targets. Given perfect target information, it would reduce to the STM described above. Priors for the MTM were found by assigning each valid eye sample to its closest two targets, and weighting the models proportional to the number of samples assigned to the corresponding target, divided by its distance from the mean of those samples. We tried other ways of assigning priors and the one presented gave the best results. We calculated the reduction in decoding quality when instead of perfect priors we provide eye-movement based noisy priors (Fig. 5). The accuracies of the mEKFT, the mKFT and the MTM were only degraded by 0.8, 1.9 and 2.1% respectively, compared to the perfect prior situation. The mEKFT was still close to 10% better than the KF. The mixture model framework is effective in accounting for uncertain priors. Figure 5: Mean and standard errors of accuracy for algorithms with perfect priors, and uncertain priors with full and partial training set. The asterisk indicates a statistically significant effects between the two training types, where real priors are used. Here, only reaches at normal speed were used to train the models, as this is a more realistic training set for a BMI application. This accounts for the degraded performance of the MTM with perfect priors relative to the STM from above (Fig. 2). With even more stereotyped training data for each target, the MTM doesn't generalize as well to new speeds. 7 We also wanted to know if the algorithms could generalize to new targets. In a real application, the available training data will generally not span the entire useable worksp ace. We compared the algorithms where reaches to all targets except the one being tested had been used to train the models. The performance of the MTM was significantly de graded unsurprisingly, as it was designed for reaches to a set of known targets. Performance of the mKFT and mEKFT degraded by about 1%, but not significantly (both p>0.7), demonstrating that the continuous approach to target information is preferable when the target could be anywhere in space, not just at locations for which training data is available. 6 Di scu ssi on and concl u si on s The goal of this work was to design a trajectory model that would improve decoding for BMIs with an application to reaching. We incorporated two features that prominently influence the dynamics of natural reach: the movement speed and the target location. Our approach is appropriate where uncertain target information is available. The model generalizes well to new regions of the workspace for which there is no training data, and across a broad range of reaching dynamics to widely spaced targets in three dimensions. The advantages over linear models in decoding precision we report here could be equally obtained using mixtures over many targets and speeds. While mixture models [11, 13] could allow for slow versus fast movements and any number of potential targets, this strategy will generally require many mixture components. Such an approach would require a lot more training data, as we have shown that it does not generalize well. It would also be run-time intensive which is problematic for prosthetic devices that rely on low power controllers. In contrast, the algorithm introduced here only takes a small amount of additional run-time in comparison to the standard KF approach. The EKF is only marginally slower than the standard KF and the algorithm will not generally need to consider more than 3 mixture components assuming the subject fixates the target within the second pre ceding the reach. In this paper we assumed that subjects always would fixate a reach target – along with other non-targets. While this is close to the way humans usually coordinate eyes and reaches [15], there might be cases where people do not fixate a reach target. Our approach could be easily extended to deal with such situations by adding a dummy mixture component that all ows the description of movements to any target. As an alternative to mixture approaches, a system can explicitly estimate the target position in the state vector [9]. This approach, however, would not straightforwardly allow for the rich target information available; we look at the target but also at other locations, strongly suggesting mixture distributions. A combination of the two approaches could further improve decoding quality. We could both estimate speed and target position for the EKFT in a continuous manner while retaining the mixture over target priors. We believe that the issues that we have addressed here are almost universal. Virtually all types of movements are executed at varying speed. A probabilistic distribution for a small number of action candidates may also be expected in most BMI applications – after all there are usually only a small number of actions that make sense in a given environment. While this work is presented in the context of decoding human reaching, it may be applied to a wide range of BMI applications including lower limb prosthetic devices and human computer interactions, as well as different signal sources such as electrode grid recordings and electroencephalograms. The increased user control in conveying their intended movements would significantly improve the functionality of a neuroprosthetic device. A c k n o w l e d g e me n t s T h e a u t h o r s t h a n k T. H a s w e l l , E . K r e p k o v i c h , a n d V. Ravichandran for assistance with experiments. This work was funded by the NSF Program in Cyber-Physical Systems. R e f e re n c e s [1] L.R. Hochberg, M.D. Serruya, G.M. Friehs, J.A. Mukand, M. Saleh, A.H. Caplan, A. Branner, D. 8 [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] Chen, R.D. Penn, and J.P. Donoghue, “Neuronal ensemble control of prosthetic devices by a human with tetraplegia,” Nature, vol. 442, 2006, pp. 164–171. W. Wu, Y. Gao, E. Bienenstock, J.P. Donoghue, and M.J. Black, “Bayesian population decoding of motor cortical activity using a Kalman filter,” Neural Computation, vol. 18, 2006, pp. 80–118. W. Wu, M.J. Black, Y. Gao, E. Bienenstock, M. Serruya, A. Shaikhouni, and J.P. Donoghue, “Neural decoding of cursor motion using a Kalman filter,” Advances in Neural Information Processing Systems 15: Proceedings of the 2002 Conference, 2003, p. 133. R.E. Kalman, “A new approach to linear filtering and prediction problems,” Journal of basic Engineering, vol. 82, 1960, pp. 35–45. G.G. Scandaroli, G.A. Borges, J.Y. Ishihara, M.H. Terra, A.F.D. Rocha, and F.A.D.O. Nascimento, “Estimation of foot orientation with respect to ground for an above knee robotic prosthesis,” Proceedings of the 2009 IEEE/RSJ international conference on Intelligent robots and systems, St. Louis, MO, USA: IEEE Press, 2009, pp. 1112-1117. I. Cikajlo, Z. Matjačić, and T. Bajd, “Efficient FES triggering applying Kalman filter during sensory supported treadmill walking,” Journal of Medical Engineering & Technology, vol. 32, 2008, pp. 133144. S. Kim, J.D. Simeral, L.R. Hochberg, J.P. Donoghue, and M.J. Black, “Neural control of computer cursor velocity by decoding motor cortical spiking activity in humans with tetraplegia,” Journal of Neural Engineering, vol. 5, 2008, pp. 455-476. L. Srinivasan, U.T. Eden, A.S. Willsky, and E.N. Brown, “A state-space analysis for reconstruction of goal-directed movements using neural signals,” Neural computation, vol. 18, 2006, pp. 2465–2494. G.H. Mulliken, S. Musallam, and R.A. Andersen, “Decoding trajectories from posterior parietal cortex ensembles,” Journal of Neuroscience, vol. 28, 2008, p. 12913. W. Wu, J.E. Kulkarni, N.G. Hatsopoulos, and L. Paninski, “Neural Decoding of Hand Motion Using a Linear State-Space Model With Hidden States,” IEEE Transactions on neural systems and rehabilitation engineering, vol. 17, 2009, p. 1. J.E. Kulkarni and L. Paninski, “State-space decoding of goal-directed movements,” IEEE Signal Processing Magazine, vol. 25, 2008, p. 78. C. Kemere and T. Meng, “Optimal estimation of feed-forward-controlled linear systems,” IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. Proceedings.(ICASSP'05), 2005. B.M. Yu, C. Kemere, G. Santhanam, A. Afshar, S.I. Ryu, T.H. Meng, M. Sahani, and K.V. Shenoy, “Mixture of trajectory models for neural decoding of goal-directed movements,” Journal of neurophysiology, vol. 97, 2007, p. 3763. N. Hatsopoulos, J. Joshi, and J.G. O'Leary, “Decoding continuous and discrete motor behaviors using motor and premotor cortical ensembles,” Journal of neurophysiology, vol. 92, 2004, p. 1165. R.S. Johansson, G. Westling, A. Backstrom, and J.R. Flanagan, “Eye-hand coordination in object manipulation,” Journal of Neuroscience, vol. 21, 2001, p. 6917. G. Wu, F.C. van der Helm, H.E.J. Veeger, M. Makhsous, P. Van Roy, C. Anglin, J. Nagels, A.R. Karduna, and K. McQuade, “ISB recommendation on definitions of joint coordinate systems of various joints for the reporting of human joint motion–Part II: shoulder, elbow, wrist and hand,” Journal of biomechanics, vol. 38, 2005, pp. 981–992. D. Simon, Optimal state estimation: Kalman, H [infinity] and nonlinear approaches, John Wiley and Sons, 2006. Z. Ghahramani and G.E. Hinton, “Parameter estimation for linear dynamical systems,” University of Toronto technical report CRG-TR-96-2, vol. 6, 1996. 9

5 0.99806809 235 nips-2010-Self-Paced Learning for Latent Variable Models

Author: M. P. Kumar, Benjamin Packer, Daphne Koller

Abstract: Latent variable models are a powerful tool for addressing several tasks in machine learning. However, the algorithms for learning the parameters of latent variable models are prone to getting stuck in a bad local optimum. To alleviate this problem, we build on the intuition that, rather than considering all samples simultaneously, the algorithm should be presented with the training data in a meaningful order that facilitates learning. The order of the samples is determined by how easy they are. The main challenge is that often we are not provided with a readily computable measure of the easiness of samples. We address this issue by proposing a novel, iterative self-paced learning algorithm where each iteration simultaneously selects easy samples and learns a new parameter vector. The number of samples selected is governed by a weight that is annealed until the entire training data has been considered. We empirically demonstrate that the self-paced learning algorithm outperforms the state of the art method for learning a latent structural SVM on four applications: object localization, noun phrase coreference, motif ﬁnding and handwritten digit recognition. 1

6 0.9974364 187 nips-2010-Occlusion Detection and Motion Estimation with Convex Optimization

7 0.99705797 100 nips-2010-Gaussian Process Preference Elicitation

8 0.99495792 133 nips-2010-Kernel Descriptors for Visual Recognition

9 0.9943344 108 nips-2010-Graph-Valued Regression

10 0.97883356 240 nips-2010-Simultaneous Object Detection and Ranking with Weak Supervision

11 0.97676837 83 nips-2010-Evidence-Specific Structures for Rich Tractable CRFs

12 0.9734866 144 nips-2010-Learning Efficient Markov Networks

13 0.97332239 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning

14 0.97119939 197 nips-2010-Optimal Bayesian Recommendation Sets and Myopically Optimal Choice Query Sets

15 0.97021115 118 nips-2010-Implicit Differentiation by Perturbation

16 0.9684661 141 nips-2010-Layered image motion with explicit occlusions, temporal consistency, and depth ordering

17 0.96815526 212 nips-2010-Predictive State Temporal Difference Learning

18 0.96716881 61 nips-2010-Direct Loss Minimization for Structured Prediction

19 0.96537322 93 nips-2010-Feature Construction for Inverse Reinforcement Learning

20 0.96467102 94 nips-2010-Feature Set Embedding for Incomplete Data