nips nips2013 nips2013-101 knowledge-graph by maker-knowledge-mining

101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models


Source: pdf

Author: Khaled Refaat, Arthur Choi, Adnan Darwiche

Abstract: EDML is a recently proposed algorithm for learning parameters in Bayesian networks. It was originally derived in terms of approximate inference on a metanetwork, which underlies the Bayesian approach to parameter estimation. While this initial derivation helped discover EDML in the first place and provided a concrete context for identifying some of its properties (e.g., in contrast to EM), the formal setting was somewhat tedious in the number of concepts it drew on. In this paper, we propose a greatly simplified perspective on EDML, which casts it as a general approach to continuous optimization. The new perspective has several advantages. First, it makes immediate some results that were non-trivial to prove initially. Second, it facilitates the design of EDML algorithms for new graphical models, leading to a new algorithm for learning parameters in Markov networks. We derive this algorithm in this paper, and show, empirically, that it can sometimes learn estimates more efficiently from complete data, compared to commonly used optimization methods, such as conjugate gradient and L-BFGS. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 It was originally derived in terms of approximate inference on a metanetwork, which underlies the Bayesian approach to parameter estimation. [sent-5, score-0.094]

2 We derive this algorithm in this paper, and show, empirically, that it can sometimes learn estimates more efficiently from complete data, compared to commonly used optimization methods, such as conjugate gradient and L-BFGS. [sent-13, score-0.094]

3 1 Introduction EDML is a recently proposed algorithm for learning MAP parameters of a Bayesian network from incomplete data [5, 16]. [sent-14, score-0.059]

4 Some empirical evaluations further suggested that EDML and hybrid EDML/EM algorithms can sometimes find better parameter estimates than vanilla EM, in fewer iterations and less time. [sent-17, score-0.065]

5 EDML was originally derived in terms of approximate inference on a meta-network used for Bayesian approaches to parameter estimation. [sent-18, score-0.077]

6 Second, it facilitates the design of new EDML algorithms for new classes of models, where graphical formulations of parameter estimation, such as meta-networks, are lacking. [sent-25, score-0.069]

7 Here, we derive, in particular, a new parameter estimation algorithm for Markov networks, which is in many ways a more challenging task, compared to the case of Bayesian networks. [sent-26, score-0.049]

8 Empirically, we find that EDML is capable of learning parameter estimates, under complete data, more efficiently than popular methods such as conjugate-gradient and L-BFGS, and in some cases, by an order-of-magnitude. [sent-27, score-0.069]

9 In Section 3, we formulate the EDML algorithm for parameter estimation in Bayesian networks, as an instance of this optimization method. [sent-30, score-0.049]

10 In Section 5, we contrast the two EDML algorithms for directed and undirected graphical models, in the complete data case. [sent-32, score-0.093]

11 We empirically evaluate our new algorithm for parameter estimation under complete data in Markov networks, in Section 6; review related work in Section 7; and conclude in Section 8. [sent-33, score-0.09]

12 , xn ), where each component xi is a vector in Rki for some ki . [sent-41, score-0.088]

13 Suppose further that we have a constraint on the domain of function f (x) with a corresponding function g that maps an arbitrary point x to a point g(x) satisfying the given constraint. [sent-42, score-0.052]

14 We say in this case that g(x) is a feasibility function and refer to the points in its range as feasible points. [sent-43, score-0.16]

15 Our goal here is to find a feasible input vector x = (x1 , . [sent-44, score-0.055]

16 Given the difficulty of this optimization problem in general, we will settle for finding stationary points x in the constrained domain of function f (x). [sent-51, score-0.12]

17 One approach for finding such stationary points is as follows. [sent-52, score-0.097]

18 , xn ) be a feasible point in the domain of function f (x). [sent-59, score-0.09]

19 For each component xi , we define a sub-function fx (xi ) = f (x1 , . [sent-60, score-0.112]

20 Each of these subfunctions is obtained by fixing all inputs xj of f (x), for j = i, to their values in x , while keeping the input xi free. [sent-68, score-0.078]

21 We can now characterize all feasible points x that are stationary with respect to the function f (x), in terms of local conditions on sub-functions fx (xi ). [sent-70, score-0.196]

22 , xn ) is stationary for function f (x) iff for all i, component xi is stationary for sub-function fx (xi ). [sent-77, score-0.292]

23 This is immediate from the definition of a stationary point. [sent-78, score-0.103]

24 Assuming no constraints, at a stationary point x , the gradient f (x ) = 0, i. [sent-79, score-0.109]

25 , xi f (x ) = fx (xi ) = 0 for all xi , where xi f (x ) denotes the sub-vector of gradient f (x ) with respect to component xi . [sent-81, score-0.285]

26 1 With these observations, we can now search for feasible stationary points x of the constrained function f (x) using an iterative method that searches instead for stationary points of the constrained sub-functions fx (xi ). [sent-82, score-0.368]

27 Start with some feasible point xt of function f (x) for t = 0 2. [sent-84, score-0.07]

28 With an appropriate feasibility function g(y), one can guarantee that a fixed-point of this procedure yields a stationary point of the constrained function f (x), by Claim 1. [sent-86, score-0.219]

29 2 Further, any stationary point is trivially a fixed-point of this procedure (one can seed this procedure with such a point). [sent-87, score-0.121]

30 1 2 Under constraints, we consider points that are stationary with respect to the corresponding Lagrangian. [sent-88, score-0.097]

31 We next show this connection to EDML as proposed for parameter estimation in Bayesian networks. [sent-91, score-0.049]

32 We follow by deriving an EDML algorithm (another instance of the above procedure), but for parameter estimation in undirected graphical models. [sent-92, score-0.101]

33 We will also study the impact of having complete data on both versions of the EDML algorithm, and finally evaluate the new instance of EDML by comparing it to conjugate gradient and L-BFGS when applied to complete datasets. [sent-93, score-0.116]

34 A network parameter will therefore have the general form θx|u , representing the probability Pr (X = x|U = u). [sent-97, score-0.063]

35 Our goal is to find parameter estimates θ that minimize the negative log-likelihood: N f (θ) = − (θ|D) = − log Pr θ (di ). [sent-102, score-0.066]

36 As such, Pr θ (di ) is the probability of observing example di in dataset D under parameters θ. [sent-111, score-0.098]

37 Each component of θ is a parameter set θX|u , which defines a parameter θx|u for each value x of variable X and instantiation u of its parents U. [sent-112, score-0.111]

38 The feasibility constraint here is that each component θX|u satisfies the convex sum-to-one constraint: x θx|u = 1. [sent-113, score-0.148]

39 The above parameter estimation problem is clearly in the form of the constrained optimization problem that we phrased in the previous section and, hence, admits the same iterative procedure proposed in that section for finding stationary points. [sent-114, score-0.194]

40 What is the feasibility function g(y) in this case? [sent-119, score-0.088]

41 X X1 η1 …   …   X2 η2 XN ηN Figure 1: Estimation given independent soft observations. [sent-128, score-0.052]

42 4 This model is illustrated in Figure 1, where our goal is to estimate a parameter set θX , given soft observations η = (η1 , . [sent-131, score-0.08]

43 , XN , where each ηi has a strength specified by a weight on each value xi of Xi . [sent-137, score-0.067]

44 )) = θx , and (3) weights P(ηi |xi ) denote the strengths of soft evidence ηi on value xi . [sent-144, score-0.127]

45 Theorem 2 Consider Equations 2 and 3, and assume that each soft evidence ηi has the strength i i P(ηi |xi ) = Cu + Cx|u . [sent-146, score-0.088]

46 It then follows that fθ (θX|u ) = − log P(η|θX|u ) (4) This theorem yields the following interesting semantics for EDML sub-functions. [sent-147, score-0.073]

47 Consider a parameter set θX|u and example di in our dataset. [sent-148, score-0.114]

48 3 Properties Equation 2 is a convex function, and thus has a unique optimum. [sent-153, score-0.057]

49 The sum of two concave functions is also concave, thus our sub-function fθ (θX|u ) is convex, and is subject to a convex sum-to-one constraint [16]. [sent-155, score-0.062]

50 4 Finding the Unique Optima In every EDML iteration, and for each parameter set θX|u , we seek the unique optimum for each sub-function fθ (θX|u ), given by Equation 2. [sent-161, score-0.089]

51 Moreover, the solutions it produces already satisfy the convex sum-to-one constraint and, hence, the feasibility function g ends up being the identity function g(θ) = θ. [sent-169, score-0.133]

52 Moreover, while the above procedure is convergent when optimizing sub-functions fθ (θX|u ), the global EDML algorithm that is optimizing function f (θ) may not be convergent in general. [sent-171, score-0.091]

53 5 Connection to Previous Work EDML was originally derived by applying an approximate inference algorithm to a meta-network, which is typically used in Bayesian approaches to parameter estimation [5, 16]. [sent-173, score-0.098]

54 For example, it now follows immediately (from Section 2) that the fixed points of EDML are stationary points of the log-likelihood—a fact that was not proven until [16], using a technique that appealed to the relationship between EDML and EM. [sent-178, score-0.128]

55 Moreover, the proof that EDML under complete data will converge immediately to the optimal estimates is also now immediate (see Section 5). [sent-179, score-0.097]

56 Indeed, in the next section, we use this procedure to derive an EDML instance for Markov networks, which is followed by an empirical evaluation of the new algorithm under complete data. [sent-181, score-0.054]

57 4 EDML for Undirected Models In this section, we show how parameter estimation for undirected graphical models, such as Markov networks, can also be posed as an optimization problem, as described in Section 2. [sent-182, score-0.101]

58 Component θXa is a parameter set for a (tabular) factor a, assigning a number θxa ≥ 0 for each instantiation xa of variables Xa . [sent-190, score-0.279]

59 The negative log-likelihood − (θ|D) for a Markov network is: N − (θ|D) = N log Zθ − log Zθ (di ) (6) i=1 where Zθ is the partition function, and where Zθ (di ) is the partition function after conditioning on example di , under parameterization θ. [sent-191, score-0.159]

60 Consider instead the following objective function, which we shall subsequently relate to the negative log-likelihood: N f (θ) = − log Zθ (di ), (7) i=1 with a feasibility constraint that the partition function Zθ equals some constant α. [sent-193, score-0.144]

61 Moreover, a point θ is stationary for Equation 6 iff the point g(θ) is stationary for Equation 7, subject to its constraint. [sent-201, score-0.207]

62 Note that computing these constants requires inference on a Markov network with parameters θ . [sent-205, score-0.08]

63 8 Interestingly, this sub-function is convex, as well as the constraint (which is now linear), resulting in a unique optimum, as in Bayesian networks. [sent-206, score-0.056]

64 However, even when θ is a feasible point, the unique optima of these sub-functions may not be feasible when combined. [sent-207, score-0.168]

65 Thus, the feasibility function g(θ) of Theorem 3 must be utilized in this case. [sent-208, score-0.088]

66 We now have another instance of the iterative algorithm proposed in Section 2, but for undirected graphical models. [sent-209, score-0.081]

67 5 EDML under Complete Data We consider now how EDML simplifies under complete data for both Bayesian and Markov networks, identifying forms and properties of the corresponding sub-functions under complete data. [sent-211, score-0.082]

68 Consider a variable X, and a parent instantiation u; and let D#(xu) represent the number of examples that contain xu in the complete dataset D. [sent-213, score-0.121]

69 , each θX|u satisfies the sumto-one constraint), the unique optimum of this sub-function is θx|u = D#(xu) , which is guaranteed D#(u) to yield a feasible point θ, globally. [sent-217, score-0.131]

70 Hence, EDML produces the unique optimal estimates in its first iteration and terminates immediately thereafter. [sent-218, score-0.067]

71 Under a complete dataset D, Equation 8 of Theorem 4 reduces to: fθ (θXa ) = − xa D#(xa ) log θxa + C, where C is a constant that is independent of parameter set θXa . [sent-220, score-0.311]

72 , satisfies Zθ = α), the α unique optimum of this sub-function has the closed form: θxa = N D#(xa ) , which is equivalent Cxa to the unique optimum one would obtain in a sub-function for Equation 6 [15, 13]. [sent-223, score-0.122]

73 Contrary to Bayesian networks, the collection of these optima for different parameter sets do not necessarily yield a feasible point θ. [sent-224, score-0.122]

74 Hence, the feasibility function g of Theorem 3 must be applied here. [sent-225, score-0.088]

75 The resulting feasible point, however, may no longer be a stationary point for the corresponding sub-functions, leading EDML to iterate further. [sent-226, score-0.15]

76 Hence, under complete data, EDML for Bayesian networks converges immediately, while EDML for Markov networks may take multiple iterations. [sent-227, score-0.155]

77 Both results are consistent with what is already known in the literature on parameter estimation for Bayesian and Markov networks. [sent-228, score-0.049]

78 The result on Bayesian networks is useful in confirming that EDML performs optimally in this case. [sent-229, score-0.057]

79 The result for Markov networks, however, gives rise to a new algorithm for parameter estimation under complete data. [sent-230, score-0.09]

80 Let D be a complete dataset over three variables A, B and C, specified in terms of the number of times that each instantiation a, b, c appears in D. [sent-232, score-0.093]

81 Suppose we want to learn, from a ¯ a b, a b, ¯ this dataset, a Markov network with 3 edges, (A, B), (B, C) and (A, C), with the corresponding parameter sets θAB , θBC and θAC . [sent-382, score-0.063]

82 , θXY = (1, 1, 1, 1), then Equation 8 gives the sub-function fθ (θAB ) = −22 · log θab − 15 · log θa¯ − 2 · log θab − 61 · log θa¯. [sent-385, score-0.076]

83 ¯ ¯ b ¯b b ¯b Minimizing fθ (θAB ) under Zθ = α = 2 corresponds to solving a convex optimization problem, 22 15 2 61 which has the unique solution: (θab , θa¯, θab , θa¯) = ( 100 , 100 , 100 , 100 ). [sent-387, score-0.057]

84 We solve similar convex b ¯ b optimization problems for the other parameter sets θBC and θAC , to update estimates θ . [sent-388, score-0.07]

85 We then apply an appropriate feasibility function g (see Footnote 7), and repeat until convergence. [sent-389, score-0.088]

86 6 Experimental Results We evaluate now the efficiency of EDML, conjugate gradient (CG) and Limited-memory BFGS (L-BFGS), when learning Markov networks under complete data. [sent-390, score-0.132]

87 We also simulated datasets from networks used in the probabilistic inference evaluations of UAI-2008, 2010 and 2012, that are amenable to jointree inference. [sent-392, score-0.12]

88 12 For EDML, we damped parameter estimates at each iteration, which is typical for algorithms like loopy belief propagation, which EDML was originally inspired by [5]. [sent-417, score-0.07]

89 We first run CG until convergence (or after exceeding 30 minutes) to obtain parameter estimates of some quality qcg (in log likelihood), recording the number of iterations icg and time tcg required in minutes. [sent-419, score-0.189]

90 EDML is then run next until it obtains an estimate of the same quality qcg , or better, recording also the number of iterations iedml and time tedml in minutes. [sent-420, score-0.135]

91 We also performed the same comparison with LBFGS instead of CG, recording the corresponding number of iterations (il-bfgs , iedml ) and time taken (tl-bfgs , tedml ), giving us the speed-up of EDML over L-BFGS as S = tl-bfgs /tedml . [sent-422, score-0.11]

92 It shows the number of variables in each network (#vars), the average number of iterations taken by each algorithm, and the average speed-up achieved by EDML over CG (L-BFGS). [sent-424, score-0.053]

93 This is due in part by the number of times inference is invoked by CG and L-BFGS (in line search), whereas EDML only needs to invoke inference once per iteration. [sent-430, score-0.052]

94 In the case of complete data, the resulting sub-function is convex, yet for incomplete data, it is not necessarily convex. [sent-441, score-0.065]

95 For relational Markov networks or Markov networks that otherwise assume a feature-based representation [8], evaluating the likelihood is typically intractable, in which case one typically optimizes instead the pseudo-log-likelihood [2]. [sent-443, score-0.144]

96 8 Conclusion In this paper, we provided an abstract and simple view of the EDML algorithm, originally proposed for parameter estimation in Bayesian networks, as a particular method for continuous optimization. [sent-445, score-0.072]

97 One consequence of this view is that it is immediate that fixed-points of EDML are stationary points of the log-likelihood, and vice-versa [16]. [sent-446, score-0.12]

98 Empirically, we find that EDML can more efficiently learn parameter estimates for Markov networks under complete data, compared to conjugate gradient and L-BFGS, sometimes by an order-of-magnitude. [sent-448, score-0.179]

99 The empirical evaluation of EDML for Markov networks under incomplete data is left for future work. [sent-449, score-0.081]

100 12 For exact inference in Markov networks, we employed a jointree algorithm from the S AM I AM inference library, http://reasoning. [sent-451, score-0.089]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('edml', 0.901), ('xa', 0.211), ('cxa', 0.111), ('cg', 0.096), ('feasibility', 0.088), ('di', 0.086), ('stationary', 0.08), ('ab', 0.063), ('adnan', 0.06), ('cx', 0.058), ('networks', 0.057), ('markov', 0.056), ('bayesian', 0.056), ('feasible', 0.055), ('xi', 0.053), ('soft', 0.052), ('iedml', 0.049), ('ipf', 0.049), ('refaat', 0.049), ('equation', 0.044), ('cu', 0.044), ('fx', 0.044), ('pr', 0.042), ('semantics', 0.042), ('complete', 0.041), ('instantiation', 0.04), ('jointree', 0.037), ('tcg', 0.037), ('network', 0.035), ('unique', 0.034), ('khaled', 0.033), ('fxt', 0.033), ('iterative', 0.029), ('parameter', 0.028), ('optimum', 0.027), ('convergent', 0.027), ('inference', 0.026), ('undirected', 0.026), ('graphical', 0.026), ('bc', 0.025), ('apache', 0.025), ('asserted', 0.025), ('icg', 0.025), ('qcg', 0.025), ('subfunctions', 0.025), ('tedml', 0.025), ('optima', 0.024), ('incomplete', 0.024), ('originally', 0.023), ('convex', 0.023), ('immediate', 0.023), ('constrained', 0.023), ('evidence', 0.022), ('constraint', 0.022), ('choi', 0.022), ('suite', 0.022), ('footnote', 0.022), ('procedurally', 0.022), ('em', 0.022), ('arthur', 0.021), ('ac', 0.021), ('estimation', 0.021), ('bfgs', 0.02), ('della', 0.02), ('pietra', 0.02), ('vars', 0.02), ('conjugate', 0.02), ('xn', 0.02), ('constants', 0.019), ('estimates', 0.019), ('log', 0.019), ('iterations', 0.018), ('recording', 0.018), ('points', 0.017), ('votes', 0.017), ('underlies', 0.017), ('library', 0.017), ('subject', 0.017), ('tedious', 0.017), ('xu', 0.016), ('facilitates', 0.015), ('letters', 0.015), ('optimizes', 0.015), ('likelihood', 0.015), ('component', 0.015), ('perspective', 0.015), ('point', 0.015), ('fitting', 0.015), ('shall', 0.015), ('start', 0.014), ('immediately', 0.014), ('strength', 0.014), ('gradient', 0.014), ('procedure', 0.013), ('aij', 0.013), ('benchmarks', 0.013), ('dataset', 0.012), ('parent', 0.012), ('theorem', 0.012), ('optimizing', 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models

Author: Khaled Refaat, Arthur Choi, Adnan Darwiche

Abstract: EDML is a recently proposed algorithm for learning parameters in Bayesian networks. It was originally derived in terms of approximate inference on a metanetwork, which underlies the Bayesian approach to parameter estimation. While this initial derivation helped discover EDML in the first place and provided a concrete context for identifying some of its properties (e.g., in contrast to EM), the formal setting was somewhat tedious in the number of concepts it drew on. In this paper, we propose a greatly simplified perspective on EDML, which casts it as a general approach to continuous optimization. The new perspective has several advantages. First, it makes immediate some results that were non-trivial to prove initially. Second, it facilitates the design of EDML algorithms for new graphical models, leading to a new algorithm for learning parameters in Markov networks. We derive this algorithm in this paper, and show, empirically, that it can sometimes learn estimates more efficiently from complete data, compared to commonly used optimization methods, such as conjugate gradient and L-BFGS. 1

2 0.093800187 291 nips-2013-Sensor Selection in High-Dimensional Gaussian Trees with Nuisances

Author: Daniel S. Levine, Jonathan P. How

Abstract: We consider the sensor selection problem on multivariate Gaussian distributions where only a subset of latent variables is of inferential interest. For pairs of vertices connected by a unique path in the graph, we show that there exist decompositions of nonlocal mutual information into local information measures that can be computed efficiently from the output of message passing algorithms. We integrate these decompositions into a computationally efficient greedy selector where the computational expense of quantification can be distributed across nodes in the network. Experimental results demonstrate the comparative efficiency of our algorithms for sensor selection in high-dimensional distributions. We additionally derive an online-computable performance bound based on augmentations of the relevant latent variable set that, when such a valid augmentation exists, is applicable for any distribution with nuisances. 1

3 0.047724497 151 nips-2013-Learning Chordal Markov Networks by Constraint Satisfaction

Author: Jukka Corander, Tomi Janhunen, Jussi Rintanen, Henrik Nyman, Johan Pensar

Abstract: We investigate the problem of learning the structure of a Markov network from data. It is shown that the structure of such networks can be described in terms of constraints which enables the use of existing solver technology with optimization capabilities to compute optimal networks starting from initial scores computed from the data. To achieve efficient encodings, we develop a novel characterization of Markov network structure using a balancing condition on the separators between cliques forming the network. The resulting translations into propositional satisfiability and its extensions such as maximum satisfiability, satisfiability modulo theories, and answer set programming, enable us to prove optimal certain networks which have been previously found by stochastic search. 1

4 0.03597182 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

Author: Jason Chang, John W. Fisher III

Abstract: We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve significant computational gains. We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. Each cluster is augmented with two subclusters to construct likely split moves. Unlike some previous parallel samplers, the proposed sampler enforces the correct stationary distribution of the Markov chain without the need for finite approximations. Empirical results illustrate that the new sampler exhibits better convergence properties than current methods. 1

5 0.035033762 254 nips-2013-Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Author: Adrien Todeschini, François Caron, Marie Chavent

Abstract: We propose a novel class of algorithms for low rank matrix completion. Our approach builds on novel penalty functions on the singular values of the low rank matrix. By exploiting a mixture model representation of this penalty, we show that a suitably chosen set of latent variables enables to derive an ExpectationMaximization algorithm to obtain a Maximum A Posteriori estimate of the completed low rank matrix. The resulting algorithm is an iterative soft-thresholded algorithm which iteratively adapts the shrinkage coefficients associated to the singular values. The algorithm is simple to implement and can scale to large matrices. We provide numerical comparisons between our approach and recent alternatives showing the interest of the proposed approach for low rank matrix completion. 1

6 0.034910042 347 nips-2013-Variational Planning for Graph-based MDPs

7 0.033715498 269 nips-2013-Regression-tree Tuning in a Streaming Setting

8 0.033706177 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching

9 0.031606764 344 nips-2013-Using multiple samples to learn mixture models

10 0.03135227 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition

11 0.030783059 66 nips-2013-Computing the Stationary Distribution Locally

12 0.030666037 132 nips-2013-Global MAP-Optimality by Shrinking the Combinatorial Search Area with Convex Relaxation

13 0.030316994 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search

14 0.030070683 107 nips-2013-Embed and Project: Discrete Sampling with Universal Hashing

15 0.028853659 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations

16 0.028516637 248 nips-2013-Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs

17 0.028424446 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima

18 0.027674127 52 nips-2013-Bayesian inference as iterated random functions with applications to sequential inference in graphical models

19 0.027538722 64 nips-2013-Compete to Compute

20 0.02717492 299 nips-2013-Solving inverse problem of Markov chain with partial observations


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.086), (1, 0.013), (2, -0.007), (3, 0.003), (4, 0.01), (5, 0.034), (6, 0.026), (7, 0.001), (8, 0.006), (9, -0.009), (10, 0.02), (11, -0.01), (12, 0.026), (13, 0.008), (14, -0.005), (15, -0.002), (16, 0.001), (17, -0.004), (18, -0.007), (19, 0.028), (20, 0.019), (21, 0.02), (22, 0.025), (23, -0.027), (24, -0.006), (25, 0.043), (26, -0.018), (27, 0.022), (28, -0.009), (29, 0.049), (30, 0.027), (31, -0.006), (32, -0.01), (33, -0.004), (34, 0.022), (35, 0.044), (36, 0.009), (37, 0.044), (38, 0.036), (39, 0.003), (40, -0.08), (41, -0.019), (42, -0.028), (43, -0.032), (44, -0.034), (45, 0.013), (46, -0.049), (47, 0.005), (48, -0.058), (49, -0.04)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81529826 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models

Author: Khaled Refaat, Arthur Choi, Adnan Darwiche

Abstract: EDML is a recently proposed algorithm for learning parameters in Bayesian networks. It was originally derived in terms of approximate inference on a metanetwork, which underlies the Bayesian approach to parameter estimation. While this initial derivation helped discover EDML in the first place and provided a concrete context for identifying some of its properties (e.g., in contrast to EM), the formal setting was somewhat tedious in the number of concepts it drew on. In this paper, we propose a greatly simplified perspective on EDML, which casts it as a general approach to continuous optimization. The new perspective has several advantages. First, it makes immediate some results that were non-trivial to prove initially. Second, it facilitates the design of EDML algorithms for new graphical models, leading to a new algorithm for learning parameters in Markov networks. We derive this algorithm in this paper, and show, empirically, that it can sometimes learn estimates more efficiently from complete data, compared to commonly used optimization methods, such as conjugate gradient and L-BFGS. 1

2 0.56894374 122 nips-2013-First-order Decomposition Trees

Author: Nima Taghipour, Jesse Davis, Hendrik Blockeel

Abstract: Lifting attempts to speedup probabilistic inference by exploiting symmetries in the model. Exact lifted inference methods, like their propositional counterparts, work by recursively decomposing the model and the problem. In the propositional case, there exist formal structures, such as decomposition trees (dtrees), that represent such a decomposition and allow us to determine the complexity of inference a priori. However, there is currently no equivalent structure nor analogous complexity results for lifted inference. In this paper, we introduce FO-dtrees, which upgrade propositional dtrees to the first-order level. We show how these trees can characterize a lifted inference solution for a probabilistic logical model (in terms of a sequence of lifted operations), and make a theoretical analysis of the complexity of lifted inference in terms of the novel notion of lifted width for the tree. 1

3 0.56358498 184 nips-2013-Marginals-to-Models Reducibility

Author: Tim Roughgarden, Michael Kearns

Abstract: We consider a number of classical and new computational problems regarding marginal distributions, and inference in models specifying a full joint distribution. We prove general and efficient reductions between a number of these problems, which demonstrate that algorithmic progress in inference automatically yields progress for “pure data” problems. Our main technique involves formulating the problems as linear programs, and proving that the dual separation oracle required by the ellipsoid method is provided by the target problem. This technique may be of independent interest in probabilistic inference. 1

4 0.55820507 291 nips-2013-Sensor Selection in High-Dimensional Gaussian Trees with Nuisances

Author: Daniel S. Levine, Jonathan P. How

Abstract: We consider the sensor selection problem on multivariate Gaussian distributions where only a subset of latent variables is of inferential interest. For pairs of vertices connected by a unique path in the graph, we show that there exist decompositions of nonlocal mutual information into local information measures that can be computed efficiently from the output of message passing algorithms. We integrate these decompositions into a computationally efficient greedy selector where the computational expense of quantification can be distributed across nodes in the network. Experimental results demonstrate the comparative efficiency of our algorithms for sensor selection in high-dimensional distributions. We additionally derive an online-computable performance bound based on augmentations of the relevant latent variable set that, when such a valid augmentation exists, is applicable for any distribution with nuisances. 1

5 0.55107993 332 nips-2013-Tracking Time-varying Graphical Structure

Author: Erich Kummerfeld, David Danks

Abstract: Structure learning algorithms for graphical models have focused almost exclusively on stable environments in which the underlying generative process does not change; that is, they assume that the generating model is globally stationary. In real-world environments, however, such changes often occur without warning or signal. Real-world data often come from generating models that are only locally stationary. In this paper, we present LoSST, a novel, heuristic structure learning algorithm that tracks changes in graphical model structure or parameters in a dynamic, real-time manner. We show by simulation that the algorithm performs comparably to batch-mode learning when the generating graphical structure is globally stationary, and significantly better when it is only locally stationary. 1

6 0.53197289 151 nips-2013-Learning Chordal Markov Networks by Constraint Satisfaction

7 0.52977502 154 nips-2013-Learning Gaussian Graphical Models with Observed or Latent FVSs

8 0.51681155 252 nips-2013-Predictive PAC Learning and Process Decompositions

9 0.51596719 220 nips-2013-On the Complexity and Approximation of Binary Evidence in Lifted Inference

10 0.51219511 52 nips-2013-Bayesian inference as iterated random functions with applications to sequential inference in graphical models

11 0.51129705 134 nips-2013-Graphical Models for Inference with Missing Data

12 0.50218546 340 nips-2013-Understanding variable importances in forests of randomized trees

13 0.49269679 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions

14 0.49240437 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent

15 0.48343828 167 nips-2013-Learning the Local Statistics of Optical Flow

16 0.47906879 160 nips-2013-Learning Stochastic Feedforward Neural Networks

17 0.4763253 344 nips-2013-Using multiple samples to learn mixture models

18 0.4724074 199 nips-2013-More data speeds up training time in learning halfspaces over sparse vectors

19 0.46856532 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling

20 0.4681755 161 nips-2013-Learning Stochastic Inverses


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.014), (16, 0.041), (33, 0.109), (34, 0.092), (41, 0.027), (49, 0.055), (50, 0.279), (56, 0.102), (70, 0.038), (85, 0.026), (89, 0.035), (93, 0.036), (95, 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.7284078 209 nips-2013-New Subsampling Algorithms for Fast Least Squares Regression

Author: Paramveer Dhillon, Yichao Lu, Dean P. Foster, Lyle Ungar

Abstract: We address the problem of fast estimation of ordinary least squares (OLS) from large amounts of data (n p). We propose three methods which solve the big data problem by subsampling the covariance matrix using either a single or two stage estimation. All three run in the order of size of input i.e. O(np) and our best method, Uluru, gives an error bound of O( p/n) which is independent of the amount of subsampling as long as it is above a threshold. We provide theoretical bounds for our algorithms in the fixed design (with Randomized Hadamard preconditioning) as well as sub-Gaussian random design setting. We also compare the performance of our methods on synthetic and real-world datasets and show that if observations are i.i.d., sub-Gaussian then one can directly subsample without the expensive Randomized Hadamard preconditioning without loss of accuracy. 1

2 0.71186668 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models

Author: Michael Hughes, Erik Sudderth

Abstract: Variational inference algorithms provide the most effective framework for largescale training of Bayesian nonparametric models. Stochastic online approaches are promising, but are sensitive to the chosen learning rate and often converge to poor local optima. We present a new algorithm, memoized online variational inference, which scales to very large (yet finite) datasets while avoiding the complexities of stochastic gradient. Our algorithm maintains finite-dimensional sufficient statistics from batches of the full dataset, requiring some additional memory but still scaling to millions of examples. Exploiting nested families of variational bounds for infinite nonparametric models, we develop principled birth and merge moves allowing non-local optimization. Births adaptively add components to the model to escape local optima, while merges remove redundancy and improve speed. Using Dirichlet process mixture models for image clustering and denoising, we demonstrate major improvements in robustness and accuracy.

same-paper 3 0.69503629 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models

Author: Khaled Refaat, Arthur Choi, Adnan Darwiche

Abstract: EDML is a recently proposed algorithm for learning parameters in Bayesian networks. It was originally derived in terms of approximate inference on a metanetwork, which underlies the Bayesian approach to parameter estimation. While this initial derivation helped discover EDML in the first place and provided a concrete context for identifying some of its properties (e.g., in contrast to EM), the formal setting was somewhat tedious in the number of concepts it drew on. In this paper, we propose a greatly simplified perspective on EDML, which casts it as a general approach to continuous optimization. The new perspective has several advantages. First, it makes immediate some results that were non-trivial to prove initially. Second, it facilitates the design of EDML algorithms for new graphical models, leading to a new algorithm for learning parameters in Markov networks. We derive this algorithm in this paper, and show, empirically, that it can sometimes learn estimates more efficiently from complete data, compared to commonly used optimization methods, such as conjugate gradient and L-BFGS. 1

4 0.61173606 99 nips-2013-Dropout Training as Adaptive Regularization

Author: Stefan Wager, Sida Wang, Percy Liang

Abstract: Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We also establish a connection to AdaGrad, an online learning algorithm, and find that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer. We apply this idea to document classification tasks, and show that it consistently boosts the performance of dropout training, improving on state-of-the-art results on the IMDB reviews dataset. 1

5 0.57128561 121 nips-2013-Firing rate predictions in optimal balanced networks

Author: David G. Barrett, Sophie Denève, Christian K. Machens

Abstract: How are firing rates in a spiking network related to neural input, connectivity and network function? This is an important problem because firing rates are a key measure of network activity, in both the study of neural computation and neural network dynamics. However, it is a difficult problem, because the spiking mechanism of individual neurons is highly non-linear, and these individual neurons interact strongly through connectivity. We develop a new technique for calculating firing rates in optimal balanced networks. These are particularly interesting networks because they provide an optimal spike-based signal representation while producing cortex-like spiking activity through a dynamic balance of excitation and inhibition. We can calculate firing rates by treating balanced network dynamics as an algorithm for optimising signal representation. We identify this algorithm and then calculate firing rates by finding the solution to the algorithm. Our firing rate calculation relates network firing rates directly to network input, connectivity and function. This allows us to explain the function and underlying mechanism of tuning curves in a variety of systems. 1

6 0.56940997 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking

7 0.56380183 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables

8 0.56364232 249 nips-2013-Polar Operators for Structured Sparse Estimation

9 0.56354499 303 nips-2013-Sparse Overlapping Sets Lasso for Multitask Learning and its Application to fMRI Analysis

10 0.56153715 141 nips-2013-Inferring neural population dynamics from multiple partial recordings of the same neural circuit

11 0.56148541 304 nips-2013-Sparse nonnegative deconvolution for compressive calcium imaging: algorithms and phase transitions

12 0.56142509 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits

13 0.56126845 353 nips-2013-When are Overcomplete Topic Models Identifiable? Uniqueness of Tensor Tucker Decompositions with Structured Sparsity

14 0.56119883 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

15 0.56085759 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

16 0.55941153 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

17 0.55856019 5 nips-2013-A Deep Architecture for Matching Short Texts

18 0.55820501 318 nips-2013-Structured Learning via Logistic Regression

19 0.55813748 102 nips-2013-Efficient Algorithm for Privately Releasing Smooth Queries

20 0.55812567 355 nips-2013-Which Space Partitioning Tree to Use for Search?