nips nips2006 nips2006-201 knowledge-graph by maker-knowledge-mining

201 nips-2006-Using Combinatorial Optimization within Max-Product Belief Propagation

Source: pdf

Author: Daniel Tarlow, Gal Elidan, Daphne Koller, John C. Duchi

Abstract: In general, the problem of computing a maximum a posteriori (MAP) assignment in a Markov random ﬁeld (MRF) is computationally intractable. However, in certain subclasses of MRF, an optimal or close-to-optimal assignment can be found very efﬁciently using combinatorial optimization algorithms: certain MRFs with mutual exclusion constraints can be solved using bipartite matching, and MRFs with regular potentials can be solved using minimum cut methods. However, these solutions do not apply to the many MRFs that contain such tractable components as sub-networks, but also other non-complying potentials. In this paper, we present a new method, called C OMPOSE, for exploiting combinatorial optimization for sub-networks within the context of a max-product belief propagation algorithm. C OMPOSE uses combinatorial optimization for computing exact maxmarginals for an entire sub-network; these can then be used for inference in the context of the network as a whole. We describe highly efﬁcient methods for computing max-marginals for subnetworks corresponding both to bipartite matchings and to regular networks. We present results on both synthetic and real networks encoding correspondence problems between images, which involve both matching constraints and pairwise geometric constraints. We compare to a range of current methods, showing that the ability of C OMPOSE to transmit information globally across the network leads to improved convergence, decreased running time, and higher-scoring assignments.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper, we present a new method, called C OMPOSE, for exploiting combinatorial optimization for sub-networks within the context of a max-product belief propagation algorithm. [sent-6, score-0.376]

2 C OMPOSE uses combinatorial optimization for computing exact maxmarginals for an entire sub-network; these can then be used for inference in the context of the network as a whole. [sent-7, score-0.327]

3 We describe highly efﬁcient methods for computing max-marginals for subnetworks corresponding both to bipartite matchings and to regular networks. [sent-8, score-0.332]

4 We present results on both synthetic and real networks encoding correspondence problems between images, which involve both matching constraints and pairwise geometric constraints. [sent-9, score-0.545]

5 However, the probabilistic inference task in MRFs — computing the posterior distribution of one or more variables — is tractable only in small tree-width networks, which are not often an appropriate model in practice. [sent-12, score-0.166]

6 Thus, one typically must resort to the use of approximate inference methods, most commonly (in recent years) some variant of loopy belief propagation [11]. [sent-13, score-0.352]

7 An alternative approach, whose popularity has grown in recent years, is based on the maximum a posteriori (MAP) inference problem — computing the single most likely assignment relative to the distribution. [sent-14, score-0.186]

8 Somewhat surprisingly, there are certain classes of networks where MAP inference can be performed very efﬁciently using combinatorial optimization algorithms, even though posterior probability inference is intractable. [sent-15, score-0.35]

9 Regular (or associative) networks [18], where the potentials encode a preference for adjacent variables to take the same value, can be solved optimally or almost optimally using a minimum cut algorithm. [sent-17, score-0.481]

10 Conversely, matching networks, where the potentials encode a type of mutual exclusion constraints between values of adjacent variables, can be solved using matching algorithms. [sent-18, score-0.923]

11 These types of networks have been shown to be applicable in a variety of applications, such as stereo reconstruction [13] and segmentation for regular networks, and image correspondence [15] or word alignment for matching networks [19]. [sent-19, score-0.647]

12 The problem may well have a large component that can be well-modeled as regular or as a matching problem, but there may be additional constraints that take it outside this restricted scope. [sent-21, score-0.337]

13 For example, in a task of registering features between two images or 3D scans, we may formulate the task as a matching problem, but may also want to encode constraints that enforce the preservation of local or global geometry [1]. [sent-22, score-0.34]

14 Unfortunately, once the network contains some “non-complying” potentials, it is not clear if and how one can apply the combinatorial optimization algorithm, even if only as a subroutine. [sent-23, score-0.216]

15 In practice, in such networks, one often simply resorts to applying standard inference methods, such as belief propagation. [sent-24, score-0.209]

16 Unfortunately, belief propagation may be far from an ideal procedure for these types of networks. [sent-25, score-0.274]

17 Indeed, recent empirical studies studies [17] show that belief propagation methods perform considerably worse than min-cut-based methods when applied to a variety of (purely) regular MRFs. [sent-27, score-0.33]

18 Thus, falling back on belief propagation methods for these MRFs may result in poor performance. [sent-28, score-0.243]

19 The main contribution of this paper is a message-passing scheme for max-product inference that can exploit combinatorial optimization algorithms for tractable subnetworks. [sent-29, score-0.274]

20 The basic idea in our algorithm, called C OMPOSE (Combinatorial Optimization for Max-Product on Subnetworks), is that the network can often be partitioned into a number of subnetworks whose union is equivalent to the original distribution. [sent-30, score-0.25]

21 If we have a black box that computes a max-marginal for each variable X in a subnetwork, we can embed that black box as a subroutine in a max-product belief propagation algorithm, without changing the algorithm’s basic properties. [sent-35, score-0.243]

22 In the remainder of this paper, we deﬁne the C OMPOSE scheme, and show how combinatorial algorithms for both regular networks and matching networks can be embedded in this framework. [sent-36, score-0.623]

23 In particular, we also describe efﬁcient combinatorial optimization algorithms for both types of networks that can compute all the max-marginals in the network at a cost similar to that of ﬁnding the single MAP assignment. [sent-37, score-0.364]

24 We evaluate the applicability of C OMPOSE on synthetic networks and on an image registration task for scans of a cell obtained using an electron microscope, all of which are matching problems with additional pairwise constraints. [sent-38, score-0.512]

25 2 Markov Random Fields In this paper, for simplicity of presentation, we restrict our discussion to pairwise Markov networks (or Markov Random Fields) over discrete variables X = {X1 , . [sent-41, score-0.2]

26 We denote an assignment of values to X with x, and an assignment of a value to a single variable X i with xi . [sent-46, score-0.324]

27 A pairwise Markov network M is deﬁned as a graph G = (V, E) and set of potentials F that include both node potentials φi (xi ) and edge potentials φij (xi , xj ). [sent-47, score-0.888]

28 The network encodes a joint probability distribution via an unnormalized density PF (x) = N φi (xi ) i,j∈U φij (xi , xj ), deﬁning the i=1 1 distribution as PF (x) = Z PF (x), where Z is the partition function given by Z = x PF (x). [sent-48, score-0.172]

29 This type of inference is essentially equivalent to computing the partition function, which sums up exponentially many assignments, a computation which is currently intractable except in networks of low tree width. [sent-51, score-0.222]

30 In the MAP problem, we can avoid computing the partition function, so there are certain classes of networks to which the MAP assignment can be computed effectively, even though computing the partition problem can be shown to be intractable; we describe two such important classes in Section 4. [sent-53, score-0.279]

31 Max-product belief propagation (MPBP) [20] is a commonly-used method for ﬁnding an approximate solution. [sent-55, score-0.243]

32 In this algorithm, each node Xi passes to its neighboring nodes Ni a message which is a vector deﬁning a value for each value xi :   δi→j (xj ) := max φi (xi )φij (xi , xj ) xi k∈Ni −{j} δk→i (xi ) . [sent-56, score-0.44]

33 However, applied to a network with loops, MPBP often does not converge, even when combined with techniques such as smoothing and asynchronous message passing, and the answers obtained can be quite approximate. [sent-60, score-0.207]

34 As the unnormalized probability of an assignment in a Markov network is a product of local potentials, we can partition the potentials in an MRF into an ensemble of k subgraphs G1 , . [sent-62, score-0.437]

35 We require that the product of the potentials in these subnetworks maintain the same information as the original MRF. [sent-72, score-0.32]

36 Even if MAP inference in the original network is intractable, it may be tractable in each of the sub-networks in the ensemble. [sent-78, score-0.19]

37 But how do we combine the results from MAP inference in an ensemble of networks over the same set of variables? [sent-79, score-0.2]

38 We begin by conceptually reformulating the ensemble as a set of networks over disjoint sets of variables (l) (l) {X1 , . [sent-81, score-0.183]

39 More precisely, let δ(l)→i be the message sent from subnetwork l to Xi and δi→(l) the opposite message. [sent-92, score-0.181]

40 Then we deﬁne the C OMPOSE message passing scheme as follows: δ(l)→i (xi ) δi→(l) = max x(l) : = (l) Xi =xi δ(l )→i . [sent-93, score-0.188]

41 It is not difﬁcult to see that this message passing scheme is equivalent to a particular scheduling algorithm for max-product belief propagation over the ensemble of networks, assuming that the max-product computation in each of the subnetworks is computed exactly using a black-box subroutine. [sent-96, score-0.603]

42 We note that this message passing scheme is somewhat related to the tree-reweighted maxproduct (TRW) method of Wainwright et al. [sent-97, score-0.218]

43 Many problems can be well-formulated as maximum-score (or minimum weight) bipartite matching: We are given a graph G = (A, U), whose nodes are partitioned into disjoint sets A = A ∪ B. [sent-103, score-0.198]

44 A bipartite matching is a subset of the edges W ⊂ U such that each node appears in at most one edge. [sent-105, score-0.365]

45 The notion of a matching can be relaxed to include other types of degree constraints, e. [sent-106, score-0.232]

46 The score of the matching is simply the sum of the scores of the edges in W. [sent-109, score-0.429]

47 The matching problem can also be formulated as an MRF, in several different ways. [sent-110, score-0.201]

48 The edge scores in the matching graph are then simply singleton potentials in the MRF, where φa (Xa = b) = exp(c(a, b)). [sent-112, score-0.649]

49 Unfortunately, while the costs can be easily encoded in an MRF, the degree constraints on the matching induce a set of pairwise mutual-exclusion potentials on all pairs of variables in the MRF, leading to a fully connected network. [sent-113, score-0.514]

50 Thus, standard methods for MRF inference cannot handle the networks associated with matching problems. [sent-114, score-0.368]

51 Nevertheless, ﬁnding the maximum score bipartite matching (with any set of degree constraints) can be accomplished easily using standard combinatorial optimization algorithms (e. [sent-115, score-0.507]

52 Fortunately, we can adapt the standard algorithm for ﬁnding a single best matching to also ﬁnd all of the max-marginals. [sent-119, score-0.201]

53 We now run a standard max-weight ﬂow algorithm, and deﬁne an edge to be in the matching if it bears ﬂow. [sent-121, score-0.24]

54 A ﬂow in the graph deﬁnes a residual graph, where there is an edge in the graph whose capacity is the amount of ﬂow it can carry relative to the current ﬂow. [sent-124, score-0.206]

55 Thus, for example, if the current solution carries a unit of ﬂow along a particular edge (a, b) in the original graph, the residual graph will have an edge with a unit capacity going in the reverse direction, corresponding to the fact that we can now choose to “eliminate” the ﬂow from a to b. [sent-125, score-0.184]

56 The scores in these inverse edges are also negative, corresponding to the fact that score is lost when we reduce the ﬂow. [sent-126, score-0.198]

57 Our goal now is to ﬁnd, for each pair (a, b), the score of the optimal matching where we force this pair to be matched. [sent-127, score-0.321]

58 Any edges on this new path from A to B will be included in the new matching; any edges from B to A were included in the old matching, but are not in the new matching because of the augmenting path. [sent-130, score-0.307]

59 Thus, we can ﬁnd the highest-scoring path by simply negating all edge costs and ﬁnding the shortest path in the graph. [sent-134, score-0.157]

60 Thus, to compute all of the max-marginals, we simply need to ﬁnd the shortest path from every node a ∈ A to every node b ∈ B. [sent-135, score-0.18]

61 A very different class of networks that admits an efﬁcient solution is based on the application of a minimum cut algorithm to a graph. [sent-139, score-0.168]

62 At a high level, these networks encode situations where adjacent variables like to take “similar” values. [sent-140, score-0.212]

63 For MRFs with only regular potentials, the MAP solution can be found as the minimum cut of a weighted graph constructed from the MRF [9]. [sent-144, score-0.199]

64 In recent work, Kohli and Torr [7], studying the problem of conﬁdence estimation in MAP problems, showed how all of the max-marginals in a regular network can be computed using dynamic algorithms for ﬂow computations. [sent-148, score-0.17]

65 Their method also applies to non-binary networks with convex potentials (as in [5]), but not to networks for which α-expansion is used to ﬁnd an approximate MAP assignment. [sent-149, score-0.415]

66 5 Experimental Results We evaluate C OMPOSE on the image correspondence problem, which is characteristic of matching problems with geometric constraints. [sent-150, score-0.336]

67 We encode our MRF with a variable Xi for each marker xi in the source image, whose value corresponds to its aligned candidate x j in the target image. [sent-165, score-0.244]

68 The MRF also contains pairwise potentials {φij } that can encode dependencies between the landmark assignments. [sent-167, score-0.293]

69 In particular, we may want to encode geometric potentials, which enforce a preference for preservation of distance or orientation for pairs of markers xi , xj and their assigned targets xk , xl . [sent-168, score-0.335]

70 Finally, as the goal is to ﬁnd a 1-to-1 mapping between landmarks in the source and target images, we also encode a set of mutual exclusion potentials over pairs of variables, enforcing the constraint that no two markers are assigned to the same candidate xk . [sent-169, score-0.623]

71 , xn } is generated by generating one point from each template point xi , sampling from a Gaussian distribution with mean xi and a diagonal covariance matrix σ 2 I. [sent-180, score-0.274]

72 As there was no true local information, the matching (or singleton) potentials for both types of synthetic networks were generated uniformly at random on [0, 1). [sent-181, score-0.564]

73 The ‘correct’ matching point, or the one the template variable generates, was given weight . [sent-182, score-0.237]

74 7, ensuring that the correct matching gets a non-negligible weight without making the correspondence too obvious. [sent-183, score-0.254]

75 In both cases, we generate pairwise geometric potentials φij (Xi , Xj ) that are Gaussian with mean µ = (xi − xj ) and standard deviation proportional to the Euclidean distance between xi and xj and variance σ 2 . [sent-186, score-0.494]

76 Since Figure 1: (a) Cumulative percentage of convergent runs versus CPU time on networks with 30 variables and sigma ranging from 3 to 9. [sent-192, score-0.183]

77 Shown is the difference between the log score of each algorithm and the (a) (b) score found by AMP. [sent-194, score-0.186]

78 (d) Score of assignment based on intermediate beliefs versus time for C OM POSE , TRMP, and matching on 100 variable networks. [sent-196, score-0.343]

79 sum-product algorithms are known in general to be less susceptible to oscillation than their maxproduct counterparts, we also compared against sum-product asynchronous belief propagation. [sent-198, score-0.192]

80 1(b) shows the average difference in log scores between each algorithm’s result and the average log score of AMP as a function of the number of variables in the networks. [sent-201, score-0.2]

81 1(d) examines the intermediate scores obtained by C OMPOSE and TRMP on intermediate assignments reached during the inference process, for large (100 variable) problems. [sent-210, score-0.221]

82 We now consider real networks generated for the task of electron microscope tomography: the three-dimensional reconstruction of cell and organelle structures based on a series of images obtained at different tilt angles. [sent-217, score-0.266]

83 The problem is to localize and track markers in images across time, and it is a difﬁcult one; traditional methods like cross correlation and graph matching often result in many errors. [sent-218, score-0.359]

84 We therefore used a variant on AMP called residual max-product (RMP) [3] that schedules messages in an informed way over the network; in this work and others, we have found this variant to achieve better performance than TRMP on difﬁcult networks. [sent-221, score-0.154]

85 2(a) shows a source set of markers in an electron tomography image; Fig. [sent-223, score-0.206]

86 Because the network structure was difﬁcult for loopy approximate methods, we ran experiments where we replaced mutual exclusion constraints with soft location constraints on individual landmarks; while convergence improved, actual performance was inferior. [sent-229, score-0.434]

87 It is clear that, though RMP and TRMP run on a simpler network with soft mutual exclusion constraints are competitive with, and even very slightly better than C OMPOSE on simple problems, as problems become more difﬁcult (more variance in target images), C OMPOSE clearly dominates. [sent-233, score-0.362]

88 We also compare C OMPOSE to simply ﬁnding the best matching of markers to candidates without any geometric information; C OMPOSE dominates this approach, never scoring worse than the matching. [sent-234, score-0.381]

89 Motivated by the existence of very efﬁcient algorithms to extract all max-marginals from combinatorial substructures, we presented a variation of belief propagation methods that used the max-marginals to take large steps in inference. [sent-236, score-0.344]

90 Some existing variants of belief propagation (such as TRMP) attempt to speed the exchange of information across opposing sides of the network by means of intelligent message scheduling. [sent-239, score-0.417]

91 Even intelligently-scheduled message passing is limited, however, as messages are inherently local. [sent-240, score-0.211]

92 C OMPOSE slices the network along a different axis, using subnetworks that are global in nature but that do not have all of the information about any subset of variables. [sent-242, score-0.222]

93 If the component of the network that is difﬁcult for belief propagation can be encoded in an efﬁcient special-purpose subnetwork such as a matching, then we have a means of effectively propagating global information. [sent-243, score-0.416]

94 Some very recent work explores the case where a regular MRF contains terms that are not regular [14, 13], but this work is largely speciﬁc to certain types of “close-to-regular” MRFs. [sent-245, score-0.205]

95 Our work is also related to work trying to solve the quadratic assignment problem (QAP) [10], a class of problems of which our generalized matching networks are a special case. [sent-247, score-0.428]

96 Standard algorithms for QAP include simulated annealing, tabu search, branch and bound, and ant algorithms [16]; the latter have some of the ﬂavor of message passing, walking trails over the graph representing a QAP and iteratively updating scores of different assignments to the QAP. [sent-248, score-0.259]

97 To the best of our knowledge, however, none of these previous methods attempts to use a combinatorial algorithm as a component in a general message-passing algorithm, thereby exploiting the structure of the pairwise constraints. [sent-249, score-0.151]

98 A second major direction is the identiﬁcation of other tractable components within real-world MRFs that one can solve using combinatorial optimization methods, or other efﬁcient approaches. [sent-252, score-0.19]

99 Measuring uncertainty in graph cut solutions - efﬁciently computing min-marginal energies using dynamic graph cuts. [sent-299, score-0.199]

100 Loopy belief propagation for approximate inference: An empirical study. [sent-318, score-0.243]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ompose', 0.641), ('trmp', 0.208), ('matching', 0.201), ('potentials', 0.181), ('mrf', 0.179), ('mrfs', 0.145), ('exclusion', 0.139), ('subnetworks', 0.139), ('belief', 0.129), ('rmp', 0.121), ('networks', 0.117), ('propagation', 0.114), ('ow', 0.113), ('assignment', 0.11), ('xi', 0.104), ('combinatorial', 0.101), ('score', 0.093), ('message', 0.091), ('subnetwork', 0.09), ('regular', 0.087), ('network', 0.083), ('bipartite', 0.08), ('scores', 0.074), ('pf', 0.073), ('amp', 0.069), ('electron', 0.069), ('markers', 0.069), ('map', 0.067), ('singleton', 0.063), ('passing', 0.063), ('encode', 0.062), ('graph', 0.061), ('xj', 0.059), ('tractable', 0.057), ('messages', 0.057), ('correspondence', 0.053), ('mutual', 0.053), ('node', 0.053), ('communicator', 0.052), ('microscope', 0.052), ('mpbp', 0.052), ('qap', 0.052), ('cut', 0.051), ('inference', 0.05), ('pairwise', 0.05), ('constraints', 0.049), ('kolmogorov', 0.048), ('nb', 0.048), ('ij', 0.048), ('om', 0.045), ('markov', 0.045), ('residual', 0.045), ('path', 0.044), ('landmarks', 0.041), ('image', 0.041), ('geometric', 0.041), ('source', 0.04), ('na', 0.04), ('candidates', 0.04), ('edge', 0.039), ('target', 0.038), ('solved', 0.037), ('template', 0.036), ('elidan', 0.035), ('maxmarginals', 0.035), ('satisfaction', 0.035), ('substructures', 0.035), ('loops', 0.034), ('synthetic', 0.034), ('scheme', 0.034), ('variables', 0.033), ('assignments', 0.033), ('ensemble', 0.033), ('asynchronous', 0.033), ('loopy', 0.033), ('runs', 0.033), ('intermediate', 0.032), ('optimization', 0.032), ('edges', 0.031), ('types', 0.031), ('unnormalized', 0.03), ('simply', 0.03), ('xn', 0.03), ('transmit', 0.03), ('kohli', 0.03), ('maxproduct', 0.03), ('nodes', 0.029), ('intractable', 0.029), ('eccv', 0.029), ('images', 0.028), ('convergence', 0.028), ('partitioned', 0.028), ('tomography', 0.028), ('chatalbashev', 0.028), ('veksler', 0.028), ('force', 0.027), ('variant', 0.026), ('uai', 0.026), ('computing', 0.026), ('maxx', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 201 nips-2006-Using Combinatorial Optimization within Max-Product Belief Propagation

Author: Daniel Tarlow, Gal Elidan, Daphne Koller, John C. Duchi

2 0.18176354 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

Author: Su-in Lee, Varun Ganapathi, Daphne Koller

Abstract: Markov networks are commonly used in a wide variety of applications, ranging from computer vision, to natural language, to computational biology. In most current applications, even those that rely heavily on learned models, the structure of the Markov network is constructed by hand, due to the lack of effective algorithms for learning Markov network structure from data. In this paper, we provide a computationally efﬁcient method for learning Markov network structure from data. Our method is based on the use of L1 regularization on the weights of the log-linear model, which has the effect of biasing the model towards solutions where many of the parameters are zero. This formulation converts the Markov network learning problem into a convex optimization problem in a continuous space, which can be solved using efﬁcient gradient methods. A key issue in this setting is the (unavoidable) use of approximate inference, which can lead to errors in the gradient computation when the network structure is dense. Thus, we explore the use of different feature introduction schemes and compare their performance. We provide results for our method on synthetic data, and on two real world data sets: pixel values in the MNIST data, and genetic sequence variations in the human HapMap data. We show that our L1 -based method achieves considerably higher generalization performance than the more standard L2 -based method (a Gaussian parameter prior) or pure maximum-likelihood learning. We also show that we can learn MRF network structure at a computational cost that is not much greater than learning parameters alone, demonstrating the existence of a feasible method for this important problem.

3 0.14455551 35 nips-2006-Approximate inference using planar graph decomposition

Author: Amir Globerson, Tommi S. Jaakkola

Abstract: A number of exact and approximate methods are available for inference calculations in graphical models. Many recent approximate methods for graphs with cycles are based on tractable algorithms for tree structured graphs. Here we base the approximation on a different tractable model, planar graphs with binary variables and pure interaction potentials (no external ﬁeld). The partition function for such models can be calculated exactly using an algorithm introduced by Fisher and Kasteleyn in the 1960s. We show how such tractable planar models can be used in a decomposition to derive upper bounds on the partition function of non-planar models. The resulting algorithm also allows for the estimation of marginals. We compare our planar decomposition to the tree decomposition method of Wainwright et. al., showing that it results in a much tighter bound on the partition function, improved pairwise marginals, and comparable singleton marginals. Graphical models are a powerful tool for modeling multivariate distributions, and have been successfully applied in various ﬁelds such as coding theory and image processing. Applications of graphical models typically involve calculating two types of quantities, namely marginal distributions, and MAP assignments. The evaluation of the model partition function is closely related to calculating marginals [12]. These three problems can rarely be solved exactly in polynomial time, and are provably computationally hard in the general case [1]. When the model conforms to a tree structure, however, all these problems can be solved in polynomial time. This has prompted extensive research into tree based methods. For example, the junction tree method [6] converts a graphical model into a tree by clustering nodes into cliques, such that the graph over cliques is a tree. The resulting maximal clique size (cf. tree width) may nevertheless be prohibitively large. Wainwright et. al. [9, 11] proposed an approximate method based on trees known as tree reweighting (TRW). The TRW approach decomposes the potential vector of a graphical model into a mixture over spanning trees of the model, and then uses convexity arguments to bound various quantities, such as the partition function. One key advantage of this approach is that it provides bounds on partition function value, a property which is not shared by approximations based on Bethe free energies [13]. In this paper we focus on a different class of tractable models: planar graphs. A graph is called planar if it can be drawn in the plane without crossing edges. Works in the 1960s by physicists Fisher [5] and Kasteleyn [7], among others, have shown that the partition function for planar graphs may be calculated in polynomial time. This, however, is true under two key restrictions. One is that the variables xi are binary. The other is that the interaction potential depends only on xi xj (where xi ∈ {±1}), and not on their individual values (i.e., the zero external ﬁeld case). Here we show how the above method can be used to obtain upper bounds on the partition function for non-planar graphs. As in TRW, we decompose the potential of a non-planar graph into a sum over spanning planar models, and then use a convexity argument to obtain an upper bound on the log partition function. The bound optimization is a convex problem, and can be solved in polynomial time. We compare our method with TRW on a planar graph with an external ﬁeld, and show that it performs favorably with respect to both pairwise marginals and the bound on the partition function, and the two methods give similar results for singleton marginals. 1 Deﬁnitions and Notations Given a graph G with n vertices and a set of edges E, we are interested in pairwise Markov Random Fields (MRF) over the graph G. A pairwise MRF [13] is a multivariate distribution over variables x = {x1 , . . . , xn } deﬁned as 1 P p(x) = e ij∈E fij (xi ,xj ) (1) Z where fij are a set of |E| functions, or interaction potentials, deﬁned over pairs of variables. The P partition function is deﬁned as Z = x e ij∈E fij (xi ,xj ) . Here we will focus on the case where xi ∈ {±1}. Furthermore, we will be interested in interaction potentials which only depend on agreement or disagreement between the signs of their variables. We deﬁne those by 1 θij (1 + xi xj ) = θij I(xi = xj ) (2) 2 so that fij (xi , xj ) is zero if xi = xj and θij if xi = xj . The model is then deﬁned via the set of parameters θij . We use θ to denote the vector of parameters θij , and denote the partition function by Z(θ) to highlight its dependence on these parameters. f (xi , xj ) = A graph G is deﬁned as planar if it can be drawn in the plane without any intersection of edges [4]. With some abuse of notation, we deﬁne E as the set of line segments in 2 corresponding to the edges in the graph. The regions of 2 \ E are deﬁned as the faces of the graph. The face which corresponds to an unbounded region is called the external face. Given a planar graph G, its dual graph G∗ is deﬁned in the following way: the vertices of G∗ correspond to faces of G, and there is an edge between two vertices in G∗ iff the two corresponding faces in G share an edge. If the graph G is weighted, the weight on an edge in G∗ is the weight on the edge shared by the corresponding faces in G. A plane triangulation of a planar graph G is obtained from G by adding edges such that all the faces of the resulting graph have exactly three vertices. Thus a plane triangulated graph has a dual where all vertices have degree three. It can be shown that every plane graph can be plane triangulated [4]. We shall also need the notion of a perfect matching on a graph. A perfect matching on a graph G is deﬁned as a set of edges H ⊆ E such that every vertex in G has exactly one edge in H incident on it. If the graph is weighted, the weight of the matching is deﬁned as the product of the weights of the edges in the matching. Finally, we recall the deﬁnition of a marginal polytope of a graph [12]. Consider an MRF over a graph G where fij are given by Equation 2. Denote the probability of the event I(xi = xj ) under p(x) by τij . The marginal polytope of G, denoted by M(G), is deﬁned as the set of values τij that can be obtained under some assignment to the parameters θij . For a general graph G the polytope M(G) cannot be described using a polynomial number of inequalities. However, for planar graphs, it turns out that a set of O(n3 ) constraints, commonly referred to as triangle inequalities, sufﬁce to describe M(G) (see [3] page 434). The triangle inequalities are deﬁned by 1 TRI(n) = {τij : τij + τjk − τik ≤ 1, τij + τjk + τik ≥ 1, ∀i, j, k ∈ {1, . . . , n}} (3) Note that the above inequalities actually contain variables τij which do not correspond to edges in the original graph G. Thus the equality M(G) = TRI(n) should be understood as referring only to the values of τij that correspond to edges in the graph. Importantly, the values of τij for edges not in the graph need not be valid marginals for any MRF. In other words M(G) is a projection of TRI(n) on the set of edges of G. It is well known that the marginal polytope for trees is described via pairwise constraints. It is thus interesting that for planar graphs, it is triplets, rather than pairwise 1 The deﬁnition here is slightly different from that in [3], since here we refer to agreement probabilities, whereas [3] refers to disagreement probabilities. This polytope is also referred to as the cut polytope. constraints, that characterize the polytope. In this sense, planar graphs and trees may be viewed as a hierarchy of polytope complexity classes. It remains an interesting problem to characterize other structures in this hierarchy and their related inference algorithms. 2 Exact calculation of partition function using perfect matching The seminal works of Kasteleyn [7] and Fisher [5] have shown how one can calculate the partition function for a binary MRF over a planar graph with pure interaction potentials. We brieﬂy review Fisher’s construction, which we will use in what follows. Our interpretation of the method differs somewhat from that of Fisher, but we believe it is more straightforward. The key idea in calculating the partition function is to convert the summation over values of x to the problem of calculating the sum of weights of all perfect matchings in a graph constructed from G, as shown below. In this section, we consider weighted graphs (graphs with numbers assigned to their edges). For the graph G associated with the pairwise MRF, we assign weights wij = e2θij to the edges. The ﬁrst step in the construction is to plane triangulate the graph G. Let us call the resulting graph GT . We deﬁne an MRF on GT by assigning a parameter θij = 0 to the edges that have been added to G, and the corresponding weight wij = 1. Thus GT essentially describes the same distribution as G, and therefore has the same partition function. We can thus restrict our attention to calculating the partition function for the MRF on GT . As a ﬁrst step in calculating a partition function over GT , we introduce the following deﬁnition: a ˆ set of edges E in GT is an agreement edge set (or AES) if for every triangle face F in GT one of the ˆ ˆ following holds: The edges in F are all in E, or exactly one of the edges in F is in E. The weight ˆ is deﬁned as the product of the weights of the edges in E. ˆ of a set E It can be shown that there exists a bijection between pairs of assignments {x, −x} and agreement edge sets. The mapping from x to an edge set is simply the set of edges such that xi = xj . It is easy to see that this is an agreement edge set. The reverse mapping is obtained by ﬁnding an assignment x such that xi = xj iff the corresponding edge is in the agreement edge set. The existence of this mapping can be shown by induction on the number of (triangle) faces. P The contribution of a given assignment x to the partition function is e ˆ sponds to an AES denoted by E it is easy to see that P e ij∈E θij I(xi =xj ) = e− P ij∈E θij P e ˆ ij∈E 2θij = ce P ˆ ij∈E ij∈E 2θij θij I(xi =xj ) =c wij . If x corre(4) ˆ ij∈E P where c = e− ij∈E θij . Deﬁne the superset Λ as the set of agreement edge sets. The above then implies that Z(θ) = 2c E∈Λ ij∈E wij , and is thus proportional to the sum of AES weights. ˆ ˆ To sum over agreement edge sets, we use the following elegant trick introduced by Fisher [5]. Construct a new graph GPM from the dual of GT by introducing new vertices and edges according to the following rule: Replace each original vertex with three vertices that are connected to each other, and assign a weight of one to the new edges. Next, consider the three neighbors of the original vertex 2 . Connect each of the three new vertices to one of these three neighbors, keeping the original weights on these edges. The transformation is illustrated in Figure 1. The new graph GPM has O(3n) vertices, and is also planar. It can be seen that there is a one to one correspondence between perfect matchings in GPM and agreement edge sets in GT . Deﬁne Ω to be the set of perfect matchings in GPM . Then Z(θ) = 2c M ∈Ω ij∈M wij where we have used the fact that all the new weights have a value of one. Thus, the partition function is a sum over the weights of perfect matchings in GPM . Finally, we need a way of summing over the weights of the set of perfect matchings in a graph. Kasteleyn [7] proved that for a planar graph GPM , this sum may be obtained using the following sequence of steps: • Direct the edges of the graph GPM such that for every face (except possibly the external face), the number of edges on its perimeter oriented in a clockwise manner is odd. Kasteleyn showed that such a so called Pfafﬁan orientation may be constructed in polynomial time for a planar graph (see also [8] page 322). 2 Note that in the dual of GT all vertices have degree three, since GT is plane triangulated. 1.2 0.7 0.6 1 1 1 0.8 0.6 0.8 1.5 1.4 1.5 1 1 1.2 1 1 1 1 0.7 1.4 1 1 1 Figure 1: Illustration of the graph transformations in Section 2 for a complete graph with four vertices. Left panel shows the original weighted graph (dotted edges and grey vertices) and its dual (solid edges and black vertices). Right panel shows the dual graph with each vertex replaced by a triangle (the graph GPM in the text). Weights for dual graph edges correspond to the weights on the original graph. • Deﬁne the matrix P (GPM ) to be a skew symmetric matrix such that Pij = 0 if ij is not an edge, Pij = wij if the arrow on edge ij runs from i to j and Pij = −wij otherwise. • The sum over weighted matchings can then be shown to equal |P (GPM )|. The partition function is thus given by Z(θ) = 2c |P (GPM )|. To conclude this section we reiterate the following two key points: the partition function of a binary MRF over a planar graph with interaction potentials as in Equation 2 may be calculated in polynomial time by calculating the determinant of a matrix of size O(3n). An important outcome of this result is that the functional relation between Z(θ) and the parameters θij is known, a fact we shall use in what follows. 3 Partition function bounds via planar decomposition Given a non-planar graph G over binary variables with a vector of interaction potentials θ, we wish to use the exact planar computation to obtain a bound on the partition function of the MRF on G. We assume for simplicity that the potentials on the MRF for G are given in the form of Equation 2. Thus, G violates the assumptions of the previous section only in its non-planarity. Deﬁne G(r) as a set of spanning planar subgraphs of G, i.e., each graph G(r) is planar and contains all the vertices of G and some its edges. Denote by m the number of such graphs. Introduce the following deﬁnitions: (r) • θ (r) is a set of parameters on the edges of G(r) , and θij is an element in this set. Z(θ (r) ) is the partition function of the MRF on G(r) with parameters θ (r) . ˆ (r) ˆ(r) • θ is a set of parameters on the edges of G such that if edge (ij) is in G(r) then θij = (r) ˆ(r) θ , and otherwise θ = 0. ij ij Given a distribution ρ(r) on the graphs G(r) (i.e., ρ(r) ≥ 0 for r = 1, . . . , m and assume that the parameters for G(r) are such that ˆ ρ(r)θ θ= (r) r ρ(r) = 1), (5) r Then, by the convexity of the log partition function, as a function of the model parameters, we have ρ(r) log Z(θ (r) ) ≡ f (θ, ρ, θ (r) ) log Z(θ) ≤ (6) r Since by assumption the graphs G(r) are planar, this bound can be calculated in polynomial time. Since this bound is true for any set of parameters θ (r) which satisﬁes the condition in Equation 5 and for any distribution ρ(r), we may optimize over these two variables to obtain the tightest bound possible. Deﬁne the optimal bound for a ﬁxed value of ρ(r) by g(ρ, θ) (optimization is w.r.t. θ (r) ) g(ρ, θ) = f (θ, ρ, θ (r) ) min θ (r) : P ˆ ρ(r)θ (r) =θ (7) Also, deﬁne the optimum of the above w.r.t. ρ by h(θ). h(θ) = min g(θ, ρ) ρ(r) ≥ 0, ρ(r) = 1 (8) Thus, h(θ) is the optimal upper bound for the given parameter vector θ. In the following section we argue that we can in fact ﬁnd the global optimum of the above problem. 4 Globally Optimal Bound Optimization First consider calculating g(ρ, θ) from Equation 7. Note that since log Z(θ (r) ) is a convex function of θ (r) , and the constraints are linear, the overall optimization is convex and can be solved efﬁciently. In the current implementation, we use a projected gradient algorithm [2]. The gradient of f (θ, ρ, θ (r) ) w.r.t. θ (r) is given by ∂f (θ, ρ, θ (r) ) (r) ∂θij (r) = ρ(r) 1 + eθij (r) P −1 (GPM ) (r) k(i,j) Sign(Pk(i,j) (GPM )) (9) where k(i, j) returns the row and column indices of the element in the upper triangular matrix of (r) (r) P (GPM ), which contains the element e2θij . Since the optimization in Equation 7 is convex, it has an equivalent convex dual. Although we do not use this dual for optimization (because of the difﬁculty of expressing the entropy of planar models solely in terms of triplet marginals), it nevertheless allows some insight into the structure of the problem. The dual in this case is closely linked to the notion of the marginal polytope deﬁned in Section 1. Using a derivation similar to [11], we arrive at the following characterization of the dual g(ρ, θ) = max τ ∈TRI(n) ρ(r)H(θ (r) (τ )) θ·τ + (10) r where θ (r) (τ ) denotes the parameters of an MRF on G(r) such that its marginals are given by the restriction of τ to the edges of G(r) , and H(θ (r) (τ )) denotes the entropy of the MRF over G(r) with parameters θ (r) (τ ). The maximized function in Equation 10 is linear in ρ and thus g(ρ, θ) is a pointwise maximum over (linear) convex functions in ρ and is thus convex in ρ. It therefore has no (r) local minima. Denote by θmin (ρ) the set of parameters that minimizes Equation 7 for a given value of ρ. Using a derivation similar to that in [11], the gradient of g(ρ, θ) can be shown to be ∂g(ρ, θ) (r) = H(θmin (ρ)) ∂ρ(r) (11) Since the partition function for G(r) can be calculated efﬁciently, so can the entropy. We can now summarize the algorithm for calculating h(θ) • Initialize ρ0 . Iterate: – For ρt , ﬁnd θ (r) which solves the minimization in Equation 7. – Calculate the gradient of g(ρ, θ) at ρt using the expression in Equation 11 – Update ρt+1 = ρt + αv where v is a feasible search direction calculated from the gradient of g(ρ, θ) and the simplex constraints on ρ. The step size α is calculated via an Armijo line search. – Halt when the change in g(ρ, θ) is smaller than some threshold. Note that the minimization w.r.t. θ (r) is not very time consuming since we can initialize it with the minimum from the previous step, and thus only a few iterations are needed to ﬁnd the new optimum, provided the change in ρ is not too big. The above algorithm is guaranteed to converge to a global optimum of ρ [2], and thus we obtain the tightest possible upper bound on Z(θ) given our planar graph decomposition. The procedure described here is asymmetric w.r.t. ρ and θ (r) . In a symmetric formulation the minimizing gradient steps could be carried out jointly or in an alternating sequence. The symmetric ˆ (r) formulation can be obtained by decoupling ρ and θ (r) in the bi-linear constraint ρ(r)θ = θ. Field Figure 2: Illustration of planar subgraph construction for a rectangular lattice with external ﬁeld. Original graph is shown on the left. The ﬁeld vertex is connected to all vertices (edges not shown). The graph on the right results from isolating the 4th ,5th columns of the original graph (shown in grey), and connecting the ﬁeld vertex to the external vertices of the three disconnected components. Note that the resulting graph is planar. ˜ ˜ Speciﬁcally, we introduce θ (r) = θ (r) ρ(r) and perform the optimization w.r.t. ρ and θ (r) . It can be ˜(r) ) with the relevant (de-coupled) constraint is equivalent shown that a stationary point of f (θ, ρ, θ to the procedure described above. The advantage of this approach is that the exact minimization w.r.t θ (r) is not required before modifying ρ. Our experiments have shown, however, that the methods take comparable times to converge, although this may be a property of the implementation. 5 Estimating Marginals The optimization problem as deﬁned above minimizes an upper bound on the partition function. However, it may also be of interest to obtain estimates of the marginals of the MRF over G. To obtain marginal estimates, we follow the approach in [11]. We ﬁrst characterize the optimum of Equation 7 for a ﬁxed value of ρ. Deriving the Lagrangian of Equation 7 w.r.t. θ (r) we obtain the (r) following characterization of θmin (ρ): Marginal Optimality Criterion: For any two graphs G(r) , G(s) such that the edge (ij) is in both (r) (s) graphs, the optimal parameter vector satisﬁes τij (θmin (ρ)) = τij (θmin (ρ)). Thus, the optimal set of parameters for the graphs G(r) is such that every two graphs agree on the marginals of all the edges they share. This implies that at the optimum, there is a well deﬁned set of marginals over all the edges. We use this set as an approximation to the true marginals. A different method for estimating marginals uses the partition function bound directly. We ﬁrst P calculate partition function bounds on the sums: αi (1) = x:xi =1 e ij∈E fij (xi ,xj ) and αi (−1) = P αi (1) e ij∈E fij (xi ,xj ) and then normalize αi (1)+αi (−1) to obtain an estimate for p(xi = 1). This method has the advantage of being more numerically stable (since it does not depend on derivatives of log Z). However, it needs to be calculated separately for each variable, so that it may be time consuming if one is interested in marginals for a large set of variables. x:xi =−1 6 Experimental Evaluation We study the application of our Planar Decomposition (PDC) P method to a binary MRF on a square P lattice with an external ﬁeld. The MRF is given by p(x) ∝ e ij∈E θij xi xj + i∈V θi xi where V are the lattice vertices, and θi and θij are parameters. Note that this interaction does not satisfy the conditions for exact calculation of the partition function, even though the graph is planar. This problem is in fact NP hard [1]. However, it is possible to obtain the desired interaction form by introducing an additional variable xn+1 that is connected to all the original variables.P Denote the correspondP ij∈E θij xi xj + i∈V θi,n+1 xi xn+1 , where ing graph by Gf . Consider the distribution p(x, xn+1 ) ∝ e θi,n+1 = θi . It is easy to see that any property of p(x) (e.g., partition function, marginals) may be calculated from the corresponding property of p(x, xn+1 ). The advantage of the latter distribution is that it has the desired interaction form. We can thus apply PDC by choosing planar subgraphs of the non-planar graph Gf . 0.25 0.15 0.1 0.05 0.5 1 1.5 Interaction Strength 0.03 Singleton Marginal Error Z Bound Error Pairwise Marginals Error 0.08 PDC TRW 0.2 0.07 0.06 0.05 0.04 0.03 0.02 2 0.5 1 1.5 Interaction Strength 0.025 0.02 0.015 0.01 0.005 2 0.5 1 1.5 Interaction Strength 2 !3 x 10 0.025 0.02 0.015 0.5 1 Field Strength 1.5 2 Singleton Marginal Error Pairwise Marginals Error Z Bound Error 0.03 0.03 0.025 0.02 0.015 0.5 1 Field Strength 1.5 2 9 8 7 6 5 4 3 0.5 1 Field Strength 1.5 2 Figure 3: Comparison of the TRW and Planar Decomposition (PDC) algorithms on a 7×7 square lattice. TRW results shown in red squares, and PDC in blue circles. Left column shows the error in the log partition bound. Middle column is the mean error for pairwise marginals, and right column is the error for the singleton marginal of the variable at the lattice center. Results in upper row are for ﬁeld parameters drawn from U[−0.05, 0.05] and various interaction parameters. Results in the lower row are for interaction parameters drawn from U [−0.5, 0.5] and various ﬁeld parameters. Error bars are standard errors calculated from 40 random trials. There are clearly many ways to choose spanning planar subgraphs of Gf . Spanning subtrees are one option, and were used in [11]. Since our optimization is polynomial in the number of subgraphs, √ we preferred to use a number of subgraphs that is linear in n. The key idea in generating these planar subgraphs is to generate disconnected components of the lattice and connect xn+1 only to the external vertices of these components. Here we generate three disconnected components by isolating two neighboring columns (or rows) from the rest of the graph, resulting in three components. This is √ illustrated in Figure 2. To this set of 2 n graphs, we add the independent variables graph consisting only of edges from the ﬁeld node to all the other nodes. We compared the performance of the PDC and TRW methods 3 4 on a 7 × 7 lattice . Since the exact partition function and marginals can be calculated for this case, we could compare both algorithms to the true values. The MRF parameters were set according to the two following scenarios: 1) Varying Interaction - The ﬁeld parameters θi were drawn uniformly from U[−0.05, 0.05], and the interaction θij from U[−α, α] where α ∈ {0.2, 0.4, . . . , 2}. This is the setting tested in [11]. 2) Varying Field θi was drawn uniformly from U[−α, α], where α ∈ {0.2, 0.4, . . . , 2} and θij from U[−0.5, 0.5]. For each scenario, we calculated the following measures: 1) Normalized log partition error 1 1 alg − log Z true ). 2) Error in pairwise marginals |E| ij∈E |palg (xi = 1, xj = 1) − 49 (log Z ptrue (xi = 1, xj = 1)|. Pairwise marginals were calculated jointly using the marginal optimality criterion of Section 5. 3) Error in singleton marginals. We calculated the singleton marginals for the innermost node in the lattice (i.e., coordinate [3, 3]), which intuitively should be the most difﬁcult for the planar based algorithm. This marginal was calculated using two partition functions, as explained in Section 5 5 . The same method was used for TRW. The reported error measure is |palg (xi = 1) − ptrue (xi = 1)|. Results were averaged over 40 random trials. Results for the two scenarios and different evaluation measures are given in Figure 3. It can be seen that the partition function bound for PDC is signiﬁcantly better than TRW for almost all parameter settings, although the difference becomes smaller for large ﬁeld values. Error for the PDC pairwise 3 TRW and PDC bounds were optimized over both the subgraph parameters and the mixture parameters ρ. In terms of running time, PDC optimization for a ﬁxed value of ρ took about 30 seconds, which is still slower than the TRW message passing implementation. 5 Results using the marginal optimality criterion were worse for PDC, possibly due to its reduced numerical precision. 4 marginals are smaller than those of TRW for all parameter settings. For the singleton parameters, TRW slightly outperforms PDC. This is not surprising since the ﬁeld is modeled by every spanning tree in the TRW decomposition, whereas in PDC not all the structures model a given ﬁeld. 7 Discussion We have presented a method for using planar graphs as the basis for approximating non-planar graphs such as planar graphs with external ﬁelds. While the restriction to binary variables limits the applicability of our approach, it remains relevant in many important applications, such as coding theory and combinatorial optimization. Moreover, it is always possible to convert a non-binary graphical model to a binary one by introducing additional variables. The resulting graph will typically not be planar, even when the original graph over k−ary variables is. However, the planar decomposition method can then be applied to this non-planar graph. The optimization of the decomposition is carried out explicitly over the planar subgraphs, thus limiting the number of subgraphs that can be used in the approximation. In the TRW method this problem is circumvented since it is possible to implicitly optimize over all spanning trees. The reason this can be done for trees is that the entropy of an MRF over a tree may be written as a function of its marginal variables. We do not know of an equivalent result for planar graphs, and it remains a challenge to ﬁnd one. It is however possible to combine the planar and tree decompositions into one single bound, which is guaranteed to outperform the tree or planar approximations alone. The planar decomposition idea may in principle be applied to bounding the value of the MAP assignment. However, as in TRW, it can be shown that the solution is not dependent on the decomposition (as long as each edge appears in some structure), and the problem is equivalent to maximizing a linear function over the marginal polytope (which can be done in polynomial time for planar graphs). However, such a decomposition may suggest new message passing algorithms, as in [10]. Acknowledgments The authors acknowledge support from the Defense Advanced Research Projects Agency (Transfer Learning program). Amir Globerson is also supported by the Rothschild Yad-Hanadiv fellowship. The authors also wish to thank Martin Wainwright for providing his TRW code. References [1] F. Barahona. On the computational complexity of ising spin glass models. J. Phys. A., 15(10):3241–3253, 1982. [2] D. P. Bertsekas, editor. Nonlinear Programming. Athena Scientiﬁc, Belmont, MA, 1995. [3] M.M. Deza and M. Laurent. Geometry of Cuts and Metrics. Springe-Verlag, 1997. [4] R. Diestel. Graph Theory. Springer-Verlag, 1997. [5] M.E. Fisher. On the dimer solution of planar ising models. J. Math. Phys., 7:1776–1781, 1966. [6] M.I. Jordan, editor. Learning in graphical models. MIT press, Cambridge, MA, 1998. [7] P.W. Kasteleyn. Dimer statistics and phase transitions. Journal of Math. Physics, 4:287–293, 1963. [8] L. Lovasz and M.D. Plummer. Matching Theory, volume 29 of Annals of discrete mathematics. NorthHolland, New-York, 1986. [9] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Tree-based reparameterization framework for analysis of sum-product and related algorithms. IEEE Trans. on Information Theory, 49(5):1120–1146, 2003. [10] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Map estimation via agreement on trees: messagepassing and linear programming. IEEE Trans. on Information Theory, 51(11):1120–1146, 2005. [11] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. A new class of upper bounds on the log partition function. IEEE Trans. on Information Theory, 51(7):2313–2335, 2005. [12] M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational inference. Technical report, UC Berkeley Dept. of Statistics, 2003. [13] J.S. Yedidia, W.T. W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. on Information Theory, 51(7):2282–2312, 2005.

4 0.14183837 39 nips-2006-Balanced Graph Matching

Author: Timothee Cour, Praveen Srinivasan, Jianbo Shi

Abstract: Graph matching is a fundamental problem in Computer Vision and Machine Learning. We present two contributions. First, we give a new spectral relaxation technique for approximate solutions to matching problems, that naturally incorporates one-to-one or one-to-many constraints within the relaxation scheme. The second is a normalization procedure for existing graph matching scoring functions that can dramatically improve the matching accuracy. It is based on a reinterpretation of the graph matching compatibility matrix as a bipartite graph on edges for which we seek a bistochastic normalization. We evaluate our two contributions on a comprehensive test set of random graph matching problems, as well as on image correspondence problem. Our normalization procedure can be used to improve the performance of many existing graph matching algorithms, including spectral matching, graduated assignment and semideﬁnite programming. 1

5 0.11418825 190 nips-2006-The Neurodynamics of Belief Propagation on Binary Markov Random Fields

Author: Thomas Ott, Ruedi Stoop

Abstract: We rigorously establish a close relationship between message passing algorithms and models of neurodynamics by showing that the equations of a continuous Hopﬁeld network can be derived from the equations of belief propagation on a binary Markov random ﬁeld. As Hopﬁeld networks are equipped with a Lyapunov function, convergence is guaranteed. As a consequence, in the limit of many weak connections per neuron, Hopﬁeld networks exactly implement a continuous-time variant of belief propagation starting from message initialisations that prevent from running into convergence problems. Our results lead to a better understanding of the role of message passing algorithms in real biological neural networks.

6 0.11071312 199 nips-2006-Unsupervised Learning of a Probabilistic Grammar for Object Detection and Parsing

7 0.10634274 69 nips-2006-Distributed Inference in Dynamical Systems

8 0.10612841 43 nips-2006-Bayesian Model Scoring in Markov Random Fields

9 0.097897418 57 nips-2006-Conditional mean field

10 0.094052508 34 nips-2006-Approximate Correspondences in High Dimensions

11 0.069026597 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

12 0.068645954 66 nips-2006-Detecting Humans via Their Pose

13 0.065548375 110 nips-2006-Learning Dense 3D Correspondence

14 0.064290069 7 nips-2006-A Local Learning Approach for Clustering

15 0.060994092 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming

16 0.060249429 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts

17 0.059783082 98 nips-2006-Inferring Network Structure from Co-Occurrences

18 0.059724424 128 nips-2006-Manifold Denoising

19 0.059640571 14 nips-2006-A Small World Threshold for Economic Network Formation

20 0.058310121 42 nips-2006-Bayesian Image Super-resolution, Continued

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.216), (1, 0.026), (2, 0.073), (3, -0.034), (4, 0.166), (5, -0.025), (6, 0.043), (7, 0.039), (8, -0.059), (9, -0.156), (10, 0.065), (11, -0.174), (12, 0.029), (13, 0.071), (14, 0.098), (15, -0.004), (16, -0.144), (17, 0.07), (18, 0.136), (19, 0.122), (20, 0.121), (21, 0.096), (22, 0.03), (23, 0.069), (24, 0.02), (25, 0.023), (26, 0.144), (27, -0.03), (28, 0.027), (29, -0.073), (30, -0.06), (31, -0.077), (32, -0.023), (33, -0.033), (34, -0.007), (35, 0.003), (36, 0.036), (37, -0.048), (38, -0.025), (39, 0.114), (40, -0.076), (41, -0.005), (42, -0.003), (43, -0.096), (44, 0.082), (45, 0.095), (46, 0.043), (47, 0.118), (48, -0.049), (49, -0.039)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9495948 201 nips-2006-Using Combinatorial Optimization within Max-Product Belief Propagation

Author: Daniel Tarlow, Gal Elidan, Daphne Koller, John C. Duchi

2 0.71707827 190 nips-2006-The Neurodynamics of Belief Propagation on Binary Markov Random Fields

Author: Thomas Ott, Ruedi Stoop

3 0.63821298 35 nips-2006-Approximate inference using planar graph decomposition

Author: Amir Globerson, Tommi S. Jaakkola

4 0.60543811 39 nips-2006-Balanced Graph Matching

Author: Timothee Cour, Praveen Srinivasan, Jianbo Shi

5 0.59940612 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

Author: Su-in Lee, Varun Ganapathi, Daphne Koller

6 0.59707397 43 nips-2006-Bayesian Model Scoring in Markov Random Fields

7 0.56060332 199 nips-2006-Unsupervised Learning of a Probabilistic Grammar for Object Detection and Parsing

8 0.48979837 57 nips-2006-Conditional mean field

9 0.48351017 34 nips-2006-Approximate Correspondences in High Dimensions

10 0.46546811 69 nips-2006-Distributed Inference in Dynamical Systems

11 0.45154631 139 nips-2006-Multi-dynamic Bayesian Networks

12 0.45153144 98 nips-2006-Inferring Network Structure from Co-Occurrences

13 0.41985089 160 nips-2006-Part-based Probabilistic Point Matching using Equivalence Constraints

14 0.41411352 182 nips-2006-Statistical Modeling of Images with Fields of Gaussian Scale Mixtures

15 0.37896121 5 nips-2006-A Kernel Method for the Two-Sample-Problem

16 0.35382968 169 nips-2006-Relational Learning with Gaussian Processes

17 0.34611455 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints

18 0.34497786 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression

19 0.34171268 93 nips-2006-Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms

20 0.34048852 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.083), (3, 0.022), (7, 0.106), (9, 0.031), (20, 0.01), (22, 0.041), (44, 0.095), (57, 0.096), (65, 0.033), (69, 0.391)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96380949 147 nips-2006-Non-rigid point set registration: Coherent Point Drift

Author: Andriy Myronenko, Xubo Song, Miguel Á. Carreira-Perpiñán

Abstract: We introduce Coherent Point Drift (CPD), a novel probabilistic method for nonrigid registration of point sets. The registration is treated as a Maximum Likelihood (ML) estimation problem with motion coherence constraint over the velocity ﬁeld such that one point set moves coherently to align with the second set. We formulate the motion coherence constraint and derive a solution of regularized ML estimation through the variational approach, which leads to an elegant kernel form. We also derive the EM algorithm for the penalized ML optimization with deterministic annealing. The CPD method simultaneously ﬁnds both the non-rigid transformation and the correspondence between two point sets without making any prior assumption of the transformation model except that of motion coherence. This method can estimate complex non-linear non-rigid transformations, and is shown to be accurate on 2D and 3D examples and robust in the presence of outliers and missing points.

2 0.95741808 88 nips-2006-Greedy Layer-Wise Training of Deep Networks

Author: Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle

Abstract: Complexity theory of circuits strongly suggests that deep architectures can be much more efﬁcient (sometimes exponentially) than shallow architectures, in terms of computational elements required to represent some functions. Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization appears to often get stuck in poor solutions. Hinton et al. recently introduced a greedy layer-wise unsupervised learning algorithm for Deep Belief Networks (DBN), a generative model with many layers of hidden causal variables. In the context of the above optimization problem, we study this algorithm empirically and explore variants to better understand its success and extend it to cases where the inputs are continuous or where the structure of the input distribution is not revealing enough about the variable to be predicted in a supervised task. Our experiments also conﬁrm the hypothesis that the greedy layer-wise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.

3 0.93873364 176 nips-2006-Single Channel Speech Separation Using Factorial Dynamics

Author: John R. Hershey, Trausti Kristjansson, Steven Rennie, Peder A. Olsen

Abstract: Human listeners have the extraordinary ability to hear and recognize speech even when more than one person is talking. Their machine counterparts have historically been unable to compete with this ability, until now. We present a modelbased system that performs on par with humans in the task of separating speech of two talkers from a single-channel recording. Remarkably, the system surpasses human recognition performance in many conditions. The models of speech use temporal dynamics to help infer the source speech signals, given mixed speech signals. The estimated source signals are then recognized using a conventional speech recognition system. We demonstrate that the system achieves its best performance when the model of temporal dynamics closely captures the grammatical constraints of the task. One of the hallmarks of human perception is our ability to solve the auditory cocktail party problem: we can direct our attention to a given speaker in the presence of interfering speech, and understand what was said remarkably well. Until now the same could not be said for automatic speech recognition systems. However, we have recently introduced a system which in many conditions performs this task better than humans [1][2]. The model addresses the Pascal Speech Separation Challenge task [3], and outperforms all other published results by more than 10% word error rate (WER). In this model, dynamics are modeled using a layered combination of one or two Markov chains: one for long-term dependencies and another for short-term dependencies. The combination of the two speakers was handled via an iterative Laplace approximation method known as Algonquin [4]. Here we describe experiments that show better performance on the same task with a simpler version of the model. The task we address is provided by the PASCAL Speech Separation Challenge [3], which provides standard training, development, and test data sets of single-channel speech mixtures following an arbitrary but simple grammar. In addition, the challenge organizers have conducted human-listening experiments to provide an interesting baseline for comparison of computational techniques. The overall system we developed is composed of the three components: a speaker identiﬁcation and gain estimation component, a signal separation component, and a speech recognition system. In this paper we focus on the signal separation component, which is composed of the acoustic and grammatical models. The details of the other components are discussed in [2]. Single-channel speech separation has previously been attempted using Gaussian mixture models (GMMs) on individual frames of acoustic features. However such models tend to perform well only when speakers are of different gender or have rather different voices [4]. When speakers have similar voices, speaker-dependent mixture models cannot unambiguously identify the component speakers. In such cases it is helpful to model the temporal dynamics of the speech. Several models in the literature have attempted to do so either for recognition [5, 6] or enhancement [7, 8] of speech. Such models have typically been based on a discrete-state hidden Markov model (HMM) operating on a frame-based acoustic feature vector. Modeling the dynamics of the log spectrum of speech is challenging in that different speech components evolve at different time-scales. For example the excitation, which carries mainly pitch, versus the ﬁlter, which consists of the formant structure, are somewhat independent of each other. The formant structure closely follows the sequences of phonemes in each word, which are pronounced at a rate of several per second. In non-tonal languages such as English, the pitch ﬂuctuates with prosody over the course of a sentence, and is not directly coupled with the words being spoken. Nevertheless, it seems to be important in separating speech, because the pitch harmonics carry predictable structure that stands out against the background. We address the various dynamic components of speech by testing different levels of dynamic constraints in our models. We explore four different levels of dynamics: no dynamics, low-level acoustic dynamics, high-level grammar dynamics, and a layered combination, dual dynamics, of the acoustic and grammar dynamics. The grammar dynamics and dual dynamics models perform the best in our experiments. The acoustic models are combined to model mixtures of speech using two methods: a nonlinear model known as Algonquin, which models the combination of log-spectrum models as a sum in the power spectrum, and a simpler max model that combines two log spectra using the max function. It turns out that whereas Algonquin works well, our formulation of the max model does better overall. With the combination of the max model and grammar-level dynamics, the model produces remarkable results: it is often able to extract two utterances from a mixture even when they are from the same speaker 1 . Overall results are given in Table 1, which shows that our closest competitors are human listeners. Table 1: Overall word error rates across all conditions on the challenge task. Human: average human error rate, IBM: our best result, Next Best: the best of the eight other published results on this task, and Chance: the theoretical error rate for random guessing. System: Word Error Rate: 1 Human 22.3% IBM 22.6% Next Best 34.2% Chance 93.0% Speech Models The model consists of an acoustic model and temporal dynamics model for each source, and a mixing model, which models how the source models are combined to describe the mixture. The acoustic features were short-time log spectrum frames computed every 15 ms. Each frame was of length 40 ms and a 640-point mixed-radix FFT was used. The DC component was discarded, producing a 319-dimensional log-power-spectrum feature vector yt . The acoustic model consists of a set of diagonal-covariance Gaussians in the features. For a given speaker, a, we model the conditional probability of the log-power spectrum of each source signal xa given a discrete acoustic state sa as Gaussian, p(xa |sa ) = N (xa ; µsa , Σsa ), with mean µsa , and covariance matrix Σsa . We used 256 Gaussians, one per acoustic state, to model the acoustic space of each speaker. For efﬁciency and tractability we restrict the covariance to be diagonal. A model with no dynamics can be formulated by producing state probabilities p(sa ), and is depicted in 1(a). Acoustic Dynamics: To capture the low-level dynamics of the acoustic signal, we modeled the acoustic dynamics of a given speaker, a, via state transitions p(sa |sa ) as shown in Figure 1(b). t t−1 There are 256 acoustic states, hence for each speaker a, we estimated a 256 × 256 element transition matrix Aa . Grammar Dynamics: The grammar dynamics are modeled by grammar state transitions, a a p(vt |vt−1 ), which consist of left-to-right phone models. The legal word sequences are given by the Speech Separation Challenge grammar [3] and are modeled using a set of pronunciations that 1 Demos and information can be found at: http : //www.research.ibm.com/speechseparation sa t−1 sa t sa t−1 sa t xt−1 xt xt−1 xt (a) No Dynamics (b) Acoustic Dynamics a vt−1 a vt a vt−1 a vt sa t−1 sa t sa t−1 sa t xt−1 xt xt−1 xt (c) Grammar Dynamics (d) Dual Dynamics Figure 1: Graph of models for a given source. In (a), there are no dynamics, so the model is a simple mixture model. In (b), only acoustic dynamics are modeled. In (c), grammar dynamics are modeled with a shared set of acoustic Gaussians, in (d) dual – grammar and acoustic – dynamics have been combined. Note that (a) (b) and (c) are special cases of (d), where different nodes are assumed independent. map from words to three-state context-dependent phone models. The state transition probabilities derived from these phone models are sparse in the sense that most transition probabilities are zero. We model speaker dependent distributions p(sa |v a ) that associate the grammar states, v a to the speaker-dependent acoustic states. These are learned from training data where the grammar state sequences and acoustic state sequences are known for each utterance. The grammar of our system has 506 states, so we estimate a 506 × 256 element conditional probability matrix B a for each speaker. Dual Dynamics: The dual-dynamics model combines the acoustic dynamics with the grammar dynamics. It is useful in this case to avoid modeling the full combination of s and v states in the joint transitions p(sa |sa , vt ). Instead we make a naive-Bayes assumption to approximate this as t t−1 1 p(sa |sa )α p(sa |vt )β , where α and β adjust the relative inﬂuence of the two probabilities, and z t t−1 t z is the normalizing constant. Here we simply use the probability matrices Aa and B a , deﬁned above. 2 Mixed Speech Models The speech separation challenge involves recognizing speech in mixtures of signals from two speakers, a and b. We consider only mixing models that operate independently on each frequency for analytical and computational tractability. The short-time log spectrum of the mixture yt , in a given frequency band, is related to that of the two sources xa and xb via the mixing model given by the t t conditional probability distribution, p(y|xa , xb ). The joint distribution of the observation and source in one feature dimension, given the source states is thus: p(yt , xa , xb |sa , sb ) = p(yt |xa , xb )p(xa |sa )p(xb |sb ). t t t t t t t t t t (1) In general, to infer and reconstruct speech we need to compute the likelihood of the observed mixture p(yt |sa , sb ) = t t p(yt , xa , xb |sa , sb )dxa dxb , t t t t t t (2) and the posterior expected values of the sources given the states, E(xa |yt , sa , sb ) = t t t xa p(xa , xb |yt , sa , sb )dxa dxb , t t t t t t t (3) and similarly for xb . These quantities, combined with a prior model for the joint state set quences {sa , sb }, allow us to compute the minimum mean squared error (MMSE) estima1..T 1..T ˆ ˆ tors E(xa |y1..T ) or the maximum a posteriori (MAP) estimate E(xa |y1..T , sa 1..T , sb 1..T ), 1..T 1..T ˆ ˆ where sa 1..T , sb 1..T = arg maxsa ,sb p(sa , sb |y1..T ), where the subscript, 1..T , refers to 1..T 1..T 1..T 1..T all frames in the signal. The mixing model can be deﬁned in a number of ways. We explore two popular candidates, for which the above integrals can be readily computed: Algonquin, and the max model. s a s xa b xb y (a) Mixing Model (v a v b )t−1 (v a v b )t (sa sb )t−1 (sa sb )t yt yt (b) Dual Dynamics Factorial Model Figure 2: Model combination for two talkers. In (a) all dependencies are shown. In (b) the full dual-dynamics model is graphed with the xa and xb integrated out, and corresponding states from each speaker combined into product states. The other models are special cases of this graph with different edges removed, as in Figure 1. Algonquin: The relationship between the sources and mixture in the log power spectral domain is approximated as p(yt |xa , xb ) = N (yt ; log(exp(xa ) + exp(xb )), Ψ) (4) t t t t where Ψ is introduced to model the error due to the omission of phase [4]. An iterative NewtonLaplace method accurately approximates the conditional posterior p(xa , xb |yt , sa , sb ) from (1) as t t t t Gaussian. This Gaussian allows us to analytically compute the observation likelihood p(yt |sa , sb ) t t and expected value E(xa |yt , sa , sb ), as in [4]. t t t Max model: The mixing model is simpliﬁed using the fact that log of a sum is approximately the log of the maximum: p(y|xa , xb ) = δ y − max(xa , xb ) (5) In this model the likelihood is p(yt |sa , sb ) = pxa (yt |sa )Φxb (yt |sb ) + pxb (yt |sb )Φxa (yt |sa ), (6) t t t t t t t t t y t where Φxa (yt |sa ) = −∞ N (xa ; µsa , Σsa )dxa is a Gaussian cumulative distribution function [5]. t t t t t t In [5], such a model was used to compute state likelihoods and ﬁnd the optimal state sequence. In [8], a simpliﬁed model was used to infer binary masking values for reﬁltering. We take the max model a step further and derive source posteriors, so that we can compute the MMSE estimators for the log power spectrum. Note that the source posteriors in xa and xb are each t t a mixture of a delta function and a truncated Gaussian. Thus we analytically derive the necessary expected value: E(xa |yt , sa , sb ) t t t p(xa = yt |yt , sa , sb )yt + p(xa < yt |yt , sa , sb )E(xa |xa < yt , sa ) t t t t t t t t t pxa (yt |sa ) t a b , = πt yt + πt µsa − Σsa t t t Φxa (yt |sa ) t t = (7) (8) a b a with weights πt = p(xa=yt |yt , sa , sb ) = pxa (yt |sa )Φxb (yt |sb )/p(yt |sa , sb ), and πt = 1 − πt . For t t t t t t t t a ≫ µ b in a given frequency many pairs of states one model is signiﬁcantly louder than another µs s band, relative to their variances. In such cases it is reasonable to approximate the likelihood as p(yt |sa , sb ) ≈ pxa (yt |sa ), and the posterior expected values according to E(xa |yt , sa , sb ) ≈ yt and t t t t t t t E(xb |yt , sa , sb ) ≈ min(yt , µsb ), and similarly for µsa ≪ µsb . t t t t 3 Likelihood Estimation Because of the large number of state combinations, the model would not be practical without techniques to reduce computation time. To speed up the evaluation of the joint state likelihood, we employed both band quantization of the acoustic Gaussians and joint-state pruning. Band Quantization: One source of computational savings stems from the fact that some of the Gaussians in our model may differ only in a few features. Band quantization addresses this by approximating each of the D Gaussians of each model with a shared set of d Gaussians, where d ≪ D, in each of the F frequency bands of the feature vector. A similar idea is described in [9]. It relies on the use of a diagonal covariance matrix, so that p(xa |sa ) = f N (xa ; µf,sa , Σf,sa ), where Σf,sa f are the diagonal elements of covariance matrix Σsa . The mapping Mf (si ) associates each of the D Gaussians with one of the d Gaussians in band f . Now p(xa |sa ) = f N (xa ; µf,Mf (sa ) , Σf,Mf (sa ) ) ˆ f is used as a surrogate for p(xa |sa ). Figure 3 illustrates the idea. Figure 3: In band quantization, many multi-dimensional Gaussians are mapped to a few unidimensional Gaussians. Under this model the d Gaussians are optimized by minimizing the KL-divergence D( sa p(sa )p(xa |sa )|| sa p(sa )ˆ(xa |sa )), and likewise for sb . Then in each frequency band, p only d×d, instead of D ×D combinations of Gaussians have to be evaluated to compute p(y|sa , sb ). Despite the relatively small number of components d in each band, taken across bands, band quantization is capable of expressing dF distinct patterns, in an F -dimensional feature space, although in practice only a subset of these will be used to approximate the Gaussians in a given model. We used d = 8 and D = 256, which reduced the likelihood computation time by three orders of magnitude. Joint State Pruning: Another source of computational savings comes from the sparseness of the model. Only a handful of sa , sb combinations have likelihoods that are signiﬁcantly larger than the rest for a given observation. Only these states are required to adequately explain the observation. By pruning the total number of combinations down to a smaller number we can speed up the likelihood calculation, estimation of the components signals, as well as the temporal inference. However, we must estimate the likelihoods in order to determine which states to retain. We therefore used band-quantization to estimate likelihoods for all states, perform state pruning, and then the full model on the pruned states using the exact parameters. In the experiments reported here, we pruned down to 256 state combinations. The effect of these speedup methods on accuracy will be reported in a future publication. 4 Inference In our experiments we performed inference in four different conditions: no dynamics, with acoustic dynamics only, with grammar dynamics only, and with dual dynamics (acoustic and grammar). With no dynamics the source models reduce to GMMs and we infer MMSE estimates of the sources based on p(xa , xb |y) as computed from (1), using Algonquin or the max model. Once the log spectrum of each source is estimated, we estimate the corresponding time-domain signal as shown in [4]. In the acoustic dynamics condition the exact inference algorithm uses a 2-Dimensional Viterbi search, described below, with acoustic temporal constraints p(st |st−1 ) and likelihoods from Eqn. (1), to ﬁnd the most likely joint state sequence s1..T . Similarly in the grammar dynamics condition, 2-D Viterbi search is used to infer the grammar state sequences, v1..T . Instead of single Gaussians as the likelihood models, however, we have mixture models in this case. So we can perform an MMSE estimate of the sources by averaging over the posterior probability of the mixture components given the grammar Viterbi sequence, and the observations. It is critical to use the 2-D Viterbi algorithm in both cases, rather than the forward-backward algorithm, because in the same-speaker condition at 0dB, the acoustic models and dynamics are symmetric. This symmetry means that the posterior is essentially bimodal and averaging over these modes would yield identical estimates for both speakers. By ﬁnding the best path through the joint state space, the 2-D Viterbi algorithm breaks this symmetry and allows the model to make different estimates for each speaker. In the dual-dynamics condition we use the model of section 2(b). With two speakers, exact inference is computationally complex because the full joint distribution of the grammar and acoustic states, (v a × sa ) × (v b × sb ) is required and is very large in number. Instead we perform approximate inference by alternating the 2-D Viterbi search between two factors: the Cartesian product sa × sb of the acoustic state sequences and the Cartesian product v a × v b of the grammar state sequences. When evaluating each state sequence we hold the other chain constant, which decouples its dynamics and allows for efﬁcient inference. This is a useful factorization because the states sa and sb interact strongly with each other and similarly for v a and v b . Again, in the same-talker condition, the 2-D Viterbi search breaks the symmetry in each factor. 2-D Viterbi search: The Viterbi algorithm estimates the maximum-likelihood state sequence s1..T given the observations x1..T . The complexity of the Viterbi search is O(T D2 ) where D is the number of states and T is the number of frames. For producing MAP estimates of the 2 sources, we require a 2 dimensional Viterbi search which ﬁnds the most likely joint state sequences sa and 1..T sb given the mixed signal y1..T as was proposed in [5]. 1..T On the surface, the 2-D Viterbi search appears to be of complexity O(T D4 ). Surprisingly, it can be computed in O(T D3 ) operations. This stems from the fact that the dynamics for each chain are independent. The forward-backward algorithm for a factorial HMM with N state variables requires only O(T N DN +1 ) rather than the O(T D2N ) required for a naive implementation [10]. The same is true for the Viterbi algorithm. In the Viterbi algorithm, we wish to ﬁnd the most probable paths leading to each state by ﬁnding the two arguments sa and sb of the following maximization: t−1 t−1 {ˆa , sb } = st−1 ˆt−1 = arg max p(sa |sa )p(sb |sb )p(sa , sb |y1..t−1 ) t t−1 t t−1 t−1 t−1 sa sb t−1 t−1 arg max p(sa |sa ) max p(sb |sb )p(sa , sb |y1..t−1 ). t t−1 t t−1 t−1 t−1 a st−1 sb t−1 (9) The two maximizations can be done in sequence, requiring O(D3 ) operations with O(D2 ) storage for each step. In general, as with the forward-backward algorithm, the N -dimensional Viterbi search requires O(T N DN +1 ) operations. We can also exploit the sparsity of the transition matrices and observation likelihoods, by pruning unlikely values. Using both of these methods our implementation of 2-D Viterbi search is faster than the acoustic likelihood computation that serves as its input, for the model sizes and grammars chosen in the speech separation task. Speaker and Gain Estimation: In the challenge task, the gains and identities of the two speakers were unknown at test time and were selected from a set of 34 speakers which were mixed at SNRs ranging from 6dB to -9dB. We used speaker-dependent acoustic models because of their advantages when separating different speakers. These models were trained on gain-normalized data, so the models are not well matched to the different gains of the signals at test time. This means that we have to estimate both the speaker identities and the gain in order to adapt our models to the source signals for each test utterance. The number of speakers and range of SNRs in the test set makes it too expensive to consider every possible combination of models and gains. Instead, we developed an efﬁcient model-based method for identifying the speakers and gains, described in [2]. The algorithm is based upon a very simple idea: identify and utilize frames that are dominated by a single source – based on their likelihoods under each speaker-dependent acoustic model – to determine what sources are present in the mixture. Using this criteria we can eliminate most of the unlikely speakers, and explore all combinations of the remaining speakers. An approximate EM procedure is then used to select a single pair of speakers and estimate their gains. Recognition: Although inference in the system may involve recognition of the words– for models that contain a grammar –we still found that a separately trained recognizer performed better. After reconstruction, each of the two signals is therefore decoded with a speech recognition system that incorporates Speaker Dependent Labeling (SDL) [2]. This method uses speaker dependent models for each of the 34 speakers. Instead of using the speaker identities provided by the speaker ID and gain module, we followed the approach for gender dependent labeling (GDL) described in [11]. This technique provides better results than if the true speaker ID is speciﬁed. 5 Results The Speech Separation Challenge [3] involves separating the mixed speech of two speakers drawn from of a set of 34 speakers. An example utterance is place white by R 4 now. In each recording, one of the speakers says white while the other says blue, red or green. The task is to recognize the letter and the digit of the speaker that said white. Using the SDL recognizer, we decoded the two estimated signals under the assumption that one signal contains white and the other does not, and vice versa. We then used the association that yielded the highest combined likelihood. 80 WER (%) 60 40 20 0 Same Talker No Separation No dynamics Same Gender Acoustic Dyn. Different Gender Grammar Dyn All Dual Dyn Human Figure 4: Average word error rate (WER) as a function of model dynamics, in different talker conditions, compared to Human error rates, using Algonquin. Human listener performance [3] is compared in Figure 4 to results using the SDL recognizer without speech separation, and for each the proposed models. Performance is poor without separation in all conditions. With no dynamics the models do surprisingly well in the different talker conditions, but poorly when the signals come from the same talker. Acoustic dynamics gives some improvement, mainly in the same-talker condition. The grammar dynamics seems to give the most beneﬁt, bringing the error rate in the same-gender condition below that of humans. The dual-dynamics model performed about the same as the grammar dynamics model, despite our intuitions. Replacing Algonquin with the max model reduced the error rate in the dual dynamics model (from 24.3% to 23.5%) and grammar dynamics model (from 24.6% to 22.6%), which brings the latter closer than any other model to the human recognition rate of 22.3%. Figure 5 shows the relative word error rate of the best system compared to human subjects. When both speakers are around the same loudness, the system exceeds human performance, and in the same-gender condition makes less than half the errors of the humans. Human listeners do better when the two signals are at different levels, even if the target is below the masker (i.e., in -9dB), suggesting that they are better able to make use of differences in amplitude as a cue for separation. Relative Word Error Rate (WER) 200 Same Talker Same Gender Different Gender Human 150 100 50 0 −50 −100 6 dB 3 dB 0 dB −3 dB Signal to Noise Ratio (SNR) −6 dB −9 dB Figure 5: Word error rate of best system relative to human performance. Shaded area is where the system outperforms human listeners. An interesting question is to what extent different grammar constraints affect the results. To test this, we limited the grammar to just the two test utterances, and the error rate on the estimated sources dropped to around 10%. This may be a useful paradigm for separating speech from background noise when the text is known, such as in closed-captioned recordings. At the other extreme, in realistic speech recognition scenarios, there is little knowledge of the background speaker’s grammar. In such cases the beneﬁts of models of low-level acoustic continuity over purely grammar-based systems may be more apparent. It is our hope that further experiments with both human and machine listeners will provide us with a better understanding of the differences in their performance characteristics, and provide insights into how the human auditory system functions, as well as how automatic speech perception in general can be brought to human levels of performance. References [1] T. Kristjansson, J. R. Hershey, P. A. Olsen, S. Rennie, and R. Gopinath, “Super-human multi-talker speech recognition: The IBM 2006 speech separation challenge system,” in ICSLP, 2006. [2] Steven Rennie, Pedera A. Olsen, John R. Hershey, and Trausti Kristjansson, “Separating multiple speakers using temporal constraints,” in ISCA Workshop on Statistical And Perceptual Audition, 2006. [3] Martin Cooke and Tee-Won Lee, “Interspeech speech separation http : //www.dcs.shef.ac.uk/ ∼ martin/SpeechSeparationChallenge.htm, 2006. challenge,” [4] T. Kristjansson, J. Hershey, and H. Attias, “Single microphone source separation using high resolution signal reconstruction,” ICASSP, 2004. [5] P. Varga and R.K. Moore, “Hidden Markov model decomposition of speech and noise,” ICASSP, pp. 845–848, 1990. [6] M. Gales and S. Young, “Robust continuous speech recognition using parallel model combination,” IEEE Transactions on Speech and Audio Processing, vol. 4, no. 5, pp. 352–359, September 1996. [7] Y. Ephraim, “A Bayesian estimation approach for speech enhancement using hidden Markov models.,” vol. 40, no. 4, pp. 725–735, 1992. [8] S. Roweis, “Factorial models and reﬁltering for speech separation and denoising,” Eurospeech, pp. 1009–1012, 2003. [9] E. Bocchieri, “Vector quantization for the efﬁcient computation of continuous density likelihoods. proceedings of the international conference on acoustics,” in ICASSP, 1993, vol. II, pp. 692–695. [10] Zoubin Ghahramani and Michael I. Jordan, “Factorial hidden Markov models,” in Advances in Neural Information Processing Systems, vol. 8. [11] Peder Olsen and Satya Dharanipragada, “An efﬁcient integrated gender detection scheme and time mediated averaging of gender dependent acoustic models,” in Eurospeech 2003, 2003, vol. 4, pp. 2509–2512.

4 0.92105889 93 nips-2006-Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms

Author: Xinhua Zhang, Wee S. Lee

Abstract: Semi-supervised learning algorithms have been successfully applied in many applications with scarce labeled data, by utilizing the unlabeled data. One important category is graph based semi-supervised learning algorithms, for which the performance depends considerably on the quality of the graph, or its hyperparameters. In this paper, we deal with the less explored problem of learning the graphs. We propose a graph learning method for the harmonic energy minimization method; this is done by minimizing the leave-one-out prediction error on labeled data points. We use a gradient based method and designed an efﬁcient algorithm which significantly accelerates the calculation of the gradient by applying the matrix inversion lemma and using careful pre-computation. Experimental results show that the graph learning method is effective in improving the performance of the classiﬁcation algorithm. 1

same-paper 5 0.90173876 201 nips-2006-Using Combinatorial Optimization within Max-Product Belief Propagation

Author: Daniel Tarlow, Gal Elidan, Daphne Koller, John C. Duchi

6 0.71191156 134 nips-2006-Modeling Human Motion Using Binary Latent Variables

7 0.66447163 160 nips-2006-Part-based Probabilistic Point Matching using Equivalence Constraints

8 0.64051628 167 nips-2006-Recursive ICA

9 0.61639524 43 nips-2006-Bayesian Model Scoring in Markov Random Fields

10 0.60772526 34 nips-2006-Approximate Correspondences in High Dimensions

11 0.60322762 158 nips-2006-PG-means: learning the number of clusters in data

12 0.5970943 97 nips-2006-Inducing Metric Violations in Human Similarity Judgements

13 0.58938193 31 nips-2006-Analysis of Contour Motions

14 0.58671004 198 nips-2006-Unified Inference for Variational Bayesian Linear Gaussian State-Space Models

15 0.58254558 15 nips-2006-A Switched Gaussian Process for Estimating Disparity and Segmentation in Binocular Stereo

16 0.57823062 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians

17 0.57810032 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

18 0.56920016 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures

19 0.56892699 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model

20 0.56567711 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields