nips nips2006 nips2006-169 knowledge-graph by maker-knowledge-mining

169 nips-2006-Relational Learning with Gaussian Processes

Source: pdf

Author: Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S. S. Keerthi

Abstract: Correlation between instances is often modelled via a kernel function using input attributes of the instances. Relational knowledge can further reveal additional pairwise correlations between variables of interest. In this paper, we develop a class of models which incorporates both reciprocal relational information and input attributes using Gaussian process techniques. This approach provides a novel non-parametric Bayesian framework with a data-dependent covariance function for supervised learning tasks. We also apply this framework to semi-supervised learning. Experimental results on several real world data sets verify the usefulness of this algorithm. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Research Media Studios North Burbank, CA 91504 Abstract Correlation between instances is often modelled via a kernel function using input attributes of the instances. [sent-10, score-0.216]

2 In this paper, we develop a class of models which incorporates both reciprocal relational information and input attributes using Gaussian process techniques. [sent-12, score-0.345]

3 Many such domains involve relational data in which instances have “links” or inter-relationships between them that are highly informative for learning tasks, e. [sent-17, score-0.261]

4 One simple but general type of relational information can be effectively represented in the form of a graph G = (V, E). [sent-25, score-0.255]

5 The vertex set V represents a collection of input instances (which may contain the labelled inputs as a subset, but is typically a much larger set of instances). [sent-26, score-0.392]

6 The edge set E ⊂ V × V represents the pairwise relations over these input instances. [sent-27, score-0.12]

7 , reciprocal relations, though directionality may be an important aspect of some relational datasets. [sent-30, score-0.262]

8 These undirected edges provide useful structural knowledge about correlation between the vertex instances. [sent-31, score-0.186]

9 In particular, we allow edges to be of two types – “positive” or “negative” depending on whether the associated adjacent vertices are positively or negatively correlated, respectively. [sent-32, score-0.154]

10 This setting is also applicable to semi-supervised tasks even on traditional “ﬂat” datasets where the linkage structure may be derived from data input attributes. [sent-34, score-0.161]

11 In graph-based semi-supervised methods, G is typically an adjacency graph constructed by linking each instance (including labelled and unlabelled) to its neighbors according to some distance metric in the input space. [sent-35, score-0.322]

12 , 2003), have been derived under the assumption that data points nearby on this graph are positively correlated. [sent-42, score-0.124]

13 Several methods have been proposed recently to incorporate relational information within learning algorithms, e. [sent-43, score-0.193]

14 The reciprocal relations over input instances essentially reﬂect the network structure or the distribution underlying the data, which enrich our prior belief of how instances in the entire input space are correlated. [sent-50, score-0.33]

15 In this paper, we integrate relational information with input attributes in a non-parametric Bayesian framework based on Gaussian processes (GP) (Rasmussen & Williams, 2006), which leads to a data-dependent covariance/kernel function. [sent-51, score-0.276]

16 We highlight the following aspects of our approach: 1) We propose a novel likelihood function for undirected linkages and carry out approximate inference using efﬁcient Expectation Propagation techniques under a Gaussian process prior. [sent-52, score-0.337]

17 The covariance function of the approximate posterior distribution deﬁnes a relational Gaussian process, hereafter abbreviated as RGP. [sent-53, score-0.36]

18 We also derive explicit formulae for linkage prediction over pairs of test points. [sent-55, score-0.124]

19 2) When applied to semi-supervised learning tasks involving labelled and unlabelled data, RGP is closely related to the warped reproducing kernel Hilbert Space approach of (Sindhwani et al. [sent-56, score-0.574]

20 , 2005), which are mainly transductive by design, RGP delineates a decision boundary in the input space and provides probabilistic induction over unseen test points. [sent-63, score-0.117]

21 Furthermore, by maximizing the joint evidence of known labels and linkages, we explicitly involve unlabelled data in the model selection procedure. [sent-64, score-0.247]

22 3) On a variety of classiﬁcation tasks, RGP requires very few labels for providing high-quality generalization on unseen test examples as compared to standard GP classiﬁcation that ignores relational information. [sent-66, score-0.218]

23 2 Relational Gaussian Processes In the standard setting of learning from data, instances are usually described by a collection of input attributes, denoted as a column vector x ∈ X ⊂ Rd . [sent-73, score-0.13]

24 The key idea in Gaussian process models is to introduce a random variable fx for all points in the input space X . [sent-74, score-0.116]

25 The covariance between fx and fz is fully determined by the coordinates of the data pair x and z, and is deﬁned by any Mercer kernel function K(x, z). [sent-76, score-0.29]

26 In the following, we consider the scenario with undirected linkages over a set of instances. [sent-84, score-0.304]

27 1 Undirected Linkages Let the vertex set V in the relational graph be associated with n input instances x1 . [sent-86, score-0.391]

28 Consider a set of observed pairwise undirected linkages on these instances, denoted as E = {Eij }. [sent-90, score-0.304]

29 Here Eij = +1 indicates that the instances xi and xj are “positively tied” and Eij = −1 indicates the instances are “negatively tied”. [sent-94, score-0.136]

30 In the presence of uncertainty in observing Eij , we assume the variable values fxi and fxj are contaminated with Gaussian noise that allows some tolerance for noisy observations. [sent-96, score-0.594]

31 Then the likelihood function (2) becomes P Eij = +1|fxi , fxj = Pideal Eij = +1|fxi + δi , fxj + δj N (δi ; 0, σ 2 )N (δj ; 0, σ 2 ) dδi dδj =Φ fxi σ Φ fxj σ fxi σ + 1−Φ 1−Φ fxj σ (3) z where Φ(z) = −∞ N (γ; 0, 1) dγ. [sent-99, score-1.716]

32 The integral in (3) evaluates the volume of a joint Gaussian in the ﬁrst and third quadrants where fxi and fxj have the same sign. [sent-100, score-0.59]

33 Note that P Eij = −1|fxi , fxj = 1 − P Eij = +1|fxi , fxj and P Eij = +1|fxi , fxj = P Eij = +1| − fxi , −fxj . [sent-101, score-1.157]

34 1 For example, we could deﬁne Pl (Eij = +1|fxi , fxj ) = 1+exp(−νfx fx ) where ν > 0. [sent-103, score-0.382]

35 Instead of treating edges as Bernoulli variables, we could consider a graph itself as a random variable and then the probability of observing the graph G can be simply evaluated as: P(G|f ) = 1 Z exp − 1 f T Ψ f where Ψ is a graph-regularization matrix (e. [sent-105, score-0.177]

36 2 Approximate Inference Combining the Gaussian process prior (1) with the likelihood function (3), we obtain the posterior distribution as follows, 1 P(f |E) = P(f ) (4) P Eij |fxi , fxj P(E) ij where f = [fx1 , . [sent-111, score-0.508]

37 , fxn ]T and ij runs over the set of observed undirected linkages. [sent-114, score-0.278]

38 It is important to note that reciprocal relations update the correlation between examples but never change individual mean. [sent-118, score-0.128]

39 To preserve computational tractability and the true posterior mean, we would rather approximate the posterior distribution as a joint Gaussian centered at the true mean than resort to sampling methods. [sent-119, score-0.196]

40 It is inappropriate to apply the Laplace approximation to this case since the posterior distribution is not unimodal and it is a saddle point at the true posterior mean. [sent-122, score-0.158]

41 Importantly this still captures the posterior covariance structure allowing prediction of link presence. [sent-126, score-0.164]

42 The ˜ parameters {sij , Πij } in {t(f ij )} are successively optimized by locally minimizing the KullbackLeibler divergence, 1 We could specify different noise levels for weighted edges. [sent-128, score-0.178]

43 ˜ t(f ij )new = arg min KL ˜ t(f ij ) Q(f ) Q(f ) ˜ t(f ij ) . [sent-131, score-0.429]

44 P(Eij |f ij ) ˜(f ij )old ˜(f ij )old t t (5) Since Q(f ) is in the exponential family, this minimization can be simply solved by moment matching up to the second order. [sent-132, score-0.429]

45 At the equilibrium the EP algorithm returns a Gaussian approximation to the posterior distribution P(f |E) ≈ N (0, A) (6) −1 −1 ˇ ˇ ij and Πij is an n × n matrix with four non-zero entries where A = (Σ + Π) , Π = ij Π augmented from Πij . [sent-133, score-0.352]

46 The normalization factor in this Gaussian approximation serves as approximate model evidence that can be explicitly written as 1 |A| 2 P(E) ≈ sij (7) 1 |Σ| 2 ij The detailed updating formulations have to be omitted here to save space. [sent-135, score-0.293]

47 The approximate evidence (7) holds an upper bound on the true value of P(E) (Wainwright et al. [sent-136, score-0.134]

48 3 Data-dependent Covariance Function After approximate inference as outlined above, the posterior process conditioned on E is explicitly given by a modiﬁed covariance function deﬁned in the following proposition. [sent-143, score-0.167]

49 , K(xn , x)]T , Σ is an n × n covariance matrix of the vertex set V obtained by evaluating the base kernel K, and Π is deﬁned as in (6). [sent-147, score-0.168]

50 RGP is obtained by a Bayesian update of a standard GP using relational knowledge, which is closely related to the warped reproducing kernel Hilbert space approach (Sindhwani et al. [sent-149, score-0.342]

51 , 2005) using a novel graph regularizer Π in place of the standard graph Laplacian. [sent-150, score-0.124]

52 4 Linkage Prediction Given a RGP, the joint distribution of the random variables f rs = [fxr , fxs ]T , associated with a test pair xr and xs , is a Gaussian as well. [sent-155, score-0.359]

53 The linkage predictive distribution P(f rs |E) can be explicitly ˜ written as a zero-mean bivariate Gaussian N (f rs ; 0, Σrs ) with covariance matrix ˜ Σrs = ˜ K(xr , xr ) ˜ K(xs , xr ) ˜ K(xr , xs ) ˜ K(xs , xs ) ˜ where K is deﬁned as in (8). [sent-156, score-0.735]

54 The predictive probability of having a positive edge can be evaluated as ˜ P(Ers |E) = Pideal (Ers |f rs )N (f rs ; 0, Σrs )dfxr dfxs which can be simpliﬁed as P(Ers |E) = 1 arcsin(ρErs ) + 2 π (9) where ρ = √ ˜ ˜ K(xr ,xs ) ˜ K(xs ,xs )K(xr ,xr ) . [sent-157, score-0.24]

55 3 Semi-supervised Learning We now apply the RGP framework for semi-supervised learning where a large collection of unlabelled examples are available and labelled data is scarce. [sent-159, score-0.425]

56 To apply RGP, we construct positive reciprocal relations between examples within K nearest neighborhood. [sent-164, score-0.128]

57 K could be heuristically set at the minimal integer of nearest neighborhood that could setup a connected graph over labelled and unlabelled examples, where there is a path between each pair of nodes. [sent-165, score-0.458]

58 Learning on these constructed relational data results in a RGP as described in the previous section (see section 4. [sent-166, score-0.193]

59 Given a set of labelled pairs {z , y }m where y ∈ {+1, −1}, the Gaussian process classiﬁer =1 (Rasmussen & Williams, 2006) relates the variable fz at z to the label y through a probit noise y fz 2 model, i. [sent-170, score-0.436]

60 Combining the probit likelihood with the RGP prior deﬁned by the covariance function (8), we have the posterior distribution as follows, 1 P(f |E) P(y |fz ) P(f |Y, E) = P(Y|E) ˜ where f = [fz1 , . [sent-173, score-0.16]

61 The predictive distribution of the variable fzt at a test case 2 2 ˜ zt then becomes a Gaussian, i. [sent-178, score-0.137]

62 P(fzt |Y, E) ≈ N (µt , σt ), where µt = kt Σ−1 µ and σt = ˜ t , zt ) − kT (Σ−1 − Σ−1 C Σ−1 )kt with kt = [K(z1 , zt ), . [sent-180, score-0.188]

63 (10) P(yt |Y, E) = Φ 2 + σ2 σn t To summarize, we ﬁrst incorporate linkage information into a standard GP that leads to a RGP, and then perform standard inference with the RGP as the prior in supervised learning. [sent-185, score-0.129]

64 As for model selection, it is advantageous to directly use the joint evidence P(Y, E) = P(Y|E)P(E), (11) to determine the model parameters (such as the kernel parameter, the edge noise level and the label noise level). [sent-187, score-0.241]

65 Note that P(Y, E) explicitly involves unlabelled data for model selection. [sent-188, score-0.169]

66 This can be particularly useful when labelled data is very scarce and possibly noisy. [sent-189, score-0.227]

67 We employ the linear kernel K(x, z) = x · z or the Gaussian kernel K(x, z) = exp − κ x − z 2 , and shift the origin of the kernel space to the empirical mean, i. [sent-193, score-0.195]

68 2 2 1 1 1 K(x, z)− n i K(x, xi )− n i K(z, xi )+ n2 i j K(xi , xj ) where n is the number of available labelled and unlabelled data. [sent-195, score-0.396]

69 The centralized kernel is then used as base kernel in our experiments. [sent-196, score-0.13]

70 The optimal setting of the σ2 and the κ in the Gaussian kernel is determined by the joint evidence (11) in each trial. [sent-200, score-0.143]

71 The 30 samples drawn from the Gaussian mixture are presented as dots in (a) and the two labelled samples are indicated by a diamond and a circle respectively. [sent-212, score-0.386]

72 The numbers of categorized Web pages and undirected linkages in the four university dataset are listed in the second column. [sent-217, score-0.304]

73 The averaged AUC scores of label prediction on unlabelled cases are recorded along with standard deviation over 100 trials. [sent-218, score-0.169]

74 Figure 1(c) presents the posterior covariance function K (8) at this optimal setting. [sent-288, score-0.134]

75 Compared to the data-independent prior covariance function deﬁned by the Gaussian kernel, the posterior covariance function captures the density information of the unlabelled samples. [sent-289, score-0.371]

76 Given two labelled samples, one per class, indicated by the diamond and ˜ the circle in Figure 1(a), we carried out supervised learning on the basis of the new prior K, as described in section 3. [sent-292, score-0.294]

77 The hyperlinks were translated into 66249 undirected “positive” linkages over the pages under the assumption that two pages are likely to be positively correlated if they are hyper-linked by the same hub page. [sent-304, score-0.425]

78 Note there are no “negative” linkages in this case. [sent-305, score-0.206]

79 The numbers of samples and linkages of the four universities are listed in Table 1. [sent-309, score-0.311]

80 We randomly selected 10% samples as labelled data and used the remaining samples as unlabelled data. [sent-310, score-0.518]

81 The grouped boxes from left to right represent the results of GPC, LapSVM, and RGP respectively at different percentages of labelled samples over 100 trials. [sent-341, score-0.288]

82 We conducted this experiment in a transductive setting where the entire linkage data was used to learn the RGP model and comparisons were made with GPC for predicting labels of unlabelled samples. [sent-347, score-0.327]

83 We make comparisons with a discriminant kernel approach to semi-supervised learning – the Laplacian SVM (Sindhwani et al. [sent-348, score-0.119]

84 , 2005) using the linear kernel and a graph Laplacian based regularizer. [sent-349, score-0.127]

85 We recorded the average AUC for predicting labels of unlabelled cases in Table 1. [sent-350, score-0.169]

86 As future work, it would be interesting to utilize weighted linkages and to compare with other graph kernels. [sent-353, score-0.268]

87 We varied the percentage of labelled data from 0. [sent-359, score-0.318]

88 1% to 10% gradually, and at each percentage repeated the random selection of labelled data 100 times. [sent-360, score-0.28]

89 Model parameters for LapSVM were tuned using crossvalidation with 50 labelled samples, since it is difﬁcult for discriminant kernel approaches to carry out cross validation when the labelled samples are scarce. [sent-364, score-0.58]

90 Our algorithm yields much better results than GPC and LapSVM, especially when the fraction of labelled data is less than 5%. [sent-365, score-0.227]

91 When the labelled samples are few (a typical case in semi-supervised learning), cross validation becomes hard to use while our approach provides a Bayesian model selection by the model evidence. [sent-366, score-0.288]

92 This partition contains 1214 training samples (556 samples of digit 3 and 658 samples of digit 5) and 326 test samples. [sent-372, score-0.278]

93 We randomly picked up a subset of the training samples as labelled data and treated the remaining samples as unlabelled. [sent-374, score-0.349]

94 We varied the percentage of labelled data from 0. [sent-375, score-0.318]

95 1% to 10% gradually, and at each percentage repeated the selection of labelled data 100 times. [sent-376, score-0.28]

96 When the percentage of labelled data is less than 5%, our algorithm achieved greatly better performance than GPC, and very competitive results compared with LapSVM (tuned with 50 labelled samples) though RGP used 4 AUC stands for the area under the Receiver-Operator Characteristic (ROC) curve. [sent-390, score-0.507]

97 5 Conclusion We developed a Bayesian framework to learn from relational data based on Gaussian processes. [sent-393, score-0.193]

98 The resulting relational Gaussian processes provide a uniﬁed data-dependent covariance function for many learning tasks. [sent-394, score-0.261]

99 While this paper has focused on modelling symmetric (undirected) relations, this relational Gaussian process framework can be generalized for asymmetric (directed) relations as well as multiple classes of relations. [sent-396, score-0.252]

100 (2006) have represented each relational pair by a tensor product of the attributes of the associated nodes, and have further proposed efﬁcient algorithms. [sent-398, score-0.243]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('rgp', 0.505), ('fxj', 0.299), ('fxi', 0.26), ('gpc', 0.243), ('labelled', 0.227), ('eij', 0.211), ('linkages', 0.206), ('relational', 0.193), ('unlabelled', 0.169), ('ij', 0.143), ('xr', 0.114), ('sindhwani', 0.104), ('lapsvm', 0.104), ('linkage', 0.099), ('undirected', 0.098), ('rs', 0.088), ('fx', 0.083), ('fz', 0.074), ('reciprocal', 0.069), ('covariance', 0.068), ('instances', 0.068), ('posterior', 0.066), ('gaussian', 0.066), ('kernel', 0.065), ('xs', 0.064), ('graph', 0.062), ('positively', 0.062), ('auc', 0.062), ('samples', 0.061), ('transductive', 0.059), ('relations', 0.059), ('pideal', 0.056), ('chu', 0.055), ('kt', 0.055), ('et', 0.054), ('edges', 0.053), ('percentage', 0.053), ('ep', 0.051), ('attributes', 0.05), ('student', 0.05), ('evidence', 0.047), ('gp', 0.046), ('universities', 0.044), ('documents', 0.044), ('semisupervised', 0.041), ('sij', 0.039), ('seeger', 0.039), ('negatively', 0.039), ('zt', 0.039), ('williams', 0.038), ('web', 0.038), ('varied', 0.038), ('diamond', 0.037), ('fxn', 0.037), ('fxr', 0.037), ('fxs', 0.037), ('fzt', 0.037), ('kapoor', 0.037), ('ncnm', 0.037), ('pcmac', 0.037), ('quartile', 0.037), ('tsvm', 0.037), ('preprocessed', 0.037), ('predictive', 0.036), ('noise', 0.035), ('digit', 0.035), ('vertex', 0.035), ('usps', 0.034), ('rasmussen', 0.034), ('input', 0.033), ('bayesian', 0.033), ('classi', 0.033), ('zhu', 0.033), ('whiskers', 0.033), ('winther', 0.033), ('basu', 0.033), ('wagstaff', 0.033), ('hyperlinks', 0.033), ('approximate', 0.033), ('taskar', 0.032), ('laplacian', 0.031), ('normalization', 0.031), ('joint', 0.031), ('supervised', 0.03), ('link', 0.03), ('zhou', 0.03), ('opper', 0.03), ('warped', 0.03), ('krishnapuram', 0.03), ('tasks', 0.029), ('collection', 0.029), ('bernoulli', 0.028), ('edge', 0.028), ('minka', 0.028), ('lawrence', 0.028), ('saddle', 0.026), ('getoor', 0.026), ('probit', 0.026), ('correlated', 0.026), ('test', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 169 nips-2006-Relational Learning with Gaussian Processes

Author: Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S. S. Keerthi

2 0.13577385 183 nips-2006-Stochastic Relational Models for Discriminative Link Prediction

Author: Kai Yu, Wei Chu, Shipeng Yu, Volker Tresp, Zhao Xu

Abstract: We introduce a Gaussian process (GP) framework, stochastic relational models (SRM), for learning social, physical, and other relational phenomena where interactions between entities are observed. The key idea is to model the stochastic structure of entity relationships (i.e., links) via a tensor interaction of multiple GPs, each deﬁned on one type of entities. These models in fact deﬁne a set of nonparametric priors on inﬁnite dimensional tensor matrices, where each element represents a relationship between a tuple of entities. By maximizing the marginalized likelihood, information is exchanged between the participating GPs through the entire relational network, so that the dependency structure of links is messaged to the dependency of entities, reﬂected by the adapted GP kernels. The framework offers a discriminative approach to link prediction, namely, predicting the existences, strengths, or types of relationships based on the partially observed linkage network as well as the attributes of entities (if given). We discuss properties and variants of SRM and derive an efﬁcient learning algorithm. Very encouraging experimental results are achieved on a toy problem and a user-movie preference link prediction task. In the end we discuss extensions of SRM to general relational learning tasks. 1

3 0.10551544 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

Author: Mark Girolami, Mingjun Zhong

Abstract: By adopting Gaussian process priors a fully Bayesian solution to the problem of integrating possibly heterogeneous data sets within a classiﬁcation setting is presented. Approximate inference schemes employing Variational & Expectation Propagation based methods are developed and rigorously assessed. We demonstrate our approach to integrating multiple data sets on a large scale protein fold prediction problem where we infer the optimal combinations of covariance functions and achieve state-of-the-art performance without resorting to any ad hoc parameter tuning and classiﬁer combination. 1

4 0.10383414 104 nips-2006-Large-Scale Sparsified Manifold Regularization

Author: Ivor W. Tsang, James T. Kwok

Abstract: Semi-supervised learning is more powerful than supervised learning by using both labeled and unlabeled data. In particular, the manifold regularization framework, together with kernel methods, leads to the Laplacian SVM (LapSVM) that has demonstrated state-of-the-art performance. However, the LapSVM solution typically involves kernel expansions of all the labeled and unlabeled examples, and is slow on testing. Moreover, existing semi-supervised learning methods, including the LapSVM, can only handle a small number of unlabeled examples. In this paper, we integrate manifold regularization with the core vector machine, which has been used for large-scale supervised and unsupervised learning. By using a sparsiﬁed manifold regularizer and formulating as a center-constrained minimum enclosing ball problem, the proposed method produces sparse solutions with low time and space complexities. Experimental results show that it is much faster than the LapSVM, and can handle a million unlabeled examples on a standard PC; while the LapSVM can only handle several thousand patterns. 1

5 0.08699467 35 nips-2006-Approximate inference using planar graph decomposition

Author: Amir Globerson, Tommi S. Jaakkola

Abstract: A number of exact and approximate methods are available for inference calculations in graphical models. Many recent approximate methods for graphs with cycles are based on tractable algorithms for tree structured graphs. Here we base the approximation on a different tractable model, planar graphs with binary variables and pure interaction potentials (no external ﬁeld). The partition function for such models can be calculated exactly using an algorithm introduced by Fisher and Kasteleyn in the 1960s. We show how such tractable planar models can be used in a decomposition to derive upper bounds on the partition function of non-planar models. The resulting algorithm also allows for the estimation of marginals. We compare our planar decomposition to the tree decomposition method of Wainwright et. al., showing that it results in a much tighter bound on the partition function, improved pairwise marginals, and comparable singleton marginals. Graphical models are a powerful tool for modeling multivariate distributions, and have been successfully applied in various ﬁelds such as coding theory and image processing. Applications of graphical models typically involve calculating two types of quantities, namely marginal distributions, and MAP assignments. The evaluation of the model partition function is closely related to calculating marginals [12]. These three problems can rarely be solved exactly in polynomial time, and are provably computationally hard in the general case [1]. When the model conforms to a tree structure, however, all these problems can be solved in polynomial time. This has prompted extensive research into tree based methods. For example, the junction tree method [6] converts a graphical model into a tree by clustering nodes into cliques, such that the graph over cliques is a tree. The resulting maximal clique size (cf. tree width) may nevertheless be prohibitively large. Wainwright et. al. [9, 11] proposed an approximate method based on trees known as tree reweighting (TRW). The TRW approach decomposes the potential vector of a graphical model into a mixture over spanning trees of the model, and then uses convexity arguments to bound various quantities, such as the partition function. One key advantage of this approach is that it provides bounds on partition function value, a property which is not shared by approximations based on Bethe free energies [13]. In this paper we focus on a different class of tractable models: planar graphs. A graph is called planar if it can be drawn in the plane without crossing edges. Works in the 1960s by physicists Fisher [5] and Kasteleyn [7], among others, have shown that the partition function for planar graphs may be calculated in polynomial time. This, however, is true under two key restrictions. One is that the variables xi are binary. The other is that the interaction potential depends only on xi xj (where xi ∈ {±1}), and not on their individual values (i.e., the zero external ﬁeld case). Here we show how the above method can be used to obtain upper bounds on the partition function for non-planar graphs. As in TRW, we decompose the potential of a non-planar graph into a sum over spanning planar models, and then use a convexity argument to obtain an upper bound on the log partition function. The bound optimization is a convex problem, and can be solved in polynomial time. We compare our method with TRW on a planar graph with an external ﬁeld, and show that it performs favorably with respect to both pairwise marginals and the bound on the partition function, and the two methods give similar results for singleton marginals. 1 Deﬁnitions and Notations Given a graph G with n vertices and a set of edges E, we are interested in pairwise Markov Random Fields (MRF) over the graph G. A pairwise MRF [13] is a multivariate distribution over variables x = {x1 , . . . , xn } deﬁned as 1 P p(x) = e ij∈E fij (xi ,xj ) (1) Z where fij are a set of |E| functions, or interaction potentials, deﬁned over pairs of variables. The P partition function is deﬁned as Z = x e ij∈E fij (xi ,xj ) . Here we will focus on the case where xi ∈ {±1}. Furthermore, we will be interested in interaction potentials which only depend on agreement or disagreement between the signs of their variables. We deﬁne those by 1 θij (1 + xi xj ) = θij I(xi = xj ) (2) 2 so that fij (xi , xj ) is zero if xi = xj and θij if xi = xj . The model is then deﬁned via the set of parameters θij . We use θ to denote the vector of parameters θij , and denote the partition function by Z(θ) to highlight its dependence on these parameters. f (xi , xj ) = A graph G is deﬁned as planar if it can be drawn in the plane without any intersection of edges [4]. With some abuse of notation, we deﬁne E as the set of line segments in 2 corresponding to the edges in the graph. The regions of 2 \ E are deﬁned as the faces of the graph. The face which corresponds to an unbounded region is called the external face. Given a planar graph G, its dual graph G∗ is deﬁned in the following way: the vertices of G∗ correspond to faces of G, and there is an edge between two vertices in G∗ iff the two corresponding faces in G share an edge. If the graph G is weighted, the weight on an edge in G∗ is the weight on the edge shared by the corresponding faces in G. A plane triangulation of a planar graph G is obtained from G by adding edges such that all the faces of the resulting graph have exactly three vertices. Thus a plane triangulated graph has a dual where all vertices have degree three. It can be shown that every plane graph can be plane triangulated [4]. We shall also need the notion of a perfect matching on a graph. A perfect matching on a graph G is deﬁned as a set of edges H ⊆ E such that every vertex in G has exactly one edge in H incident on it. If the graph is weighted, the weight of the matching is deﬁned as the product of the weights of the edges in the matching. Finally, we recall the deﬁnition of a marginal polytope of a graph [12]. Consider an MRF over a graph G where fij are given by Equation 2. Denote the probability of the event I(xi = xj ) under p(x) by τij . The marginal polytope of G, denoted by M(G), is deﬁned as the set of values τij that can be obtained under some assignment to the parameters θij . For a general graph G the polytope M(G) cannot be described using a polynomial number of inequalities. However, for planar graphs, it turns out that a set of O(n3 ) constraints, commonly referred to as triangle inequalities, sufﬁce to describe M(G) (see [3] page 434). The triangle inequalities are deﬁned by 1 TRI(n) = {τij : τij + τjk − τik ≤ 1, τij + τjk + τik ≥ 1, ∀i, j, k ∈ {1, . . . , n}} (3) Note that the above inequalities actually contain variables τij which do not correspond to edges in the original graph G. Thus the equality M(G) = TRI(n) should be understood as referring only to the values of τij that correspond to edges in the graph. Importantly, the values of τij for edges not in the graph need not be valid marginals for any MRF. In other words M(G) is a projection of TRI(n) on the set of edges of G. It is well known that the marginal polytope for trees is described via pairwise constraints. It is thus interesting that for planar graphs, it is triplets, rather than pairwise 1 The deﬁnition here is slightly different from that in [3], since here we refer to agreement probabilities, whereas [3] refers to disagreement probabilities. This polytope is also referred to as the cut polytope. constraints, that characterize the polytope. In this sense, planar graphs and trees may be viewed as a hierarchy of polytope complexity classes. It remains an interesting problem to characterize other structures in this hierarchy and their related inference algorithms. 2 Exact calculation of partition function using perfect matching The seminal works of Kasteleyn [7] and Fisher [5] have shown how one can calculate the partition function for a binary MRF over a planar graph with pure interaction potentials. We brieﬂy review Fisher’s construction, which we will use in what follows. Our interpretation of the method differs somewhat from that of Fisher, but we believe it is more straightforward. The key idea in calculating the partition function is to convert the summation over values of x to the problem of calculating the sum of weights of all perfect matchings in a graph constructed from G, as shown below. In this section, we consider weighted graphs (graphs with numbers assigned to their edges). For the graph G associated with the pairwise MRF, we assign weights wij = e2θij to the edges. The ﬁrst step in the construction is to plane triangulate the graph G. Let us call the resulting graph GT . We deﬁne an MRF on GT by assigning a parameter θij = 0 to the edges that have been added to G, and the corresponding weight wij = 1. Thus GT essentially describes the same distribution as G, and therefore has the same partition function. We can thus restrict our attention to calculating the partition function for the MRF on GT . As a ﬁrst step in calculating a partition function over GT , we introduce the following deﬁnition: a ˆ set of edges E in GT is an agreement edge set (or AES) if for every triangle face F in GT one of the ˆ ˆ following holds: The edges in F are all in E, or exactly one of the edges in F is in E. The weight ˆ is deﬁned as the product of the weights of the edges in E. ˆ of a set E It can be shown that there exists a bijection between pairs of assignments {x, −x} and agreement edge sets. The mapping from x to an edge set is simply the set of edges such that xi = xj . It is easy to see that this is an agreement edge set. The reverse mapping is obtained by ﬁnding an assignment x such that xi = xj iff the corresponding edge is in the agreement edge set. The existence of this mapping can be shown by induction on the number of (triangle) faces. P The contribution of a given assignment x to the partition function is e ˆ sponds to an AES denoted by E it is easy to see that P e ij∈E θij I(xi =xj ) = e− P ij∈E θij P e ˆ ij∈E 2θij = ce P ˆ ij∈E ij∈E 2θij θij I(xi =xj ) =c wij . If x corre(4) ˆ ij∈E P where c = e− ij∈E θij . Deﬁne the superset Λ as the set of agreement edge sets. The above then implies that Z(θ) = 2c E∈Λ ij∈E wij , and is thus proportional to the sum of AES weights. ˆ ˆ To sum over agreement edge sets, we use the following elegant trick introduced by Fisher [5]. Construct a new graph GPM from the dual of GT by introducing new vertices and edges according to the following rule: Replace each original vertex with three vertices that are connected to each other, and assign a weight of one to the new edges. Next, consider the three neighbors of the original vertex 2 . Connect each of the three new vertices to one of these three neighbors, keeping the original weights on these edges. The transformation is illustrated in Figure 1. The new graph GPM has O(3n) vertices, and is also planar. It can be seen that there is a one to one correspondence between perfect matchings in GPM and agreement edge sets in GT . Deﬁne Ω to be the set of perfect matchings in GPM . Then Z(θ) = 2c M ∈Ω ij∈M wij where we have used the fact that all the new weights have a value of one. Thus, the partition function is a sum over the weights of perfect matchings in GPM . Finally, we need a way of summing over the weights of the set of perfect matchings in a graph. Kasteleyn [7] proved that for a planar graph GPM , this sum may be obtained using the following sequence of steps: • Direct the edges of the graph GPM such that for every face (except possibly the external face), the number of edges on its perimeter oriented in a clockwise manner is odd. Kasteleyn showed that such a so called Pfafﬁan orientation may be constructed in polynomial time for a planar graph (see also [8] page 322). 2 Note that in the dual of GT all vertices have degree three, since GT is plane triangulated. 1.2 0.7 0.6 1 1 1 0.8 0.6 0.8 1.5 1.4 1.5 1 1 1.2 1 1 1 1 0.7 1.4 1 1 1 Figure 1: Illustration of the graph transformations in Section 2 for a complete graph with four vertices. Left panel shows the original weighted graph (dotted edges and grey vertices) and its dual (solid edges and black vertices). Right panel shows the dual graph with each vertex replaced by a triangle (the graph GPM in the text). Weights for dual graph edges correspond to the weights on the original graph. • Deﬁne the matrix P (GPM ) to be a skew symmetric matrix such that Pij = 0 if ij is not an edge, Pij = wij if the arrow on edge ij runs from i to j and Pij = −wij otherwise. • The sum over weighted matchings can then be shown to equal |P (GPM )|. The partition function is thus given by Z(θ) = 2c |P (GPM )|. To conclude this section we reiterate the following two key points: the partition function of a binary MRF over a planar graph with interaction potentials as in Equation 2 may be calculated in polynomial time by calculating the determinant of a matrix of size O(3n). An important outcome of this result is that the functional relation between Z(θ) and the parameters θij is known, a fact we shall use in what follows. 3 Partition function bounds via planar decomposition Given a non-planar graph G over binary variables with a vector of interaction potentials θ, we wish to use the exact planar computation to obtain a bound on the partition function of the MRF on G. We assume for simplicity that the potentials on the MRF for G are given in the form of Equation 2. Thus, G violates the assumptions of the previous section only in its non-planarity. Deﬁne G(r) as a set of spanning planar subgraphs of G, i.e., each graph G(r) is planar and contains all the vertices of G and some its edges. Denote by m the number of such graphs. Introduce the following deﬁnitions: (r) • θ (r) is a set of parameters on the edges of G(r) , and θij is an element in this set. Z(θ (r) ) is the partition function of the MRF on G(r) with parameters θ (r) . ˆ (r) ˆ(r) • θ is a set of parameters on the edges of G such that if edge (ij) is in G(r) then θij = (r) ˆ(r) θ , and otherwise θ = 0. ij ij Given a distribution ρ(r) on the graphs G(r) (i.e., ρ(r) ≥ 0 for r = 1, . . . , m and assume that the parameters for G(r) are such that ˆ ρ(r)θ θ= (r) r ρ(r) = 1), (5) r Then, by the convexity of the log partition function, as a function of the model parameters, we have ρ(r) log Z(θ (r) ) ≡ f (θ, ρ, θ (r) ) log Z(θ) ≤ (6) r Since by assumption the graphs G(r) are planar, this bound can be calculated in polynomial time. Since this bound is true for any set of parameters θ (r) which satisﬁes the condition in Equation 5 and for any distribution ρ(r), we may optimize over these two variables to obtain the tightest bound possible. Deﬁne the optimal bound for a ﬁxed value of ρ(r) by g(ρ, θ) (optimization is w.r.t. θ (r) ) g(ρ, θ) = f (θ, ρ, θ (r) ) min θ (r) : P ˆ ρ(r)θ (r) =θ (7) Also, deﬁne the optimum of the above w.r.t. ρ by h(θ). h(θ) = min g(θ, ρ) ρ(r) ≥ 0, ρ(r) = 1 (8) Thus, h(θ) is the optimal upper bound for the given parameter vector θ. In the following section we argue that we can in fact ﬁnd the global optimum of the above problem. 4 Globally Optimal Bound Optimization First consider calculating g(ρ, θ) from Equation 7. Note that since log Z(θ (r) ) is a convex function of θ (r) , and the constraints are linear, the overall optimization is convex and can be solved efﬁciently. In the current implementation, we use a projected gradient algorithm [2]. The gradient of f (θ, ρ, θ (r) ) w.r.t. θ (r) is given by ∂f (θ, ρ, θ (r) ) (r) ∂θij (r) = ρ(r) 1 + eθij (r) P −1 (GPM ) (r) k(i,j) Sign(Pk(i,j) (GPM )) (9) where k(i, j) returns the row and column indices of the element in the upper triangular matrix of (r) (r) P (GPM ), which contains the element e2θij . Since the optimization in Equation 7 is convex, it has an equivalent convex dual. Although we do not use this dual for optimization (because of the difﬁculty of expressing the entropy of planar models solely in terms of triplet marginals), it nevertheless allows some insight into the structure of the problem. The dual in this case is closely linked to the notion of the marginal polytope deﬁned in Section 1. Using a derivation similar to [11], we arrive at the following characterization of the dual g(ρ, θ) = max τ ∈TRI(n) ρ(r)H(θ (r) (τ )) θ·τ + (10) r where θ (r) (τ ) denotes the parameters of an MRF on G(r) such that its marginals are given by the restriction of τ to the edges of G(r) , and H(θ (r) (τ )) denotes the entropy of the MRF over G(r) with parameters θ (r) (τ ). The maximized function in Equation 10 is linear in ρ and thus g(ρ, θ) is a pointwise maximum over (linear) convex functions in ρ and is thus convex in ρ. It therefore has no (r) local minima. Denote by θmin (ρ) the set of parameters that minimizes Equation 7 for a given value of ρ. Using a derivation similar to that in [11], the gradient of g(ρ, θ) can be shown to be ∂g(ρ, θ) (r) = H(θmin (ρ)) ∂ρ(r) (11) Since the partition function for G(r) can be calculated efﬁciently, so can the entropy. We can now summarize the algorithm for calculating h(θ) • Initialize ρ0 . Iterate: – For ρt , ﬁnd θ (r) which solves the minimization in Equation 7. – Calculate the gradient of g(ρ, θ) at ρt using the expression in Equation 11 – Update ρt+1 = ρt + αv where v is a feasible search direction calculated from the gradient of g(ρ, θ) and the simplex constraints on ρ. The step size α is calculated via an Armijo line search. – Halt when the change in g(ρ, θ) is smaller than some threshold. Note that the minimization w.r.t. θ (r) is not very time consuming since we can initialize it with the minimum from the previous step, and thus only a few iterations are needed to ﬁnd the new optimum, provided the change in ρ is not too big. The above algorithm is guaranteed to converge to a global optimum of ρ [2], and thus we obtain the tightest possible upper bound on Z(θ) given our planar graph decomposition. The procedure described here is asymmetric w.r.t. ρ and θ (r) . In a symmetric formulation the minimizing gradient steps could be carried out jointly or in an alternating sequence. The symmetric ˆ (r) formulation can be obtained by decoupling ρ and θ (r) in the bi-linear constraint ρ(r)θ = θ. Field Figure 2: Illustration of planar subgraph construction for a rectangular lattice with external ﬁeld. Original graph is shown on the left. The ﬁeld vertex is connected to all vertices (edges not shown). The graph on the right results from isolating the 4th ,5th columns of the original graph (shown in grey), and connecting the ﬁeld vertex to the external vertices of the three disconnected components. Note that the resulting graph is planar. ˜ ˜ Speciﬁcally, we introduce θ (r) = θ (r) ρ(r) and perform the optimization w.r.t. ρ and θ (r) . It can be ˜(r) ) with the relevant (de-coupled) constraint is equivalent shown that a stationary point of f (θ, ρ, θ to the procedure described above. The advantage of this approach is that the exact minimization w.r.t θ (r) is not required before modifying ρ. Our experiments have shown, however, that the methods take comparable times to converge, although this may be a property of the implementation. 5 Estimating Marginals The optimization problem as deﬁned above minimizes an upper bound on the partition function. However, it may also be of interest to obtain estimates of the marginals of the MRF over G. To obtain marginal estimates, we follow the approach in [11]. We ﬁrst characterize the optimum of Equation 7 for a ﬁxed value of ρ. Deriving the Lagrangian of Equation 7 w.r.t. θ (r) we obtain the (r) following characterization of θmin (ρ): Marginal Optimality Criterion: For any two graphs G(r) , G(s) such that the edge (ij) is in both (r) (s) graphs, the optimal parameter vector satisﬁes τij (θmin (ρ)) = τij (θmin (ρ)). Thus, the optimal set of parameters for the graphs G(r) is such that every two graphs agree on the marginals of all the edges they share. This implies that at the optimum, there is a well deﬁned set of marginals over all the edges. We use this set as an approximation to the true marginals. A different method for estimating marginals uses the partition function bound directly. We ﬁrst P calculate partition function bounds on the sums: αi (1) = x:xi =1 e ij∈E fij (xi ,xj ) and αi (−1) = P αi (1) e ij∈E fij (xi ,xj ) and then normalize αi (1)+αi (−1) to obtain an estimate for p(xi = 1). This method has the advantage of being more numerically stable (since it does not depend on derivatives of log Z). However, it needs to be calculated separately for each variable, so that it may be time consuming if one is interested in marginals for a large set of variables. x:xi =−1 6 Experimental Evaluation We study the application of our Planar Decomposition (PDC) P method to a binary MRF on a square P lattice with an external ﬁeld. The MRF is given by p(x) ∝ e ij∈E θij xi xj + i∈V θi xi where V are the lattice vertices, and θi and θij are parameters. Note that this interaction does not satisfy the conditions for exact calculation of the partition function, even though the graph is planar. This problem is in fact NP hard [1]. However, it is possible to obtain the desired interaction form by introducing an additional variable xn+1 that is connected to all the original variables.P Denote the correspondP ij∈E θij xi xj + i∈V θi,n+1 xi xn+1 , where ing graph by Gf . Consider the distribution p(x, xn+1 ) ∝ e θi,n+1 = θi . It is easy to see that any property of p(x) (e.g., partition function, marginals) may be calculated from the corresponding property of p(x, xn+1 ). The advantage of the latter distribution is that it has the desired interaction form. We can thus apply PDC by choosing planar subgraphs of the non-planar graph Gf . 0.25 0.15 0.1 0.05 0.5 1 1.5 Interaction Strength 0.03 Singleton Marginal Error Z Bound Error Pairwise Marginals Error 0.08 PDC TRW 0.2 0.07 0.06 0.05 0.04 0.03 0.02 2 0.5 1 1.5 Interaction Strength 0.025 0.02 0.015 0.01 0.005 2 0.5 1 1.5 Interaction Strength 2 !3 x 10 0.025 0.02 0.015 0.5 1 Field Strength 1.5 2 Singleton Marginal Error Pairwise Marginals Error Z Bound Error 0.03 0.03 0.025 0.02 0.015 0.5 1 Field Strength 1.5 2 9 8 7 6 5 4 3 0.5 1 Field Strength 1.5 2 Figure 3: Comparison of the TRW and Planar Decomposition (PDC) algorithms on a 7×7 square lattice. TRW results shown in red squares, and PDC in blue circles. Left column shows the error in the log partition bound. Middle column is the mean error for pairwise marginals, and right column is the error for the singleton marginal of the variable at the lattice center. Results in upper row are for ﬁeld parameters drawn from U[−0.05, 0.05] and various interaction parameters. Results in the lower row are for interaction parameters drawn from U [−0.5, 0.5] and various ﬁeld parameters. Error bars are standard errors calculated from 40 random trials. There are clearly many ways to choose spanning planar subgraphs of Gf . Spanning subtrees are one option, and were used in [11]. Since our optimization is polynomial in the number of subgraphs, √ we preferred to use a number of subgraphs that is linear in n. The key idea in generating these planar subgraphs is to generate disconnected components of the lattice and connect xn+1 only to the external vertices of these components. Here we generate three disconnected components by isolating two neighboring columns (or rows) from the rest of the graph, resulting in three components. This is √ illustrated in Figure 2. To this set of 2 n graphs, we add the independent variables graph consisting only of edges from the ﬁeld node to all the other nodes. We compared the performance of the PDC and TRW methods 3 4 on a 7 × 7 lattice . Since the exact partition function and marginals can be calculated for this case, we could compare both algorithms to the true values. The MRF parameters were set according to the two following scenarios: 1) Varying Interaction - The ﬁeld parameters θi were drawn uniformly from U[−0.05, 0.05], and the interaction θij from U[−α, α] where α ∈ {0.2, 0.4, . . . , 2}. This is the setting tested in [11]. 2) Varying Field θi was drawn uniformly from U[−α, α], where α ∈ {0.2, 0.4, . . . , 2} and θij from U[−0.5, 0.5]. For each scenario, we calculated the following measures: 1) Normalized log partition error 1 1 alg − log Z true ). 2) Error in pairwise marginals |E| ij∈E |palg (xi = 1, xj = 1) − 49 (log Z ptrue (xi = 1, xj = 1)|. Pairwise marginals were calculated jointly using the marginal optimality criterion of Section 5. 3) Error in singleton marginals. We calculated the singleton marginals for the innermost node in the lattice (i.e., coordinate [3, 3]), which intuitively should be the most difﬁcult for the planar based algorithm. This marginal was calculated using two partition functions, as explained in Section 5 5 . The same method was used for TRW. The reported error measure is |palg (xi = 1) − ptrue (xi = 1)|. Results were averaged over 40 random trials. Results for the two scenarios and different evaluation measures are given in Figure 3. It can be seen that the partition function bound for PDC is signiﬁcantly better than TRW for almost all parameter settings, although the difference becomes smaller for large ﬁeld values. Error for the PDC pairwise 3 TRW and PDC bounds were optimized over both the subgraph parameters and the mixture parameters ρ. In terms of running time, PDC optimization for a ﬁxed value of ρ took about 30 seconds, which is still slower than the TRW message passing implementation. 5 Results using the marginal optimality criterion were worse for PDC, possibly due to its reduced numerical precision. 4 marginals are smaller than those of TRW for all parameter settings. For the singleton parameters, TRW slightly outperforms PDC. This is not surprising since the ﬁeld is modeled by every spanning tree in the TRW decomposition, whereas in PDC not all the structures model a given ﬁeld. 7 Discussion We have presented a method for using planar graphs as the basis for approximating non-planar graphs such as planar graphs with external ﬁelds. While the restriction to binary variables limits the applicability of our approach, it remains relevant in many important applications, such as coding theory and combinatorial optimization. Moreover, it is always possible to convert a non-binary graphical model to a binary one by introducing additional variables. The resulting graph will typically not be planar, even when the original graph over k−ary variables is. However, the planar decomposition method can then be applied to this non-planar graph. The optimization of the decomposition is carried out explicitly over the planar subgraphs, thus limiting the number of subgraphs that can be used in the approximation. In the TRW method this problem is circumvented since it is possible to implicitly optimize over all spanning trees. The reason this can be done for trees is that the entropy of an MRF over a tree may be written as a function of its marginal variables. We do not know of an equivalent result for planar graphs, and it remains a challenge to ﬁnd one. It is however possible to combine the planar and tree decompositions into one single bound, which is guaranteed to outperform the tree or planar approximations alone. The planar decomposition idea may in principle be applied to bounding the value of the MAP assignment. However, as in TRW, it can be shown that the solution is not dependent on the decomposition (as long as each edge appears in some structure), and the problem is equivalent to maximizing a linear function over the marginal polytope (which can be done in polynomial time for planar graphs). However, such a decomposition may suggest new message passing algorithms, as in [10]. Acknowledgments The authors acknowledge support from the Defense Advanced Research Projects Agency (Transfer Learning program). Amir Globerson is also supported by the Rothschild Yad-Hanadiv fellowship. The authors also wish to thank Martin Wainwright for providing his TRW code. References [1] F. Barahona. On the computational complexity of ising spin glass models. J. Phys. A., 15(10):3241–3253, 1982. [2] D. P. Bertsekas, editor. Nonlinear Programming. Athena Scientiﬁc, Belmont, MA, 1995. [3] M.M. Deza and M. Laurent. Geometry of Cuts and Metrics. Springe-Verlag, 1997. [4] R. Diestel. Graph Theory. Springer-Verlag, 1997. [5] M.E. Fisher. On the dimer solution of planar ising models. J. Math. Phys., 7:1776–1781, 1966. [6] M.I. Jordan, editor. Learning in graphical models. MIT press, Cambridge, MA, 1998. [7] P.W. Kasteleyn. Dimer statistics and phase transitions. Journal of Math. Physics, 4:287–293, 1963. [8] L. Lovasz and M.D. Plummer. Matching Theory, volume 29 of Annals of discrete mathematics. NorthHolland, New-York, 1986. [9] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Tree-based reparameterization framework for analysis of sum-product and related algorithms. IEEE Trans. on Information Theory, 49(5):1120–1146, 2003. [10] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Map estimation via agreement on trees: messagepassing and linear programming. IEEE Trans. on Information Theory, 51(11):1120–1146, 2005. [11] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. A new class of upper bounds on the log partition function. IEEE Trans. on Information Theory, 51(7):2313–2335, 2005. [12] M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational inference. Technical report, UC Berkeley Dept. of Statistics, 2003. [13] J.S. Yedidia, W.T. W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. on Information Theory, 51(7):2282–2312, 2005.

6 0.083525196 57 nips-2006-Conditional mean field

7 0.079851247 46 nips-2006-Blind source separation for over-determined delayed mixtures

8 0.076788515 150 nips-2006-On Transductive Regression

9 0.070370592 43 nips-2006-Bayesian Model Scoring in Markov Random Fields

10 0.069785416 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts

11 0.068418369 115 nips-2006-Learning annotated hierarchies from relational data

12 0.067137852 2 nips-2006-A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

13 0.06674882 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression

14 0.062043261 65 nips-2006-Denoising and Dimension Reduction in Feature Space

15 0.061350912 39 nips-2006-Balanced Graph Matching

16 0.060855389 166 nips-2006-Recursive Attribute Factoring

17 0.059635058 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

18 0.058653072 77 nips-2006-Fast Computation of Graph Kernels

19 0.057637736 48 nips-2006-Branch and Bound for Semi-Supervised Support Vector Machines

20 0.057508998 163 nips-2006-Prediction on a Graph with a Perceptron

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.206), (1, 0.064), (2, 0.015), (3, 0.018), (4, 0.05), (5, 0.057), (6, 0.096), (7, 0.061), (8, -0.055), (9, -0.086), (10, -0.001), (11, -0.049), (12, 0.005), (13, -0.132), (14, -0.036), (15, 0.064), (16, -0.017), (17, -0.057), (18, -0.078), (19, -0.009), (20, -0.058), (21, -0.065), (22, 0.042), (23, 0.133), (24, 0.144), (25, 0.071), (26, -0.05), (27, 0.004), (28, 0.024), (29, 0.005), (30, 0.051), (31, 0.0), (32, 0.004), (33, -0.008), (34, -0.011), (35, -0.024), (36, 0.135), (37, -0.076), (38, 0.123), (39, 0.121), (40, 0.112), (41, -0.021), (42, 0.037), (43, 0.079), (44, 0.128), (45, 0.074), (46, 0.073), (47, 0.015), (48, -0.07), (49, -0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89442122 169 nips-2006-Relational Learning with Gaussian Processes

Author: Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S. S. Keerthi

2 0.72484815 183 nips-2006-Stochastic Relational Models for Discriminative Link Prediction

Author: Kai Yu, Wei Chu, Shipeng Yu, Volker Tresp, Zhao Xu

3 0.49461538 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

Author: Mark Girolami, Mingjun Zhong

4 0.46933398 82 nips-2006-Gaussian and Wishart Hyperkernels

Author: Risi Kondor, Tony Jebara

Abstract: We propose a new method for constructing hyperkenels and deﬁne two promising special cases that can be computed in closed form. These we call the Gaussian and Wishart hyperkernels. The former is especially attractive in that it has an interpretable regularization scheme reminiscent of that of the Gaussian RBF kernel. We discuss how kernel learning can be used not just for improving the performance of classiﬁcation and regression methods, but also as a stand-alone algorithm for dimensionality reduction and relational or metric learning. 1

5 0.45670494 166 nips-2006-Recursive Attribute Factoring

Author: Deepak Verma, Karl Pfleger, David Tax

Abstract: Clustering, or factoring of a document collection attempts to “explain” each observed document in terms of one or a small number of inferred prototypes. Prior work demonstrated that when links exist between documents in the corpus (as is the case with a collection of web pages or scientiﬁc papers), building a joint model of document contents and connections produces a better model than that built from contents or connections alone. Many problems arise when trying to apply these joint models to corpus at the scale of the World Wide Web, however; one of these is that the sheer overhead of representing a feature space on the order of billions of dimensions becomes impractical. We address this problem with a simple representational shift inspired by probabilistic relational models: instead of representing document linkage in terms of the identities of linking documents, we represent it by the explicit and inferred attributes of the linking documents. Several surprising results come with this shift: in addition to being computationally more tractable, the new model produces factors that more cleanly decompose the document collection. We discuss several variations on this model and show how some can be seen as exact generalizations of the PageRank algorithm. 1

6 0.45598522 57 nips-2006-Conditional mean field

7 0.45033282 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods

8 0.41478994 104 nips-2006-Large-Scale Sparsified Manifold Regularization

9 0.41066697 5 nips-2006-A Kernel Method for the Two-Sample-Problem

10 0.40805137 35 nips-2006-Approximate inference using planar graph decomposition

11 0.40637404 2 nips-2006-A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

12 0.39033532 43 nips-2006-Bayesian Model Scoring in Markov Random Fields

13 0.37729251 115 nips-2006-Learning annotated hierarchies from relational data

14 0.37594354 124 nips-2006-Linearly-solvable Markov decision problems

15 0.36990216 103 nips-2006-Kernels on Structured Objects Through Nested Histograms

16 0.36395103 150 nips-2006-On Transductive Regression

17 0.35265642 93 nips-2006-Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms

18 0.35210627 177 nips-2006-Sparse Kernel Orthonormalized PLS for feature extraction in large data sets

19 0.34282634 48 nips-2006-Branch and Bound for Semi-Supervised Support Vector Machines

20 0.33648777 201 nips-2006-Using Combinatorial Optimization within Max-Product Belief Propagation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.111), (3, 0.027), (7, 0.091), (9, 0.043), (20, 0.02), (22, 0.081), (29, 0.266), (44, 0.067), (52, 0.017), (57, 0.073), (64, 0.014), (65, 0.052), (69, 0.038), (71, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7899403 169 nips-2006-Relational Learning with Gaussian Processes

Author: Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S. S. Keerthi

2 0.60125089 20 nips-2006-Active learning for misspecified generalized linear models

Author: Francis R. Bach

Abstract: Active learning refers to algorithmic frameworks aimed at selecting training data points in order to reduce the number of required training data points and/or improve the generalization performance of a learning method. In this paper, we present an asymptotic analysis of active learning for generalized linear models. Our analysis holds under the common practical situation of model misspeciﬁcation, and is based on realistic assumptions regarding the nature of the sampling distributions, which are usually neither independent nor identical. We derive unbiased estimators of generalization performance, as well as estimators of expected reduction in generalization error after adding a new training data point, that allow us to optimize its sampling distribution through a convex optimization problem. Our analysis naturally leads to an algorithm for sequential active learning which is applicable for all tasks supported by generalized linear models (e.g., binary classiﬁcation, multi-class classiﬁcation, regression) and can be applied in non-linear settings through the use of Mercer kernels. 1

3 0.6008485 65 nips-2006-Denoising and Dimension Reduction in Feature Space

Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann

Abstract: We show that the relevant information about a classiﬁcation problem in feature space is contained up to negligible error in a ﬁnite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efﬁcient implicit representations of the data to perform classiﬁcation. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classiﬁcation. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classiﬁcation results. 1

4 0.60039526 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

Author: Hamed Valizadegan, Rong Jin

Abstract: Maximum margin clustering was proposed lately and has shown promising performance in recent studies [1, 2]. It extends the theory of support vector machine to unsupervised learning. Despite its good performance, there are three major problems with maximum margin clustering that question its eﬃciency for real-world applications. First, it is computationally expensive and diﬃcult to scale to large-scale datasets because the number of parameters in maximum margin clustering is quadratic in the number of examples. Second, it requires data preprocessing to ensure that any clustering boundary will pass through the origins, which makes it unsuitable for clustering unbalanced dataset. Third, it is sensitive to the choice of kernel functions, and requires external procedure to determine the appropriate values for the parameters of kernel functions. In this paper, we propose “generalized maximum margin clustering” framework that addresses the above three problems simultaneously. The new framework generalizes the maximum margin clustering algorithm by allowing any clustering boundaries including those not passing through the origins. It signiﬁcantly improves the computational eﬃciency by reducing the number of parameters. Furthermore, the new framework is able to automatically determine the appropriate kernel matrix without any labeled data. Finally, we show a formal connection between maximum margin clustering and spectral clustering. We demonstrate the eﬃciency of the generalized maximum margin clustering algorithm using both synthetic datasets and real datasets from the UCI repository. 1

5 0.60035288 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

Author: Dong S. Cheng, Vittorio Murino, Mário Figueiredo

Abstract: This paper proposes a new approach to model-based clustering under prior knowledge. The proposed formulation can be interpreted from two different angles: as penalized logistic regression, where the class labels are only indirectly observed (via the probability density of each class); as ﬁnite mixture learning under a grouping prior. To estimate the parameters of the proposed model, we derive a (generalized) EM algorithm with a closed-form E-step, in contrast with other recent approaches to semi-supervised probabilistic clustering which require Gibbs sampling or suboptimal shortcuts. We show that our approach is ideally suited for image segmentation: it avoids the combinatorial nature Markov random ﬁeld priors, and opens the door to more sophisticated spatial priors (e.g., wavelet-based) in a simple and computationally efﬁcient way. Finally, we extend our formulation to work in unsupervised, semi-supervised, or discriminative modes. 1

6 0.59865314 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming

7 0.59691811 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

8 0.59632534 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

9 0.59618902 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

10 0.59584707 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy

11 0.59407508 175 nips-2006-Simplifying Mixture Models through Function Approximation

12 0.59191513 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions

13 0.59171045 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition

14 0.59125012 117 nips-2006-Learning on Graph with Laplacian Regularization

15 0.58951789 61 nips-2006-Convex Repeated Games and Fenchel Duality

16 0.58819669 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

17 0.58706945 79 nips-2006-Fast Iterative Kernel PCA

18 0.58644336 158 nips-2006-PG-means: learning the number of clusters in data

19 0.58560193 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments

20 0.58482337 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis