nips nips2007 nips2007-131 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Peter Hoff
Abstract: This article discusses a latent variable model for inference and prediction of symmetric relational data. The model, based on the idea of the eigenvalue decomposition, represents the relationship between two nodes as the weighted inner-product of node-specific vectors of latent characteristics. This “eigenmodel” generalizes other popular latent variable models, such as latent class and distance models: It is shown mathematically that any latent class or distance model has a representation as an eigenmodel, but not vice-versa. The practical implications of this are examined in the context of three real datasets, for which the eigenmodel has as good or better out-of-sample predictive performance than the other two models. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Modeling homophily and stochastic equivalence in symmetric relational data Peter D. [sent-1, score-0.516]
2 edu Abstract This article discusses a latent variable model for inference and prediction of symmetric relational data. [sent-5, score-0.498]
3 The model, based on the idea of the eigenvalue decomposition, represents the relationship between two nodes as the weighted inner-product of node-specific vectors of latent characteristics. [sent-6, score-0.358]
4 This “eigenmodel” generalizes other popular latent variable models, such as latent class and distance models: It is shown mathematically that any latent class or distance model has a representation as an eigenmodel, but not vice-versa. [sent-7, score-1.089]
5 The practical implications of this are examined in the context of three real datasets, for which the eigenmodel has as good or better out-of-sample predictive performance than the other two models. [sent-8, score-0.293]
6 The examples considered in this article include friendships among people, associations among words and interactions among proteins. [sent-10, score-0.11]
7 Such measurements are often represented by a sociomatrix Y , which is a symmetric n × n matrix with an undefined diagonal. [sent-11, score-0.153]
8 One of the goals of relational data analysis is to describe the variation among the entries of Y , as well as any potential covariation of Y with observed explanatory variables X = {xi,j , 1 ≤ i < j ≤ n}. [sent-12, score-0.291]
9 To this end, a variety of statistical models have been developed that describe yi,j as some function of node-specific latent variables ui and uj and a linear predictor β T xi,j . [sent-13, score-0.797]
10 , un } represent across-node variation in the yi,j ’s and β represents covariation of the yi,j ’s with the xi,j ’s. [sent-17, score-0.225]
11 For example, Nowicki and Snijders [2001] present a model in which each node i is assumed to belong to an unobserved latent class ui , and a probability distribution describes the relationships between each pair of classes (see Kemp et al. [sent-18, score-0.635]
12 Such a model captures stochastic equivalence, a type of pattern often seen in network data in which the nodes can be divided into groups such that members of the same group have similar patterns of relationships. [sent-21, score-0.234]
13 An alternative approach to representing across-node variation is based on the idea of homophily, in which the relationships between nodes with similar characteristics are stronger than the relationships between nodes having different characteristics. [sent-22, score-0.297]
14 Homophily provides an explanation to data patterns often seen in social networks, such as transitivity (“a friend of a friend is a friend”), balance (“the enemy of my friend is an enemy”) and the existence of cohesive subgroups of nodes. [sent-23, score-0.391]
15 [2002] present a model in which the conditional mean of yi,j is a function of β xi,j − |ui − uj |, where {u1 , . [sent-25, score-0.329]
16 , un } are vectors of unobserved, latent characteristics in a Euclidean space. [sent-28, score-0.352]
17 In the context of binary relational data, such a model predicts the existence of more transitive triples, or “triangles,” than would be seen under a random allocation of edges among pairs of nodes. [sent-29, score-0.242]
18 The latent class model of Nowicki and Snijders [2001] and the latent distance model of Hoff et al. [sent-32, score-0.662]
19 [2002] are able to identify, respectively, classes of nodes with similar roles, and the locational properties of the nodes. [sent-33, score-0.129]
20 These two items are perhaps the two primary features of interest in social network and relational data analysis. [sent-34, score-0.294]
21 However, a model that can represent one feature may not be able to represent the other: Consider the two graphs in Figure 1. [sent-36, score-0.095]
22 The graph on the left displays a large degree of transitivity, and can be well-represented by the latent distance model with a set of vectors {u1 , . [sent-37, score-0.395]
23 , un } in two-dimensional space, in which the probability of an edge between i and j is decreasing in |ui − uj |. [sent-40, score-0.377]
24 In contrast, representation of the graph by a latent class model would require a large number of classes, none of which would be particularly cohesive or distinguishable from the others. [sent-41, score-0.394]
25 The second panel of Figure 1 displays a network involving three classes of stochastically equivalent nodes, two of which (say A and B) have only across-class ties, and one (C) that has both within- and across-class ties. [sent-42, score-0.175]
26 This graph is well-represented by a latent class model in which edges occur with high probability between pairs having one member in each of A and B or in B and C, and among pairs having both members in C (in models of stochastic equivalence, nodes within each class are not differentiated). [sent-43, score-0.602]
27 In contrast, representation of this type of graph with a latent distance model would require the dimension of the latent characteristics to be on the order of the class membership sizes. [sent-44, score-0.668]
28 Many real networks exhibit combinations of structural equivalence and homophily in varying degrees. [sent-45, score-0.329]
29 In these situations, use of either the latent class or distance model would only be representing part of the network structure. [sent-46, score-0.443]
30 , un } are node-specific factors and Λ is a diagonal matrix. [sent-50, score-0.115]
31 In this i article, we show mathematically and by example how this eigenmodel can represent both stochastic equivalence and homophily in symmetric relational data, and thus is more general than the other two latent variable models. [sent-51, score-1.079]
32 The next section motivates the use of latent variables models for relational data, and shows mathematically that the eigenmodel generalizes the latent class and distance models in the sense that it can compactly represent the same network features as these other models but not vice-versa. [sent-52, score-1.268]
33 Section 3 compares the out-of-sample predictive performance of these three models on three different datasets: a social network of 12th graders; a relational dataset on word association counts from the first chapter of Genesis; and a dataset on protein-protein interactions. [sent-53, score-0.409]
34 The first two networks exhibit latent homophily and stochastic equivalence respectively, whereas the third shows both to some degree. [sent-54, score-0.602]
35 2 In support of the theoretical results of Section 2, the latent distance and class models perform well for the first and second datasets respectively, whereas the eigenmodel performs well for all three. [sent-55, score-0.694]
36 14] show that if a model satisfies the above exchangeability condition for each integer n, then it can be written as a latent variable model of the form yi,j = h(µ, ui , uj , i,j ) (1) for i. [sent-63, score-0.856]
37 This result is very general - it says that any statistical model for a sociomatrix in which the nodes are exchangeable can be written as a latent variable model. [sent-73, score-0.498]
38 A general probit model for binary network data can be put in the form of (1) as follows: { : 1 ≤ i < j ≤ n} ∼ i. [sent-75, score-0.155]
39 f (u|ψ) = h(µ, ui , uj , i,j ) = δ(0,∞) (µ + α(ui , uj ) + i,j yi,j i,j ), where µ and ψ are parameters to be estimated, and α is a symmetric function, also potentially involving parameters to be estimated. [sent-84, score-0.808]
40 Finally, integrating over i,j we obtain Pr(yi,j = 1|xi,j , ui , uj ) = Φ[µ + β T xi,j + α(ui , uj )]. [sent-86, score-0.767]
41 , un } can be expressed as Pr(yi,j = 1|xi,j , ui , uj ) ≡ θi,j = Pr(Y |X, u1 , . [sent-90, score-0.62]
42 , un ) Φ[µ + β T xi,j + α(ui , uj )] yi,j θi,j (1 = (2) yi,j − θi,j ) i 0 or λk < 0. [sent-93, score-0.377]
43 In this way, the model can represent both positive or negative homophily in varying degrees, and stochastically equivalent nodes (nodes with the same or similar latent vectors) may or may not have strong relationships with one another. [sent-94, score-0.669]
44 We now show that the eigenmodel generalizes the latent class and distance models: Let Sn be the set of n × n sociomatrices, and let CK = {C ∈ Sn : ci,j = mui ,uj , ui ∈ {1, . [sent-95, score-0.966]
45 , K}, M a K × K symmetric matrix}; DK = {D ∈ Sn : di,j = −|ui − uj |, ui ∈ RK }; EK = {E ∈ Sn : ei,j = uT Λuj , ui ∈ RK , Λ a K × K diagonal matrix}. [sent-98, score-0.789]
46 i In other words, CK is the set of possible values of {α(ui , uj ), 1 ≤ i < j ≤ n} under a Kdimensional latent class model, and similarly for DK and EK . [sent-99, score-0.559]
47 ˜ EK generalizes CK : Let C ∈ CK and let C be a completion of C obtained by setting ci,i = mui ,ui . [sent-100, score-0.09]
48 Such a (negative) distance matrix will generally be of full rank, in which case it cannot be represented exactly by an E ∈ EK for K < n. [sent-105, score-0.099]
49 This is because the probit and ordered probit model we are considering include threshold variables {µy : y ∈ Y} which can be adjusted to accommodate monotone transformations of α(ui , uj ). [sent-107, score-0.455]
50 Furthermore, letting ui = (zi , r2 − zi zi ) ∈ RK+1 for each i ∈ {1, . [sent-118, score-0.299]
51 For large r this is approximately r 2 − |z − z |2 /2, which ui uj = zi zj + (r i j i j is an increasing function of the negative distance di,j . [sent-122, score-0.607]
52 DK does not weakly generalize E1 : Consider E ∈ E1 generated by Λ = 1, u1 = 1 and ui = r < 1 for i > 1. [sent-124, score-0.297]
53 This is because the kissing number in R2 , or the number of non-overlapping spheres of unit radius that can simultaneously touch a central sphere of unit radius, is 6. [sent-129, score-0.173]
54 If we put node 1 at the center of the central sphere, and 6 nodes at the centers of the 6 kissing spheres, then we have d1,i1 = d1,i2 = di1 ,i2 for all i1 , i2 = 1. [sent-130, score-0.154]
55 We can only have d1,i1 = d1,i2 > di1 ,i2 if we remove one of the non-central spheres to allow for more room between those remaining, leaving one central sphere plus five kissing spheres for a total of n = 6. [sent-131, score-0.248]
56 4 A less general positive semi-definite version of the eigenmodel has been studied by Hoff [2005], in which Λ was taken to be the identity matrix. [sent-133, score-0.262]
57 Such a model can weakly generalize a distance model, but cannot generalize a latent class model, as the eigenvalues of a latent class model could be negative. [sent-134, score-0.801]
58 For these algorithms, it is useful to formulate the probit models described in Section 2. [sent-137, score-0.108]
59 1 in terms of an additional latent variable zi,j ∼ normal[β xi,j + α(ui , uj )], for which yi,j = y if µy < zi,j < µy+1 . [sent-138, score-0.499]
60 For each {i, j}, sample zi,j from its (constrained normal) full conditional distribution. [sent-146, score-0.065]
61 For each y ∈ Y, sample µy from its (normal) full conditional distribution. [sent-147, score-0.065]
62 , un and their associated parameters: • For the latent distance model, propose and accept or reject new values of the ui ’s with the Metropolis algorithm, and then sample the population variances of the ui ’s from their (inverse-gamma) full conditional distributions. [sent-152, score-0.977]
63 • For the latent class model, update each class variable ui from its (multinomial) conditional distribution given current values of Z, {uj : j = i} and the variance of the elements of M (but marginally over M to improve mixing). [sent-153, score-0.64]
64 Then sample the elements of M from their (normal) full conditional distributions and the variance of the entries of M from its (inverse-gamma) full conditional distribution. [sent-154, score-0.158]
65 • For the latent vector model, sample each ui from its (multivariate normal) full conditional distribution, sample the mean of the ui ’s from their (normal) full conditional distributions, and then sample Λ from its (multivariate normal) full conditional distribution. [sent-155, score-0.918]
66 To facilitate comparison across models, we used prior distributions in which the level of prior variability in α(ui , uj ) was similar across the three different models (further details and code to implement these algorithms are available at my website). [sent-156, score-0.287]
67 2 Cross validation To compare the performance of these three different models we evaluated their out-of-sample predictive performance under a range of dimensions (K ∈ {3, 5, 10}) and on three different datasets exhibiting varying combinations of homophily and stochastic equivalence. [sent-158, score-0.39]
68 For each combination of dataset, dimension and model we performed a five-fold cross validation experiment as follows: 1. [sent-159, score-0.076]
69 , 5}: (a) Obtain posterior distributions of the model parameter conditional on {yi,j : si,j = s}, the data on pairs not in set s. [sent-165, score-0.067]
70 ˆ This procedure generates a sociomatrix Y , in which each entry yi,j represents a predicted value ˆ ˆ obtained from using a subset of the data that does not include yi,j . [sent-167, score-0.112]
71 Thus Y is a sociomatrix of out-of-sample predictions of the observed data Y . [sent-168, score-0.112]
72 Add health Genesis Protein interaction dist class eigen dist class eigen dist class eigen 0. [sent-170, score-0.525]
73 90 0 5000 15000 25000 false positives Figure 2: Social network data and unscaled ROC curves for the K = 3 models. [sent-197, score-0.167]
74 3 Adolescent Health social network The first dataset records friendship ties among 247 12th-graders, obtained from the National Longitudinal Study of Adolescent Health (www. [sent-199, score-0.219]
75 These data are represented as an undirected graph in the first panel of Figure 2. [sent-204, score-0.084]
76 Like many social networks, these data exhibit a good deal of transitivity. [sent-205, score-0.113]
77 It is therefore not surprising that the best performing models considered (in terms of area under the ROC curve, given in Table 1) are the distance models, with the eigenmodels close behind. [sent-206, score-0.099]
78 In contrast, the latent class models perform poorly, and the results suggest that increasing K for this model would not improve its performance. [sent-207, score-0.349]
79 4 Word neighbors in Genesis The second dataset we consider is derived from word and punctuation counts in the first chapter of the King James version of Genesis (www. [sent-209, score-0.087]
80 There are 158 unique words and punctuation marks in this chapter, and for our example we take yi,j to be the number of times that word i and word j appear next to each other (a model extension, appropriate for an asymmetric version of this dataset, is discussed in the next section). [sent-213, score-0.125]
81 These data can be viewed as a graph with weighted edges, the unweighted version of which is shown in the first panel of Figure 3. [sent-214, score-0.084]
82 The cross validation results support this claim, in that the latent class model performs much better than the distance model on these data, as seen in the second panel of Figure 3 and in Table 1. [sent-217, score-0.525]
83 As discussed in the previous section, the eigenmodel generalizes the latent class model and performs equally well. [sent-218, score-0.639]
84 We note that parameter estimates for these data were obtained using the ordered probit versions of the models (as the data are not binary), but the out-of-sample predictive performance was evaluated based on each model’s ability to predict a non-zero relationship. [sent-221, score-0.139]
85 5 Protein-protein interaction data Our last example is the protein-protein interaction data of Butland et al. [sent-223, score-0.066]
86 We analyze the large connected component of this graph, which includes 230 proteins and is displayed in the first panel of 4. [sent-225, score-0.075]
87 This graph indicates patterns of both stochastic equivalence and homophily: Some nodes could be described as “hubs”, connecting to many other nodes which in turn do not connect to each other. [sent-226, score-0.386]
88 Such structure is better represented by a latent class model than a distance model. [sent-227, score-0.398]
89 However, most nodes connecting to hubs generally connect to only one hub, which is a feature that is hard to represent with a small number of latent classes. [sent-228, score-0.429]
90 To represent this structure well, we would need two latent classes per hub, one for the hub itself and one for the nodes connecting to the hub. [sent-229, score-0.483]
91 Furthermore, the core of the network (the nodes with more than two connections) displays a good degree of homophily in the form of transitive triads, a feature which is easiest to represent with a distance model. [sent-230, score-0.515]
92 The eigenmodel is able to capture both of these data features and performs better than the other two models in terms of out-of-sample predictive performance. [sent-231, score-0.318]
93 In fact, the K = 3 eigenmodel performs better than the other two models for any value of K considered. [sent-232, score-0.287]
94 4 Discussion Latent distance and latent class models provide concise, easily interpreted descriptions of social networks and relational data. [sent-233, score-0.671]
95 However, neither of these models will provide a complete picture of relational data that exhibit degrees of both homophily and stochastic equivalence. [sent-234, score-0.459]
96 In contrast, we have shown that a latent eigenmodel is able to represent datasets with either or both of these data patterns. [sent-235, score-0.569]
97 This is due to the fact that the eigenmodel provides an unrestricted low-rank approximation to the sociomatrix, and is therefore able to represent a wide array of patterns in the data. [sent-236, score-0.324]
98 The concept behind the eigenmodel is the familiar eigenvalue decomposition of a symmetric matrix. [sent-237, score-0.356]
99 The analogue for directed networks or rectangular matrix data would be a model based on the singular value decomposition, in which data yi,j could be modeled as depending on uT Dvj , where i ui and vj represent vectors of latent row and column effects respectively. [sent-238, score-0.567]
100 the approach for binary and other non-Gaussian relational datasets could be implemented using the ordered probit model discussed in this paper. [sent-241, score-0.307]
wordName wordTfidf (topN-words)
[('qq', 0.636), ('eigenmodel', 0.262), ('uj', 0.262), ('ui', 0.243), ('latent', 0.237), ('homophily', 0.212), ('relational', 0.161), ('ek', 0.131), ('un', 0.115), ('sociomatrix', 0.112), ('hoff', 0.098), ('nodes', 0.098), ('genesis', 0.094), ('social', 0.088), ('probit', 0.083), ('dk', 0.08), ('spheres', 0.075), ('distance', 0.074), ('positives', 0.066), ('equivalence', 0.066), ('class', 0.06), ('friend', 0.06), ('exchangeability', 0.06), ('hub', 0.056), ('kissing', 0.056), ('nowicki', 0.056), ('sociomatrices', 0.056), ('unscaled', 0.056), ('generalizes', 0.053), ('normal', 0.052), ('panel', 0.051), ('eigen', 0.049), ('covariation', 0.049), ('issn', 0.049), ('ck', 0.049), ('zn', 0.045), ('network', 0.045), ('roc', 0.043), ('sphere', 0.042), ('dist', 0.042), ('symmetric', 0.041), ('conditional', 0.04), ('health', 0.039), ('adolescent', 0.037), ('butland', 0.037), ('cohesive', 0.037), ('friendship', 0.037), ('mui', 0.037), ('snijders', 0.037), ('verbs', 0.037), ('relationships', 0.037), ('datasets', 0.036), ('stochastic', 0.036), ('sn', 0.035), ('word', 0.035), ('pr', 0.034), ('represent', 0.034), ('interaction', 0.033), ('graph', 0.033), ('enemy', 0.033), ('hubs', 0.033), ('article', 0.032), ('predictive', 0.031), ('classes', 0.031), ('mathematically', 0.03), ('predictor', 0.03), ('airoldi', 0.03), ('decomposition', 0.03), ('weakly', 0.029), ('zi', 0.028), ('wasserman', 0.028), ('transitive', 0.028), ('punctuation', 0.028), ('entries', 0.028), ('patterns', 0.028), ('connecting', 0.027), ('model', 0.027), ('variation', 0.027), ('kemp', 0.026), ('exhibiting', 0.026), ('networks', 0.026), ('ut', 0.026), ('rk', 0.026), ('among', 0.026), ('models', 0.025), ('exhibit', 0.025), ('full', 0.025), ('transitivity', 0.025), ('generalize', 0.025), ('cross', 0.025), ('displays', 0.024), ('exchangeable', 0.024), ('proteins', 0.024), ('stochastically', 0.024), ('chapter', 0.024), ('validation', 0.024), ('eigenvalue', 0.023), ('ordering', 0.023), ('ties', 0.023), ('yang', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 131 nips-2007-Modeling homophily and stochastic equivalence in symmetric relational data
Author: Peter Hoff
Abstract: This article discusses a latent variable model for inference and prediction of symmetric relational data. The model, based on the idea of the eigenvalue decomposition, represents the relationship between two nodes as the weighted inner-product of node-specific vectors of latent characteristics. This “eigenmodel” generalizes other popular latent variable models, such as latent class and distance models: It is shown mathematically that any latent class or distance model has a representation as an eigenmodel, but not vice-versa. The practical implications of this are examined in the context of three real datasets, for which the eigenmodel has as good or better out-of-sample predictive performance than the other two models. 1
2 0.12814078 97 nips-2007-Hidden Common Cause Relations in Relational Learning
Author: Ricardo Silva, Wei Chu, Zoubin Ghahramani
Abstract: When predicting class labels for objects within a relational database, it is often helpful to consider a model for relationships: this allows for information between class labels to be shared and to improve prediction performance. However, there are different ways by which objects can be related within a relational database. One traditional way corresponds to a Markov network structure: each existing relation is represented by an undirected edge. This encodes that, conditioned on input features, each object label is independent of other object labels given its neighbors in the graph. However, there is no reason why Markov networks should be the only representation of choice for symmetric dependence structures. Here we discuss the case when relationships are postulated to exist due to hidden common causes. We discuss how the resulting graphical model differs from Markov networks, and how it describes different types of real-world relational processes. A Bayesian nonparametric classification model is built upon this graphical representation and evaluated with several empirical studies. 1 Contribution Prediction problems, such as classification, can be easier when class labels share a sort of relational dependency that is not accounted by the input features [10]. If the variables to be predicted are attributes of objects in a relational database, such dependencies are often postulated from the relations that exist in the database. This paper proposes and evaluates a new method for building classifiers that uses information concerning the relational structure of the problem. Consider the following standard example, adapted from [3]. There are different webpages, each one labeled according to some class (e.g., “student page” or “not a student page”). Features such as the word distribution within the body of each page can be used to predict each webpage’s class. However, webpages do not exist in isolation: there are links connecting them. Two pages having a common set of links is evidence for similarity between such pages. For instance, if W1 and W3 both link to W2 , this is commonly considered to be evidence for W1 and W3 having the same class. One way of expressing this dependency is through the following Markov network [5]: ∗ Now at the Statistical Laboratory, University of Cambridge. E-mail: silva@statslab.cam.ac.uk F1 F2 F 3 C1 C2 C3 Here Fi are the features of page Wi , and Ci is its respective page label. Other edges linking F variables to C variables (e.g., F1 −C2 ) can be added without affecting the main arguments presented in this section. The semantics of the graph, for a fixed input feature set {F1 , F2 , F3 }, are as follows: C1 is marginally dependent on C3 , but conditionally independent given C2 . Depending on the domain, this might be either a suitable or unsuitable representation of relations. For instance, in some domains it could be the case that the most sensible model would state that C1 is only informative about C3 once we know what C2 is: that is, C1 and C3 are marginally independent, but dependent given C2 . This can happen if the existence of a relation (Ci , Cj ) corresponds to the existence of hidden common causes generating this pair of random variables. Consider the following example, loosely based on a problem described by [12]. We have three objects, Microsoft (M ), Sony (S) and Philips (P ). The task is a regression task where we want to predict the stock market price of each company given its profitability from last year. The given relationships are that M and S are direct competitors (due to the videogame console market), as well S and P (due to the TV set market). M.Profit S.Profit P.Profit M.Profit S.Profit P.Profit M.Profit S.Profit P.Profit M.Stock S.Stock P.Stock M.Stock S.Stock P.Stock M.Stock S.Stock P.Stock εm εs εp εm εs εp (a) (b) (c) Figure 1: (a) Assumptions that relate Microsoft, Sony and Philips stock prices through hidden common cause mechanisms, depicted as unlabeled gray vertices; (b) A graphical representation for generic hidden common causes relationships by using bi-directed edges; (c) A depiction of the same relationship skeleton by a Markov network model, which has different probabilistic semantics. It is expected that several market factors that affect stock prices are unaccounted by the predictor variable Past Year Profit. For example, a shortage of Microsoft consoles is a hidden common factor for both Microsoft’s and Sony’s stock. Another hidden common cause would be a high price for Sony’s consoles. Assume here that these factors have no effect on Philips’ stock value. A depiction of several hidden common causes that correpond to the relations Competitor(M, S) and Competitor(S, P ) is given in Figure 1(a) as unlabeled gray vertices. Consider a linear regression model for this setup. We assume that for each object Oi ∈ {M, S, P }, the stock price Oi .Stock, centered at the mean, is given by Oi .Stock = β × Oi .P rof it + where each i i (1) is a Gaussian random variable. The fact that there are several hidden common causes between M and S can be modeled by the covariance of m and s , σms . That is, unlike in standard directed Gaussian models, σms is allowed to be non-zero. The same holds for σsp . Covariances of error terms of unrelated objects should be zero (σmp = 0). This setup is very closely related to the classic seemingly unrelated regression model popular in economics [12]. A graphical representation for this type of model is the directed mixed graph (DMG) [9, 11], with bi-directed edges representing the relationship of having hidden common causes between a pair of vertices. This is shown in Figure 1(b). Contrast this to the Markov network representation in Figure 1(c). The undirected representation encodes that m and p are marginally dependent, which does not correspond to our assumptions1 . Moreover, the model in Figure 1(b) states that once we observe Sony’s stock price, Philip’s stocks (and profit) should have a non-zero association with Microsoft’s profit: this follows from a extension of d-separation to DMGs [9]. This is expected from the assumptions (Philip’s stocks should tell us something about Microsoft’s once we know Sony’s, but not before it), but does not hold in the graphical model in Figure 1(c). While it is tempting to use Markov networks to represent relational models (free of concerns raised by cyclic directed representations), it is clear that there are problems for which they are not a sensible choice. This is not to say that Markov networks are not the best representation for large classes of relational problems. Conditional random fields [4] are well-motivated Markov network models for sequence learning. The temporal relationship is closed under marginalization: if we do not measure some steps in the sequence, we will still link the corresponding remaining vertices accordingly, as illustrated in Figure 2. Directed mixed graphs are not a good representation for this sequence structure. X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5 (a) (b) X1 X3 X5 Y1 Y3 Y5 (c) Figure 2: (a) A conditional random field (CRF) graph for sequence data; (b) A hypothetical scenario where two of the time slices are not measured, as indicated by dashed boxes; (c) The resulting CRF graph for the remaining variables, which corresponds to the same criteria for construction of (a). To summarize, the decision between using a Markov network or a DMG reduces to the following modeling issue: if two unlinked object labels yi , yj are statistically associated when some chain of relationships exists between yi and yj , then the Markov network semantics should apply (as in the case for temporal relationships). However, if the association arises only given the values of the other objects in the chain, then this is accounted by the dependence semantics of the directed mixed graph representation. The DMG representation propagates training data information through other training points. The Markov network representation propagates training data information through test points. Propagation through training points is relevant in real problems. For instance, in a webpage domain where each webpage has links to pages of several kinds (e.g., [3]), a chain of intermediated points between two classes labels yi and yj is likely to be more informative if we know the values of the labels in this chain. The respective Markov network would ignore all training points in this chain besides the endpoints. In this paper, we introduce a non-parametric classification model for relational data that factorizes according to a directed mixed graph. Sections 2 and 3 describes the model and contrasts it to a closely related approach which bears a strong analogy to the Markov network formulation. Experiments in text classification are described in Section 4. 2 Model Chu et al. [2] describe an approach for Gaussian process classification using relational information, which we review and compare to our proposed model. Previous approach: relational Gaussian processes through indicators − For each point x in the input space X , there is a corresponding function value fx . Given observed input points x1 , x2 , . . . , xn , a Gaussian process prior over f = [f1 , f2 , . . . , fn ]T has the shape P(f ) = 1 (2π)n/2 |Σ|1/2 1 exp − f T Σ−1 f 2 (2) 1 For Gaussian models, the absence of an edge in the undirected representation (i.e., Gaussian Markov random fields) corresponds to a zero entry in the inverse covariance matrix, where in the DMG it corresponds to a zero in the covariance matrix [9]. X1 X2 f1 X3 X2 X3 X1 X2 X3 f3 f2 ξ 12 X1 f1 f2 f3 f1 f2 f3 Y2 Y3 ξ 23 Y1 Y2 Y3 Y1 Y2 Y3 Y1 ε1 ε2 ε3 ε1 ε2 ε3 ε1 (a) ζ1 (b) ζ2 ε2 ζ3 ε3 (c) Figure 3: (a) A prediction problem where y3 is unknown and the training set is composed of other two datapoints. Dependencies between f1 , f2 and f3 are given by a Gaussian process prior and not represented in the picture. Indicators ξij are known and set to 1; (b) The extra associations that arise by conditioning on ξ = 1 can be factorized as the Markov network model here depicted, in the spirit of [9]; (c) Our proposed model, which ties the error terms and has origins in known statistical models such as seemingly unrelated regression and structural equation models [11]. where the ijth entry of Σ is given by a Mercer kernel function K(xi , xj ) [8]. The idea is to start from a standard Gaussian process prior, and add relational information by conditioning on relational indicators. Let ξij be an indicator that assumes different values, e.g., 1 or 0. The indicator values are observed for each pair of data points (xi , xj ): they are an encoding of the given relational structure. A model for P (ξij = 1|fi , fj ) is defined. This evidence is incorporated into the Gaussian process by conditioning on all indicators ξij that are positive. Essentially, the idea boils down to using P(f |ξ = 1) as the prior for a Gaussian process classifier. Figure 3(a) illustrates a problem with datapoints {(x1 , y1 ), (x2 , y2 ), (x3 , y3 )}. Gray vertices represent unobserved variables. Each yi is a binary random variable, with conditional probability given by P(yi = 1|fi ) = Φ(fi /σ) (3) where Φ(·) is the standard normal cumulative function and σ is a hyperparameter. This can be interpreted as the cumulative distribution of fi + i , where fi is given and i is a normal random variable with zero mean and variance σ 2 . In the example of Figure 3(a), one has two relations: (x1 , x2 ), (x2 , x3 ). This information is incorporated by conditioning on the evidence (ξ12 = 1, ξ23 = 1). Observed points (x1 , y1 ), (x2 , y2 ) form the training set. The prediction task is to estimate y3 . Notice that ξ12 is not used to predict y3 : the Markov blanket for f3 includes (f1 , f2 , ξ23 , y3 , 3 ) and the input features. Essentially, conditioning on ξ = 1 corresponds to a pairwise Markov network structure, as depicted in Figure 3(b) [9]2 . Our approach: mixed graph relational model − Figure 3(c) illustrates our proposed setup. For reasons that will become clear in the sequel, we parameterize the conditional probability of yi as √ (4) P(yi = 1|gi , vi ) = Φ(gi / vi ) where gi = fi + ζi . As before, Equation (4) can be interpreted as the cumulative distribution of 2 gi + i , with i as a normal random variable with zero mean and variance vi = σ 2 − σζi , the last term being the variance of ζi . That is, we break the original error term as i = ζi + i , where i and j are independent for all i = j. Random vector ζ is a multivariate normal with zero mean and covariance matrix Σζ . The key aspect in our model is that the covariance of ζi and ζj is non-zero only if objects i and j are related (that is, bi-directed edge yi ↔ yj is in the relational graph). Parameterizing Σζ for relational problems is non-trivial and discussed in the next section. In the example of Figure 3, one noticeable difference of our model 3(c) to a standard Markov network models 3(b) is that now the Markov blanket for f3 includes error terms for all variables (both and ζ terms), following the motivation presented in Section 1. 2 In the figure, we are not representing explicitly that f1 , f2 and f3 are not independent (the prior covariance matrix Σ is complete). The figure is meant as a representation of the extra associations that arise when conditioning on ξ = 1, and the way such associations factorize. As before, the prior for f in our setup is the Gaussian process prior (2). This means that g has the following Gaussian process prior (implicitly conditioned on x): P(g) = 1 (2π)n/2 |R|1/2 1 exp − g R−1 g 2 (5) where R = K + Σζ is the covariance matrix of g = f + ζ, with Kij = K(xi , xj ). 3 Parametrizing a mixed graph model for relational classification For simplicity, in this paper we will consider only relationships that induce positive associations between labels. Ideally, the parameterization of Σζ has to fulfill two desiderata: (i). it should respect the marginal independence constraints as encoded by the graphical model (i.e., zero covariance for vertices that are not adjacent), and be positive definite; (ii). it has to be parsimonious in order to facilitate hyperparameter selection, both computationally and statistically. Unlike the multivariate analysis problems in [11], the size of our covariance matrix grows with the number of data points. As shown by [11], exact inference in models with covariance matrices with zero-entry constraints is computationally demanding. We provide two alternative parameterizations that are not as flexible, but which lead to covariance matrices that are simple to compute and easy to implement. We will work under the transductive scenario, where training and all test points are given in advance. The corresponding graph thus contain unobserved and observed label nodes. 3.1 Method I The first method is an automated method to relax some of the independence constraints, while guaranteeing positive-definiteness, and a parameterization that depends on a single scalar ρ. This allows for more efficient inference and is done as follows: 1. Let Gζ be the corresponding bi-directed subgraph of our original mixed graph, and let U0 be a matrix with n × n entries, n being the number of nodes in Gζ 2. Set U0 to be the number of cliques in Gζ where yi and yj appear together; ij 3. Set U0 to be the number of cliques containing yi , plus a small constant ∆; ii 4. Set U to be the corresponding correlation matrix obtained by intepreting U0 as a covariance matrix and rescaling it; Finally, set Σζ = ρU, where ρ ∈ [0, 1] is a given hyperparameter. Matrix U is always guaranteed to be positive definite: it is equivalent to obtaining the covariance matrix of y from a linear latent variable model, where there is an independent standard Gaussian latent variable as a common parent to every clique, and every observed node yi is given by the sum of its parents plus an independent error term of variance ∆. Marginal independencies are respected, since independent random variables will never be in a same clique in Gζ . In practice, this method cannot be used as is since the number of cliques will in general grow at an exponential rate as a function of n. Instead, we first triangulate the graph: in this case, extracting cliques can be done in polynomial time. This is a relaxation of the original goal, since some of the original marginal independence constraints will not be enforced due to the triangulation3. 3.2 Method II The method suggested in the previous section is appealing under the assumption that vertices that appear in many common cliques are more likely to have more hidden common causes, and hence should have stronger associations. However, sometimes the triangulation introduces bad artifacts, with lots of marginal independence constraints being violated. In this case, this will often result in a poor prediction performance. A cheap alternative approach is not generating cliques, and instead 3 The need for an approximation is not a shortcoming only of the DMG approach. Notice that the relational Gaussian process of [2] also requires an approximation of its relational kernel. 10 10 10 20 20 20 30 30 30 40 40 40 50 50 50 60 60 60 70 70 70 80 80 80 90 90 90 100 100 100 10 20 30 40 50 60 (a) 70 80 90 100 10 20 30 40 50 60 70 80 90 (b) 100 10 20 30 40 50 60 70 80 90 100 (c) Figure 4: (a) The link matrix for the political books dataset. (b) The relational kernel matrix obtained with the approximated Method I. (c) The kernel matrix obtained with Method II, which tends to produce much weaker associations but does not introduce spurious relations. getting a marginal covariance matrix from a different latent variable model. In this model, we create an independent standard Gaussian variable for each edge yi ↔ yj instead of each clique. No triangulation will be necessary, and all marginal independence constraints will be respected. This, however, has shortcomings of its own: for all pairs (yi , yj ) connected by an edge, it will be the case that U0 = 1, while U0 can be as large as n. This means that the resulting correlation in Uij can be ij ii close to zero even if yi and yj are always in the same cliques. In Section 4, we will choose between Methods I and II according to the marginal likelihood of the model. 3.3 Algorithm Recall that our model is a Gaussian process classifier with error terms i of variance σ such that i = ζi + i . Without loss of generality, we will assume that σ = 1. This results in the following parameterization of the full error covariance matrix: Σ = (1 − ρ)I + ρU (6) where I is an n × n identity matrix. Matrix (1 − ρ)I corresponds to the covariance matrix Σ . The usefulness of separating as and ζ becomes evident when we use an expectation-propagation (EP) algorithm [7] to perform inference in our relational classifier. Instead of approximating the posterior of f , we approximate the posterior density P(g|D), D = {(x1 , y1 ), . . . , (xn , yn )} being ˜ the given training data. The approximate posterior has the form Q(g) ∝ P(g) i ti (gi ) where P(g) is the Gaussian process prior with kernel matrix R = K + Σζ as defined in the previous section. Since the covariance matrix Σ is diagonal, the true likelihood of y given g factorizes n over each datapoint: P(y|g) = i=1 P(yi |gi ), and standard EP algorithms for Gaussian process classification can be used [8] (with the variance given by Σ instead of Σ , and kernel matrix R instead of K). The final algorithm defines a whole new class of relational models, depends on a single hyperparameter ρ which can be optimized by grid search in [0, 1], and requires virtually no modification of code written for EP-based Gaussian process classifiers4 . 4 Results We now compare three different methods in relational classification tasks. We will compare a standard Gaussian process classifier (GPC), the relational Gaussian process (RGP) of [2] and our method, the mixed graph Gaussian process (XGP). A linear kernel K(x, z) = x · z is used, as described by [2]. We set ∆ = 10−4 and the hyperparameter ρ is found by a grid search in the space {0.1, 0.2, 0.3, . . . , 1.0} maximizing the approximate EP marginal likelihood5. 4 We provide MATLAB/Octave code for our method in http://www.statslab.cam.ac.uk/∼silva. For triangulation, we used the MATLAB implementation of the Reverse Cuthill McKee vertex ordering available at http://people.scs.fsu.edu/∼burkardt/m src/rcm/rcm.html 5 Table 1: The averaged AUC scores of citation prediction on test cases of the Cora database are recorded along with standard deviation over 100 trials. “n” denotes the number of papers in one class. “Citations” denotes the citation count within the two paper classes. Group n Citations GPC GPC with Citations XGP 5vs1 346/488 2466 0.905 ± 0.031 0.891 ± 0.022 0.945 ± 0.053 5vs2 346/619 3417 0.900 ± 0.032 0.905 ± 0.044 0.933 ± 0.059 5vs3 346/1376 3905 0.863 ± 0.040 0.893 ± 0.017 0.883 ± 0.013 5vs4 346/646 2858 0.916 ± 0.030 0.887 ± 0.018 0.951 ± 0.042 5vs6 346/281 1968 0.887 ± 0.054 0.843 ± 0.076 0.955 ± 0.041 5vs7 346/529 2948 0.869 ± 0.045 0.867 ± 0.041 0.926 ± 0.076 4.1 Political books We consider first a simple classification problem where the goal is to classify whether a particular book is of liberal political inclination or not. The features of each book are given by the words in the Amazon.com front page for that particular book. The choice of books, labels, and relationships are given in the data collected by Valdis Krebs and available at http://www-personal.umich.edu/ mejn/netdata. The data containing book features can be found at http://www.statslab.cam.ac.uk/∼silva. There are 105 books, 43 of which are labeled as liberal books. The relationships are pairs of books which are frequently purchased together by a same customer. Notice this is an easy problem, where labels are strongly associated if they share a relationship. We performed evaluation by sampling 100 times from the original pool of books, assigning half of them as trainining data. The evaluation criterion was the area under the curve (AUC) for this binary problem. This is a problem where Method I is suboptimal. Figure 4(a) shows the original binary link matrix. Figure 4(b) depicts the corresponding U0 matrix obtained with Method I, where entries closer to red correspond to stronger correlations. Method II gives a better performance here (Method I was better in the next two experiments). The AUC result for GPC was of 0.92, while both RGP and XGP achieved 0.98 (the difference between XGP and GPC having a std. deviation of 0.02). 4.2 Cora The Cora collection [6] contains over 50,000 computer science research papers including bibliographic citations. We used a subset in our experiment. The subset consists of 4,285 machine learning papers categorized into 7 classes. The second column of Table 1 shows the class sizes. Each paper was preprocessed as a bag-of-words, a vector of “term frequency” components scaled by “inverse document frequency”, and then normalized to unity length. This follows the pre-processing used in [2]. There is a total of 20,082 features. For each class, we randomly selected 1% of the labelled samples for training and tested on the remainder. The partition was repeated 100 times. We used the fact that the database is composed of fairly specialized papers as an illustration of when XGP might not be as optimal as RGP (whose AUC curves are very close to 1), since the population of links tends to be better separated between different classes (but this is also means that the task is fairly easy, and differences disappear very rapidly with increasing sample sizes). The fact there is very little training data also favors RGP, since XGP propagates information through training points. Still, XGP does better than the non-relational GPC. Notice that adding the citation adjacency matrix as a binary input feature for each paper does not improve the performance of the GPC, as shown in Table 1. Results for other classes are of similar qualitative nature and not displayed here. 4.3 WebKB The WebKB dataset consists of homepages from 4 different universities: Cornell, Texas, Washington and Wisconsin [3]. Each webpage belongs to one out of 7 categories: student, professor, course, project, staff, department and “other”. The relations come from actual links in the webpages. There is relatively high heterogeneity of types of links in each page: in terms of mixed graph modeling, this linkage mechanism is explained by a hidden common cause (e.g., a student and a course page are associated because that person’s interest in enrolling as a student also creates demand for a course). The heterogeneity also suggests that two unlinked pages should not, on average, have an association if they link to a common page W . However, observing the type of page W might create Table 2: Comparison of the three algorithms on the task “other” vs. “not-other” in the WebKB domain. Results for GPC and RGP taken from [2]. The same partitions for training and test are used to generate the results for XGP. Mean and standard deviation of AUC results are reported. University Numbers Other or Not Other All Link GPC RGP XGP Cornell 617 865 13177 0.708 ± 0.021 0.884 ± 0.025 0.917 ± 0.022 Texas 571 827 16090 0.799 ± 0.021 0.906 ± 0.026 0.949 ± 0.015 Washington 939 1205 15388 0.782 ± 0.023 0.877 ± 0.024 0.923 ± 0.016 Wisconsin 942 1263 21594 0.839 ± 0.014 0.899 ± 0.015 0.941 ± 0.018 the association. We compare how the three algorithms perform when trying to predict if a webpage is of class “other” or not (the other classifications are easier, with smaller differences. Results are omitted for space purposes). The proportion of “other” to non-“other” is about 4:1, which makes the area under the curve (AUC) a more suitable measure of success. We used the same 100 subsamples from [2], where 10% of the whole data is sampled from the pool for a specific university, and the remaining is used for test. We also used the same features as in [2], pre-processed as described in the previous section. The results are shown in Table 2. Both relational Gaussian processes are far better than the non-relational GPC. XGP gives significant improvements over RGP in all four universities. 5 Conclusion We introduced a new family of relational classifiers by extending a classical statistical model [12] to non-parametric relational classification. This is inspired by recent advances in relational Gaussian processes [2] and Bayesian inference for mixed graph models [11]. We showed empirically that modeling the type of latent phenomena that our approach postulates can sometimes improve prediction performance in problems traditionally approached by Markov network structures. Several interesting problems can be treated in the future. It is clear that there are many different ways by which the relational covariance matrix can be parameterized. Intermediate solutions between Methods I and II, approximations through matrix factorizations and graph cuts are only a few among many alternatives that can be explored. Moreover, there is a relationship between our model and multiple kernel learning [1], where one of the kernels comes from error covariances. This might provide alternative ways of learning our models, including multiple types of relationships. Acknowledgements: We thank Vikas Sindhwani for the preprocessed Cora database. References [1] F. Bach, G. Lanckriet, and M. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. 21st International Conference on Machine Learning, 2004. [2] W. Chu, V. Sindhwani, Z. Ghahramani, and S. Keerthi. Relational learning with Gaussian processes. Neural Information Processing Systems, 2006. [3] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to extract symbolic knowledge from the World Wide Web. Proceedings of AAAI’98, pages 509–516, 1998. [4] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 18th International Conference on Machine Learning, 2001. [5] S. Lauritzen. Graphical Models. Oxford University Press, 1996. [6] A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of Internet portals with machine learning. Information Retrieval Journal, 3:127–163, 2000. [7] T. Minka. A family of algorithms for approximate Bayesian inference. PhD Thesis, MIT, 2001. [8] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [9] T. Richardson and P. Spirtes. Ancestral graph Markov models. Annals of Statistics, 30:962–1030, 2002. [10] P. Sen and L. Getoor. Link-based classification. Report CS-TR-4858, University of Maryland, 2007. [11] R. Silva and Z. Ghahramani. Bayesian inference for Gaussian mixed graph models. UAI, 2006. [12] A. Zellner. An efficient method of estimating seemingly unrelated regression equations and tests for aggregation bias. Journal of the American Statistical Association, 1962.
3 0.089764178 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
Author: Kai Yu, Wei Chu
Abstract: This paper aims to model relational data on edges of networks. We describe appropriate Gaussian Processes (GPs) for directed, undirected, and bipartite networks. The inter-dependencies of edges can be effectively modeled by adapting the GP hyper-parameters. The framework suggests an intimate connection between link prediction and transfer learning, which were traditionally two separate research topics. We develop an efficient learning algorithm that can handle a large number of observations. The experimental results on several real-world data sets verify superior learning capacity. 1
4 0.078107648 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
Author: Francis R. Bach, Zaïd Harchaoui
Abstract: We present a novel linear clustering framework (D IFFRAC) which relies on a linear discriminative cost function and a convex relaxation of a combinatorial optimization problem. The large convex optimization problem is solved through a sequence of lower dimensional singular value decompositions. This framework has several attractive properties: (1) although apparently similar to K-means, it exhibits superior clustering performance than K-means, in particular in terms of robustness to noise. (2) It can be readily extended to non linear clustering if the discriminative cost function is based on positive definite kernels, and can then be seen as an alternative to spectral clustering. (3) Prior information on the partition is easily incorporated, leading to state-of-the-art performance for semi-supervised learning, for clustering or classification. We present empirical evaluations of our algorithms on synthetic and real medium-scale datasets.
5 0.066845737 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model
Author: Zhengdong Lu, Cristian Sminchisescu, Miguel Á. Carreira-Perpiñán
Abstract: Reliably recovering 3D human pose from monocular video requires models that bias the estimates towards typical human poses and motions. We construct priors for people tracking using the Laplacian Eigenmaps Latent Variable Model (LELVM). LELVM is a recently introduced probabilistic dimensionality reduction model that combines the advantages of latent variable models—a multimodal probability density for latent and observed variables, and globally differentiable nonlinear mappings for reconstruction and dimensionality reduction—with those of spectral manifold learning methods—no local optima, ability to unfold highly nonlinear manifolds, and good practical scaling to latent spaces of high dimension. LELVM is computationally efficient, simple to learn from sparse training data, and compatible with standard probabilistic trackers such as particle filters. We analyze the performance of a LELVM-based probabilistic sigma point mixture tracker in several real and synthetic human motion sequences and demonstrate that LELVM not only provides sufficient constraints for robust operation in the presence of missing, noisy and ambiguous image measurements, but also compares favorably with alternative trackers based on PCA or GPLVM priors. Recent research in reconstructing articulated human motion has focused on methods that can exploit available prior knowledge on typical human poses or motions in an attempt to build more reliable algorithms. The high-dimensionality of human ambient pose space—between 30-60 joint angles or joint positions depending on the desired accuracy level, makes exhaustive search prohibitively expensive. This has negative impact on existing trackers, which are often not sufficiently reliable at reconstructing human-like poses, self-initializing or recovering from failure. Such difficulties have stimulated research in algorithms and models that reduce the effective working space, either using generic search focusing methods (annealing, state space decomposition, covariance scaling) or by exploiting specific problem structure (e.g. kinematic jumps). Experience with these procedures has nevertheless shown that any search strategy, no matter how effective, can be made significantly more reliable if restricted to low-dimensional state spaces. This permits a more thorough exploration of the typical solution space, for a given, comparatively similar computational effort as a high-dimensional method. The argument correlates well with the belief that the human pose space, although high-dimensional in its natural ambient parameterization, has a significantly lower perceptual (latent or intrinsic) dimensionality, at least in a practical sense—many poses that are possible are so improbable in many real-world situations that it pays off to encode them with low accuracy. A perceptual representation has to be powerful enough to capture the diversity of human poses in a sufficiently broad domain of applicability (the task domain), yet compact and analytically tractable for search and optimization. This justifies the use of models that are nonlinear and low-dimensional (able to unfold highly nonlinear manifolds with low distortion), yet probabilistically motivated and globally continuous for efficient optimization. Reducing dimensionality is not the only goal: perceptual representations have to preserve critical properties of the ambient space. Reliable tracking needs locality: nearby regions in ambient space have to be mapped to nearby regions in latent space. If this does not hold, the tracker is forced to make unrealistically large, and difficult to predict jumps in latent space in order to follow smooth trajectories in the joint angle ambient space. 1 In this paper we propose to model priors for articulated motion using a recently introduced probabilistic dimensionality reduction method, the Laplacian Eigenmaps Latent Variable Model (LELVM) [1]. Section 1 discusses the requirements of priors for articulated motion in the context of probabilistic and spectral methods for manifold learning, and section 2 describes LELVM and shows how it combines both types of methods in a principled way. Section 3 describes our tracking framework (using a particle filter) and section 4 shows experiments with synthetic and real human motion sequences using LELVM priors learned from motion-capture data. Related work: There is significant work in human tracking, using both generative and discriminative methods. Due to space limitations, we will focus on the more restricted class of 3D generative algorithms based on learned state priors, and not aim at a full literature review. Deriving compact prior representations for tracking people or other articulated objects is an active research field, steadily growing with the increased availability of human motion capture data. Howe et al. and Sidenbladh et al. [2] propose Gaussian mixture representations of short human motion fragments (snippets) and integrate them in a Bayesian MAP estimation framework that uses 2D human joint measurements, independently tracked by scaled prismatic models [3]. Brand [4] models the human pose manifold using a Gaussian mixture and uses an HMM to infer the mixture component index based on a temporal sequence of human silhouettes. Sidenbladh et al. [5] use similar dynamic priors and exploit ideas in texture synthesis—efficient nearest-neighbor search for similar motion fragments at runtime—in order to build a particle-filter tracker with observation model based on contour and image intensity measurements. Sminchisescu and Jepson [6] propose a low-dimensional probabilistic model based on fitting a parametric reconstruction mapping (sparse radial basis function) and a parametric latent density (Gaussian mixture) to the embedding produced with a spectral method. They track humans walking and involved in conversations using a Bayesian multiple hypotheses framework that fuses contour and intensity measurements. Urtasun et al. [7] use a dynamic MAP estimation framework based on a GPLVM and 2D human joint correspondences obtained from an independent image-based tracker. Li et al. [8] use a coordinated mixture of factor analyzers within a particle filtering framework, in order to reconstruct human motion in multiple views using chamfer matching to score different configuration. Wang et al. [9] learn a latent space with associated dynamics where both the dynamics and observation mapping are Gaussian processes, and Urtasun et al. [10] use it for tracking. Taylor et al. [11] also learn a binary latent space with dynamics (using an energy-based model) but apply it to synthesis, not tracking. Our work learns a static, generative low-dimensional model of poses and integrates it into a particle filter for tracking. We show its ability to work with real or partially missing data and to track multiple activities. 1 Priors for articulated human pose We consider the problem of learning a probabilistic low-dimensional model of human articulated motion. Call y ∈ RD the representation in ambient space of the articulated pose of a person. In this paper, y contains the 3D locations of anywhere between 10 and 60 markers located on the person’s joints (other representations such as joint angles are also possible). The values of y have been normalised for translation and rotation in order to remove rigid motion and leave only the articulated motion (see section 3 for how we track the rigid motion). While y is high-dimensional, the motion pattern lives in a low-dimensional manifold because most values of y yield poses that violate body constraints or are simply atypical for the motion type considered. Thus we want to model y in terms of a small number of latent variables x given a collection of poses {yn }N (recorded from a human n=1 with motion-capture technology). The model should satisfy the following: (1) It should define a probability density for x and y, to be able to deal with noise (in the image or marker measurements) and uncertainty (from missing data due to occlusion or markers that drop), and to allow integration in a sequential Bayesian estimation framework. The density model should also be flexible enough to represent multimodal densities. (2) It should define mappings for dimensionality reduction F : y → x and reconstruction f : x → y that apply to any value of x and y (not just those in the training set); and such mappings should be defined on a global coordinate system, be continuous (to avoid physically impossible discontinuities) and differentiable (to allow efficient optimisation when tracking), yet flexible enough to represent the highly nonlinear manifold of articulated poses. From a statistical machine learning point of view, this is precisely what latent variable models (LVMs) do; for example, factor analysis defines linear mappings and Gaussian densities, while the generative topographic mapping (GTM; [12]) defines nonlinear mappings and a Gaussian-mixture density in ambient space. However, factor analysis is too limited to be of practical use, and GTM— 2 while flexible—has two important practical problems: (1) the latent space must be discretised to allow tractable learning and inference, which limits it to very low (2–3) latent dimensions; (2) the parameter estimation is prone to bad local optima that result in highly distorted mappings. Another dimensionality reduction method recently introduced, GPLVM [13], which uses a Gaussian process mapping f (x), partly improves this situation by defining a tunable parameter xn for each data point yn . While still prone to local optima, this allows the use of a better initialisation for {xn }N (obtained from a spectral method, see later). This has prompted the application of n=1 GPLVM for tracking human motion [7]. However, GPLVM has some disadvantages: its training is very costly (each step of the gradient iteration is cubic on the number of training points N , though approximations based on using few points exist); unlike true LVMs, it defines neither a posterior distribution p(x|y) in latent space nor a dimensionality reduction mapping E {x|y}; and the latent representation it obtains is not ideal. For example, for periodic motions such as running or walking, repeated periods (identical up to small noise) can be mapped apart from each other in latent space because nothing constrains xn and xm to be close even when yn = ym (see fig. 3 and [10]). There exists a different type of dimensionality reduction methods, spectral methods (such as Isomap, LLE or Laplacian eigenmaps [14]), that have advantages and disadvantages complementary to those of LVMs. They define neither mappings nor densities but just a correspondence (xn , yn ) between points in latent space xn and ambient space yn . However, the training is efficient (a sparse eigenvalue problem) and has no local optima, and often yields a correspondence that successfully models highly nonlinear, convoluted manifolds such as the Swiss roll. While these attractive properties have spurred recent research in spectral methods, their lack of mappings and densities has limited their applicability in people tracking. However, a new model that combines the advantages of LVMs and spectral methods in a principled way has been recently proposed [1], which we briefly describe next. 2 The Laplacian Eigenmaps Latent Variable Model (LELVM) LELVM is based on a natural way of defining an out-of-sample mapping for Laplacian eigenmaps (LE) which, in addition, results in a density model. In LE, typically we first define a k-nearestneighbour graph on the sample data {yn }N and weigh each edge yn ∼ ym by a Gaussian affinity n=1 2 1 function K(yn , ym ) = wnm = exp (− 2 (yn − ym )/σ ). Then the latent points X result from: min tr XLX⊤ s.t. X ∈ RL×N , XDX⊤ = I, XD1 = 0 (1) where we define the matrix XL×N = (x1 , . . . , xN ), the symmetric affinity matrix WN ×N , the deN gree matrix D = diag ( n=1 wnm ), the graph Laplacian matrix L = D−W, and 1 = (1, . . . , 1)⊤ . The constraints eliminate the two trivial solutions X = 0 (by fixing an arbitrary scale) and x1 = · · · = xN (by removing 1, which is an eigenvector of L associated with a zero eigenvalue). The solution is given by the leading u2 , . . . , uL+1 eigenvectors of the normalised affinity matrix 1 1 1 N = D− 2 WD− 2 , namely X = V⊤ D− 2 with VN ×L = (v2 , . . . , vL+1 ) (an a posteriori translated, rotated or uniformly scaled X is equally valid). Following [1], we now define an out-of-sample mapping F(y) = x for a new point y as a semisupervised learning problem, by recomputing the embedding as in (1) (i.e., augmenting the graph Laplacian with the new point), but keeping the old embedding fixed: L K(y) X⊤ min tr ( X x ) K(y)⊤ 1⊤ K(y) (2) x⊤ x∈RL 2 where Kn (y) = K(y, yn ) = exp (− 1 (y − yn )/σ ) for n = 1, . . . , N is the kernel induced by 2 the Gaussian affinity (applied only to the k nearest neighbours of y, i.e., Kn (y) = 0 if y ≁ yn ). This is one natural way of adding a new point to the embedding by keeping existing embedded points fixed. We need not use the constraints from (1) because they would trivially determine x, and the uninteresting solutions X = 0 and X = constant were already removed in the old embedding anyway. The solution yields an out-of-sample dimensionality reduction mapping x = F(y): x = F(y) = X K(y) 1⊤ K(y) N K(y,yn ) PN x n=1 K(y,yn′ ) n ′ = (3) n =1 applicable to any point y (new or old). This mapping is formally identical to a Nadaraya-Watson estimator (kernel regression; [15]) using as data {(xn , yn )}N and the kernel K. We can take this n=1 a step further by defining a LVM that has as joint distribution a kernel density estimate (KDE): p(x, y) = 1 N N n=1 Ky (y, yn )Kx (x, xn ) p(y) = 3 1 N N n=1 Ky (y, yn ) p(x) = 1 N N n=1 Kx (x, xn ) where Ky is proportional to K so it integrates to 1, and Kx is a pdf kernel in x–space. Consequently, the marginals in observed and latent space are also KDEs, and the dimensionality reduction and reconstruction mappings are given by kernel regression (the conditional means E {y|x}, E {x|y}): F(y) = N n=1 p(n|y)xn f (x) = N K (x,xn ) PN x y n=1 Kx (x,xn′ ) n ′ = n =1 N n=1 p(n|x)yn . (4) We allow the bandwidths to be different in the latent and ambient spaces: 2 2 1 Kx (x, xn ) ∝ exp (− 1 (x − xn )/σx ) and Ky (y, yn ) ∝ exp (− 2 (y − yn )/σy ). They 2 may be tuned to control the smoothness of the mappings and densities [1]. Thus, LELVM naturally extends a LE embedding (efficiently obtained as a sparse eigenvalue problem with a cost O(N 2 )) to global, continuous, differentiable mappings (NW estimators) and potentially multimodal densities having the form of a Gaussian KDE. This allows easy computation of posterior probabilities such as p(x|y) (unlike GPLVM). It can use a continuous latent space of arbitrary dimension L (unlike GTM) by simply choosing L eigenvectors in the LE embedding. It has no local optima since it is based on the LE embedding. LELVM can learn convoluted mappings (e.g. the Swiss roll) and define maps and densities for them [1]. The only parameters to set are the graph parameters (number of neighbours k, affinity width σ) and the smoothing bandwidths σx , σy . 3 Tracking framework We follow the sequential Bayesian estimation framework, where for state variables s and observation variables z we have the recursive prediction and correction equations: p(st |z0:t−1 ) = p(st |st−1 ) p(st−1 |z0:t−1 ) dst−1 p(st |z0:t ) ∝ p(zt |st ) p(st |z0:t−1 ). (5) L We define the state variables as s = (x, d) where x ∈ R is the low-dim. latent space (for pose) and d ∈ R3 is the centre-of-mass location of the body (in the experiments our state also includes the orientation of the body, but for simplicity here we describe only the translation). The observed variables z consist of image features or the perspective projection of the markers on the camera plane. The mapping from state to observations is (for the markers’ case, assuming M markers): P f x ∈ RL − − → y ∈ R3M −→ ⊕ − − − z ∈ R2M −− − − −→ d ∈ R3 (6) where f is the LELVM reconstruction mapping (learnt from mocap data); ⊕ shifts each 3D marker by d; and P is the perspective projection (pinhole camera), applied to each 3D point separately. Here we use a simple observation model p(zt |st ): Gaussian with mean given by the transformation (6) and isotropic covariance (set by the user to control the influence of measurements in the tracking). We assume known correspondences and observations that are obtained either from the 3D markers (for tracking synthetic data) or 2D tracks obtained from a 2D tracker. Our dynamics model is p(st |st−1 ) ∝ pd (dt |dt−1 ) px (xt |xt−1 ) p(xt ) (7) where both dynamics models for d and x are random walks: Gaussians centred at the previous step value dt−1 and xt−1 , respectively, with isotropic covariance (set by the user to control the influence of dynamics in the tracking); and p(xt ) is the LELVM prior. Thus the overall dynamics predicts states that are both near the previous state and yield feasible poses. Of course, more complex dynamics models could be used if e.g. the speed and direction of movement are known. As tracker we use the Gaussian mixture Sigma-point particle filter (GMSPPF) [16]. This is a particle filter that uses a Gaussian mixture representation for the posterior distribution in state space and updates it with a Sigma-point Kalman filter. This Gaussian mixture will be used as proposal distribution to draw the particles. As in other particle filter implementations, the prediction step is carried out by approximating the integral (5) with particles and updating the particles’ weights. Then, a new Gaussian mixture is fitted with a weighted EM algorithm to these particles. This replaces the resampling stage needed by many particle filters and mitigates the problem of sample depletion while also preventing the number of components in the Gaussian mixture from growing over time. The choice of this particular tracker is not critical; we use it to illustrate the fact that LELVM can be introduced in any probabilistic tracker for nonlinear, nongaussian models. Given the corrected distribution p(st |z0:t ), we choose its mean as recovered state (pose and location). It is also possible to choose instead the mode closest to the state at t − 1, which could be found by mean-shift or Newton algorithms [17] since we are using a Gaussian-mixture representation in state space. 4 4 Experiments We demonstrate our low-dimensional tracker on image sequences of people walking and running, both synthetic (fig. 1) and real (fig. 2–3). Fig. 1 shows the model copes well with persistent partial occlusion and severely subsampled training data (A,B), and quantitatively evaluates temporal reconstruction (C). For all our experiments, the LELVM parameters (number of neighbors k, Gaussian affinity σ, and bandwidths σx and σy ) were set manually. We mainly considered 2D latent spaces (for pose, plus 6D for rigid motion), which were expressive enough for our experiments. More complex, higher-dimensional models are straightforward to construct. The initial state distribution p(s0 ) was chosen a broad Gaussian, the dynamics and observation covariance were set manually to control the tracking smoothness, and the GMSPPF tracker used a 5-component Gaussian mixture in latent space (and in the state space of rigid motion) and a small set of 500 particles. The 3D representation we use is a 102-D vector obtained by concatenating the 3D markers coordinates of all the body joints. These would be highly unconstrained if estimated independently, but we only use them as intermediate representation; tracking actually occurs in the latent space, tightly controlled using the LELVM prior. For the synthetic experiments and some of the real experiments (figs. 2–3) the camera parameters and the body proportions were known (for the latter, we used the 2D outputs of [6]). For the CMU mocap video (fig. 2B) we roughly guessed. We used mocap data from several sources (CMU, OSU). As observations we always use 2D marker positions, which, depending on the analyzed sequence were either known (the synthetic case), or provided by an existing tracker [6] or specified manually (fig. 2B). Alternatively 2D point trackers similar to the ones of [7] can be used. The forward generative model is obtained by combining the latent to ambient space mapping (this provides the position of the 3D markers) with a perspective projection transformation. The observation model is a product of Gaussians, each measuring the probability of a particular marker position given its corresponding image point track. Experiments with synthetic data: we analyze the performance of our tracker in controlled conditions (noise perturbed synthetically generated image tracks) both under regular circumstances (reasonable sampling of training data) and more severe conditions with subsampled training points and persistent partial occlusion (the man running behind a fence, with many of the 2D marker tracks obstructed). Fig. 1B,C shows both the posterior (filtered) latent space distribution obtained from our tracker, and its mean (we do not show the distribution of the global rigid body motion; in all experiments this is tracked with good accuracy). In the latent space plot shown in fig. 1B, the onset of running (two cycles were used) appears as a separate region external to the main loop. It does not appear in the subsampled training set in fig. 1B, where only one running cycle was used for training and the onset of running was removed. In each case, one can see that the model is able to track quite competently, with a modest decrease in its temporal accuracy, shown in fig. 1C, where the averages are computed per 3D joint (normalised wrt body height). Subsampling causes some ambiguity in the estimate, e.g. see the bimodality in the right plot in fig. 1C. In another set of experiments (not shown) we also tracked using different subsets of 3D markers. The estimates were accurate even when about 30% of the markers were dropped. Experiments with real images: this shows our tracker’s ability to work with real motions of different people, with different body proportions, not in its latent variable model training set (figs. 2–3). We study walking, running and turns. In all cases, tracking and 3D reconstruction are reasonably accurate. We have also run comparisons against low-dimensional models based on PCA and GPLVM (fig. 3). It is important to note that, for LELVM, errors in the pose estimates are primarily caused by mismatches between the mocap data used to learn the LELVM prior and the body proportions of the person in the video. For example, the body proportions of the OSU motion captured walker are quite different from those of the image in fig. 2–3 (e.g. note how the legs of the stick man are shorter relative to the trunk). Likewise, the style of the runner from the OSU data (e.g. the swinging of the arms) is quite different from that of the video. Finally, the interest points tracked by the 2D tracker do not entirely correspond either in number or location to the motion capture markers, and are noisy and sometimes missing. In future work, we plan to include an optimization step to also estimate the body proportions. This would be complicated for a general, unconstrained model because the dimensions of the body couple with the pose, so either one or the other can be changed to improve the tracking error (the observation likelihood can also become singular). But for dedicated prior pose models like ours these difficulties should be significantly reduced. The model simply cannot assume highly unlikely stances—these are either not representable at all, or have reduced probability—and thus avoids compensatory, unrealistic body proportion estimates. 5 n = 15 n = 40 n = 65 n = 90 n = 115 n = 140 A 1.5 1.5 1.5 1.5 1 −1 −1 −1 −1 −1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1 −1 −1 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 −1 −1 1 −1 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1.5 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0.3 0.2 0.1 −1 0.4 0.3 0.2 −1.5 −1.5 n = 60 0.4 0.3 −2 −1 −1.5 −1 0 −0.5 n = 49 0.4 0.1 −1 −1.5 1 0 −0.5 −2 −1 −1.5 −1 0.5 0 −0.5 −1.5 −1.5 0.5 1 0.5 0 −0.5 −1.5 −1.5 −1.5 1 0.5 0 −0.5 n = 37 0.2 0.1 −1 −1 −1.5 −0.5 −1 −1.5 −2 1 0.5 0 −0.5 −0.5 −1.5 −1.5 0.3 0.2 0.1 0 0.4 0.3 0.2 −0.5 n = 25 0.4 0.3 −1 0 −0.5 −1 −1.5 1 0.5 0 0 −0.5 1.5 1 0.5 0 −0.5 −2 1 0.5 0.5 0 −1.5 −1.5 −1.5 1.5 1 0.5 0 −1 −1.5 −2 1 1 0.5 −0.5 −1 −1.5 −1.5 n = 13 0.4 −1 −1 −1.5 −1.5 −1.5 n=1 B −1 −1 −1.5 −1.5 1.5 1 0.5 −0.5 0 −1 −1.5 −2 −2 1 0 −0.5 −0.5 −0.5 −1 −1.5 −2 0.5 0 0 −0.5 0.5 0 −0.5 −1 −1.5 −2 1 0.5 0.5 0 1 0.5 0 −0.5 −1 −1.5 −1 −1.5 −2 1 1 0.5 1.5 1 0.5 0 −0.5 0 −0.5 −1 −1.5 −2 1 −0.5 1.5 1.5 1 0.5 0.5 0 −0.5 −1 −1.5 −2 1.5 1 1 0.5 0 −0.5 −2 0 −0.5 −0.5 1.5 1 0.5 0 −1 −1.5 −1 −1.5 0.5 0 0 −0.5 −1.5 −1.5 −1.5 1.5 1.5 1 0.5 −0.5 0 −0.5 −2 1 0.5 0.5 0 −0.5 −1 −1.5 −1.5 1.5 1 1 0.5 0 −1 −1.5 1 1 0.5 0 −0.5 −0.5 1.5 1 −0.5 −2 1 0.5 0 0 −0.5 0.5 0 −1 −1.5 −2 1 0.5 0.5 0 −0.5 1.5 1.5 1 −0.5 −2 1 1 0.5 0.5 0 −1 −1.5 −1 −1.5 −2 −2 1 0 −0.5 1.5 1 0.5 −0.5 0 −0.5 −1 −1.5 −2 0.5 0 −0.5 0.5 0 −0.5 −1 −1.5 −2 1 0.5 0 −0.5 0.5 0 −0.5 −1 −1.5 −1 −1.5 −2 1 0.5 0 1 0.5 0 −0.5 0 −0.5 −1 −1.5 −2 1 0.5 1.5 1 0.5 0.5 0 −0.5 −1 −1.5 1 1 1 0.5 0 −0.5 −2 1.5 1.5 1 1 0.5 0 −1 −1.5 1.5 1.5 1 0.5 −0.5 0.2 0.1 0.1 0 0 0 0 0 0 −0.1 −0.1 −0.1 −0.1 −0.1 −0.1 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.3 −0.3 −0.3 −0.3 −0.3 −0.3 −0.4 0.4 0.6 0.8 1.5 1 1.2 1.4 1.6 1.8 2 2.2 −0.4 0.4 2.4 1.5 0.6 0.8 1.5 1 1.2 1.4 1.6 1.8 2 2.2 −0.4 0.4 2.4 1.5 0.6 0.8 1.5 1.5 1.5 1 1.2 1.4 1.6 1.8 2 2.2 −0.4 0.4 2.4 1.5 0.6 0.8 1.5 1.5 1.5 1 1.2 1.4 1.6 1.8 2 2.2 −0.4 0.4 2.4 1.5 0.6 0.8 1.5 1.5 1.5 1 1.2 1.4 1.6 1.8 2 2.2 −0.4 0.4 2.4 1.5 0.6 0.8 1.5 1.5 1.5 1 1.2 1.4 1.6 1.8 2 2.2 2.4 1.5 1.5 1.5 1.5 1.5 1 1 0.5 0.5 0 0 1 −0.5 −0.5 0 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 0.1 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 0 −0.5 −1.5 0.5 0 0 −0.5 0 −0.5 −0.5 −0.5 −2 −1 −1 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −2 −1 −1.5 −1 −1.5 1 0.5 0.5 0 −1.5 −1 1 1 0.5 −2 −1 −1.5 −1.5 −0.5 −1 −2 1 −1 0 −1 −1.5 −2 −0.5 −2 −1.5 0.5 −0.5 −1 −1.5 0 −0.5 −1 −1.5 0 −1.5 0.5 0 −0.5 −1 −0.5 −1 −1.5 1 0.5 0 −0.5 −1 −0.5 −1 1 0.5 0 −1.5 1 0.5 0 −0.5 −2 1 0.5 −2 −1 −1.5 −1.5 −0.5 0 −1 1 −1 0 1 −1.5 −2 −0.5 −2 −1.5 1 0.5 0 0.5 −0.5 −1 −1.5 0 −0.5 −1 −1.5 0 −1.5 0.5 0 −0.5 −1 −0.5 −1 −1.5 1 0.5 0 −0.5 −1 −0.5 −1 1 0.5 0 −1.5 1 0.5 1 0.5 0 −0.5 −2 1 0.5 −2 −1 −1.5 −1.5 −0.5 0 −1 1 −1 0 1 −1.5 −2 −0.5 −2 −1.5 1 0.5 0 0.5 −0.5 −1 −1.5 0 −0.5 −1 −1.5 0 −1.5 0.5 0 −0.5 −1 −0.5 −1 −1.5 1 0.5 0 −0.5 −1 −0.5 −1 1 0.5 0 −1.5 1 0.5 1 0.5 0 −0.5 −2 1 0.5 −2 −1 −1.5 −1.5 −0.5 0 −1 1 −1 0 1 −1.5 −2 −0.5 −2 −1 −1.5 −0.5 1 0.5 0 0.5 −0.5 −1 −1.5 0 −0.5 −0.5 −1 −1 −1.5 0.5 0 0 −0.5 −2 −1.5 0 −1.5 1 0.5 0.5 0 −2 −1 −1.5 −1.5 −1 1 1 0.5 −0.5 −1 1 0.5 1 0.5 −0.5 −1 −2 1 0 −0.5 −1 −1.5 −1.5 −1.5 −2 −1.5 0.5 0 −0.5 −1 −0.5 0 −0.5 −1.5 −1 −1.5 1 0.5 0 −0.5 0 1 −1 −0.5 −1 1 0.5 0 1 0.5 0 0.5 −0.5 −1 −0.5 −2 1 0.5 1 0.5 1 0.5 0 −1 1 0 1 −1.5 −2 0.5 0 1 0.5 −0.5 −1 −1.5 1 0.5 1 0.5 −1.5 −1.5 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 0.13 0.12 RMSE C RMSE 0.08 0.06 0.04 0.02 0.11 0.1 0.09 0.08 0.07 0.06 0 0 50 100 150 0.05 0 10 time step n 20 30 40 50 60 time step n Figure 1: OSU running man motion capture data. A: we use 217 datapoints for training LELVM (with added noise) and for tracking. Row 1: tracking in the 2D latent space. The contours (very tight in this sequence) are the posterior probability. Row 2: perspective-projection-based observations with occlusions. Row 3: each quadruplet (a, a′ , b, b′ ) show the true pose of the running man from a front and side views (a, b), and the reconstructed pose by tracking with our model (a′ , b′ ). B: we use the first running cycle for training LELVM and the second cycle for tracking. C: RMSE errors for each frame, for the tracking of A (left plot) and B (middle plot), normalised so that 1 equals the M 1 ˆ 2 height of the stick man. RMSE(n) = M j=1 ynj − ynj −1/2 for all 3D locations of the M ˆ markers, i.e., comparison of reconstructed stick man yn with ground-truth stick man yn . Right plot: multimodal posterior distribution in pose space for the model of A (frame 42). Comparison with PCA and GPLVM (fig. 3): for these models, the tracker uses the same GMSPPF setting as for LELVM (number of particles, initialisation, random-walk dynamics, etc.) but with the mapping y = f (x) provided by GPLVM or PCA, and with a uniform prior p(x) in latent space (since neither GPLVM nor the non-probabilistic PCA provide one). The LELVM-tracker uses both its f (x) and latent space prior p(x), as discussed. All methods use a 2D latent space. We ensured the best possible training of GPLVM by model selection based on multiple runs. For PCA, the latent space looks deceptively good, showing non-intersecting loops. However, (1) individual loops do not collect together as they should (for LELVM they do); (2) worse still, the mapping from 2D to pose space yields a poor observation model. The reason is that the loop in 102-D pose space is nonlinearly bent and a plane can at best intersect it at a few points, so the tracker often stays put at one of those (typically an “average” standing position), since leaving it would increase the error a lot. Using more latent dimensions would improve this, but as LELVM shows, this is not necessary. For GPLVM, we found high sensitivity to filter initialisation: the estimates have high variance across runs and are inaccurate ≈ 80% of the time. When it fails, the GPLVM tracker often freezes in latent space, like PCA. When it does succeed, it produces results that are comparable with LELVM, although somewhat less accurate visually. However, even then GPLVM’s latent space consists of continuous chunks spread apart and offset from each other; GPLVM has no incentive to place nearby two xs mapping to the same y. This effect, combined with the lack of a data-sensitive, realistic latent space density p(x), makes GPLVM jump erratically from chunk to chunk, in contrast with LELVM, which smoothly follows the 1D loop. Some GPLVM problems might be alleviated using higher-order dynamics, but our experiments suggest that such modeling sophistication is less 6 0 0.5 1 n=1 n = 15 n = 29 n = 43 n = 55 n = 69 A 100 100 100 100 100 50 50 50 50 50 0 0 0 0 0 0 −50 −50 −50 −50 −50 −50 −100 −100 −100 −100 −100 −100 50 50 40 −40 −20 −50 −30 −10 0 −40 20 −20 40 n=4 −40 10 20 −20 40 50 0 n=9 −40 −20 30 40 50 0 n = 14 −40 20 −20 40 50 30 20 10 0 0 −40 −20 −30 −20 −40 10 20 −30 −10 0 −50 30 50 n = 19 −50 −30 −10 −50 30 −10 −20 −30 −40 10 50 40 30 20 10 0 −50 −30 −10 −50 20 −10 −20 −30 −40 10 50 40 30 20 10 0 −50 −30 −10 −50 30 −10 −20 −30 −40 0 50 50 40 30 20 10 0 −50 −30 −10 −50 30 −10 −20 −30 −40 10 40 30 20 10 0 −10 −20 −30 50 40 30 20 10 0 −10 −50 100 40 −40 10 20 −50 30 50 n = 24 40 50 n = 29 B 20 20 20 20 20 40 40 40 40 40 60 60 60 60 60 80 80 80 80 80 80 100 100 100 100 100 100 120 120 120 120 120 120 140 140 140 140 140 140 160 160 160 160 160 160 180 180 180 180 180 200 200 200 200 200 220 220 220 220 220 50 100 150 200 250 300 350 50 100 150 200 250 300 350 50 100 150 200 250 300 350 50 100 150 200 250 300 350 20 40 60 180 200 220 50 100 150 200 250 300 350 50 100 150 100 100 100 100 100 50 50 50 50 200 250 300 350 100 50 50 0 0 0 0 0 0 −50 −50 −50 −50 −50 −50 −100 −100 −100 −100 −100 −100 50 50 40 −40 −20 −50 −30 −10 0 −40 10 20 −50 30 40 −40 −20 −50 −30 −10 0 −40 10 20 −50 30 40 −40 −20 −50 −30 −10 0 −40 −20 30 40 50 0 −40 10 20 −50 30 40 50 40 30 30 20 20 10 10 0 −50 −30 −10 −50 20 0 −10 −20 −30 −40 10 40 30 20 10 0 −10 −20 −30 50 50 40 30 20 10 0 −10 −20 −30 50 50 40 30 20 10 0 −10 −20 −30 50 40 30 20 10 0 −10 −50 50 −40 −20 −30 −20 −30 −10 0 −40 10 20 −50 30 40 50 −10 −50 −40 −20 −30 −20 −30 −10 0 −40 10 20 −50 30 40 50 Figure 2: A: tracking of a video from [6] (turning & walking). We use 220 datapoints (3 full walking cycles) for training LELVM. Row 1: tracking in the 2D latent space. The contours are the estimated posterior probability. Row 2: tracking based on markers. The red dots are the 2D tracks and the green stick man is the 3D reconstruction obtained using our model. Row 3: our 3D reconstruction from a different viewpoint. B: tracking of a person running straight towards the camera. Notice the scale changes and possible forward-backward ambiguities in the 3D estimates. We train the LELVM using 180 datapoints (2.5 running cycles); 2D tracks were obtained by manually marking the video. In both A–B the mocap training data was for a person different from the video’s (with different body proportions and motions), and no ground-truth estimate was available for favourable initialisation. LELVM GPLVM PCA tracking in latent space tracking in latent space tracking in latent space 2.5 0.02 30 38 2 38 0.99 0.015 20 1.5 38 0.01 1 10 0.005 0.5 0 0 0 −0.005 −0.5 −10 −0.01 −1 −0.015 −1.5 −20 −0.02 −0.025 −0.025 −2 −0.02 −0.015 −0.01 −0.005 0 0.005 0.01 0.015 0.02 0.025 −2.5 −2 −1 0 1 2 3 −30 −80 −60 −40 −20 0 20 40 60 80 Figure 3: Method comparison, frame 38. PCA and GPLVM map consecutive walking cycles to spatially distinct latent space regions. Compounded by a data independent latent prior, the resulting tracker gets easily confused: it jumps across loops and/or remains put, trapped in local optima. In contrast, LELVM is stable and follows tightly a 1D manifold (see videos). crucial if locality constraints are correctly modeled (as in LELVM). We conclude that, compared to LELVM, GPLVM is significantly less robust for tracking, has much higher training overhead and lacks some operations (e.g. computing latent conditionals based on partly missing ambient data). 7 5 Conclusion and future work We have proposed the use of priors based on the Laplacian Eigenmaps Latent Variable Model (LELVM) for people tracking. LELVM is a probabilistic dim. red. method that combines the advantages of latent variable models and spectral manifold learning algorithms: a multimodal probability density over latent and ambient variables, globally differentiable nonlinear mappings for reconstruction and dimensionality reduction, no local optima, ability to unfold highly nonlinear manifolds, and good practical scaling to latent spaces of high dimension. LELVM is computationally efficient, simple to learn from sparse training data, and compatible with standard probabilistic trackers such as particle filters. Our results using a LELVM-based probabilistic sigma point mixture tracker with several real and synthetic human motion sequences show that LELVM provides sufficient constraints for robust operation in the presence of missing, noisy and ambiguous image measurements. Comparisons with PCA and GPLVM show LELVM is superior in terms of accuracy, robustness and computation time. The objective of this paper was to demonstrate the ability of the LELVM prior in a simple setting using 2D tracks obtained automatically or manually, and single-type motions (running, walking). Future work will explore more complex observation models such as silhouettes; the combination of different motion types in the same latent space (whose dimension will exceed 2); and the exploration of multimodal posterior distributions in latent space caused by ambiguities. Acknowledgments This work was partially supported by NSF CAREER award IIS–0546857 (MACP), NSF IIS–0535140 and EC MCEXT–025481 (CS). CMU data: http://mocap.cs.cmu.edu (created with funding from NSF EIA–0196217). OSU data: http://accad.osu.edu/research/mocap/mocap data.htm. References ´ [1] M. A. Carreira-Perpi˜ an and Z. Lu. The Laplacian Eigenmaps Latent Variable Model. In AISTATS, 2007. n´ [2] N. R. Howe, M. E. Leventon, and W. T. Freeman. Bayesian reconstruction of 3D human motion from single-camera video. In NIPS, volume 12, pages 820–826, 2000. [3] T.-J. Cham and J. M. Rehg. A multiple hypothesis approach to figure tracking. In CVPR, 1999. [4] M. Brand. Shadow puppetry. In ICCV, pages 1237–1244, 1999. [5] H. Sidenbladh, M. J. Black, and L. Sigal. Implicit probabilistic models of human motion for synthesis and tracking. In ECCV, volume 1, pages 784–800, 2002. [6] C. Sminchisescu and A. Jepson. Generative modeling for continuous non-linearly embedded visual inference. In ICML, pages 759–766, 2004. [7] R. Urtasun, D. J. Fleet, A. Hertzmann, and P. Fua. Priors for people tracking from small training sets. In ICCV, pages 403–410, 2005. [8] R. Li, M.-H. Yang, S. Sclaroff, and T.-P. Tian. Monocular tracking of 3D human motion with a coordinated mixture of factor analyzers. In ECCV, volume 2, pages 137–150, 2006. [9] J. M. Wang, D. Fleet, and A. Hertzmann. Gaussian process dynamical models. In NIPS, volume 18, 2006. [10] R. Urtasun, D. J. Fleet, and P. Fua. Gaussian process dynamical models for 3D people tracking. In CVPR, pages 238–245, 2006. [11] G. W. Taylor, G. E. Hinton, and S. Roweis. Modeling human motion using binary latent variables. In NIPS, volume 19, 2007. [12] C. M. Bishop, M. Svens´ n, and C. K. I. Williams. GTM: The generative topographic mapping. Neural e Computation, 10(1):215–234, January 1998. [13] N. Lawrence. Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research, 6:1783–1816, November 2005. [14] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373–1396, June 2003. [15] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman & Hall, 1986. [16] R. van der Merwe and E. A. Wan. Gaussian mixture sigma-point particle filters for sequential probabilistic inference in dynamic state-space models. In ICASSP, volume 6, pages 701–704, 2003. ´ [17] M. A. Carreira-Perpi˜ an. Acceleration strategies for Gaussian mean-shift image segmentation. In CVPR, n´ pages 1160–1167, 2006. 8
6 0.065330401 181 nips-2007-Sparse Overcomplete Latent Variable Decomposition of Counts Data
7 0.055139195 1 nips-2007-A Bayesian Framework for Cross-Situational Word-Learning
8 0.055111196 105 nips-2007-Infinite State Bayes-Nets for Structured Domains
9 0.050506171 84 nips-2007-Expectation Maximization and Posterior Constraints
10 0.049838159 156 nips-2007-Predictive Matrix-Variate t Models
11 0.048157103 170 nips-2007-Robust Regression with Twinned Gaussian Processes
12 0.044933528 32 nips-2007-Bayesian Co-Training
13 0.043880191 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs
14 0.04351373 72 nips-2007-Discriminative Log-Linear Grammars with Latent Variables
15 0.043189969 197 nips-2007-The Infinite Markov Model
16 0.042366996 189 nips-2007-Supervised Topic Models
17 0.041402847 99 nips-2007-Hierarchical Penalization
18 0.040952809 145 nips-2007-On Sparsity and Overcompleteness in Image Models
19 0.040632039 149 nips-2007-Optimal ROC Curve for a Combination of Classifiers
20 0.039377637 63 nips-2007-Convex Relaxations of Latent Variable Training
topicId topicWeight
[(0, -0.141), (1, 0.044), (2, -0.039), (3, -0.057), (4, 0.01), (5, -0.027), (6, -0.044), (7, -0.023), (8, -0.03), (9, -0.017), (10, -0.076), (11, -0.023), (12, 0.016), (13, -0.015), (14, -0.032), (15, -0.05), (16, -0.053), (17, -0.03), (18, -0.04), (19, -0.055), (20, -0.0), (21, 0.065), (22, 0.018), (23, 0.056), (24, -0.013), (25, 0.053), (26, -0.025), (27, -0.052), (28, 0.028), (29, 0.004), (30, 0.062), (31, 0.077), (32, -0.0), (33, 0.088), (34, -0.115), (35, -0.059), (36, 0.054), (37, 0.224), (38, -0.075), (39, -0.111), (40, -0.065), (41, 0.001), (42, -0.012), (43, -0.027), (44, 0.015), (45, -0.186), (46, -0.046), (47, 0.045), (48, 0.126), (49, -0.038)]
simIndex simValue paperId paperTitle
same-paper 1 0.92951959 131 nips-2007-Modeling homophily and stochastic equivalence in symmetric relational data
Author: Peter Hoff
Abstract: This article discusses a latent variable model for inference and prediction of symmetric relational data. The model, based on the idea of the eigenvalue decomposition, represents the relationship between two nodes as the weighted inner-product of node-specific vectors of latent characteristics. This “eigenmodel” generalizes other popular latent variable models, such as latent class and distance models: It is shown mathematically that any latent class or distance model has a representation as an eigenmodel, but not vice-versa. The practical implications of this are examined in the context of three real datasets, for which the eigenmodel has as good or better out-of-sample predictive performance than the other two models. 1
2 0.62065864 97 nips-2007-Hidden Common Cause Relations in Relational Learning
Author: Ricardo Silva, Wei Chu, Zoubin Ghahramani
Abstract: When predicting class labels for objects within a relational database, it is often helpful to consider a model for relationships: this allows for information between class labels to be shared and to improve prediction performance. However, there are different ways by which objects can be related within a relational database. One traditional way corresponds to a Markov network structure: each existing relation is represented by an undirected edge. This encodes that, conditioned on input features, each object label is independent of other object labels given its neighbors in the graph. However, there is no reason why Markov networks should be the only representation of choice for symmetric dependence structures. Here we discuss the case when relationships are postulated to exist due to hidden common causes. We discuss how the resulting graphical model differs from Markov networks, and how it describes different types of real-world relational processes. A Bayesian nonparametric classification model is built upon this graphical representation and evaluated with several empirical studies. 1 Contribution Prediction problems, such as classification, can be easier when class labels share a sort of relational dependency that is not accounted by the input features [10]. If the variables to be predicted are attributes of objects in a relational database, such dependencies are often postulated from the relations that exist in the database. This paper proposes and evaluates a new method for building classifiers that uses information concerning the relational structure of the problem. Consider the following standard example, adapted from [3]. There are different webpages, each one labeled according to some class (e.g., “student page” or “not a student page”). Features such as the word distribution within the body of each page can be used to predict each webpage’s class. However, webpages do not exist in isolation: there are links connecting them. Two pages having a common set of links is evidence for similarity between such pages. For instance, if W1 and W3 both link to W2 , this is commonly considered to be evidence for W1 and W3 having the same class. One way of expressing this dependency is through the following Markov network [5]: ∗ Now at the Statistical Laboratory, University of Cambridge. E-mail: silva@statslab.cam.ac.uk F1 F2 F 3 C1 C2 C3 Here Fi are the features of page Wi , and Ci is its respective page label. Other edges linking F variables to C variables (e.g., F1 −C2 ) can be added without affecting the main arguments presented in this section. The semantics of the graph, for a fixed input feature set {F1 , F2 , F3 }, are as follows: C1 is marginally dependent on C3 , but conditionally independent given C2 . Depending on the domain, this might be either a suitable or unsuitable representation of relations. For instance, in some domains it could be the case that the most sensible model would state that C1 is only informative about C3 once we know what C2 is: that is, C1 and C3 are marginally independent, but dependent given C2 . This can happen if the existence of a relation (Ci , Cj ) corresponds to the existence of hidden common causes generating this pair of random variables. Consider the following example, loosely based on a problem described by [12]. We have three objects, Microsoft (M ), Sony (S) and Philips (P ). The task is a regression task where we want to predict the stock market price of each company given its profitability from last year. The given relationships are that M and S are direct competitors (due to the videogame console market), as well S and P (due to the TV set market). M.Profit S.Profit P.Profit M.Profit S.Profit P.Profit M.Profit S.Profit P.Profit M.Stock S.Stock P.Stock M.Stock S.Stock P.Stock M.Stock S.Stock P.Stock εm εs εp εm εs εp (a) (b) (c) Figure 1: (a) Assumptions that relate Microsoft, Sony and Philips stock prices through hidden common cause mechanisms, depicted as unlabeled gray vertices; (b) A graphical representation for generic hidden common causes relationships by using bi-directed edges; (c) A depiction of the same relationship skeleton by a Markov network model, which has different probabilistic semantics. It is expected that several market factors that affect stock prices are unaccounted by the predictor variable Past Year Profit. For example, a shortage of Microsoft consoles is a hidden common factor for both Microsoft’s and Sony’s stock. Another hidden common cause would be a high price for Sony’s consoles. Assume here that these factors have no effect on Philips’ stock value. A depiction of several hidden common causes that correpond to the relations Competitor(M, S) and Competitor(S, P ) is given in Figure 1(a) as unlabeled gray vertices. Consider a linear regression model for this setup. We assume that for each object Oi ∈ {M, S, P }, the stock price Oi .Stock, centered at the mean, is given by Oi .Stock = β × Oi .P rof it + where each i i (1) is a Gaussian random variable. The fact that there are several hidden common causes between M and S can be modeled by the covariance of m and s , σms . That is, unlike in standard directed Gaussian models, σms is allowed to be non-zero. The same holds for σsp . Covariances of error terms of unrelated objects should be zero (σmp = 0). This setup is very closely related to the classic seemingly unrelated regression model popular in economics [12]. A graphical representation for this type of model is the directed mixed graph (DMG) [9, 11], with bi-directed edges representing the relationship of having hidden common causes between a pair of vertices. This is shown in Figure 1(b). Contrast this to the Markov network representation in Figure 1(c). The undirected representation encodes that m and p are marginally dependent, which does not correspond to our assumptions1 . Moreover, the model in Figure 1(b) states that once we observe Sony’s stock price, Philip’s stocks (and profit) should have a non-zero association with Microsoft’s profit: this follows from a extension of d-separation to DMGs [9]. This is expected from the assumptions (Philip’s stocks should tell us something about Microsoft’s once we know Sony’s, but not before it), but does not hold in the graphical model in Figure 1(c). While it is tempting to use Markov networks to represent relational models (free of concerns raised by cyclic directed representations), it is clear that there are problems for which they are not a sensible choice. This is not to say that Markov networks are not the best representation for large classes of relational problems. Conditional random fields [4] are well-motivated Markov network models for sequence learning. The temporal relationship is closed under marginalization: if we do not measure some steps in the sequence, we will still link the corresponding remaining vertices accordingly, as illustrated in Figure 2. Directed mixed graphs are not a good representation for this sequence structure. X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5 (a) (b) X1 X3 X5 Y1 Y3 Y5 (c) Figure 2: (a) A conditional random field (CRF) graph for sequence data; (b) A hypothetical scenario where two of the time slices are not measured, as indicated by dashed boxes; (c) The resulting CRF graph for the remaining variables, which corresponds to the same criteria for construction of (a). To summarize, the decision between using a Markov network or a DMG reduces to the following modeling issue: if two unlinked object labels yi , yj are statistically associated when some chain of relationships exists between yi and yj , then the Markov network semantics should apply (as in the case for temporal relationships). However, if the association arises only given the values of the other objects in the chain, then this is accounted by the dependence semantics of the directed mixed graph representation. The DMG representation propagates training data information through other training points. The Markov network representation propagates training data information through test points. Propagation through training points is relevant in real problems. For instance, in a webpage domain where each webpage has links to pages of several kinds (e.g., [3]), a chain of intermediated points between two classes labels yi and yj is likely to be more informative if we know the values of the labels in this chain. The respective Markov network would ignore all training points in this chain besides the endpoints. In this paper, we introduce a non-parametric classification model for relational data that factorizes according to a directed mixed graph. Sections 2 and 3 describes the model and contrasts it to a closely related approach which bears a strong analogy to the Markov network formulation. Experiments in text classification are described in Section 4. 2 Model Chu et al. [2] describe an approach for Gaussian process classification using relational information, which we review and compare to our proposed model. Previous approach: relational Gaussian processes through indicators − For each point x in the input space X , there is a corresponding function value fx . Given observed input points x1 , x2 , . . . , xn , a Gaussian process prior over f = [f1 , f2 , . . . , fn ]T has the shape P(f ) = 1 (2π)n/2 |Σ|1/2 1 exp − f T Σ−1 f 2 (2) 1 For Gaussian models, the absence of an edge in the undirected representation (i.e., Gaussian Markov random fields) corresponds to a zero entry in the inverse covariance matrix, where in the DMG it corresponds to a zero in the covariance matrix [9]. X1 X2 f1 X3 X2 X3 X1 X2 X3 f3 f2 ξ 12 X1 f1 f2 f3 f1 f2 f3 Y2 Y3 ξ 23 Y1 Y2 Y3 Y1 Y2 Y3 Y1 ε1 ε2 ε3 ε1 ε2 ε3 ε1 (a) ζ1 (b) ζ2 ε2 ζ3 ε3 (c) Figure 3: (a) A prediction problem where y3 is unknown and the training set is composed of other two datapoints. Dependencies between f1 , f2 and f3 are given by a Gaussian process prior and not represented in the picture. Indicators ξij are known and set to 1; (b) The extra associations that arise by conditioning on ξ = 1 can be factorized as the Markov network model here depicted, in the spirit of [9]; (c) Our proposed model, which ties the error terms and has origins in known statistical models such as seemingly unrelated regression and structural equation models [11]. where the ijth entry of Σ is given by a Mercer kernel function K(xi , xj ) [8]. The idea is to start from a standard Gaussian process prior, and add relational information by conditioning on relational indicators. Let ξij be an indicator that assumes different values, e.g., 1 or 0. The indicator values are observed for each pair of data points (xi , xj ): they are an encoding of the given relational structure. A model for P (ξij = 1|fi , fj ) is defined. This evidence is incorporated into the Gaussian process by conditioning on all indicators ξij that are positive. Essentially, the idea boils down to using P(f |ξ = 1) as the prior for a Gaussian process classifier. Figure 3(a) illustrates a problem with datapoints {(x1 , y1 ), (x2 , y2 ), (x3 , y3 )}. Gray vertices represent unobserved variables. Each yi is a binary random variable, with conditional probability given by P(yi = 1|fi ) = Φ(fi /σ) (3) where Φ(·) is the standard normal cumulative function and σ is a hyperparameter. This can be interpreted as the cumulative distribution of fi + i , where fi is given and i is a normal random variable with zero mean and variance σ 2 . In the example of Figure 3(a), one has two relations: (x1 , x2 ), (x2 , x3 ). This information is incorporated by conditioning on the evidence (ξ12 = 1, ξ23 = 1). Observed points (x1 , y1 ), (x2 , y2 ) form the training set. The prediction task is to estimate y3 . Notice that ξ12 is not used to predict y3 : the Markov blanket for f3 includes (f1 , f2 , ξ23 , y3 , 3 ) and the input features. Essentially, conditioning on ξ = 1 corresponds to a pairwise Markov network structure, as depicted in Figure 3(b) [9]2 . Our approach: mixed graph relational model − Figure 3(c) illustrates our proposed setup. For reasons that will become clear in the sequel, we parameterize the conditional probability of yi as √ (4) P(yi = 1|gi , vi ) = Φ(gi / vi ) where gi = fi + ζi . As before, Equation (4) can be interpreted as the cumulative distribution of 2 gi + i , with i as a normal random variable with zero mean and variance vi = σ 2 − σζi , the last term being the variance of ζi . That is, we break the original error term as i = ζi + i , where i and j are independent for all i = j. Random vector ζ is a multivariate normal with zero mean and covariance matrix Σζ . The key aspect in our model is that the covariance of ζi and ζj is non-zero only if objects i and j are related (that is, bi-directed edge yi ↔ yj is in the relational graph). Parameterizing Σζ for relational problems is non-trivial and discussed in the next section. In the example of Figure 3, one noticeable difference of our model 3(c) to a standard Markov network models 3(b) is that now the Markov blanket for f3 includes error terms for all variables (both and ζ terms), following the motivation presented in Section 1. 2 In the figure, we are not representing explicitly that f1 , f2 and f3 are not independent (the prior covariance matrix Σ is complete). The figure is meant as a representation of the extra associations that arise when conditioning on ξ = 1, and the way such associations factorize. As before, the prior for f in our setup is the Gaussian process prior (2). This means that g has the following Gaussian process prior (implicitly conditioned on x): P(g) = 1 (2π)n/2 |R|1/2 1 exp − g R−1 g 2 (5) where R = K + Σζ is the covariance matrix of g = f + ζ, with Kij = K(xi , xj ). 3 Parametrizing a mixed graph model for relational classification For simplicity, in this paper we will consider only relationships that induce positive associations between labels. Ideally, the parameterization of Σζ has to fulfill two desiderata: (i). it should respect the marginal independence constraints as encoded by the graphical model (i.e., zero covariance for vertices that are not adjacent), and be positive definite; (ii). it has to be parsimonious in order to facilitate hyperparameter selection, both computationally and statistically. Unlike the multivariate analysis problems in [11], the size of our covariance matrix grows with the number of data points. As shown by [11], exact inference in models with covariance matrices with zero-entry constraints is computationally demanding. We provide two alternative parameterizations that are not as flexible, but which lead to covariance matrices that are simple to compute and easy to implement. We will work under the transductive scenario, where training and all test points are given in advance. The corresponding graph thus contain unobserved and observed label nodes. 3.1 Method I The first method is an automated method to relax some of the independence constraints, while guaranteeing positive-definiteness, and a parameterization that depends on a single scalar ρ. This allows for more efficient inference and is done as follows: 1. Let Gζ be the corresponding bi-directed subgraph of our original mixed graph, and let U0 be a matrix with n × n entries, n being the number of nodes in Gζ 2. Set U0 to be the number of cliques in Gζ where yi and yj appear together; ij 3. Set U0 to be the number of cliques containing yi , plus a small constant ∆; ii 4. Set U to be the corresponding correlation matrix obtained by intepreting U0 as a covariance matrix and rescaling it; Finally, set Σζ = ρU, where ρ ∈ [0, 1] is a given hyperparameter. Matrix U is always guaranteed to be positive definite: it is equivalent to obtaining the covariance matrix of y from a linear latent variable model, where there is an independent standard Gaussian latent variable as a common parent to every clique, and every observed node yi is given by the sum of its parents plus an independent error term of variance ∆. Marginal independencies are respected, since independent random variables will never be in a same clique in Gζ . In practice, this method cannot be used as is since the number of cliques will in general grow at an exponential rate as a function of n. Instead, we first triangulate the graph: in this case, extracting cliques can be done in polynomial time. This is a relaxation of the original goal, since some of the original marginal independence constraints will not be enforced due to the triangulation3. 3.2 Method II The method suggested in the previous section is appealing under the assumption that vertices that appear in many common cliques are more likely to have more hidden common causes, and hence should have stronger associations. However, sometimes the triangulation introduces bad artifacts, with lots of marginal independence constraints being violated. In this case, this will often result in a poor prediction performance. A cheap alternative approach is not generating cliques, and instead 3 The need for an approximation is not a shortcoming only of the DMG approach. Notice that the relational Gaussian process of [2] also requires an approximation of its relational kernel. 10 10 10 20 20 20 30 30 30 40 40 40 50 50 50 60 60 60 70 70 70 80 80 80 90 90 90 100 100 100 10 20 30 40 50 60 (a) 70 80 90 100 10 20 30 40 50 60 70 80 90 (b) 100 10 20 30 40 50 60 70 80 90 100 (c) Figure 4: (a) The link matrix for the political books dataset. (b) The relational kernel matrix obtained with the approximated Method I. (c) The kernel matrix obtained with Method II, which tends to produce much weaker associations but does not introduce spurious relations. getting a marginal covariance matrix from a different latent variable model. In this model, we create an independent standard Gaussian variable for each edge yi ↔ yj instead of each clique. No triangulation will be necessary, and all marginal independence constraints will be respected. This, however, has shortcomings of its own: for all pairs (yi , yj ) connected by an edge, it will be the case that U0 = 1, while U0 can be as large as n. This means that the resulting correlation in Uij can be ij ii close to zero even if yi and yj are always in the same cliques. In Section 4, we will choose between Methods I and II according to the marginal likelihood of the model. 3.3 Algorithm Recall that our model is a Gaussian process classifier with error terms i of variance σ such that i = ζi + i . Without loss of generality, we will assume that σ = 1. This results in the following parameterization of the full error covariance matrix: Σ = (1 − ρ)I + ρU (6) where I is an n × n identity matrix. Matrix (1 − ρ)I corresponds to the covariance matrix Σ . The usefulness of separating as and ζ becomes evident when we use an expectation-propagation (EP) algorithm [7] to perform inference in our relational classifier. Instead of approximating the posterior of f , we approximate the posterior density P(g|D), D = {(x1 , y1 ), . . . , (xn , yn )} being ˜ the given training data. The approximate posterior has the form Q(g) ∝ P(g) i ti (gi ) where P(g) is the Gaussian process prior with kernel matrix R = K + Σζ as defined in the previous section. Since the covariance matrix Σ is diagonal, the true likelihood of y given g factorizes n over each datapoint: P(y|g) = i=1 P(yi |gi ), and standard EP algorithms for Gaussian process classification can be used [8] (with the variance given by Σ instead of Σ , and kernel matrix R instead of K). The final algorithm defines a whole new class of relational models, depends on a single hyperparameter ρ which can be optimized by grid search in [0, 1], and requires virtually no modification of code written for EP-based Gaussian process classifiers4 . 4 Results We now compare three different methods in relational classification tasks. We will compare a standard Gaussian process classifier (GPC), the relational Gaussian process (RGP) of [2] and our method, the mixed graph Gaussian process (XGP). A linear kernel K(x, z) = x · z is used, as described by [2]. We set ∆ = 10−4 and the hyperparameter ρ is found by a grid search in the space {0.1, 0.2, 0.3, . . . , 1.0} maximizing the approximate EP marginal likelihood5. 4 We provide MATLAB/Octave code for our method in http://www.statslab.cam.ac.uk/∼silva. For triangulation, we used the MATLAB implementation of the Reverse Cuthill McKee vertex ordering available at http://people.scs.fsu.edu/∼burkardt/m src/rcm/rcm.html 5 Table 1: The averaged AUC scores of citation prediction on test cases of the Cora database are recorded along with standard deviation over 100 trials. “n” denotes the number of papers in one class. “Citations” denotes the citation count within the two paper classes. Group n Citations GPC GPC with Citations XGP 5vs1 346/488 2466 0.905 ± 0.031 0.891 ± 0.022 0.945 ± 0.053 5vs2 346/619 3417 0.900 ± 0.032 0.905 ± 0.044 0.933 ± 0.059 5vs3 346/1376 3905 0.863 ± 0.040 0.893 ± 0.017 0.883 ± 0.013 5vs4 346/646 2858 0.916 ± 0.030 0.887 ± 0.018 0.951 ± 0.042 5vs6 346/281 1968 0.887 ± 0.054 0.843 ± 0.076 0.955 ± 0.041 5vs7 346/529 2948 0.869 ± 0.045 0.867 ± 0.041 0.926 ± 0.076 4.1 Political books We consider first a simple classification problem where the goal is to classify whether a particular book is of liberal political inclination or not. The features of each book are given by the words in the Amazon.com front page for that particular book. The choice of books, labels, and relationships are given in the data collected by Valdis Krebs and available at http://www-personal.umich.edu/ mejn/netdata. The data containing book features can be found at http://www.statslab.cam.ac.uk/∼silva. There are 105 books, 43 of which are labeled as liberal books. The relationships are pairs of books which are frequently purchased together by a same customer. Notice this is an easy problem, where labels are strongly associated if they share a relationship. We performed evaluation by sampling 100 times from the original pool of books, assigning half of them as trainining data. The evaluation criterion was the area under the curve (AUC) for this binary problem. This is a problem where Method I is suboptimal. Figure 4(a) shows the original binary link matrix. Figure 4(b) depicts the corresponding U0 matrix obtained with Method I, where entries closer to red correspond to stronger correlations. Method II gives a better performance here (Method I was better in the next two experiments). The AUC result for GPC was of 0.92, while both RGP and XGP achieved 0.98 (the difference between XGP and GPC having a std. deviation of 0.02). 4.2 Cora The Cora collection [6] contains over 50,000 computer science research papers including bibliographic citations. We used a subset in our experiment. The subset consists of 4,285 machine learning papers categorized into 7 classes. The second column of Table 1 shows the class sizes. Each paper was preprocessed as a bag-of-words, a vector of “term frequency” components scaled by “inverse document frequency”, and then normalized to unity length. This follows the pre-processing used in [2]. There is a total of 20,082 features. For each class, we randomly selected 1% of the labelled samples for training and tested on the remainder. The partition was repeated 100 times. We used the fact that the database is composed of fairly specialized papers as an illustration of when XGP might not be as optimal as RGP (whose AUC curves are very close to 1), since the population of links tends to be better separated between different classes (but this is also means that the task is fairly easy, and differences disappear very rapidly with increasing sample sizes). The fact there is very little training data also favors RGP, since XGP propagates information through training points. Still, XGP does better than the non-relational GPC. Notice that adding the citation adjacency matrix as a binary input feature for each paper does not improve the performance of the GPC, as shown in Table 1. Results for other classes are of similar qualitative nature and not displayed here. 4.3 WebKB The WebKB dataset consists of homepages from 4 different universities: Cornell, Texas, Washington and Wisconsin [3]. Each webpage belongs to one out of 7 categories: student, professor, course, project, staff, department and “other”. The relations come from actual links in the webpages. There is relatively high heterogeneity of types of links in each page: in terms of mixed graph modeling, this linkage mechanism is explained by a hidden common cause (e.g., a student and a course page are associated because that person’s interest in enrolling as a student also creates demand for a course). The heterogeneity also suggests that two unlinked pages should not, on average, have an association if they link to a common page W . However, observing the type of page W might create Table 2: Comparison of the three algorithms on the task “other” vs. “not-other” in the WebKB domain. Results for GPC and RGP taken from [2]. The same partitions for training and test are used to generate the results for XGP. Mean and standard deviation of AUC results are reported. University Numbers Other or Not Other All Link GPC RGP XGP Cornell 617 865 13177 0.708 ± 0.021 0.884 ± 0.025 0.917 ± 0.022 Texas 571 827 16090 0.799 ± 0.021 0.906 ± 0.026 0.949 ± 0.015 Washington 939 1205 15388 0.782 ± 0.023 0.877 ± 0.024 0.923 ± 0.016 Wisconsin 942 1263 21594 0.839 ± 0.014 0.899 ± 0.015 0.941 ± 0.018 the association. We compare how the three algorithms perform when trying to predict if a webpage is of class “other” or not (the other classifications are easier, with smaller differences. Results are omitted for space purposes). The proportion of “other” to non-“other” is about 4:1, which makes the area under the curve (AUC) a more suitable measure of success. We used the same 100 subsamples from [2], where 10% of the whole data is sampled from the pool for a specific university, and the remaining is used for test. We also used the same features as in [2], pre-processed as described in the previous section. The results are shown in Table 2. Both relational Gaussian processes are far better than the non-relational GPC. XGP gives significant improvements over RGP in all four universities. 5 Conclusion We introduced a new family of relational classifiers by extending a classical statistical model [12] to non-parametric relational classification. This is inspired by recent advances in relational Gaussian processes [2] and Bayesian inference for mixed graph models [11]. We showed empirically that modeling the type of latent phenomena that our approach postulates can sometimes improve prediction performance in problems traditionally approached by Markov network structures. Several interesting problems can be treated in the future. It is clear that there are many different ways by which the relational covariance matrix can be parameterized. Intermediate solutions between Methods I and II, approximations through matrix factorizations and graph cuts are only a few among many alternatives that can be explored. Moreover, there is a relationship between our model and multiple kernel learning [1], where one of the kernels comes from error covariances. This might provide alternative ways of learning our models, including multiple types of relationships. Acknowledgements: We thank Vikas Sindhwani for the preprocessed Cora database. References [1] F. Bach, G. Lanckriet, and M. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. 21st International Conference on Machine Learning, 2004. [2] W. Chu, V. Sindhwani, Z. Ghahramani, and S. Keerthi. Relational learning with Gaussian processes. Neural Information Processing Systems, 2006. [3] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to extract symbolic knowledge from the World Wide Web. Proceedings of AAAI’98, pages 509–516, 1998. [4] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 18th International Conference on Machine Learning, 2001. [5] S. Lauritzen. Graphical Models. Oxford University Press, 1996. [6] A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of Internet portals with machine learning. Information Retrieval Journal, 3:127–163, 2000. [7] T. Minka. A family of algorithms for approximate Bayesian inference. PhD Thesis, MIT, 2001. [8] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [9] T. Richardson and P. Spirtes. Ancestral graph Markov models. Annals of Statistics, 30:962–1030, 2002. [10] P. Sen and L. Getoor. Link-based classification. Report CS-TR-4858, University of Maryland, 2007. [11] R. Silva and Z. Ghahramani. Bayesian inference for Gaussian mixed graph models. UAI, 2006. [12] A. Zellner. An efficient method of estimating seemingly unrelated regression equations and tests for aggregation bias. Journal of the American Statistical Association, 1962.
3 0.46451306 158 nips-2007-Probabilistic Matrix Factorization
Author: Andriy Mnih, Ruslan Salakhutdinov
Abstract: Many existing approaches to collaborative filtering can neither handle very large datasets nor easily deal with users who have very few ratings. In this paper we present the Probabilistic Matrix Factorization (PMF) model which scales linearly with the number of observations and, more importantly, performs well on the large, sparse, and very imbalanced Netflix dataset. We further extend the PMF model to include an adaptive prior on the model parameters and show how the model capacity can be controlled automatically. Finally, we introduce a constrained version of the PMF model that is based on the assumption that users who have rated similar sets of movies are likely to have similar preferences. The resulting model is able to generalize considerably better for users with very few ratings. When the predictions of multiple PMF models are linearly combined with the predictions of Restricted Boltzmann Machines models, we achieve an error rate of 0.8861, that is nearly 7% better than the score of Netflix’s own system.
4 0.43227983 156 nips-2007-Predictive Matrix-Variate t Models
Author: Shenghuo Zhu, Kai Yu, Yihong Gong
Abstract: It is becoming increasingly important to learn from a partially-observed random matrix and predict its missing elements. We assume that the entire matrix is a single sample drawn from a matrix-variate t distribution and suggest a matrixvariate t model (MVTM) to predict those missing elements. We show that MVTM generalizes a range of known probabilistic models, and automatically performs model selection to encourage sparse predictive models. Due to the non-conjugacy of its prior, it is difficult to make predictions by computing the mode or mean of the posterior distribution. We suggest an optimization method that sequentially minimizes a convex upper-bound of the log-likelihood, which is very efficient and scalable. The experiments on a toy data and EachMovie dataset show a good predictive accuracy of the model. 1
5 0.42808023 72 nips-2007-Discriminative Log-Linear Grammars with Latent Variables
Author: Slav Petrov, Dan Klein
Abstract: We demonstrate that log-linear grammars with latent variables can be practically trained using discriminative methods. Central to efficient discriminative training is a hierarchical pruning procedure which allows feature expectations to be efficiently approximated in a gradient-based procedure. We compare L1 and L2 regularization and show that L1 regularization is superior, requiring fewer iterations to converge, and yielding sparser solutions. On full-scale treebank parsing experiments, the discriminative latent models outperform both the comparable generative latent models as well as the discriminative non-latent baselines.
6 0.41261601 197 nips-2007-The Infinite Markov Model
7 0.40357473 105 nips-2007-Infinite State Bayes-Nets for Structured Domains
8 0.37760171 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
9 0.37360686 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model
10 0.36884677 181 nips-2007-Sparse Overcomplete Latent Variable Decomposition of Counts Data
11 0.36774814 19 nips-2007-Active Preference Learning with Discrete Choice Data
12 0.36299026 196 nips-2007-The Infinite Gamma-Poisson Feature Model
13 0.34486774 1 nips-2007-A Bayesian Framework for Cross-Situational Word-Learning
14 0.34432939 63 nips-2007-Convex Relaxations of Latent Variable Training
15 0.341701 125 nips-2007-Markov Chain Monte Carlo with People
16 0.31785205 41 nips-2007-COFI RANK - Maximum Margin Matrix Factorization for Collaborative Ranking
17 0.3173365 29 nips-2007-Automatic Generation of Social Tags for Music Recommendation
18 0.31444559 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data
19 0.3119531 78 nips-2007-Efficient Principled Learning of Thin Junction Trees
20 0.31146458 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
topicId topicWeight
[(5, 0.033), (13, 0.028), (16, 0.039), (18, 0.022), (19, 0.013), (21, 0.077), (31, 0.021), (34, 0.028), (35, 0.022), (47, 0.09), (64, 0.344), (83, 0.103), (85, 0.033), (87, 0.025), (90, 0.034)]
simIndex simValue paperId paperTitle
1 0.78964281 67 nips-2007-Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation
Author: Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V. Buenau, Motoaki Kawanabe
Abstract: A situation where training and test samples follow different input distributions is called covariate shift. Under covariate shift, standard learning methods such as maximum likelihood estimation are no longer consistent—weighted variants according to the ratio of test and training input densities are consistent. Therefore, accurately estimating the density ratio, called the importance, is one of the key issues in covariate shift adaptation. A naive approach to this task is to first estimate training and test input densities separately and then estimate the importance by taking the ratio of the estimated densities. However, this naive approach tends to perform poorly since density estimation is a hard task particularly in high dimensional cases. In this paper, we propose a direct importance estimation method that does not involve density estimation. Our method is equipped with a natural cross validation procedure and hence tuning parameters such as the kernel width can be objectively optimized. Simulations illustrate the usefulness of our approach. 1
same-paper 2 0.70325404 131 nips-2007-Modeling homophily and stochastic equivalence in symmetric relational data
Author: Peter Hoff
Abstract: This article discusses a latent variable model for inference and prediction of symmetric relational data. The model, based on the idea of the eigenvalue decomposition, represents the relationship between two nodes as the weighted inner-product of node-specific vectors of latent characteristics. This “eigenmodel” generalizes other popular latent variable models, such as latent class and distance models: It is shown mathematically that any latent class or distance model has a representation as an eigenmodel, but not vice-versa. The practical implications of this are examined in the context of three real datasets, for which the eigenmodel has as good or better out-of-sample predictive performance than the other two models. 1
3 0.64811099 53 nips-2007-Compressed Regression
Author: Shuheng Zhou, Larry Wasserman, John D. Lafferty
Abstract: Recent research has studied the role of sparsity in high dimensional regression and signal reconstruction, establishing theoretical limits for recovering sparse models from sparse data. In this paper we study a variant of this problem where the original n input variables are compressed by a random linear transformation to m n examples in p dimensions, and establish conditions under which a sparse linear model can be successfully recovered from the compressed data. A primary motivation for this compression procedure is to anonymize the data and preserve privacy by revealing little information about the original data. We characterize the number of random projections that are required for 1 -regularized compressed regression to identify the nonzero coefficients in the true model with probability approaching one, a property called “sparsistence.” In addition, we show that 1 -regularized compressed regression asymptotically predicts as well as an oracle linear model, a property called “persistence.” Finally, we characterize the privacy properties of the compression procedure in information-theoretic terms, establishing upper bounds on the rate of information communicated between the compressed and uncompressed data that decay to zero. 1
4 0.44633037 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data
Author: Michael Ross, Andrew Cohen
Abstract: This paper describes a new model for human visual classification that enables the recovery of image features that explain human subjects’ performance on different visual classification tasks. Unlike previous methods, this algorithm does not model their performance with a single linear classifier operating on raw image pixels. Instead, it represents classification as the combination of multiple feature detectors. This approach extracts more information about human visual classification than previous methods and provides a foundation for further exploration. 1
5 0.44519958 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
Author: Matthias Bethge, Philipp Berens
Abstract: Maximum entropy analysis of binary variables provides an elegant way for studying the role of pairwise correlations in neural populations. Unfortunately, these approaches suffer from their poor scalability to high dimensions. In sensory coding, however, high-dimensional data is ubiquitous. Here, we introduce a new approach using a near-maximum entropy model, that makes this type of analysis feasible for very high-dimensional data—the model parameters can be derived in closed form and sampling is easy. Therefore, our NearMaxEnt approach can serve as a tool for testing predictions from a pairwise maximum entropy model not only for low-dimensional marginals, but also for high dimensional measurements of more than thousand units. We demonstrate its usefulness by studying natural images with dichotomized pixel intensities. Our results indicate that the statistics of such higher-dimensional measurements exhibit additional structure that are not predicted by pairwise correlations, despite the fact that pairwise correlations explain the lower-dimensional marginal statistics surprisingly well up to the limit of dimensionality where estimation of the full joint distribution is feasible. 1
6 0.44318488 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
7 0.44236192 18 nips-2007-A probabilistic model for generating realistic lip movements from speech
8 0.44212338 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC
9 0.43972129 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model
10 0.43875977 86 nips-2007-Exponential Family Predictive Representations of State
11 0.43690476 104 nips-2007-Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes
12 0.43619588 100 nips-2007-Hippocampal Contributions to Control: The Third Way
13 0.43543726 69 nips-2007-Discriminative Batch Mode Active Learning
14 0.43524495 158 nips-2007-Probabilistic Matrix Factorization
15 0.43427768 63 nips-2007-Convex Relaxations of Latent Variable Training
16 0.43386042 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation
17 0.43374437 97 nips-2007-Hidden Common Cause Relations in Relational Learning
18 0.43354684 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images
19 0.43343222 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons
20 0.43332982 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations