nips nips2007 nips2007-94 knowledge-graph by maker-knowledge-mining

94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

Source: pdf

Author: Kai Yu, Wei Chu

Abstract: This paper aims to model relational data on edges of networks. We describe appropriate Gaussian Processes (GPs) for directed, undirected, and bipartite networks. The inter-dependencies of edges can be effectively modeled by adapting the GP hyper-parameters. The framework suggests an intimate connection between link prediction and transfer learning, which were traditionally two separate research topics. We develop an efﬁcient learning algorithm that can handle a large number of observations. The experimental results on several real-world data sets verify superior learning capacity. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Gaussian Process Models for Link Analysis and Transfer Learning Kai Yu NEC Laboratories America Cupertino, CA 95014 Wei Chu Columbia University, CCLS New York, NY 10115 Abstract This paper aims to model relational data on edges of networks. [sent-1, score-0.304]

2 We describe appropriate Gaussian Processes (GPs) for directed, undirected, and bipartite networks. [sent-2, score-0.172]

3 The inter-dependencies of edges can be effectively modeled by adapting the GP hyper-parameters. [sent-3, score-0.111]

4 The framework suggests an intimate connection between link prediction and transfer learning, which were traditionally two separate research topics. [sent-4, score-0.408]

5 1 Introduction In many scenarios the data of interest consist of relational observations on the edges of networks. [sent-7, score-0.358]

6 Typically, a given ﬁnite collection of such relational data can be represented as an M × N matrix Y = {yi,j }, which is often partially observed because many elements are missing. [sent-8, score-0.232]

7 Sometimes accompanying Y are attributes of nodes or edges. [sent-9, score-0.177]

8 One notable example is transfer learning, also known as multi-task learning, which jointly learns multiple related but different predictive functions based on the M × N observed labels Y, namely, the results of N functions acting on a set of M data examples. [sent-13, score-0.157]

9 Collaborative ﬁltering is an important application of transfer learning that learns many users’ interests on a large set of items. [sent-14, score-0.124]

10 The data are measurements of existences, strengths, and types of links between a set of nodes in a graph, where a given collection of observations are an M ×M (in this case N = M ) matrix Y, which can be symmetric or asymmetric, depending on whether the links are undirected or directed. [sent-16, score-0.669]

11 Examples include protein-protein interactions, social networks, citation networks, and hyperlinks on the WEB. [sent-17, score-0.096]

12 Link prediction aims to recover those missing measurements in Y, for example, predicting unknown protein-protein interactions based on known interactions. [sent-18, score-0.135]

13 The goal of this paper is to design a Gaussian process (GP) [13] framework to model the dependence structure of networks, and to contribute an efﬁcient algorithm to learn and predict large-scale relational data. [sent-19, score-0.269]

14 We explicitly construct a series of parametric models indexed by their dimensionality, and show that in the limit we obtain nonparametric GP priors consistent with the dependence of edge-wise measurements. [sent-20, score-0.133]

15 Since the kernel matrix is on a quadratic number of edges and the computation cost is even cubic of the kernel size, we develop an efﬁcient algorithm to reduce the computational complexity. [sent-21, score-0.365]

16 We also demonstrate that transfer learning has an intimate connection to link prediction. [sent-22, score-0.374]

17 Our method generalizes several recent transfer learning algorithms by additionally learning a task-speciﬁc kernel that directly expresses the dependence between tasks. [sent-23, score-0.293]

18 1 The application of GPs to learning on networks or graphs has been fairly recent. [sent-24, score-0.147]

19 Most of the work in this direction has focused on GPs over nodes of graphs and targeted at the classiﬁcation of nodes [20, 6, 10]. [sent-25, score-0.214]

20 In this paper, we regard the edges as the ﬁrst-class citizen and develop a general GP framework for modeling the dependence of edge-wise observations on bipartite, undirected and directed graphs. [sent-26, score-0.637]

21 This work extends [19], which built GPs for only bipartite graphs and proposed an algorithm scaling cubically to the number of nodes. [sent-27, score-0.286]

22 Our study promises a careful treatment to model the nature of edge-wise observations and offers a promising tool for link prediction. [sent-29, score-0.258]

23 1 Gaussian Processes for Network Data Modeling Bipartite Graphs We ﬁrst review the edge-wise GP for bipartite graphs [19], where each observation is a measurement on a pair of objects of different types, or under a pair of heterogenous conditions. [sent-31, score-0.334]

24 In the context of transfer learning, the pair involves a data instance i and a task j, and yi,j denotes the label of data i within task j. [sent-33, score-0.124]

25 , N form a matrix F, following a matrix-variate normal distribution NM ×N (B, Σ, Ω), or equivalently a normal distribution N (b, K) with mean b = vec(B) and covariance K = Ω ⊗ Σ, where ⊗ means Kronecker product. [sent-41, score-0.152]

26 The dependence structure of edges is decomposed into the dependence of nodes. [sent-42, score-0.263]

27 Since a kernel is a notion of similarity, the model expresses a prior belief – if node i is similar to node i and node j is similar node j , then so are f (i, j) and f (i , j ). [sent-43, score-0.301]

28 For transfer learning, this means to learn the kernel Σ between data instances and the kernel Ω between tasks. [sent-45, score-0.31]

29 Having Σ and Ω is it then possible to predict those missing yi,j based on known observations by using GP inference. [sent-46, score-0.109]

30 Let f (i, j) = D−1/2 D k=1 gk (i)hk (j) iid + b(i, j), where gk ∼ GP(0, Σ) and iid hk ∼ GP(0, Ω), then f ∼ GP(b, K) in the limit D → ∞, and the covariance between pairs is K ((i, j), (i , j )) = Σ(i, i )Ω(j, j ). [sent-49, score-0.54]

31 The theorem suggests that the GP model for bipartite relational data is a generalization of a Bayesian low-rank matrix factorization F = HG + B, under the prior H ∼ NM ×D (0, Σ, I) and G ∼ NN ×D (0, Ω, I). [sent-56, score-0.491]

32 2 Modeling Directed and Undirected Graphs In this section we model observations on pairs of nodes of the same set U. [sent-59, score-0.104]

33 It turns out that the directed graph is relatively easy to handle while deriving a GP prior for undirected graphs is slightly non-trivial. [sent-61, score-0.574]

34 For the case of directed graphs, we let the function f : U × U → R follow GP(b, K), where the covariance function between edges is K ((i, j), (i , j )) = C(i, i )C(j, j ) (2) and C : U × U → R is a kernel function between nodes. [sent-62, score-0.505]

35 Since a random function f drawn from the GP is generally asymmetric (even if b is symmetric), namely f (i, j) = f (j, i), the direction of edges can be modeled. [sent-63, score-0.139]

36 For the case of undirected graphs, we need to design a GP that ensures any sampled function to be symmetric. [sent-72, score-0.176]

37 1), it seems that f is symmetric if gk ≡ hk for k = 1, . [sent-74, score-0.228]

38 2) shows that the problem can be solved by subtracting a growing quantity D1/2 C(i, j) as D → ∞, and suggests the covariance function K ((i, j), (i , j )) = C(i, i )C(j, j ) + C(i, j )C(j, i ). [sent-80, score-0.113]

39 (3) With such covariance function , f is ensured to be symmetric because the covariance between f (i, j) and f (j, i) equals the variance of either. [sent-81, score-0.255]

40 Let f (i, j) = D−1/2 k=1 tk (i)tk (j)+b(i, j)−D1/2 C(i, j), where tk ∼ GP(0, C), then f ∼ GP(b, K) in the limit D → ∞, and the covariance between pairs is K ((i, j), (i , j )) = C(i, i )C(j, j ) + C(i, j )C(j, i ). [sent-84, score-0.641]

41 The covariance function is Cov(f (i, j), f (i , j )) = D k=1 {E[tk (i)tk (j)tk (i )tk (j )] − C(i, j)E[tk (i )tk (j )] − C(i , j )E[tk (i)tk (j)] + C(i, j)C(i , j )} = C(i, i )C(j, j ) + C(i, j )C(j, i ) + C(i, j)C(i , j ) − C(i, j)C(i , j ) = C(i, i )C(j, j ) + C(i, j )C(j, i ). [sent-89, score-0.113]

42 To see the connection, let hk ∼ GP(0, Σ) and gk ∼ GP(0, Ω) be concatenated to form a function tk , then we have tk ∼ GP(0, C) and the covariance is  Σ(i, j), if i, j ∈ U, (4) C(i, j) = Ω(i, j), if i, j ∈ V,  0, if i, j are in different sets. [sent-93, score-0.81]

43 2) leads to D f (i, j) = D−1/2 D tk (i)tk (j) + b(i, j) − D1/2 C(i, j) = D−1/2 k=1 hk (i)gk (j) + b(i, j), (5) k=1 K ((i, j), (i , j )) = C(i, i )C(j, j ) + C(i, j )C(j, i ) = Σ(i, i )Ω(j, j ). [sent-95, score-0.352]

44 2) suggest a general GP framework to model directed or undirected relationships connecting heterogeneous types of nodes. [sent-98, score-0.455]

45 Basically, we learn node-wise covariance functions, like Σ, Ω, and C, such that edge-wise covariances composed by Eq. [sent-99, score-0.113]

46 The proposed framework can be extended to cope with more complex network data, for example, networks containing both undirected links and directed links. [sent-101, score-0.562]

47 Let y = [yi,j ](i,j)∈O be the observational vector of length |O|, f be the corresponding quantities of the latent function f , and K be the |O| × |O| matrix of K between edges having observations, computed by Eq. [sent-105, score-0.15]

48 Then observations on edges are generated by yi,j = f (i, j) + bi,j + i,j (7) iid where f ∼ N (0, K), i,j ∼ N (0, β −1 ), and the mean has a parametric form bi,j = µi + νj . [sent-107, score-0.243]

49 However the computation can be prohibitively high when the size |O| of measured edges is very big, because the memory cost is O(|O|2 ), and the computational cost is O(|O|3 ). [sent-116, score-0.111]

50 2 and assume K to be composed by node-wise linear kernels Σ(i, i ) = xi , xi , Ω(i, i ) = zj , zj , and C(i, j) = xi , xj , with x ∈ RL1 and z ∈ RL2 . [sent-122, score-0.599]

51 The edge-wise covariance is then • Bipartite Graphs: K ((i, j), (i , j )) = xi ⊗ zj , xi ⊗ zj . [sent-123, score-0.445]

52 • Directed Graphs: K ((i, j), (i , j )) = xi ⊗ xj , xi ⊗ xj . [sent-124, score-0.534]

53 • Undirected Graphs: K ((i, j), (i , j )) = xi ⊗ xj , xi ⊗ xj + xi ⊗ xj , xj ⊗ xi We turn the problem of optimizing K into the problem of optimizing X = [x1 , . [sent-125, score-1.068]

54 It is important to note that in all the cases the kernel matrix has the form K = UU , where U is an |O| × L matrix, L |O|, therefore applying the Woodbury identity C−1 = β[I−U(U U+β −1 I)−1 U ] can dramatically reduce the computational cost. [sent-132, score-0.132]

55 For example, in the bipartite graph case and the directed graph case, respectively there are U = xi ⊗ zj (i,j)∈O , and U = xi ⊗ xj (i,j)∈O , (10) where the rows of U are indexed by (i, j) ∈ O. [sent-133, score-0.887]

56 1 Incorporating Additional Attributes and Learning from Discrete Observations There are different ways to incorporate node or edge attributes into our model. [sent-141, score-0.208]

57 A common practice is to let the kernel K, Σ, or Ω be some parametric function of attributes. [sent-142, score-0.12]

58 However, node or edge attributes are typically local information while the network itself is rather a global dependence structure, thus the network data often has a large part of patterns that are independent of those known predictors. [sent-144, score-0.364]

59 Let Σ0 1 be the covariance that we wish Σ to be apriori close to. [sent-146, score-0.113]

60 If we let Σ0 be the linear kernel of attributes, normalized by the dimensionality, then E(Σ) can be derived from a likelihood of Σ as if each dimension of the attributes is a random sample from GP(0, Σ + γ −1 δ). [sent-149, score-0.22]

61 If the attributes are nonlinear predictors we can conveniently set Σ0 by a nonlinear kernel. [sent-150, score-0.127]

62 We set Σ0 = I if the corresponding attributes are absent. [sent-151, score-0.127]

63 4 Discussions on Related Work Transfer Learning: As we have suggested before, the link prediction for bipartite graphs has a tight connection to transfer learning. [sent-160, score-0.652]

64 To make it clear, let fj (·) = f (·, j), then the edge-wise function f : U × V → R consists of N node-wise functions fj : U → R for j = 1, . [sent-161, score-0.308]

65 If we ﬁx Ω(j, j ) ≡ δ(j, j ), namely a Dirac delta function, then fj are assumed to be i. [sent-165, score-0.154]

66 In particular, the negative logarithm of p {yi,j }, {fj }|Σ is   N N 1  log |Σ|, (14) L {fj }, Σ = l yi,j , fj (i) + f j Σ−1 f j  + 2 2 j=1 i∈Oj where l(yi,j , fj (i)) = − log p(yi,j |fj (i)). [sent-170, score-0.308]

67 It was proven in [3] that if l(·, ·) is convex with fj , then the minimization of (14) is convex with jointly {fj } and Σ. [sent-172, score-0.154]

68 The GP approach differs from the regularization approach in two aspects: (1) fj are treated as random variables which are marginalized out, thus we only need to estimate Σ; (2) The regularization for Σ is a non-convex log-determinant term. [sent-173, score-0.291]

69 From a probabilistic modeling point of view, the independence of {fj } conditioned on Σ is a restrictive assumption and even incorrect when some task-speciﬁc attributes are given (which means that {fj } are not exchangeable anymore). [sent-176, score-0.159]

70 The task-speciﬁc kernel for transfer learning has been recently introduced in [4], which however increased the computational complexity by a factor of N 2 . [sent-177, score-0.217]

71 One contribution of this paper on transfer learning is an algorithm that can efﬁciently solve the learning problem with both data kernel Σ and task kernel Ω. [sent-178, score-0.31]

72 If we enforce Ω(j, j ) = δ(j, j ) in the model of bipartite graphs, then the evidence Eq. [sent-180, score-0.172]

73 In this paper we consider a situation with complex dependence of edges in network graphs. [sent-183, score-0.227]

74 [7] introduced link uncertainty in the framework of probabilistic relational models. [sent-185, score-0.366]

75 Latent-class relational models [17, 11, 1] have been popular, aiming to ﬁnd the block structure of links. [sent-186, score-0.221]

76 The second row is of our results, the third row is of the MMMF results, and the fourth row is of the bilinear results. [sent-191, score-0.161]

77 In cases that the additional attributes on nodes or edges are either unavailable or very weak, we compare our method with max-margin matrix factorization (MMMF) [14] using a square loss, which is similar to singular value decomposition (SVD) but can handle missing measurements. [sent-195, score-0.473]

78 We manually knocked 10 images off as test cases, as presented in Figure 1, and treated each image as a vector that leads to a 103040 × 10 matrix with 103040 missing values, where each column corresponds a view of faces. [sent-198, score-0.148]

79 We also employed the bilinear models introduced by [16], which however does not handle missing data of a matrix, and put the results at the bottom row for comparison. [sent-204, score-0.223]

80 2 Collaborative Filtering Collaborative ﬁltering is a typical case of bipartite graphs, where ratings are measurements on edges of user-item pairs. [sent-207, score-0.392]

81 We tried 20 or 50 principal components as attributes in this experiment and carried out least square regression on observed entries. [sent-213, score-0.127]

82 We treated the citation network as a directed graph and modeled the link existence as binary labels. [sent-316, score-0.546]

83 Our model applied the probit likelihood and learned a node-wise covariance function C, L = 50 × 50, which composes an edge-wise covariance K by Eq. [sent-317, score-0.262]

84 We set the prior covariance C0 by the linear kernel computed by bag-of-word content attributes. [sent-319, score-0.248]

85 Thus the learned linear features encode both link and content information, which were then used for document classiﬁcation. [sent-320, score-0.215]

86 We compare several other methods that provide linear features for one-against-all categorization using SVM: 1) CONTENT: bag-of-words features; 2) LINK: each paper’s citation list; 3) PCA: 50 components by PCA on the concatenation of bag-of-word features and citation list for each paper. [sent-321, score-0.142]

87 6 Conclusion and Extensions In this paper we proposed GPs for modeling data living on links of networks. [sent-326, score-0.157]

88 We described solutions to handle directed and undirected links, as well as links connecting heterogenous nodes. [sent-327, score-0.621]

89 This work paves a way for future extensions for learning more complex relational data. [sent-328, score-0.224]

90 For example, we can model a network containing both directed and undirected links. [sent-329, score-0.404]

91 Let (i, j) be directed and (i , j ) be undirected. [sent-330, score-0.188]

92 (12) √ for undirected links, the covaraince is K((i, j), (i , j )) = 1/ 2[C(i, i )C(j, j ) + C(i, j )C(j, i )], 7 which indicates that dependence between a directed link and an undirected link is penalized compared to dependence between two undirected links. [sent-333, score-1.239]

93 Letting covariance between two different types of nodes be zero, we obtain a huge block-diagonal node-wise covariance matrix, where each block corresponds to one type of nodes. [sent-336, score-0.329]

94 This big covariance matrix will induce the edge-wise covariance for links connecting nodes of the same or different types. [sent-337, score-0.501]

95 In the near future it is promising to apply the model to various link prediction or network completion problems. [sent-338, score-0.247]

96 Xing, Mixed membership stochastic block models for relational data with application to protein-protein interactions. [sent-347, score-0.221]

97 Probabilistic models of text and link structure for hypertext classiﬁcation. [sent-386, score-0.173]

98 Modeling homophily and stochastic equivalence in symmetric relational data. [sent-394, score-0.222]

99 Hyperparameter and kernel learning for graph based semi-supervised classiﬁcation. [sent-402, score-0.14]

100 Learning systems of concepts with an inﬁnite relational model. [sent-412, score-0.193]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gp', 0.557), ('tk', 0.249), ('relational', 0.193), ('directed', 0.188), ('undirected', 0.176), ('link', 0.173), ('bipartite', 0.172), ('mmmf', 0.171), ('xj', 0.158), ('fj', 0.154), ('attributes', 0.127), ('links', 0.125), ('transfer', 0.124), ('informaiton', 0.12), ('graphs', 0.114), ('covariance', 0.113), ('edges', 0.111), ('xi', 0.109), ('hk', 0.103), ('gk', 0.096), ('gps', 0.096), ('kernel', 0.093), ('dependence', 0.076), ('citation', 0.071), ('rmse', 0.068), ('pearson', 0.067), ('collaborative', 0.064), ('ratings', 0.063), ('gplvm', 0.057), ('zj', 0.057), ('missing', 0.055), ('observations', 0.054), ('users', 0.053), ('node', 0.052), ('yu', 0.052), ('iid', 0.051), ('eachmovie', 0.05), ('bilinear', 0.05), ('nodes', 0.05), ('handle', 0.049), ('ean', 0.048), ('earson', 0.048), ('ethods', 0.048), ('getoor', 0.048), ('heterogenous', 0.048), ('umist', 0.048), ('graph', 0.047), ('measurements', 0.046), ('movie', 0.046), ('papers', 0.045), ('theorem', 0.045), ('chu', 0.044), ('tresp', 0.042), ('content', 0.042), ('factorization', 0.042), ('cora', 0.042), ('intimate', 0.042), ('regularization', 0.041), ('tr', 0.041), ('network', 0.04), ('gaussian', 0.039), ('matrix', 0.039), ('row', 0.037), ('rating', 0.037), ('probit', 0.036), ('ha', 0.036), ('pl', 0.036), ('mae', 0.036), ('connecting', 0.035), ('pca', 0.035), ('connection', 0.035), ('prediction', 0.034), ('networks', 0.033), ('predictive', 0.033), ('employed', 0.032), ('modeling', 0.032), ('dirac', 0.032), ('offers', 0.031), ('extensions', 0.031), ('movies', 0.031), ('heterogeneous', 0.031), ('dimensionality', 0.03), ('limit', 0.03), ('cubic', 0.029), ('symmetric', 0.029), ('edge', 0.029), ('asymmetric', 0.028), ('marginalized', 0.028), ('block', 0.028), ('images', 0.027), ('taskar', 0.027), ('nm', 0.027), ('treated', 0.027), ('parametric', 0.027), ('tenenbaum', 0.027), ('big', 0.026), ('user', 0.025), ('types', 0.025), ('penalized', 0.025), ('social', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

Author: Kai Yu, Wei Chu

2 0.29059827 104 nips-2007-Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes

Author: Maneesh Sahani, Byron M. Yu, John P. Cunningham, Krishna V. Shenoy

Abstract: Neural spike trains present challenges to analytical efforts due to their noisy, spiking nature. Many studies of neuroscientiﬁc and neural prosthetic importance rely on a smoothed, denoised estimate of the spike train’s underlying ﬁring rate. Current techniques to ﬁnd time-varying ﬁring rates require ad hoc choices of parameters, offer no conﬁdence intervals on their estimates, and can obscure potentially important single trial variability. We present a new method, based on a Gaussian Process prior, for inferring probabilistically optimal estimates of ﬁring rate functions underlying single or multiple neural spike trains. We test the performance of the method on simulated data and experimentally gathered neural spike trains, and we demonstrate improvements over conventional estimators. 1

3 0.28303322 32 nips-2007-Bayesian Co-Training

Author: Shipeng Yu, Balaji Krishnapuram, Harald Steck, R. B. Rao, Rómer Rosales

Abstract: We propose a Bayesian undirected graphical model for co-training, or more generally for semi-supervised multi-view learning. This makes explicit the previously unstated assumptions of a large class of co-training type algorithms, and also clariﬁes the circumstances under which these assumptions fail. Building upon new insights from this model, we propose an improved method for co-training, which is a novel co-training kernel for Gaussian process classiﬁers. The resulting approach is convex and avoids local-maxima problems, unlike some previous multi-view learning methods. Furthermore, it can automatically estimate how much each view should be trusted, and thus accommodate noisy or unreliable views. Experiments on toy data and real world data sets illustrate the beneﬁts of this approach. 1

4 0.27936146 135 nips-2007-Multi-task Gaussian Process Prediction

Author: Edwin V. Bonilla, Kian M. Chai, Christopher Williams

Abstract: In this paper we investigate multi-task learning in the context of Gaussian Processes (GP). We propose a model that learns a shared covariance function on input-dependent features and a “free-form” covariance matrix over tasks. This allows for good ﬂexibility when modelling inter-task dependencies while avoiding the need for large amounts of data for training. We show that under the assumption of noise-free observations and a block design, predictions for a given task only depend on its target values and therefore a cancellation of inter-task transfer occurs. We evaluate the beneﬁts of our model on two practical applications: a compiler performance prediction problem and an exam score prediction task. Additionally, we make use of GP approximations and properties of our model in order to provide scalability to large data sets. 1

5 0.23592907 97 nips-2007-Hidden Common Cause Relations in Relational Learning

Author: Ricardo Silva, Wei Chu, Zoubin Ghahramani

Abstract: When predicting class labels for objects within a relational database, it is often helpful to consider a model for relationships: this allows for information between class labels to be shared and to improve prediction performance. However, there are different ways by which objects can be related within a relational database. One traditional way corresponds to a Markov network structure: each existing relation is represented by an undirected edge. This encodes that, conditioned on input features, each object label is independent of other object labels given its neighbors in the graph. However, there is no reason why Markov networks should be the only representation of choice for symmetric dependence structures. Here we discuss the case when relationships are postulated to exist due to hidden common causes. We discuss how the resulting graphical model differs from Markov networks, and how it describes different types of real-world relational processes. A Bayesian nonparametric classiﬁcation model is built upon this graphical representation and evaluated with several empirical studies. 1 Contribution Prediction problems, such as classiﬁcation, can be easier when class labels share a sort of relational dependency that is not accounted by the input features [10]. If the variables to be predicted are attributes of objects in a relational database, such dependencies are often postulated from the relations that exist in the database. This paper proposes and evaluates a new method for building classiﬁers that uses information concerning the relational structure of the problem. Consider the following standard example, adapted from [3]. There are different webpages, each one labeled according to some class (e.g., “student page” or “not a student page”). Features such as the word distribution within the body of each page can be used to predict each webpage’s class. However, webpages do not exist in isolation: there are links connecting them. Two pages having a common set of links is evidence for similarity between such pages. For instance, if W1 and W3 both link to W2 , this is commonly considered to be evidence for W1 and W3 having the same class. One way of expressing this dependency is through the following Markov network [5]: ∗ Now at the Statistical Laboratory, University of Cambridge. E-mail: silva@statslab.cam.ac.uk F1 F2 F 3 C1 C2 C3 Here Fi are the features of page Wi , and Ci is its respective page label. Other edges linking F variables to C variables (e.g., F1 −C2 ) can be added without affecting the main arguments presented in this section. The semantics of the graph, for a ﬁxed input feature set {F1 , F2 , F3 }, are as follows: C1 is marginally dependent on C3 , but conditionally independent given C2 . Depending on the domain, this might be either a suitable or unsuitable representation of relations. For instance, in some domains it could be the case that the most sensible model would state that C1 is only informative about C3 once we know what C2 is: that is, C1 and C3 are marginally independent, but dependent given C2 . This can happen if the existence of a relation (Ci , Cj ) corresponds to the existence of hidden common causes generating this pair of random variables. Consider the following example, loosely based on a problem described by [12]. We have three objects, Microsoft (M ), Sony (S) and Philips (P ). The task is a regression task where we want to predict the stock market price of each company given its proﬁtability from last year. The given relationships are that M and S are direct competitors (due to the videogame console market), as well S and P (due to the TV set market). M.Profit S.Profit P.Profit M.Profit S.Profit P.Profit M.Profit S.Profit P.Profit M.Stock S.Stock P.Stock M.Stock S.Stock P.Stock M.Stock S.Stock P.Stock εm εs εp εm εs εp (a) (b) (c) Figure 1: (a) Assumptions that relate Microsoft, Sony and Philips stock prices through hidden common cause mechanisms, depicted as unlabeled gray vertices; (b) A graphical representation for generic hidden common causes relationships by using bi-directed edges; (c) A depiction of the same relationship skeleton by a Markov network model, which has different probabilistic semantics. It is expected that several market factors that affect stock prices are unaccounted by the predictor variable Past Year Proﬁt. For example, a shortage of Microsoft consoles is a hidden common factor for both Microsoft’s and Sony’s stock. Another hidden common cause would be a high price for Sony’s consoles. Assume here that these factors have no effect on Philips’ stock value. A depiction of several hidden common causes that correpond to the relations Competitor(M, S) and Competitor(S, P ) is given in Figure 1(a) as unlabeled gray vertices. Consider a linear regression model for this setup. We assume that for each object Oi ∈ {M, S, P }, the stock price Oi .Stock, centered at the mean, is given by Oi .Stock = β × Oi .P rof it + where each i i (1) is a Gaussian random variable. The fact that there are several hidden common causes between M and S can be modeled by the covariance of m and s , σms . That is, unlike in standard directed Gaussian models, σms is allowed to be non-zero. The same holds for σsp . Covariances of error terms of unrelated objects should be zero (σmp = 0). This setup is very closely related to the classic seemingly unrelated regression model popular in economics [12]. A graphical representation for this type of model is the directed mixed graph (DMG) [9, 11], with bi-directed edges representing the relationship of having hidden common causes between a pair of vertices. This is shown in Figure 1(b). Contrast this to the Markov network representation in Figure 1(c). The undirected representation encodes that m and p are marginally dependent, which does not correspond to our assumptions1 . Moreover, the model in Figure 1(b) states that once we observe Sony’s stock price, Philip’s stocks (and proﬁt) should have a non-zero association with Microsoft’s proﬁt: this follows from a extension of d-separation to DMGs [9]. This is expected from the assumptions (Philip’s stocks should tell us something about Microsoft’s once we know Sony’s, but not before it), but does not hold in the graphical model in Figure 1(c). While it is tempting to use Markov networks to represent relational models (free of concerns raised by cyclic directed representations), it is clear that there are problems for which they are not a sensible choice. This is not to say that Markov networks are not the best representation for large classes of relational problems. Conditional random ﬁelds [4] are well-motivated Markov network models for sequence learning. The temporal relationship is closed under marginalization: if we do not measure some steps in the sequence, we will still link the corresponding remaining vertices accordingly, as illustrated in Figure 2. Directed mixed graphs are not a good representation for this sequence structure. X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5 (a) (b) X1 X3 X5 Y1 Y3 Y5 (c) Figure 2: (a) A conditional random ﬁeld (CRF) graph for sequence data; (b) A hypothetical scenario where two of the time slices are not measured, as indicated by dashed boxes; (c) The resulting CRF graph for the remaining variables, which corresponds to the same criteria for construction of (a). To summarize, the decision between using a Markov network or a DMG reduces to the following modeling issue: if two unlinked object labels yi , yj are statistically associated when some chain of relationships exists between yi and yj , then the Markov network semantics should apply (as in the case for temporal relationships). However, if the association arises only given the values of the other objects in the chain, then this is accounted by the dependence semantics of the directed mixed graph representation. The DMG representation propagates training data information through other training points. The Markov network representation propagates training data information through test points. Propagation through training points is relevant in real problems. For instance, in a webpage domain where each webpage has links to pages of several kinds (e.g., [3]), a chain of intermediated points between two classes labels yi and yj is likely to be more informative if we know the values of the labels in this chain. The respective Markov network would ignore all training points in this chain besides the endpoints. In this paper, we introduce a non-parametric classiﬁcation model for relational data that factorizes according to a directed mixed graph. Sections 2 and 3 describes the model and contrasts it to a closely related approach which bears a strong analogy to the Markov network formulation. Experiments in text classiﬁcation are described in Section 4. 2 Model Chu et al. [2] describe an approach for Gaussian process classiﬁcation using relational information, which we review and compare to our proposed model. Previous approach: relational Gaussian processes through indicators − For each point x in the input space X , there is a corresponding function value fx . Given observed input points x1 , x2 , . . . , xn , a Gaussian process prior over f = [f1 , f2 , . . . , fn ]T has the shape P(f ) = 1 (2π)n/2 |Σ|1/2 1 exp − f T Σ−1 f 2 (2) 1 For Gaussian models, the absence of an edge in the undirected representation (i.e., Gaussian Markov random ﬁelds) corresponds to a zero entry in the inverse covariance matrix, where in the DMG it corresponds to a zero in the covariance matrix [9]. X1 X2 f1 X3 X2 X3 X1 X2 X3 f3 f2 ξ 12 X1 f1 f2 f3 f1 f2 f3 Y2 Y3 ξ 23 Y1 Y2 Y3 Y1 Y2 Y3 Y1 ε1 ε2 ε3 ε1 ε2 ε3 ε1 (a) ζ1 (b) ζ2 ε2 ζ3 ε3 (c) Figure 3: (a) A prediction problem where y3 is unknown and the training set is composed of other two datapoints. Dependencies between f1 , f2 and f3 are given by a Gaussian process prior and not represented in the picture. Indicators ξij are known and set to 1; (b) The extra associations that arise by conditioning on ξ = 1 can be factorized as the Markov network model here depicted, in the spirit of [9]; (c) Our proposed model, which ties the error terms and has origins in known statistical models such as seemingly unrelated regression and structural equation models [11]. where the ijth entry of Σ is given by a Mercer kernel function K(xi , xj ) [8]. The idea is to start from a standard Gaussian process prior, and add relational information by conditioning on relational indicators. Let ξij be an indicator that assumes different values, e.g., 1 or 0. The indicator values are observed for each pair of data points (xi , xj ): they are an encoding of the given relational structure. A model for P (ξij = 1|fi , fj ) is deﬁned. This evidence is incorporated into the Gaussian process by conditioning on all indicators ξij that are positive. Essentially, the idea boils down to using P(f |ξ = 1) as the prior for a Gaussian process classiﬁer. Figure 3(a) illustrates a problem with datapoints {(x1 , y1 ), (x2 , y2 ), (x3 , y3 )}. Gray vertices represent unobserved variables. Each yi is a binary random variable, with conditional probability given by P(yi = 1|fi ) = Φ(fi /σ) (3) where Φ(·) is the standard normal cumulative function and σ is a hyperparameter. This can be interpreted as the cumulative distribution of fi + i , where fi is given and i is a normal random variable with zero mean and variance σ 2 . In the example of Figure 3(a), one has two relations: (x1 , x2 ), (x2 , x3 ). This information is incorporated by conditioning on the evidence (ξ12 = 1, ξ23 = 1). Observed points (x1 , y1 ), (x2 , y2 ) form the training set. The prediction task is to estimate y3 . Notice that ξ12 is not used to predict y3 : the Markov blanket for f3 includes (f1 , f2 , ξ23 , y3 , 3 ) and the input features. Essentially, conditioning on ξ = 1 corresponds to a pairwise Markov network structure, as depicted in Figure 3(b) [9]2 . Our approach: mixed graph relational model − Figure 3(c) illustrates our proposed setup. For reasons that will become clear in the sequel, we parameterize the conditional probability of yi as √ (4) P(yi = 1|gi , vi ) = Φ(gi / vi ) where gi = fi + ζi . As before, Equation (4) can be interpreted as the cumulative distribution of 2 gi + i , with i as a normal random variable with zero mean and variance vi = σ 2 − σζi , the last term being the variance of ζi . That is, we break the original error term as i = ζi + i , where i and j are independent for all i = j. Random vector ζ is a multivariate normal with zero mean and covariance matrix Σζ . The key aspect in our model is that the covariance of ζi and ζj is non-zero only if objects i and j are related (that is, bi-directed edge yi ↔ yj is in the relational graph). Parameterizing Σζ for relational problems is non-trivial and discussed in the next section. In the example of Figure 3, one noticeable difference of our model 3(c) to a standard Markov network models 3(b) is that now the Markov blanket for f3 includes error terms for all variables (both and ζ terms), following the motivation presented in Section 1. 2 In the ﬁgure, we are not representing explicitly that f1 , f2 and f3 are not independent (the prior covariance matrix Σ is complete). The ﬁgure is meant as a representation of the extra associations that arise when conditioning on ξ = 1, and the way such associations factorize. As before, the prior for f in our setup is the Gaussian process prior (2). This means that g has the following Gaussian process prior (implicitly conditioned on x): P(g) = 1 (2π)n/2 |R|1/2 1 exp − g R−1 g 2 (5) where R = K + Σζ is the covariance matrix of g = f + ζ, with Kij = K(xi , xj ). 3 Parametrizing a mixed graph model for relational classiﬁcation For simplicity, in this paper we will consider only relationships that induce positive associations between labels. Ideally, the parameterization of Σζ has to fulﬁll two desiderata: (i). it should respect the marginal independence constraints as encoded by the graphical model (i.e., zero covariance for vertices that are not adjacent), and be positive deﬁnite; (ii). it has to be parsimonious in order to facilitate hyperparameter selection, both computationally and statistically. Unlike the multivariate analysis problems in [11], the size of our covariance matrix grows with the number of data points. As shown by [11], exact inference in models with covariance matrices with zero-entry constraints is computationally demanding. We provide two alternative parameterizations that are not as ﬂexible, but which lead to covariance matrices that are simple to compute and easy to implement. We will work under the transductive scenario, where training and all test points are given in advance. The corresponding graph thus contain unobserved and observed label nodes. 3.1 Method I The ﬁrst method is an automated method to relax some of the independence constraints, while guaranteeing positive-deﬁniteness, and a parameterization that depends on a single scalar ρ. This allows for more efﬁcient inference and is done as follows: 1. Let Gζ be the corresponding bi-directed subgraph of our original mixed graph, and let U0 be a matrix with n × n entries, n being the number of nodes in Gζ 2. Set U0 to be the number of cliques in Gζ where yi and yj appear together; ij 3. Set U0 to be the number of cliques containing yi , plus a small constant ∆; ii 4. Set U to be the corresponding correlation matrix obtained by intepreting U0 as a covariance matrix and rescaling it; Finally, set Σζ = ρU, where ρ ∈ [0, 1] is a given hyperparameter. Matrix U is always guaranteed to be positive deﬁnite: it is equivalent to obtaining the covariance matrix of y from a linear latent variable model, where there is an independent standard Gaussian latent variable as a common parent to every clique, and every observed node yi is given by the sum of its parents plus an independent error term of variance ∆. Marginal independencies are respected, since independent random variables will never be in a same clique in Gζ . In practice, this method cannot be used as is since the number of cliques will in general grow at an exponential rate as a function of n. Instead, we ﬁrst triangulate the graph: in this case, extracting cliques can be done in polynomial time. This is a relaxation of the original goal, since some of the original marginal independence constraints will not be enforced due to the triangulation3. 3.2 Method II The method suggested in the previous section is appealing under the assumption that vertices that appear in many common cliques are more likely to have more hidden common causes, and hence should have stronger associations. However, sometimes the triangulation introduces bad artifacts, with lots of marginal independence constraints being violated. In this case, this will often result in a poor prediction performance. A cheap alternative approach is not generating cliques, and instead 3 The need for an approximation is not a shortcoming only of the DMG approach. Notice that the relational Gaussian process of [2] also requires an approximation of its relational kernel. 10 10 10 20 20 20 30 30 30 40 40 40 50 50 50 60 60 60 70 70 70 80 80 80 90 90 90 100 100 100 10 20 30 40 50 60 (a) 70 80 90 100 10 20 30 40 50 60 70 80 90 (b) 100 10 20 30 40 50 60 70 80 90 100 (c) Figure 4: (a) The link matrix for the political books dataset. (b) The relational kernel matrix obtained with the approximated Method I. (c) The kernel matrix obtained with Method II, which tends to produce much weaker associations but does not introduce spurious relations. getting a marginal covariance matrix from a different latent variable model. In this model, we create an independent standard Gaussian variable for each edge yi ↔ yj instead of each clique. No triangulation will be necessary, and all marginal independence constraints will be respected. This, however, has shortcomings of its own: for all pairs (yi , yj ) connected by an edge, it will be the case that U0 = 1, while U0 can be as large as n. This means that the resulting correlation in Uij can be ij ii close to zero even if yi and yj are always in the same cliques. In Section 4, we will choose between Methods I and II according to the marginal likelihood of the model. 3.3 Algorithm Recall that our model is a Gaussian process classiﬁer with error terms i of variance σ such that i = ζi + i . Without loss of generality, we will assume that σ = 1. This results in the following parameterization of the full error covariance matrix: Σ = (1 − ρ)I + ρU (6) where I is an n × n identity matrix. Matrix (1 − ρ)I corresponds to the covariance matrix Σ . The usefulness of separating as and ζ becomes evident when we use an expectation-propagation (EP) algorithm [7] to perform inference in our relational classiﬁer. Instead of approximating the posterior of f , we approximate the posterior density P(g|D), D = {(x1 , y1 ), . . . , (xn , yn )} being ˜ the given training data. The approximate posterior has the form Q(g) ∝ P(g) i ti (gi ) where P(g) is the Gaussian process prior with kernel matrix R = K + Σζ as deﬁned in the previous section. Since the covariance matrix Σ is diagonal, the true likelihood of y given g factorizes n over each datapoint: P(y|g) = i=1 P(yi |gi ), and standard EP algorithms for Gaussian process classiﬁcation can be used [8] (with the variance given by Σ instead of Σ , and kernel matrix R instead of K). The ﬁnal algorithm deﬁnes a whole new class of relational models, depends on a single hyperparameter ρ which can be optimized by grid search in [0, 1], and requires virtually no modiﬁcation of code written for EP-based Gaussian process classiﬁers4 . 4 Results We now compare three different methods in relational classiﬁcation tasks. We will compare a standard Gaussian process classiﬁer (GPC), the relational Gaussian process (RGP) of [2] and our method, the mixed graph Gaussian process (XGP). A linear kernel K(x, z) = x · z is used, as described by [2]. We set ∆ = 10−4 and the hyperparameter ρ is found by a grid search in the space {0.1, 0.2, 0.3, . . . , 1.0} maximizing the approximate EP marginal likelihood5. 4 We provide MATLAB/Octave code for our method in http://www.statslab.cam.ac.uk/∼silva. For triangulation, we used the MATLAB implementation of the Reverse Cuthill McKee vertex ordering available at http://people.scs.fsu.edu/∼burkardt/m src/rcm/rcm.html 5 Table 1: The averaged AUC scores of citation prediction on test cases of the Cora database are recorded along with standard deviation over 100 trials. “n” denotes the number of papers in one class. “Citations” denotes the citation count within the two paper classes. Group n Citations GPC GPC with Citations XGP 5vs1 346/488 2466 0.905 ± 0.031 0.891 ± 0.022 0.945 ± 0.053 5vs2 346/619 3417 0.900 ± 0.032 0.905 ± 0.044 0.933 ± 0.059 5vs3 346/1376 3905 0.863 ± 0.040 0.893 ± 0.017 0.883 ± 0.013 5vs4 346/646 2858 0.916 ± 0.030 0.887 ± 0.018 0.951 ± 0.042 5vs6 346/281 1968 0.887 ± 0.054 0.843 ± 0.076 0.955 ± 0.041 5vs7 346/529 2948 0.869 ± 0.045 0.867 ± 0.041 0.926 ± 0.076 4.1 Political books We consider ﬁrst a simple classiﬁcation problem where the goal is to classify whether a particular book is of liberal political inclination or not. The features of each book are given by the words in the Amazon.com front page for that particular book. The choice of books, labels, and relationships are given in the data collected by Valdis Krebs and available at http://www-personal.umich.edu/ mejn/netdata. The data containing book features can be found at http://www.statslab.cam.ac.uk/∼silva. There are 105 books, 43 of which are labeled as liberal books. The relationships are pairs of books which are frequently purchased together by a same customer. Notice this is an easy problem, where labels are strongly associated if they share a relationship. We performed evaluation by sampling 100 times from the original pool of books, assigning half of them as trainining data. The evaluation criterion was the area under the curve (AUC) for this binary problem. This is a problem where Method I is suboptimal. Figure 4(a) shows the original binary link matrix. Figure 4(b) depicts the corresponding U0 matrix obtained with Method I, where entries closer to red correspond to stronger correlations. Method II gives a better performance here (Method I was better in the next two experiments). The AUC result for GPC was of 0.92, while both RGP and XGP achieved 0.98 (the difference between XGP and GPC having a std. deviation of 0.02). 4.2 Cora The Cora collection [6] contains over 50,000 computer science research papers including bibliographic citations. We used a subset in our experiment. The subset consists of 4,285 machine learning papers categorized into 7 classes. The second column of Table 1 shows the class sizes. Each paper was preprocessed as a bag-of-words, a vector of “term frequency” components scaled by “inverse document frequency”, and then normalized to unity length. This follows the pre-processing used in [2]. There is a total of 20,082 features. For each class, we randomly selected 1% of the labelled samples for training and tested on the remainder. The partition was repeated 100 times. We used the fact that the database is composed of fairly specialized papers as an illustration of when XGP might not be as optimal as RGP (whose AUC curves are very close to 1), since the population of links tends to be better separated between different classes (but this is also means that the task is fairly easy, and differences disappear very rapidly with increasing sample sizes). The fact there is very little training data also favors RGP, since XGP propagates information through training points. Still, XGP does better than the non-relational GPC. Notice that adding the citation adjacency matrix as a binary input feature for each paper does not improve the performance of the GPC, as shown in Table 1. Results for other classes are of similar qualitative nature and not displayed here. 4.3 WebKB The WebKB dataset consists of homepages from 4 different universities: Cornell, Texas, Washington and Wisconsin [3]. Each webpage belongs to one out of 7 categories: student, professor, course, project, staff, department and “other”. The relations come from actual links in the webpages. There is relatively high heterogeneity of types of links in each page: in terms of mixed graph modeling, this linkage mechanism is explained by a hidden common cause (e.g., a student and a course page are associated because that person’s interest in enrolling as a student also creates demand for a course). The heterogeneity also suggests that two unlinked pages should not, on average, have an association if they link to a common page W . However, observing the type of page W might create Table 2: Comparison of the three algorithms on the task “other” vs. “not-other” in the WebKB domain. Results for GPC and RGP taken from [2]. The same partitions for training and test are used to generate the results for XGP. Mean and standard deviation of AUC results are reported. University Numbers Other or Not Other All Link GPC RGP XGP Cornell 617 865 13177 0.708 ± 0.021 0.884 ± 0.025 0.917 ± 0.022 Texas 571 827 16090 0.799 ± 0.021 0.906 ± 0.026 0.949 ± 0.015 Washington 939 1205 15388 0.782 ± 0.023 0.877 ± 0.024 0.923 ± 0.016 Wisconsin 942 1263 21594 0.839 ± 0.014 0.899 ± 0.015 0.941 ± 0.018 the association. We compare how the three algorithms perform when trying to predict if a webpage is of class “other” or not (the other classiﬁcations are easier, with smaller differences. Results are omitted for space purposes). The proportion of “other” to non-“other” is about 4:1, which makes the area under the curve (AUC) a more suitable measure of success. We used the same 100 subsamples from [2], where 10% of the whole data is sampled from the pool for a speciﬁc university, and the remaining is used for test. We also used the same features as in [2], pre-processed as described in the previous section. The results are shown in Table 2. Both relational Gaussian processes are far better than the non-relational GPC. XGP gives signiﬁcant improvements over RGP in all four universities. 5 Conclusion We introduced a new family of relational classiﬁers by extending a classical statistical model [12] to non-parametric relational classiﬁcation. This is inspired by recent advances in relational Gaussian processes [2] and Bayesian inference for mixed graph models [11]. We showed empirically that modeling the type of latent phenomena that our approach postulates can sometimes improve prediction performance in problems traditionally approached by Markov network structures. Several interesting problems can be treated in the future. It is clear that there are many different ways by which the relational covariance matrix can be parameterized. Intermediate solutions between Methods I and II, approximations through matrix factorizations and graph cuts are only a few among many alternatives that can be explored. Moreover, there is a relationship between our model and multiple kernel learning [1], where one of the kernels comes from error covariances. This might provide alternative ways of learning our models, including multiple types of relationships. Acknowledgements: We thank Vikas Sindhwani for the preprocessed Cora database. References [1] F. Bach, G. Lanckriet, and M. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. 21st International Conference on Machine Learning, 2004. [2] W. Chu, V. Sindhwani, Z. Ghahramani, and S. Keerthi. Relational learning with Gaussian processes. Neural Information Processing Systems, 2006. [3] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to extract symbolic knowledge from the World Wide Web. Proceedings of AAAI’98, pages 509–516, 1998. [4] J. Lafferty, A. McCallum, and F. Pereira. Conditional random ﬁelds: Probabilistic models for segmenting and labeling sequence data. 18th International Conference on Machine Learning, 2001. [5] S. Lauritzen. Graphical Models. Oxford University Press, 1996. [6] A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of Internet portals with machine learning. Information Retrieval Journal, 3:127–163, 2000. [7] T. Minka. A family of algorithms for approximate Bayesian inference. PhD Thesis, MIT, 2001. [8] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [9] T. Richardson and P. Spirtes. Ancestral graph Markov models. Annals of Statistics, 30:962–1030, 2002. [10] P. Sen and L. Getoor. Link-based classiﬁcation. Report CS-TR-4858, University of Maryland, 2007. [11] R. Silva and Z. Ghahramani. Bayesian inference for Gaussian mixed graph models. UAI, 2006. [12] A. Zellner. An efﬁcient method of estimating seemingly unrelated regression equations and tests for aggregation bias. Journal of the American Statistical Association, 1962.

6 0.23236774 170 nips-2007-Robust Regression with Twinned Gaussian Processes

7 0.22832209 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

8 0.14597172 161 nips-2007-Random Projections for Manifold Learning

9 0.14440607 156 nips-2007-Predictive Matrix-Variate t Models

10 0.12935096 195 nips-2007-The Generalized FITC Approximation

11 0.11166268 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs

12 0.10715201 41 nips-2007-COFI RANK - Maximum Margin Matrix Factorization for Collaborative Ranking

13 0.10554555 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations

14 0.10448773 157 nips-2007-Privacy-Preserving Belief Propagation and Sampling

15 0.097338662 207 nips-2007-Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations

16 0.089764178 131 nips-2007-Modeling homophily and stochastic equivalence in symmetric relational data

17 0.0728423 103 nips-2007-Inferring Elapsed Time from Stochastic Neural Processes

18 0.072472297 6 nips-2007-A General Boosting Method and its Application to Learning Ranking Functions for Web Search

19 0.071555838 118 nips-2007-Learning with Transformation Invariant Kernels

20 0.06926427 63 nips-2007-Convex Relaxations of Latent Variable Training

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.292), (1, 0.13), (2, -0.1), (3, 0.136), (4, -0.097), (5, -0.046), (6, -0.349), (7, -0.22), (8, 0.008), (9, -0.027), (10, -0.314), (11, -0.116), (12, -0.061), (13, 0.152), (14, -0.132), (15, -0.098), (16, -0.028), (17, -0.112), (18, 0.002), (19, 0.017), (20, 0.004), (21, 0.05), (22, -0.026), (23, 0.057), (24, -0.035), (25, 0.029), (26, -0.02), (27, -0.176), (28, -0.039), (29, 0.038), (30, -0.005), (31, -0.043), (32, -0.054), (33, 0.002), (34, -0.094), (35, -0.046), (36, -0.023), (37, 0.01), (38, 0.016), (39, 0.062), (40, -0.06), (41, 0.04), (42, -0.013), (43, -0.065), (44, -0.008), (45, -0.042), (46, -0.039), (47, -0.055), (48, 0.046), (49, 0.083)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95282108 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

Author: Kai Yu, Wei Chu

2 0.69015801 32 nips-2007-Bayesian Co-Training

Author: Shipeng Yu, Balaji Krishnapuram, Harald Steck, R. B. Rao, Rómer Rosales

3 0.65692121 135 nips-2007-Multi-task Gaussian Process Prediction

Author: Edwin V. Bonilla, Kian M. Chai, Christopher Williams

4 0.65583897 97 nips-2007-Hidden Common Cause Relations in Relational Learning

Author: Ricardo Silva, Wei Chu, Zoubin Ghahramani

5 0.60344273 104 nips-2007-Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes

Author: Maneesh Sahani, Byron M. Yu, John P. Cunningham, Krishna V. Shenoy

6 0.59431136 170 nips-2007-Robust Regression with Twinned Gaussian Processes

7 0.56631601 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

8 0.52030343 195 nips-2007-The Generalized FITC Approximation

9 0.49232736 156 nips-2007-Predictive Matrix-Variate t Models

10 0.4389759 131 nips-2007-Modeling homophily and stochastic equivalence in symmetric relational data

11 0.37129471 158 nips-2007-Probabilistic Matrix Factorization

12 0.36050457 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

13 0.35217834 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes

14 0.34862852 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs

15 0.34200349 207 nips-2007-Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations

16 0.34017858 24 nips-2007-An Analysis of Inference with the Universum

17 0.33771887 174 nips-2007-Selecting Observations against Adversarial Objectives

18 0.33457217 19 nips-2007-Active Preference Learning with Discrete Choice Data

19 0.32864147 157 nips-2007-Privacy-Preserving Belief Propagation and Sampling

20 0.3235175 118 nips-2007-Learning with Transformation Invariant Kernels

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.048), (11, 0.026), (13, 0.027), (16, 0.034), (18, 0.016), (21, 0.129), (31, 0.031), (32, 0.105), (34, 0.022), (35, 0.047), (47, 0.094), (49, 0.012), (64, 0.011), (83, 0.15), (85, 0.056), (87, 0.036), (88, 0.027), (90, 0.057)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91956496 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

Author: Kai Yu, Wei Chu

2 0.91046113 146 nips-2007-On higher-order perceptron algorithms

Author: Claudio Gentile, Fabio Vitale, Cristian Brotto

Abstract: A new algorithm for on-line learning linear-threshold functions is proposed which efﬁciently combines second-order statistics about the data with the ”logarithmic behavior” of multiplicative/dual-norm algorithms. An initial theoretical analysis is provided suggesting that our algorithm might be viewed as a standard Perceptron algorithm operating on a transformed sequence of examples with improved margin properties. We also report on experiments carried out on datasets from diverse domains, with the goal of comparing to known Perceptron algorithms (ﬁrst-order, second-order, additive, multiplicative). Our learning procedure seems to generalize quite well, and converges faster than the corresponding multiplicative baseline algorithms. 1 Introduction and preliminaries The problem of on-line learning linear-threshold functions from labeled data is one which have spurred a substantial amount of research in Machine Learning. The relevance of this task from both the theoretical and the practical point of view is widely recognized: On the one hand, linear functions combine ﬂexiblity with analytical and computational tractability, on the other hand, online algorithms provide efﬁcient methods for processing massive amounts of data. Moreover, the widespread use of kernel methods in Machine Learning (e.g., [24]) have greatly improved the scope of this learning technology, thereby increasing even further the general attention towards the speciﬁc task of incremental learning (generalized) linear functions. Many models/algorithms have been proposed in the literature (stochastic, adversarial, noisy, etc.) : Any list of references would not do justice of the existing work on this subject. In this paper, we are interested in the problem of online learning linear-threshold functions from adversarially generated examples. We introduce a new family of algorithms, collectively called the Higher-order Perceptron algorithm (where ”higher” means here ”higher than one”, i.e., ”higher than ﬁrst-order” descent algorithms such as gradientdescent or standard Perceptron-like algorithms”). Contrary to other higher-order algorithms, such as the ridge regression-like algorithms considered in, e.g., [4, 7], Higher-order Perceptron has the ability to put together in a principled and ﬂexible manner second-order statistics about the data with the ”logarithmic behavior” of multiplicative/dual-norm algorithms (e.g., [18, 19, 6, 13, 15, 20]). Our algorithm exploits a simpliﬁed form of the inverse data matrix, lending itself to be easily combined with the dual norms machinery introduced by [13] (see also [12, 23]). As we will see, this has also computational advantages, allowing us to formulate an efﬁcient (subquadratic) implementation. Our contribution is twofold. First, we provide an initial theoretical analysis suggesting that our algorithm might be seen as a standard Perceptron algorithm [21] operating on a transformed sequence of examples with improved margin properties. The same analysis also suggests a simple (but principled) way of switching on the ﬂy between higher-order and ﬁrst-order updates. This is ∗ The authors gratefully acknowledge partial support by the PASCAL Network of Excellence under EC grant n. 506778. This publication only reﬂects the authors’ views. especially convenient when we deal with kernel functions, a major concern being the sparsity of the computed solution. The second contribution of this paper is an experimental investigation of our algorithm on artiﬁcial and real-world datasets from various domains: We compared Higher-order Perceptron to baseline Perceptron algorithms, like the Second-order Perceptron algorithm deﬁned in [7] and the standard (p-norm) Perceptron algorithm, as in [13, 12]. We found in our experiments that Higher-order Perceptron generalizes quite well. Among our experimental ﬁndings are the following: 1) Higher-order Perceptron is always outperforming the corresponding multiplicative (p-norm) baseline (thus the stored data matrix is always beneﬁcial in terms of convergence speed); 2) When dealing with Euclidean norms (p = 2), the comparison to Second-order Perceptron is less clear and depends on the speciﬁc task at hand. Learning protocol and notation. Our algorithm works in the well-known mistake bound model of on-line learning, as introduced in [18, 2], and further investigated by many authors (e.g., [19, 6, 13, 15, 7, 20, 23] and references therein). Prediction proceeds in a sequence of trials. In each trial t = 1, 2, . . . the prediction algorithm is given an instance vector in Rn (for simplicity, all vectors are normalized, i.e., ||xt || = 1, where || · || is the Euclidean norm unless otherwise speciﬁed), and then guesses the binary label yt ∈ {−1, 1} associated with xt . We denote the algorithm’s prediction by yt ∈ {−1, 1}. Then the true label yt is disclosed. In the case when yt = yt we say that the algorithm has made a prediction mistake. We call an example a pair (xt , yt ), and a sequence of examples S any sequence S = (x1 , y1 ), (x2 , y2 ), . . . , (xT , yT ). In this paper, we are competing against the class of linear-threshold predictors, parametrized by normal vectors u ∈ {v ∈ Rn : ||v|| = 1}. In this case, a common way of measuring the (relative) prediction performance of an algorithm A is to compare the total number of mistakes of A on S to some measure of the linear separability of S. One such measure (e.g., [24]) is the cumulative hinge-loss (or soft-margin) Dγ (u; S) of S w.r.t. a T linear classiﬁer u at a given margin value γ > 0: Dγ (u; S) = t=1 max{0, γ − yt u xt } (observe that Dγ (u; S) vanishes if and only if u separates S with margin at least γ. A mistake-driven algorithm A is one which updates its internal state only upon mistakes. One can therefore associate with the run of A on S a subsequence M = M(S, A) ⊆ {1, . . . , T } of mistaken trials. Now, the standard analysis of these algorithms allows us to restrict the behavior of the comparison class to mistaken trials only and, as a consequence, to reﬁne Dγ (u; S) so as to include only trials in M: Dγ (u; S) = t∈M max{0, γ − yt u xt }. This gives bounds on A’s performance relative to the best u over a sequence of examples produced (or, actually, selected) by A during its on-line functioning. Our analysis in Section 3 goes one step further: the number of mistakes of A on S is contrasted to the cumulative hinge loss of the best u on a transformed ˜ sequence S = ((˜ i1 , yi1 ), (˜ i2 , yi2 ), . . . , (˜ im , yim )), where each instance xik gets transformed x x x ˜ into xik through a mapping depending only on the past behavior of the algorithm (i.e., only on ˜ examples up to trial t = ik−1 ). As we will see in Section 3, this new sequence S tends to be ”more separable” than the original sequence, in the sense that if S is linearly separable with some margin, ˜ then the transformed sequence S is likely to be separable with a larger margin. 2 The Higher-order Perceptron algorithm The algorithm (described in Figure 1) takes as input a sequence of nonnegative parameters ρ1 , ρ2 , ..., and maintains a product matrix Bk (initialized to the identity matrix I) and a sum vector v k (initialized to 0). Both of them are indexed by k, a counter storing the current number of mistakes (plus one). Upon receiving the t-th normalized instance vector xt ∈ Rn , the algorithm computes its binary prediction value yt as the sign of the inner product between vector Bk−1 v k−1 and vector Bk−1 xt . If yt = yt then matrix Bk−1 is updates multiplicatively as Bk = Bk−1 (I − ρk xt xt ) while vector v k−1 is updated additively through the standard Perceptron rule v k = v k−1 + yt xt . The new matrix Bk and the new vector v k will be used in the next trial. If yt = yt no update is performed (hence the algorithm is mistake driven). Observe that ρk = 0 for any k makes this algorithm degenerate into the standard Perceptron algorithm [21]. Moreover, one can easily see that, in order to let this algorithm exploit the information collected in the matrix B (and let the algorithm’s ∞ behavior be substantially different from Perceptron’s) we need to ensure k=1 ρk = ∞. In the sequel, our standard choice will be ρk = c/k, with c ∈ (0, 1). See Sections 3 and 4. Implementing Higher-Order Perceptron can be done in many ways. Below, we quickly describe three of them, each one having its own merits. 1) Primal version. We store and update an n×n matrix Ak = Bk Bk and an n-dimensional column Parameters: ρ1 , ρ2 , ... ∈ [0, 1). Initialization: B0 = I; v 0 = 0; k = 1. Repeat for t = 1, 2, . . . , T : 1. Get instance xt ∈ Rn , ||xt || = 1; 2. Predict yt = SGN(wk−1 xt ) ∈ {−1, +1}, where wk−1 = Bk−1 Bk−1 v k−1 ; 3. Get label yt ∈ {−1, +1}; v k = v k−1 + yt xt 4. if yt = yt then: Bk k = Bk−1 (I − ρk xt xt ) ← k + 1. Figure 1: The Higher-order Perceptron algorithm (for p = 2). vector v k . Matrix Ak is updated as Ak = Ak−1 − ρAk−1 xx − ρxx Ak−1 + ρ2 (x Ak−1 x)xx , taking O(n2 ) operations, while v k is updated as in Figure 1. Computing the algorithm’s margin v Ax can then be carried out in time quadratic in the dimension n of the input space. 2) Dual version. This implementation allows us the use of kernel functions (e.g., [24]). Let us denote by Xk the n × k matrix whose columns are the n-dimensional instance vectors x1 , ..., xk where a mistake occurred so far, and y k be the k-dimensional column vector of the corresponding (k) labels. We store and update the k × k matrix Dk = [di,j ]k i,j=1 , the k × k diagonal matrix Hk = DIAG {hk }, (k) (k) hk = (h1 , ..., hk ) = Xk Xk y k , and the k-dimensional column vector g k = y k + Dk Hk 1k , being 1k a vector of k ones. If we interpret the primal matrix Ak above as Ak = (k) k I + i,j=1 di,j xi xj , it is not hard to show that the margin value wk−1 x is equal to g k−1 Xk−1 x, and can be computed through O(k) extra inner products. Now, on the k-th mistake, vector g can be updated with O(k 2 ) extra inner products by updating D and H in the following way. We let D0 and H0 be empty matrices. Then, given Dk−1 and Hk−1 = DIAG{hk−1 }, we have1 Dk = Dk−1 −ρk bk (k) , where bk = Dk−1 Xk−1 xk , and dk,k = ρ2 xk Xk−1 bk − 2ρk + ρ2 . On (k) k k −ρk bk dk,k the other hand, Hk = DIAG {hk−1 (k) (k) + yk Xk−1 xk , hk }, with hk = y k−1 Xk−1 xk + yk . Observe that on trials when ρk = 0 matrix Dk−1 is padded with a zero row and a zero column. (k) k This amounts to say that matrix Ak = I + i,j=1 di,j xi xj , is not updated, i.e., Ak = Ak−1 . A closer look at the above update mechanism allows us to conclude that the overall extra inner products needed to compute g k is actually quadratic only in the number of past mistaken trials having ρk > 0. This turns out to be especially important when using a sparse version of our algorithm which, on a mistaken trial, decides whether to update both B and v or just v (see Section 4). 3) Implicit primal version and the dual norms algorithm. This is based on the simple observation that for any vector z we can compute Bk z by unwrapping Bk as in Bk z = Bk−1 (I − ρxx )z = Bk−1 z , where vector z = (z − ρx x z) can be calculated in time O(n). Thus computing the margin v Bk−1 Bk−1 x actually takes O(nk). Maintaining this implicit representation for the product matrix B can be convenient when an efﬁcient dual version is likely to be unavailable, as is the case for the multiplicative (or, more generally, dual norms) extension of our algorithm. We recall that a multiplicative algorithm is useful when learning sparse target hyperplanes (e.g., [18, 15, 3, 12, 11, 20]). We obtain a dual norms algorithm by introducing a norm parameter p ≥ 2, and the associated gradient mapping2 g : θ ∈ Rn → θ ||θ||2 / 2 ∈ Rn . Then, in Figure 1, we p normalize instance vectors xt w.r.t. the p-norm, we deﬁne wk−1 = Bk−1 g(Bk−1 v k−1 ), and generalize the matrix update as Bk = Bk−1 (I − ρk xt g(xt ) ). As we will see, the resulting algorithm combines the multiplicative behavior of the p-norm algorithms with the ”second-order” information contained in the matrix Bk . One can easily see that the above-mentioned argument for computing the margin g(Bk−1 v k−1 ) Bk−1 x in time O(nk) still holds. 1 Observe that, by construction, Dk is a symmetric matrix. This mapping has also been used in [12, 11]. Recall that setting p = O(log n) yields an algorithm similar to Winnow [18]. Also, notice that p = 2 yields g = identity. 2 3 Analysis We express the performance of the Higher-order Perceptron algorithm in terms of the hinge-loss behavior of the best linear classiﬁer over the transformed sequence ˜ S = (B0 xt(1) , yt(1) ), (B1 xt(2) , yt(2) ), (B2 xt(3) , yt(3) ), . . . , (1) being t(k) the trial where the k-th mistake occurs, and Bk the k-th matrix produced by the algorithm. Observe that each feature vector xt(k) gets transformed by a matrix Bk depending on past examples ˜ only. This is relevant to the argument that S tends to have a larger margin than the original sequence (see the discussion at the end of this section). This neat ”on-line structure” does not seem to be shared by other competing higher-order algorithms, such as the ”ridge regression-like” algorithms considered, e.g., in [25, 4, 7, 23]. For the sake of simplicity, we state the theorem below only in the case p = 2. A more general statement holds when p ≥ 2. Theorem 1 Let the Higher-order Perceptron algorithm in Figure 1 be run on a sequence of examples S = (x1 , y1 ), (x2 , y2 ), . . . , (xT , yT ). Let the sequence of parameters ρk satisfy 0 ≤ ρk ≤ 1−c , where xt is the k-th mistaken instance vector, and c ∈ (0, 1]. Then the total number m 1+|v k−1 xt | of mistakes satisﬁes3 ˜ ˜ Dγ (u; Sc )) α2 α Dγ (u; Sc )) α2 m≤α + 2+ α + 2, (2) γ 2γ γ γ 4γ holding for any γ > 0 and any unit norm vector u ∈ Rn , where α = α(c) = (2 − c)/c. Proof. The analysis deliberately mimics the standard Perceptron convergence analysis [21]. We ﬁx an arbitrary sequence S = (x1 , y1 ), (x2 , y2 ), . . . , (xT , yT ) and let M ⊆ {1, 2, . . . , T } be the set of trials where the algorithm in Figure 1 made a mistake. Let t = t(k) be the trial where the k-th mistake occurred. We study the evolution of ||Bk v k ||2 over mistaken trials. Notice that the matrix Bk Bk is positive semideﬁnite for any k. We can write ||Bk v k ||2 = ||Bk−1 (I − ρk xt xt ) (v k−1 + yt xt ) ||2 (from the update rule v k = v k−1 + yt xt and Bk = Bk−1 (I − ρk xt xt ) ) = ||Bk−1 v k−1 + yt (1 − ρk yt v k−1 xt − ρk )Bk−1 xt ||2 2 = ||Bk−1 v k−1 || + 2 yt rk v k−1 Bk−1 Bk−1 xt + (using ||xt || = 1) 2 rk ||Bk−1 xt ||2 , where we set for brevity rk = 1 − ρk yt v k−1 xt − ρk . We proceed by upper and lower bounding the above chain of equalities. To this end, we need to ensure rk ≥ 0. Observe that yt v k−1 xt ≥ 0 implies rk ≥ 0 if and only if ρk ≤ 1/(1 + yt v k−1 xt ). On the other hand, if yt v k−1 xt < 0 then, in order for rk to be nonnegative, it sufﬁces to pick ρk ≤ 1. In both cases ρk ≤ (1 − c)/(1 + |v k−1 xt |) implies 2 rk ≥ c > 0, and also rk ≤ (1+ρk |v k−1 xt |−ρk )2 ≤ (2−c)2 . Now, using yt v k−1 Bk−1 Bk−1 xt ≤ 0 (combined with rk ≥ 0), we conclude that ||Bk v k ||2 − ||Bk−1 v k−1 ||2 ≤ (2 − c)2 ||Bk−1 xt ||2 = (2 − c)2 xt Ak−1 xt , where we set Ak = Bk Bk . A simple4 (and crude) upper bound on the last term follows by observing that ||xt || = 1 implies xt Ak−1 xt ≤ ||Ak−1 ||, the spectral norm (largest eigenvalue) of Ak−1 . Since a factor matrix of the form (I − ρ xx ) with ρ ≤ 1 and ||x|| = 1 has k−1 spectral norm one, we have xt Ak−1 xt ≤ ||Ak−1 || ≤ i=1 ||I − ρi xt(i) xt(i) ||2 ≤ 1. Therefore, summing over k = 1, . . . , m = |M| (or, equivalently, over t ∈ M) and using v 0 = 0 yields the upper bound ||Bm v m ||2 ≤ (2 − c)2 m. (3) To ﬁnd a lower bound of the left-hand side of (3), we ﬁrst pick any unit norm vector u ∈ Rn , and apply the standard Cauchy-Schwartz inequality: ||Bm v m || ≥ u Bm v m . Then, we observe that for a generic trial t = t(k) the update rule of our algorithm allows us to write u Bk v k − u Bk−1 v k−1 = rk yt u Bk−1 xt ≥ rk (γ − max{0, γ − yt u Bk−1 xt }), where the last inequality follows from rk ≥ 0 and holds for any margin value γ > 0. We sum 3 ˜ The subscript c in Sc emphasizes the dependence of the transformed sequence on the choice of c. Note that in the special case c = 1 we have ρk = 0 for any k and α = 1, thereby recovering the standard Perceptron bound for nonseparable sequences (see, e.g., [12]). 4 A slightly more reﬁned bound can be derived which depends on the trace of matrices I − Ak . Details will be given in the full version of this paper. the above over k = 1, . . . , m and exploit c ≤ rk ≤ 2 − c after rearranging terms. This gets ˜ ||Bm v m || ≥ u Bm v m ≥ c γ m − (2 − c)Dγ (u; Sc ). Combining with (3) and solving for m gives the claimed bound. From the above result one can see that our algorithm might be viewed as a standard Perceptron ˜ algorithm operating on the transformed sequence Sc in (1). We now give a qualitative argument, ˜ which is suggestive of the improved margin properties of Sc . Assume for simplicity that all examples (xt , yt ) in the original sequence are correctly classiﬁed by hyperplane u with the same margin γ = yt u xt > 0, where t = t(k). According to Theorem 1, the parameters ρ1 , ρ2 , . . . should be small positive numbers. Assume, again for simplicity, that all ρk are set to the same small enough k value ρ > 0. Then, up to ﬁrst order, matrix Bk = i=1 (I − ρ xt(i) xt(i) ) can be approximated as Bk I −ρ k i=1 xt(i) xt(i) . Then, to the extent that the above approximation holds, we can write:5 yt u Bk−1 xt = yt u I −ρ = yt u xt − ρ yt k−1 i=1 xt(i) xt(i) xt = yt u k−1 i=1 I −ρ k−1 i=1 yt(i) xt(i) yt(i) xt(i) xt yt(i) u xt(i) yt(i) xt(i) xt = γ − ρ γ yt v k−1 xt . Now, yt v k−1 xt is the margin of the (ﬁrst-order) Perceptron vector v k−1 over a mistaken trial for the Higher-order Perceptron vector wk−1 . Since the two vectors v k−1 and wk−1 are correlated (recall that v k−1 wk−1 = v k−1 Bk−1 Bk−1 v k−1 = ||Bk−1 v k−1 ||2 ≥ 0) the mistaken condition yt wk−1 xt ≤ 0 is more likely to imply yt v k−1 xt ≤ 0 than the opposite. This tends to yield a margin larger than the original margin γ. As we mentioned in Section 2, this is also advantageous from a computational standpoint, since in those cases the matrix update Bk−1 → Bk might be skipped (this is equivalent to setting ρk = 0), still Theorem 1 would hold. Though the above might be the starting point of a more thorough theoretical understanding of the margin properties of our algorithm, in this paper we prefer to stop early and leave any further investigation to collecting experimental evidence. 4 Experiments We tested the empirical performance of our algorithm by conducting a number of experiments on a collection of datasets, both artiﬁcial and real-world from diverse domains (Optical Character Recognition, text categorization, DNA microarrays). The main goal of these experiments was to compare Higher-order Perceptron (with both p = 2 and p > 2) to known Perceptron-like algorithms, such as ﬁrst-order [21] and second-order Perceptron [7], in terms of training accuracy (i.e., convergence speed) and test set accuracy. The results are contained in Tables 1, 2, 3, and in Figure 2. Task 1: DNA microarrays and artiﬁcial data. The goal here was to test the convergence properties of our algorithms on sparse target learning tasks. We ﬁrst tested on a couple of well-known DNA microarray datasets. For each dataset, we ﬁrst generated a number of random training/test splits (our random splits also included random permutations of the training set). The reported results are averaged over these random splits. The two DNA datasets are: i. The ER+/ER− dataset from [14]. Here the task is to analyze expression proﬁles of breast cancer and classify breast tumors according to ER (Estrogen Receptor) status. This dataset (which we call the “Breast” dataset) contains 58 expression proﬁles concerning 3389 genes. We randomly split 1000 times into a training set of size 47 and a test set of size 11. ii. The “Lymphoma” dataset [1]. Here the goal is to separate cancerous and normal tissues in a large B-Cell lymphoma problem. The dataset contains 96 expression proﬁles concerning 4026 genes. We randomly split the dataset into a training set of size 60 and a test set of size 36. Again, the random split was performed 1000 times. On both datasets, the tested algorithms have been run by cycling 5 times over the current training set. No kernel functions have been used. We also artiﬁcially generated two (moderately) sparse learning problems with margin γ ≥ 0.005 at labeling noise levels η = 0.0 (linearly separable) and η = 0.1, respectively. The datasets have been generated at random by ﬁrst generating two (normalized) target vectors u ∈ {−1, 0, +1}500 , where the ﬁrst 50 components are selected independently at random in {−1, +1} and the remaining 450 5 Again, a similar argument holds in the more general setting p ≥ 2. The reader should notice how important the dependence of Bk on the past is to this argument. components are 0. Then we set η = 0.0 for the ﬁrst target and η = 0.1 for the second one and, corresponding to each of the two settings, we randomly generated 1000 training examples and 1000 test examples. The instance vectors are chosen at random from [−1, +1]500 and then normalized. If u · xt ≥ γ then a +1 label is associated with xt . If u · xt ≤ −γ then a −1 label is associated with xt . The labels so obtained are ﬂipped with probability η. If |u · xt | < γ then xt is rejected and a new vector xt is drawn. We call the two datasets ”Artiﬁcial 0.0 ” and ”Artiﬁcial 0.1 ”. We tested our algorithms by training over an increasing number of epochs and checking the evolution of the corresponding test set accuracy. Again, no kernel functions have been used. Task 2: Text categorization. The text categorization datasets are derived from the ﬁrst 20,000 newswire stories in the Reuters Corpus Volume 1 (RCV1, [22]). A standard TF - IDF bag-of-words encoding was used to transform each news story into a normalized vector of real attributes. We built four binary classiﬁcation problems by “binarizing” consecutive news stories against the four target categories 70, 101, 4, and 59. These are the 2nd, 3rd, 4th, and 5th most frequent6 categories, respectively, within the ﬁrst 20,000 news stories of RCV1. We call these datasets RCV1x , where x = 70, 101, 4, 59. Each dataset was split into a training set of size 10,000 and a test set of the same size. All algorithms have been trained for a single epoch. We initially tried polynomial kernels, then realized that kernel functions did not signiﬁcantly alter our conclusions on this task. Thus the reported results refer to algorithms with no kernel functions. Task 3: Optical character recognition (OCR). We used two well-known OCR benchmarks: the USPS dataset and the MNIST dataset [16] and followed standard experimental setups, such as the one in [9], including the one-versus-rest scheme for reducing a multiclass problem to a set of binary tasks. We used for each algorithm the standard Gaussian and polynomial kernels, with parameters chosen via 5-fold cross validation on the training set across standard ranges. Again, all algorithms have been trained for a single epoch over the training set. The results in Table 3 only refer to the best parameter settings for each kernel. Algorithms. We implemented the standard Perceptron algorithm (with and without kernels), the Second-order Perceptron algorithm, as described in [7] (with and without kernels), and our Higherorder Perceptron algorithm. The implementation of the latter algorithm (for both p = 2 and p > 2) was ”implicit primal” when tested on the sparse learning tasks, and in dual variables for the other two tasks. When using Second-order Perceptron, we set its parameter a (see [7] for details) by testing on a generous range of values. For brevity, only the settings achieving the best results are reported. On the sparse learning tasks we tried Higher-order Perceptron with norm p = 2, 4, 7, 10, while on the other two tasks we set p = 2. In any case, for each value of p, we set7 ρk = c/k, with c = 0, 0.2, 0.4, 0.6, 0.8. Since c = 0 corresponds to a standard p-norm Perceptron algorithm [13, 12] we tried to emphasize the comparison c = 0 vs. c > 0. Finally, when using kernels on the OCR tasks, we also compared to a sparse dual version of Higher-order Perceptron. On a mistaken round t = t(k), this algorithm sets ρk = c/k if yt v k−1 xt ≥ 0, and ρk = 0 otherwise (thus, when yt v k−1 xt < 0 the matrix Bk−1 is not updated). For the sake of brevity, the standard Perceptron algorithm is called FO (”First Order”), the Second-order algorithm is denoted by SO (”Second Order”), while the Higher-order algorithm with norm parameter p and ρk = c/k is abbreviated as HOp (c). Thus, for instance, FO = HO2 (0). Results and conclusions. Our Higher-order Perceptron algorithm seems to deliver interesting results. In all our experiments HOp (c) with c > 0 outperforms HOp (0). On the other hand, the comparison HOp (c) vs. SO depends on the speciﬁc task. On the DNA datasets, HOp (c) with c > 0 is clearly superior in Breast. On Lymphoma, HOp (c) gets worse as p increases. This is a good indication that, in general, a multiplicative algorithm is not suitable for this dataset. In any case, HO2 turns out to be only slightly worse than SO. On the artiﬁcial datasets HOp (c) with c > 0 is always better than the corresponding p-norm Perceptron algorithm. On the text categorization tasks, HO2 tends to perform better than SO. On USPS, HO2 is superior to the other competitors, while on MNIST it performs similarly when combined with Gaussian kernels (though it turns out to be relatively sparser), while it is slightly inferior to SO when using polynomial kernels. The sparse version of HO2 cuts the matrix updates roughly by half, still maintaining a good performance. In all cases HO2 (either sparse or not) signiﬁcantly outperforms FO. In conclusion, the Higher-order Perceptron algorithm is an interesting tool for on-line binary clas6 7 We did not use the most frequent category because of its signiﬁcant overlap with the other ones. Notice that this setting fulﬁlls the condition on ρk stated in Theorem 1. Table 1: Training and test error on the two datasets ”Breast” and ”Lymphoma”. Training error is the average total number of updates over 5 training epochs, while test error is the average fraction of misclassiﬁed patterns in the test set, The results refer to the same training/test splits. For each algorithm, only the best setting is shown (best training and best test setting coincided in these experiments). Thus, for instance, HO2 differs from FO because of the c parameter. We emphasized the comparison HO7 (0) vs. HO7 (c) with best c among the tested values. According to Wilcoxon signed rank test, an error difference of 0.5% or larger might be considered signiﬁcant. In bold are the smallest ﬁgures achieved on each row of the table. FO TRAIN TEST TRAIN TEST LYMPHOMA HO 2 HO 4 HO 7 (0) HO 7 HO 10 SO 45.2 23.4% 22.1 11.8% 21.7 16.4% 19.6 10.0% 24.5 13.3% 18.9 10.0% 47.4 15.7% 23.0 11.5% 24.5 12.0% 20.0 11.5% 32.4 13.5 23.1 11.9% 29.6 15.0% 19.3 9.6% FO = HO 2(0.0) Training updates vs training epochs on Artificial 0.0 SO # of training updates 800 * HO 4(0.4) 600 HO 7(0.0) * 400 300 * * * * * SO 2400 HO 2(0.4) 700 500 HO 7 (0.4) * 2000 * 1200 400 2 3 5 10 15 20 * 1 * * 2 3 * Test error rates * * * (a = 0.2) HO 2(0.4) HO 4(0.4) * * * * HO 7(0.0) HO 7 (0.4) 14% Test error rates (minus 10%) FO = HO 2(0.0) SO 18% * HO 7(0.0) HO 7(0.4) 5 10 15 20 # of training epochs Test error rates vs training epochs on Artificial 0.0 22% * * # of training epochs 26% (a = 0.2) HO 2(0.4) HO 4(0.4) 1600 800 * 1 FO = HO 2(0.0) Training updates vs training epochs on Artificial 0.1 (a = 0.2) # of training updates B REAST FO = HO 2(0.0) Test error rates vs training epochs on Artificial 0.1 SO 26% 22% * * * * * * * * (a = 0.2) HO 2(0.4) HO 4(0.4) 18% HO 7(0.0) 14% HO 7 (0.4) 10% 10% 6% 6% 1 2 3 5 10 # of training epochs 15 20 1 2 3 5 10 15 20 # of training epochs Figure 2: Experiments on the two artiﬁcial datasets (Artiﬁcial0.0 , on the left, and Artiﬁcial0.1 , on the right). The plots give training and test behavior as a function of the number of training epochs. Notice that the test set in Artiﬁcial0.1 is affected by labelling noise of rate 10%. Hence, a visual comparison between the two plots at the bottom can only be made once we shift down the y-axis of the noisy plot by 10%. On the other hand, the two training plots (top) are not readily comparable. The reader might have difﬁculty telling apart the two kinds of algorithms HOp (0.0) and HOp (c) with c > 0. In practice, the latter turned out to be always slightly superior in performance to the former. siﬁcation, having the ability to combine multiplicative (or nonadditive) and second-order behavior into a single inference procedure. Like other algorithms, HOp can be extended (details omitted due to space limitations) in several ways through known worst-case learning technologies, such as large margin (e.g., [17, 11]), label-efﬁcient/active learning (e.g., [5, 8]), and bounded memory (e.g., [10]). References [1] A. Alizadeh, et al. (2000). Distinct types of diffuse large b-cell lymphoma identiﬁed by gene expression proﬁling. Nature, 403, 503–511. [2] D. Angluin (1988). Queries and concept learning. Machine Learning, 2(4), 319–342. [3] P. Auer & M.K. Warmuth (1998). Tracking the best disjunction. Machine Learning, 32(2), 127–150. [4] K.S. Azoury & M.K. Warmuth (2001). Relative loss bounds for on-line density estimation with the exponential familiy of distributions. Machine Learning, 43(3), 211–246. [5] A. Bordes, S. Ertekin, J. Weston, & L. Bottou (2005). Fast kernel classiﬁers with on-line and active learning. JMLR, 6, 1579–1619. [6] N. Cesa-Bianchi, Y. Freund, D. Haussler, D.P. Helmbold, R.E. Schapire, & M.K. Warmuth (1997). How to use expert advice. J. ACM, 44(3), 427–485. Table 2: Experimental results on the four binary classiﬁcation tasks derived from RCV1. ”Train” denotes the number of training corrections, while ”Test” gives the fraction of misclassiﬁed patterns in the test set. Only the results corresponding to the best test set accuracy are shown. In bold are the smallest ﬁgures achieved for each of the 8 combinations of dataset (RCV1x , x = 70, 101, 4, 59) and phase (training or test). FO TRAIN TEST 993 673 803 767 7.20% 6.39% 6.14% 6.45% RCV170 RCV1101 RCV14 RCV159 HO 2 TRAIN TEST 941 665 783 762 6.83% 5.81% 5.94% 6.04% SO TRAIN TEST 880 677 819 760 6.95% 5.48% 6.05% 6.84% Table 3: Experimental results on the OCR tasks. ”Train” denotes the total number of training corrections, summed over the 10 categories, while ”Test” denotes the fraction of misclassiﬁed patterns in the test set. Only the results corresponding to the best test set accuracy are shown. For the sparse version of HO2 we also reported (in parentheses) the number of matrix updates during training. In bold are the smallest ﬁgures achieved for each of the 8 combinations of dataset (USPS or MNIST), kernel type (Gaussian or Polynomial), and phase (training or test). FO TRAIN U SPS M NIST G AUSS P OLY G AUSS P OLY TEST 1385 1609 5834 8148 6.53% 7.37% 2.10% 3.04% HO 2 TRAIN TEST 945 1090 5351 6404 4.76% 5.71% 1.79% 2.27% Sparse HO2 SO TRAIN TEST TRAIN TEST 965 (440) 1081 (551) 5363 (2596) 6476 (3311) 5.13% 5.52% 1.81% 2.28% 1003 1054 5684 6440 5.05% 5.53% 1.82% 2.03% [7] N. Cesa-Bianchi, A. Conconi & C. Gentile (2005). A second-order perceptron algorithm. SIAM Journal of Computing, 34(3), 640–668. [8] N. Cesa-Bianchi, C. Gentile, & L. Zaniboni (2006). Worst-case analysis of selective sampling for linearthreshold algorithms. JMLR, 7, 1205–1230. [9] C. Cortes & V. Vapnik (1995). Support-vector networks. Machine Learning, 20(3), 273–297. [10] O. Dekel, S. Shalev-Shwartz, & Y. Singer (2006). The Forgetron: a kernel-based Perceptron on a ﬁxed budget. NIPS 18, MIT Press, pp. 259–266. [11] C. Gentile (2001). A new approximate maximal margin classiﬁcation algorithm. JMLR, 2, 213–242. [12] C. Gentile (2003). The Robustness of the p-norm Algorithms. Machine Learning, 53(3), pp. 265–299. [13] A.J. Grove, N. Littlestone & D. Schuurmans (2001). General convergence results for linear discriminant updates. Machine Learning Journal, 43(3), 173–210. [14] S. Gruvberger, et al. (2001). Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns. Cancer Res., 61, 5979–5984. [15] J. Kivinen, M.K. Warmuth, & P. Auer (1997). The perceptron algorithm vs. winnow: linear vs. logarithmic mistake bounds when few input variables are relevant. Artiﬁcial Intelligence, 97, 325–343. [16] Y. Le Cun, et al. (1995). Comparison of learning algorithms for handwritten digit recognition. ICANN 1995, pp. 53–60. [17] Y. Li & P. Long (2002). The relaxed online maximum margin algorithm. Machine Learning, 46(1-3), 361–387. [18] N. Littlestone (1988). Learning quickly when irrelevant attributes abound: a new linear-threshold algorithm. Machine Learning, 2(4), 285–318. [19] N. Littlestone & M.K. Warmuth (1994). The weighted majority algorithm. Information and Computation, 108(2), 212–261. [20] P. Long & X. Wu (2004). Mistake bounds for maximum entropy discrimination. NIPS 2004. [21] A.B.J. Novikov (1962). On convergence proofs on perceptrons. Proc. of the Symposium on the Mathematical Theory of Automata, vol. XII, pp. 615–622. [22] Reuters: 2000. http://about.reuters.com/researchandstandards/corpus/. [23] S. Shalev-Shwartz & Y. Singer (2006). Online Learning Meets Optimization in the Dual. COLT 2006, pp. 423–437. [24] B. Schoelkopf & A. Smola (2002). Learning with kernels. MIT Press. [25] Vovk, V. (2001). Competitive on-line statistics. International Statistical Review, 69, 213-248.

3 0.87854332 97 nips-2007-Hidden Common Cause Relations in Relational Learning

Author: Ricardo Silva, Wei Chu, Zoubin Ghahramani

4 0.87270278 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

Author: Alessandro Lazaric, Marcello Restelli, Andrea Bonarini

Abstract: Learning in real-world domains often requires to deal with continuous state and action spaces. Although many solutions have been proposed to apply Reinforcement Learning algorithms to continuous state problems, the same techniques can be hardly extended to continuous action spaces, where, besides the computation of a good approximation of the value function, a fast method for the identiﬁcation of the highest-valued action is needed. In this paper, we propose a novel actor-critic approach in which the policy of the actor is estimated through sequential Monte Carlo methods. The importance sampling step is performed on the basis of the values learned by the critic, while the resampling step modiﬁes the actor’s policy. The proposed approach has been empirically compared to other learning algorithms into several domains; in this paper, we report results obtained in a control problem consisting of steering a boat across a river. 1

5 0.86907411 69 nips-2007-Discriminative Batch Mode Active Learning

Author: Yuhong Guo, Dale Schuurmans

Abstract: Active learning sequentially selects unlabeled instances to label with the goal of reducing the effort needed to learn a good classiﬁer. Most previous studies in active learning have focused on selecting one unlabeled instance to label at a time while retraining in each iteration. Recently a few batch mode active learning approaches have been proposed that select a set of most informative unlabeled instances in each iteration under the guidance of heuristic scores. In this paper, we propose a discriminative batch mode active learning approach that formulates the instance selection task as a continuous optimization problem over auxiliary instance selection variables. The optimization is formulated to maximize the discriminative classiﬁcation performance of the target classiﬁer, while also taking the unlabeled data into account. Although the objective is not convex, we can manipulate a quasi-Newton method to obtain a good local solution. Our empirical studies on UCI datasets show that the proposed active learning is more effective than current state-of-the art batch mode active learning algorithms. 1

6 0.86100858 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

7 0.85942721 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images

8 0.85913819 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

9 0.85540378 156 nips-2007-Predictive Matrix-Variate t Models

10 0.85283697 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

11 0.85167676 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

12 0.84979737 175 nips-2007-Semi-Supervised Multitask Learning

13 0.84620076 41 nips-2007-COFI RANK - Maximum Margin Matrix Factorization for Collaborative Ranking

14 0.84577179 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation

15 0.84523427 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

16 0.84179944 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model

17 0.84152436 158 nips-2007-Probabilistic Matrix Factorization

18 0.84119618 88 nips-2007-Fast and Scalable Training of Semi-Supervised CRFs with Application to Activity Recognition

19 0.83933914 131 nips-2007-Modeling homophily and stochastic equivalence in symmetric relational data

20 0.8392452 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC