nips nips2002 nips2002-124 knowledge-graph by maker-knowledge-mining

124 nips-2002-Learning Graphical Models with Mercer Kernels


Source: pdf

Author: Francis R. Bach, Michael I. Jordan

Abstract: We present a class of algorithms for learning the structure of graphical models from data. The algorithms are based on a measure known as the kernel generalized variance (KGV), which essentially allows us to treat all variables on an equal footing as Gaussians in a feature space obtained from Mercer kernels. Thus we are able to learn hybrid graphs involving discrete and continuous variables of arbitrary type. We explore the computational properties of our approach, showing how to use the kernel trick to compute the relevant statistics in linear time. We illustrate our framework with experiments involving discrete and continuous data.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We present a class of algorithms for learning the structure of graphical models from data. [sent-7, score-0.247]

2 The algorithms are based on a measure known as the kernel generalized variance (KGV), which essentially allows us to treat all variables on an equal footing as Gaussians in a feature space obtained from Mercer kernels. [sent-8, score-0.569]

3 Thus we are able to learn hybrid graphs involving discrete and continuous variables of arbitrary type. [sent-9, score-0.485]

4 We explore the computational properties of our approach, showing how to use the kernel trick to compute the relevant statistics in linear time. [sent-10, score-0.206]

5 We illustrate our framework with experiments involving discrete and continuous data. [sent-11, score-0.228]

6 In recent years, there has been a growing interest in learning the structure of graphical models directly from data, either in the directed case [1, 2, 3, 4] or the undirected case [5]. [sent-13, score-0.471]

7 Current algorithms deal reasonably well with models involving discrete variables or Gaussian variables having only limited interaction with discrete neighbors. [sent-14, score-0.551]

8 However, applications to general hybrid graphs and to domains with general continuous variables are few, and are generally based on discretization. [sent-15, score-0.349]

9 We make use of a relationship between kernel-based measures of “generalized variance” in a feature space, and quantities such as mutual information and pairwise independence in the input space. [sent-17, score-0.633]

10 Let and consider the set of random variables in feature space. [sent-19, score-0.233]

11 Suppose that we compute the mean and covariance matrix , that have the same mean of these variables and consider a set of Gaussian variables, and covariance. [sent-20, score-0.383]

12 We showed in [6] that a canonical correlation analysis of yields a measure, known as “kernel generalized variance,” that characterizes pairwise independence among the original variables , and is closely related to the mutual information among the original variables. [sent-21, score-0.731]

13 In the current paper we pursue this idea in a different direction, considering the use of the kernel generalized variance as a surrogate for the mutual information in model selection problems. [sent-23, score-0.57]

14 Effectively, we map data into a feature space via a set of Mercer kernels, with different kernels for different data types, and treat all data on an equal footing ¡  ¨ ©¡ ¤ §¡ ¥ ¦ ¡ ¢  ¡ ¥   ¡ £ ¡ ¤  ¥  ¡¢ ¡    ¥ ¡  as Gaussian in feature space. [sent-24, score-0.378]

15 We briefly review the structure-learning problem in Section 2, and in Section 4 and Section 5 we show how classical approaches to the problem, based on MDL/BIC and conditional independence tests, can be extended to our kernel-based approach. [sent-25, score-0.322]

16 In Section 3 we show that by making use of the “kernel trick” we are able to compute the sample covariance matrix in feature space in linear time in the number of samples. [sent-26, score-0.396]

17 For directed graphical models, in the MDL/BIC setting of [2], the likelihood is penalized by a model selection term that is equal to times the number of parameters necessary to encode the local distributions. [sent-31, score-0.488]

18 The likelihood term can be decomposed and expressed as follows: , with , where is the set of parents of node in the graph to be scored and is the empirical mutual information between the variable and the vector . [sent-32, score-0.366]

19 These mutual information terms and the number of parameters for each local conditional distributions are easily computable in discrete models, as well as in Gaussian models. [sent-33, score-0.523]

20 In this approach, conditional independence tests are performed to constrain the structure of possible graphs. [sent-41, score-0.443]

21 For undirected models, going from the graph to the set of conditional independences is relatively easy: there is an edge between and if and only if and are independent given all other variables [7]. [sent-42, score-0.406]

22 In Section 5, we show how our approach could be used to perform independence tests and learn an undirected graphical model. [sent-43, score-0.575]

23 We also show how this approach can be used to prune the search space for the local search of a directed model. [sent-44, score-0.273]

24 ¡     4 6 ¡     4 5 3 Gaussians in feature space In this section, we introduce our Gaussianity assumption and show how to approximate the mutual information, as required for the structure learning algorithms. [sent-45, score-0.416]

25 1 Mercer Kernels ¡    ¨ A Mercer kernel on a space is a function from to such that for any set of points in , the matrix , defined by , is positive semidefinite. [sent-47, score-0.243]

26 The matrix is usually referred to as the Gram matrix of the points . [sent-48, score-0.174]

27 Given a Mercer kernel , it is possible to find a space and a map from to , such that is the dot product in between and (see, e. [sent-49, score-0.189]

28 The space is usually referred to as the feature space and the map as the feature map. [sent-52, score-0.281]

29 We also use the ¦   ¨ ¤ ¦  )9¨ @    the notation to denote the dot product of and in feature space notation to denote the representative of in the dual space of . [sent-54, score-0.16]

30 £ £   §  ¨ £ ¡ ¤¢    ¡ ¢  For a discrete variable which takes values in , we use the trivial kernel , which corresponds to a feature space of dimension . [sent-55, score-0.377]

31 ¦  % 0¡ (&$  B B B CDC    ¨ ¥ ¦   ©   § A   © ¨   ¢©  B B B DCDC   For continuous variables, we use the Gaussian kernel . [sent-59, score-0.212]

32 The feature space has infinite dimension, but as we will show, the data only occupy a small linear manifold and this linear subspace can be determined adaptively in linear time. [sent-60, score-0.124]

33 Note that , which corresponds to simply modeling the an alternative is to use the kernel data as Gaussian in input space. [sent-61, score-0.12]

34 2 Notation 1 1 2    Let be random variables with values in spaces . [sent-63, score-0.145]

35 Let us assign a Mercer kernel to each of the input spaces , with feature space and feature map . [sent-64, score-0.365]

36 The random vector of feature images has a covariance matrix defined by blocks, with block being the covariance matrix and . [sent-65, score-0.525]

37 Let denote a jointly Gaussian between vector with the same mean and covariance as . [sent-66, score-0.15]

38 The vector will be used as the random vector on which the learning of graphical model structure is based. [sent-67, score-0.21]

39 No dependency involving strictly more than two variables is modeled explicitly, which makes our scoring metric easy to compute. [sent-69, score-0.477]

40 3 Computing sample covariances using kernel trick 1 F F      We are given a random sample of elements of . [sent-72, score-0.238]

41 We assume that for each the data in feature space have been centered, i. [sent-74, score-0.124]

42 Note that a Gaussian with covariance matrix has zero variance along directions that are orthogonal to the images of the data. [sent-78, score-0.248]

43 (1), we see that the sample covariance matrix of in the “data basis” has blocks . [sent-81, score-0.357]

44 4 Regularization When the feature space has infinite dimension (as in the case of a Gaussian kernel on ), then the covariance we are implicitly fitting with a kernel method has an infinite number of parameters. [sent-83, score-0.524]

45 We add to an isotropic Gaussian with covariance in an orthonormal basis. [sent-86, score-0.115]

46 In the data basis, the covariance of this Gaussian is exactly the block diagonal matrix with blocks . [sent-87, score-0.387]

47 Consequently, our regularized Gaussian covariance has blocks if and . [sent-88, score-0.236]

48 Since is a small constant, we can use , which leads to a more compact correlation matrix , with blocks for , and , where . [sent-89, score-0.27]

49 These cross-correlation matrices have exact dimension , but since the eigenvalues of are softly thresholded to zero or one by the regularization, the effective dimension is . [sent-90, score-0.124]

50 This dimensionality will be used as the dimension of our Gaussian variables for the MDL/BIC criterion, in Section 4. [sent-91, score-0.19]

51 The approximation is exact when the feature space has finite dimension (e. [sent-95, score-0.169]

52 § § § ¡   § Using the incomplete Cholesky decomposition, for each matrix we obtain the factorization , where is an matrix with rank , where . [sent-100, score-0.205]

53 We perform a singular value decomposition of to obtain an matrix with orthogonal columns (i. [sent-101, score-0.123]

54 ¦ ¡¡ %¡   ¦ $#¡ I We have , where where is the diagonal matrix obtained from the diagonal matrix by applying the function to its elements. [sent-112, score-0.236]

55 Thus has a correlation matrix with blocks in the new basis defined by the columns of the matrices , and these blocks will be used to compute the various mutual information terms. [sent-113, score-0.691]

56 1 ¥ 1 2   ¥ We now show how to compute the mutual information between a link with the mutual information of the original variables , and we make  DCD B  B B     1 Let be jointly Gaussian random vectors with covariance matrix , defined in terms of blocks . [sent-118, score-1.092]

57 The mutual information between the variables is equal to (see, e. [sent-119, score-0.449]

58 The ratio of determinants in this expression is usually referred to as the generalized variance, and is independent of the basis which is chosen to compute . [sent-122, score-0.111]

59 (2), the mutual information between the distribution of , is equal to (3)   A ¥ £ ¦¤¢ ¥ ¡ % ¦  6  1  B B B CDCC    ¨ I P '   ¡ We refer to this quantity as the -mutual information (KGV stands for kernel generalized variance). [sent-124, score-0.465]

60 It is always nonnegative and can also be defined for partitions of the variables into subsets, by simply partitioning the correlation matrix accordingly. [sent-125, score-0.264]

61  I The KGV has an interesting relationship to the mutual information among the original variables, . [sent-126, score-0.292]

62 In particular, as shown in [6], in the case of two discrete variables, the KGV is equal to the mutual information up to second order, when expanding around the manifold of distributions that factorize in the trivial graphical model (i. [sent-127, score-0.57]

63 Moreover, in the case of continuous variables, when the width of the Gaussian kernel tend to zero, the KGV necessarily tends to a limit, and also provides a second-order expansion of the mutual information around independence. [sent-130, score-0.472]

64 4 Structure learning using local search )1  ¤ £ measures the goodness of fit of the In this approach, an objective function directed graphical model , and is minimized. [sent-133, score-0.381]

65 Given the scoring metric , we are faced with an NPhard optimization problem on the space of directed acyclic graphs [10]. [sent-141, score-0.469]

66 Because the score decomposes as a sum of local scores, local greedy search heuristics are usually exploited. [sent-142, score-0.175]

67  #¤ & A '¨   5 Conditional independence tests using KGV In this section, we indicate how conditional independence tests can be performed using the KGV, and show how these tests can be used to estimate Markov blankets of nodes. [sent-146, score-0.901]

68 In the case of marginal independence, the likelihood ratio criterion is exactly equal to a power of the mutual information (see, e. [sent-148, score-0.402]

69 This generalizes easily to conditional independence, where the likelihood ratio criterion to test the conditional independence of and given is equal to , where is the number of samples and the mutual information terms are computed using empirical distributions. [sent-150, score-0.846]

70 &# Applied to our Gaussian variables , we obtain a test statistic based on linear combination of KGV-mutual information terms: . [sent-152, score-0.176]

71 Theoretical threshold values exist for conditional independence tests with Gaussian variables [7], but instead, we prefer to use the value given by the MDL/BIC criterion, i. [sent-153, score-0.556]

72 , (where and are the dimensions of the Gaussians), so that the same decision regarding conditional independence is made in the two approaches (scoring metric or independence tests) [12]. [sent-155, score-0.672]

73 For Gaussian variables, it is well-known that some conditional independencies can be read out from the inverse of the joint covariance matrix [7]. [sent-157, score-0.414]

74 More precisely, ¡ § 1 If are jointly Gaussian random vectors with dimensions , and with covariance matrix defined in terms of blocks , then and are independent given all the other variables if and only if the block of is equal to zero. [sent-158, score-0.58]

75 using the test statistic Applied to the variables and for all pairs of nodes, we can find an undirected graphical model in polynomial time, and thus a set of Markov blankets [4]. [sent-160, score-0.574]

76 E ¡ %  ¡ ¦ ¡  ¡¥ @ We may also be interested in constructing a directed model from the Markov blankets; however, this transformation is not always possible [7]. [sent-161, score-0.116]

77 Consequently, most approaches use heuristics to define a directed model from a set of conditional independencies [4, 13]. [sent-162, score-0.329]

78 Alternatively, as a pruning step in learning a directed graphical model, the Markov blanket can be safely used by only considering directed models whose moral graph is covered by the undirected graph. [sent-163, score-0.586]

79 6 Experiments We compare the performance of three hillclimbing algorithms for directed graphical modand ), one using the MDL/BIC metric els, one using the KGV metric (with of [2] and one using the BDe metric of [1] (with equivalent prior sample size ). [sent-164, score-0.808]

80 ¥ ¦ ¥ ¦   § ¥ 9 69 ¦ B ¢   When the domain includes continuous variables, we used two discretization strategies; the first one is to use K-means with a given number of clusters, the second one uses the adaptive discretization scheme for the MDL/BIC scoring metric of [14]. [sent-165, score-0.568]

81 Also, to parameterize the local conditional probabilities we used mixture models (mixture of Gaussians, mixture of softmax regressions, mixture of linear regressions), which provide enough flexibility at reasonable cost. [sent-166, score-0.311]

82 When the true generating network is known, we measure the performance of algorithms by the KL divergence to the true distribution; otherwise, we report log-likelihood on held-out test data. [sent-169, score-0.136]

83 We see that on the discrete networks, the performance of all three algorithms is similar, degrading slightly as increases. [sent-179, score-0.118]

84 We used three networks commonly used as benchmarks 1, the A LARM network (37 variables), the I NSURANCE network (27 variables) and the H AILFINDER network (56 variables). [sent-183, score-0.245]

85 We see that the performance of our metric lies between the (approximate Bayesian) BIC metric and the (full Bayesian) BDe § 1 Available at http://www. [sent-186, score-0.33]

86 size of discrete network : KGV (plain), BDe (dashed), MDL/BIC (dotted). [sent-233, score-0.155]

87 size of linear Gaussian network: KGV (plain), BDe with discretized data (dashed), MDL/BIC with discretized data (dotted x), MDL/BIC with adaptive discretization (dotted +). [sent-235, score-0.204]

88 is the number of samples, and and are the number of discrete and continuous variables, respectively. [sent-268, score-0.18]

89 Thus the performance of the new metric appears to be competitive with standard metrics for discrete data, providing some assurance that even in this case pairwise sufficient statistics in feature space seem to provide a reasonable characterization of Bayesian network structure. [sent-271, score-0.512]

90 It is the case of hybrid discrete/continuous networks that is our principal interest—in this case the KGV metric can be applied directly, without discretization of the continuous variables. [sent-273, score-0.461]

91 We also log-transformed all continuous variables that represent rates or counts. [sent-275, score-0.237]

92 We report average results (over 10 replications) in Table 1 for the KGV metric and for the BDe metric—continuous variables are discretized using K-means with 5 clusters (d-5) or 10 clusters (d-10). [sent-276, score-0.411]

93 7 Conclusion We have presented a general method for learning the structure of graphical models, based on treating variables as Gaussians in a high-dimensional feature space. [sent-278, score-0.443]

94 The method seamlessly integrates discrete and continuous variables in a unified framework, and can provide improvements in performance when compared to approaches based on discretization of continuous variables. [sent-279, score-0.543]

95 The Gaussianity assumption also provides a direct way to approximate Markov blankets for undirected graphical models, based on the classical link between conditional independence and zeros in the precision matrix. [sent-281, score-0.789]

96 While the use of the KGV as a scoring metric is inspired by the relationship between the KGV and the mutual information, it must be emphasized that this relationship is a local one, based on an expansion of the mutual information around independence. [sent-282, score-0.921]

97 While our empirical results suggest that the KGV is also an effective surrogate for the mutual information more generally, further theoretical work is needed to provide a deeper understanding of the KGV in models that are far from independence. [sent-283, score-0.37]

98 Finally, our algorithms have free parameters, in particular the regularization parameter and the width of the Gaussian kernel for continuous variables. [sent-284, score-0.245]

99 Although the performance is empirically robust to the setting of these parameters, learning those parameters from data would not only provide better and more consistent performance, but it would also provide a principled way to learn graphical models with local structure [15]. [sent-285, score-0.33]

100 Conditions under which conditional independence and scoring methods lead to identical selection of Bayesian network models. [sent-358, score-0.553]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('kgv', 0.562), ('mutual', 0.26), ('independence', 0.2), ('bde', 0.197), ('graphical', 0.178), ('metric', 0.15), ('variables', 0.145), ('scoring', 0.134), ('conditional', 0.122), ('blocks', 0.121), ('kernel', 0.12), ('directed', 0.116), ('covariance', 0.115), ('blankets', 0.112), ('cdc', 0.112), ('undirected', 0.108), ('gaussian', 0.105), ('mercer', 0.105), ('discretization', 0.096), ('continuous', 0.092), ('tests', 0.089), ('feature', 0.088), ('discrete', 0.088), ('matrix', 0.087), ('ccdc', 0.084), ('cdcc', 0.084), ('dcd', 0.084), ('dcdc', 0.084), ('hybrid', 0.079), ('gram', 0.075), ('gaussianity', 0.075), ('surrogate', 0.073), ('replications', 0.073), ('network', 0.067), ('bayesian', 0.066), ('ailfinder', 0.056), ('independencies', 0.056), ('larm', 0.056), ('nsurance', 0.056), ('regressions', 0.056), ('gaussians', 0.056), ('discretized', 0.054), ('local', 0.053), ('pairwise', 0.053), ('trick', 0.05), ('bic', 0.049), ('footing', 0.049), ('involving', 0.048), ('kl', 0.047), ('variance', 0.046), ('dimension', 0.045), ('equal', 0.044), ('networks', 0.044), ('geiger', 0.042), ('parents', 0.041), ('generalized', 0.041), ('kernels', 0.04), ('divergence', 0.039), ('bach', 0.039), ('mdl', 0.039), ('cholesky', 0.039), ('della', 0.039), ('pietra', 0.037), ('models', 0.037), ('compute', 0.036), ('space', 0.036), ('markov', 0.036), ('zeros', 0.036), ('decomposition', 0.036), ('jointly', 0.035), ('dotted', 0.035), ('heuristics', 0.035), ('matrices', 0.034), ('plain', 0.034), ('read', 0.034), ('centered', 0.034), ('search', 0.034), ('likelihood', 0.034), ('sample', 0.034), ('ratio', 0.034), ('mixture', 0.033), ('block', 0.033), ('graphs', 0.033), ('regularization', 0.033), ('map', 0.033), ('penalized', 0.033), ('link', 0.033), ('structure', 0.032), ('relationship', 0.032), ('correlation', 0.032), ('diagonal', 0.031), ('graph', 0.031), ('clusters', 0.031), ('rank', 0.031), ('consequently', 0.031), ('statistic', 0.031), ('selection', 0.03), ('criterion', 0.03), ('performance', 0.03), ('compact', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 124 nips-2002-Learning Graphical Models with Mercer Kernels

Author: Francis R. Bach, Michael I. Jordan

Abstract: We present a class of algorithms for learning the structure of graphical models from data. The algorithms are based on a measure known as the kernel generalized variance (KGV), which essentially allows us to treat all variables on an equal footing as Gaussians in a feature space obtained from Mercer kernels. Thus we are able to learn hybrid graphs involving discrete and continuous variables of arbitrary type. We explore the computational properties of our approach, showing how to use the kernel trick to compute the relevant statistics in linear time. We illustrate our framework with experiments involving discrete and continuous data.

2 0.11114659 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

Author: Olivier Chapelle, Jason Weston, Bernhard SchĂślkopf

Abstract: We propose a framework to incorporate unlabeled data in kernel classifier, based on the idea that two points in the same cluster are more likely to have the same label. This is achieved by modifying the eigenspectrum of the kernel matrix. Experimental results assess the validity of this approach. 1

3 0.1094821 114 nips-2002-Information Regularization with Partially Labeled Data

Author: Martin Szummer, Tommi S. Jaakkola

Abstract: Classification with partially labeled data requires using a large number of unlabeled examples (or an estimated marginal P (x)), to further constrain the conditional P (y|x) beyond a few available labeled examples. We formulate a regularization approach to linking the marginal and the conditional in a general way. The regularization penalty measures the information that is implied about the labels over covering regions. No parametric assumptions are required and the approach remains tractable even for continuous marginal densities P (x). We develop algorithms for solving the regularization problem for finite covers, establish a limiting differential equation, and exemplify the behavior of the new regularization approach in simple cases.

4 0.10820977 156 nips-2002-On the Complexity of Learning the Kernel Matrix

Author: Olivier Bousquet, Daniel Herrmann

Abstract: We investigate data based procedures for selecting the kernel when learning with Support Vector Machines. We provide generalization error bounds by estimating the Rademacher complexities of the corresponding function classes. In particular we obtain a complexity bound for function classes induced by kernels with given eigenvectors, i.e., we allow to vary the spectrum and keep the eigenvectors fix. This bound is only a logarithmic factor bigger than the complexity of the function class induced by a single kernel. However, optimizing the margin over such classes leads to overfitting. We thus propose a suitable way of constraining the class. We use an efficient algorithm to solve the resulting optimization problem, present preliminary experimental results, and compare them to an alignment-based approach.

5 0.10766652 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information

Author: Eric P. Xing, Michael I. Jordan, Stuart Russell, Andrew Y. Ng

Abstract: Many algorithms rely critically on being given a good metric over their inputs. For instance, data can often be clustered in many “plausible” ways, and if a clustering algorithm such as K-means initially fails to find one that is meaningful to a user, the only recourse may be for the user to manually tweak the metric until sufficiently good clusters are found. For these and other applications requiring good metrics, it is desirable that we provide a more systematic way for users to indicate what they consider “similar.” For instance, we may ask them to provide examples. In this paper, we present an algorithm that, given examples of similar (and, , learns a distance metric over if desired, dissimilar) pairs of points in that respects these relationships. Our method is based on posing metric learning as a convex optimization problem, which allows us to give efficient, local-optima-free algorithms. We also demonstrate empirically that the learned metrics can be used to significantly improve clustering performance. £ ¤¢ £ ¥¢

6 0.10535555 90 nips-2002-Feature Selection in Mixture-Based Clustering

7 0.10233635 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

8 0.10008829 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization

9 0.094894454 111 nips-2002-Independent Components Analysis through Product Density Estimation

10 0.0941993 113 nips-2002-Information Diffusion Kernels

11 0.093873225 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

12 0.092874631 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

13 0.092800245 120 nips-2002-Kernel Design Using Boosting

14 0.091547161 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum

15 0.08748515 106 nips-2002-Hyperkernels

16 0.086882375 77 nips-2002-Effective Dimension and Generalization of Kernel Learning

17 0.078393117 125 nips-2002-Learning Semantic Similarity

18 0.078047916 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine

19 0.077993006 119 nips-2002-Kernel Dependency Estimation

20 0.077334106 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.248), (1, -0.101), (2, 0.02), (3, -0.002), (4, -0.107), (5, 0.052), (6, -0.037), (7, 0.081), (8, -0.07), (9, 0.074), (10, 0.114), (11, 0.019), (12, 0.088), (13, 0.053), (14, -0.005), (15, -0.021), (16, -0.021), (17, -0.014), (18, 0.013), (19, 0.008), (20, 0.007), (21, 0.017), (22, -0.014), (23, -0.023), (24, -0.075), (25, -0.094), (26, -0.058), (27, 0.112), (28, 0.041), (29, 0.13), (30, 0.007), (31, -0.166), (32, 0.039), (33, 0.025), (34, -0.05), (35, -0.083), (36, 0.019), (37, 0.078), (38, 0.077), (39, 0.083), (40, 0.144), (41, -0.038), (42, 0.047), (43, -0.082), (44, 0.062), (45, -0.095), (46, -0.012), (47, 0.014), (48, -0.121), (49, 0.136)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94228578 124 nips-2002-Learning Graphical Models with Mercer Kernels

Author: Francis R. Bach, Michael I. Jordan

Abstract: We present a class of algorithms for learning the structure of graphical models from data. The algorithms are based on a measure known as the kernel generalized variance (KGV), which essentially allows us to treat all variables on an equal footing as Gaussians in a feature space obtained from Mercer kernels. Thus we are able to learn hybrid graphs involving discrete and continuous variables of arbitrary type. We explore the computational properties of our approach, showing how to use the kernel trick to compute the relevant statistics in linear time. We illustrate our framework with experiments involving discrete and continuous data.

2 0.61707413 114 nips-2002-Information Regularization with Partially Labeled Data

Author: Martin Szummer, Tommi S. Jaakkola

Abstract: Classification with partially labeled data requires using a large number of unlabeled examples (or an estimated marginal P (x)), to further constrain the conditional P (y|x) beyond a few available labeled examples. We formulate a regularization approach to linking the marginal and the conditional in a general way. The regularization penalty measures the information that is implied about the labels over covering regions. No parametric assumptions are required and the approach remains tractable even for continuous marginal densities P (x). We develop algorithms for solving the regularization problem for finite covers, establish a limiting differential equation, and exemplify the behavior of the new regularization approach in simple cases.

3 0.55452091 111 nips-2002-Independent Components Analysis through Product Density Estimation

Author: Trevor Hastie, Rob Tibshirani

Abstract: We present a simple direct approach for solving the ICA problem, using density estimation and maximum likelihood. Given a candidate orthogonal frame, we model each of the coordinates using a semi-parametric density estimate based on cubic splines. Since our estimates have two continuous derivatives , we can easily run a second order search for the frame parameters. Our method performs very favorably when compared to state-of-the-art techniques. 1

4 0.54064488 99 nips-2002-Graph-Driven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA

Author: Jean-philippe Vert, Minoru Kanehisa

Abstract: We present an algorithm to extract features from high-dimensional gene expression profiles, based on the knowledge of a graph which links together genes known to participate to successive reactions in metabolic pathways. Motivated by the intuition that biologically relevant features are likely to exhibit smoothness with respect to the graph topology, the algorithm involves encoding the graph and the set of expression profiles into kernel functions, and performing a generalized form of canonical correlation analysis in the corresponding reproducible kernel Hilbert spaces. Function prediction experiments for the genes of the yeast S. Cerevisiae validate this approach by showing a consistent increase in performance when a state-of-the-art classifier uses the vector of features instead of the original expression profile to predict the functional class of a gene.

5 0.53270447 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization

Author: Harald Steck, Tommi S. Jaakkola

Abstract: A common objective in learning a model from data is to recover its network structure, while the model parameters are of minor interest. For example, we may wish to recover regulatory networks from high-throughput data sources. In this paper we examine how Bayesian regularization using a product of independent Dirichlet priors over the model parameters affects the learned model structure in a domain with discrete variables. We show that a small scale parameter - often interpreted as

6 0.5310728 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum

7 0.52809221 34 nips-2002-Artefactual Structure from Least-Squares Multidimensional Scaling

8 0.49710968 178 nips-2002-Robust Novelty Detection with Single-Class MPM

9 0.49158052 89 nips-2002-Feature Selection by Maximum Marginal Diversity

10 0.47668219 113 nips-2002-Information Diffusion Kernels

11 0.47136971 106 nips-2002-Hyperkernels

12 0.46464598 172 nips-2002-Recovering Articulated Model Topology from Observed Rigid Motion

13 0.45198286 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information

14 0.44560686 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

15 0.43955106 90 nips-2002-Feature Selection in Mixture-Based Clustering

16 0.43620217 41 nips-2002-Bayesian Monte Carlo

17 0.42962265 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

18 0.42348221 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

19 0.4216159 98 nips-2002-Going Metric: Denoising Pairwise Data

20 0.42101541 179 nips-2002-Scaling of Probability-Based Optimization Algorithms


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.028), (14, 0.011), (23, 0.019), (29, 0.213), (42, 0.099), (54, 0.159), (55, 0.029), (68, 0.039), (74, 0.145), (87, 0.02), (92, 0.034), (98, 0.114)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.8741895 60 nips-2002-Convergence Properties of Some Spike-Triggered Analysis Techniques

Author: Liam Paninski

Abstract: vVe analyze the convergence properties of three spike-triggered data analysis techniques. All of our results are obtained in the setting of a (possibly multidimensional) linear-nonlinear (LN) cascade model for stimulus-driven neural activity. We start by giving exact rate of convergence results for the common spike-triggered average (STA) technique. Next, we analyze a spike-triggered covariance method, variants of which have been recently exploited successfully by Bialek, Simoncelli, and colleagues. These first two methods suffer from extraneous conditions on their convergence; therefore, we introduce an estimator for the LN model parameters which is designed to be consistent under general conditions. We provide an algorithm for the computation of this estimator and derive its rate of convergence. We close with a brief discussion of the efficiency of these estimators and an application to data recorded from the primary motor cortex of awake, behaving primates. 1

same-paper 2 0.84701216 124 nips-2002-Learning Graphical Models with Mercer Kernels

Author: Francis R. Bach, Michael I. Jordan

Abstract: We present a class of algorithms for learning the structure of graphical models from data. The algorithms are based on a measure known as the kernel generalized variance (KGV), which essentially allows us to treat all variables on an equal footing as Gaussians in a feature space obtained from Mercer kernels. Thus we are able to learn hybrid graphs involving discrete and continuous variables of arbitrary type. We explore the computational properties of our approach, showing how to use the kernel trick to compute the relevant statistics in linear time. We illustrate our framework with experiments involving discrete and continuous data.

3 0.76872683 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

Author: Olivier Chapelle, Jason Weston, Bernhard SchĂślkopf

Abstract: We propose a framework to incorporate unlabeled data in kernel classifier, based on the idea that two points in the same cluster are more likely to have the same label. This is achieved by modifying the eigenspectrum of the kernel matrix. Experimental results assess the validity of this approach. 1

4 0.7647053 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

Author: Max Welling, Simon Osindero, Geoffrey E. Hinton

Abstract: We propose a model for natural images in which the probability of an image is proportional to the product of the probabilities of some filter outputs. We encourage the system to find sparse features by using a Studentt distribution to model each filter output. If the t-distribution is used to model the combined outputs of sets of neurally adjacent filters, the system learns a topographic map in which the orientation, spatial frequency and location of the filters change smoothly across the map. Even though maximum likelihood learning is intractable in our model, the product form allows a relatively efficient learning procedure that works well even for highly overcomplete sets of filters. Once the model has been learned it can be used as a prior to derive the “iterated Wiener filter” for the purpose of denoising images.

5 0.7646035 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture

Author: David R. Martin, Charless C. Fowlkes, Jitendra Malik

Abstract: The goal of this work is to accurately detect and localize boundaries in natural scenes using local image measurements. We formulate features that respond to characteristic changes in brightness and texture associated with natural boundaries. In order to combine the information from these features in an optimal way, a classifier is trained using human labeled images as ground truth. We present precision-recall curves showing that the resulting detector outperforms existing approaches.

6 0.76056492 53 nips-2002-Clustering with the Fisher Score

7 0.75615829 74 nips-2002-Dynamic Structure Super-Resolution

8 0.7560274 3 nips-2002-A Convergent Form of Approximate Policy Iteration

9 0.75355196 169 nips-2002-Real-Time Particle Filters

10 0.75241661 152 nips-2002-Nash Propagation for Loopy Graphical Games

11 0.75067848 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games

12 0.75005227 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

13 0.7500158 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

14 0.74964571 173 nips-2002-Recovering Intrinsic Images from a Single Image

15 0.74851328 39 nips-2002-Bayesian Image Super-Resolution

16 0.74831045 2 nips-2002-A Bilinear Model for Sparse Coding

17 0.74727488 48 nips-2002-Categorization Under Complexity: A Unified MDL Account of Human Learning of Regular and Irregular Categories

18 0.7454679 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction

19 0.74532533 122 nips-2002-Learning About Multiple Objects in Images: Factorial Learning without Factorial Search

20 0.74516177 188 nips-2002-Stability-Based Model Selection