nips nips2006 nips2006-84 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Wenye Li, Kin-hong Lee, Kwong-sak Leung
Abstract: Kernel-based regularized learning seeks a model in a hypothesis space by minimizing the empirical error and the model’s complexity. Based on the representer theorem, the solution consists of a linear combination of translates of a kernel. This paper investigates a generalized form of representer theorem for kernel-based learning. After mapping predefined features and translates of a kernel simultaneously onto a hypothesis space by a specific way of constructing kernels, we proposed a new algorithm by utilizing a generalized regularizer which leaves part of the space unregularized. Using a squared-loss function in calculating the empirical error, a simple convex solution is obtained which combines predefined features with translates of the kernel. Empirical evaluations have confirmed the effectiveness of the algorithm for supervised learning tasks.
Reference: text
sentIndex sentText sentNum sentScore
1 hk Abstract Kernel-based regularized learning seeks a model in a hypothesis space by minimizing the empirical error and the model’s complexity. [sent-4, score-0.668]
2 Based on the representer theorem, the solution consists of a linear combination of translates of a kernel. [sent-5, score-0.437]
3 This paper investigates a generalized form of representer theorem for kernel-based learning. [sent-6, score-0.298]
4 After mapping predefined features and translates of a kernel simultaneously onto a hypothesis space by a specific way of constructing kernels, we proposed a new algorithm by utilizing a generalized regularizer which leaves part of the space unregularized. [sent-7, score-0.863]
5 Using a squared-loss function in calculating the empirical error, a simple convex solution is obtained which combines predefined features with translates of the kernel. [sent-8, score-0.472]
6 Empirical evaluations have confirmed the effectiveness of the algorithm for supervised learning tasks. [sent-9, score-0.06]
7 Examples of such tasks include regression which produces continuous values, and classification which predicts a class label for an input object. [sent-12, score-0.077]
8 Vapnik’s seminal work[1] shows that the key to effectively solving this problem is by controlling the solution’s complexity, which leads to the techniques known as regularized kernel methods[1] [2][3] and regularization networks[4]. [sent-13, score-0.365]
9 The work championed by Poggio and other researchers[5][6] implicitly treats learning as an approximation problem and gives a general scheme with ideas going back to modern regularization theory[7][8][9]. [sent-14, score-0.084]
10 For both frameworks, a solution is sought by simultaneously minimizing the empirical error and the complexity. [sent-15, score-0.182]
11 K According to representer theorem [10][11] [12], the minimizer of (1) admits a simple solution as a linear combination of translates of the kernel K by the training data m f∗ = ci Kxi , ci ∈ R, 1 ≤ i ≤ m i=1 for a variety of loss functions. [sent-18, score-0.931]
12 For 2 example, when used for classification, a squared-loss (y − f (x)) brings about the regularized least-squares classification (RLSC) algorithm[13][14][15]; while a hinge loss (1 − yf (x)) + ≡ max (1 − yf (x) , 0) corresponds to the classical support vector machines(SVM). [sent-20, score-0.404]
13 Using this model, data are implicitly projected onto the hypothesis space H K via a transformation φK : x → K x and a linear functional is sought by finding its representer in HK , which generally has infinite dimensions. [sent-21, score-0.405]
14 Using an existing trick in designing kernels, an RKHS is constructed which contains a subspace spanned by some predefined features and this subspace is left unregularized during the learning process. [sent-25, score-0.501]
15 Empirical results have shown the embedding of these features often has the effect of stabilizing the algorithms’s performance for different choices of kernels and prevents the results from deteriorating for inappropriate kernels. [sent-26, score-0.28]
16 First, a generalized regularized learning model and its associated representer theorem are studied. [sent-28, score-0.416]
17 Then, we introduce an existing trick with which we constructed a hypothesis space which has a subspace of the predefined features. [sent-29, score-0.323]
18 Next, a generic learning algorithm is proposed based on the model and especially evaluated for classification problems. [sent-30, score-0.064]
19 x1 , · · · , xm ∈ Rd , K ∈ Rm×m , and y1 , · · · , ym ∈ R. [sent-35, score-0.047]
20 2 Generalized regularized least-squares learning model Suppose the space HK decomposes into the direct sum: HK = H0 ⊕ H1 , where H0 is spanned by (≤ m) linearly independent features: H0 = span (ϕ1 , · · · , ϕ ). [sent-38, score-0.268]
21 We propose the generalized regularized least-squares (G-RLS) learning model as min L (f ) = f ∈HK 1 m m (yi − f (xi ))2 + γ f − P f 2 K , (2) i=1 where P f is the orthogonal projection of f onto H0 . [sent-39, score-0.33]
22 (4) f ∗ − P f ∗ = i=1 mγ P f ∗ is the orthogonal projection of f ∗ onto H0 and hence, P f∗ = λp ϕp , λp ∈ R, 1 ≤ p ≤ . [sent-49, score-0.082]
23 (5) p=1 So (4) is simplified to m f∗ = λp ϕ p + p=1 c i Kxi , i=1 (6) where yi − f ∗ (xi ) , 1 ≤ i ≤ m. [sent-50, score-0.087]
24 (7) mγ The coefficients λ1 , · · · , λ , c1 , · · · , cm are uniquely specified by m + linear equations. [sent-51, score-0.154]
25 The first m equations are obtained by substituting (6) into (7). [sent-52, score-0.047]
26 The rest equations are derived from the orthogonality constraint between P f ∗ and f − P f ∗ , which can be written as ci = m ϕp , (8) = 0, 1 ≤ p ≤ , c i Kxi i=1 K or equivalently due to the property of reproducing kernels, m (9) ci ϕp (xi ) = 0, 1 ≤ p ≤ . [sent-53, score-0.392]
27 i=1 m The solution (6) derived from (2) satisfies the reproduction property. [sent-54, score-0.082]
28 Suppose (x i ; yi )i=1 comes purely from a model which is perfectly linearly related to ϕ1 , · · · , ϕ , it is desirable to get back a solution that is independent of the other features. [sent-55, score-0.131]
29 As an evident result of (2), the property is satisfied. [sent-56, score-0.045]
30 The parameters c1 , · · · , cm in the resulting estimator (6) are all zero, which makes the regularizer in (2) equal to zero. [sent-57, score-0.245]
31 3 Kernel construction By decomposing a hypothesis space HK and studying a generalized regularizer, we have proposed the G-RLS model and derived a solution which consists of predefined features as well as translates of a kernel function. [sent-58, score-0.749]
32 In this section, starting with predefined features ϕ 1 , · · · , ϕ and a kernel Φ, we will construct a hypothesis space which contains the features and translates of the kernel by using an existing trick. [sent-59, score-0.819]
33 (13) This trick is studied in [16] to provide an alternative basis for radial basis functions and first used in a fast RBF interpolation algorithm[17]. [sent-65, score-0.385]
34 A sketch of properties which are peripheral to our concerns in this paper are given below. [sent-66, score-0.035]
35 Kxp = ϕ p , 1 ≤ p ≤ (14) ϕ p , ϕq K 1 0 = 1≤p=q≤ 1≤p=q≤ (15) (16) Hxp = H (xp , ·) = 0, 1 ≤ p ≤ H x i , ϕp = H (xi , ·) , ϕp K Hxi , Hxj K + 1 ≤ i ≤ m, 1 ≤ p ≤ = H (xi , xj ) , K Another property is that the matrix H = (H be used in the computations below. [sent-67, score-0.045]
36 (17) + 1 ≤ i, j ≤ m = 0, (18) m (xi , xj ))i,j= +1 is strictly positive definite, which will By constructing a kernel K using this trick, predefined features ϕ1 , · · · , ϕ are explicitly mapped onto HK which has a subspace H0 = span (ϕ1 , · · · , ϕ ) = span (ϕ1 , · · · , ϕ ). [sent-68, score-0.631]
37 By property (15), we can see that ϕ1 , · · · , ϕ also forms an orthonormal basis of H0 . [sent-69, score-0.09]
38 2 Computation After projecting the features ϕ1 , · · · , ϕ onto an RKHS HK , let’s study the regularized minimization problem in (2). [sent-71, score-0.362]
39 As shown in (6), the minimizer has a form of a linear combination of predefined features and translates of a kernel. [sent-72, score-0.428]
40 Furthermore, from the orthog˜ onal property between ϕp and Hxi in (17), we have m f∗ − P f∗ = (20) c i Hx i . [sent-74, score-0.045]
41 I× ET T +1 , · · · O ×(m− ) I(m− )×(m− = ) , (24) = y, ˜ λ ˜ c O ×(m− ) H + γmI where y1 = (y1 , · · · , y ) and y2 = (y O× O(m− )× ˜ ˜ and K−1 H = ˜ = I y1 y2 , (25) T ˜ , ym ) . [sent-83, score-0.047]
42 Equation (25) uniquely specifies λ by ˜ = y1 , λ (26) and ˜ by c λ. [sent-84, score-0.04]
43 (27) (H + γmI) ˜ = y2 − ET ˜ c H+γmI is a strictly positive definite matrix. [sent-85, score-0.095]
44 fast multipole method), and the complexity reduces to nearly O (m log m), which provides the potential to solve large scale problems. [sent-90, score-0.038]
45 4 A generic learning algorithm Based on the discussions above, a generic learning algorithm (G-RLS algorithm) is summarized below. [sent-91, score-0.128]
46 For (≤ m) predefined linearly independent features ϕ1 , · · · , ϕ of the data, define ϕ1 , · · · , ϕ according to equation (12). [sent-95, score-0.124]
47 Choose a symmetric, strictly positive definite function Φx (x ) = Φ (x, x ) which is continuous on X × X . [sent-97, score-0.095]
48 The estimator f : X → Y is given by m ˜ λp ϕp (x) + f (x) = p=1 ci Hxi (x) ˜ (28) i= +1 ˜ ˜ ˜ where λ1 , · · · , λ , c +1 , · · · , cm are obtained by solving equations (26) and (27). [sent-100, score-0.336]
49 ˜ The algorithm can be applied to a number of applications including regression and binary classification. [sent-101, score-0.077]
50 Polynomial features up to the second degree (ϕ1 = 1, ϕ2 = x, ϕ3 = x2 ) were used for G-RLS algorithm along with a Gaussian RBF kernel x−· 2 Φx (·) = e− σ2 . [sent-103, score-0.249]
51 We selected ridge regression with the Gaussian RBF kernel for a comparison, which can be regarded as an implementation of standard regularized least-squares model for regression tasks. [sent-104, score-0.535]
52 For both algorithms, three trials were made in which the parameter σ was set to a large value, to a small value, and by cross validation respectively. [sent-105, score-0.045]
53 For each σ, the parameter γ was set by cross validation. [sent-106, score-0.045]
54 Comparing with ridge regression in figure 1(b), the existence of polynomial features in G-RLS has the effect of stabilizing the results, as shown in figure 1(a). [sent-107, score-0.427]
55 Varying σ, different fitting results were obtained by ridge regression. [sent-108, score-0.1]
56 In the case of generalized regularized least-squares classification (G-RLSC), each y i of the training set takes the values {−1, 1}. [sent-110, score-0.248]
57 otherwise G-RLSC uses the ”classical” squared-loss as a classification loss criterion. [sent-112, score-0.048]
58 The effectiveness of this criterion has been reported by the empirical results[13][14][15]. [sent-113, score-0.139]
59 The existence of polynomial features in G-RLS helped to improve the stability of the algorithm. [sent-117, score-0.173]
60 5 Experiments To evaluate the performance of G-RLS algorithm, empirical results are reported on text categorization tasks using the three datasets from CMU text mining group1. [sent-118, score-0.179]
61 The 7-sectors dataset has 4, 573 web pages belonging to seven economic sectors, with each sector containing pages varying from 300 to 1, 099. [sent-119, score-0.092]
62 The 4-universities dataset consists of 8, 282 webpages collected mainly from four universities, in which the pages belong to seven classes and each class has 137 to 3, 764 pages. [sent-120, score-0.092]
63 The 20-newsgroups dataset collects UseNet postings into twenty newsgroups and each group has about 1, 000 messages. [sent-121, score-0.088]
64 Probabilistic latent semantic analysis[19] (pLSA) was used to get ten latent features ϕ1 , · · · , ϕ10 out of the data. [sent-130, score-0.306]
65 Each experiment consisted of ten runs and the average accuracy is reported. [sent-132, score-0.047]
66 In each run, the data were separated by the xval-prep utility accompanied in C4. [sent-133, score-0.035]
67 Moreover, an insightful observation may find that although SVM excels on the dataset when the number of training data increases, G-RLSC shows better performance than standard RLSC. [sent-137, score-0.05]
68 A possible reason is that the hinge loss used by SVM is more appropriate than the squared-loss used by RLSC and G-RLSC on this dataset; while the embedding of pLSA features still improves the accuracy. [sent-138, score-0.254]
69 6 Conclusion In this paper, we first proposed a generic G-RLS learning model. [sent-139, score-0.064]
70 Unlike the standard kernel-based methods which only consider the translates of a kernel for model learning, the new model takes predefined features into special consideration. [sent-140, score-0.474]
71 A generalized regularizer is studied which leaves part of the hypothesis space unregularized. [sent-141, score-0.307]
72 Similar ideas were explored in spline smoothing[9] in which low degree polynomials are not regularized. [sent-142, score-0.088]
73 Another example is semi-parametric SVM[2], which considers the addition of some features to the kernel expansion for SVM. [sent-143, score-0.249]
74 However, to our knowledge, few learning algorithms and applications have been studied along this line from a unified RKHS regularization point of view, or investigated for empirical evaluations. [sent-144, score-0.163]
75 An RKHS that contains the combined solutions is explicitly constructed based on a special trick in designing kernels. [sent-146, score-0.2]
76 (The idea of a conditionally positive definite function[20] is lurking in the back1 2 http://www. [sent-147, score-0.077]
77 Figure 2: Classification accuracies on CMU text datasets with different number of training samples. [sent-157, score-0.05]
78 Ten pLSA features along with a linear kernel Φ were used for G-RLSC. [sent-158, score-0.249]
79 Both bag-of-words (BoW) and pLSA representations of documents were experimented for RLSC and SVM with a linear kernel. [sent-159, score-0.039]
80 ) With the construction of the RKHS, the computation is further optimized and the theoretical analysis of such algorithms is also potentially facilitated. [sent-164, score-0.043]
81 The empirical results from real-world applications have confirmed the effectiveness of the algorithm. [sent-166, score-0.139]
82 Regularization algorithms for learning that are equivalent to multilayer networks. [sent-199, score-0.035]
83 Spaces of distributions, interpolation by translates of a basis function and error estimates. [sent-272, score-0.357]
84 Fast solution of the radial basis function interpolation equations: Domain decomposition methods. [sent-283, score-0.224]
85 Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. [sent-291, score-0.05]
86 Interpolation of scattered data: Distances, matrices, and conditionally positive definite functions. [sent-303, score-0.077]
wordName wordTfidf (topN-words)
[('hk', 0.337), ('prede', 0.259), ('translates', 0.225), ('rlsc', 0.221), ('kxi', 0.192), ('representer', 0.168), ('trick', 0.16), ('regularized', 0.156), ('plsa', 0.154), ('hong', 0.14), ('hxi', 0.133), ('hx', 0.131), ('kernel', 0.125), ('rkhs', 0.124), ('features', 0.124), ('ci', 0.121), ('xp', 0.117), ('cm', 0.114), ('xi', 0.104), ('ridge', 0.1), ('kong', 0.098), ('hypothesis', 0.096), ('poggio', 0.092), ('generalized', 0.092), ('bow', 0.088), ('spline', 0.088), ('yi', 0.087), ('interpolation', 0.087), ('regularization', 0.084), ('onto', 0.082), ('rmed', 0.081), ('minimizer', 0.079), ('empirical', 0.079), ('regularizer', 0.077), ('stabilizing', 0.077), ('rgc', 0.077), ('yf', 0.077), ('xq', 0.077), ('regression', 0.077), ('svm', 0.07), ('span', 0.069), ('subspace', 0.067), ('generic', 0.064), ('cmu', 0.062), ('effectiveness', 0.06), ('strictly', 0.059), ('sought', 0.059), ('reproducing', 0.058), ('classi', 0.057), ('chinese', 0.056), ('cv', 0.056), ('groups', 0.056), ('estimator', 0.054), ('kx', 0.054), ('rbf', 0.052), ('dataset', 0.05), ('text', 0.05), ('nite', 0.049), ('mi', 0.049), ('polynomial', 0.049), ('radial', 0.048), ('loss', 0.048), ('ym', 0.047), ('ten', 0.047), ('derivative', 0.047), ('equations', 0.047), ('latent', 0.046), ('hinge', 0.046), ('basis', 0.045), ('property', 0.045), ('cross', 0.045), ('solution', 0.044), ('spanned', 0.043), ('semantic', 0.043), ('construction', 0.043), ('kernels', 0.043), ('seven', 0.042), ('leaves', 0.042), ('conditionally', 0.041), ('designing', 0.04), ('sons', 0.04), ('uniquely', 0.04), ('experimented', 0.039), ('hilbert', 0.039), ('investigates', 0.038), ('beatson', 0.038), ('multipole', 0.038), ('newsgroups', 0.038), ('om', 0.038), ('reproduction', 0.038), ('tchebychef', 0.038), ('positive', 0.036), ('embedding', 0.036), ('soc', 0.035), ('accompanied', 0.035), ('fresh', 0.035), ('suykens', 0.035), ('multilayer', 0.035), ('peripheral', 0.035), ('fung', 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space
Author: Wenye Li, Kin-hong Lee, Kwong-sak Leung
Abstract: Kernel-based regularized learning seeks a model in a hypothesis space by minimizing the empirical error and the model’s complexity. Based on the representer theorem, the solution consists of a linear combination of translates of a kernel. This paper investigates a generalized form of representer theorem for kernel-based learning. After mapping predefined features and translates of a kernel simultaneously onto a hypothesis space by a specific way of constructing kernels, we proposed a new algorithm by utilizing a generalized regularizer which leaves part of the space unregularized. Using a squared-loss function in calculating the empirical error, a simple convex solution is obtained which combines predefined features with translates of the kernel. Empirical evaluations have confirmed the effectiveness of the algorithm for supervised learning tasks.
2 0.1912407 181 nips-2006-Stability of $K$-Means Clustering
Author: Alexander Rakhlin, Andrea Caponnetto
Abstract: We phrase K-means clustering as an empirical risk minimization procedure over a class HK and explicitly calculate the covering number for this class. Next, we show that stability of K-means clustering is characterized by the geometry of HK with respect to the underlying distribution. We prove that in the case of a unique global minimizer, the clustering solution is stable with respect to complete changes of the data, while for the case of multiple minimizers, the change of Ω(n1/2 ) samples defines the transition between stability and instability. While for a finite number of minimizers this result follows from multinomial distribution estimates, the case of infinite minimizers requires more refined tools. We conclude by proving that stability of the functions in HK implies stability of the actual centers of the clusters. Since stability is often used for selecting the number of clusters in practice, we hope that our analysis serves as a starting point for finding theoretically grounded recipes for the choice of K. 1
3 0.13378328 116 nips-2006-Learning from Multiple Sources
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields results for classification and regression generally, and for density estimation within the exponential family. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. 1
4 0.13146345 65 nips-2006-Denoising and Dimension Reduction in Feature Space
Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann
Abstract: We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data to perform classification. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results. 1
5 0.12694672 95 nips-2006-Implicit Surfaces with Globally Regularised and Compactly Supported Basis Functions
Author: Christian Walder, Olivier Chapelle, Bernhard Schölkopf
Abstract: We consider the problem of constructing a function whose zero set is to represent a surface, given sample points with surface normal vectors. The contributions include a novel means of regularising multi-scale compactly supported basis functions that leads to the desirable properties previously only associated with fully supported bases, and show equivalence to a Gaussian process with modified covariance function. We also provide a regularisation framework for simpler and more direct treatment of surface normals, along with a corresponding generalisation of the representer theorem. We demonstrate the techniques on 3D problems of up to 14 million data points, as well as 4D time series data. 1
6 0.11885414 186 nips-2006-Support Vector Machines on a Budget
7 0.11096213 104 nips-2006-Large-Scale Sparsified Manifold Regularization
8 0.10835701 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models
9 0.097916864 150 nips-2006-On Transductive Regression
10 0.09489242 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods
11 0.09073928 102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm
12 0.08692386 130 nips-2006-Max-margin classification of incomplete data
13 0.085224018 82 nips-2006-Gaussian and Wishart Hyperkernels
14 0.083190143 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
15 0.082089067 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
16 0.07668189 20 nips-2006-Active learning for misspecified generalized linear models
17 0.075921342 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
18 0.074847654 103 nips-2006-Kernels on Structured Objects Through Nested Histograms
19 0.073346123 7 nips-2006-A Local Learning Approach for Clustering
20 0.073156379 6 nips-2006-A Kernel Subspace Method by Stochastic Realization for Learning Nonlinear Dynamical Systems
topicId topicWeight
[(0, -0.249), (1, 0.099), (2, 0.009), (3, 0.078), (4, -0.054), (5, 0.136), (6, -0.084), (7, 0.026), (8, 0.039), (9, 0.079), (10, -0.165), (11, -0.054), (12, -0.019), (13, -0.105), (14, 0.008), (15, -0.023), (16, 0.003), (17, -0.001), (18, -0.093), (19, 0.066), (20, 0.018), (21, 0.084), (22, 0.095), (23, 0.08), (24, -0.163), (25, -0.045), (26, 0.159), (27, -0.132), (28, -0.145), (29, 0.072), (30, 0.047), (31, -0.086), (32, -0.057), (33, -0.169), (34, 0.132), (35, -0.004), (36, -0.015), (37, -0.097), (38, -0.037), (39, -0.074), (40, -0.057), (41, 0.021), (42, 0.039), (43, 0.09), (44, -0.052), (45, 0.086), (46, 0.042), (47, -0.065), (48, 0.02), (49, 0.059)]
simIndex simValue paperId paperTitle
same-paper 1 0.93318319 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space
Author: Wenye Li, Kin-hong Lee, Kwong-sak Leung
Abstract: Kernel-based regularized learning seeks a model in a hypothesis space by minimizing the empirical error and the model’s complexity. Based on the representer theorem, the solution consists of a linear combination of translates of a kernel. This paper investigates a generalized form of representer theorem for kernel-based learning. After mapping predefined features and translates of a kernel simultaneously onto a hypothesis space by a specific way of constructing kernels, we proposed a new algorithm by utilizing a generalized regularizer which leaves part of the space unregularized. Using a squared-loss function in calculating the empirical error, a simple convex solution is obtained which combines predefined features with translates of the kernel. Empirical evaluations have confirmed the effectiveness of the algorithm for supervised learning tasks.
2 0.66107607 181 nips-2006-Stability of $K$-Means Clustering
Author: Alexander Rakhlin, Andrea Caponnetto
Abstract: We phrase K-means clustering as an empirical risk minimization procedure over a class HK and explicitly calculate the covering number for this class. Next, we show that stability of K-means clustering is characterized by the geometry of HK with respect to the underlying distribution. We prove that in the case of a unique global minimizer, the clustering solution is stable with respect to complete changes of the data, while for the case of multiple minimizers, the change of Ω(n1/2 ) samples defines the transition between stability and instability. While for a finite number of minimizers this result follows from multinomial distribution estimates, the case of infinite minimizers requires more refined tools. We conclude by proving that stability of the functions in HK implies stability of the actual centers of the clusters. Since stability is often used for selecting the number of clusters in practice, we hope that our analysis serves as a starting point for finding theoretically grounded recipes for the choice of K. 1
3 0.65670556 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models
Author: S. S. Keerthi, Vikas Sindhwani, Olivier Chapelle
Abstract: We consider the task of tuning hyperparameters in SVM models based on minimizing a smooth performance validation function, e.g., smoothed k-fold crossvalidation error, using non-linear optimization techniques. The key computation in this approach is that of the gradient of the validation function with respect to hyperparameters. We show that for large-scale problems involving a wide choice of kernel-based models and validation functions, this computation can be very efficiently done; often within just a fraction of the training time. Empirical results show that a near-optimal set of hyperparameters can be identified by our approach with very few training rounds and gradient computations. . 1
4 0.620139 82 nips-2006-Gaussian and Wishart Hyperkernels
Author: Risi Kondor, Tony Jebara
Abstract: We propose a new method for constructing hyperkenels and define two promising special cases that can be computed in closed form. These we call the Gaussian and Wishart hyperkernels. The former is especially attractive in that it has an interpretable regularization scheme reminiscent of that of the Gaussian RBF kernel. We discuss how kernel learning can be used not just for improving the performance of classification and regression methods, but also as a stand-alone algorithm for dimensionality reduction and relational or metric learning. 1
5 0.6127041 95 nips-2006-Implicit Surfaces with Globally Regularised and Compactly Supported Basis Functions
Author: Christian Walder, Olivier Chapelle, Bernhard Schölkopf
Abstract: We consider the problem of constructing a function whose zero set is to represent a surface, given sample points with surface normal vectors. The contributions include a novel means of regularising multi-scale compactly supported basis functions that leads to the desirable properties previously only associated with fully supported bases, and show equivalence to a Gaussian process with modified covariance function. We also provide a regularisation framework for simpler and more direct treatment of surface normals, along with a corresponding generalisation of the representer theorem. We demonstrate the techniques on 3D problems of up to 14 million data points, as well as 4D time series data. 1
6 0.58704305 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods
7 0.54070175 104 nips-2006-Large-Scale Sparsified Manifold Regularization
8 0.53559381 6 nips-2006-A Kernel Subspace Method by Stochastic Realization for Learning Nonlinear Dynamical Systems
9 0.5314014 186 nips-2006-Support Vector Machines on a Budget
10 0.51755339 177 nips-2006-Sparse Kernel Orthonormalized PLS for feature extraction in large data sets
11 0.48549646 116 nips-2006-Learning from Multiple Sources
12 0.46863559 129 nips-2006-Map-Reduce for Machine Learning on Multicore
13 0.46568587 30 nips-2006-An Oracle Inequality for Clipped Regularized Risk Minimizers
14 0.45202288 150 nips-2006-On Transductive Regression
15 0.44682571 65 nips-2006-Denoising and Dimension Reduction in Feature Space
16 0.42746869 102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm
17 0.42247173 55 nips-2006-Computation of Similarity Measures for Sequential Data using Generalized Suffix Trees
18 0.408995 140 nips-2006-Multiple Instance Learning for Computer Aided Diagnosis
19 0.40479225 153 nips-2006-Online Clustering of Moving Hyperplanes
20 0.40439141 103 nips-2006-Kernels on Structured Objects Through Nested Histograms
topicId topicWeight
[(1, 0.153), (3, 0.042), (7, 0.076), (9, 0.048), (22, 0.076), (44, 0.075), (57, 0.051), (64, 0.01), (65, 0.074), (69, 0.047), (83, 0.168), (90, 0.102)]
simIndex simValue paperId paperTitle
1 0.88251436 77 nips-2006-Fast Computation of Graph Kernels
Author: Karsten M. Borgwardt, Nicol N. Schraudolph, S.v.n. Vishwanathan
Abstract: Using extensions of linear algebra concepts to Reproducing Kernel Hilbert Spaces (RKHS), we define a unifying framework for random walk kernels on graphs. Reduction to a Sylvester equation allows us to compute many of these kernels in O(n3 ) worst-case time. This includes kernels whose previous worst-case time complexity was O(n6 ), such as the geometric kernels of G¨ rtner et al. [1] and a the marginal graph kernels of Kashima et al. [2]. Our algebra in RKHS allow us to exploit sparsity in directed and undirected graphs more effectively than previous methods, yielding sub-cubic computational complexity when combined with conjugate gradient solvers or fixed-point iterations. Experiments on graphs from bioinformatics and other application domains show that our algorithms are often more than 1000 times faster than existing approaches. 1
same-paper 2 0.86830723 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space
Author: Wenye Li, Kin-hong Lee, Kwong-sak Leung
Abstract: Kernel-based regularized learning seeks a model in a hypothesis space by minimizing the empirical error and the model’s complexity. Based on the representer theorem, the solution consists of a linear combination of translates of a kernel. This paper investigates a generalized form of representer theorem for kernel-based learning. After mapping predefined features and translates of a kernel simultaneously onto a hypothesis space by a specific way of constructing kernels, we proposed a new algorithm by utilizing a generalized regularizer which leaves part of the space unregularized. Using a squared-loss function in calculating the empirical error, a simple convex solution is obtained which combines predefined features with translates of the kernel. Empirical evaluations have confirmed the effectiveness of the algorithm for supervised learning tasks.
3 0.84945208 35 nips-2006-Approximate inference using planar graph decomposition
Author: Amir Globerson, Tommi S. Jaakkola
Abstract: A number of exact and approximate methods are available for inference calculations in graphical models. Many recent approximate methods for graphs with cycles are based on tractable algorithms for tree structured graphs. Here we base the approximation on a different tractable model, planar graphs with binary variables and pure interaction potentials (no external field). The partition function for such models can be calculated exactly using an algorithm introduced by Fisher and Kasteleyn in the 1960s. We show how such tractable planar models can be used in a decomposition to derive upper bounds on the partition function of non-planar models. The resulting algorithm also allows for the estimation of marginals. We compare our planar decomposition to the tree decomposition method of Wainwright et. al., showing that it results in a much tighter bound on the partition function, improved pairwise marginals, and comparable singleton marginals. Graphical models are a powerful tool for modeling multivariate distributions, and have been successfully applied in various fields such as coding theory and image processing. Applications of graphical models typically involve calculating two types of quantities, namely marginal distributions, and MAP assignments. The evaluation of the model partition function is closely related to calculating marginals [12]. These three problems can rarely be solved exactly in polynomial time, and are provably computationally hard in the general case [1]. When the model conforms to a tree structure, however, all these problems can be solved in polynomial time. This has prompted extensive research into tree based methods. For example, the junction tree method [6] converts a graphical model into a tree by clustering nodes into cliques, such that the graph over cliques is a tree. The resulting maximal clique size (cf. tree width) may nevertheless be prohibitively large. Wainwright et. al. [9, 11] proposed an approximate method based on trees known as tree reweighting (TRW). The TRW approach decomposes the potential vector of a graphical model into a mixture over spanning trees of the model, and then uses convexity arguments to bound various quantities, such as the partition function. One key advantage of this approach is that it provides bounds on partition function value, a property which is not shared by approximations based on Bethe free energies [13]. In this paper we focus on a different class of tractable models: planar graphs. A graph is called planar if it can be drawn in the plane without crossing edges. Works in the 1960s by physicists Fisher [5] and Kasteleyn [7], among others, have shown that the partition function for planar graphs may be calculated in polynomial time. This, however, is true under two key restrictions. One is that the variables xi are binary. The other is that the interaction potential depends only on xi xj (where xi ∈ {±1}), and not on their individual values (i.e., the zero external field case). Here we show how the above method can be used to obtain upper bounds on the partition function for non-planar graphs. As in TRW, we decompose the potential of a non-planar graph into a sum over spanning planar models, and then use a convexity argument to obtain an upper bound on the log partition function. The bound optimization is a convex problem, and can be solved in polynomial time. We compare our method with TRW on a planar graph with an external field, and show that it performs favorably with respect to both pairwise marginals and the bound on the partition function, and the two methods give similar results for singleton marginals. 1 Definitions and Notations Given a graph G with n vertices and a set of edges E, we are interested in pairwise Markov Random Fields (MRF) over the graph G. A pairwise MRF [13] is a multivariate distribution over variables x = {x1 , . . . , xn } defined as 1 P p(x) = e ij∈E fij (xi ,xj ) (1) Z where fij are a set of |E| functions, or interaction potentials, defined over pairs of variables. The P partition function is defined as Z = x e ij∈E fij (xi ,xj ) . Here we will focus on the case where xi ∈ {±1}. Furthermore, we will be interested in interaction potentials which only depend on agreement or disagreement between the signs of their variables. We define those by 1 θij (1 + xi xj ) = θij I(xi = xj ) (2) 2 so that fij (xi , xj ) is zero if xi = xj and θij if xi = xj . The model is then defined via the set of parameters θij . We use θ to denote the vector of parameters θij , and denote the partition function by Z(θ) to highlight its dependence on these parameters. f (xi , xj ) = A graph G is defined as planar if it can be drawn in the plane without any intersection of edges [4]. With some abuse of notation, we define E as the set of line segments in 2 corresponding to the edges in the graph. The regions of 2 \ E are defined as the faces of the graph. The face which corresponds to an unbounded region is called the external face. Given a planar graph G, its dual graph G∗ is defined in the following way: the vertices of G∗ correspond to faces of G, and there is an edge between two vertices in G∗ iff the two corresponding faces in G share an edge. If the graph G is weighted, the weight on an edge in G∗ is the weight on the edge shared by the corresponding faces in G. A plane triangulation of a planar graph G is obtained from G by adding edges such that all the faces of the resulting graph have exactly three vertices. Thus a plane triangulated graph has a dual where all vertices have degree three. It can be shown that every plane graph can be plane triangulated [4]. We shall also need the notion of a perfect matching on a graph. A perfect matching on a graph G is defined as a set of edges H ⊆ E such that every vertex in G has exactly one edge in H incident on it. If the graph is weighted, the weight of the matching is defined as the product of the weights of the edges in the matching. Finally, we recall the definition of a marginal polytope of a graph [12]. Consider an MRF over a graph G where fij are given by Equation 2. Denote the probability of the event I(xi = xj ) under p(x) by τij . The marginal polytope of G, denoted by M(G), is defined as the set of values τij that can be obtained under some assignment to the parameters θij . For a general graph G the polytope M(G) cannot be described using a polynomial number of inequalities. However, for planar graphs, it turns out that a set of O(n3 ) constraints, commonly referred to as triangle inequalities, suffice to describe M(G) (see [3] page 434). The triangle inequalities are defined by 1 TRI(n) = {τij : τij + τjk − τik ≤ 1, τij + τjk + τik ≥ 1, ∀i, j, k ∈ {1, . . . , n}} (3) Note that the above inequalities actually contain variables τij which do not correspond to edges in the original graph G. Thus the equality M(G) = TRI(n) should be understood as referring only to the values of τij that correspond to edges in the graph. Importantly, the values of τij for edges not in the graph need not be valid marginals for any MRF. In other words M(G) is a projection of TRI(n) on the set of edges of G. It is well known that the marginal polytope for trees is described via pairwise constraints. It is thus interesting that for planar graphs, it is triplets, rather than pairwise 1 The definition here is slightly different from that in [3], since here we refer to agreement probabilities, whereas [3] refers to disagreement probabilities. This polytope is also referred to as the cut polytope. constraints, that characterize the polytope. In this sense, planar graphs and trees may be viewed as a hierarchy of polytope complexity classes. It remains an interesting problem to characterize other structures in this hierarchy and their related inference algorithms. 2 Exact calculation of partition function using perfect matching The seminal works of Kasteleyn [7] and Fisher [5] have shown how one can calculate the partition function for a binary MRF over a planar graph with pure interaction potentials. We briefly review Fisher’s construction, which we will use in what follows. Our interpretation of the method differs somewhat from that of Fisher, but we believe it is more straightforward. The key idea in calculating the partition function is to convert the summation over values of x to the problem of calculating the sum of weights of all perfect matchings in a graph constructed from G, as shown below. In this section, we consider weighted graphs (graphs with numbers assigned to their edges). For the graph G associated with the pairwise MRF, we assign weights wij = e2θij to the edges. The first step in the construction is to plane triangulate the graph G. Let us call the resulting graph GT . We define an MRF on GT by assigning a parameter θij = 0 to the edges that have been added to G, and the corresponding weight wij = 1. Thus GT essentially describes the same distribution as G, and therefore has the same partition function. We can thus restrict our attention to calculating the partition function for the MRF on GT . As a first step in calculating a partition function over GT , we introduce the following definition: a ˆ set of edges E in GT is an agreement edge set (or AES) if for every triangle face F in GT one of the ˆ ˆ following holds: The edges in F are all in E, or exactly one of the edges in F is in E. The weight ˆ is defined as the product of the weights of the edges in E. ˆ of a set E It can be shown that there exists a bijection between pairs of assignments {x, −x} and agreement edge sets. The mapping from x to an edge set is simply the set of edges such that xi = xj . It is easy to see that this is an agreement edge set. The reverse mapping is obtained by finding an assignment x such that xi = xj iff the corresponding edge is in the agreement edge set. The existence of this mapping can be shown by induction on the number of (triangle) faces. P The contribution of a given assignment x to the partition function is e ˆ sponds to an AES denoted by E it is easy to see that P e ij∈E θij I(xi =xj ) = e− P ij∈E θij P e ˆ ij∈E 2θij = ce P ˆ ij∈E ij∈E 2θij θij I(xi =xj ) =c wij . If x corre(4) ˆ ij∈E P where c = e− ij∈E θij . Define the superset Λ as the set of agreement edge sets. The above then implies that Z(θ) = 2c E∈Λ ij∈E wij , and is thus proportional to the sum of AES weights. ˆ ˆ To sum over agreement edge sets, we use the following elegant trick introduced by Fisher [5]. Construct a new graph GPM from the dual of GT by introducing new vertices and edges according to the following rule: Replace each original vertex with three vertices that are connected to each other, and assign a weight of one to the new edges. Next, consider the three neighbors of the original vertex 2 . Connect each of the three new vertices to one of these three neighbors, keeping the original weights on these edges. The transformation is illustrated in Figure 1. The new graph GPM has O(3n) vertices, and is also planar. It can be seen that there is a one to one correspondence between perfect matchings in GPM and agreement edge sets in GT . Define Ω to be the set of perfect matchings in GPM . Then Z(θ) = 2c M ∈Ω ij∈M wij where we have used the fact that all the new weights have a value of one. Thus, the partition function is a sum over the weights of perfect matchings in GPM . Finally, we need a way of summing over the weights of the set of perfect matchings in a graph. Kasteleyn [7] proved that for a planar graph GPM , this sum may be obtained using the following sequence of steps: • Direct the edges of the graph GPM such that for every face (except possibly the external face), the number of edges on its perimeter oriented in a clockwise manner is odd. Kasteleyn showed that such a so called Pfaffian orientation may be constructed in polynomial time for a planar graph (see also [8] page 322). 2 Note that in the dual of GT all vertices have degree three, since GT is plane triangulated. 1.2 0.7 0.6 1 1 1 0.8 0.6 0.8 1.5 1.4 1.5 1 1 1.2 1 1 1 1 0.7 1.4 1 1 1 Figure 1: Illustration of the graph transformations in Section 2 for a complete graph with four vertices. Left panel shows the original weighted graph (dotted edges and grey vertices) and its dual (solid edges and black vertices). Right panel shows the dual graph with each vertex replaced by a triangle (the graph GPM in the text). Weights for dual graph edges correspond to the weights on the original graph. • Define the matrix P (GPM ) to be a skew symmetric matrix such that Pij = 0 if ij is not an edge, Pij = wij if the arrow on edge ij runs from i to j and Pij = −wij otherwise. • The sum over weighted matchings can then be shown to equal |P (GPM )|. The partition function is thus given by Z(θ) = 2c |P (GPM )|. To conclude this section we reiterate the following two key points: the partition function of a binary MRF over a planar graph with interaction potentials as in Equation 2 may be calculated in polynomial time by calculating the determinant of a matrix of size O(3n). An important outcome of this result is that the functional relation between Z(θ) and the parameters θij is known, a fact we shall use in what follows. 3 Partition function bounds via planar decomposition Given a non-planar graph G over binary variables with a vector of interaction potentials θ, we wish to use the exact planar computation to obtain a bound on the partition function of the MRF on G. We assume for simplicity that the potentials on the MRF for G are given in the form of Equation 2. Thus, G violates the assumptions of the previous section only in its non-planarity. Define G(r) as a set of spanning planar subgraphs of G, i.e., each graph G(r) is planar and contains all the vertices of G and some its edges. Denote by m the number of such graphs. Introduce the following definitions: (r) • θ (r) is a set of parameters on the edges of G(r) , and θij is an element in this set. Z(θ (r) ) is the partition function of the MRF on G(r) with parameters θ (r) . ˆ (r) ˆ(r) • θ is a set of parameters on the edges of G such that if edge (ij) is in G(r) then θij = (r) ˆ(r) θ , and otherwise θ = 0. ij ij Given a distribution ρ(r) on the graphs G(r) (i.e., ρ(r) ≥ 0 for r = 1, . . . , m and assume that the parameters for G(r) are such that ˆ ρ(r)θ θ= (r) r ρ(r) = 1), (5) r Then, by the convexity of the log partition function, as a function of the model parameters, we have ρ(r) log Z(θ (r) ) ≡ f (θ, ρ, θ (r) ) log Z(θ) ≤ (6) r Since by assumption the graphs G(r) are planar, this bound can be calculated in polynomial time. Since this bound is true for any set of parameters θ (r) which satisfies the condition in Equation 5 and for any distribution ρ(r), we may optimize over these two variables to obtain the tightest bound possible. Define the optimal bound for a fixed value of ρ(r) by g(ρ, θ) (optimization is w.r.t. θ (r) ) g(ρ, θ) = f (θ, ρ, θ (r) ) min θ (r) : P ˆ ρ(r)θ (r) =θ (7) Also, define the optimum of the above w.r.t. ρ by h(θ). h(θ) = min g(θ, ρ) ρ(r) ≥ 0, ρ(r) = 1 (8) Thus, h(θ) is the optimal upper bound for the given parameter vector θ. In the following section we argue that we can in fact find the global optimum of the above problem. 4 Globally Optimal Bound Optimization First consider calculating g(ρ, θ) from Equation 7. Note that since log Z(θ (r) ) is a convex function of θ (r) , and the constraints are linear, the overall optimization is convex and can be solved efficiently. In the current implementation, we use a projected gradient algorithm [2]. The gradient of f (θ, ρ, θ (r) ) w.r.t. θ (r) is given by ∂f (θ, ρ, θ (r) ) (r) ∂θij (r) = ρ(r) 1 + eθij (r) P −1 (GPM ) (r) k(i,j) Sign(Pk(i,j) (GPM )) (9) where k(i, j) returns the row and column indices of the element in the upper triangular matrix of (r) (r) P (GPM ), which contains the element e2θij . Since the optimization in Equation 7 is convex, it has an equivalent convex dual. Although we do not use this dual for optimization (because of the difficulty of expressing the entropy of planar models solely in terms of triplet marginals), it nevertheless allows some insight into the structure of the problem. The dual in this case is closely linked to the notion of the marginal polytope defined in Section 1. Using a derivation similar to [11], we arrive at the following characterization of the dual g(ρ, θ) = max τ ∈TRI(n) ρ(r)H(θ (r) (τ )) θ·τ + (10) r where θ (r) (τ ) denotes the parameters of an MRF on G(r) such that its marginals are given by the restriction of τ to the edges of G(r) , and H(θ (r) (τ )) denotes the entropy of the MRF over G(r) with parameters θ (r) (τ ). The maximized function in Equation 10 is linear in ρ and thus g(ρ, θ) is a pointwise maximum over (linear) convex functions in ρ and is thus convex in ρ. It therefore has no (r) local minima. Denote by θmin (ρ) the set of parameters that minimizes Equation 7 for a given value of ρ. Using a derivation similar to that in [11], the gradient of g(ρ, θ) can be shown to be ∂g(ρ, θ) (r) = H(θmin (ρ)) ∂ρ(r) (11) Since the partition function for G(r) can be calculated efficiently, so can the entropy. We can now summarize the algorithm for calculating h(θ) • Initialize ρ0 . Iterate: – For ρt , find θ (r) which solves the minimization in Equation 7. – Calculate the gradient of g(ρ, θ) at ρt using the expression in Equation 11 – Update ρt+1 = ρt + αv where v is a feasible search direction calculated from the gradient of g(ρ, θ) and the simplex constraints on ρ. The step size α is calculated via an Armijo line search. – Halt when the change in g(ρ, θ) is smaller than some threshold. Note that the minimization w.r.t. θ (r) is not very time consuming since we can initialize it with the minimum from the previous step, and thus only a few iterations are needed to find the new optimum, provided the change in ρ is not too big. The above algorithm is guaranteed to converge to a global optimum of ρ [2], and thus we obtain the tightest possible upper bound on Z(θ) given our planar graph decomposition. The procedure described here is asymmetric w.r.t. ρ and θ (r) . In a symmetric formulation the minimizing gradient steps could be carried out jointly or in an alternating sequence. The symmetric ˆ (r) formulation can be obtained by decoupling ρ and θ (r) in the bi-linear constraint ρ(r)θ = θ. Field Figure 2: Illustration of planar subgraph construction for a rectangular lattice with external field. Original graph is shown on the left. The field vertex is connected to all vertices (edges not shown). The graph on the right results from isolating the 4th ,5th columns of the original graph (shown in grey), and connecting the field vertex to the external vertices of the three disconnected components. Note that the resulting graph is planar. ˜ ˜ Specifically, we introduce θ (r) = θ (r) ρ(r) and perform the optimization w.r.t. ρ and θ (r) . It can be ˜(r) ) with the relevant (de-coupled) constraint is equivalent shown that a stationary point of f (θ, ρ, θ to the procedure described above. The advantage of this approach is that the exact minimization w.r.t θ (r) is not required before modifying ρ. Our experiments have shown, however, that the methods take comparable times to converge, although this may be a property of the implementation. 5 Estimating Marginals The optimization problem as defined above minimizes an upper bound on the partition function. However, it may also be of interest to obtain estimates of the marginals of the MRF over G. To obtain marginal estimates, we follow the approach in [11]. We first characterize the optimum of Equation 7 for a fixed value of ρ. Deriving the Lagrangian of Equation 7 w.r.t. θ (r) we obtain the (r) following characterization of θmin (ρ): Marginal Optimality Criterion: For any two graphs G(r) , G(s) such that the edge (ij) is in both (r) (s) graphs, the optimal parameter vector satisfies τij (θmin (ρ)) = τij (θmin (ρ)). Thus, the optimal set of parameters for the graphs G(r) is such that every two graphs agree on the marginals of all the edges they share. This implies that at the optimum, there is a well defined set of marginals over all the edges. We use this set as an approximation to the true marginals. A different method for estimating marginals uses the partition function bound directly. We first P calculate partition function bounds on the sums: αi (1) = x:xi =1 e ij∈E fij (xi ,xj ) and αi (−1) = P αi (1) e ij∈E fij (xi ,xj ) and then normalize αi (1)+αi (−1) to obtain an estimate for p(xi = 1). This method has the advantage of being more numerically stable (since it does not depend on derivatives of log Z). However, it needs to be calculated separately for each variable, so that it may be time consuming if one is interested in marginals for a large set of variables. x:xi =−1 6 Experimental Evaluation We study the application of our Planar Decomposition (PDC) P method to a binary MRF on a square P lattice with an external field. The MRF is given by p(x) ∝ e ij∈E θij xi xj + i∈V θi xi where V are the lattice vertices, and θi and θij are parameters. Note that this interaction does not satisfy the conditions for exact calculation of the partition function, even though the graph is planar. This problem is in fact NP hard [1]. However, it is possible to obtain the desired interaction form by introducing an additional variable xn+1 that is connected to all the original variables.P Denote the correspondP ij∈E θij xi xj + i∈V θi,n+1 xi xn+1 , where ing graph by Gf . Consider the distribution p(x, xn+1 ) ∝ e θi,n+1 = θi . It is easy to see that any property of p(x) (e.g., partition function, marginals) may be calculated from the corresponding property of p(x, xn+1 ). The advantage of the latter distribution is that it has the desired interaction form. We can thus apply PDC by choosing planar subgraphs of the non-planar graph Gf . 0.25 0.15 0.1 0.05 0.5 1 1.5 Interaction Strength 0.03 Singleton Marginal Error Z Bound Error Pairwise Marginals Error 0.08 PDC TRW 0.2 0.07 0.06 0.05 0.04 0.03 0.02 2 0.5 1 1.5 Interaction Strength 0.025 0.02 0.015 0.01 0.005 2 0.5 1 1.5 Interaction Strength 2 !3 x 10 0.025 0.02 0.015 0.5 1 Field Strength 1.5 2 Singleton Marginal Error Pairwise Marginals Error Z Bound Error 0.03 0.03 0.025 0.02 0.015 0.5 1 Field Strength 1.5 2 9 8 7 6 5 4 3 0.5 1 Field Strength 1.5 2 Figure 3: Comparison of the TRW and Planar Decomposition (PDC) algorithms on a 7×7 square lattice. TRW results shown in red squares, and PDC in blue circles. Left column shows the error in the log partition bound. Middle column is the mean error for pairwise marginals, and right column is the error for the singleton marginal of the variable at the lattice center. Results in upper row are for field parameters drawn from U[−0.05, 0.05] and various interaction parameters. Results in the lower row are for interaction parameters drawn from U [−0.5, 0.5] and various field parameters. Error bars are standard errors calculated from 40 random trials. There are clearly many ways to choose spanning planar subgraphs of Gf . Spanning subtrees are one option, and were used in [11]. Since our optimization is polynomial in the number of subgraphs, √ we preferred to use a number of subgraphs that is linear in n. The key idea in generating these planar subgraphs is to generate disconnected components of the lattice and connect xn+1 only to the external vertices of these components. Here we generate three disconnected components by isolating two neighboring columns (or rows) from the rest of the graph, resulting in three components. This is √ illustrated in Figure 2. To this set of 2 n graphs, we add the independent variables graph consisting only of edges from the field node to all the other nodes. We compared the performance of the PDC and TRW methods 3 4 on a 7 × 7 lattice . Since the exact partition function and marginals can be calculated for this case, we could compare both algorithms to the true values. The MRF parameters were set according to the two following scenarios: 1) Varying Interaction - The field parameters θi were drawn uniformly from U[−0.05, 0.05], and the interaction θij from U[−α, α] where α ∈ {0.2, 0.4, . . . , 2}. This is the setting tested in [11]. 2) Varying Field θi was drawn uniformly from U[−α, α], where α ∈ {0.2, 0.4, . . . , 2} and θij from U[−0.5, 0.5]. For each scenario, we calculated the following measures: 1) Normalized log partition error 1 1 alg − log Z true ). 2) Error in pairwise marginals |E| ij∈E |palg (xi = 1, xj = 1) − 49 (log Z ptrue (xi = 1, xj = 1)|. Pairwise marginals were calculated jointly using the marginal optimality criterion of Section 5. 3) Error in singleton marginals. We calculated the singleton marginals for the innermost node in the lattice (i.e., coordinate [3, 3]), which intuitively should be the most difficult for the planar based algorithm. This marginal was calculated using two partition functions, as explained in Section 5 5 . The same method was used for TRW. The reported error measure is |palg (xi = 1) − ptrue (xi = 1)|. Results were averaged over 40 random trials. Results for the two scenarios and different evaluation measures are given in Figure 3. It can be seen that the partition function bound for PDC is significantly better than TRW for almost all parameter settings, although the difference becomes smaller for large field values. Error for the PDC pairwise 3 TRW and PDC bounds were optimized over both the subgraph parameters and the mixture parameters ρ. In terms of running time, PDC optimization for a fixed value of ρ took about 30 seconds, which is still slower than the TRW message passing implementation. 5 Results using the marginal optimality criterion were worse for PDC, possibly due to its reduced numerical precision. 4 marginals are smaller than those of TRW for all parameter settings. For the singleton parameters, TRW slightly outperforms PDC. This is not surprising since the field is modeled by every spanning tree in the TRW decomposition, whereas in PDC not all the structures model a given field. 7 Discussion We have presented a method for using planar graphs as the basis for approximating non-planar graphs such as planar graphs with external fields. While the restriction to binary variables limits the applicability of our approach, it remains relevant in many important applications, such as coding theory and combinatorial optimization. Moreover, it is always possible to convert a non-binary graphical model to a binary one by introducing additional variables. The resulting graph will typically not be planar, even when the original graph over k−ary variables is. However, the planar decomposition method can then be applied to this non-planar graph. The optimization of the decomposition is carried out explicitly over the planar subgraphs, thus limiting the number of subgraphs that can be used in the approximation. In the TRW method this problem is circumvented since it is possible to implicitly optimize over all spanning trees. The reason this can be done for trees is that the entropy of an MRF over a tree may be written as a function of its marginal variables. We do not know of an equivalent result for planar graphs, and it remains a challenge to find one. It is however possible to combine the planar and tree decompositions into one single bound, which is guaranteed to outperform the tree or planar approximations alone. The planar decomposition idea may in principle be applied to bounding the value of the MAP assignment. However, as in TRW, it can be shown that the solution is not dependent on the decomposition (as long as each edge appears in some structure), and the problem is equivalent to maximizing a linear function over the marginal polytope (which can be done in polynomial time for planar graphs). However, such a decomposition may suggest new message passing algorithms, as in [10]. Acknowledgments The authors acknowledge support from the Defense Advanced Research Projects Agency (Transfer Learning program). Amir Globerson is also supported by the Rothschild Yad-Hanadiv fellowship. The authors also wish to thank Martin Wainwright for providing his TRW code. References [1] F. Barahona. On the computational complexity of ising spin glass models. J. Phys. A., 15(10):3241–3253, 1982. [2] D. P. Bertsekas, editor. Nonlinear Programming. Athena Scientific, Belmont, MA, 1995. [3] M.M. Deza and M. Laurent. Geometry of Cuts and Metrics. Springe-Verlag, 1997. [4] R. Diestel. Graph Theory. Springer-Verlag, 1997. [5] M.E. Fisher. On the dimer solution of planar ising models. J. Math. Phys., 7:1776–1781, 1966. [6] M.I. Jordan, editor. Learning in graphical models. MIT press, Cambridge, MA, 1998. [7] P.W. Kasteleyn. Dimer statistics and phase transitions. Journal of Math. Physics, 4:287–293, 1963. [8] L. Lovasz and M.D. Plummer. Matching Theory, volume 29 of Annals of discrete mathematics. NorthHolland, New-York, 1986. [9] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Tree-based reparameterization framework for analysis of sum-product and related algorithms. IEEE Trans. on Information Theory, 49(5):1120–1146, 2003. [10] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Map estimation via agreement on trees: messagepassing and linear programming. IEEE Trans. on Information Theory, 51(11):1120–1146, 2005. [11] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. A new class of upper bounds on the log partition function. IEEE Trans. on Information Theory, 51(7):2313–2335, 2005. [12] M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational inference. Technical report, UC Berkeley Dept. of Statistics, 2003. [13] J.S. Yedidia, W.T. W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. on Information Theory, 51(7):2282–2312, 2005.
4 0.75847399 54 nips-2006-Comparative Gene Prediction using Conditional Random Fields
Author: Jade P. Vinson, David Decaprio, Matthew D. Pearson, Stacey Luoma, James E. Galagan
Abstract: Computational gene prediction using generative models has reached a plateau, with several groups converging to a generalized hidden Markov model (GHMM) incorporating phylogenetic models of nucleotide sequence evolution. Further improvements in gene calling accuracy are likely to come through new methods that incorporate additional data, both comparative and species specific. Conditional Random Fields (CRFs), which directly model the conditional probability P (y|x) of a vector of hidden states conditioned on a set of observations, provide a unified framework for combining probabilistic and non-probabilistic information and have been shown to outperform HMMs on sequence labeling tasks in natural language processing. We describe the use of CRFs for comparative gene prediction. We implement a model that encapsulates both a phylogenetic-GHMM (our baseline comparative model) and additional non-probabilistic features. We tested our model on the genome sequence of the fungal human pathogen Cryptococcus neoformans. Our baseline comparative model displays accuracy comparable to the the best available gene prediction tool for this organism. Moreover, we show that discriminative training and the incorporation of non-probabilistic evidence significantly improve performance. Our software implementation, Conrad, is freely available with an open source license at http://www.broad.mit.edu/annotation/conrad/. 1
5 0.75045907 1 nips-2006-A Bayesian Approach to Diffusion Models of Decision-Making and Response Time
Author: Michael D. Lee, Ian G. Fuss, Daniel J. Navarro
Abstract: We present a computational Bayesian approach for Wiener diffusion models, which are prominent accounts of response time distributions in decision-making. We first develop a general closed-form analytic approximation to the response time distributions for one-dimensional diffusion processes, and derive the required Wiener diffusion as a special case. We use this result to undertake Bayesian modeling of benchmark data, using posterior sampling to draw inferences about the interesting psychological parameters. With the aid of the benchmark data, we show the Bayesian account has several advantages, including dealing naturally with the parameter variation needed to account for some key features of the data, and providing quantitative measures to guide decisions about model construction. 1
6 0.74877053 163 nips-2006-Prediction on a Graph with a Perceptron
7 0.74816567 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition
8 0.74499196 65 nips-2006-Denoising and Dimension Reduction in Feature Space
9 0.74363923 101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow
10 0.74264568 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
11 0.74222231 108 nips-2006-Large Scale Hidden Semi-Markov SVMs
12 0.74049348 79 nips-2006-Fast Iterative Kernel PCA
13 0.73918408 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy
14 0.73867708 20 nips-2006-Active learning for misspecified generalized linear models
15 0.73503584 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
16 0.73412967 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections
17 0.73226517 61 nips-2006-Convex Repeated Games and Fenchel Duality
18 0.73123896 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
19 0.72986287 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
20 0.72947246 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation