nips nips2005 nips2005-42 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Andreas Argyriou, Mark Herbster, Massimiliano Pontil
Abstract: A foundational problem in semi-supervised learning is the construction of a graph underlying the data. We propose to use a method which optimally combines a number of differently constructed graphs. For each of these graphs we associate a basic graph kernel. We then compute an optimal combined kernel. This kernel solves an extended regularization problem which requires a joint minimization over both the data and the set of graph kernels. We present encouraging results on different OCR tasks where the optimal combined kernel is computed from graphs constructed with a variety of distances functions and the ‘k’ in nearest neighbors. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract A foundational problem in semi-supervised learning is the construction of a graph underlying the data. [sent-7, score-0.396]
2 For each of these graphs we associate a basic graph kernel. [sent-9, score-0.372]
3 This kernel solves an extended regularization problem which requires a joint minimization over both the data and the set of graph kernels. [sent-11, score-0.713]
4 We present encouraging results on different OCR tasks where the optimal combined kernel is computed from graphs constructed with a variety of distances functions and the ‘k’ in nearest neighbors. [sent-12, score-0.484]
5 1 Introduction Semi-supervised learning has received significant attention in machine learning in recent years, see, for example, [2, 3, 4, 8, 9, 16, 17, 18] and references therein. [sent-13, score-0.094]
6 The defining insight of semi-supervised methods is that unlabeled data may be used to improve the performance of learners in a supervised task. [sent-14, score-0.102]
7 Graph construction consists of two stages, first selection of a distance function and then application of it to determine the graph’s edges (or weights thereof). [sent-16, score-0.268]
8 For example, in this paper we consider distances between images based on the Euclidean distance, Euclidean distance combined with image transformations, and the related tangent distance [6]; we determine the edge set of the graph with k-nearest neighbors. [sent-17, score-1.032]
9 Another common choice is 2 to weight edges by a decreasing function of the distance d such as e−βd . [sent-18, score-0.196]
10 Although a surplus of unlabeled data may improve the quality of the empirical approximation of the manifold (via the graph) leading to improved performances, practical experience with these methods indicates that their performance significantly depends on how the graph is constructed. [sent-19, score-0.458]
11 Hence, the model selection problem must consider both the selection of the distance function and the parameters k or β used in the graph building process described above. [sent-20, score-0.508]
12 A diversity of methods have been proposed for graph construction; in this paper we do not advocate selecting a single graph but, rather we propose combining a number of graphs. [sent-21, score-0.599]
13 Our solution implements a method based on regularization which builds upon the work in [1]. [sent-22, score-0.235]
14 For a given dataset each combination of distance functions and edge set specifications from the distance will lead to a specific graph. [sent-23, score-0.501]
15 Each of these graphs may then be associated with a kernel. [sent-24, score-0.09]
16 We then apply regularization to select the best convex combination of these kernels; the minimizing function will trade off its fit to the data against its norm. [sent-25, score-0.509]
17 What is unique about this regularization is that the minimization is not over a single kernel space but rather over a space corresponding to all convex combinations of kernels. [sent-26, score-0.632]
18 Thus all data (labeled vertices) may be conserved for training rather than reduced by cross-validation which is not an appealing option when the number of labeled vertices per class is very small. [sent-27, score-0.307]
19 There, three different distances for 400 images of the digits ‘six’ and ‘nine’ are depicted, namely, the Euclidean distance, a distance invariant under small centered image rotations from [−10◦ , 10◦ ] and a distance invariant under rotations from [−180◦ , 180◦ ]. [sent-29, score-0.777]
20 Clearly, the last distance is problematic as sixes become similar to nines. [sent-30, score-0.196]
21 The performance of our graph regularization learning algorithm discussed in Section 2. [sent-31, score-0.534]
22 2 with these distances is reported below each plot; as expected, this performance is much lower in the case that the third distance is used. [sent-32, score-0.28]
23 In Section 2 we discuss how regularization may be applied to single graphs. [sent-34, score-0.205]
24 First, we review regularization in the context of reproducing kernel Hilbert spaces (Section 2. [sent-35, score-0.53]
25 Here we review the (normalized) Laplacian of the graph and a kernel which is the pseudoinverse of the graph Laplacian. [sent-38, score-0.763]
26 In Section 3 we detail our algorithm for learning an optimal convex combination of Laplacian kernels. [sent-39, score-0.386]
27 2 Background on graph regularization In this section we review graph regularization [2, 9, 14] from the perspective of reproducing kernel Hilbert spaces, see e. [sent-41, score-1.249]
28 1 Reproducing kernel Hilbert spaces Let X be a set and K : X × X → IR a kernel function. [sent-45, score-0.348]
29 We say that HK is a reproducing kernel Hilbert space (RKHS) of functions f : X → IR if (i): for every x ∈ X, K(x, ·) ∈ HK and (ii): the reproducing kernel property f (x) = f, K(x, ·) K holds for every f ∈ HK and x ∈ X, where ·, · K is the inner product on HK . [sent-46, score-0.55]
30 Regularization in an RKHS learns a function f ∈ HK on the basis of available input/output examples {(xi , yi ) : i ∈ INℓ } by solving the variational problem ℓ Eγ (K) := min V (yi , f (xi )) + γ f 2 K : f ∈ HK (2. [sent-51, score-0.175]
31 1) i=1 where V : IR × IR → [0, ∞) is a loss function and γ a positive parameter. [sent-52, score-0.077]
32 1) then it has the form ℓ f (x) = ci K(xi , x), x ∈ X i=1 (2. [sent-54, score-0.093]
33 However, in many practical situations it is more convenient to compute c by solving the dual problem to (2. [sent-59, score-0.088]
34 1), namely ℓ −Eγ (K) := min 1 ⊤ c Kc + V ∗ (yi , ci ) : c ∈ IRℓ 4γ i=1 (2. [sent-60, score-0.138]
35 3) ∗ where K = (K(xi , xj ))ℓ i,j=1 and the function V : IR × IR → IR ∪ {+∞} is the conjugate of the loss function V which is defined, for every z, α ∈ IR, as V ∗ (z, α) := sup{λα − V (z, λ) : λ ∈ IR}, see, for example, [1] for a discussion. [sent-61, score-0.077]
36 The choice of the loss function V leads to different learning methods among which the most prominent are square loss regularization and support vector machines, see, for example [15]. [sent-62, score-0.461]
37 2 Graph regularization Let G be an undirected graph with m vertices and an m × m adjacency matrix A such that Aij = 1 if there is an edge connecting vertices i and j and zero otherwise1 . [sent-64, score-0.741]
38 The graph Laplacian L is the m × m matrix defined as L := D − A, where D = diag(di : i ∈ INm ) m and di is the degree of vertex i, that is di = j=1 Aij . [sent-65, score-0.358]
39 We identify the linear space of real-valued functions defined on the graph with IRm and introduce on it the semi-inner product u, v := u⊤ Lv, u, v ∈ IRm . [sent-66, score-0.282]
40 We let {σi , ui }m be a system of eigeni=1 values/vectors of L where the eigenvalues are non-decreasing in order, σi = 0, i ∈ INr , and define the linear subspace H(G) of IRm which is orthogonal to the eigenvectors with zero eigenvalue, that is, H(G) := {v : v⊤ ui = 0, i ∈ INr }. [sent-72, score-0.099]
41 Within this framework, we wish to learn a function v ∈ H(G) on the basis of a set of labeled vertices. [sent-73, score-0.149]
42 Without loss of generality we assume that the first ℓ ≤ m vertices are labeled and let y1 , . [sent-74, score-0.353]
43 Following [2] we prescribe a loss function V and compute the function v by solving the optimization problem ℓ min V (yi , vi ) + γ v 2 : v ∈ H(G) . [sent-78, score-0.297]
44 4) i=1 We note that a similar approach is presented in [17] where v is (essentially) obtained as the minimal norm interpolant in H(G) to the labeled vertices. [sent-80, score-0.187]
45 4) balances the error on the labeled points with a smoothness term measuring the complexity of v on the graph. [sent-82, score-0.211]
46 Note that this last term contains the information of both the labeled and unlabeled vertices via the graph Laplacian. [sent-83, score-0.66]
47 Moreover, the pseudoinverse of the Laplacian, L+ , is the reproducing kernel of H(G), see, for example, [7] for a proof. [sent-89, score-0.325]
48 This means that for every v ∈ H(G) and i ∈ INm there holds the reproducing kernel property vi = L+ , v , where L+ is the i-th i i column of L+ . [sent-90, score-0.362]
49 Hence, by setting X ≡ INm , f (i) = vi and K(i, j) = L+ , i, j ∈ INm , we ij see that HK ≡ H(G). [sent-91, score-0.118]
50 In particular, for square loss regularization [2] and minimal norm interpolation [17] this requires solving a squared linear system of m and m − ℓ equations respectively. [sent-96, score-0.433]
51 The coefficients ci are obtained by solving problem (2. [sent-100, score-0.181]
52 For example, for square loss regularization the computation of the parameter vector c = (ci : i ∈ INℓ ) involves solving a linear system of ℓ equations, namely (K + γI)c = y. [sent-102, score-0.395]
53 5) Learning a convex combination of Laplacian kernels We now describe our framework for learning with multiple graph Laplacians. [sent-104, score-0.893]
54 We assume that we are given n graphs G(q) , q ∈ INn , all having m vertices, with corresponding Laplacians L(q) , kernels K (q) = (L(q) )+ , Hilbert spaces H(q) := H(G(q) ) and norms v 2 := v⊤ L(q) v, v ∈ H(q) . [sent-105, score-0.4]
55 We propose to learn an optimal convex combination of q graph kernels, that is, we solve the optimization problem ℓ ρ = min V (yi , vi ) + γ v 2 K(λ) : λ ∈ Λ, v ∈ HK(λ) (3. [sent-106, score-0.783]
56 1) i=1 n where we have defined the set Λ := {λ ∈ IRn : λq ≥ 0, q=1 λq = 1} and, for each n λ ∈ Λ, the kernel K(λ) := q=1 λq K (q) . [sent-107, score-0.149]
57 Hence an optimal convex combination of kernels has a smaller right hand side than that of any individual kernel, motivating the expectation of improved performance. [sent-109, score-0.599]
58 1) over v ∈ HK and K ∈ co(K), the convex hull of kernels in a prescribed set K. [sent-113, score-0.484]
59 1) it is important to require that the kernels K (q) satisfy a normalization condition such as that they all have the same trace or the same Frobenius norm, see [10] for a discussion. [sent-117, score-0.26]
60 ˆ ˆ Figure 1: Algorithm to compute an optimal convex combination of kernels in the set co{K (q) : q ∈ INn }. [sent-125, score-0.599]
61 1) we can rewrite this problem as ℓ −ρ = max min 1 ⊤ c K(λ)c + V ∗ (yi , ci ) : c ∈ IRℓ : λ ∈ Λ . [sent-128, score-0.168]
62 2) expresses the optimal convex combination of the kernels as the solution to a saddle point problem. [sent-131, score-0.64]
63 In particular, for square loss regularization this requires ˆ solving the equation (2. [sent-139, score-0.395]
64 4 Experiments In this section we present our experiments on optical character recognition. [sent-141, score-0.105]
65 First, the optimal convex combination of kernels computed by our algorithm is competitive with the best base kernels. [sent-143, score-0.632]
66 Second, by observing the ‘weights’ of the convex combination we can distinguish the strong from the weak candidate kernels. [sent-144, score-0.304]
67 We used the USPS dataset2 of 16×16 images of handwritten digits with pixel values ranging between -1 and 1. [sent-146, score-0.158]
68 We present the results for 5 pairwise classification tasks of varying difficulty and for odd vs. [sent-147, score-0.213]
69 For pairwise classification, the training set consisted of the first 200 images for each digit in the USPS training set and the number of labeled points was chosen to be 4, 8 or 12 (with equal numbers for each digit). [sent-149, score-0.464]
70 even digit classification, the training set consisted of the first 80 images per digit in the USPS training set and the number of labeled points was 10, 20 or 30, with equal numbers for each digit. [sent-151, score-0.541]
71 Performance was averaged over 30 random selections, each with the same number of labeled points. [sent-152, score-0.149]
72 In each experiment, we constructed n = 30 graphs G(q) (q ∈ INn ) by combining k-nearest neighbors (k ∈ IN10 ) with three different distances. [sent-153, score-0.161]
73 Since kernels obtained from different types of graphs can vary widely, it was necessary to renormalize them. [sent-156, score-0.35]
74 Hence, we chose to normalize each kernel during the 2 Available at: http://www-stat-class. [sent-157, score-0.149]
75 7 Table 1: Misclassification error percentage (top) and standard deviation (bottom) for the best convex combination of kernels on different handwritten digit recognition tasks, using different distances. [sent-312, score-0.766]
76 training process by the Frobenius norm of its submatrix corresponding to the labeled data. [sent-314, score-0.218]
77 The regularization parameter was set to 10−5 in all algorithms. [sent-316, score-0.205]
78 For convex minimization, as the starting kernel in the algorithm in Figure 1 we always used the average of the n kernels and as the maximum number of iterations T = 100. [sent-317, score-0.604]
79 Table 1 shows the results obtained using three distances as combined with k-NN (k ∈ IN10 ). [sent-318, score-0.128]
80 The first distance is the Euclidean distance between images. [sent-319, score-0.392]
81 The second method is transformation, where the distance between two images is given by the smallest Euclidean distance between any pair of transformed images as determined by applying a number of affine transformations and a thickness transformation3 , see [6] for more information. [sent-320, score-0.566]
82 The third distance is tangent distance, as described in [6], which is a first-order approximation to the above transformations. [sent-321, score-0.319]
83 For the first three columns in the table the Euclidean distance was used, for columns 4–6 the image transformation distance was used, for columns 7–9 the tangent distance was used. [sent-322, score-0.98]
84 We also noted that within each of the methods the performance of the convex combination is comparable to that of the best kernels. [sent-325, score-0.304]
85 Figure 2 reports the weight of each individual kernel learned by our algorithm when 2% labels are used in the pairwise tasks and 20 labels are used for odd vs. [sent-326, score-0.479]
86 7 task, the large weights are associated with the graphs/kernels built with the tangent distance. [sent-329, score-0.158]
87 The effectiveness of our algorithm in selecting the good graphs/kernels is better demonstrated in Figure 3, where the Euclidean and the transformation kernels are combined with a “low-quality” kernel. [sent-330, score-0.41]
88 3 This distance was approximated using Matlab’s constrained minimization function. [sent-332, score-0.243]
89 The figure shows the distance matrix on the set of labeled and unlabeled data for the Euclidean, transformation and “low-quality distance” respectively. [sent-333, score-0.553]
90 The best error among 15 different values of k within each method, the error of the learned convex combination and the total learned weights for each method are shown below each plot. [sent-334, score-0.467]
91 It is clear that the solution of the algorithm is dominated by the good kernels and is not influenced by the ones with low performance. [sent-335, score-0.26]
92 As a result, the error of the convex combination is comparable to that of the Euclidean and transformation methods. [sent-336, score-0.443]
93 The final experiment (see Figure 4) demonstrates that unlabeled data improves the performance of our method. [sent-337, score-0.102]
94 Euclidean Transformation Low−quality distance 0 0 0 50 50 50 100 100 100 150 150 150 200 200 200 250 250 250 300 300 300 350 350 400 0 100 200 300 400 error = 0. [sent-384, score-0.229]
95 26% Figure 3: Similarity matrices and corresponding learned coefficients of the convex combination for the 6 vs. [sent-391, score-0.335]
96 5 Conclusion We have presented a method for computing an optimal kernel within the framework of regularization over graphs. [sent-394, score-0.389]
97 When tested on optical character recognition tasks, the method exhibits competitive performance and is able to select good graph structures. [sent-396, score-0.42]
98 In particular, we may consider a continuous family of graphs each corresponding to a different weight matrix and study graph kernel combinations over this class. [sent-398, score-0.557]
99 The number of labeled points is 10 on the left and 20 on the right. [sent-424, score-0.149]
100 Diffusion kernels on graphs and other discrete input spaces. [sent-486, score-0.35]
wordName wordTfidf (topN-words)
[('ir', 0.301), ('graph', 0.282), ('kernels', 0.26), ('euclidean', 0.222), ('regularization', 0.205), ('inm', 0.203), ('distance', 0.196), ('convex', 0.195), ('inn', 0.177), ('hk', 0.158), ('kernel', 0.149), ('labeled', 0.149), ('irm', 0.136), ('odd', 0.129), ('vertices', 0.127), ('reproducing', 0.126), ('tangent', 0.123), ('digit', 0.115), ('laplacian', 0.114), ('combination', 0.109), ('transformation', 0.106), ('inp', 0.102), ('unlabeled', 0.102), ('hilbert', 0.097), ('ci', 0.093), ('graphs', 0.09), ('vi', 0.087), ('usps', 0.086), ('distances', 0.084), ('loss', 0.077), ('laplacians', 0.075), ('images', 0.07), ('argyriou', 0.068), ('herbster', 0.068), ('inr', 0.068), ('co', 0.067), ('character', 0.062), ('solving', 0.058), ('aij', 0.056), ('square', 0.055), ('handwritten', 0.054), ('spaces', 0.05), ('micchelli', 0.05), ('pseudoinverse', 0.05), ('th', 0.049), ('minimization', 0.047), ('pontil', 0.047), ('learning', 0.047), ('tasks', 0.046), ('rotations', 0.045), ('min', 0.045), ('manifold', 0.044), ('combined', 0.044), ('optical', 0.043), ('rkhs', 0.043), ('labels', 0.043), ('columns', 0.042), ('yi', 0.042), ('pk', 0.041), ('saddle', 0.041), ('frobenius', 0.041), ('eigenvectors', 0.039), ('belkin', 0.038), ('norm', 0.038), ('di', 0.038), ('pairwise', 0.038), ('construction', 0.037), ('zhu', 0.037), ('misclassi', 0.037), ('image', 0.037), ('constructed', 0.036), ('combinations', 0.036), ('combining', 0.035), ('optimal', 0.035), ('invariant', 0.035), ('weights', 0.035), ('transformations', 0.034), ('colt', 0.034), ('digits', 0.034), ('competitive', 0.033), ('error', 0.033), ('semide', 0.032), ('ij', 0.031), ('learned', 0.031), ('cients', 0.031), ('text', 0.031), ('training', 0.031), ('consisted', 0.03), ('ui', 0.03), ('builds', 0.03), ('smola', 0.03), ('smoothing', 0.03), ('experience', 0.03), ('problem', 0.03), ('balances', 0.029), ('irn', 0.029), ('kondor', 0.029), ('lesser', 0.029), ('matveeva', 0.029), ('prescribed', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
Author: Andreas Argyriou, Mark Herbster, Massimiliano Pontil
Abstract: A foundational problem in semi-supervised learning is the construction of a graph underlying the data. We propose to use a method which optimally combines a number of differently constructed graphs. For each of these graphs we associate a basic graph kernel. We then compute an optimal combined kernel. This kernel solves an extended regularization problem which requires a joint minimization over both the data and the set of graph kernels. We present encouraging results on different OCR tasks where the optimal combined kernel is computed from graphs constructed with a variety of distances functions and the ‘k’ in nearest neighbors. 1
2 0.23455104 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
Author: Y. Altun, D. McAllester, M. Belkin
Abstract: Many real-world classification problems involve the prediction of multiple inter-dependent variables forming some structural dependency. Recent progress in machine learning has mainly focused on supervised classification of such structured variables. In this paper, we investigate structured classification in a semi-supervised setting. We present a discriminative approach that utilizes the intrinsic geometry of input patterns revealed by unlabeled data points and we derive a maximum-margin formulation of semi-supervised learning for structured variables. Unlike transductive algorithms, our formulation naturally extends to new test points. 1
3 0.23417634 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
Author: Tong Zhang, Rie Kubota Ando
Abstract: We consider a framework for semi-supervised learning using spectral decomposition based un-supervised kernel design. This approach subsumes a class of previously proposed semi-supervised learning methods on data graphs. We examine various theoretical properties of such methods. In particular, we derive a generalization performance bound, and obtain the optimal kernel design by minimizing the bound. Based on the theoretical analysis, we are able to demonstrate why spectral kernel design based methods can often improve the predictive performance. Experiments are used to illustrate the main consequences of our analysis.
4 0.23401718 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
Author: Ashish Kapoor, Hyungil Ahn, Yuan Qi, Rosalind W. Picard
Abstract: There have been many graph-based approaches for semi-supervised classification. One problem is that of hyperparameter learning: performance depends greatly on the hyperparameters of the similarity graph, transformation of the graph Laplacian and the noise model. We present a Bayesian framework for learning hyperparameters for graph-based semisupervised classification. Given some labeled data, which can contain inaccurate labels, we pose the semi-supervised classification as an inference problem over the unknown labels. Expectation Propagation is used for approximate inference and the mean of the posterior is used for classification. The hyperparameters are learned using EM for evidence maximization. We also show that the posterior mean can be written in terms of the kernel matrix, providing a Bayesian classifier to classify new points. Tests on synthetic and real datasets show cases where there are significant improvements in performance over the existing approaches. 1
5 0.19544083 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
Author: Kilian Q. Weinberger, John Blitzer, Lawrence K. Saul
Abstract: We show how to learn a Mahanalobis distance metric for k-nearest neighbor (kNN) classification by semidefinite programming. The metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. On seven data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification—for example, achieving a test error rate of 1.3% on the MNIST handwritten digits. As in support vector machines (SVMs), the learning problem reduces to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our framework requires no modification or extension for problems in multiway (as opposed to binary) classification. 1
6 0.16379881 105 nips-2005-Large-Scale Multiclass Transduction
7 0.15478024 178 nips-2005-Soft Clustering on Graphs
8 0.15057719 126 nips-2005-Metric Learning by Collapsing Classes
9 0.13923551 50 nips-2005-Convex Neural Networks
10 0.12801148 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm
11 0.12526633 47 nips-2005-Consistency of one-class SVM and related algorithms
12 0.12524198 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
13 0.11976518 175 nips-2005-Sequence and Tree Kernels with Statistical Feature Mining
14 0.11811413 114 nips-2005-Learning Rankings via Convex Hull Separation
15 0.1173673 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging
16 0.1109932 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions
17 0.10598096 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
18 0.10507421 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
19 0.10335619 138 nips-2005-Non-Local Manifold Parzen Windows
20 0.098105162 75 nips-2005-Fixing two weaknesses of the Spectral Method
topicId topicWeight
[(0, 0.321), (1, 0.204), (2, -0.149), (3, -0.141), (4, -0.11), (5, 0.113), (6, 0.157), (7, -0.025), (8, 0.105), (9, -0.036), (10, 0.09), (11, -0.099), (12, -0.19), (13, -0.01), (14, -0.085), (15, -0.245), (16, 0.059), (17, -0.011), (18, -0.018), (19, 0.093), (20, -0.045), (21, 0.005), (22, -0.03), (23, -0.019), (24, 0.062), (25, -0.029), (26, 0.031), (27, -0.041), (28, -0.126), (29, -0.072), (30, 0.022), (31, -0.013), (32, -0.044), (33, 0.03), (34, 0.07), (35, 0.089), (36, -0.072), (37, 0.053), (38, -0.019), (39, -0.007), (40, -0.075), (41, 0.025), (42, 0.087), (43, 0.057), (44, -0.012), (45, -0.045), (46, -0.09), (47, 0.002), (48, 0.015), (49, -0.049)]
simIndex simValue paperId paperTitle
same-paper 1 0.97389805 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
Author: Andreas Argyriou, Mark Herbster, Massimiliano Pontil
Abstract: A foundational problem in semi-supervised learning is the construction of a graph underlying the data. We propose to use a method which optimally combines a number of differently constructed graphs. For each of these graphs we associate a basic graph kernel. We then compute an optimal combined kernel. This kernel solves an extended regularization problem which requires a joint minimization over both the data and the set of graph kernels. We present encouraging results on different OCR tasks where the optimal combined kernel is computed from graphs constructed with a variety of distances functions and the ‘k’ in nearest neighbors. 1
2 0.79151893 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
Author: Y. Altun, D. McAllester, M. Belkin
Abstract: Many real-world classification problems involve the prediction of multiple inter-dependent variables forming some structural dependency. Recent progress in machine learning has mainly focused on supervised classification of such structured variables. In this paper, we investigate structured classification in a semi-supervised setting. We present a discriminative approach that utilizes the intrinsic geometry of input patterns revealed by unlabeled data points and we derive a maximum-margin formulation of semi-supervised learning for structured variables. Unlike transductive algorithms, our formulation naturally extends to new test points. 1
3 0.71322095 105 nips-2005-Large-Scale Multiclass Transduction
Author: Thomas Gärtner, Quoc V. Le, Simon Burton, Alex J. Smola, Vishy Vishwanathan
Abstract: We present a method for performing transductive inference on very large datasets. Our algorithm is based on multiclass Gaussian processes and is effective whenever the multiplication of the kernel matrix or its inverse with a vector can be computed sufficiently fast. This holds, for instance, for certain graph and string kernels. Transduction is achieved by variational inference over the unlabeled data subject to a balancing constraint. 1
4 0.70503819 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
Author: Ashish Kapoor, Hyungil Ahn, Yuan Qi, Rosalind W. Picard
Abstract: There have been many graph-based approaches for semi-supervised classification. One problem is that of hyperparameter learning: performance depends greatly on the hyperparameters of the similarity graph, transformation of the graph Laplacian and the noise model. We present a Bayesian framework for learning hyperparameters for graph-based semisupervised classification. Given some labeled data, which can contain inaccurate labels, we pose the semi-supervised classification as an inference problem over the unknown labels. Expectation Propagation is used for approximate inference and the mean of the posterior is used for classification. The hyperparameters are learned using EM for evidence maximization. We also show that the posterior mean can be written in terms of the kernel matrix, providing a Bayesian classifier to classify new points. Tests on synthetic and real datasets show cases where there are significant improvements in performance over the existing approaches. 1
5 0.70500475 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
Author: Tong Zhang, Rie Kubota Ando
Abstract: We consider a framework for semi-supervised learning using spectral decomposition based un-supervised kernel design. This approach subsumes a class of previously proposed semi-supervised learning methods on data graphs. We examine various theoretical properties of such methods. In particular, we derive a generalization performance bound, and obtain the optimal kernel design by minimizing the bound. Based on the theoretical analysis, we are able to demonstrate why spectral kernel design based methods can often improve the predictive performance. Experiments are used to illustrate the main consequences of our analysis.
6 0.64142305 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm
7 0.59680748 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
8 0.56859267 126 nips-2005-Metric Learning by Collapsing Classes
9 0.52063876 114 nips-2005-Learning Rankings via Convex Hull Separation
10 0.48898911 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression
11 0.48580769 107 nips-2005-Large scale networks fingerprinting and visualization using the k-core decomposition
12 0.4776006 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
13 0.46580634 75 nips-2005-Fixing two weaknesses of the Spectral Method
14 0.45632693 175 nips-2005-Sequence and Tree Kernels with Statistical Feature Mining
15 0.45607811 116 nips-2005-Learning Topology with the Generative Gaussian Graph and the EM Algorithm
16 0.41662639 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging
17 0.41502389 178 nips-2005-Soft Clustering on Graphs
18 0.41354191 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
19 0.41096094 71 nips-2005-Fast Krylov Methods for N-Body Learning
20 0.40786904 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
topicId topicWeight
[(3, 0.04), (10, 0.035), (31, 0.018), (34, 0.573), (41, 0.012), (50, 0.012), (55, 0.023), (65, 0.013), (69, 0.036), (73, 0.041), (88, 0.09), (91, 0.029)]
simIndex simValue paperId paperTitle
1 0.99452198 2 nips-2005-A Bayes Rule for Density Matrices
Author: Manfred K. Warmuth
Abstract: The classical Bayes rule computes the posterior model probability from the prior probability and the data likelihood. We generalize this rule to the case when the prior is a density matrix (symmetric positive definite and trace one) and the data likelihood a covariance matrix. The classical Bayes rule is retained as the special case when the matrices are diagonal. In the classical setting, the calculation of the probability of the data is an expected likelihood, where the expectation is over the prior distribution. In the generalized setting, this is replaced by an expected variance calculation where the variance is computed along the eigenvectors of the prior density matrix and the expectation is over the eigenvalues of the density matrix (which form a probability vector). The variances along any direction is determined by the covariance matrix. Curiously enough this expected variance calculation is a quantum measurement where the co-variance matrix specifies the instrument and the prior density matrix the mixture state of the particle. We motivate both the classical and the generalized Bayes rule with a minimum relative entropy principle, where the Kullbach-Leibler version gives the classical Bayes rule and Umegaki’s quantum relative entropy the new Bayes rule for density matrices. 1
2 0.98641884 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine
Author: François Laviolette, Mario Marchand, Mohak Shah
Abstract: We design a new learning algorithm for the Set Covering Machine from a PAC-Bayes perspective and propose a PAC-Bayes risk bound which is minimized for classifiers achieving a non trivial margin-sparsity trade-off. 1
3 0.98597836 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm
Author: Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer
Abstract: While classical kernel-based learning algorithms are based on a single kernel, in practice it is often desirable to use multiple kernels. Lankriet et al. (2004) considered conic combinations of kernel matrices for classification, leading to a convex quadratically constraint quadratic program. We show that it can be rewritten as a semi-infinite linear program that can be efficiently solved by recycling the standard SVM implementations. Moreover, we generalize the formulation and our method to a larger class of problems, including regression and one-class classification. Experimental results show that the proposed algorithm helps for automatic model selection, improving the interpretability of the learning result and works for hundred thousands of examples or hundreds of kernels to be combined. 1
same-paper 4 0.98533446 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
Author: Andreas Argyriou, Mark Herbster, Massimiliano Pontil
Abstract: A foundational problem in semi-supervised learning is the construction of a graph underlying the data. We propose to use a method which optimally combines a number of differently constructed graphs. For each of these graphs we associate a basic graph kernel. We then compute an optimal combined kernel. This kernel solves an extended regularization problem which requires a joint minimization over both the data and the set of graph kernels. We present encouraging results on different OCR tasks where the optimal combined kernel is computed from graphs constructed with a variety of distances functions and the ‘k’ in nearest neighbors. 1
5 0.96565157 143 nips-2005-Off-Road Obstacle Avoidance through End-to-End Learning
Author: Urs Muller, Jan Ben, Eric Cosatto, Beat Flepp, Yann L. Cun
Abstract: We describe a vision-based obstacle avoidance system for off-road mobile robots. The system is trained from end to end to map raw input images to steering angles. It is trained in supervised mode to predict the steering angles provided by a human driver during training runs collected in a wide variety of terrains, weather conditions, lighting conditions, and obstacle types. The robot is a 50cm off-road truck, with two forwardpointing wireless color cameras. A remote computer processes the video and controls the robot via radio. The learning system is a large 6-layer convolutional network whose input is a single left/right pair of unprocessed low-resolution images. The robot exhibits an excellent ability to detect obstacles and navigate around them in real time at speeds of 2 m/s.
6 0.83656406 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression
7 0.82521278 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
8 0.8092525 114 nips-2005-Learning Rankings via Convex Hull Separation
9 0.79627877 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
10 0.7939831 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
11 0.79114282 105 nips-2005-Large-Scale Multiclass Transduction
12 0.78623897 77 nips-2005-From Lasso regression to Feature vector machine
13 0.78563625 31 nips-2005-Asymptotics of Gaussian Regularized Least Squares
14 0.77075446 184 nips-2005-Structured Prediction via the Extragradient Method
15 0.76105952 50 nips-2005-Convex Neural Networks
16 0.7542007 126 nips-2005-Metric Learning by Collapsing Classes
17 0.75282496 195 nips-2005-Transfer learning for text classification
18 0.74933219 138 nips-2005-Non-Local Manifold Parzen Windows
19 0.74369216 58 nips-2005-Divergences, surrogate loss functions and experimental design
20 0.74338186 47 nips-2005-Consistency of one-class SVM and related algorithms