nips nips2005 nips2005-105 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Thomas Gärtner, Quoc V. Le, Simon Burton, Alex J. Smola, Vishy Vishwanathan
Abstract: We present a method for performing transductive inference on very large datasets. Our algorithm is based on multiclass Gaussian processes and is effective whenever the multiplication of the kernel matrix or its inverse with a vector can be computed sufficiently fast. This holds, for instance, for certain graph and string kernels. Transduction is achieved by variational inference over the unlabeled data subject to a balancing constraint. 1
Reference: text
sentIndex sentText sentNum sentScore
1 au Abstract We present a method for performing transductive inference on very large datasets. [sent-13, score-0.194]
2 Our algorithm is based on multiclass Gaussian processes and is effective whenever the multiplication of the kernel matrix or its inverse with a vector can be computed sufficiently fast. [sent-14, score-0.409]
3 This holds, for instance, for certain graph and string kernels. [sent-15, score-0.204]
4 Transduction is achieved by variational inference over the unlabeled data subject to a balancing constraint. [sent-16, score-0.245]
5 1 Introduction While obtaining labeled data remains a time and labor consuming task, acquisition and storage of unlabelled data is becoming increasingly cheap and easy. [sent-17, score-0.113]
6 This development has driven machine learning research into exploring algorithms that make extensive use of unlabelled data at training time in order to obtain better generalization performance. [sent-18, score-0.116]
7 A common problem of many transductive approaches is that they scale badly with the amount of unlabeled data, which prohibits the use of massive sets of unlabeled data. [sent-19, score-0.312]
8 We perform classification on a dataset consisting of a digraph with 75, 888 vertices and 508, 960 edges. [sent-21, score-0.177]
9 To the best of our knowledge it has so far not been possible to perform transduction on graphs of this size in reasonable time (with standard hardware). [sent-22, score-0.296]
10 Moreover, on graphs various methods for unsupervised learning have been proposed [12, 11], all of which are mainly concerned with computing the kernel matrix on training and test set jointly. [sent-25, score-0.358]
11 Other formulations impose that the label assignment on the test set be consistent with the assumption of confident classification [8]. [sent-26, score-0.138]
12 Yet others impose that training and test set have similar marginal distributions [4]. [sent-27, score-0.222]
13 It is particularly efficient whenever Kα or K −1 α can be computed in linear time, where K ∈ Rm×m is the kernel matrix and α ∈ Rm . [sent-29, score-0.123]
14 • We require consistency of training and test marginals. [sent-30, score-0.115]
15 • Kernels (or their inverses) are computed on training and test set simultaneously. [sent-32, score-0.115]
16 On graphs this can lead to considerable computational savings. [sent-33, score-0.152]
17 • Self consistency of the estimates is achieved by a variational approach. [sent-34, score-0.073]
18 This allows us to make use of Gaussian Process multiclass formulations. [sent-35, score-0.176]
19 2 Multiclass Classification We begin with a brief overview over Gaussian Process multiclass classification [10] recast in terms of exponential families. [sent-36, score-0.176]
20 It is our goal to estimate y|x via p(y|x, θ) = exp ( φ(x, y), θ − g(θ|x)) where g(θ|x) = log exp ( φ(x, y), θ ) . [sent-47, score-0.141]
21 We impose a normal prior on θ, leading to the following negative joint likelihood in θ and Y : m P := − log p(θ, Y |X) = [g(θ|xi ) − φ(xi , yi ), θ ] + i=1 1 θ 2σ 2 2 + const. [sent-49, score-0.185]
22 (2) For transduction purposes p(θ, Y |X) will prove more useful than p(θ|Y, X). [sent-50, score-0.176]
23 Note that a normal prior on θ with variance σ 2 1 implies a Gaussian process on the random variable t(x, y) := φ(x, y), θ with covariance kernel Cov [t(x, y), t(x′ , y ′ )] = σ 2 φ(x, y), φ(x′ , y ′ ) =: σ 2 k((x, y), (x′ , y ′ )). [sent-51, score-0.104]
24 Here joint log-likelihood (2) in terms of α and K yields m n exp ([Kα]iy ) − tr µ⊤ Kα + log i=1 y=1 1 tr α⊤ Kα + const. [sent-60, score-0.354]
25 This is commonly done in Gaussian process literature and we will use both formulations, depending on the problem we need to solve: if Kα can be computed effectively, as is the case with string kernels [9], we use the α-parameterization. [sent-62, score-0.186]
26 Conversely, if K −1 α is cheap, as for example with graph kernels [7], we use the t-parameterization. [sent-63, score-0.222]
27 Derivatives Second order methods such as Conjugate Gradient require the computation of derivatives of − log p(θ, Y |X) with respect to θ in terms of α or t. [sent-64, score-0.09]
28 ∗ (Kβ)) + σ −2 tr γ ⊤ Kβ (8a) 2 ∂t P[u, v] (8b) ⊤ ⊤ = tr u (π. [sent-71, score-0.256]
29 Combining this with rates of convergence for Newton-type or nonlinear CG solver strategies yields overall time costs in the order of O(m log m) to O(m2 ) worst case, a significant improvement over conventional O(m3 ) methods. [sent-77, score-0.095]
30 3 Transductive Inference by Variational Methods As we are interested in transduction, the labels Y (and analogously the data X) decompose as Y = Ytrain ∪ Ytest . [sent-78, score-0.07]
31 Instead, we now aim at estimating the mode of p(θ|X, Ytrain ) by variational means. [sent-80, score-0.073]
32 [5]) − log p(θ|X, Ytrain ) ≤ − log p(θ|X, Ytrain ) + D(q(Ytest ) p(Ytest |X, Ytrain , θ)) =− (log p(Ytest , θ|X, Ytrain ) − log q(Ytest )) q(Ytest ) (9) (10) Ytest holds. [sent-83, score-0.165]
33 The key trick is that while using a factorizing approximation for q we restrict the latter to distributions which satisfy balancing constraints. [sent-85, score-0.122]
34 That is, we require them to yield marginals on the unlabeled data which are comparable with the labeled observations. [sent-86, score-0.106]
35 With qij := q(yi = j) we define µij (q) = qij for all i > mtrain and µij (q) = 1 if yi = 1 and 0 otherwise for all i ≤ mtrain . [sent-89, score-1.258]
36 In other words, we are taking the expectation in µ over all unobserved labels Ytest with respect to the distribution q(Ytest ). [sent-90, score-0.07]
37 We have q(Ytest ) log p(Ytest , θ|X, Ytrain ) Ytest m = n exp ([Kα]ij ) − tr µ(q)⊤ Kα + log i=1 j=1 1 tr α⊤ Kα + const. [sent-91, score-0.409]
38 Since q facm q(Ytest ) log q(Ytest ) = Ytest qij log qij . [sent-95, score-0.834]
39 (13) i=mtrain +1 It is unreasonable to assume that q may be chosen freely from all factorizing distributions (the latter would lead to a straightforward EM algorithm for transductive inference): if we observe a certain distribution of labels on the training set, e. [sent-96, score-0.373]
40 If m ≫ mtrain , however, a naive application of the variational bound can lead to cases where q is concentrated on one class — the increase in likelihood for a resulting very simple classifier completely outweighs any balancing constraints implicit in the data. [sent-100, score-0.496]
41 It is, incidentally, also the reason why SVM transduction optimization codes [4] impose a balancing constraint on the assignment of test labels. [sent-102, score-0.415]
42 We impose the following conditions: m n − rj ≤ + qij ≤ rj for all j ∈ Y and i=mtrain +1 qij = 1 for all i ∈ {mtrain . [sent-103, score-0.949]
43 j=1 − + Here the constraints rj = pemp (y = j) − ǫ and rj = pemp (y = j) + ǫ are chosen such as to correspond to confidence intervals given by finite sample size tail bounds. [sent-106, score-0.36]
44 In other mtrain words we set pemp (y = j) = m−1 train i=1 {yi = j} and ǫ such as to satisfy mtrain m−1 train Pr mtest ξi − m−1 test i=1 ′ ξi > ǫ ≤δ (14) i=1 ′ for iid {0, 1} random variables ξi and ξi with mean p. [sent-107, score-0.713]
45 7)] after application of a union bound over the class labels that ǫ ≤ log(2n/δ)m/ (2mtrain mtest ). [sent-111, score-0.186]
46 4 Graphs, Strings and Vectors We now discuss the two main applications where computational savings can be achieved: graphs and strings. [sent-112, score-0.12]
47 In the case of graphs, the advantage arises from the fact that K −1 is sparse, whereas for texts we can use fast string kernels [9] to compute Kα in linear time. [sent-113, score-0.186]
48 Graphs Denote by G(V, E) the graph given by vertices V and edges E where each edge is a set of two vertices. [sent-114, score-0.172]
49 Then W ∈ R|V |×|V | denotes the adjacency matrix of the graph, where Wij > 0 only if edge {i, j} ∈ E. [sent-115, score-0.101]
50 We assume that the graph G, and thus also the adjacency matrix W , is sparse. [sent-116, score-0.221]
51 Now denote by 1 the identity matrix and by D the diagonal matrix of vertex degrees, i. [sent-117, score-0.108]
52 Then the graph Laplacian and the normalized graph Laplacian of G are given by L := D − W 1 1 ˜ L := 1 − D− 2 W D− 2 , and (15) respectively. [sent-120, score-0.24]
53 Many kernels K (or their inverse) on G are given by low-degree polynomials of the Laplacian or the adjacency matrix of G, such as the following: l l ci W 2i , K = K= i=1 ˜ ˜ (1 − ci L), or K −1 = L + ǫ1. [sent-121, score-0.279]
54 The first kernel arises from an l-step random walk, the third case is typically referred to as regularized graph Laplacian. [sent-123, score-0.189]
55 This means that if the average degree of the graph does not increase with the number of observations, L = O(m) as m = |V | for inference on graphs. [sent-125, score-0.155]
56 From Graphs to Graphical Models Graphs are one of the examples where transduction actually improves computational cost: Assume that we are given the inverse kernel matrix K −1 on training and test set and we wish to perform induction only. [sent-126, score-0.494]
57 In this case we need to compute the kernel matrix (or its inverse) restricted to the training set. [sent-127, score-0.191]
58 Let K −1 = A B , then the upper left hand corner (representing the training set part only) of B⊤ C −1 K is given by the Schur complement A − B ⊤ C −1 B . [sent-128, score-0.101]
59 Moreover, neither the Schur complement nor its inverse are typically sparse. [sent-130, score-0.113]
60 Here we have a nice connection between graphical models and graph kernels. [sent-131, score-0.165]
61 In this case the inverse covariance matrix has nonzero entries only for variables with a direct dependency structure. [sent-133, score-0.134]
62 In other words, if we are given a graphical model of normal random variables, their conditional independence structure is reflected by K −1 . [sent-135, score-0.08]
63 In the same way as in graphical models marginalization may induce dependencies, computing the kernel matrix on the training set only, may lead to dense matrices, even when the inverse kernel on training and test data combined is sparse. [sent-136, score-0.587]
64 The bottom line is there are cases where it is computationally cheaper to take both training and test set into account and optimize over a larger set of variables rather than dealing with a smaller dense matrix. [sent-137, score-0.17]
65 Strings: Efficient computation of string kernels using suffix trees was described in [9]. [sent-138, score-0.186]
66 The efficient computation scheme covers all kernels of type k(x, x′ ) = ws #s (x)#s (x′ ) (17) s for arbitrary ws ≥ 0. [sent-141, score-0.184]
67 This means that computation time for evaluating Kα is again O( i |xi |) as we need to evaluate the kernel expansion for all x ∈ X. [sent-143, score-0.069]
68 Since the average string length is independent of m this yields an O(m) algorithm for Kα. [sent-144, score-0.084]
69 We have: minimize tr q ⊤ τ + q qij log qij (18) i,j − subject to qj ≤ + qij ≤ qj , qij ≥ 0 and i qli = 1 for all j ∈ Y, l ∈ {1. [sent-158, score-1.695]
70 mtest } i Table 1: Error rates on some benchmark datasets (mostly from UCI). [sent-160, score-0.176]
71 The last column is the error rates reported in [1] DATASET cancer cancer (progn. [sent-161, score-0.15]
72 Using Lagrange multipliers one can show that q n needs to satisfy qij = exp(−τij )bi cj where bi , cj ≥ 0. [sent-219, score-0.424]
73 Solving for j qij = 1 yields exp(−τ )c ij j qij = Pn exp(−τil )cl . [sent-220, score-0.809]
74 This means that instead of an optimization problem in mtest × n l=1 variables we now only need to optimize over n variables subject to 2n constraints. [sent-221, score-0.136]
75 + − Note that the exact matching constraint where qi = qi amounts to a maximum likelihood problem for a shifted exponential family model where qij = exp(τij ) exp(γi − gj (γi )). [sent-222, score-0.426]
76 It can be shown that the approximate matching problem is equivalent to a maximum a posteriori optimization problem using the norm dual to expectation constraints on qij . [sent-223, score-0.445]
77 As initialization we choose γi such that the per class averages match the marginal constraint while ignoring the per sample balance. [sent-226, score-0.079]
78 6 Experiments Unfortunately, we are not aware of other multiclass transductive learning algorithms. [sent-228, score-0.369]
79 To still be able to compare our approach to other transductive learning algorithms we performed experiments on some benchmark datasets. [sent-229, score-0.212]
80 To investigate the performance of our algorithm in classifying vertices of a graph, we choose the WebKB dataset. [sent-230, score-0.084]
81 Benchmark datasets Table 1 reports results on some benchmark datasets. [sent-231, score-0.083]
82 To be able to compare the error rates of the transductive multiclass Gaussian Process classifier proposed in this paper, we also report error rates from [2] and an inductive multiclass Gaussian Process classifier. [sent-232, score-0.591]
83 Parameters were chosen by crossvalidation inside the training folds. [sent-234, score-0.134]
84 Graph Mining To illustrate the effectiveness of our approach on graphs we performed experiments on the well known WebKB dataset. [sent-235, score-0.12]
85 This dataset consists of 8275 webpages classified into 7 classes. [sent-236, score-0.117]
86 As we are using this dataset to evaluate our graph mining algorithm, we ignore the text on each webpage and consider the dataset as a labelled directed graph. [sent-238, score-0.335]
87 We use the co-linkage graph and report results for ‘inverse’ 10fold stratified crossvalidations, i. [sent-242, score-0.12]
88 , we use 1 fold as training data and 9 folds as test data. [sent-244, score-0.115]
89 To overcome this, we predict on the test set as follows: For each class the instances that are most likely to be in this class are picked (if they haven’t been picked for a class with lower index) such that the fraction of instances assigned to this class is the same on the training and test set. [sent-247, score-0.382]
90 Although a directed graph approach outperforms there an undirected approach, we resorted to kernels for undirected graphs, as those are computationally more attractive. [sent-250, score-0.258]
91 We will investigate computationally attractive digraph kernels in future work and expect similar benefits as reported by [11]. [sent-251, score-0.222]
92 To investigate the behaviour of our algorithm with less training data, we performed a 20-fold inverse crossvalidation on the ‘wisconsin’ subset and observed an error rate of 17% there. [sent-253, score-0.246]
93 To further strengthen our results and show that the runtime performance of our algorithm is sufficient for classifying the vertices of massive graphs, we also performed initial experiments on the Epinions dataset collected by Mathew Richardson and Pedro Domingos. [sent-254, score-0.156]
94 However, the experiments show that the algorithm can be run on very large graph datasets. [sent-259, score-0.12]
95 7 Discussion and Extensions We presented an efficient method for performing transduction on multiclass estimation problems with Gaussian Processes. [sent-260, score-0.352]
96 It performs particularly well whenever the kernel matrix has special numerical properties which allow fast matrix vector multiplication. [sent-261, score-0.177]
97 That said, also on standard dense problems we observed very good improvements (typically a 10% reduction of the training error) over standard induction. [sent-262, score-0.123]
98 Structured Labels and Conditional Random Fields are a clear area where to extend the transductive setting. [sent-263, score-0.159]
99 The key obstacle to overcome in this context is to find a suitable marginal distribution: with increasing structure of the labels the confidence bounds per subclass decrease dramatically. [sent-264, score-0.116]
100 Learning from labeled and unlabeled data on a o directed graph. [sent-358, score-0.095]
wordName wordTfidf (topN-words)
[('ytest', 0.459), ('qij', 0.362), ('ytrain', 0.306), ('mtrain', 0.25), ('transduction', 0.176), ('multiclass', 0.176), ('transductive', 0.159), ('tr', 0.128), ('graphs', 0.12), ('graph', 0.12), ('webkb', 0.111), ('kernels', 0.102), ('ij', 0.085), ('string', 0.084), ('iy', 0.083), ('mtest', 0.083), ('pemp', 0.083), ('rj', 0.082), ('inverse', 0.08), ('balancing', 0.078), ('variational', 0.073), ('labels', 0.07), ('kernel', 0.069), ('dataset', 0.069), ('training', 0.068), ('rm', 0.067), ('crossvalidation', 0.066), ('impose', 0.061), ('unlabeled', 0.059), ('classi', 0.059), ('digraph', 0.056), ('epinions', 0.056), ('mathew', 0.056), ('pedro', 0.056), ('rror', 0.056), ('topreviewers', 0.056), ('vishwanathan', 0.056), ('log', 0.055), ('dense', 0.055), ('matrix', 0.054), ('optimization', 0.053), ('benchmark', 0.053), ('gp', 0.053), ('cg', 0.053), ('vertices', 0.052), ('cornell', 0.048), ('unlabelled', 0.048), ('schur', 0.048), ('webpages', 0.048), ('test', 0.047), ('adjacency', 0.047), ('marginals', 0.047), ('sch', 0.047), ('marginal', 0.046), ('graphical', 0.045), ('picked', 0.044), ('richardson', 0.044), ('factorizing', 0.044), ('xi', 0.044), ('exp', 0.043), ('ws', 0.041), ('australian', 0.041), ('webpage', 0.041), ('rates', 0.04), ('gaussian', 0.04), ('wisconsin', 0.039), ('strings', 0.039), ('cancer', 0.039), ('ci', 0.038), ('lkopf', 0.036), ('smola', 0.036), ('directed', 0.036), ('inference', 0.035), ('australia', 0.035), ('cheap', 0.035), ('usps', 0.035), ('massive', 0.035), ('normal', 0.035), ('laplacian', 0.035), ('derivatives', 0.035), ('yi', 0.034), ('kluwer', 0.034), ('aware', 0.034), ('class', 0.033), ('complement', 0.033), ('suf', 0.032), ('investigate', 0.032), ('reported', 0.032), ('cation', 0.032), ('lead', 0.032), ('expand', 0.032), ('qj', 0.032), ('qi', 0.032), ('conjugate', 0.031), ('cj', 0.031), ('constraints', 0.03), ('storage', 0.03), ('formulations', 0.03), ('reports', 0.03), ('multiplication', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 105 nips-2005-Large-Scale Multiclass Transduction
Author: Thomas Gärtner, Quoc V. Le, Simon Burton, Alex J. Smola, Vishy Vishwanathan
Abstract: We present a method for performing transductive inference on very large datasets. Our algorithm is based on multiclass Gaussian processes and is effective whenever the multiplication of the kernel matrix or its inverse with a vector can be computed sufficiently fast. This holds, for instance, for certain graph and string kernels. Transduction is achieved by variational inference over the unlabeled data subject to a balancing constraint. 1
2 0.16379881 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
Author: Andreas Argyriou, Mark Herbster, Massimiliano Pontil
Abstract: A foundational problem in semi-supervised learning is the construction of a graph underlying the data. We propose to use a method which optimally combines a number of differently constructed graphs. For each of these graphs we associate a basic graph kernel. We then compute an optimal combined kernel. This kernel solves an extended regularization problem which requires a joint minimization over both the data and the set of graph kernels. We present encouraging results on different OCR tasks where the optimal combined kernel is computed from graphs constructed with a variety of distances functions and the ‘k’ in nearest neighbors. 1
3 0.14308925 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
Author: Y. Altun, D. McAllester, M. Belkin
Abstract: Many real-world classification problems involve the prediction of multiple inter-dependent variables forming some structural dependency. Recent progress in machine learning has mainly focused on supervised classification of such structured variables. In this paper, we investigate structured classification in a semi-supervised setting. We present a discriminative approach that utilizes the intrinsic geometry of input patterns revealed by unlabeled data points and we derive a maximum-margin formulation of semi-supervised learning for structured variables. Unlike transductive algorithms, our formulation naturally extends to new test points. 1
4 0.12674733 46 nips-2005-Consensus Propagation
Author: Benjamin V. Roy, Ciamac C. Moallemi
Abstract: We propose consensus propagation, an asynchronous distributed protocol for averaging numbers across a network. We establish convergence, characterize the convergence rate for regular graphs, and demonstrate that the protocol exhibits better scaling properties than pairwise averaging, an alternative that has received much recent attention. Consensus propagation can be viewed as a special case of belief propagation, and our results contribute to the belief propagation literature. In particular, beyond singly-connected graphs, there are very few classes of relevant problems for which belief propagation is known to converge. 1
5 0.12626649 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
Author: Ashish Kapoor, Hyungil Ahn, Yuan Qi, Rosalind W. Picard
Abstract: There have been many graph-based approaches for semi-supervised classification. One problem is that of hyperparameter learning: performance depends greatly on the hyperparameters of the similarity graph, transformation of the graph Laplacian and the noise model. We present a Bayesian framework for learning hyperparameters for graph-based semisupervised classification. Given some labeled data, which can contain inaccurate labels, we pose the semi-supervised classification as an inference problem over the unknown labels. Expectation Propagation is used for approximate inference and the mean of the posterior is used for classification. The hyperparameters are learned using EM for evidence maximization. We also show that the posterior mean can be written in terms of the kernel matrix, providing a Bayesian classifier to classify new points. Tests on synthetic and real datasets show cases where there are significant improvements in performance over the existing approaches. 1
6 0.1261327 195 nips-2005-Transfer learning for text classification
7 0.12348839 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
8 0.10389281 114 nips-2005-Learning Rankings via Convex Hull Separation
9 0.098580308 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
10 0.095311105 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
11 0.086056724 178 nips-2005-Soft Clustering on Graphs
12 0.083744206 30 nips-2005-Assessing Approximations for Gaussian Process Classification
13 0.082099386 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
14 0.079251081 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm
15 0.076144606 76 nips-2005-From Batch to Transductive Online Learning
16 0.076079644 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
17 0.075721607 175 nips-2005-Sequence and Tree Kernels with Statistical Feature Mining
18 0.074483678 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts
19 0.070461266 52 nips-2005-Correlated Topic Models
20 0.070095919 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification
topicId topicWeight
[(0, 0.258), (1, 0.143), (2, -0.075), (3, -0.072), (4, 0.001), (5, -0.049), (6, 0.051), (7, 0.057), (8, 0.075), (9, 0.064), (10, 0.056), (11, -0.119), (12, -0.03), (13, 0.029), (14, -0.078), (15, -0.165), (16, 0.025), (17, -0.069), (18, 0.031), (19, 0.082), (20, -0.034), (21, 0.056), (22, 0.014), (23, -0.003), (24, 0.103), (25, -0.067), (26, 0.021), (27, 0.045), (28, -0.008), (29, 0.057), (30, -0.002), (31, -0.044), (32, -0.031), (33, -0.025), (34, 0.088), (35, 0.111), (36, -0.077), (37, 0.029), (38, 0.047), (39, -0.013), (40, -0.024), (41, -0.001), (42, -0.038), (43, -0.08), (44, 0.017), (45, -0.017), (46, -0.143), (47, -0.032), (48, -0.047), (49, -0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.92930567 105 nips-2005-Large-Scale Multiclass Transduction
Author: Thomas Gärtner, Quoc V. Le, Simon Burton, Alex J. Smola, Vishy Vishwanathan
Abstract: We present a method for performing transductive inference on very large datasets. Our algorithm is based on multiclass Gaussian processes and is effective whenever the multiplication of the kernel matrix or its inverse with a vector can be computed sufficiently fast. This holds, for instance, for certain graph and string kernels. Transduction is achieved by variational inference over the unlabeled data subject to a balancing constraint. 1
2 0.7376439 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
Author: Andreas Argyriou, Mark Herbster, Massimiliano Pontil
Abstract: A foundational problem in semi-supervised learning is the construction of a graph underlying the data. We propose to use a method which optimally combines a number of differently constructed graphs. For each of these graphs we associate a basic graph kernel. We then compute an optimal combined kernel. This kernel solves an extended regularization problem which requires a joint minimization over both the data and the set of graph kernels. We present encouraging results on different OCR tasks where the optimal combined kernel is computed from graphs constructed with a variety of distances functions and the ‘k’ in nearest neighbors. 1
3 0.72102213 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
Author: Ashish Kapoor, Hyungil Ahn, Yuan Qi, Rosalind W. Picard
Abstract: There have been many graph-based approaches for semi-supervised classification. One problem is that of hyperparameter learning: performance depends greatly on the hyperparameters of the similarity graph, transformation of the graph Laplacian and the noise model. We present a Bayesian framework for learning hyperparameters for graph-based semisupervised classification. Given some labeled data, which can contain inaccurate labels, we pose the semi-supervised classification as an inference problem over the unknown labels. Expectation Propagation is used for approximate inference and the mean of the posterior is used for classification. The hyperparameters are learned using EM for evidence maximization. We also show that the posterior mean can be written in terms of the kernel matrix, providing a Bayesian classifier to classify new points. Tests on synthetic and real datasets show cases where there are significant improvements in performance over the existing approaches. 1
4 0.68863273 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
Author: Y. Altun, D. McAllester, M. Belkin
Abstract: Many real-world classification problems involve the prediction of multiple inter-dependent variables forming some structural dependency. Recent progress in machine learning has mainly focused on supervised classification of such structured variables. In this paper, we investigate structured classification in a semi-supervised setting. We present a discriminative approach that utilizes the intrinsic geometry of input patterns revealed by unlabeled data points and we derive a maximum-margin formulation of semi-supervised learning for structured variables. Unlike transductive algorithms, our formulation naturally extends to new test points. 1
5 0.6062026 114 nips-2005-Learning Rankings via Convex Hull Separation
Author: Glenn Fung, Rómer Rosales, Balaji Krishnapuram
Abstract: We propose efficient algorithms for learning ranking functions from order constraints between sets—i.e. classes—of training samples. Our algorithms may be used for maximizing the generalized Wilcoxon Mann Whitney statistic that accounts for the partial ordering of the classes: special cases include maximizing the area under the ROC curve for binary classification and its generalization for ordinal regression. Experiments on public benchmarks indicate that: (a) the proposed algorithm is at least as accurate as the current state-of-the-art; (b) computationally, it is several orders of magnitude faster and—unlike current methods—it is easily able to handle even large datasets with over 20,000 samples. 1
6 0.55372673 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
7 0.53148818 195 nips-2005-Transfer learning for text classification
8 0.51904714 62 nips-2005-Efficient Estimation of OOMs
9 0.51876402 175 nips-2005-Sequence and Tree Kernels with Statistical Feature Mining
10 0.51594549 116 nips-2005-Learning Topology with the Generative Gaussian Graph and the EM Algorithm
11 0.50732523 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm
12 0.49674147 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
13 0.48568073 30 nips-2005-Assessing Approximations for Gaussian Process Classification
14 0.47564873 107 nips-2005-Large scale networks fingerprinting and visualization using the k-core decomposition
15 0.45697549 46 nips-2005-Consensus Propagation
16 0.4495343 69 nips-2005-Fast Gaussian Process Regression using KD-Trees
17 0.44937837 189 nips-2005-Tensor Subspace Analysis
18 0.44658872 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
19 0.44515204 76 nips-2005-From Batch to Transductive Online Learning
20 0.4416644 204 nips-2005-Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation
topicId topicWeight
[(3, 0.052), (10, 0.035), (11, 0.01), (18, 0.013), (27, 0.025), (31, 0.049), (34, 0.151), (39, 0.011), (41, 0.021), (50, 0.013), (55, 0.03), (65, 0.022), (69, 0.061), (73, 0.046), (88, 0.082), (91, 0.067), (98, 0.238)]
simIndex simValue paperId paperTitle
same-paper 1 0.82134283 105 nips-2005-Large-Scale Multiclass Transduction
Author: Thomas Gärtner, Quoc V. Le, Simon Burton, Alex J. Smola, Vishy Vishwanathan
Abstract: We present a method for performing transductive inference on very large datasets. Our algorithm is based on multiclass Gaussian processes and is effective whenever the multiplication of the kernel matrix or its inverse with a vector can be computed sufficiently fast. This holds, for instance, for certain graph and string kernels. Transduction is achieved by variational inference over the unlabeled data subject to a balancing constraint. 1
2 0.64463753 184 nips-2005-Structured Prediction via the Extragradient Method
Author: Ben Taskar, Simon Lacoste-Julian, Michael I. Jordan
Abstract: We present a simple and scalable algorithm for large-margin estimation of structured models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem and apply the extragradient method, yielding an algorithm with linear convergence using simple gradient and projection calculations. The projection step can be solved using combinatorial algorithms for min-cost quadratic flow. This makes the approach an efficient alternative to formulations based on reductions to a quadratic program (QP). We present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm. 1
3 0.6410777 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
Author: John D. Lafferty, Pradeep K. Ravikumar
Abstract: We present a family of approximation techniques for probabilistic graphical models, based on the use of graphical preconditioners developed in the scientific computing literature. Our framework yields rigorous upper and lower bounds on event probabilities and the log partition function of undirected graphical models, using non-iterative procedures that have low time complexity. As in mean field approaches, the approximations are built upon tractable subgraphs; however, we recast the problem of optimizing the tractable distribution parameters and approximate inference in terms of the well-studied linear systems problem of obtaining a good matrix preconditioner. Experiments are presented that compare the new approximation schemes to variational methods. 1
4 0.64083856 114 nips-2005-Learning Rankings via Convex Hull Separation
Author: Glenn Fung, Rómer Rosales, Balaji Krishnapuram
Abstract: We propose efficient algorithms for learning ranking functions from order constraints between sets—i.e. classes—of training samples. Our algorithms may be used for maximizing the generalized Wilcoxon Mann Whitney statistic that accounts for the partial ordering of the classes: special cases include maximizing the area under the ROC curve for binary classification and its generalization for ordinal regression. Experiments on public benchmarks indicate that: (a) the proposed algorithm is at least as accurate as the current state-of-the-art; (b) computationally, it is several orders of magnitude faster and—unlike current methods—it is easily able to handle even large datasets with over 20,000 samples. 1
5 0.63859224 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
Author: Yoshua Bengio, Olivier Delalleau, Nicolas L. Roux
Abstract: We present a series of theoretical arguments supporting the claim that a large class of modern learning algorithms that rely solely on the smoothness prior – with similarity between examples expressed with a local kernel – are sensitive to the curse of dimensionality, or more precisely to the variability of the target. Our discussion covers supervised, semisupervised and unsupervised learning algorithms. These algorithms are found to be local in the sense that crucial properties of the learned function at x depend mostly on the neighbors of x in the training set. This makes them sensitive to the curse of dimensionality, well studied for classical non-parametric statistical learning. We show in the case of the Gaussian kernel that when the function to be learned has many variations, these algorithms require a number of training examples proportional to the number of variations, which could be large even though there may exist short descriptions of the target function, i.e. their Kolmogorov complexity may be low. This suggests that there exist non-local learning algorithms that at least have the potential to learn about such structured but apparently complex functions (because locally they have many variations), while not using very specific prior domain knowledge. 1
6 0.63766217 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
7 0.6350463 50 nips-2005-Convex Neural Networks
8 0.62900883 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
9 0.6282962 77 nips-2005-From Lasso regression to Feature vector machine
10 0.62823635 47 nips-2005-Consistency of one-class SVM and related algorithms
11 0.62749755 58 nips-2005-Divergences, surrogate loss functions and experimental design
12 0.62713784 78 nips-2005-From Weighted Classification to Policy Search
13 0.62674308 98 nips-2005-Infinite latent feature models and the Indian buffet process
14 0.62525553 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
15 0.6249305 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
16 0.62481803 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
17 0.6208393 143 nips-2005-Off-Road Obstacle Avoidance through End-to-End Learning
18 0.6205343 23 nips-2005-An Application of Markov Random Fields to Range Sensing
19 0.61927974 178 nips-2005-Soft Clustering on Graphs
20 0.61845326 177 nips-2005-Size Regularized Cut for Data Clustering