jmlr jmlr2008 jmlr2008-76 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Olivier Chapelle, Vikas Sindhwani, Sathiya S. Keerthi
Abstract: Due to its wide applicability, the problem of semi-supervised classification is attracting increasing attention in machine learning. Semi-Supervised Support Vector Machines (S 3 VMs) are based on applying the margin maximization principle to both labeled and unlabeled examples. Unlike SVMs, their formulation leads to a non-convex optimization problem. A suite of algorithms have recently been proposed for solving S3 VMs. This paper reviews key ideas in this literature. The performance and behavior of various S3 VM algorithms is studied together, under a common experimental setting. Keywords: semi-supervised learning, support vector machines, non-convex optimization, transductive learning
Reference: text
sentIndex sentText sentNum sentScore
1 Semi-Supervised Support Vector Machines (S 3 VMs) are based on applying the margin maximization principle to both labeled and unlabeled examples. [sent-9, score-0.474]
2 The goal of semi-supervised learning is to employ the large collection of unlabeled data jointly with a few labeled examples for improving generalization performance. [sent-17, score-0.474]
3 By maximizing the margin in the presence of unlabeled data, one learns a decision boundary that traverses through low data-density regions while respecting labels in the input space. [sent-20, score-0.449]
4 There are 2 labeled points (the triangle and the cross) and 100 unlabeled points. [sent-26, score-0.499]
5 The training set consists of l labeled examples 3 {(xi , yi )}l , yi = ±1, and u unlabeled examples {xi }n i=1 i=l+1 , with n = l + u. [sent-60, score-0.552]
6 yn ] , min I (w, b, yu ) = (w,b), yu 1 w 2 2 l +C ∑ V (yi , oi ) +C i=1 n ∑ V (yi , oi ) (1) i=l+1 where oi = w xi + b and V is a loss function. [sent-64, score-0.543]
7 The Hinge loss is a popular choice for V , V (yi , oi ) = max (0, 1 − yi oi ) p . [sent-65, score-0.307]
8 The loss over labeled and unlabeled examples is weighted by two hyperparameters, C and C , which reflect confidence in labels and in the cluster assumption respectively. [sent-71, score-0.606]
9 u i=l+1 (3) This constraint helps in avoiding unbalanced solutions by enforcing that a certain user-specified fraction, r, of the unlabeled data should be assigned to the positive class. [sent-74, score-0.464]
10 Since the true class ratio is unknown for the unlabeled data, r is estimated from the class ratio on the labeled set, or from prior knowledge about the classification problem. [sent-76, score-0.474]
11 Therefore, the optimal yu is simply given by the signs of oi = w xi + b. [sent-91, score-0.204]
12 Eliminating yu in this manner gives a continuous objective function over (w, b): 1 w 2 2 l +C ∑ max (0, 1 − yi oi )2 +C i=1 n ∑ max (0, 1 − |oi |)2 . [sent-92, score-0.295]
13 The last term (see Figure 2) drives the decision boundary, that is, the zero output contour, away from unlabeled points. [sent-95, score-0.406]
14 5 Figure 2: The effective loss max (0, 1 − |o|)2 is shown above as a function of o = (w x + b), the real-valued output at an unlabeled point x. [sent-104, score-0.448]
15 Combinatorial Optimization We now discuss combinatorial techniques in which the labels y u of the unlabeled points are explicit optimization variables. [sent-110, score-0.551]
16 BB organizes subsets of solutions into a binary tree (Figure 3) where nodes are associated with a fixed partial labeling of the unlabeled data set and the two children correspond to the labeling of some new unlabeled point. [sent-130, score-0.891]
17 The effectiveness of BB depends on the following design issues: (1) the lower bound at a node and (2) the sequence of unlabeled examples to branch on. [sent-134, score-0.406]
18 (2006c) use a labelingconfidence criterion to choose an unlabeled example and a label on which to branch. [sent-140, score-0.406]
19 The vector yu is initialized as the labeling given by an SVM trained on the labeled set, thresholding outputs so that u × r unlabeled examples are positive. [sent-151, score-0.575]
20 Consider an iteration of the algorithm where yu is the temporary labeling of the unlabeled ˜ ˜ ˜ ˜ data and let (w, b) = argminw,b I (w, b, yu ) and J (yu ) = I (w, b, yu ). [sent-153, score-0.649]
21 Suppose a pair of unlabeled 4 examples indexed by (i, j) satisfies the following condition, yi = 1, y j = −1,V (1, oi ) +V (−1, o j ) > V (−1, oi ) +V (1, o j ) (6) ˜ ˜ where oi , o j are outputs of (w, b) on the examples xi , x j . [sent-154, score-0.804]
22 Since C controls the non-convex part of the objective function (4), this annealing loop can be interpreted as implementing a “smoothing” heuristic as a means to protect the algorithm from sub-optimal local minima. [sent-159, score-0.253]
23 3 Deterministic Annealing S3 VM Deterministic annealing (DA) is a global optimization heuristic that has been used to approach hard combinatorial or non-convex problems. [sent-162, score-0.238]
24 , 2006), it consists of relaxing the discrete label variables yu to real-valued variables pu = (pl+1 , . [sent-164, score-0.197]
25 The following objective function is now considered: I (w, b, pu ) = E [I (w, b, yu )] = 1 w 2 2 (7) l +C ∑ V (yi , oi ) +C i=1 n ∑ piV (1, oi ) + (1 − pi )V (−1, oi ) i=l+1 2. [sent-168, score-0.648]
26 Note that in this SVM training, the loss terms associated with (originally) labeled and (currently labeled) unlabeled examples are weighted by C and C respectively. [sent-169, score-0.516]
27 Note that at optimality with respect to pu , pi must concentrate all its mass on yi = sign(w xi + b) which leads to the smaller of the two losses V (1, oi ) and V (−1, oi ). [sent-180, score-0.451]
28 In DA, an additional entropy term −H(pu ) is added to the objective, I (w, b, pu ; T ) = I (w, b, pu ) − T H(pu ) where H(pu ) = − ∑ pi log pi + (1 − pi ) log (1 − pi ), i and T ≥ 0 is usually referred to as ‘temperature’. [sent-182, score-0.412]
29 u i=l+1 Note that when T = 0, I reduces to (7) and the optimal pu identifies the optimal yu . [sent-184, score-0.197]
30 Keeping pu fixed, the minimization over (w, b) is standard SVM training—each unlabeled example contributes two loss terms weighted by C pi and C (1 − pi ). [sent-191, score-0.683]
31 This leads to: 1 (8) pi = 1 + e(gi −ν)/T where gi = C [V (1, oi )−V (−1, oi )] and ν, the Lagrange multiplier associated with the balance constraint, is obtained by solving the root finding problem that arises by plugging (8) back in the balance constraint. [sent-193, score-0.338]
32 Gradient Methods: An alternative possibility5 is to substitute the optimal pu (8) as a function of (w, b) and obtain an objective function over (w, b) for which gradient techniques can be used: S (w, b) := min I (w, b, pu ; T ). [sent-197, score-0.354]
33 Figure 4 shows the effective loss terms in S associated with an unlabeled example for various values of T . [sent-208, score-0.448]
34 while H(puT ) > ε do Solve (wT , bT , puT ) = argmin(w,b),pu I (w, b, pu ; T ) subject to: 1 ∑n u i=l+1 pi = r (find local minima starting from previous solution—alternating optimization or gradient methods can be used. [sent-220, score-0.306]
35 5 −2 −3 −2 −1 0 1 2 3 output Figure 4: DA parameterizes a family of loss functions (over unlabeled examples) where the degree of non-convexity is controlled by T . [sent-232, score-0.448]
36 DA is the original algorithm (alternate optimization on pu and w). [sent-258, score-0.176]
37 ∇DA is a direct gradient optimization on (w, b), where pu should be considered as a function of (w, b). [sent-259, score-0.206]
38 optimization problem of SVMs: 1 w (w,b),yu 2 min 2 l +C ∑ ξ2 +C i i=1 n ∑ ξ2 subject to: yi oi ≥ 1 − ξi i = 1, . [sent-262, score-0.202]
39 The labels of the unlabeled points are estimated from Γ (through one of its columns or its largest eigenvector). [sent-280, score-0.474]
40 Continuous Optimization In this section we consider methods which do not include y u , the labels of unlabeled examples, as optimization variables, but instead solve suitably modified versions of (5) by continuous optimization techniques. [sent-286, score-0.549]
41 For a given r, an easy way ˜ ˜ of enforcing (12) is to translate all the points such that the mean of the unlabeled points is the origin, r that is, ∑n i=l+1 xi = 0. [sent-292, score-0.476]
42 In addition to being easy to implement, this linear constraint may also add some robustness against uncertainty about the true unknown class ratio in the unlabeled set (see also Chen et al. [sent-295, score-0.445]
43 While most of the methods of Section 3 can use a standard SVM solver as a subroutine, the methods of this section need to solve (5) with a non-convex loss function over unlabeled examples. [sent-304, score-0.448]
44 Splitting the last non-convex term corresponding to the unlabeled part as the sum of a convex and a concave function, we have: max(0, 1 − |t|)2 = max(0, 1 − |t|)2 + 2|t| −2|t| . [sent-333, score-0.447]
45 convex concave If an unlabeled point is currently classified positive, then at the next iteration, the effective (convex) loss on this point will be if t ≥ 1, 0 2 if |t| < 1, ˜ = (1 − t) L(t) −4t if t ≤ −1. [sent-334, score-0.489]
46 ˜ A corresponding L can be defined for the case of an unlabeled point being classified negative. [sent-335, score-0.406]
47 (14) i=l+1 As for S3 VMlight , ∇S3 VM performs annealing in an outer loop on C . [sent-349, score-0.163]
48 5 Figure 5: The loss function on the unlabeled points t → max(0, 1 − |t|) 2 is replaced by a differentiable approximation t → exp(−5t 2 ). [sent-366, score-0.473]
49 unlabeled points as shown in Figure 5), but the method used for annealing is different. [sent-367, score-0.574]
50 It is noteworthy that since the loss for the unlabeled points is bounded, its convolution with a Gaussian of infinite width tends to the zero function. [sent-384, score-0.473]
51 In other words, with enough smoothing, the unlabeled part of the objective function vanishes and the optimization is identical to a standard SVM. [sent-385, score-0.528]
52 3 is that their complexity scales as O(n3 ) because they employ an unlabeled loss function that does not have a linear part, for example, see (14). [sent-392, score-0.448]
53 In this subsection we propose a new loss function for unlabeled points and an associated Newton method (along the lines of Chapelle, 2007) which brings down the O(n 3 ) complexity of the ∇S3 VM method. [sent-395, score-0.473]
54 Take for instance a Gaussian kernel and consider an unlabeled point far from the labeled points. [sent-465, score-0.497]
55 13 This unlabeled point does not contribute to pushing the decision boundary one way or the other. [sent-467, score-0.406]
56 This seems like a satisfactory behavior: it is better to wait to have more information before taking a decision on an unlabeled point for which we are unsure. [sent-468, score-0.406]
57 Left: error rates on the unlabeled set; right: average training time in seconds for one split and one binary classifier training (with annealing and dichotomic search on the threshold as explained in the experimental section). [sent-492, score-0.625]
58 S3 VMs were originally motivated by Transductive learning, the problem of estimating labels of unlabeled examples without necessarily producing a decision function over the entire input space. [sent-505, score-0.449]
59 5, we run S3 VM algorithms in an inductive mode and analyze performance differences between unlabeled and test examples. [sent-508, score-0.424]
60 Such a comparison would require, say, cross-validation over multiple hyperparameters, randomization over choices of labels, dichotomic search to neutralize balance constraint differences and handling different choices of annealing sequences. [sent-514, score-0.271]
61 For this data set, the labeled points are fixed and new unlabeled points are randomly generated for each repetition of the experiment. [sent-536, score-0.524]
62 They are found with cross-validation by training an inductive SVM on the entire data set (the unlabeled points being assigned their real label). [sent-541, score-0.449]
63 Balancing constraint For the sake of simplicity, we set r in the balancing constraint (3) to the true ratio of the positive points in the unlabeled set. [sent-556, score-0.543]
64 For a given data set, we randomly split the data into a labeled set and an unlabeled set. [sent-559, score-0.497]
65 We refer to the error rate on the unlabeled set as the unlabeled error to differentiate it from the test error which would be computed on an unseen test set. [sent-560, score-0.834]
66 The difference between unlabeled and test performance is discussed in Section 5. [sent-562, score-0.406]
67 3 Suitability of the S3 VM Objective Function Table 1 shows unlabeled error rates for common S3 VM implementations on two small data sets Coil3 and 2moons. [sent-565, score-0.433]
68 Alternatively, one could set C = C u to have equal contribution from labeled and unlabeled points. [sent-569, score-0.474]
69 Table 6 records the rank correlation between unlabeled error and objective function. [sent-575, score-0.478]
70 For each split, we take 10 different solutions and compute the associated unlabeled error and objective value. [sent-577, score-0.497]
71 A standard SVM is trained on the original labeled set and a fraction of the unlabeled set (with their true labels). [sent-580, score-0.474]
72 The labels of the remaining unlabeled points are assigned through a thresholding of the real value outputs of the SVM. [sent-582, score-0.474]
73 Table 6 provides evidence that the unlabeled error is correlated with the objective values. [sent-586, score-0.478]
74 45 Table 6: Kendall’s rank correlation (Abdi, 2006) between the unlabeled error and the objective function averaged over the 10 splits (see text for details). [sent-594, score-0.513]
75 However, as we will see in the next section, this does not necessarily translate into lower unlabeled errors. [sent-602, score-0.406]
76 2 Q UALITY OF G ENERALIZATION Table 8 reports the unlabeled errors of the different algorithms on our benchmark data sets. [sent-614, score-0.406]
77 However, the unlabeled errors on the other data sets, Uspst, Isolet, Coil20, Coil3, 2moons (see also Table 1) are poor and sometimes even worse than a standard SVM. [sent-616, score-0.406]
78 The next to the last column is an SVM trained only on the labeled data, while the last column reports 5 fold cross-validation results of an SVM trained on the whole data set using the labels of the unlabeled points. [sent-658, score-0.517]
79 Due to computational constraints, instead of setting the three hyperparameters (C,C and the Gaussian kernel width, σ) by cross-validation for each of the methods, we explored the influence of these hyperparameters on one split of the data. [sent-676, score-0.2]
80 By measuring the variation of the unlabeled errors with respect to the choice of the hyperparameters, Table 10 records an indication of the sensitivity of the method with respect to that choice. [sent-721, score-0.406]
81 7 Table 10: The variance of the unlabeled errors have been averaged over the 27 possible hyperparameters (cf. [sent-729, score-0.483]
82 , gradually decreasing T in DA or increasing C in S3 VMlight in an outer loop) where the role of the unlabeled points is progressively increased. [sent-781, score-0.431]
83 The starting point which is usually chosen in such a way that the unlabeled points have very little influence and the problem is thus almost convex. [sent-783, score-0.431]
84 More precisely, we have noticed that the number of steps does not matter as much as the minimization around a “critical” value of the annealing parameter; if that point is missed, then the results can be bad. [sent-800, score-0.192]
85 1 0 1 2 4 8 Number of steps 16 32 Figure 9: Influence of the number of annealing steps on the first split of Text. [sent-805, score-0.206]
86 • DA seems the method which relies the most on a relatively slow annealing schedule, while ∇DA can have a faster annealing schedule. [sent-822, score-0.286]
87 This is because (a) we took into account the knowledge of the true class ratio among the unlabeled examples, and (b) we enforced the constraint (3) exactly (with the dichotomic search described at the beginning of Section 4). [sent-830, score-0.516]
88 5 Transductive Versus Semi-supervised Learning S3 VMs were introduced as Transductive SVMs, originally designed for the task of directly estimating labels of unlabeled points. [sent-882, score-0.449]
89 15 We expect S3 VMs to perform equally well on the unlabeled set and on an unseen test set. [sent-887, score-0.428]
90 To test this hypothesis, we took out 25% of the unlabeled set that we used as a unseen test set. [sent-888, score-0.428]
91 Based on 10 splits, the error rate was found to be better on the unlabeled set than on the test set. [sent-895, score-0.406]
92 Indeed, because of small sample effect, the unlabeled and test set could appear as not coming from the same distribution (this is even more likely in high dimension). [sent-897, score-0.406]
93 We considered the g50c data set and biased the split between unlabeled and test set such that there is an angle between the principal directions of the two sets. [sent-898, score-0.451]
94 1 Figure 10: The test set of g50c has been chosen with a bias such that there is an angle between the principal directions of the unlabeled and test sets. [sent-918, score-0.428]
95 The figure shows the difference in error rates (negative means better error rate on the unlabeled set) as a function of the angle for different random (biased) splits. [sent-919, score-0.428]
96 Note that in practice it is not possible to do semi-supervised one-vs-one multiclass training because the labels of the unlabeled points are truly unknown. [sent-930, score-0.493]
97 First, all methods use annealing and so the complexity depends on the number of annealing steps. [sent-1030, score-0.286]
98 Methods whose complexity is of the same order as that of a standard SVM trained with the predicted labels of the unlabeled points, which is O(n 3 + n2 ) where nsv is the number of sv points which are in a non-linear part of the loss function. [sent-1033, score-0.643]
99 230 O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES Ultimately a semi-supervised learning algorithm should be able to handle data sets with millions of unlabeled points. [sent-1049, score-0.406]
100 Learning from labeled and unlabeled data with label propagation. [sent-1285, score-0.474]
wordName wordTfidf (topN-words)
[('vm', 0.426), ('unlabeled', 0.406), ('da', 0.256), ('vmlight', 0.239), ('chapelle', 0.198), ('vms', 0.188), ('cccp', 0.173), ('annealing', 0.143), ('isolet', 0.142), ('achines', 0.133), ('echniques', 0.133), ('emi', 0.133), ('hapelle', 0.133), ('indhwani', 0.133), ('upervised', 0.133), ('sindhwani', 0.128), ('pu', 0.126), ('uspst', 0.12), ('oi', 0.113), ('ptimization', 0.113), ('eerthi', 0.113), ('ector', 0.101), ('upport', 0.101), ('newton', 0.097), ('zien', 0.092), ('collobert', 0.08), ('hyperparameters', 0.077), ('objective', 0.072), ('yu', 0.071), ('kdk', 0.071), ('transductive', 0.069), ('labeled', 0.068), ('nsv', 0.068), ('continuation', 0.06), ('sv', 0.059), ('svm', 0.057), ('bb', 0.054), ('dichotomic', 0.053), ('asv', 0.053), ('optimization', 0.05), ('ki', 0.048), ('dii', 0.047), ('cluster', 0.047), ('ksv', 0.044), ('labels', 0.043), ('table', 0.043), ('minima', 0.042), ('loss', 0.042), ('pi', 0.04), ('yi', 0.039), ('constraint', 0.039), ('lapsvm', 0.038), ('balance', 0.036), ('text', 0.035), ('balancing', 0.034), ('cholesky', 0.034), ('gradient', 0.03), ('labeling', 0.03), ('minimization', 0.029), ('bie', 0.029), ('joachims', 0.028), ('combinatorial', 0.027), ('implementations', 0.027), ('kg', 0.027), ('fung', 0.027), ('dsv', 0.027), ('smoothing', 0.027), ('points', 0.025), ('yahoo', 0.025), ('globally', 0.024), ('hybrid', 0.023), ('kernel', 0.023), ('split', 0.023), ('transduction', 0.023), ('demiriz', 0.023), ('concave', 0.022), ('unseen', 0.022), ('angle', 0.022), ('bennett', 0.022), ('hyperparameter', 0.022), ('enforce', 0.021), ('manifold', 0.02), ('mangasarian', 0.02), ('loop', 0.02), ('steps', 0.02), ('xi', 0.02), ('hinge', 0.019), ('multiclass', 0.019), ('solutions', 0.019), ('convex', 0.019), ('local', 0.018), ('inductive', 0.018), ('support', 0.018), ('beginning', 0.018), ('global', 0.018), ('astorino', 0.018), ('blvd', 0.018), ('cole', 0.018), ('excel', 0.018), ('fcave', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
Author: Olivier Chapelle, Vikas Sindhwani, Sathiya S. Keerthi
Abstract: Due to its wide applicability, the problem of semi-supervised classification is attracting increasing attention in machine learning. Semi-Supervised Support Vector Machines (S 3 VMs) are based on applying the margin maximization principle to both labeled and unlabeled examples. Unlike SVMs, their formulation leads to a non-convex optimization problem. A suite of algorithms have recently been proposed for solving S3 VMs. This paper reviews key ideas in this literature. The performance and behavior of various S3 VM algorithms is studied together, under a common experimental setting. Keywords: semi-supervised learning, support vector machines, non-convex optimization, transductive learning
2 0.081664413 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
Author: Matthias W. Seeger
Abstract: We propose a highly efficient framework for penalized likelihood kernel methods applied to multiclass models with a large, structured set of classes. As opposed to many previous approaches which try to decompose the fitting problem into many smaller ones, we focus on a Newton optimization of the complete model, making use of model structure and linear conjugate gradients in order to approximate Newton search directions. Crucially, our learning method is based entirely on matrix-vector multiplication primitives with the kernel matrices and their derivatives, allowing straightforward specialization to new kernels, and focusing code optimization efforts to these primitives only. Kernel parameters are learned automatically, by maximizing the cross-validation log likelihood in a gradient-based way, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical structure on thousands of classes, achieving state-of-the-art results in an order of magnitude less time than previous work. Parts of this work appeared in the conference paper Seeger (2007). Keywords: multi-way classification, kernel logistic regression, hierarchical classification, cross validation optimization, Newton-Raphson optimization
3 0.078567103 91 jmlr-2008-Trust Region Newton Method for Logistic Regression
Author: Chih-Jen Lin, Ruby C. Weng, S. Sathiya Keerthi
Abstract: Large-scale logistic regression arises in many applications such as document classification and natural language processing. In this paper, we apply a trust region Newton method to maximize the log-likelihood of the logistic regression model. The proposed method uses only approximate Newton steps in the beginning, but achieves fast convergence in the end. Experiments show that it is faster than the commonly used quasi Newton approach for logistic regression. We also extend the proposed method to large-scale L2-loss linear support vector machines (SVM). Keywords: logistic regression, newton method, trust region, conjugate gradient, support vector machines
4 0.075606838 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
Author: Arthur D. Szlam, Mauro Maggioni, Ronald R. Coifman
Abstract: Harmonic analysis and diffusion on discrete data has been shown to lead to state-of-the-art algorithms for machine learning tasks, especially in the context of semi-supervised and transductive learning. The success of these algorithms rests on the assumption that the function(s) to be studied (learned, interpolated, etc.) are smooth with respect to the geometry of the data. In this paper we present a method for modifying the given geometry so the function(s) to be studied are smoother with respect to the modified geometry, and thus more amenable to treatment using harmonic analysis methods. Among the many possible applications, we consider the problems of image denoising and transductive classification. In both settings, our approach improves on standard diffusion based methods. Keywords: diffusion processes, diffusion geometry, spectral graph theory, image denoising, transductive learning, semi-supervised learning
5 0.059377775 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine
Author: Ja-Yong Koo, Yoonkyung Lee, Yuwon Kim, Changyi Park
Abstract: The support vector machine has been successful in a variety of applications. Also on the theoretical front, statistical properties of the support vector machine have been studied quite extensively with a particular attention to its Bayes risk consistency under some conditions. In this paper, we study somewhat basic statistical properties of the support vector machine yet to be investigated, namely the asymptotic behavior of the coefficients of the linear support vector machine. A Bahadur type representation of the coefficients is established under appropriate conditions, and their asymptotic normality and statistical variability are derived on the basis of the representation. These asymptotic results do not only help further our understanding of the support vector machine, but also they can be useful for related statistical inferences. Keywords: asymptotic normality, Bahadur representation, classification, convexity lemma, Radon transform
6 0.05281505 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
7 0.051835578 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
8 0.046202589 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
9 0.044865213 86 jmlr-2008-SimpleMKL
10 0.042334896 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
11 0.0397585 28 jmlr-2008-Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines
12 0.038606483 56 jmlr-2008-Magic Moments for Structured Output Prediction
13 0.037692301 96 jmlr-2008-Visualizing Data using t-SNE
14 0.036793001 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
15 0.036313485 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
16 0.033074465 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
17 0.032576606 58 jmlr-2008-Max-margin Classification of Data with Absent Features
18 0.031366836 52 jmlr-2008-Learning from Multiple Sources
19 0.02995163 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines (Special Topic on Model Selection)
20 0.029360576 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
topicId topicWeight
[(0, 0.173), (1, -0.072), (2, 0.062), (3, 0.125), (4, 0.034), (5, -0.017), (6, 0.022), (7, 0.053), (8, 0.078), (9, -0.009), (10, -0.069), (11, 0.072), (12, -0.0), (13, 0.027), (14, 0.036), (15, 0.057), (16, 0.016), (17, -0.022), (18, 0.067), (19, 0.131), (20, -0.133), (21, -0.037), (22, -0.023), (23, 0.085), (24, 0.02), (25, -0.005), (26, 0.001), (27, 0.056), (28, -0.334), (29, -0.194), (30, -0.023), (31, 0.262), (32, 0.085), (33, -0.015), (34, 0.166), (35, -0.262), (36, 0.188), (37, -0.223), (38, 0.123), (39, 0.123), (40, 0.027), (41, -0.165), (42, -0.115), (43, -0.148), (44, -0.062), (45, -0.095), (46, -0.23), (47, -0.109), (48, -0.12), (49, -0.07)]
simIndex simValue paperId paperTitle
same-paper 1 0.93944776 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
Author: Olivier Chapelle, Vikas Sindhwani, Sathiya S. Keerthi
Abstract: Due to its wide applicability, the problem of semi-supervised classification is attracting increasing attention in machine learning. Semi-Supervised Support Vector Machines (S 3 VMs) are based on applying the margin maximization principle to both labeled and unlabeled examples. Unlike SVMs, their formulation leads to a non-convex optimization problem. A suite of algorithms have recently been proposed for solving S3 VMs. This paper reviews key ideas in this literature. The performance and behavior of various S3 VM algorithms is studied together, under a common experimental setting. Keywords: semi-supervised learning, support vector machines, non-convex optimization, transductive learning
2 0.42612499 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
Author: Arthur D. Szlam, Mauro Maggioni, Ronald R. Coifman
Abstract: Harmonic analysis and diffusion on discrete data has been shown to lead to state-of-the-art algorithms for machine learning tasks, especially in the context of semi-supervised and transductive learning. The success of these algorithms rests on the assumption that the function(s) to be studied (learned, interpolated, etc.) are smooth with respect to the geometry of the data. In this paper we present a method for modifying the given geometry so the function(s) to be studied are smoother with respect to the modified geometry, and thus more amenable to treatment using harmonic analysis methods. Among the many possible applications, we consider the problems of image denoising and transductive classification. In both settings, our approach improves on standard diffusion based methods. Keywords: diffusion processes, diffusion geometry, spectral graph theory, image denoising, transductive learning, semi-supervised learning
3 0.30135036 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
Author: Matthias W. Seeger
Abstract: We propose a highly efficient framework for penalized likelihood kernel methods applied to multiclass models with a large, structured set of classes. As opposed to many previous approaches which try to decompose the fitting problem into many smaller ones, we focus on a Newton optimization of the complete model, making use of model structure and linear conjugate gradients in order to approximate Newton search directions. Crucially, our learning method is based entirely on matrix-vector multiplication primitives with the kernel matrices and their derivatives, allowing straightforward specialization to new kernels, and focusing code optimization efforts to these primitives only. Kernel parameters are learned automatically, by maximizing the cross-validation log likelihood in a gradient-based way, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical structure on thousands of classes, achieving state-of-the-art results in an order of magnitude less time than previous work. Parts of this work appeared in the conference paper Seeger (2007). Keywords: multi-way classification, kernel logistic regression, hierarchical classification, cross validation optimization, Newton-Raphson optimization
4 0.28985342 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine
Author: Ja-Yong Koo, Yoonkyung Lee, Yuwon Kim, Changyi Park
Abstract: The support vector machine has been successful in a variety of applications. Also on the theoretical front, statistical properties of the support vector machine have been studied quite extensively with a particular attention to its Bayes risk consistency under some conditions. In this paper, we study somewhat basic statistical properties of the support vector machine yet to be investigated, namely the asymptotic behavior of the coefficients of the linear support vector machine. A Bahadur type representation of the coefficients is established under appropriate conditions, and their asymptotic normality and statistical variability are derived on the basis of the representation. These asymptotic results do not only help further our understanding of the support vector machine, but also they can be useful for related statistical inferences. Keywords: asymptotic normality, Bahadur representation, classification, convexity lemma, Radon transform
5 0.21840943 91 jmlr-2008-Trust Region Newton Method for Logistic Regression
Author: Chih-Jen Lin, Ruby C. Weng, S. Sathiya Keerthi
Abstract: Large-scale logistic regression arises in many applications such as document classification and natural language processing. In this paper, we apply a trust region Newton method to maximize the log-likelihood of the logistic regression model. The proposed method uses only approximate Newton steps in the beginning, but achieves fast convergence in the end. Experiments show that it is faster than the commonly used quasi Newton approach for logistic regression. We also extend the proposed method to large-scale L2-loss linear support vector machines (SVM). Keywords: logistic regression, newton method, trust region, conjugate gradient, support vector machines
6 0.19750267 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
7 0.18828765 86 jmlr-2008-SimpleMKL
8 0.18778467 56 jmlr-2008-Magic Moments for Structured Output Prediction
9 0.1825999 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
10 0.17619327 85 jmlr-2008-Shark (Machine Learning Open Source Software Paper)
11 0.17343313 80 jmlr-2008-Ranking Individuals by Group Comparisons
13 0.16764762 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
14 0.1534535 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
15 0.15340568 52 jmlr-2008-Learning from Multiple Sources
16 0.14879175 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
17 0.14831752 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
18 0.13902247 28 jmlr-2008-Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines
19 0.13663746 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
20 0.13503884 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
topicId topicWeight
[(0, 0.03), (5, 0.468), (31, 0.014), (40, 0.038), (54, 0.027), (58, 0.049), (66, 0.043), (76, 0.017), (88, 0.073), (92, 0.035), (94, 0.075), (99, 0.034)]
simIndex simValue paperId paperTitle
1 0.91253036 87 jmlr-2008-Stationary Features and Cat Detection
Author: François Fleuret, Donald Geman
Abstract: Most discriminative techniques for detecting instances from object categories in still images consist of looping over a partition of a pose space with dedicated binary classiÄ?Ĺš ers. The efÄ?Ĺš ciency of this strategy for a complex pose, that is, for Ä?Ĺš ne-grained descriptions, can be assessed by measuring the effect of sample size and pose resolution on accuracy and computation. Two conclusions emerge: (1) fragmenting the training data, which is inevitable in dealing with high in-class variation, severely reduces accuracy; (2) the computational cost at high resolution is prohibitive due to visiting a massive pose partition. To overcome data-fragmentation we propose a novel framework centered on pose-indexed features which assign a response to a pair consisting of an image and a pose, and are designed to be stationary: the probability distribution of the response is always the same if an object is actually present. Such features allow for efÄ?Ĺš cient, one-shot learning of pose-speciÄ?Ĺš c classiÄ?Ĺš ers. To avoid expensive scene processing, we arrange these classiÄ?Ĺš ers in a hierarchy based on nested partitions of the pose; as in previous work on coarse-to-Ä?Ĺš ne search, this allows for efÄ?Ĺš cient processing. The hierarchy is then â€?foldedâ€? for training: all the classiÄ?Ĺš ers at each level are derived from one base predictor learned from all the data. The hierarchy is â€?unfoldedâ€? for testing: parsing a scene amounts to examining increasingly Ä?Ĺš ner object descriptions only when there is sufÄ?Ĺš cient evidence for coarser ones. In this way, the detection results are equivalent to an exhaustive search at high resolution. We illustrate these ideas by detecting and localizing cats in highly cluttered greyscale scenes. Keywords: supervised learning, computer vision, image interpretation, cats, stationary features, hierarchical search
same-paper 2 0.83584785 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
Author: Olivier Chapelle, Vikas Sindhwani, Sathiya S. Keerthi
Abstract: Due to its wide applicability, the problem of semi-supervised classification is attracting increasing attention in machine learning. Semi-Supervised Support Vector Machines (S 3 VMs) are based on applying the margin maximization principle to both labeled and unlabeled examples. Unlike SVMs, their formulation leads to a non-convex optimization problem. A suite of algorithms have recently been proposed for solving S3 VMs. This paper reviews key ideas in this literature. The performance and behavior of various S3 VM algorithms is studied together, under a common experimental setting. Keywords: semi-supervised learning, support vector machines, non-convex optimization, transductive learning
Author: Shann-Ching Chen, Geoffrey J. Gordon, Robert F. Murphy
Abstract: In structured classification problems, there is a direct conflict between expressive models and efficient inference: while graphical models such as Markov random fields or factor graphs can represent arbitrary dependences among instance labels, the cost of inference via belief propagation in these models grows rapidly as the graph structure becomes more complicated. One important source of complexity in belief propagation is the need to marginalize large factors to compute messages. This operation takes time exponential in the number of variables in the factor, and can limit the expressiveness of the models we can use. In this paper, we study a new class of potential functions, which we call decomposable k-way potentials, and provide efficient algorithms for computing messages from these potentials during belief propagation. We believe these new potentials provide a good balance between expressive power and efficient inference in practical structured classification problems. We discuss three instances of decomposable potentials: the associative Markov network potential, the nested junction tree, and a new type of potential which we call the voting potential. We use these potentials to classify images of protein subcellular location patterns in groups of cells. Classifying subcellular location patterns can help us answer many important questions in computational biology, including questions about how various treatments affect the synthesis and behavior of proteins and networks of proteins within a cell. Our new representation and algorithm lead to substantial improvements in both inference speed and classification accuracy. Keywords: factor graphs, approximate inference algorithms, structured classification, protein subcellular location patterns, location proteomics
4 0.38344714 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
Author: Matthias W. Seeger
Abstract: The linear model with sparsity-favouring prior on the coefficients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identification from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in significant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identification application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing
5 0.36892328 83 jmlr-2008-Robust Submodular Observation Selection
Author: Andreas Krause, H. Brendan McMahan, Carlos Guestrin, Anupam Gupta
Abstract: In many applications, one has to actively select among a set of expensive observations before making an informed decision. For example, in environmental monitoring, we want to select locations to measure in order to most effectively predict spatial phenomena. Often, we want to select observations which are robust against a number of possible objective functions. Examples include minimizing the maximum posterior variance in Gaussian Process regression, robust experimental design, and sensor placement for outbreak detection. In this paper, we present the Submodular Saturation algorithm, a simple and efficient algorithm with strong theoretical approximation guarantees for cases where the possible objective functions exhibit submodularity, an intuitive diminishing returns property. Moreover, we prove that better approximation algorithms do not exist unless NP-complete problems admit efficient algorithms. We show how our algorithm can be extended to handle complex cost functions (incorporating non-unit observation cost or communication and path costs). We also show how the algorithm can be used to near-optimally trade off expected-case (e.g., the Mean Square Prediction Error in Gaussian Process regression) and worst-case (e.g., maximum predictive variance) performance. We show that many important machine learning problems fit our robust submodular observation selection formalism, and provide extensive empirical evaluation on several real-world problems. For Gaussian Process regression, our algorithm compares favorably with state-of-the-art heuristics described in the geostatistics literature, while being simpler, faster and providing theoretical guarantees. For robust experimental design, our algorithm performs favorably compared to SDP-based algorithms. c 2008 Andreas Krause, H. Brendan McMahan, Carlos Guestrin and Anupam Gupta. K RAUSE , M C M AHAN , G UESTRIN AND G UPTA Keywords: observation selection, experimental design, active learning, submodular functions, Gaussi
6 0.36214334 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
7 0.35867995 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
8 0.35637715 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
9 0.35603437 86 jmlr-2008-SimpleMKL
10 0.34853965 58 jmlr-2008-Max-margin Classification of Data with Absent Features
11 0.347599 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
12 0.34663594 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction
13 0.34000432 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
14 0.33857688 16 jmlr-2008-Approximations for Binary Gaussian Process Classification
15 0.33774358 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
16 0.32656237 67 jmlr-2008-Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies
17 0.32380554 28 jmlr-2008-Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines
18 0.31993803 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
19 0.31845403 56 jmlr-2008-Magic Moments for Structured Output Prediction
20 0.31814617 73 jmlr-2008-On the Suitable Domain for SVM Training in Image Coding