jmlr jmlr2007 jmlr2007-58 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Roni Khardon, Gabriel Wachman
Abstract: A large number of variants of the Perceptron algorithm have been proposed and partially evaluated in recent work. One type of algorithm aims for noise tolerance by replacing the last hypothesis of the perceptron with another hypothesis or a vote among hypotheses. Another type simply adds a margin term to the perceptron in order to increase robustness and accuracy, as done in support vector machines. A third type borrows further from support vector machines and constrains the update function of the perceptron in ways that mimic soft-margin techniques. The performance of these algorithms, and the potential for combining different techniques, has not been studied in depth. This paper provides such an experimental study and reveals some interesting facts about the algorithms. In particular the perceptron with margin is an effective method for tolerating noise and stabilizing the algorithm. This is surprising since the margin in itself is not designed or used for noise tolerance, and there are no known guarantees for such performance. In most cases, similar performance is obtained by the voted-perceptron which has the advantage that it does not require parameter selection. Techniques using soft margin ideas are run-time intensive and do not give additional performance benefits. The results also highlight the difficulty with automatic parameter selection which is required with some of these variants. Keywords: perceptron algorithm, on-line learning, noise tolerance, kernel methods
Reference: text
sentIndex sentText sentNum sentScore
1 One type of algorithm aims for noise tolerance by replacing the last hypothesis of the perceptron with another hypothesis or a vote among hypotheses. [sent-7, score-0.572]
2 Another type simply adds a margin term to the perceptron in order to increase robustness and accuracy, as done in support vector machines. [sent-8, score-0.5]
3 A third type borrows further from support vector machines and constrains the update function of the perceptron in ways that mimic soft-margin techniques. [sent-9, score-0.331]
4 In particular the perceptron with margin is an effective method for tolerating noise and stabilizing the algorithm. [sent-12, score-0.543]
5 This is surprising since the margin in itself is not designed or used for noise tolerance, and there are no known guarantees for such performance. [sent-13, score-0.23]
6 Techniques using soft margin ideas are run-time intensive and do not give additional performance benefits. [sent-15, score-0.223]
7 Keywords: perceptron algorithm, on-line learning, noise tolerance, kernel methods 1. [sent-17, score-0.374]
8 , 1992; Cristianini and Shawe-Taylor, 2000) has led to increasing interest in the perceptron algorithm. [sent-19, score-0.331]
9 Like SVM, the perceptron algorithm has a linear threshold hypothesis and can be used with kernels, but unlike SVM, it is simple and efficient. [sent-20, score-0.381]
10 Several on-line algorithms have been proposed which iteratively construct large margin hypotheses in the feature space, and therefore combine the advantages of large margin hypotheses with the efficiency of the perceptron algorithm. [sent-25, score-0.727]
11 Other variants adapt the on-line algorithms to work in a batch setting choosing a more robust hypothesis to be used instead of the last hypothesis from the on-line session. [sent-26, score-0.326]
12 The first explicitly uses the idea of hard and soft margin from SVM. [sent-36, score-0.203]
13 The basic perceptron algorithm is mistake driven, that is, it only updates the hypothesis when it makes a mistake on the current example. [sent-37, score-0.535]
14 The perceptron algorithm with margin (Krauth and M´ zard, 1987; Li et al. [sent-38, score-0.5]
15 , 2002) forces the hypothesis to have some margin by e making updates even when it does not make a mistake but where the margin is too small. [sent-39, score-0.501]
16 Adding to this idea, one can mimic soft-margin versions of support vector machines within the perceptron algorithm that allow it to tolerate noisy data (e. [sent-40, score-0.388]
17 The algorithms that arise from this idea constrain the update function of the perceptron and limit the effect of any single example on the final hypothesis. [sent-45, score-0.331]
18 Each of these performs margin based updates and has other small differences motivated by various considerations. [sent-47, score-0.205]
19 The second family of variants tackles the use of on-line learning algorithms in a batch setting, where one trains the algorithm on a data set and tests its performance on a separate test set. [sent-49, score-0.228]
20 In particular, the longest survivor variant (Kearns et al. [sent-53, score-0.586]
21 The voted perceptron variant (Freund and Schapire, 1999) takes a vote among hypotheses produced during training. [sent-55, score-1.121]
22 In this paper we report on experiments with a large number of such variants that arise when combining some of margin, soft margin, and on-line to batch conversions. [sent-59, score-0.246]
23 The experiments lead to the following conclusions: First, the perceptron with margin is the most successful variant. [sent-61, score-0.525]
24 Second, the soft-margin variants on their own are weaker than the perceptron with margin, and combining soft-margin with the regular margin variant does not provide additional improvements. [sent-63, score-0.679]
25 The third conclusion is that in most cases the voted perceptron performs similarly to the perceptron with margin. [sent-64, score-1.347]
26 The voted perceptron has the advantage that it does not require parameter selection (for the margin) that can be costly in terms of run time. [sent-65, score-1.066]
27 Combining the two to get the voted perceptron with margin has the potential for further improvements but this occasionally degrades performance. [sent-66, score-1.185]
28 Finally, both the voted perceptron and the margin variant reduce the deviation in accuracy in addition to improving the accuracy. [sent-67, score-1.261]
29 1 The Perceptron Algorithm The perceptron algorithm (Rosenblatt, 1958) takes as input a set of training examples in R n with labels in {−1, 1}. [sent-83, score-0.35]
30 When the data are linearly separable via some hyperplane (w, θ), the margin is defined as γ = min1≤i≤m (yi ( w, xi − θ)). [sent-94, score-0.241]
31 If the data are linearly separable, and θ is initialized to 0, the perceptron algorithm 229 K HARDON AND WACHMAN is guaranteed to converge in ≤ ( R )2 iterations (Novikoff, 1962; Cristianini and Shawe-Taylor, 2000), γ where R = max1≤i≤m xi . [sent-96, score-0.428]
32 In the case of non-separable data, the extent to which a data point fails to have margin γ via the hyperplane w can be quantified by a slack variable ξ i = max(0, γ − yi ( w, xi + θ)). [sent-97, score-0.19]
33 The perceptron is guaranteed to make no more than ( 2(R+D) )2 mistakes on m examples, for any w, γ > 0 γ where D = ∑m ξ2 (Freund and Schapire, 1999; Shalev-Shwartz and Singer, 2005). [sent-99, score-0.372]
34 It is well known that the perceptron can be re-written in “dual form” whereby it can be used with kernel functions (Cristianini and Shawe-Taylor, 2000). [sent-101, score-0.331]
35 The dual form of the perceptron is obtained by observing that the weight vector can be written as w = ∑m ηαi yi xi where αi is the i=1 number of mistakes that have been made on example i. [sent-102, score-0.409]
36 All the perceptron variants we discuss below have dual form representations. [sent-105, score-0.483]
37 , 2002) attempts to minimize the effect of noisy examples during training similar to the L2 soft margin technique used with support vector machines (Cristianini and Shawe-Taylor, 2000). [sent-111, score-0.258]
38 3 The α-bound This variant is motivated by the L1 soft margin technique used with support vector machines (Cristianini and Shawe-Taylor, 2000). [sent-119, score-0.25]
39 4 Perceptron Using Margins The classical perceptron attempts to separate the data but has no guarantees on the separation margin obtained. [sent-129, score-0.569]
40 , 1992; Cristianini and Shawe-Taylor, 2000) one may expect that providing the perceptron with higher margin will add to the stability and accuracy of the hypothesis produced. [sent-132, score-0.579]
41 Notice that this includes the case of a mistake where y j (SUM − θ) < 0 and the case of correct classification with low margin when 0 ≤ y j (SUM −θ) < τ. [sent-134, score-0.228]
42 When the data are linearly separable and τ < γopt , PAM finds a separating hyperplane τ with a margin that is guaranteed to be at least γopt √8(ηR2 +τ) , where γopt is the maximum possible margin (Krauth and M´ zard, 1987; Li et al. [sent-138, score-0.41]
43 1 e As above, it is convenient to make the margin parameter data set independent so we can use the same values across data sets. [sent-140, score-0.239]
44 5 Longest Survivor and Voted Perceptron The classical perceptron returns the last weight vector w, that is, the one obtained after all training has been completed, but this may not always be useful especially if the data is noisy. [sent-147, score-0.457]
45 (1987) show that longest survivor hypothesis, that is, the one who made the largest number of correct predictions during training in the on-line setting, is a good choice in the sense that one can provide guarantees on its performance in the PAC learning model (Valiant, 1984). [sent-155, score-0.596]
46 The voted perceptron (Freund and Schapire, 1999) assigns each vector a “vote” based on the number of sequential correct classifications by that weight vector. [sent-157, score-1.033]
47 Whenever an example is misclassified, the voted perceptron records the number of correct classifications made since the previous misclassification, assigns this number to the current weight vector’s vote, saves the current weight vector, and then updates as normal. [sent-158, score-1.086]
48 So when using the dual perceptron the prediction phase of the voted perceptron is not substantially more expensive than the prediction phase of the classical perceptron, although it is more expensive in the primal form. [sent-162, score-1.459]
49 When the data are linearly separable and given enough iterations, both these variants will converge to a hypothesis that is very close to the simple perceptron algorithm. [sent-163, score-0.585]
50 In one set of experiments we searched through a pre-determined set of values of τ, α, and λ by running each of the classical, longest survivor, and voted perceptron using 10-fold cross-validation and parameterized by each value of τ, α, and λ as well as each combination of τ × α and τ × λ. [sent-177, score-1.453]
51 For further comparison, we also ran SVM on the data sets using the L 1 -norm soft margin and L2 -norm soft margin. [sent-207, score-0.292]
52 In the following, we report on experiments with artificial data where the margin and noise levels are varied so we effectively span all four different types. [sent-218, score-0.258]
53 We first specify parameters for number of features, noise rate and the required margin size. [sent-220, score-0.212]
54 4 N/A Table 1: UCI and Other Natural Data Set Characteristics measured the actual margin of the examples with respect to the weight vector and then discarded any examples xi such that | w, xi − θ| < M ∗ θ where M is the margin parameter specified. [sent-232, score-0.383]
55 In the tables of results presented below, f stands for number of features, M stands for the margin size, and N stands for the noise rate. [sent-234, score-0.239]
56 The parameters of interest in our experiments are τ, α, λ that control the margin and soft margin variants. [sent-287, score-0.397]
57 Dominance: V = Voted, C = Classical, LS = Longest Survivor Finally we performed a comparison of the classical perceptron, the longest survivor and the voted perceptron algorithms without the additional variants. [sent-305, score-1.585]
58 One can see that with higher noise rate the voted perceptron and longest survivor improve the performance over the base algorithm. [sent-308, score-1.618]
59 Over all 150 artificial data sets, the voted perceptron strictly improves performance over the classical perceptron in 59 cases, and ties or improves in 138 cases. [sent-309, score-1.418]
60 Using the distribution over our artificial data sets one can calculate a simple weak statistical test that supports the hypothesis that using the voted algorithm does not hurt accuracy, and can often improve it. [sent-310, score-0.756]
61 Looking next at the on-line to batch conversions we see that the differences between the basic algorithm, the longest survivor and the voted perceptron are noticeable without margin based variants. [sent-344, score-1.779]
62 For the artificial data sets this only holds for one group of data sets ( f = 50), the one with 238 N OISE T OLERANT P ERCEPTRON VARIANTS breast-cancer-wisconsin bupa wdbc crx promoters ionosphere wpbc sonar. [sent-345, score-0.195]
63 4 The longest survivor seems less stable and degrades performance in some cases. [sent-526, score-0.559]
64 Finally compare the τ variant with voted perceptron and their combination. [sent-527, score-1.063]
65 For the artificial data sets using τ alone (with last hypothesis) gives higher accuracy than using the voted perceptron alone. [sent-528, score-1.105]
66 We see that voted perceptron and the τ variant give similar performance on this data set as well. [sent-532, score-1.104]
67 In these cases these was no difference between the basic, longest and voted versions except when combining with other variants. [sent-535, score-1.097]
68 05 with the performance on the artificial data mentioned above, voted perceptron performs well even with small training set size (and small ratio of examples to features). [sent-624, score-1.076]
69 One can see that both the τ variant and the voted perceptron significantly reduce the variance in results. [sent-627, score-1.063]
70 The longest survivor does so most of the time but not always. [sent-628, score-0.539]
71 Cursory experiments to investigate the differences show that, when the data has a zero threshold separator and when perceptron is run for just one iteration, then (as reported in Servedio, 1999) the average algorithm performs better than basic perceptron. [sent-632, score-0.399]
72 However, the margin based perceptron performs better and this becomes more pronounced with non-zero threshold and multiple iterations. [sent-633, score-0.5]
73 Thus in some sense the perceptron variants are doing something non-trivial that is beyond the scope of the simple average algorithm. [sent-634, score-0.463]
74 In contrast with the tables for parameter search, columns of parameter variants in Tables 6 and 7 do include the inactive option. [sent-884, score-0.215]
75 Both τ and the voted perceptron provide consistent improvement over the classical perceptron; the longest survivor provides improvement over the classical perceptron on its own, but a smaller one than voted or τ in most cases. [sent-1348, score-2.631]
76 As observed above in parameter search, the variants with α and λ offer improvements in some cases, but when they do, τ and voted almost always offer a better improvement. [sent-1350, score-0.845]
77 Ignoring the intervals we also see that neither the τ variant nor the voted perceptron dominates the other. [sent-1351, score-1.063]
78 We can see that for “MNIST” the variants make little difference in performance but that for “USPS” we get small improvements and the usual pattern relating the variants is observed. [sent-1356, score-0.284]
79 As can be seen the perceptron variants give similar accuracies and smaller variance and they are therefore an excellent alternative for SVM. [sent-1369, score-0.463]
80 Discussion and Conclusions The paper provides an experimental evaluation of several noise tolerant variants of the perceptron algorithm. [sent-1371, score-0.535]
81 The results are surprising since they suggest that the perceptron with margin is the most successful variant although it is the only one not designed for noise tolerance. [sent-1372, score-0.59]
82 The voted perceptron has similar performance in most cases, and it has the advantage that no parameter selection is required for it. [sent-1373, score-1.064]
83 The difference between voted and perceptron with margin are most noticeable in the artificial data sets, and the two are indistinguishable in their performance on the UCI data. [sent-1374, score-1.226]
84 The experiments also show that the soft-margin variants are generally weaker than voted or margin based algorithm and they do not provide additional improvement in performance when combined with these. [sent-1375, score-1.031]
85 Both the voted perceptron and the margin variant reduced the deviation in accuracy in addition to improving the accuracy. [sent-1376, score-1.261]
86 Combining voted and perceptron with margin has the potential for further improvements but can harm performance in high variance cases. [sent-1378, score-1.227]
87 In terms of run time, the voted perceptron does not require parameter selection and can therefore be faster to train. [sent-1379, score-1.066]
88 As mentioned in the introduction a number of variants that perform margin based updates exist in the literature (Friess et al. [sent-1387, score-0.337]
89 , 2004) performs gradient descent on the soft margin risk resulting in an algorithm that rescales the old weight vector before the additive update. [sent-1394, score-0.22]
90 , 2005) adapts η on each example to guarantee it is immediately separable with margin (although update size is limited per noisy data). [sent-1396, score-0.238]
91 The Ballseptron (Shalev-Shwartz and Singer, 2005) establishes a normalized margin and replaces margin updates with updates on hypothetical examples on which a mistake would be made. [sent-1397, score-0.469]
92 Curiously, identical or similar bounds hold for the basic perceptron algorithm so that these results do not establish an advantage of the variants over the basic perceptron. [sent-1402, score-0.463]
93 , 2005), does not perform margin updates but uses spectral properties of the data in the updates in a way that can reduce mistakes in some cases. [sent-1404, score-0.303]
94 Finally, experiments with aggressive ROMMA (Li and Long, 2002; Li, 2000) have shown that adding a voted perceptron scheme can harm performance, just as we observed for the margin perceptron. [sent-1411, score-1.251]
95 It may be worth pointing out here that, although we have relative loss bounds for several variants and these can be translated to some error bounds in the batch setting, the bounds are not sufficiently strong to provide significant guarantees in the agnostic PAC model. [sent-1417, score-0.245]
96 The first is whether the more sophisticated versions of margin perceptron add significant performance improvements. [sent-1420, score-0.52]
97 Given the failure of the simple longest survivor it would be useful to evaluate the more robust versions of Gallant (1990) and Cesa-Bianchi et al. [sent-1423, score-0.539]
98 Alternatively, one could further investigate the tail variants of the voting scheme (Li, 2000) or the “averaging” version of voting (Freund and Schapire, 1999; Gentile, 2001), explaining when they work with different variants and why. [sent-1426, score-0.304]
99 Finally, as mentioned above, to our knowledge there is no theoretical explanation to the fact that perceptron with margin performs better than the basic perceptron in the presence of noise. [sent-1427, score-0.831]
100 Ranking algorithms for named entity extraction: Boosting and the voted perceptron. [sent-1457, score-0.685]
wordName wordTfidf (topN-words)
[('voted', 0.685), ('longest', 0.412), ('perceptron', 0.331), ('margin', 0.169), ('init', 0.136), ('variants', 0.132), ('survivor', 0.127), ('wachman', 0.117), ('olerant', 0.097), ('hardon', 0.091), ('mnist', 0.088), ('tally', 0.088), ('erceptron', 0.082), ('usps', 0.075), ('oise', 0.059), ('mistake', 0.059), ('adult', 0.056), ('li', 0.056), ('batch', 0.055), ('hypothesis', 0.05), ('variant', 0.047), ('noise', 0.043), ('tufts', 0.041), ('promoters', 0.041), ('mistakes', 0.041), ('iterations', 0.04), ('search', 0.04), ('gallant', 0.039), ('krauth', 0.039), ('pam', 0.039), ('tsampouka', 0.039), ('unrealizable', 0.039), ('last', 0.039), ('cristianini', 0.037), ('updates', 0.036), ('noisy', 0.036), ('ran', 0.034), ('soft', 0.034), ('ls', 0.033), ('gentile', 0.033), ('separable', 0.033), ('tolerance', 0.03), ('classical', 0.03), ('tolerant', 0.029), ('accuracy', 0.029), ('alma', 0.029), ('bupa', 0.029), ('kowalczyk', 0.029), ('votek', 0.029), ('wdbc', 0.029), ('wpbc', 0.029), ('hypotheses', 0.029), ('uci', 0.029), ('vote', 0.029), ('parameter', 0.028), ('tables', 0.027), ('kearns', 0.026), ('primal', 0.026), ('sign', 0.026), ('experiments', 0.025), ('crx', 0.025), ('roni', 0.025), ('kivinen', 0.025), ('svm', 0.024), ('wk', 0.024), ('freund', 0.024), ('picks', 0.023), ('xm', 0.023), ('agnostic', 0.022), ('harm', 0.022), ('zard', 0.022), ('pac', 0.022), ('run', 0.022), ('data', 0.021), ('schapire', 0.021), ('arti', 0.021), ('votes', 0.02), ('dekel', 0.02), ('voting', 0.02), ('performance', 0.02), ('dual', 0.02), ('conconi', 0.019), ('friess', 0.019), ('gabriel', 0.019), ('garg', 0.019), ('romma', 0.019), ('training', 0.019), ('experimented', 0.019), ('aggressive', 0.019), ('cial', 0.018), ('expensive', 0.018), ('guarantees', 0.018), ('translated', 0.018), ('opt', 0.018), ('forces', 0.018), ('linearly', 0.018), ('initialized', 0.018), ('outer', 0.018), ('sum', 0.017), ('weight', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm
Author: Roni Khardon, Gabriel Wachman
Abstract: A large number of variants of the Perceptron algorithm have been proposed and partially evaluated in recent work. One type of algorithm aims for noise tolerance by replacing the last hypothesis of the perceptron with another hypothesis or a vote among hypotheses. Another type simply adds a margin term to the perceptron in order to increase robustness and accuracy, as done in support vector machines. A third type borrows further from support vector machines and constrains the update function of the perceptron in ways that mimic soft-margin techniques. The performance of these algorithms, and the potential for combining different techniques, has not been studied in depth. This paper provides such an experimental study and reveals some interesting facts about the algorithms. In particular the perceptron with margin is an effective method for tolerating noise and stabilizing the algorithm. This is surprising since the margin in itself is not designed or used for noise tolerance, and there are no known guarantees for such performance. In most cases, similar performance is obtained by the voted-perceptron which has the advantage that it does not require parameter selection. Techniques using soft margin ideas are run-time intensive and do not give additional performance benefits. The results also highlight the difficulty with automatic parameter selection which is required with some of these variants. Keywords: perceptron algorithm, on-line learning, noise tolerance, kernel methods
2 0.087928861 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss
Author: Ofer Dekel, Philip M. Long, Yoram Singer
Abstract: We study the problem of learning multiple tasks in parallel within the online learning framework. On each online round, the algorithm receives an instance for each of the parallel tasks and responds by predicting the label of each instance. We consider the case where the predictions made on each round all contribute toward a common goal. The relationship between the various tasks is defined by a global loss function, which evaluates the overall quality of the multiple predictions made on each round. Specifically, each individual prediction is associated with its own loss value, and then these multiple loss values are combined into a single number using the global loss function. We focus on the case where the global loss function belongs to the family of absolute norms, and present several online learning algorithms for the induced problem. We prove worst-case relative loss bounds for all of our algorithms, and demonstrate the effectiveness of our approach on a largescale multiclass-multilabel text categorization problem. Keywords: online learning, multitask learning, multiclass multilabel classiifcation, perceptron
3 0.083101287 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes
Author: Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble, Christina Leslie
Abstract: Predicting a protein’s structural class from its amino acid sequence is a fundamental problem in computational biology. Recent machine learning work in this domain has focused on developing new input space representations for protein sequences, that is, string kernels, some of which give state-of-the-art performance for the binary prediction task of discriminating between one class and all the others. However, the underlying protein classification problem is in fact a huge multiclass problem, with over 1000 protein folds and even more structural subcategories organized into a hierarchy. To handle this challenging many-class problem while taking advantage of progress on the binary problem, we introduce an adaptive code approach in the output space of one-vsthe-rest prediction scores. Specifically, we use a ranking perceptron algorithm to learn a weighting of binary classifiers that improves multi-class prediction with respect to a fixed set of output codes. We use a cross-validation set-up to generate output vectors for training, and we define codes that capture information about the protein structural hierarchy. Our code weighting approach significantly improves on the standard one-vs-all method for two difficult multi-class protein classification problems: remote homology detection and fold recognition. Our algorithm also outperforms a previous code learning approach due to Crammer and Singer, trained here using a perceptron, when the dimension of the code vectors is high and the number of classes is large. Finally, we compare against PSI-BLAST, one of the most widely used methods in protein sequence analysis, and find that our method strongly outperforms it on every structure clas∗. The first two authors contributed equally to this work. c 2007 Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble and Christina Leslie. M ELVIN , I E , W ESTON , N OBLE AND L ESLIE sification problem that we consider. Supplementary data and source code are available at http: //www.cs
4 0.071455494 52 jmlr-2007-Margin Trees for High-dimensional Classification
Author: Robert Tibshirani, Trevor Hastie
Abstract: We propose a method for the classification of more than two classes, from high-dimensional features. Our approach is to build a binary decision tree in a top-down manner, using the optimal margin classifier at each split. We implement an exact greedy algorithm for this task, and compare its performance to less greedy procedures based on clustering of the matrix of pairwise margins. We compare the performance of the “margin tree” to the closely related “all-pairs” (one versus one) support vector machine, and nearest centroids on a number of cancer microarray data sets. We also develop a simple method for feature selection. We find that the margin tree has accuracy that is competitive with other methods and offers additional interpretability in its putative grouping of the classes. Keywords: maximum margin classifier, support vector machine, decision tree, CART
5 0.04600336 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
Author: Yann Guermeur
Abstract: In the context of discriminant analysis, Vapnik’s statistical learning theory has mainly been developed in three directions: the computation of dichotomies with binary-valued functions, the computation of dichotomies with real-valued functions, and the computation of polytomies with functions taking their values in finite sets, typically the set of categories itself. The case of classes of vectorvalued functions used to compute polytomies has seldom been considered independently, which is unsatisfactory, for three main reasons. First, this case encompasses the other ones. Second, it cannot be treated appropriately through a na¨ve extension of the results devoted to the computation of ı dichotomies. Third, most of the classification problems met in practice involve multiple categories. In this paper, a VC theory of large margin multi-category classifiers is introduced. Central in this theory are generalized VC dimensions called the γ-Ψ-dimensions. First, a uniform convergence bound on the risk of the classifiers of interest is derived. The capacity measure involved in this bound is a covering number. This covering number can be upper bounded in terms of the γ-Ψdimensions thanks to generalizations of Sauer’s lemma, as is illustrated in the specific case of the scale-sensitive Natarajan dimension. A bound on this latter dimension is then computed for the class of functions on which multi-class SVMs are based. This makes it possible to apply the structural risk minimization inductive principle to those machines. Keywords: multi-class discriminant analysis, large margin classifiers, uniform strong laws of large numbers, generalized VC dimensions, multi-class SVMs, structural risk minimization inductive principle, model selection
6 0.036356095 47 jmlr-2007-Learning Horn Expressions with LOGAN-H
7 0.033874512 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression
8 0.03250799 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis
9 0.03236777 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
10 0.029428916 59 jmlr-2007-Nonlinear Boosting Projections for Ensemble Construction
11 0.025266653 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition
12 0.023161218 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation
14 0.020966735 91 jmlr-2007-Very Fast Online Learning of Highly Non Linear Problems
15 0.020102747 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression
16 0.019958643 83 jmlr-2007-The On-Line Shortest Path Problem Under Partial Monitoring
17 0.01895896 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
18 0.018236762 11 jmlr-2007-Anytime Learning of Decision Trees
19 0.01813551 30 jmlr-2007-Dynamics and Generalization Ability of LVQ Algorithms
20 0.016958782 90 jmlr-2007-Value Regularization and Fenchel Duality
topicId topicWeight
[(0, 0.131), (1, 0.061), (2, -0.024), (3, 0.033), (4, 0.038), (5, 0.052), (6, -0.139), (7, 0.033), (8, -0.119), (9, 0.093), (10, 0.034), (11, 0.074), (12, 0.137), (13, 0.192), (14, 0.207), (15, -0.155), (16, -0.119), (17, -0.086), (18, -0.025), (19, 0.019), (20, -0.059), (21, 0.105), (22, -0.274), (23, -0.029), (24, -0.276), (25, -0.197), (26, -0.003), (27, 0.097), (28, -0.063), (29, 0.008), (30, -0.05), (31, 0.005), (32, 0.078), (33, 0.054), (34, 0.007), (35, 0.016), (36, 0.029), (37, -0.177), (38, -0.048), (39, -0.069), (40, -0.166), (41, 0.146), (42, 0.127), (43, -0.078), (44, 0.097), (45, 0.007), (46, 0.181), (47, -0.088), (48, -0.027), (49, 0.086)]
simIndex simValue paperId paperTitle
same-paper 1 0.95990133 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm
Author: Roni Khardon, Gabriel Wachman
Abstract: A large number of variants of the Perceptron algorithm have been proposed and partially evaluated in recent work. One type of algorithm aims for noise tolerance by replacing the last hypothesis of the perceptron with another hypothesis or a vote among hypotheses. Another type simply adds a margin term to the perceptron in order to increase robustness and accuracy, as done in support vector machines. A third type borrows further from support vector machines and constrains the update function of the perceptron in ways that mimic soft-margin techniques. The performance of these algorithms, and the potential for combining different techniques, has not been studied in depth. This paper provides such an experimental study and reveals some interesting facts about the algorithms. In particular the perceptron with margin is an effective method for tolerating noise and stabilizing the algorithm. This is surprising since the margin in itself is not designed or used for noise tolerance, and there are no known guarantees for such performance. In most cases, similar performance is obtained by the voted-perceptron which has the advantage that it does not require parameter selection. Techniques using soft margin ideas are run-time intensive and do not give additional performance benefits. The results also highlight the difficulty with automatic parameter selection which is required with some of these variants. Keywords: perceptron algorithm, on-line learning, noise tolerance, kernel methods
2 0.61939836 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes
Author: Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble, Christina Leslie
Abstract: Predicting a protein’s structural class from its amino acid sequence is a fundamental problem in computational biology. Recent machine learning work in this domain has focused on developing new input space representations for protein sequences, that is, string kernels, some of which give state-of-the-art performance for the binary prediction task of discriminating between one class and all the others. However, the underlying protein classification problem is in fact a huge multiclass problem, with over 1000 protein folds and even more structural subcategories organized into a hierarchy. To handle this challenging many-class problem while taking advantage of progress on the binary problem, we introduce an adaptive code approach in the output space of one-vsthe-rest prediction scores. Specifically, we use a ranking perceptron algorithm to learn a weighting of binary classifiers that improves multi-class prediction with respect to a fixed set of output codes. We use a cross-validation set-up to generate output vectors for training, and we define codes that capture information about the protein structural hierarchy. Our code weighting approach significantly improves on the standard one-vs-all method for two difficult multi-class protein classification problems: remote homology detection and fold recognition. Our algorithm also outperforms a previous code learning approach due to Crammer and Singer, trained here using a perceptron, when the dimension of the code vectors is high and the number of classes is large. Finally, we compare against PSI-BLAST, one of the most widely used methods in protein sequence analysis, and find that our method strongly outperforms it on every structure clas∗. The first two authors contributed equally to this work. c 2007 Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble and Christina Leslie. M ELVIN , I E , W ESTON , N OBLE AND L ESLIE sification problem that we consider. Supplementary data and source code are available at http: //www.cs
3 0.34337273 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss
Author: Ofer Dekel, Philip M. Long, Yoram Singer
Abstract: We study the problem of learning multiple tasks in parallel within the online learning framework. On each online round, the algorithm receives an instance for each of the parallel tasks and responds by predicting the label of each instance. We consider the case where the predictions made on each round all contribute toward a common goal. The relationship between the various tasks is defined by a global loss function, which evaluates the overall quality of the multiple predictions made on each round. Specifically, each individual prediction is associated with its own loss value, and then these multiple loss values are combined into a single number using the global loss function. We focus on the case where the global loss function belongs to the family of absolute norms, and present several online learning algorithms for the induced problem. We prove worst-case relative loss bounds for all of our algorithms, and demonstrate the effectiveness of our approach on a largescale multiclass-multilabel text categorization problem. Keywords: online learning, multitask learning, multiclass multilabel classiifcation, perceptron
4 0.33662832 52 jmlr-2007-Margin Trees for High-dimensional Classification
Author: Robert Tibshirani, Trevor Hastie
Abstract: We propose a method for the classification of more than two classes, from high-dimensional features. Our approach is to build a binary decision tree in a top-down manner, using the optimal margin classifier at each split. We implement an exact greedy algorithm for this task, and compare its performance to less greedy procedures based on clustering of the matrix of pairwise margins. We compare the performance of the “margin tree” to the closely related “all-pairs” (one versus one) support vector machine, and nearest centroids on a number of cancer microarray data sets. We also develop a simple method for feature selection. We find that the margin tree has accuracy that is competitive with other methods and offers additional interpretability in its putative grouping of the classes. Keywords: maximum margin classifier, support vector machine, decision tree, CART
5 0.2974093 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis
Author: Simon Günter, Nicol N. Schraudolph, S. V. N. Vishwanathan
Abstract: We develop gain adaptation methods that improve convergence of the kernel Hebbian algorithm (KHA) for iterative kernel PCA (Kim et al., 2005). KHA has a scalar gain parameter which is either held constant or decreased according to a predetermined annealing schedule, leading to slow convergence. We accelerate it by incorporating the reciprocal of the current estimated eigenvalues as part of a gain vector. An additional normalization term then allows us to eliminate a tuning parameter in the annealing schedule. Finally we derive and apply stochastic meta-descent (SMD) gain vector adaptation (Schraudolph, 1999, 2002) in reproducing kernel Hilbert space to further speed up convergence. Experimental results on kernel PCA and spectral clustering of USPS digits, motion capture and image denoising, and image super-resolution tasks confirm that our methods converge substantially faster than conventional KHA. To demonstrate scalability, we perform kernel PCA on the entire MNIST data set. Keywords: step size adaptation, gain vector adaptation, stochastic meta-descent, kernel Hebbian algorithm, online learning
6 0.25087211 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
7 0.18491755 47 jmlr-2007-Learning Horn Expressions with LOGAN-H
8 0.15676165 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression
9 0.1435357 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
10 0.1335198 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition
11 0.12946863 25 jmlr-2007-Covariate Shift Adaptation by Importance Weighted Cross Validation
13 0.12525718 59 jmlr-2007-Nonlinear Boosting Projections for Ensemble Construction
14 0.12265976 30 jmlr-2007-Dynamics and Generalization Ability of LVQ Algorithms
15 0.1174641 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation
16 0.11499211 14 jmlr-2007-Behavioral Shaping for Geometric Concepts
17 0.11492082 83 jmlr-2007-The On-Line Shortest Path Problem Under Partial Monitoring
18 0.11448918 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study
19 0.10616965 29 jmlr-2007-Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts
20 0.10475907 11 jmlr-2007-Anytime Learning of Decision Trees
topicId topicWeight
[(4, 0.017), (8, 0.018), (10, 0.024), (12, 0.019), (15, 0.039), (20, 0.023), (28, 0.089), (40, 0.04), (45, 0.02), (48, 0.028), (60, 0.033), (68, 0.353), (80, 0.035), (85, 0.055), (98, 0.102)]
simIndex simValue paperId paperTitle
same-paper 1 0.68660438 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm
Author: Roni Khardon, Gabriel Wachman
Abstract: A large number of variants of the Perceptron algorithm have been proposed and partially evaluated in recent work. One type of algorithm aims for noise tolerance by replacing the last hypothesis of the perceptron with another hypothesis or a vote among hypotheses. Another type simply adds a margin term to the perceptron in order to increase robustness and accuracy, as done in support vector machines. A third type borrows further from support vector machines and constrains the update function of the perceptron in ways that mimic soft-margin techniques. The performance of these algorithms, and the potential for combining different techniques, has not been studied in depth. This paper provides such an experimental study and reveals some interesting facts about the algorithms. In particular the perceptron with margin is an effective method for tolerating noise and stabilizing the algorithm. This is surprising since the margin in itself is not designed or used for noise tolerance, and there are no known guarantees for such performance. In most cases, similar performance is obtained by the voted-perceptron which has the advantage that it does not require parameter selection. Techniques using soft margin ideas are run-time intensive and do not give additional performance benefits. The results also highlight the difficulty with automatic parameter selection which is required with some of these variants. Keywords: perceptron algorithm, on-line learning, noise tolerance, kernel methods
2 0.40185475 47 jmlr-2007-Learning Horn Expressions with LOGAN-H
Author: Marta Arias, Roni Khardon, Jérôme Maloberti
Abstract: The paper introduces L OG A N -H —a system for learning first-order function-free Horn expressions from interpretations. The system is based on an algorithm that learns by asking questions and that was proved correct in previous work. The current paper shows how the algorithm can be implemented in a practical system, and introduces a new algorithm based on it that avoids interaction and learns from examples only. The L OG A N -H system implements these algorithms and adds several facilities and optimizations that allow efficient applications in a wide range of problems. As one of the important ingredients, the system includes several fast procedures for solving the subsumption problem, an NP-complete problem that needs to be solved many times during the learning process. We describe qualitative and quantitative experiments in several domains. The experiments demonstrate that the system can deal with varied problems, large amounts of data, and that it achieves good classification accuracy. Keywords: inductive logic programming, subsumption, bottom-up learning, learning with queries
3 0.39382493 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
Author: Wei Pan, Xiaotong Shen
Abstract: Variable selection in clustering analysis is both challenging and important. In the context of modelbased clustering analysis with a common diagonal covariance matrix, which is especially suitable for “high dimension, low sample size” settings, we propose a penalized likelihood approach with an L1 penalty function, automatically realizing variable selection via thresholding and delivering a sparse solution. We derive an EM algorithm to fit our proposed model, and propose a modified BIC as a model selection criterion to choose the number of components and the penalization parameter. A simulation study and an application to gene function prediction with gene expression profiles demonstrate the utility of our method. Keywords: BIC, EM, mixture model, penalized likelihood, soft-thresholding, shrinkage
4 0.39351153 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss
Author: Ofer Dekel, Philip M. Long, Yoram Singer
Abstract: We study the problem of learning multiple tasks in parallel within the online learning framework. On each online round, the algorithm receives an instance for each of the parallel tasks and responds by predicting the label of each instance. We consider the case where the predictions made on each round all contribute toward a common goal. The relationship between the various tasks is defined by a global loss function, which evaluates the overall quality of the multiple predictions made on each round. Specifically, each individual prediction is associated with its own loss value, and then these multiple loss values are combined into a single number using the global loss function. We focus on the case where the global loss function belongs to the family of absolute norms, and present several online learning algorithms for the induced problem. We prove worst-case relative loss bounds for all of our algorithms, and demonstrate the effectiveness of our approach on a largescale multiclass-multilabel text categorization problem. Keywords: online learning, multitask learning, multiclass multilabel classiifcation, perceptron
5 0.39125031 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation
Author: Guy Lebanon, Yi Mao, Joshua Dillon
Abstract: The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words representation and its n-gram extensions. This representation uses local smoothing to embed documents as smooth curves in the multinomial simplex thereby preserving valuable sequential information. In contrast to bag of words or n-grams, the new representation is able to robustly capture medium and long range sequential trends in the document. We discuss the representation and its geometric properties and demonstrate its applicability for various text processing tasks. Keywords: text processing, local smoothing
6 0.39003462 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
7 0.38838834 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
8 0.38782355 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
9 0.387485 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification
10 0.38671452 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition
11 0.38570741 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification
12 0.38569137 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data
13 0.38355666 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression
14 0.38302594 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study
15 0.38068676 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
17 0.37987497 39 jmlr-2007-Handling Missing Values when Applying Classification Models
19 0.37893724 77 jmlr-2007-Stagewise Lasso
20 0.37755549 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning