jmlr jmlr2009 jmlr2009-86 knowledge-graph by maker-knowledge-mining

86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms


Source: pdf

Author: Yihua Chen, Eric K. Garcia, Maya R. Gupta, Ali Rahimi, Luca Cazzanti

Abstract: This paper reviews and extends the field of similarity-based classification, presenting new analyses, algorithms, data sets, and a comprehensive set of experimental results for a rich collection of classification problems. Specifically, the generalizability of using similarities as features is analyzed, design goals and methods for weighting nearest-neighbors for similarity-based learning are proposed, and different methods for consistently converting similarities into kernels are compared. Experiments on eight real data sets compare eight approaches and their variants to similarity-based learning. Keywords: similarity, dissimilarity, similarity-based learning, indefinite kernels

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Specifically, the generalizability of using similarities as features is analyzed, design goals and methods for weighting nearest-neighbors for similarity-based learning are proposed, and different methods for consistently converting similarities into kernels are compared. [sent-15, score-0.93]

2 Introduction Similarity-based classifiers estimate the class label of a test sample based on the similarities between the test sample and a set of labeled training samples, and the pairwise similarities between the training samples. [sent-18, score-0.93]

3 Similarity-based classification does not require direct access to the features of the samples, and thus the sample space can be any set, not necessarily a Euclidean space, as long as the similarity function is well defined for any pair of samples. [sent-20, score-0.263]

4 We assume that the pairwise similarities between n training samples are given as an n × n similarity matrix S whose (i, j)-entry is ψ(xi , x j ), where xi ∈ Ω, i = 1, . [sent-23, score-0.774]

5 The problem is to estimate the class label y for a test sample x based on its similarities to the training samples ψ(x, xi ), i = 1, . [sent-30, score-0.558]

6 , 2002), and pyramid match kernel (Grauman and Darrell, 2007) to measure the similarity or dissimilarity between images in order to do image retrieval and object recognition. [sent-43, score-0.353]

7 , 1990) are popular methods to compute the similarity between different amino acid sequences for protein classification. [sent-45, score-0.33]

8 Notions of similarity appear to play a fundamental role in human learning, and thus psychologists have done extensive research to model human similarity judgement. [sent-47, score-0.514]

9 First, we discuss the idea of similarities as inner products in Section 2, then the concept of treating similarities as features in Section 3. [sent-53, score-0.881]

10 The generalizability of using similarities as features and that of using similarities as kernels are compared in Section 4. [sent-54, score-0.863]

11 Similarities as Inner Products A popular approach to similarity-based classification is to treat the given similarities as inner products in some Hilbert space or to treat dissimilarities as distances in some Euclidean space. [sent-61, score-0.437]

12 We discuss different methods for modifying similarities into kernels in Section 2. [sent-63, score-0.402]

13 Although the matheo matical meaning of a kernel is the inner product in some Hilbert space, a standard interpretation of a kernel is the pairwise similarity between different samples. [sent-69, score-0.38]

14 Conversely, many researchers have suggested treating similarities as kernels, and applying any classification algorithm that only depends on inner products. [sent-70, score-0.439]

15 Using similarities as kernels eliminates the need to explicitly embed the samples in a Euclidean space. [sent-71, score-0.457]

16 749 C HEN , G ARCIA , G UPTA , R AHIMI AND C AZZANTI functions do not satisfy the properties of an inner product, and thus the similarity matrix S can be indefinite. [sent-80, score-0.295]

17 In the following subsections we discuss several methods to modify similarities into kernels; a previous review can be found in Wu et al. [sent-81, score-0.368]

18 Spectrum clip makes S PSD by clipping all the negative eigenvalues to zero. [sent-106, score-0.364]

19 Some researchers assume that the negative eigenvalues of the similarity matrix are caused by noise and view spectrum clip as a denoising step (Wu et al. [sent-107, score-0.751]

20 , max(λn , 0)) , and the modified PSD similarity matrix be Sclip = U T ΛclipU. [sent-112, score-0.257]

21 Using Sclip as a kernel matrix for training the SVM is equivalent to implicitly using xi = Λclip ui as the representation of the ith training sample since xi , x j is equal to the (i, j)-entry of Sclip . [sent-114, score-0.326]

22 A mathematical justification for spectrum clip is that Sclip is the nearest PSD matrix to S in terms of the Frobenius norm (Higham, 1988), that is, Sclip = arg min K − S K 0 750 F, S IMILARITY- BASED C LASSIFICATION where denotes the generalized inequality with respect to the PSD cone. [sent-115, score-0.537]

23 (2006) show that the negative eigenvalues of some similarity data can code useful information about object features or categories, which agrees with some fundamental psychological studies (Tversky and Gati, 1982; Gati and Tversky, 1982). [sent-126, score-0.31]

24 1, they assume that the samples lie in a Kre˘n space K = H+ ⊕ H− with similarities given by ψ(a, b) = a+ , b+ H+ − a− , b− H− . [sent-132, score-0.423]

25 This is equivalent to flipping the sign of the negative eigenvalues of the similarity matrix S: let Λflip = diag (|λ1 |, . [sent-134, score-0.357]

26 , |λn |), and then the similarity matrix after spectrum flip is Sflip = U T ΛflipU. [sent-137, score-0.387]

27 4 S PECTRUM S HIFT Spectrum shift is another popular approach to modifying a similarity matrix into a kernel matrix: since S + λI = U T (Λ + λI)U, any indefinite similarity matrix can be made PSD by shifting its spectrum by the absolute value of its minimum eigenvalue |λmin (S)|. [sent-142, score-0.787]

28 Let Λshift = Λ + |min (λmin (S), 0)| I, which is used to form the modified similarity matrix Sshift = U T ΛshiftU. [sent-143, score-0.257]

29 Compared with spectrum clip and flip, spectrum shift only enhances all the self-similarities by the amount of |λmin (S)| and 751 C HEN , G ARCIA , G UPTA , R AHIMI AND C AZZANTI does not change the similarity between any two different samples. [sent-144, score-0.879]

30 They used spectrum shift to produce a kernel from the similarity data. [sent-152, score-0.493]

31 It is also true that using SST is the same as defining a new similarity function ψ for any a, b ∈ Ω as n ˜ ψ(a, b) = ∑ ψ(a, xi )ψ(xi , b). [sent-158, score-0.258]

32 i=1 SST We note that for symmetric S, treating as a kernel matrix K is equivalent to representing each T xi by its similarity feature vector si = ψ(xi , x1 ) . [sent-159, score-0.389]

33 The concept of treating similarities as features is discussed in more detail in Section 3. [sent-163, score-0.444]

34 Then if one uses an ERM clas˜ sifier trained with modified similarities S, but uses the unmodified test similarities, represented by T vector s = ψ(x, x1 ) . [sent-166, score-0.409]

35 In general, one would like to modify the training and test similarities in a consistent fashion, that is, to modify the underlying similarity function rather than only modifying the S. [sent-170, score-0.685]

36 (2005) proposed to ˜ first modify S and train the classifier using the modified n × n similarity matrix S, and then for each test sample modify its s in an effort to be consistent with the modified similarities used to train the model. [sent-178, score-0.666]

37 Their approach is to re-compute the same modification on the augmented (n + 1) × (n + 1) similarity matrix S s S′ = T s ψ(x, x) 2. [sent-179, score-0.257]

38 They originally use dissimilarities in their cost function, and we reformulate it into similarities with the assumption that the relationship between dissimilarities and similarities is affine. [sent-180, score-0.736]

39 752 S IMILARITY- BASED C LASSIFICATION ˜ to form S′ , and then let the modified test similarities s be the first n elements of the last column of ˜ ′ . [sent-181, score-0.409]

40 To attain consistency, we note that both the spectrum clip and flip modifications can be rep˜ resented by linear transformations, that is, S = PS, where P is the corresponding transformation matrix, and we propose to apply the same linear transformation P on s such that s = Ps. [sent-186, score-0.447]

41 For spectrum clip, this linear transformation is equivalent to embedding the test sample as a feature vector into the same Euclidean space of the embedded training samples: Proposition 1 Let Sclip be the Gram matrix of the column vectors of X ∈ Rm×n , where rank(X) = m. [sent-196, score-0.264]

42 Similarities as Features Similarity-based classification problems can be formulated into standard learning problems in Euclidean space by treating the similarities between a sample x and the n training samples as features (Graepel et al. [sent-204, score-0.555]

43 As detailed in Section 4, the generalizability analysis yields different results for using similarities as features and using similarities as inner products. [sent-208, score-0.867]

44 (2007) show that under slightly less restrictive assumptions on the similarity function there exists with high probability a convex combination of simple classifiers on the similarities as features which has a maximum specifiable error. [sent-221, score-0.631]

45 consider generative classifiers for similarity feature vectors; they propose a regularized Fisher linear discriminant classifier (Pekalska et al. [sent-229, score-0.314]

46 We note that treating similarities as features may not capture discriminative information if there is a large intraclass variance compared to the interclass variance, even if the classes are wellseparated. [sent-231, score-0.474]

47 Generalization Bounds of Similarity SVM Classifiers To investigate the generalizability of SVM classifiers using similarities, we analyze two forms of SVMs: using similarity as a kernel as discussed in Section 2, and a linear SVM using the similarities as features as given by (4). [sent-234, score-0.742]

48 w Next, we state a weaker result in Theorem 2 for the SVM using (arbitrary) similarities as features. [sent-251, score-0.368]

49 Let the features be the similarities to m (< n) prototypes {(x1 , y1 ), . [sent-252, score-0.464]

50 If the original similarities are not PSD, then they must be modified to be PSD before this result applies; see Section 2 for a discussion of common PSD modifications. [sent-261, score-0.368]

51 1 Design Goals for Similarity-based Weighted k-NN In this section, for a test sample x, we use xi to denote its ith nearest neighbor from the training set as defined by the similarity function ψ for i = 1, . [sent-289, score-0.448]

52 Also, we redefine S as the k × k matrix of the similarities between the k-nearest neighbors and s the k × 1 vector of the similarities between the test sample x and its k-nearest neighbors. [sent-293, score-0.866]

53 (2006) proposed weights for k-NN in Euclidean space that satisfy a linear interpolation with maximum entropy (LIME) objective: 2 k minimize w ∑ wi xi − x i=1 2 − λH(w) (10) k subject to ∑ wi = 1, wi ≥ 0, i = 1, . [sent-314, score-0.288]

54 Note that (11) is completely specified in terms of the inner products of the feature vectors: xi , x j and x, xi , and thus we term the solution to (11) as kernel ridge 4. [sent-321, score-0.254]

55 These two terms work together to give more weight to the training samples that are more similar to the test sample, and thus help the resulting weights satisfy the first design goal of rewarding neighbors with high affinity to the test sample. [sent-328, score-0.354]

56 (15) The k-NN decision rule (9) using these weights is equivalent to classifying by maximizing the discriminant of a local kernel ridge regression. [sent-347, score-0.267]

57 Maximizing fg (x) over g ∈ G produces the same estimated class label as (9) using the weights given in (15), thus we refer to these weights as kernel ridge regression (KRR) weights. [sent-353, score-0.257]

58 One sees from s that the four distinct training samples are not equally similar to the test sample, and from S that the training samples have zero similarity to each other. [sent-361, score-0.515]

59 The four training samples all have similarity 3 to the test sample, but x2 and x3 are very similar to each other, and are thus weighted down as prescribed the by the design goal of diversity. [sent-364, score-0.407]

60 One approach to generative classification given similarities is using the similarities as features to define an n-dimensional feature space, and then fitting standard generative models in that space, as discussed in Section 3. [sent-413, score-0.875]

61 Another generative framework for similarity-based classification termed similarity discriminant analysis (SDA) has been proposed that models the class-conditional distributions of similarity statistics (Cazzanti et al. [sent-415, score-0.534]

62 In this paper, we take the set of descriptive statistics to be the similarities to class centroids: T (x) = {ψ(x, µ1 ), ψ(x, µ2 ), . [sent-420, score-0.417]

63 In this paper we experimentally compare to local SDA (Cazzanti and Gupta, 2007) with local centroid-similarity descriptive statistics given by (17), in which SDA is applied to the k nearest neighbors of a test point, where the parameter k is trained by cross-validation. [sent-434, score-0.271]

64 Experiments We compare eight similarity-based classification approaches: a linear and an RBF SVM using similarities as features, the P-SVM (Hochreiter and Obermayer, 2006), a local SVM (SVM-KNN) (Zhang et al. [sent-444, score-0.441]

65 , 2006) and a global SVM using the given similarities as a kernel, local SDA, k-NN, and the three weighted k-NN methods discussed in Section 5: the proposed KRR and KRI weights, 762 S IMILARITY- BASED C LASSIFICATION and affinity weights as a control, defined by wi = aψ(x, xi ), i = 1, . [sent-445, score-0.567]

66 1 Data Sets We tested the proposed classifiers on eight real data sets6 representing a diverse set of similarities ranging from the human judgement of audio signals to sequence alignment of proteins. [sent-454, score-0.47]

67 Every pair of signals was assigned a similarity score from 1 to 5 by two randomly chosen human subjects unaware of the true labels, and these scores were added to produce a 100 × 100 similarity matrix with integer values from 2 to 10. [sent-462, score-0.514]

68 Similarities for pairs of the original three-dimensional face data were computed as the cosine similarity between integral invariant signatures based on surface curves of the face (Feng et al. [sent-469, score-0.294]

69 The original paper demonstrated comparable results to the state-of-the-art using these similarities with a 1-NN classifier. [sent-471, score-0.368]

70 ” The Protein data set has sequence-alignment similarities for 213 proteins from 4 classes,9 where class one through four contains 72, 72, 39, and 30 samples, respectively (Hofmann and Buhmann, 1997). [sent-500, score-0.368]

71 As further discussed in the results, we define an additional similarity termed RBF-sim for the Protein data set: ψRBF (xi , x j ) = e− s(xi )−s(x j ) 2 , where s(x) is the 213 × 1 vector of similarities with mth component ψ(x, xm ). [sent-501, score-0.588]

72 Note that a purely block-diagonal similarity matrix would indicate a particularly easy classification problem, as objects have nonzero similarity only to objects of the same class. [sent-507, score-0.477]

73 , 10 Table 2: Cross-validation Parameter Choices The Amazon-47 data set is very sparse with at most four non-zero similarities per row. [sent-555, score-0.368]

74 However, the Mirex data set is also relatively sparse, and yet the global classifiers do well, in particular the SVMs that use similarities as features. [sent-560, score-0.368]

75 The Protein data set exhibits large differences in performance, with the statistically significantly best performances achieved by two of the three SVMs that use similarity as features. [sent-563, score-0.258]

76 The reason that similarities on features performs so well while others do so poorly can be seen in the Protein similarity matrix in Figure 1. [sent-564, score-0.668]

77 The first and second classes (roughly the first and second thirds of the matrix) exhibit a strong interclass similarity, and rows belonging to the same class exhibit very similar patterns, thus treating the rows of the similarity matrix as feature vectors provides good discrimination of classes. [sent-565, score-0.32]

78 To investigate this effect, we transformed the entire data set using a radial basis function (RBF) to create a similarity based on the Euclidean distance between rows of the original similarity matrix, yielding the 213 × 213 Protein RBF similarity matrix. [sent-566, score-0.66]

79 The Caltech-101 data set is the largest data set, and with 8677 samples in 101 classes, an analysis of the structure of this similarity matrix is difficult. [sent-569, score-0.312]

80 (2006) in part as a way to reduce the computations required to train a global SVM using similarities as a kernel, and their results showed that it performed similarly to a global SVM. [sent-757, score-0.368]

81 For the Amazon-47 and Patrol data sets the local methods all do well including SVM-KNN, but the global methods do poorly, including the global SVM using similarities as a kernel. [sent-759, score-0.406]

82 On the other hand, the global SVM using similarities as a kernel is statistically significantly better than SVM-KNN on Caltech-101, even though the best performance on that data set is achieved by a local method (KRR). [sent-760, score-0.505]

83 Among the four global SVMs (including P-SVM), it is hard to draw conclusions about the performance of the one that uses similarities as a kernel versus the three that use similarities as features. [sent-762, score-0.797]

84 In terms of statistical significance, the SVM using similarities as a kernel outperforms the others on Patrol and Caltech-101 whereas it is the worst on Amazon-47 and Protein, and there is no clear division on the remaining data sets. [sent-763, score-0.429]

85 Lastly, the results do not show statistically significant differences between using the linear or RBF version of the SVM with similarities as features except for small differences on Face Rec and Patrol. [sent-764, score-0.449]

86 Different approaches to modifying similarities to form a kernel were discussed in Section 2. [sent-767, score-0.429]

87 We experimentally compared clip, flip, and shift for the KRI weights, SVM-KNN, and SVM, and flip, clip, shift and pinv for the KRR weights on the nine data sets. [sent-769, score-0.357]

88 For KRR weights, one sees that the pinv solution is never statistically significantly worse than clip, flip, or shift, which are worse than pinv at least once. [sent-771, score-0.308]

89 Thus it is not surprising that for the Protein data set, which we have seen in Table 3 works best with similarities as features, flip makes a large positive difference for SVM-KNN and SVM. [sent-776, score-0.368]

90 768 S IMILARITY- BASED C LASSIFICATION KRI KRR Amazon Mirex Patrol Protein Voting Amazon Mirex Patrol Protein Voting clip 17. [sent-778, score-0.317]

91 53 SVM-KNN SVM Amazon Mirex Patrol Protein Voting Amazon Mirex Patrol Protein Voting clip 17. [sent-813, score-0.317]

92 Conclusions and Some Open Questions Similarity-based learning is a practical learning framework for many problems in bioinformatics, computer vision, and those regarding human similarity judgement. [sent-847, score-0.257]

93 Flipping the spectrum does create significantly better performance for the original Protein problem because, as we noted earlier, flipping the spectrum has a similar effect to using the similarities as features, which works favorably for the original Protein problem. [sent-852, score-0.628]

94 Concerning approximating similarities, we addressed the issue of consistent treatment of training and test samples when approximating the similarities to be PSD. [sent-854, score-0.52]

95 A fundamental question is whether it is more advisable to use the similarities as kernels or features. [sent-856, score-0.402]

96 The Caltech-101 similarities are PSD, and it may be that the KRI and KRR methods are sensitive to approximations of the matrix S. [sent-868, score-0.405]

97 Given test similarity vector s, the least-squares solution to −1 the equation X T x = s is x = XX T Xs. [sent-886, score-0.261]

98 The proof of Theorem 2 illustrates why using similarities as features has a poorer guarantee on the generalization than using similarities as a kernel. [sent-932, score-0.779]

99 Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. [sent-1218, score-0.33]

100 An analysis of transformation on non-positive semidefinite similarity matrix for kernel machines. [sent-1374, score-0.318]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('similarities', 0.368), ('clip', 0.317), ('krr', 0.278), ('kri', 0.268), ('sda', 0.238), ('similarity', 0.22), ('patrol', 0.178), ('psd', 0.168), ('ahimi', 0.149), ('azzanti', 0.149), ('spectrum', 0.13), ('ip', 0.127), ('upta', 0.126), ('arcia', 0.126), ('pinv', 0.119), ('protein', 0.11), ('mirex', 0.109), ('sclip', 0.109), ('svm', 0.09), ('cazzanti', 0.089), ('tversky', 0.089), ('hen', 0.086), ('shift', 0.082), ('wt', 0.076), ('weights', 0.074), ('lassification', 0.073), ('sw', 0.069), ('sonar', 0.068), ('pekalska', 0.067), ('gupta', 0.064), ('inde', 0.064), ('rbf', 0.061), ('kernel', 0.061), ('aural', 0.059), ('gati', 0.059), ('training', 0.056), ('samples', 0.055), ('nearest', 0.053), ('af', 0.053), ('prototypes', 0.053), ('diag', 0.053), ('rademacher', 0.052), ('neighbors', 0.052), ('sst', 0.051), ('generalizability', 0.05), ('laub', 0.05), ('wi', 0.049), ('descriptive', 0.049), ('classi', 0.048), ('generative', 0.048), ('ridge', 0.048), ('voting', 0.048), ('eigenvalues', 0.047), ('lt', 0.046), ('discriminant', 0.046), ('features', 0.043), ('rec', 0.042), ('dissimilarity', 0.042), ('nity', 0.041), ('test', 0.041), ('ith', 0.04), ('gml', 0.04), ('kre', 0.04), ('pclip', 0.04), ('pectrum', 0.04), ('rd', 0.039), ('tm', 0.039), ('balcan', 0.039), ('graepel', 0.039), ('inner', 0.038), ('xi', 0.038), ('euclidean', 0.038), ('fs', 0.038), ('statistically', 0.038), ('local', 0.038), ('matrix', 0.037), ('st', 0.037), ('face', 0.037), ('human', 0.037), ('amazon', 0.036), ('eight', 0.035), ('mendelson', 0.035), ('design', 0.035), ('kernels', 0.034), ('hochreiter', 0.034), ('treating', 0.033), ('sees', 0.032), ('wu', 0.032), ('goals', 0.032), ('washington', 0.032), ('ei', 0.032), ('products', 0.031), ('regularization', 0.031), ('grauman', 0.03), ('interclass', 0.03), ('judgement', 0.03), ('pyramid', 0.03), ('rkks', 0.03), ('svms', 0.03), ('minimize', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

Author: Yihua Chen, Eric K. Garcia, Maya R. Gupta, Ali Rahimi, Luca Cazzanti

Abstract: This paper reviews and extends the field of similarity-based classification, presenting new analyses, algorithms, data sets, and a comprehensive set of experimental results for a rich collection of classification problems. Specifically, the generalizability of using similarities as features is analyzed, design goals and methods for weighting nearest-neighbors for similarity-based learning are proposed, and different methods for consistently converting similarities into kernels are compared. Experiments on eight real data sets compare eight approaches and their variants to similarity-based learning. Keywords: similarity, dissimilarity, similarity-based learning, indefinite kernels

2 0.099412426 23 jmlr-2009-Discriminative Learning Under Covariate Shift

Author: Steffen Bickel, Michael Brückner, Tobias Scheffer

Abstract: We address classification problems for which the training instances are governed by an input distribution that is allowed to differ arbitrarily from the test distribution—problems also referred to as classification under covariate shift. We derive a solution that is purely discriminative: neither training nor test distribution are modeled explicitly. The problem of learning under covariate shift can be written as an integrated optimization problem. Instantiating the general optimization problem leads to a kernel logistic regression and an exponential model classifier for covariate shift. The optimization problem is convex under certain conditions; our findings also clarify the relationship to the known kernel mean matching procedure. We report on experiments on problems of spam filtering, text classification, and landmine detection. Keywords: covariate shift, discriminative learning, transfer learning

3 0.071836695 90 jmlr-2009-Structure Spaces

Author: Brijnesh J. Jain, Klaus Obermayer

Abstract: Finite structures such as point patterns, strings, trees, and graphs occur as ”natural” representations of structured data in different application areas of machine learning. We develop the theory of structure spaces and derive geometrical and analytical concepts such as the angle between structures and the derivative of functions on structures. In particular, we show that the gradient of a differentiable structural function is a well-defined structure pointing in the direction of steepest ascent. Exploiting the properties of structure spaces, it will turn out that a number of problems in structural pattern recognition such as central clustering or learning in structured output spaces can be formulated as optimization problems with cost functions that are locally Lipschitz. Hence, methods from nonsmooth analysis are applicable to optimize those cost functions. Keywords: graphs, graph matching, learning in structured domains, nonsmooth optimization

4 0.06666182 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers

Author: Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil

Abstract: We consider a general class of regularization methods which learn a vector of parameters on the basis of linear measurements. It is well known that if the regularizer is a nondecreasing function of the L2 norm, then the learned vector is a linear combination of the input data. This result, known as the representer theorem, lies at the basis of kernel-based methods in machine learning. In this paper, we prove the necessity of the above condition, in the case of differentiable regularizers. We further extend our analysis to regularization methods which learn a matrix, a problem which is motivated by the application to multi-task learning. In this context, we study a more general representer theorem, which holds for a larger class of regularizers. We provide a necessary and sufficient condition characterizing this class of matrix regularizers and we highlight some concrete examples of practical importance. Our analysis uses basic principles from matrix theory, especially the useful notion of matrix nondecreasing functions. Keywords: kernel methods, matrix learning, minimal norm interpolation, multi-task learning, regularization

5 0.064580671 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting

Author: John Duchi, Yoram Singer

Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization

6 0.063587941 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent

7 0.061389901 50 jmlr-2009-Learning When Concepts Abound

8 0.056028277 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification

9 0.054673325 87 jmlr-2009-Sparse Online Learning via Truncated Gradient

10 0.051507924 78 jmlr-2009-Refinement of Reproducing Kernels

11 0.05046929 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

12 0.050053194 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM

13 0.048446052 56 jmlr-2009-Model Monitor (M2): Evaluating, Comparing, and Monitoring Models    (Machine Learning Open Source Software Paper)

14 0.048150994 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization

15 0.04750384 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems

16 0.047309894 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training

17 0.04492379 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

18 0.043081444 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability

19 0.042550225 29 jmlr-2009-Estimating Labels from Label Proportions

20 0.042389378 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.232), (1, -0.009), (2, 0.123), (3, -0.119), (4, 0.069), (5, 0.015), (6, 0.08), (7, -0.005), (8, -0.031), (9, 0.021), (10, 0.043), (11, -0.024), (12, 0.081), (13, -0.107), (14, 0.04), (15, -0.087), (16, 0.041), (17, 0.034), (18, -0.033), (19, -0.106), (20, -0.006), (21, 0.077), (22, -0.137), (23, -0.034), (24, -0.112), (25, -0.079), (26, -0.119), (27, 0.114), (28, 0.001), (29, -0.031), (30, -0.067), (31, -0.058), (32, 0.013), (33, 0.074), (34, -0.063), (35, 0.036), (36, 0.046), (37, 0.18), (38, 0.2), (39, -0.001), (40, -0.003), (41, 0.078), (42, -0.01), (43, -0.056), (44, 0.037), (45, -0.037), (46, 0.169), (47, -0.116), (48, -0.021), (49, -0.093)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91962409 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

Author: Yihua Chen, Eric K. Garcia, Maya R. Gupta, Ali Rahimi, Luca Cazzanti

Abstract: This paper reviews and extends the field of similarity-based classification, presenting new analyses, algorithms, data sets, and a comprehensive set of experimental results for a rich collection of classification problems. Specifically, the generalizability of using similarities as features is analyzed, design goals and methods for weighting nearest-neighbors for similarity-based learning are proposed, and different methods for consistently converting similarities into kernels are compared. Experiments on eight real data sets compare eight approaches and their variants to similarity-based learning. Keywords: similarity, dissimilarity, similarity-based learning, indefinite kernels

2 0.50864434 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers

Author: Volkan Vural, Glenn Fung, Balaji Krishnapuram, Jennifer G. Dy, Bharat Rao

Abstract: Most classification methods assume that the samples are drawn independently and identically from an unknown data generating distribution, yet this assumption is violated in several real life problems. In order to relax this assumption, we consider the case where batches or groups of samples may have internal correlations, whereas the samples from different batches may be considered to be uncorrelated. This paper introduces three algorithms for classifying all the samples in a batch jointly: one based on a probabilistic formulation, and two based on mathematical programming analysis. Experiments on three real-life computer aided diagnosis (CAD) problems demonstrate that the proposed algorithms are significantly more accurate than a naive support vector machine which ignores the correlations among the samples. Keywords: batch-wise classification, support vector machine, linear programming, machine learning, statistical methods, unconstrained optimization

3 0.50441903 23 jmlr-2009-Discriminative Learning Under Covariate Shift

Author: Steffen Bickel, Michael Brückner, Tobias Scheffer

Abstract: We address classification problems for which the training instances are governed by an input distribution that is allowed to differ arbitrarily from the test distribution—problems also referred to as classification under covariate shift. We derive a solution that is purely discriminative: neither training nor test distribution are modeled explicitly. The problem of learning under covariate shift can be written as an integrated optimization problem. Instantiating the general optimization problem leads to a kernel logistic regression and an exponential model classifier for covariate shift. The optimization problem is convex under certain conditions; our findings also clarify the relationship to the known kernel mean matching procedure. We report on experiments on problems of spam filtering, text classification, and landmine detection. Keywords: covariate shift, discriminative learning, transfer learning

4 0.50158173 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM

Author: Pradip Ghanty, Samrat Paul, Nikhil R. Pal

Abstract: In this paper we propose a new multilayer classifier architecture. The proposed hybrid architecture has two cascaded modules: feature extraction module and classification module. In the feature extraction module we use the multilayered perceptron (MLP) neural networks, although other tools such as radial basis function (RBF) networks can be used. In the classification module we use support vector machines (SVMs)—here also other tool such as MLP or RBF can be used. The feature extraction module has several sub-modules each of which is expected to extract features capturing the discriminating characteristics of different areas of the input space. The classification module classifies the data based on the extracted features. The resultant architecture with MLP in feature extraction module and SVM in classification module is called NEUROSVM. The NEUROSVM is tested on twelve benchmark data sets and the performance of the NEUROSVM is found to be better than both MLP and SVM. We also compare the performance of proposed architecture with that of two ensemble methods: majority voting and averaging. Here also the NEUROSVM is found to perform better than these two ensemble methods. Further we explore the use of MLP and RBF in the classification module of the proposed architecture. The most attractive feature of NEUROSVM is that it practically eliminates the severe dependency of SVM on the choice of kernel. This has been verified with respect to both linear and non-linear kernels. We have also demonstrated that for the feature extraction module, the full training of MLPs is not needed. Keywords: feature extraction, neural networks (NNs), support vector machines (SVMs), hybrid system, majority voting, averaging c 2009 Pradip Ghanty, Samrat Paul and Nikhil R. Pal. G HANTY, PAUL AND PAL

5 0.49634042 56 jmlr-2009-Model Monitor (M2): Evaluating, Comparing, and Monitoring Models    (Machine Learning Open Source Software Paper)

Author: Troy Raeder, Nitesh V. Chawla

Abstract: This paper presents Model Monitor (M 2 ), a Java toolkit for robustly evaluating machine learning algorithms in the presence of changing data distributions. M 2 provides a simple and intuitive framework in which users can evaluate classifiers under hypothesized shifts in distribution and therefore determine the best model (or models) for their data under a number of potential scenarios. Additionally, M 2 is fully integrated with the WEKA machine learning environment, so that a variety of commodity classifiers can be used if desired. Keywords: machine learning, open-source software, distribution shift, scenario analysis

6 0.41835791 50 jmlr-2009-Learning When Concepts Abound

7 0.40017876 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization

8 0.39997235 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

9 0.39103934 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems

10 0.38257068 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification

11 0.36418903 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

12 0.3605777 90 jmlr-2009-Structure Spaces

13 0.35212907 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods

14 0.34197438 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

15 0.32368717 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers

16 0.32099947 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures

17 0.31416333 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks

18 0.30605313 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting

19 0.30277076 15 jmlr-2009-Cautious Collective Classification

20 0.29611036 29 jmlr-2009-Estimating Labels from Label Proportions


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.017), (11, 0.022), (19, 0.012), (21, 0.01), (26, 0.011), (38, 0.026), (47, 0.012), (52, 0.066), (55, 0.035), (58, 0.527), (66, 0.085), (68, 0.036), (90, 0.064), (96, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96303487 6 jmlr-2009-An Algorithm for Reading Dependencies from the Minimal Undirected Independence Map of a Graphoid that Satisfies Weak Transitivity

Author: Jose M. Peña, Roland Nilsson, Johan Björkegren, Jesper Tegnér

Abstract: We present a sound and complete graphical criterion for reading dependencies from the minimal undirected independence map G of a graphoid M that satisfies weak transitivity. Here, complete means that it is able to read all the dependencies in M that can be derived by applying the graphoid properties and weak transitivity to the dependencies used in the construction of G and the independencies obtained from G by vertex separation. We argue that assuming weak transitivity is not too restrictive. As an intermediate step in the derivation of the graphical criterion, we prove that for any undirected graph G there exists a strictly positive discrete probability distribution with the prescribed sample spaces that is faithful to G. We also report an algorithm that implements the graphical criterion and whose running time is considered to be at most O(n2 (e + n)) for n nodes and e edges. Finally, we illustrate how the graphical criterion can be used within bioinformatics to identify biologically meaningful gene dependencies. Keywords: graphical models, vertex separation, graphoids, weak transitivity, bioinformatics

same-paper 2 0.83717293 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

Author: Yihua Chen, Eric K. Garcia, Maya R. Gupta, Ali Rahimi, Luca Cazzanti

Abstract: This paper reviews and extends the field of similarity-based classification, presenting new analyses, algorithms, data sets, and a comprehensive set of experimental results for a rich collection of classification problems. Specifically, the generalizability of using similarities as features is analyzed, design goals and methods for weighting nearest-neighbors for similarity-based learning are proposed, and different methods for consistently converting similarities into kernels are compared. Experiments on eight real data sets compare eight approaches and their variants to similarity-based learning. Keywords: similarity, dissimilarity, similarity-based learning, indefinite kernels

3 0.82633483 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization

Author: Vojtěch Franc, Sören Sonnenburg

Abstract: We have developed an optimized cutting plane algorithm (OCA) for solving large-scale risk minimization problems. We prove that the number of iterations OCA requires to converge to a ε precise solution is approximately linear in the sample size. We also derive OCAS, an OCA-based linear binary Support Vector Machine (SVM) solver, and OCAM, a linear multi-class SVM solver. In an extensive empirical evaluation we show that OCAS outperforms current state-of-the-art SVM solvers like SVMlight , SVMperf and BMRM, achieving speedup factor more than 1,200 over SVMlight on some data sets and speedup factor of 29 over SVMperf , while obtaining the same precise support vector solution. OCAS, even in the early optimization steps, often shows faster convergence than the currently prevailing approximative methods in this domain, SGD and Pegasos. In addition, our proposed linear multi-class SVM solver, OCAM, achieves speedups of factor of up to 10 compared to SVMmulti−class . Finally, we use OCAS and OCAM in two real-world applications, the problem of human acceptor splice site detection and malware detection. Effectively parallelizing OCAS, we achieve state-of-the-art results on an acceptor splice site recognition problem only by being able to learn from all the available 50 million examples in a 12-million-dimensional feature space. Source code, data sets and scripts to reproduce the experiments are available at http://cmp.felk.cvut.cz/˜xfrancv/ocas/html/. Keywords: risk minimization, linear support vector machine, multi-class classification, binary classification, large-scale learning, parallelization

4 0.40142313 54 jmlr-2009-Markov Properties for Linear Causal Models with Correlated Errors    (Special Topic on Causality)

Author: Changsung Kang, Jin Tian

Abstract: A linear causal model with correlated errors, represented by a DAG with bi-directed edges, can be tested by the set of conditional independence relations implied by the model. A global Markov property specifies, by the d-separation criterion, the set of all conditional independence relations holding in any model associated with a graph. A local Markov property specifies a much smaller set of conditional independence relations which will imply all other conditional independence relations which hold under the global Markov property. For DAGs with bi-directed edges associated with arbitrary probability distributions, a local Markov property is given in Richardson (2003) which may invoke an exponential number of conditional independencies. In this paper, we show that for a class of linear structural equation models with correlated errors, there is a local Markov property which will invoke only a linear number of conditional independence relations. For general linear models, we provide a local Markov property that often invokes far fewer conditional independencies than that in Richardson (2003). The results have applications in testing linear structural equation models with correlated errors. Keywords: Markov properties, linear causal models, linear structural equation models, graphical models

5 0.39712569 89 jmlr-2009-Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks

Author: Nikolai Slobodianik, Dmitry Zaporozhets, Neal Madras

Abstract: In the machine learning community, the Bayesian scoring criterion is widely used for model selection problems. One of the fundamental theoretical properties justifying the usage of the Bayesian scoring criterion is its consistency. In this paper we refine this property for the case of binomial Bayesian network models. As a by-product of our derivations we establish strong consistency and obtain the law of iterated logarithm for the Bayesian scoring criterion. Keywords: Bayesian networks, consistency, scoring criterion, model selection, BIC

6 0.38462338 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers

7 0.3816565 19 jmlr-2009-Controlling the False Discovery Rate of the Association Causality Structure Learned with the PC Algorithm    (Special Topic on Mining and Learning with Graphs and Relations)

8 0.37958735 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM

9 0.37921509 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

10 0.36688638 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures

11 0.36575449 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training

12 0.3624467 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications

13 0.3602846 18 jmlr-2009-Consistency and Localizability

14 0.3597737 38 jmlr-2009-Hash Kernels for Structured Data

15 0.35670394 85 jmlr-2009-Settable Systems: An Extension of Pearl's Causal Model with Optimization, Equilibrium, and Learning

16 0.35563692 48 jmlr-2009-Learning Nondeterministic Classifiers

17 0.35032326 53 jmlr-2009-Marginal Likelihood Integrals for Mixtures of Independence Models

18 0.34947661 70 jmlr-2009-Particle Swarm Model Selection    (Special Topic on Model Selection)

19 0.34513229 41 jmlr-2009-Improving the Reliability of Causal Discovery from Small Data Sets Using Argumentation    (Special Topic on Causality)

20 0.34485358 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions