jmlr jmlr2012 jmlr2012-36 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Aharon Ben-Tal, Sahely Bhadra, Chiranjib Bhattacharyya, Arkadi Nemirovski
Abstract: In this paper we study the problem of designing SVM classifiers when the kernel matrix, K, is affected by uncertainty. Specifically K is modeled as a positive affine combination of given positive semi definite kernels, with the coefficients ranging in a norm-bounded uncertainty set. We treat the problem using the Robust Optimization methodology. This reduces the uncertain SVM problem into a deterministic conic quadratic problem which can be solved in principle by a polynomial time Interior Point (IP) algorithm. However, for large-scale classification problems, IP methods become intractable and one has to resort to first-order gradient type methods. The strategy we use here is to reformulate the robust counterpart of the uncertain SVM problem as a saddle point problem and employ a special gradient scheme which works directly on the convex-concave saddle function. The algorithm is a simplified version of a general scheme due to Juditski and Nemirovski (2011). It achieves an O(1/T 2 ) reduction of the initial error after T iterations. A comprehensive empirical study on both synthetic data and real-world protein structure data sets show that the proposed formulations achieve the desired robustness, and the saddle point based algorithm outperforms the IP method significantly. Keywords: robust optimization, uncertain classification, kernel functions
Reference: text
sentIndex sentText sentNum sentScore
1 The strategy we use here is to reformulate the robust counterpart of the uncertain SVM problem as a saddle point problem and employ a special gradient scheme which works directly on the convex-concave saddle function. [sent-18, score-0.446]
2 A comprehensive empirical study on both synthetic data and real-world protein structure data sets show that the proposed formulations achieve the desired robustness, and the saddle point based algorithm outperforms the IP method significantly. [sent-21, score-0.354]
3 Keywords: robust optimization, uncertain classification, kernel functions 1. [sent-22, score-0.241]
4 , 2011) at designing robust classifiers have been limited to the case of linear classification where the uncertainty is specified over an explicitly stated feature map. [sent-45, score-0.23]
5 Consider the problem of automated protein structure classification, an important problem of Computational Biology, where no such feature map is available. [sent-46, score-0.173]
6 Protein Structures are specified by a set of 3D coordinates and it is possible to design kernel functions for protein structures based on the coordinates (Qiu et al. [sent-47, score-0.292]
7 , 2010) initiated a study of designing robust classifiers when the entries of the kernel matrix are independently distributed random variables (a somewhat problematic assumption). [sent-52, score-0.185]
8 The uncertainty in the kernel matrix K is modeled by a bounded convex set, which encompasses several possible realizations of K. [sent-59, score-0.175]
9 This new approach results first in a robust counterpart of the uncertain SVM which can be cast as a Conic Quadratic (CQ) problem. [sent-60, score-0.176]
10 Our main contribution here is to reformulate the robust counterpart as a saddle point problem. [sent-63, score-0.224]
11 Due to favorable conditions satisfied by the saddle function one can in principle refer to a gradient-based general scheme introduced in (Juditski and Nemirovskii, 2011) for solving such saddle point problems. [sent-64, score-0.27]
12 Experimental results performed on synthetic data, as well as real-world protein structure data sets, show that the saddle-point based algorithm outperforms the IP method considerably. [sent-66, score-0.173]
13 To motivate the paper we start with a brief discussion on issues underlying protein structure classification and kernel based classifiers in Section 2. [sent-69, score-0.238]
14 In Section 5 we present the saddle point algorithm and discuss 2924 U NCERTAIN K ERNEL M ATRICES its application to the minimax problem. [sent-74, score-0.135]
15 Motivation: Uncertain Kernels and Automated Protein Structure Classification Classification of protein structures into various classes like families, superfamilies etc remains an important research challenge in computational biology (see Holm and Sander, 1996 for an introduction). [sent-88, score-0.282]
16 Usually a protein structure is specified by the positions of alpha carbon (Cα ) atoms. [sent-92, score-0.173]
17 A formal description of Cα atoms and protein structures is beyond the scope of the paper and we refer the reader to Branden and Tooze (1999) for an introduction. [sent-93, score-0.266]
18 In the sequel we will denote protein structure by a set P = {ci ∈ R3 |i = 1, · · · , s}, (3) where each Cα atom is determined by spatial coordinates ci = {ci1 , ci2 , ci3 } obtained by X-ray crystallography. [sent-94, score-0.173]
19 Biologists often determine the similarity between a pair of structures by first computing an alignment and then measuring the quality of the alignment by root mean square deviation(RMSD). [sent-99, score-0.174]
20 All such procedures implicitly assume that the protein structures are specified exactly, that is, the location of the atoms constituting the structure is known precisely. [sent-105, score-0.266]
21 1 For a protein structure P, the resolution information r, specifies the error in each coordinate. [sent-107, score-0.213]
22 More formally the position of the ith atom in a protein ¯ structure P (see (3)) could be anywhere in the uncertainty box {c| c − ci ∞ ≤ r}, around the value ¯ i . [sent-108, score-0.283]
23 (4) B EN -TAL , B HADRA , B HATTACHARYYA AND N EMIROVSKI Figure 1: (a) Pictorial presentation of Cα atoms of protein d1vsra1(top) and d1gefa1(bottom). [sent-120, score-0.212]
24 , n} c as the nominal structure and U(P) as the uncertainty set associated with it. [sent-126, score-0.338]
25 The structural alignment between P and P′ in presence of uncertainty sets U(P) and U(P′ ) is not defined anymore. [sent-128, score-0.17]
26 Even when r is small, the alignment scores between two nominal structures, ¯ ¯ P and P′ can differ significantly from the alignment scores between an arbitrary R and R′ where R ∈ U(P), R′ ∈ U(P′ ). [sent-129, score-0.348]
27 This difference in alignment scores leads to uncertain kernel values. [sent-130, score-0.212]
28 For example, consider two proteins2 d1vsra1(denote it by P) and d1gefa1(denote it by P′ ) ˚ belonging to protein superfamily Restriction endonuclease-like. [sent-131, score-0.173]
29 If one ignores the uncertainty one obtains a kernel value of 1. [sent-138, score-0.175]
30 On randomly sampled structures, from the corresponding uncertainty box (4) we observe that the kernel value ranged from 1. [sent-141, score-0.175]
31 In superfamily Restriction endonuclease-like, more than 60% of the kernel values, computed between any pair of nominal protein structures, lie between Kmin and Kmax . [sent-145, score-0.466]
32 2926 U NCERTAIN K ERNEL M ATRICES This demonstrates that accounting for resolution information leads to considerable uncertainty in kernel values. [sent-149, score-0.215]
33 (2010), the uncertainty is modeled by independent noise in each of the entries of the kernel matrix K. [sent-157, score-0.175]
34 To constitute a valid characterization of uncertainty set the constraint (6) needs to be modified as follows + Prob α⊤Y (K + Z)Y α ≤ t ≥ 1 − ε, K + Z ∈ Sn The formulation (5) solves a relaxed version of the above problem by ignoring the psd requirement. [sent-180, score-0.167]
35 Thirdly, the assumption that entries of Z are independently distributed is extremely unrealistic; often the uncertainty in the entries are due to uncertainty in the observations hence K(xi , x j ) is seldom independent of K(xi , xl ) for distinct i, j, l. [sent-182, score-0.22]
36 Affine Uncertainty Set Model for Uncertain Kernel Matrices In this section we introduce an uncertainty set over psd matrices and study the resultant robust SVM problem using an RO approach. [sent-187, score-0.258]
37 In the RO framework the information related to Ψ is modelled as a geometric uncertainty set E ⊂ Rk and the family of problems, (8) is replaced by its robust counterpart: r∗ = min max f (x, Ψ) x Ψ∈E (9) gi (x, Ψ) ≤ 0 ∀Ψ ∈ E i = 1, . [sent-194, score-0.199]
38 A general representation of E is as follows L ¯ E = {Ψ = Ψ + ∑ ηi Ψi | η ≤ ρ} i=1 ¯ where Ψ is the nominal value of the uncertain vector Ψ, the vectors Ψi are possible scenarios of it, and η is a perturbation vector. [sent-202, score-0.315]
39 We assume that if z and z′ are noisy observations with uncertainty sets U(z) and U(z′ ) with nominal values znom and z′ respectively then nom K(z, z′ ) = K(znom , z′ ), defines a kernel function and will be called the nominal kernel. [sent-212, score-0.631]
40 The differnom ence between actual and the nominal kernel is expressed by a linear combination of known L kernel functions, Kl , l = 1, . [sent-213, score-0.358]
41 We impose the uncertainty set (10) to all examples of interest which immediately leads to the following model of uncertainty on the kernel matrix corresponding to the training set, L E (κ) = {K = K + ∑ ηl Kl , η l=1 p ≤ κ ηl ≥ 0, l = 1, . [sent-227, score-0.285]
42 As any K ∈ E (κ) is always positive semi-definite, the set E (κ) defines a valid model for describing uncertainty in psd matrices. [sent-232, score-0.14]
43 In a later subsection we will discuss the relevance of this setup to protein structure classification problem. [sent-233, score-0.173]
44 3 Evaluation of Base Kernels ¯ Recall that in the protein structure classification problem each observation is specified by a (P,U(P), y), where P(see (3)) is the nominal structure, U(P)(see (4)) is the uncertainty set specified by the resolution and y is the label. [sent-266, score-0.551]
45 To closely parallel the protein structure classification setting we consider the following setup. [sent-268, score-0.173]
46 , m} where an observation, xi , is not directly specified, instead a nominal value, xi , and an uncertainty set Ui are given. [sent-272, score-0.338]
47 An Algorithm for a Special Class of Convex-Concave Saddle Point Problems In this section we describe a novel algorithm, which is essentially a special case of an algorithm presented in Juditski and Nemirovski (2011), for a class of convex-concave saddle point problems. [sent-292, score-0.135]
48 y Y ≤1 y∈Y We are interested in solving a saddle point problem SadVal = min max φ(x, y), x∈X y∈Y where φ(·, ·) satisfies the following assumptions: 2931 (14) B EN -TAL , B HADRA , B HATTACHARYYA AND N EMIROVSKI A. [sent-313, score-0.135]
49 · Y , function on Y , and min φ(x) = SadVal = max φ(y) x∈X y∈Y and the set of saddle points of φ on X ×Y is X∗ × {y∗ }, where X∗ = ArgminX φ, and y∗ = ArgmaxY φ. [sent-327, score-0.135]
50 Finally, we denote by εsad (z), z ∈ Z, the natural saddle point proximity measure: εsad (x, y) = φ(x) − φ(y) = φ(x) − min φ + max φ − φ(y) . [sent-328, score-0.135]
51 2 Fixed Step-size per Stage(FSS) Algorithm for Convex-Concave Saddle Point Problem The MPb algorithm presented in Juditski and Nemirovski (2011) is an extremely fast algorithm for convex-concave saddle point problems. [sent-330, score-0.135]
52 1 FSS A LGORITHM We begin by introducing some notation G(x, y) = Gx (x, y) := ∂φ(x, y) ∂φ(x, y) = a(y); Gy (x, y) := − : Z := X ×Y → Z := X × Y ∂x ∂y be the monotone operator associated with the saddle point problem (14). [sent-339, score-0.135]
53 we have at our disposal positive Rs and a point ys ∈ Y such that ¯ ys − y∗ Y ≤ Rs /2. [sent-348, score-0.674]
54 Initialization: We set z1,s = (xω , ys ) = argminZ ωs (·). [sent-351, score-0.337]
55 When t = Ns , we define the approximate solution to (14) built in course of s stages as Ns (xs , ys ) = Ns−1 ∑ wt,s , (18) t=1 set ys+1 = ys , Rs+1 = Rs /2 ¯ and pass to stage s + 1. [sent-357, score-0.706]
56 ¯ Then, for every s, (Is ) takes place, and εsad (xs , ys ) ≤ θR2 2−2s−5 , 0 (19) while the total number Ms = ∑s Ni of steps of the algorithm needed to build (xs , ys ) admits the i=0 bound Lxy ΩX ΩY s Lyy ΩY + θ 2 + (s + 1) . [sent-364, score-0.674]
57 O(1) θR0 s ≤ s∗ ⇒ εsad (xs , ys ) ≤ θR2 2−2(s+2) & Ms ≤ O(1) 0 s > s∗ ⇒ εsad (xs , ys ) ≤ O(1) 2 Lxy ΩX ΩY 2 θMs & Ms ≤ (20) Proof See Appendix A. [sent-366, score-0.674]
58 In order to be consistent with the notation of FSS procedure we define the following map where LHS denotes quantities involving formulation (12) and the right hand side corresponds to the saddle point procedure detailed in the previous section. [sent-375, score-0.162]
59 However, the situation still allows to achieve O(1/T 2 ) convergence rate by applying the FSS algorithm to the saddle point reformulation of (24). [sent-412, score-0.135]
60 The prototype algorithm MP solves saddle point problem (24) with convexconcave and smooth (with Lipschitz continuous gradient) φ at the rate O(1/T ), with the hidden factor in O(·) depending on the distance from the starting point to the solution set of (5. [sent-414, score-0.135]
61 Now, when φ is strongly concave in y, the above convergence implies qualified convergence of the y-components yt of approximate solutions to the y-component y∗ of the saddle point of φ, so that eventually we know that yt − y∗ Y is, say, twice smaller than (a priori upper bound R on) y1 − y∗ Y . [sent-422, score-0.238]
62 1 Prediction Rules For each observation, xt , the values, K(xt , xi ), are only approximately known and lies in an uncertainty set (10). [sent-436, score-0.15]
63 In other words the classifier is robust to uncertainty in the value of K. [sent-440, score-0.199]
64 The simplest case would be to use K(xt , xi ) = K(xt , xi ), which when used in conjunction with (1) gives the following labelling rule: ytpr = sign ¯ ∑ yi αi K(xt , xi ) + b , (25) i which, in the sequel, will be referred to as the nominal rule. [sent-441, score-0.294]
65 l=1 One option for arriving at a label would be to take the majority vote with the above kernel function, ytpr = sign R ∑ yts , yts = sign ∑ αi yi Kts (xt , xi ) + b . [sent-448, score-0.237]
66 (26) i s=1 Once we have defined these two prediction rule , namely majority vote and the nominal rule, it is important to devise measures for evaluating the resultant classifiers. [sent-449, score-0.289]
67 For the nominal classifier (25) the usual 0/1 loss works well and we define n tst ∑t=1 1(ytpr =yt ) ¯ NominalErr(NE) = . [sent-456, score-0.255]
68 ntst Similarly for the majority vote based classifier (26), we define n MajorityErr(ME) = tst ∑t=1 1(ytpr =yt ) ntst (27) where, yt is the true label for xt . [sent-457, score-0.212]
69 The Nominal − SVM formulation is the usual SVM formulation (2) with the nominal kernel. [sent-471, score-0.282]
70 Additionally we also have empirically tested them on protein structure data. [sent-494, score-0.173]
71 We study the problem of robust classification when the kernel values are not available but are governed by (10). [sent-518, score-0.154]
72 , n} where Pi is the nominal structure described in (3) with label yi . [sent-548, score-0.257]
73 Incorporation of resolution information ri leads to uncertainty sets U(Pi ) (see (4)). [sent-549, score-0.15]
74 (2007) and assuming that the resultant uncertainty in kernel values obey (10) the kernel functions K, Kl are computed by the procedure outlined in Section 4. [sent-551, score-0.269]
75 However we still provide a comparison to the robust formulations described in this paper for the sake of completion. [sent-554, score-0.135]
76 When κ = 0, then there is no uncertainty and as it increases the uncertainty becomes more pronounced. [sent-571, score-0.22]
77 The utility of robust formulations would become clear as κ is increased. [sent-572, score-0.135]
78 This shows that, non-robust classifiers, for example, SVM, are unable to handle uncertainty compared to the proposed robust classifiers. [sent-593, score-0.199]
79 This experiment demonstrates that in the presence of uncertainty the performance of extremely accurate classifiers suffer drastically but the proposed robust formulations fare much better in handling uncertainty. [sent-626, score-0.245]
80 4 Verification of Convergence of FSS Algorithm In this section, we have experimentally verified that the proposed saddle point based algorithm has 1 O( M2 ) convergence rate (see (20)). [sent-629, score-0.135]
81 This concludes that, to build a robust classifier with a medium scale of data (even more than 1000) the saddle point based algorithm is much more effective then a Quadratic Conic Program based formulation. [sent-656, score-0.224]
82 All the three formulations are more robust than Nominal − SVM. [sent-659, score-0.135]
83 The experimental methodology follows “one-vs-one” classification setting with all 15 classes of protein structures. [sent-680, score-0.173]
84 Let D = {(Pi , ri , yi )} be a protein structure data sets where Pi is the set of coordinates of ith protein structure obtained from Astral7 database, where ri is the corresponding resolution information obtained from the PDB, and yi is the class label. [sent-683, score-0.444]
85 One can create a set i 2 2 of uncertain kernels, where K(p, p′ ) is a kernel function computed between two protein structures p ∈ Qi and p′ ∈ Q j . [sent-689, score-0.379]
86 These kernels are purely based on protein structure (specially position of cα ). [sent-698, score-0.207]
87 P ROTEIN S TRUCTURE C LASSIFICATION Table 2 and Table 3 report results for RSVM, USSVMMN and Nominal − SVM (SVM with kernels based on nominal protein structure reported in PDB files) using both standard and robust error measures defined in section 6. [sent-703, score-0.524]
88 Conclusion We studied the problem of designing robust classifiers when the kernel matrices are uncertain. [sent-722, score-0.185]
89 We adapt the general purpose algorithm of Nemirovski (2004) for solving saddle point procedure to this problem. [sent-878, score-0.135]
90 Empirical results show that USSVMSOCP is indeed a robust alternative to uncertainty in the kernel matrices both on synthetic and real world data sets. [sent-887, score-0.264]
91 for Z compatible with the norm · , and (a) argminZ ωs (·) = (xω , ys ) ∈ Zs & maxZs ωs (·) − minZs ωs (·) ≤ 1, ¯ (b) ∀(z, z′ ∈ Z) : G(z) − G(z′ ) ∗ ≤ Ls z − z′ . [sent-902, score-0.337]
92 Indeed, assuming the opposite, let Ys = {y ∈ Y : y − ys Y ≤ ¯ ¯ Rs }, and let y = argmaxy∈Ys φ(xs , y), so that φs (xs ) = φ(xs , y). [sent-948, score-0.337]
93 Since φ(xs ) := max φ(xs , y) > φs (xs ) := ¯ y∈Y ¯ ¯ ¯ max φ(xs , y) and Ys is cut off Y by the inequality y − ys Y ≤ Rs , we have y − ys Y = Rs , while by y∈Y s (Is ) we have y∗ − ys Y ≤ Rs /2, whence, in particular, y∗ ∈ Ys and y∗ − ys Y ≥ Rs /2. [sent-949, score-1.348]
94 · Y , and attains its maximum over y ∈ Y at y∗ , we have φ(y∗ ) − φ(ys ) ≥ θ y∗ − ys 2 , which combines with (36) to imply that y∗ − Y 2 ys Y ≤ Rs /4 = Rs+1 /2; this is nothing but (Is+1 ). [sent-960, score-0.674]
95 It is easy to verify that min(− maxi+ (yi + qi ), mini− (yi + ¯ ¯ qi ) −C) ≤ ν ≤ max(C − mini+ (yi + qi ), maxi− (yi + qi )). [sent-970, score-0.176]
96 Kernel Functions for Protein Structures Experiments on protein structures have been conducted with Weighted Pairwise Distance Substructure Kernel described in Bhattacharya et al. [sent-975, score-0.227]
97 A substructure Nia consists of l spatially nearest residues to the ath residue of protein Pi . [sent-980, score-0.247]
98 The substructure kernel between two substructures Nia and N jb is defined as Kpds (Nia , N jb ) = ∑ e − dia −π(d j ) 2 b σ2 π∈∏(l) where dia denotes set of pairwise distance betwen all possible pair of residues in Nia . [sent-981, score-0.139]
99 Finally the kernel function between two protein structures is defined as ni K(Pi , Pj ) = nj ∑ ∑ Kpds (Nia , N jc )Kpds (Nib , N jd )Knorm (ia , ib , jc , jd ) a,b=1 c,d=1 − where Knorm (ia , ib , jc , jd ) = e ( cia −ci − c jc −c j )2 b d σ2 . [sent-987, score-0.396]
100 Prox-method with rate of convergence o(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. [sent-1122, score-0.135]
wordName wordTfidf (topN-words)
[('ys', 0.337), ('ussvmmn', 0.33), ('ussvmsocp', 0.321), ('fss', 0.266), ('nominal', 0.228), ('rsvm', 0.211), ('bhadra', 0.183), ('protein', 0.173), ('lxy', 0.156), ('ns', 0.149), ('emirovski', 0.147), ('hadra', 0.147), ('hattacharyya', 0.147), ('ncertain', 0.137), ('saddle', 0.135), ('xs', 0.119), ('rs', 0.114), ('uncertainty', 0.11), ('bhattacharya', 0.101), ('atrices', 0.098), ('lyy', 0.094), ('vzs', 0.092), ('robust', 0.089), ('ernel', 0.087), ('uncertain', 0.087), ('ro', 0.082), ('ussvm', 0.082), ('socp', 0.078), ('kl', 0.075), ('svm', 0.068), ('kernel', 0.065), ('sadval', 0.064), ('sad', 0.063), ('en', 0.062), ('alignment', 0.06), ('proxz', 0.055), ('superfamilies', 0.055), ('bhattacharyya', 0.055), ('zs', 0.055), ('gy', 0.054), ('structures', 0.054), ('formulations', 0.046), ('gx', 0.046), ('nemirovski', 0.046), ('errormeasure', 0.046), ('juditski', 0.046), ('nia', 0.046), ('robusterr', 0.046), ('ky', 0.046), ('qi', 0.044), ('ms', 0.043), ('robustness', 0.042), ('xt', 0.04), ('resolution', 0.04), ('atoms', 0.039), ('yt', 0.039), ('sn', 0.039), ('classi', 0.038), ('mpb', 0.037), ('ntst', 0.037), ('residues', 0.037), ('substructure', 0.037), ('ussv', 0.037), ('ytpr', 0.037), ('yts', 0.037), ('ql', 0.035), ('kernels', 0.034), ('qiu', 0.033), ('stage', 0.032), ('majority', 0.032), ('holm', 0.031), ('scop', 0.031), ('argminx', 0.031), ('modulus', 0.031), ('ip', 0.031), ('designing', 0.031), ('vs', 0.031), ('psd', 0.03), ('du', 0.029), ('chance', 0.029), ('resultant', 0.029), ('db', 0.029), ('ls', 0.029), ('yi', 0.029), ('zl', 0.028), ('aharon', 0.027), ('argminy', 0.027), ('kpds', 0.027), ('rmsd', 0.027), ('sahely', 0.027), ('sedumi', 0.027), ('shindyalov', 0.027), ('tst', 0.027), ('formulation', 0.027), ('al', 0.027), ('pi', 0.027), ('jc', 0.026), ('dg', 0.026), ('beta', 0.026), ('concave', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices
Author: Aharon Ben-Tal, Sahely Bhadra, Chiranjib Bhattacharyya, Arkadi Nemirovski
Abstract: In this paper we study the problem of designing SVM classifiers when the kernel matrix, K, is affected by uncertainty. Specifically K is modeled as a positive affine combination of given positive semi definite kernels, with the coefficients ranging in a norm-bounded uncertainty set. We treat the problem using the Robust Optimization methodology. This reduces the uncertain SVM problem into a deterministic conic quadratic problem which can be solved in principle by a polynomial time Interior Point (IP) algorithm. However, for large-scale classification problems, IP methods become intractable and one has to resort to first-order gradient type methods. The strategy we use here is to reformulate the robust counterpart of the uncertain SVM problem as a saddle point problem and employ a special gradient scheme which works directly on the convex-concave saddle function. The algorithm is a simplified version of a general scheme due to Juditski and Nemirovski (2011). It achieves an O(1/T 2 ) reduction of the initial error after T iterations. A comprehensive empirical study on both synthetic data and real-world protein structure data sets show that the proposed formulations achieve the desired robustness, and the saddle point based algorithm outperforms the IP method significantly. Keywords: robust optimization, uncertain classification, kernel functions
2 0.087345466 100 jmlr-2012-Robust Kernel Density Estimation
Author: JooSeuk Kim, Clayton D. Scott
Abstract: We propose a method for nonparametric density estimation that exhibits robustness to contamination of the training sample. This method achieves robustness by combining a traditional kernel density estimator (KDE) with ideas from classical M-estimation. We interpret the KDE based on a positive semi-definite kernel as a sample mean in the associated reproducing kernel Hilbert space. Since the sample mean is sensitive to outliers, we estimate it robustly via M-estimation, yielding a robust kernel density estimator (RKDE). An RKDE can be computed efficiently via a kernelized iteratively re-weighted least squares (IRWLS) algorithm. Necessary and sufficient conditions are given for kernelized IRWLS to converge to the global minimizer of the M-estimator objective function. The robustness of the RKDE is demonstrated with a representer theorem, the influence function, and experimental results for density estimation and anomaly detection. Keywords: outlier, reproducing kernel Hilbert space, kernel trick, influence function, M-estimation
3 0.068058684 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis
Author: Fei Yan, Josef Kittler, Krystian Mikolajczyk, Atif Tahir
Abstract: Sparsity-inducing multiple kernel Fisher discriminant analysis (MK-FDA) has been studied in the literature. Building on recent advances in non-sparse multiple kernel learning (MKL), we propose a non-sparse version of MK-FDA, which imposes a general ℓ p norm regularisation on the kernel weights. We formulate the associated optimisation problem as a semi-infinite program (SIP), and adapt an iterative wrapper algorithm to solve it. We then discuss, in light of latest advances in MKL optimisation techniques, several reformulations and optimisation strategies that can potentially lead to significant improvements in the efficiency and scalability of MK-FDA. We carry out extensive experiments on six datasets from various application areas, and compare closely the performance of ℓ p MK-FDA, fixed norm MK-FDA, and several variants of SVM-based MKL (MK-SVM). Our results demonstrate that ℓ p MK-FDA improves upon sparse MK-FDA in many practical situations. The results also show that on image categorisation problems, ℓ p MK-FDA tends to outperform its SVM counterpart. Finally, we also discuss the connection between (MK-)FDA and (MK-)SVM, under the unified framework of regularised kernel machines. Keywords: multiple kernel learning, kernel fisher discriminant analysis, regularised least squares, support vector machines
4 0.058510631 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment
Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh
Abstract: This paper presents new and effective algorithms for learning kernels. In particular, as shown by our empirical results, these algorithms consistently outperform the so-called uniform combination solution that has proven to be difficult to improve upon in the past, as well as other algorithms for learning kernels based on convex combinations of base kernels in both classification and regression. Our algorithms are based on the notion of centered alignment which is used as a similarity measure between kernels or kernel matrices. We present a number of novel algorithmic, theoretical, and empirical results for learning kernels based on our notion of centered alignment. In particular, we describe efficient algorithms for learning a maximum alignment kernel by showing that the problem can be reduced to a simple QP and discuss a one-stage algorithm for learning both a kernel and a hypothesis based on that kernel using an alignment-based regularization. Our theoretical results include a novel concentration bound for centered alignment between kernel matrices, the proof of the existence of effective predictors for kernels with high alignment, both for classification and for regression, and the proof of stability-based generalization bounds for a broad family of algorithms for learning kernels based on centered alignment. We also report the results of experiments with our centered alignment-based algorithms in both classification and regression. Keywords: kernel methods, learning kernels, feature selection
5 0.058060091 97 jmlr-2012-Regularization Techniques for Learning with Matrices
Author: Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari
Abstract: There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm. This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning. Keywords: regularization, strong convexity, regret bounds, generalization bounds, multi-task learning, multi-class learning, multiple kernel learning
6 0.052938741 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization
7 0.050252274 33 jmlr-2012-Distance Metric Learning with Eigenvalue Optimization
8 0.044974253 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors
9 0.043592285 89 jmlr-2012-Pairwise Support Vector Machines and their Application to Large Scale Problems
10 0.043049067 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation
11 0.040071648 91 jmlr-2012-Plug-in Approach to Active Learning
12 0.039204802 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
13 0.038370278 12 jmlr-2012-Active Clustering of Biological Sequences
14 0.037921909 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
15 0.037044987 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
16 0.036125083 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization
17 0.03566888 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
18 0.033322789 44 jmlr-2012-Feature Selection via Dependence Maximization
19 0.032825068 80 jmlr-2012-On Ranking and Generalization Bounds
20 0.031420924 39 jmlr-2012-Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications
topicId topicWeight
[(0, -0.168), (1, -0.032), (2, 0.048), (3, 0.102), (4, 0.056), (5, -0.003), (6, -0.055), (7, 0.016), (8, -0.02), (9, -0.002), (10, 0.013), (11, -0.06), (12, -0.041), (13, -0.008), (14, 0.047), (15, 0.102), (16, 0.051), (17, -0.05), (18, 0.047), (19, -0.002), (20, -0.007), (21, 0.044), (22, 0.073), (23, -0.015), (24, 0.044), (25, -0.055), (26, 0.016), (27, 0.012), (28, 0.068), (29, 0.236), (30, 0.176), (31, 0.175), (32, 0.214), (33, -0.211), (34, -0.069), (35, -0.079), (36, -0.211), (37, 0.015), (38, 0.025), (39, -0.277), (40, -0.176), (41, -0.162), (42, 0.06), (43, 0.109), (44, -0.094), (45, 0.09), (46, -0.081), (47, 0.125), (48, -0.112), (49, 0.043)]
simIndex simValue paperId paperTitle
same-paper 1 0.91524953 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices
Author: Aharon Ben-Tal, Sahely Bhadra, Chiranjib Bhattacharyya, Arkadi Nemirovski
Abstract: In this paper we study the problem of designing SVM classifiers when the kernel matrix, K, is affected by uncertainty. Specifically K is modeled as a positive affine combination of given positive semi definite kernels, with the coefficients ranging in a norm-bounded uncertainty set. We treat the problem using the Robust Optimization methodology. This reduces the uncertain SVM problem into a deterministic conic quadratic problem which can be solved in principle by a polynomial time Interior Point (IP) algorithm. However, for large-scale classification problems, IP methods become intractable and one has to resort to first-order gradient type methods. The strategy we use here is to reformulate the robust counterpart of the uncertain SVM problem as a saddle point problem and employ a special gradient scheme which works directly on the convex-concave saddle function. The algorithm is a simplified version of a general scheme due to Juditski and Nemirovski (2011). It achieves an O(1/T 2 ) reduction of the initial error after T iterations. A comprehensive empirical study on both synthetic data and real-world protein structure data sets show that the proposed formulations achieve the desired robustness, and the saddle point based algorithm outperforms the IP method significantly. Keywords: robust optimization, uncertain classification, kernel functions
2 0.61595529 100 jmlr-2012-Robust Kernel Density Estimation
Author: JooSeuk Kim, Clayton D. Scott
Abstract: We propose a method for nonparametric density estimation that exhibits robustness to contamination of the training sample. This method achieves robustness by combining a traditional kernel density estimator (KDE) with ideas from classical M-estimation. We interpret the KDE based on a positive semi-definite kernel as a sample mean in the associated reproducing kernel Hilbert space. Since the sample mean is sensitive to outliers, we estimate it robustly via M-estimation, yielding a robust kernel density estimator (RKDE). An RKDE can be computed efficiently via a kernelized iteratively re-weighted least squares (IRWLS) algorithm. Necessary and sufficient conditions are given for kernelized IRWLS to converge to the global minimizer of the M-estimator objective function. The robustness of the RKDE is demonstrated with a representer theorem, the influence function, and experimental results for density estimation and anomaly detection. Keywords: outlier, reproducing kernel Hilbert space, kernel trick, influence function, M-estimation
3 0.40386465 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors
Author: Emilio Parrado-Hernández, Amiran Ambroladze, John Shawe-Taylor, Shiliang Sun
Abstract: This paper presents the prior PAC-Bayes bound and explores its capabilities as a tool to provide tight predictions of SVMs’ generalization. The computation of the bound involves estimating a prior of the distribution of classifiers from the available data, and then manipulating this prior in the usual PAC-Bayes generalization bound. We explore two alternatives: to learn the prior from a separate data set, or to consider an expectation prior that does not need this separate data set. The prior PAC-Bayes bound motivates two SVM-like classification algorithms, prior SVM and ηprior SVM, whose regularization term pushes towards the minimization of the prior PAC-Bayes bound. The experimental work illustrates that the new bounds can be significantly tighter than the original PAC-Bayes bound when applied to SVMs, and among them the combination of the prior PAC-Bayes bound and the prior SVM algorithm gives the tightest bound. Keywords: PAC-Bayes bound, support vector machine, generalization capability prediction, classification
4 0.39134002 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment
Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh
Abstract: This paper presents new and effective algorithms for learning kernels. In particular, as shown by our empirical results, these algorithms consistently outperform the so-called uniform combination solution that has proven to be difficult to improve upon in the past, as well as other algorithms for learning kernels based on convex combinations of base kernels in both classification and regression. Our algorithms are based on the notion of centered alignment which is used as a similarity measure between kernels or kernel matrices. We present a number of novel algorithmic, theoretical, and empirical results for learning kernels based on our notion of centered alignment. In particular, we describe efficient algorithms for learning a maximum alignment kernel by showing that the problem can be reduced to a simple QP and discuss a one-stage algorithm for learning both a kernel and a hypothesis based on that kernel using an alignment-based regularization. Our theoretical results include a novel concentration bound for centered alignment between kernel matrices, the proof of the existence of effective predictors for kernels with high alignment, both for classification and for regression, and the proof of stability-based generalization bounds for a broad family of algorithms for learning kernels based on centered alignment. We also report the results of experiments with our centered alignment-based algorithms in both classification and regression. Keywords: kernel methods, learning kernels, feature selection
5 0.300751 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis
Author: Fei Yan, Josef Kittler, Krystian Mikolajczyk, Atif Tahir
Abstract: Sparsity-inducing multiple kernel Fisher discriminant analysis (MK-FDA) has been studied in the literature. Building on recent advances in non-sparse multiple kernel learning (MKL), we propose a non-sparse version of MK-FDA, which imposes a general ℓ p norm regularisation on the kernel weights. We formulate the associated optimisation problem as a semi-infinite program (SIP), and adapt an iterative wrapper algorithm to solve it. We then discuss, in light of latest advances in MKL optimisation techniques, several reformulations and optimisation strategies that can potentially lead to significant improvements in the efficiency and scalability of MK-FDA. We carry out extensive experiments on six datasets from various application areas, and compare closely the performance of ℓ p MK-FDA, fixed norm MK-FDA, and several variants of SVM-based MKL (MK-SVM). Our results demonstrate that ℓ p MK-FDA improves upon sparse MK-FDA in many practical situations. The results also show that on image categorisation problems, ℓ p MK-FDA tends to outperform its SVM counterpart. Finally, we also discuss the connection between (MK-)FDA and (MK-)SVM, under the unified framework of regularised kernel machines. Keywords: multiple kernel learning, kernel fisher discriminant analysis, regularised least squares, support vector machines
6 0.27468863 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation
7 0.26102594 89 jmlr-2012-Pairwise Support Vector Machines and their Application to Large Scale Problems
8 0.25770107 23 jmlr-2012-Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training
9 0.24998334 112 jmlr-2012-Structured Sparsity via Alternating Direction Methods
10 0.23078842 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
11 0.22967386 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization
12 0.22225028 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization
13 0.19728062 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets
14 0.18834975 97 jmlr-2012-Regularization Techniques for Learning with Matrices
15 0.17842551 52 jmlr-2012-Iterative Reweighted Algorithms for Matrix Rank Minimization
16 0.1749139 91 jmlr-2012-Plug-in Approach to Active Learning
17 0.16844063 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection
18 0.16783206 44 jmlr-2012-Feature Selection via Dependence Maximization
19 0.16552053 109 jmlr-2012-Stability of Density-Based Clustering
20 0.1568092 102 jmlr-2012-Sally: A Tool for Embedding Strings in Vector Spaces
topicId topicWeight
[(7, 0.026), (21, 0.034), (26, 0.04), (29, 0.03), (35, 0.04), (49, 0.016), (56, 0.018), (57, 0.014), (64, 0.02), (69, 0.021), (75, 0.074), (77, 0.012), (79, 0.016), (88, 0.385), (92, 0.067), (96, 0.106)]
simIndex simValue paperId paperTitle
same-paper 1 0.68981236 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices
Author: Aharon Ben-Tal, Sahely Bhadra, Chiranjib Bhattacharyya, Arkadi Nemirovski
Abstract: In this paper we study the problem of designing SVM classifiers when the kernel matrix, K, is affected by uncertainty. Specifically K is modeled as a positive affine combination of given positive semi definite kernels, with the coefficients ranging in a norm-bounded uncertainty set. We treat the problem using the Robust Optimization methodology. This reduces the uncertain SVM problem into a deterministic conic quadratic problem which can be solved in principle by a polynomial time Interior Point (IP) algorithm. However, for large-scale classification problems, IP methods become intractable and one has to resort to first-order gradient type methods. The strategy we use here is to reformulate the robust counterpart of the uncertain SVM problem as a saddle point problem and employ a special gradient scheme which works directly on the convex-concave saddle function. The algorithm is a simplified version of a general scheme due to Juditski and Nemirovski (2011). It achieves an O(1/T 2 ) reduction of the initial error after T iterations. A comprehensive empirical study on both synthetic data and real-world protein structure data sets show that the proposed formulations achieve the desired robustness, and the saddle point based algorithm outperforms the IP method significantly. Keywords: robust optimization, uncertain classification, kernel functions
2 0.37580526 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao
Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization
3 0.3750183 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms
Author: Chunhua Shen, Junae Kim, Lei Wang, Anton van den Hengel
Abstract: The success of many machine learning and pattern recognition methods relies heavily upon the identification of an appropriate distance metric on the input data. It is often beneficial to learn such a metric from the input training data, instead of using a default one such as the Euclidean distance. In this work, we propose a boosting-based technique, termed B OOST M ETRIC, for learning a quadratic Mahalanobis distance metric. Learning a valid Mahalanobis distance metric requires enforcing the constraint that the matrix parameter to the metric remains positive semidefinite. Semidefinite programming is often used to enforce this constraint, but does not scale well and is not easy to implement. B OOST M ETRIC is instead based on the observation that any positive semidefinite matrix can be decomposed into a linear combination of trace-one rank-one matrices. B OOST M ETRIC thus uses rank-one positive semidefinite matrices as weak learners within an efficient and scalable boosting-based learning process. The resulting methods are easy to implement, efficient, and can accommodate various types of constraints. We extend traditional boosting algorithms in that its weak learner is a positive semidefinite matrix with trace and rank being one rather than a classifier or regressor. Experiments on various data sets demonstrate that the proposed algorithms compare favorably to those state-of-the-art methods in terms of classification accuracy and running time. Keywords: Mahalanobis distance, semidefinite programming, column generation, boosting, Lagrange duality, large margin nearest neighbor
4 0.37383139 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
Author: Mehrdad Mahdavi, Rong Jin, Tianbao Yang
Abstract: In this paper we propose efficient algorithms for solving constrained online convex optimization problems. Our motivation stems from the observation that most algorithms proposed for online convex optimization require a projection onto the convex set K from which the decisions are made. While the projection is straightforward for simple shapes (e.g., Euclidean ball), for arbitrary complex sets it is the main computational challenge and may be inefficient in practice. In this paper, we consider an alternative online convex optimization problem. Instead of requiring that decisions belong to K for all rounds, we only require that the constraints, which define the set K , be satisfied in the long run. By turning the problem into an online convex-concave optimization problem, √ we propose an efficient algorithm which achieves O( T ) regret bound and O(T 3/4 ) bound on the violation of constraints. Then, we modify the algorithm in order to guarantee that the constraints are satisfied in the long run. This gain is achieved at the price of getting O(T 3/4 ) regret bound. Our second algorithm is based on the mirror prox method (Nemirovski, 2005) to solve variational inequalities which achieves O(T 2/3 ) bound for both regret and the violation of constraints when the domain K can be described by a finite number of linear constraints. Finally, we extend the results to the setting where we only have partial access to the convex set K and propose a multipoint bandit feedback algorithm with the same bounds in expectation as our first algorithm. Keywords: online convex optimization, convex-concave optimization, bandit feedback, variational inequality
5 0.37352058 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
Author: Ofer Dekel, Claudio Gentile, Karthik Sridharan
Abstract: We present a new online learning algorithm in the selective sampling framework, where labels must be actively queried before they are revealed. We prove bounds on the regret of our algorithm and on the number of labels it queries when faced with an adaptive adversarial strategy of generating the instances. Our bounds both generalize and strictly improve over previous bounds in similar settings. Additionally, our selective sampling algorithm can be converted into an efficient statistical active learning algorithm. We extend our algorithm and analysis to the multiple-teacher setting, where the algorithm can choose which subset of teachers to query for each label. Finally, we demonstrate the effectiveness of our techniques on a real-world Internet search problem. Keywords: online learning, regret, label-efficient, crowdsourcing
6 0.37302309 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis
7 0.37094712 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
8 0.36823756 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
9 0.36789858 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
10 0.36243361 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach
11 0.36231229 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection
12 0.36185807 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
13 0.36116296 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models
14 0.35903245 103 jmlr-2012-Sampling Methods for the Nyström Method
15 0.35871336 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression
16 0.35700262 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
17 0.35558823 13 jmlr-2012-Active Learning via Perfect Selective Classification
18 0.35542426 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
19 0.3547166 80 jmlr-2012-On Ranking and Generalization Bounds
20 0.35452318 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment