jmlr jmlr2013 jmlr2013-29 knowledge-graph by maker-knowledge-mining

29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs

Source: pdf

Author: Yu-Feng Li, Ivor W. Tsang, James T. Kwok, Zhi-Hua Zhou

Abstract: In this paper, we study the problem of learning from weakly labeled data, where labels of the training examples are incomplete. This includes, for example, (i) semi-supervised learning where labels are partially known; (ii) multi-instance learning where labels are implicitly known; and (iii) clustering where labels are completely unknown. Unlike supervised learning, learning with weak labels involves a difﬁcult Mixed-Integer Programming (MIP) problem. Therefore, it can suffer from poor scalability and may also get stuck in local minimum. In this paper, we focus on SVMs and propose the W ELL SVM via a novel label generation strategy. This leads to a convex relaxation of the original MIP, which is at least as tight as existing convex Semi-Deﬁnite Programming (SDP) relaxations. Moreover, the W ELL SVM can be solved via a sequence of SVM subproblems that are much more scalable than previous convex SDP relaxations. Experiments on three weakly labeled learning tasks, namely, (i) semi-supervised learning; (ii) multi-instance learning for locating regions of interest in content-based information retrieval; and (iii) clustering, clearly demonstrate improved performance, and W ELL SVM is also readily applicable on large data sets. Keywords: weakly labeled data, semi-supervised learning, multi-instance learning, clustering, cutting plane, convex relaxation

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 CN National Key Laboratory for Novel Software Technology Nanjing University Nanjing 210023, China Editor: Sathiya Keerthi Abstract In this paper, we study the problem of learning from weakly labeled data, where labels of the training examples are incomplete. [sent-13, score-0.22]

2 This includes, for example, (i) semi-supervised learning where labels are partially known; (ii) multi-instance learning where labels are implicitly known; and (iii) clustering where labels are completely unknown. [sent-14, score-0.204]

3 This leads to a convex relaxation of the original MIP, which is at least as tight as existing convex Semi-Deﬁnite Programming (SDP) relaxations. [sent-18, score-0.13]

4 Keywords: weakly labeled data, semi-supervised learning, multi-instance learning, clustering, cutting plane, convex relaxation 1. [sent-21, score-0.301]

5 On the other hand, weakly labeled data, where the labels are incomplete, are often ubiquitous in many c 2013 Yu-Feng Li, Ivor Tsang, James Kwok and Zhi-Hua Zhou. [sent-24, score-0.198]

6 Therefore, exploiting weakly labeled training data may help improve performance and discover the underlying structure of the data. [sent-26, score-0.17]

7 In the following, we summarize several major learning paradigms with weakly labeled data: • Labels are partially known. [sent-29, score-0.148]

8 In traditional MIL, a bag is labeled positive when it contains at least one positive instance, and is labeled negative otherwise. [sent-43, score-0.25]

9 Although the bag labels are often available, the instance labels are only implicitly known. [sent-44, score-0.16]

10 It is worth noting that identiﬁcation of the key (or positive) instances from the positive bags can be very useful in many real-world applications. [sent-45, score-0.109]

11 (2011) considered weakly labeled data in the context of multi-label learning, whereas Yang et al. [sent-58, score-0.148]

12 (2013) considered weakly labeled data in the context of multi-instance multi-label learning (Zhou et al. [sent-59, score-0.148]

13 2152 C ONVEX AND S CALABLE W EAKLY L ABELED SVM S Unlike supervised learning where the training labels are complete, weak-label learning needs to infer the integer-valued labels of the training examples, resulting in a difﬁcult mixed-integer programming (MIP). [sent-61, score-0.144]

14 Therefore, it is desirable to develop a scalable yet convex optimization method for learning with large-scale weakly labeled data. [sent-75, score-0.193]

15 (2009a,c), we propose a convex weakly labeled SVM (denoted W ELL SVM (WEakly LabeLed SVM)) via a novel “label generation” strategy. [sent-80, score-0.193]

16 Instead of obtaining a label relation matrix via SDP, W ELL SVM maximizes the margin by generating the most violated label vectors iteratively, and then combines them via efﬁcient multiple kernel learning techniques. [sent-81, score-0.286]

17 Furthermore, it can be shown that the learned linear combination of label vector outer-products is in the convex hull of the label space. [sent-83, score-0.157]

18 Since the convex hull is the smallest convex set containing the target non-convex set (Boyd and Vandenberghe, 2004), our formulation is at least as tight as the convex SDP relaxations proposed in Xu et al. [sent-84, score-0.168]

19 , αN ]′ contains the dual variables and • A is a convex set; 2155 L I , T SANG , K WOK AND Z HOU ˆ ˆ • G(α, y) is a concave function in α for any ﬁxed y; • gy (α) = −G(α, y) is λ-strongly convex and M-Lipschitz. [sent-187, score-0.195]

20 • Standard SVM without offset: We have A = {α | C1 ≥ α ≥ 0}, 1 ˆ ˆˆ G(α, y) = α′ 1 − α′ K ⊙ yy′ α, 2 ∇2 gy (α) = K ⊙ yy′ λmin (I ⊙ yy′ ) = λmin I, √ ¯ ¯ gy (α) − gy (α) ≤ (1 +CυN) N α − α , ˆ 0 ≤ max G(α, y) ≤ CN, α∈A 1 ¯ G(α, My ) = α′ 1 − α′ K ⊙ My α, ˆ ˆ 2 ˆˆ where My = yy′ . [sent-193, score-0.244]

21 ˆ • ν-SVM (Sch¨ lkopf and Smola, 2002): We have o A = {α | α ≥ 0, α′ 1 = 1}, 1 1 ˆˆ ˆ G(α, y) = − α′ (K + I) ⊙ yy′ α, 2 C 1 1 λmin + ∇2 gy (α) = K + I ⊙ yy′ C C √ 1 ¯ ¯ gy (α) − gy (α) ≤ υ+ N N α−α , C 1 1 ˆ υ+ ≤ max G(α, y) ≤ 0, − 2 C α∈A 1 1 ¯ G(α, My ) = − α′ (K + I) ⊙ My α. [sent-194, score-0.244]

22 Proposition 1 The objective of W ELL SVM can be rewritten as the following optimization problem: min max ∑ µ∈M α∈A t:ˆ ∈B yt ˆ µt G(α, yt ), (6) ˆ where µ is the vector of µt ’s, M is the simplex {µ | ∑t µt = 1, µt ≥ 0}, and yt ∈ B . [sent-201, score-0.214]

23 We can then replace the inner optimization subproblem with its dual and Equation (5) becomes: max min ∑ α∈A µ∈M t:ˆ t ∈B y ˆ µt G(α, yt ) = min max ∑ µ∈M α∈A t:ˆ ∈B yt ˆ µt G(α, yt ). [sent-208, score-0.241]

24 Our minimax relaxation in Equation (6) can be ˆ y written as min max ∑ µ∈M α∈A t:ˆ ∈B yt ¯ ¯ µt G(α, Myt ) = min max G α, ˆ µ∈M α∈A ∑ µt Myt ˆ t:ˆ t ∈B y ¯ min max G(α, M). [sent-224, score-0.168]

25 (2005) used B = {ˆ | − β ≤ 1′ y ≤ β}, where β is a parameter controlling the y class imbalance, and MB is deﬁned as clustering MB = M = [mi j ] | −1 ≤ mi j ≤ 1; mii = 1, mi j = m ji , mik ≥ mi j + m jk − 1, m jk ≥ −mi j − mik − 1, N −β ≤ ∑ mi j ≤ β, ∀i, j, k = 1, . [sent-228, score-0.31]

26 Therefore, we can apply the cutting plane method (Kelley, 1960). [sent-252, score-0.106]

27 ˆ The cutting plane algorithm is described in Algorithm 1. [sent-253, score-0.106]

28 First, we initialize a label vector y and the working set C to {ˆ }, and obtain α from y min max ∑ µ∈M α∈A t:ˆ ∈C yt ˆ µt G(α, yt ) (9) ˆ via standard supervised learning methods. [sent-254, score-0.187]

29 Then, a violated label vector y in Equation (5) is generated and added to C . [sent-255, score-0.151]

30 1, a new label assignment for the unlabeled data is also generated in each iteration. [sent-259, score-0.159]

31 Moreover, they update the label assignment to approach a locally optimal solution, while our W ELL SVM aims to obtain a tight convex relaxation solution. [sent-264, score-0.171]

32 Thus, Proposition 2 shows that with the use of the cutting plane algorithm, the number of active constraints only scales polynomially in N. [sent-274, score-0.106]

33 Proposition 2 can be further reﬁned by taking the search effort of a violated label into account. [sent-278, score-0.151]

34 , be the magnitude of the violation of a violated label in the ˆ ˆ of r-th iteration, that is, εr = miny∈Cr G(α, y) − G(α, yr ), where Cr and yr denote the set √ violated labels and the violated label obtained in the r-th iteration, respectively. [sent-283, score-0.447]

35 Let DL = {xi , yi }l and DU = i=1 {x j }N j=l+1 be the sets of labeled and unlabeled examples, respectively, and L = {1, . [sent-299, score-0.168]

36 In semi-supervised learning, unlabeled data are typically much more abundant than labeled data, that is, N − l ≫ l. [sent-308, score-0.168]

37 Hence, one can obtain a trivially “optimal” solution with inﬁnite margin by assigning all the unlabeled examples to the same label. [sent-309, score-0.117]

38 To prevent such a useless solution, Joachims (1999) introduced the balance constraint ˆ 1′ yU 1′ yL = , N −l l ˆ where y = [y1 , · · · , yN ]′ is the vector of learned labels on both labeled and unlabeled examples, ˆ ˆ 1 ˆ yL = [y1 , . [sent-310, score-0.218]

39 Let Ω = 2 w 2 and ℓ f (D ) be the sum of hinge loss ˆ ˆ values on both labeled and unlabeled data, Equation (2) leads to min min ˆ y∈B w,ξ s. [sent-317, score-0.168]

40 , N, ˆ 2160 C ONVEX AND S CALABLE W EAKLY L ABELED SVM S ′ ′ ˆ ˆ ˆ where B = {ˆ | y = [ˆ L ; yU ], yL = yL , yU ∈ {±1}N−l ; 1 yU = 1 yL }, and C1 ,C2 trade off model comy ˆ y ˆ N−l l plexity and empirical losses on the labeled and unlabeled data, respectively. [sent-322, score-0.168]

41 Using Proposition 1, we have 1 min max 1′ α − α′ 2 α∈A µ∈M ∑ t:ˆ t ∈B y ˆ ˆ µt K ⊙ yt yt′ α, (12) ¯ ˆ which is a convex relaxation of Equation (11). [sent-327, score-0.163]

42 , 2004), where ˆ ˆ y the target kernel matrix is a convex combination of |B | base kernel matrices {K ⊙ yt yt′ }t:ˆ t ∈B , each ˆ of which is constructed from a feasible label vector yt ∈ B . [sent-332, score-0.277]

43 1 A LGORITHM From Section 3, the cutting plane algorithm is used to solve Equation (12). [sent-335, score-0.106]

44 There are two important issues that have to be addressed in the use of cutting plane algorithms. [sent-336, score-0.106]

45 Unlike standard MKL problems which try to ﬁnd the optimal kernel function/matrix for a given set of labels, here, we have to ﬁnd the optimal label kernel matrix. [sent-360, score-0.126]

46 Note that the y ˆ ˆ ˆ feature map corresponding to the base kernel matrix K ⊙ yt yt′ is xi → yti φ(xi ). [sent-367, score-0.115]

47 It is easy to verify that its dual can be written as T 1 ˆ ˆ min max 1′ α − 2 α′ ∑t=1 µt K ⊙ yt yt′ α, µ∈M α∈A which is the same as Equation (12). [sent-377, score-0.11]

48 Fix wt ’s and update µ in closed-form, as µt = wt ∑tT′ =1 wt ′ , t = 1, . [sent-394, score-0.15]

49 Note that while the use of the most violated constraint may lead to faster convergence, the cutting plane algorithm only requires the 2162 C ONVEX AND S CALABLE W EAKLY L ABELED SVM S addition of a violated constraint at each iteration (Kelley, 1960; Tsochantaridis et al. [sent-405, score-0.296]

50 Hence, we propose in the following a simple and efﬁcient method for ﬁnding a violated label assignment. [sent-407, score-0.151]

51 ¯ ¯ y Proposition 4 y∗ is a violated label assignment if y′ Hy∗ = y′ H¯ . [sent-412, score-0.181]

52 So, (y∗ )′ Hy∗ > y′ H¯ which indicates y∗ is a violated label assignment. [sent-415, score-0.151]

53 A similar trick can be used in checking if y∗ is a violated label assignment. [sent-439, score-0.151]

54 2163 L I , T SANG , K WOK AND Z HOU O(N log N) time; and ﬁnally check if y∗ is a violated label assignment using Proposition 4, which takes O(N 2 ) (resp. [sent-440, score-0.181]

55 2 Multi-Instance Learning In this section, we consider the second weakly labeled learning problem, namely, multi-instance learning (MIL), where examples are bags containing multiple instances. [sent-457, score-0.216]

56 In traditional MIL, a bag is labeled positive if it contains at least one key (or positive) instance, and negative otherwise. [sent-464, score-0.155]

57 Thus, we only have the bag labels available, while the instance labels are only implicitly known. [sent-465, score-0.16]

58 Identiﬁcation of the key instances from positive bags can be very useful in CBIR. [sent-466, score-0.109]

59 Traditional MIL implies that the label of a bag is determined by its most representative key instance, that is, f (Bi ) = max{ f (xi,1 ), · · · , f (xi,mi )}. [sent-470, score-0.116]

60 Combining all these together, Equation (18) can be rewritten as min min s∈∆ w,ξ p mi m 1 ||w||2 +C1 ∑ ξi +C2 ∑ ∑ ξs(i, j) 2 2 i=p+1 j=1 i=1 mi s. [sent-495, score-0.158]

61 With Proposition 1, we have 1 ˆ ˆ min max 1′ α − (α ⊙ y)′ ∑ µt Kst (α ⊙ y), 2 µ∈M α∈A t:st ∈∆ (21) which is a convex relaxation of Equation (19). [sent-527, score-0.11]

62 1 A LGORITHM Similar to semi-supervised learning, the cutting plane algorithm is used for solving Equation (21). [sent-530, score-0.106]

63 Recall that there are two issues in the use of cutting-plane algorithms, namely, efﬁcient multiple label-kernel learning and the ﬁnding of a violated label assignment. [sent-531, score-0.151]

64 max s∈∆ According to the deﬁnition of ψ in Equation (20), this can be rewritten as p mi m 2 mi ∑ αi ∑ di, j φ(xi, j )− ∑ ∑ αs(i, j) φ(xs(i, j) ) max s∈∆ i=1 , i=p+1 j=1 j=1 which can be reformulated as max s′ Hs + τ′ s, (23) s∈∆ where H ∈ RJp ×Jp and τ ∈ RJp . [sent-557, score-0.233]

65 , mi , we have Hv(i, j),v(iˆ, jˆ) = mi αi αiˆφ(xi, j )′ φ(xiˆ, jˆ) and τv(i, j) = −2αi φ(xi, j )′ (∑m i=p+1 ∑ j=1 αs(i, j) φ(xs(i, j) )). [sent-564, score-0.128]

66 2 (24) ′ ∗ ′ ¯ s ¯ s Proposition 6 s∗ is a violated label assignment when (s∗ )′ H¯ + τ 2 > s′ H¯ + τ2s . [sent-575, score-0.181]

67 ¯ ¯ s So (s∗ )′ Hs∗ + τ′ (s∗ ) > s′ H¯ + τ′ s, which indicates that s∗ is a violated label assignment. [sent-579, score-0.151]

68 Therefore, one can obtain its optimal solution by solving the p subproblems individually mi max di s. [sent-589, score-0.117]

69 3 Clustering In this section, we consider the third weakly labeled learning task, namely, clustering, where all the class labels are unknown. [sent-609, score-0.198]

70 With Proposition 1, we have 1 min max 1′ α − α′ 2 µ∈M α∈A ∑ t:ˆ t ∈B y ˆ ˆ µt K ⊙ yt yt′ α (29) ¯ ˆ as a convex relaxation of Equation (28). [sent-634, score-0.163]

71 1 A LGORITHM The cutting plane algorithm can still be applied for clustering. [sent-640, score-0.106]

72 , N, and its dual is T 1 ˆ ˆ min max 1′ α − 2 α′ ∑t=1 µt K ⊙ yt yt′ α, µ∈M α∈A 2168 C ONVEX AND S CALABLE W EAKLY L ABELED SVM S which is the same as Equation (29). [sent-649, score-0.11]

73 ¯ As for ﬁnding a violated label assignment, let y ∈ C be ˆ y ¯ y = arg maxy∈C y′ Hˆ , ˆ where H = K ⊙ (αα′ ) is a positive semideﬁnite matrix. [sent-651, score-0.151]

74 ˆ (31) ¯ ¯ y With Proposition 4, we obtain that y∗ is a violated label assignment if y′ Hy∗ ≥ y′ H¯ . [sent-653, score-0.181]

75 Similar to semi-supervised learning, the complexity to ﬁnd a violated label scales as O(N 2 ) (resp. [sent-663, score-0.151]

76 Experiments are conducted on all the three aforementioned weakly labeled learning tasks: semi-supervised learning (Section 5. [sent-677, score-0.148]

77 We investigate the performance of each approach with varying amount of labeled data (namely, 5%, 10% and 15% of all the labeled data). [sent-697, score-0.19]

78 Besides the local minimum problem, another possible reason may be that there are multiple large margin separators coinciding well with labeled data and the labeled examples are too few to provide a reliable selection for these separators (Li and Zhou, 2011). [sent-938, score-0.234]

79 4063 echocardiogram house heart heart-statlog haberman liverDisorders spectf ionosphere house-votes clean1 isolet australian diabetes german krvskp sick 0. [sent-1334, score-0.281]

80 objective values 0 haberman spectf australian german sick -0. [sent-1355, score-0.135]

81 The high standard deviation of W ELL SVM on real-sim with 25 labeled examples may be due to the fact that the large amount of unlabeled instances lead to a large variance in deriving a large margin classiﬁer, whereas the amount of labeled examples is too small to reduce the variance. [sent-1463, score-0.348]

82 Speciﬁcally, a training set of 50 images is created by randomly sampling 10 images from each of the ﬁve categories. [sent-1590, score-0.11]

83 The ROI success rate computed based on those images that are predicted as relevant, that is, success rate of ROIs = number of ROI successes . [sent-1620, score-0.135]

84 When an algorithm classiﬁes many images as relevant, the success rate of relevant images (Equation (33)) is high while the success rate of ROIs (Equation (34)) can be low, since there are many relevant images predicted by the algorithm. [sent-1622, score-0.192]

85 On the other hand, when an algorithm classiﬁes many images as irrelevant, the success rate of ROIs is high while the success rate of relevant images is low since many relevant images are missing. [sent-1623, score-0.192]

86 This is similar to the F-score in information retrieval as 1 success rate = = #relevant images + #predicted relevant images 2#ROI successes 1 1 1 . [sent-1700, score-0.149]

87 + #ROI successes #ROI successes 2 #relevant images #predicted relevant images Intuitively, when an algorithm correctly recognizes all the relevant images and their ROIs, the success rate will be high. [sent-1701, score-0.224]

88 3 Clustering In this section, we further evaluate our W ELL SVM on clustering problems where all the labels are unknown. [sent-1709, score-0.104]

89 1 S MALL -S CALE E XPERIMENTS The W ELL SVM is compared with the following methods: 1) k-means clustering (KM); 2) kernel k-means clustering (KKM); 3) normalized cut (NC) (Shi and Malik, 2000); 4) GMMC (Valizadegan and Jin, 2007); 5) IterSVR11 (Zhang et al. [sent-1713, score-0.143]

90 Conclusion Learning from weakly labeled data, where the training labels are incomplete, is generally regarded as a crucial yet challenging machine learning task. [sent-1971, score-0.22]

91 Speciﬁcally, W ELL SVM on three common weak-label learning tasks, namely (i) semi-supervised learning where labels are partially known; (ii) multi-instance learning where labels are implicitly known; and (iii) clustering where labels are totally unknown, can all be put under the same formulation. [sent-1979, score-0.204]

92 The focus of this paper is on binary weakly labeled problems. [sent-1982, score-0.148]

93 For multi-class weakly labeled problems, they can be easily handled by decomposing into multiple binary problems (Crammer and Singer, 2002). [sent-1983, score-0.148]

94 Note that gy (α) is λ-strongly convex and ∑y∈C (t) µy = 1, ¯ ¯ (t) ) is also λ-strongly convex. [sent-1997, score-0.118]

95 2 Using the deﬁnition of J(α, µ), we then have ∑ ˆ y∈C (t) (t) µy gy (α) − ¯ˆ ˆ ∑ ˆ y∈C (t) (t) µy gy (α(t) ) ≥ ¯ˆ ˆ ¯ λ (t) ¯ ¯ˆ ∑ µy α − α(t) 2 y∈C (t) ˆ 2182 2 = λ ¯ α − α(t) 2 . [sent-1999, score-0.146]

96 2 (35) C ONVEX AND S CALABLE W EAKLY L ABELED SVM S C (t) ˆ Let y(t+1) be the violated label vector selected at iteration t + 1 in Algorithm 1, that is, C t+1 = ˆ y(t+1) . [sent-2000, score-0.151]

97 From the deﬁnition, we have ¯ ¯ ˆ gy(t+1) (α(t) ) = −G(α(t) , y(t+1) ) ≥ ˆ ≥ ¯ ˆ max −G(α(t) , y) + ε = max gy (α(t) ) + ε ˆ ¯ ˆ y∈C (t) ∑ ˆ y∈C (t) ˆ y∈C (t) (t) µy gy (α(t) ) + ε = −p(t) + ε. [sent-2001, score-0.196]

98 ˆ (39) Using Equations (35), (36), (38) and (39), we have η ≥ ∑ ˆ y∈C (t) (t) µy gy (α(t) ) − ¯ˆ ˆ ˜ ∑ ˆ y∈C (t) (t) µy gy (α(t) ) ≥ ¯ˆ ˆ ¯ λ (t) ˜ ¯ α − α(t) 2 , 2 ˜ ˜ ¯ ¯ ε − η ≤ gy(t+1) (α(t) ) − gy(t+1) (α(t) ) ≤ M α(t) − α(t) . [sent-2008, score-0.146]

99 Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. [sent-2039, score-0.168]

100 Logistic regression and boosting for labeled bags of instances. [sent-2450, score-0.163]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ell', 0.729), ('svm', 0.324), ('sang', 0.135), ('wok', 0.128), ('abeled', 0.122), ('eakly', 0.122), ('calable', 0.104), ('hou', 0.101), ('labeled', 0.095), ('violated', 0.095), ('xu', 0.094), ('onvex', 0.094), ('usvm', 0.088), ('tsvm', 0.087), ('sdp', 0.08), ('gy', 0.073), ('chapelle', 0.073), ('unlabeled', 0.073), ('roi', 0.069), ('rois', 0.069), ('vm', 0.068), ('cutting', 0.068), ('bags', 0.068), ('wellsvm', 0.068), ('mi', 0.064), ('svmlin', 0.061), ('bag', 0.06), ('yy', 0.058), ('bie', 0.058), ('yl', 0.057), ('label', 0.056), ('equation', 0.056), ('mil', 0.054), ('clustering', 0.054), ('weakly', 0.053), ('yt', 0.053), ('cristianini', 0.053), ('lapsvm', 0.052), ('labels', 0.05), ('wt', 0.05), ('mkl', 0.049), ('zhou', 0.048), ('andrews', 0.047), ('vms', 0.047), ('miny', 0.047), ('tsang', 0.047), ('gmmc', 0.046), ('sindhwani', 0.046), ('schuurmans', 0.046), ('convex', 0.045), ('margin', 0.044), ('images', 0.044), ('kwok', 0.043), ('instances', 0.041), ('relaxation', 0.04), ('plane', 0.038), ('kernel', 0.035), ('echocardiogram', 0.035), ('haberman', 0.035), ('itersvr', 0.035), ('nanjing', 0.035), ('spectf', 0.035), ('australian', 0.034), ('castle', 0.034), ('krvskp', 0.034), ('lb', 0.034), ('mklgl', 0.034), ('rework', 0.034), ('sunset', 0.034), ('hy', 0.034), ('relaxations', 0.033), ('dual', 0.032), ('sick', 0.031), ('successes', 0.031), ('rewritten', 0.03), ('success', 0.03), ('uci', 0.03), ('assignment', 0.03), ('mmc', 0.029), ('diabetes', 0.029), ('di', 0.028), ('cale', 0.027), ('liverdisorders', 0.027), ('maron', 0.027), ('mip', 0.027), ('waterfall', 0.027), ('xperiments', 0.027), ('yti', 0.027), ('ionosphere', 0.026), ('collobert', 0.026), ('max', 0.025), ('cbir', 0.023), ('cpmmc', 0.023), ('runing', 0.023), ('ssl', 0.023), ('joachims', 0.023), ('mb', 0.023), ('liblinear', 0.022), ('isolet', 0.022), ('training', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs

Author: Yu-Feng Li, Ivor W. Tsang, James T. Kwok, Zhi-Hua Zhou

2 0.11027399 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

Author: Jun Wang, Tony Jebara, Shih-Fu Chang

Abstract: Graph-based semi-supervised learning (SSL) methods play an increasingly important role in practical machine learning systems, particularly in agnostic settings when no parametric information or other prior knowledge is available about the data distribution. Given the constructed graph represented by a weight matrix, transductive inference is used to propagate known labels to predict the values of all unlabeled vertices. Designing a robust label diffusion algorithm for such graphs is a widely studied problem and various methods have recently been suggested. Many of these can be formalized as regularized function estimation through the minimization of a quadratic cost. However, most existing label diffusion methods minimize a univariate cost with the classiﬁcation function as the only variable of interest. Since the observed labels seed the diffusion process, such univariate frameworks are extremely sensitive to the initial label choice and any label noise. To alleviate the dependency on the initial observed labels, this article proposes a bivariate formulation for graph-based SSL, where both the binary label information and a continuous classiﬁcation function are arguments of the optimization. This bivariate formulation is shown to be equivalent to a linearly constrained Max-Cut problem. Finally an efﬁcient solution via greedy gradient Max-Cut (GGMC) is derived which gradually assigns unlabeled vertices to each class with minimum connectivity. Once convergence guarantees are established, this greedy Max-Cut based SSL is applied on both artiﬁcial and standard benchmark data sets where it obtains superior classiﬁcation accuracy compared to existing state-of-the-art SSL methods. Moreover, GGMC shows robustness with respect to the graph construction method and maintains high accuracy over extensive experiments with various edge linking and weighting schemes. Keywords: graph transduction, semi-supervised learning, bivariate formulation, mixed integer programming, greedy

3 0.096777588 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach

Author: Gang Niu, Bo Dai, Lin Shang, Masashi Sugiyama

Abstract: The large volume principle proposed by Vladimir Vapnik, which advocates that hypotheses lying in an equivalence class with a larger volume are more preferable, is a useful alternative to the large margin principle. In this paper, we introduce a new discriminative clustering model based on the large volume principle called maximum volume clustering (MVC), and then propose two approximation schemes to solve this MVC model: A soft-label MVC method using sequential quadratic programming and a hard-label MVC method using semi-deﬁnite programming, respectively. The proposed MVC is theoretically advantageous for three reasons. The optimization involved in hardlabel MVC is convex, and under mild conditions, the optimization involved in soft-label MVC is akin to a convex one in terms of the resulting clusters. Secondly, the soft-label MVC method pos∗. A preliminary and shorter version has appeared in Proceedings of 14th International Conference on Artiﬁcial Intelligence and Statistics (Niu et al., 2011). The preliminary work was done when GN was studying at Department of Computer Science and Technology, Nanjing University, and BD was studying at Institute of Automation, Chinese Academy of Sciences. A Matlab implementation of maximum volume clustering is available from http://sugiyama-www.cs.titech.ac.jp/∼gang/software.html. c 2013 Gang Niu, Bo Dai, Lin Shang and Masashi Sugiyama. N IU , DAI , S HANG AND S UGIYAMA sesses a clustering error bound. Thirdly, MVC includes the optimization problems of a spectral clustering, two relaxed k-means clustering and an information-maximization clustering as special limit cases when its regularization parameter goes to inﬁnity. Experiments on several artiﬁcial and benchmark data sets demonstrate that the proposed MVC compares favorably with state-of-the-art clustering methods. Keywords: discriminative clustering, large volume principle, sequential quadratic programming, semi-deﬁnite programming, ﬁnite sample stability, clustering error

4 0.084809586 54 jmlr-2013-JKernelMachines: A Simple Framework for Kernel Machines

Author: David Picard, Nicolas Thome, Matthieu Cord

Abstract: JKernelMachines is a Java library for learning with kernels. It is primarily designed to deal with custom kernels that are not easily found in standard libraries, such as kernels on structured data. These types of kernels are often used in computer vision or bioinformatics applications. We provide several kernels leading to state of the art classiﬁcation performances in computer vision, as well as various kernels on sets. The main focus of the library is to be easily extended with new kernels. Standard SVM optimization algorithms are available, but also more sophisticated learning-based kernel combination methods such as Multiple Kernel Learning (MKL), and a recently published algorithm to learn powered products of similarities (Product Kernel Learning). Keywords: classiﬁcation, support vector machines, kernel, computer vision

5 0.083019242 19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations

Author: Nemanja Djuric, Liang Lan, Slobodan Vucetic, Zhuang Wang

Abstract: We present BudgetedSVM, an open-source C++ toolbox comprising highly-optimized implementations of recently proposed algorithms for scalable training of Support Vector Machine (SVM) approximators: Adaptive Multi-hyperplane Machines, Low-rank Linearization SVM, and Budgeted Stochastic Gradient Descent. BudgetedSVM trains models with accuracy comparable to LibSVM in time comparable to LibLinear, solving non-linear problems with millions of high-dimensional examples within minutes on a regular computer. We provide command-line and Matlab interfaces to BudgetedSVM, an efﬁcient API for handling large-scale, high-dimensional data sets, as well as detailed documentation to help developers use and further extend the toolbox. Keywords: non-linear classiﬁcation, large-scale learning, SVM, machine learning toolbox

6 0.069553405 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs

7 0.043659344 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

8 0.042822395 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods

9 0.041750476 24 jmlr-2013-Comment on "Robustness and Regularization of Support Vector Machines" by H. Xu et al. (Journal of Machine Learning Research, vol. 10, pp. 1485-1510, 2009)

10 0.041486938 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

11 0.038689554 56 jmlr-2013-Keep It Simple And Sparse: Real-Time Action Recognition

12 0.03720447 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach

13 0.037161566 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

14 0.034856148 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning

15 0.032729302 114 jmlr-2013-The Rate of Convergence of AdaBoost

16 0.031998079 76 jmlr-2013-Nonparametric Sparsity and Regularization

17 0.031641748 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

18 0.031535614 68 jmlr-2013-Machine Learning with Operational Costs

19 0.030028364 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses

20 0.029318746 55 jmlr-2013-Joint Harmonic Functions and Their Supervised Connections

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.166), (1, 0.059), (2, -0.036), (3, -0.0), (4, 0.074), (5, 0.028), (6, 0.034), (7, 0.018), (8, -0.037), (9, -0.288), (10, 0.139), (11, -0.078), (12, 0.19), (13, 0.044), (14, 0.067), (15, 0.112), (16, 0.145), (17, 0.01), (18, -0.059), (19, -0.05), (20, -0.148), (21, 0.096), (22, -0.087), (23, 0.184), (24, 0.079), (25, -0.091), (26, -0.077), (27, 0.072), (28, 0.018), (29, 0.035), (30, -0.151), (31, 0.223), (32, 0.03), (33, -0.065), (34, 0.015), (35, 0.037), (36, 0.126), (37, 0.005), (38, 0.068), (39, 0.112), (40, -0.029), (41, 0.141), (42, 0.082), (43, 0.029), (44, -0.117), (45, -0.13), (46, -0.04), (47, 0.012), (48, -0.079), (49, -0.054)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94571221 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs

Author: Yu-Feng Li, Ivor W. Tsang, James T. Kwok, Zhi-Hua Zhou

2 0.64914566 19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations

Author: Nemanja Djuric, Liang Lan, Slobodan Vucetic, Zhuang Wang

3 0.49822971 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

Author: Jun Wang, Tony Jebara, Shih-Fu Chang

4 0.42449492 54 jmlr-2013-JKernelMachines: A Simple Framework for Kernel Machines

Author: David Picard, Nicolas Thome, Matthieu Cord

5 0.39962098 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach

Author: Gang Niu, Bo Dai, Lin Shang, Masashi Sugiyama

6 0.37270525 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs

7 0.34811851 24 jmlr-2013-Comment on "Robustness and Regularization of Support Vector Machines" by H. Xu et al. (Journal of Machine Learning Research, vol. 10, pp. 1485-1510, 2009)

8 0.29348582 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning

9 0.27198377 55 jmlr-2013-Joint Harmonic Functions and Their Supervised Connections

10 0.25291297 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications

11 0.2237103 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach

12 0.20984763 57 jmlr-2013-Kernel Bayes' Rule: Bayesian Inference with Positive Definite Kernels

13 0.20036253 107 jmlr-2013-Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

14 0.19994339 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

15 0.1896529 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods

16 0.18420523 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

17 0.18407144 68 jmlr-2013-Machine Learning with Operational Costs

18 0.18084836 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

19 0.17676459 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

20 0.15940073 90 jmlr-2013-Quasi-Newton Method: A New Direction

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.027), (5, 0.111), (6, 0.028), (10, 0.104), (20, 0.018), (23, 0.025), (41, 0.015), (44, 0.016), (68, 0.029), (70, 0.016), (75, 0.04), (82, 0.363), (85, 0.026), (87, 0.02), (88, 0.012), (89, 0.013), (90, 0.037), (93, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.68791735 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs

Author: Yu-Feng Li, Ivor W. Tsang, James T. Kwok, Zhi-Hua Zhou

2 0.41621691 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

Author: Jun Wang, Tony Jebara, Shih-Fu Chang

3 0.41099095 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding

Author: Ery Arias-Castro, Bruno Pelletier

Abstract: Maximum Variance Unfolding is one of the main methods for (nonlinear) dimensionality reduction. We study its large sample limit, providing speciﬁc rates of convergence under standard assumptions. We ﬁnd that it is consistent when the underlying submanifold is isometric to a convex subset, and we provide some simple examples where it fails to be consistent. Keywords: maximum variance unfolding, isometric embedding, U-processes, empirical processes, proximity graphs.

4 0.40238732 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

Author: Wendelin Böhmer, Steffen Grünewälder, Yun Shen, Marek Musial, Klaus Obermayer

Abstract: Linear reinforcement learning (RL) algorithms like least-squares temporal difference learning (LSTD) require basis functions that span approximation spaces of potential value functions. This article investigates methods to construct these bases from samples. We hypothesize that an ideal approximation spaces should encode diffusion distances and that slow feature analysis (SFA) constructs such spaces. To validate our hypothesis we provide theoretical statements about the LSTD value approximation error and induced metric of approximation spaces constructed by SFA and the state-of-the-art methods Krylov bases and proto-value functions (PVF). In particular, we prove that SFA minimizes the average (over all tasks in the same environment) bound on the above approximation error. Compared to other methods, SFA is very sensitive to sampling and can sometimes fail to encode the whole state space. We derive a novel importance sampling modiﬁcation to compensate for this effect. Finally, the LSTD and least squares policy iteration (LSPI) performance of approximation spaces constructed by Krylov bases, PVF, SFA and PCA is compared in benchmark tasks and a visual robot navigation experiment (both in a realistic simulation and with a robot). The results support our hypothesis and suggest that (i) SFA provides subspace-invariant features for MDPs with self-adjoint transition operators, which allows strong guarantees on the approximation error, (ii) the modiﬁed SFA algorithm is best suited for LSPI in both discrete and continuous state spaces and (iii) approximation spaces encoding diffusion distances facilitate LSPI performance. Keywords: reinforcement learning, diffusion distance, proto value functions, slow feature analysis, least-squares policy iteration, visual robot navigation c 2013 Wendelin B¨ hmer, Steffen Gr¨ new¨ lder, Yun Shen, Marek Musial and Klaus Obermayer. o u a ¨ ¨ ¨ B OHMER , G R UNEW ALDER , S HEN , M USIAL AND O BERMAYER

5 0.39524499 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

Author: Alberto N. Escalante-B., Laurenz Wiskott

Abstract: Supervised learning from high-dimensional data, for example, multimedia data, is a challenging task. We propose an extension of slow feature analysis (SFA) for supervised dimensionality reduction called graph-based SFA (GSFA). The algorithm extracts a label-predictive low-dimensional set of features that can be post-processed by typical supervised algorithms to generate the ﬁnal label or class estimation. GSFA is trained with a so-called training graph, in which the vertices are the samples and the edges represent similarities of the corresponding labels. A new weighted SFA optimization problem is introduced, generalizing the notion of slowness from sequences of samples to such training graphs. We show that GSFA computes an optimal solution to this problem in the considered function space and propose several types of training graphs. For classiﬁcation, the most straightforward graph yields features equivalent to those of (nonlinear) Fisher discriminant analysis. Emphasis is on regression, where four different graphs were evaluated experimentally with a subproblem of face detection on photographs. The method proposed is promising particularly when linear models are insufﬁcient as well as when feature selection is difﬁcult. Keywords: slow feature analysis, feature extraction, classiﬁcation, regression, pattern recognition, training graphs, nonlinear dimensionality reduction, supervised learning, implicitly supervised, high-dimensional data, image analysis

6 0.39345881 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

7 0.39224514 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach

8 0.39221638 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

9 0.39210996 59 jmlr-2013-Large-scale SVD and Manifold Learning

10 0.39063829 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

11 0.39029786 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

12 0.38882089 68 jmlr-2013-Machine Learning with Operational Costs

13 0.38804877 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

14 0.38786972 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

15 0.38736826 86 jmlr-2013-Parallel Vector Field Embedding

16 0.38704723 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

17 0.38678658 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

18 0.38511977 73 jmlr-2013-Multicategory Large-Margin Unified Machines

19 0.38504118 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

20 0.38493556 41 jmlr-2013-Experiment Selection for Causal Discovery