jmlr jmlr2007 jmlr2007-44 knowledge-graph by maker-knowledge-mining

44 jmlr-2007-Large Margin Semi-supervised Learning

Source: pdf

Author: Junhui Wang, Xiaotong Shen

Abstract: In classiﬁcation, semi-supervised learning occurs when a large amount of unlabeled data is available with only a small number of labeled data. In such a situation, how to enhance predictability of classiﬁcation through unlabeled data is the focus. In this article, we introduce a novel large margin semi-supervised learning methodology, using grouping information from unlabeled data, together with the concept of margins, in a form of regularization controlling the interplay between labeled and unlabeled data. Based on this methodology, we develop two speciﬁc machines involving support vector machines and ψ-learning, denoted as SSVM and SPSI, through difference convex programming. In addition, we estimate the generalization error using both labeled and unlabeled data, for tuning regularizers. Finally, our theoretical and numerical analyses indicate that the proposed methodology achieves the desired objective of delivering high performance in generalization, particularly against some strong performers. Keywords: generalization, grouping, sequential quadratic programming, support vectors

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU School of Statistics University of Minnesota Minneapolis, MN 55455, USA Editor: Tommi Jaakkola Abstract In classiﬁcation, semi-supervised learning occurs when a large amount of unlabeled data is available with only a small number of labeled data. [sent-5, score-0.129]

2 In such a situation, how to enhance predictability of classiﬁcation through unlabeled data is the focus. [sent-6, score-0.079]

3 In this article, we introduce a novel large margin semi-supervised learning methodology, using grouping information from unlabeled data, together with the concept of margins, in a form of regularization controlling the interplay between labeled and unlabeled data. [sent-7, score-0.3]

4 In addition, we estimate the generalization error using both labeled and unlabeled data, for tuning regularizers. [sent-9, score-0.16]

5 Introduction In many classiﬁcation problems, a large amount of unlabeled data is available, while it is costly to obtain labeled data. [sent-12, score-0.129]

6 In text categorization, particularly web-page classiﬁcation, a machine is trained with a small number of manually labeled texts (web-pages), as well as a huge amount of unlabeled texts (web-pages), because manually labeling is impractical; compare with Joachims (1999). [sent-13, score-0.145]

7 In spam detection, a small group of identiﬁed e-mails, spam or non-spam, is used, in conjunction with a large number of unidentiﬁed e-mails, to train a ﬁlter to ﬂag incoming spam e-mails, compare with Amini and Gallinari (2003). [sent-14, score-0.081]

8 In a situation as such, one research problem is how to enhance accuracy of prediction in classiﬁcation by using both unlabeled and labeled data. [sent-17, score-0.129]

9 The problem of this sort is referred to as semi-supervised learning, which differs from a conventional “missing data” problem in that the size of unlabeled data greatly exceeds that of labeled data, and missing occurs only in response. [sent-18, score-0.129]

10 The central issue that this article addresses is how to use information from unlabeled data to enhance predictability of classiﬁcation. [sent-19, score-0.079]

11 In semi-supervised learning, a sample {Zi = (Xi ,Yi )}nl is observed with labeling Yi ∈ {−1, 1}, i=1 in addition to an independent unlabeled sample {X j }n l +1 with n = nl + nu , where Xk = (Xk1 , · · · , j=n Xkp ); k = 1, · · · , n is an p-dimensional input. [sent-20, score-0.795]

12 Here the labeled sample is independently and identically distributed according to an unknown joint distribution P(x, y), and the unlabeled sample is c 2007 Junhui Wang and Xiaotong Shen. [sent-21, score-0.129]

13 Essentially all existing methods make various assumptions about the relationship between P(Y = 1|X = x) and P(x) in a way for an improvement to occur when unlabeled data is used. [sent-27, score-0.094]

14 The primary objective of this article is to develop a large margin semi-supervised learning methodology to deliver high performance of classiﬁcation by using unlabeled data. [sent-31, score-0.125]

15 The methodology is designed to adapt to a variety of situations by identifying as opposed to specifying a relationship between labeled and unlabeled data from data. [sent-32, score-0.154]

16 It yields an improvement when unlabeled data can reconstruct the optimal classiﬁcation boundary, and yields a no worse performance than its supervised counterpart otherwise. [sent-33, score-0.127]

17 These ingredients are integrated in a form of regularization involving three regularizers, each controlling classiﬁcation with labeled data, grouping with unlabeled data, and interplay between them. [sent-36, score-0.2]

18 Moreover, we introduce a tuning method using unlabeled data for tuning the regularizers. [sent-37, score-0.141]

19 Moreover, a novel learning theory is developed to quantify SPSI’s generalization error as a function of complexity of the class of candidate decision functions, the sample sizes (nl , nu ), and the regularizers. [sent-40, score-0.349]

20 To our knowledge, this is the ﬁrst attempt to relate a classiﬁer’s generalization error to (nl , nu ) and regularizers in semisupervised learning. [sent-41, score-0.368]

21 Section 4 proposes a tuning methodology that uses both labeled and unlabeled data to enhance of accuracy of estimation of the generalization error. [sent-46, score-0.185]

22 In order to extract useful information about classiﬁcation from unlabeled data, we construct a loss U(·) for a grouping decision function g(x) = (1, x T )wg ≡ wT x+wg,0 , with Sign(g(x)) indicating ˜g grouping. [sent-58, score-0.136]

23 While U can be used to extract the grouping boundary, it needs to yield the Bayes decision function f ∗ = arg min f ∈F EL(Y f (X)) in order for it to be useful for classiﬁcation, where E is the expectation with respect to (X,Y ). [sent-61, score-0.087]

24 Whereas L(y f (x)) regularizes the contribution from labeled 1869 WANG AND S HEN data, U(g(x)) controls the information extracted from unlabeled data, and w f − wg 2 penalizes the disagreement between f and g, specifying a loose relationship between f and g. [sent-66, score-0.361]

25 Note that in (1) the geometric margin w2f 2 does not enter ˜ as it is regularized implicitly through 2 w f −wg and 2 2 wg ˜ 2 . [sent-68, score-0.253]

26 In nonlinear learning, a kernel K(·, ·) that maps from S × S to R 1 is usually introduced for ﬂexible representations: f (x) = (1, K(x, x1 ), · · · , K(x, xn ))w f and g(x) = (1, K(x, x1 ), · · · , K(x, xn ))wg with w f = (w f , w f ,0 ) and wg = wg + wg,0 . [sent-69, score-0.464]

27 The constrained version of (2), after introducing slack variables {ξk ≥ 0; k = 1, · · · , n}, becomes nl C1 ∑ ξi +C2 i=1 n ∑ ξj + j=nl +1 C3 f −g 2 2 + 1 g 2 2 −, (3) subject to ξi − L(yi f (xi )) ≥ 0; i = 1, · · · , nl ; ξ j − U(g(x j )) ≥ 0; j = nl + 1, · · · , n. [sent-80, score-1.095]

28 SSVM in (3) stems from grouping and interplay between grouping and classiﬁcation, whereas TSVM focuses on classiﬁcation. [sent-88, score-0.114]

29 Similarly, ∇s2 = (∇1 f , ∇2 f , ∇1g , ∇2g ) ψ ψ ψ is the gradient vector of s2 with respect to (w f , wg ), with ∇1 f = C1 ∑nl ∇ψ2 (yi f (xi ))yi xi , ∇2 f = i=1 ψ ψ C1 ∑nl ∇ψ2 (yi f (xi ))yi , ∇1g = 2∇SV M , and ∇2g = 2∇SV M , where ∇ψ2 (z) = 0 if z > 0 and ∇ψ2 (z) = 2g 1g i=1 −2 otherwise. [sent-133, score-0.262]

30 Tuning Involving Unlabeled Data This section proposes a novel tuning method based on the concept of generalized degrees of freedom (GDF) and the technique of data perturbation (Shen and Huang, 2006; Wang and Shen, 2006), through both labeled and unlabeled data. [sent-138, score-0.173]

31 By Theorem 1 j=n i=1 of Wang and Shen (2006), the optimal estimated GE( fˆC ), after ignoring the terms independent of fˆC , has the form of EGE( fˆC ) + 1 2nl nl 1 ∑ Cov(Yi , Sign( fˆC (Xi ))|X l ) + 4 D1 (X l , fˆC ). [sent-144, score-0.379]

32 (5) i=1 1 Here, EGE( fˆC ) = 2nl ∑nl (1−Yi Sign( fˆC (Xi ))) is the training error, and D1 (X l , fˆC ) = E E( (X))− i=1 nl 1 l with (X) = (E(Y |X) − Sign( fˆC (X)))2 , where E(·|X) and E(·|X l ) are condinl ∑i=1 (Xi )|X tional expectations with respect to Y and Y l respectively. [sent-145, score-0.365]

33 In (5), Cov(Yi , Sign( fˆC (Xi ))|X l ); i = 1 · · · , nl and D1 (X l , fˆC ) need to be estimated. [sent-147, score-0.365]

34 It appears that Cov(Yi , Sign( fˆC (Xi ))|X l ) is estimated only through labeled data, for which we apply the data perturbation technique of Wang and Shen (2006). [sent-148, score-0.077]

35 Our proposed estimate GE becomes i=1 1 GE( fˆC ) = EGE( fˆC ) + 2nl (9) nl 1 D ∗d ∑d Yi∗d , and fˆC is trained 1 ∑ Cov(Yi , Sign( fˆC (Xi ))|X l ) + 4 D1 (X l , fˆC ), (10) i=1 By the law of large numbers, GE converges to (5) as D → ∞. [sent-156, score-0.365]

36 In practice, we recommend D to be at least nl to ensure the precision of MC approximation and τ to be 0. [sent-157, score-0.365]

37 In contrast to the estimated GE with labeled data alone, the GE( fˆC ) in (10) requires no perturbation of X when X u is available. [sent-159, score-0.077]

38 2): (Consistency of initial estimates) For almost all x, pi (x) → pi (x), as nl → ∞; i = 1, · · · , nl . [sent-165, score-0.73]

39 3, lim nl ,nu →∞ lim GE( fˆC )/ inf GE( fˆC ) = 1. [sent-170, score-0.384]

40 ˆ τ→0+ C Theorem 2 says the ideal optimal performance infC GE( fˆC ) can be realized by GE( fˆC ) when ˆ τ → 0+ and nl , nu → ∞ against any other tuning method. [sent-171, score-0.731]

41 Numerical Examples This section examines effectiveness of SSVM and SPSI and compare them against SVM with labeled data alone, TSVM and a graphical method of Zhu, Ghahramani and Lafferty (2003), in both simulated and benchmark examples. [sent-173, score-0.104]

42 Speciﬁcally, one regularizer for SVM and one tuning parameter σ in the Gaussian weight matrix for the graphical method, two regularization regularizers for TSVM, and three regularizers for SSVM and SPSI are optimized over [10−2 , 103 ]. [sent-187, score-0.123]

43 Next, 190 unlabeled instances (Xi1 , Xi2 ) are obtained by removing labels from a randomly chosen subset of the training sample, whereas the remaining 10 instances are treated as labeled data. [sent-197, score-0.157]

44 As in Example 1, we obtain 200 (10 labeled and 190 unlabeled) instances for training as well as 800 instances for testing. [sent-206, score-0.078]

45 Instances in the WBC and Mushroom examples are randomly divided into halves with 10 labeled and 190 unlabeled instances for training, and the remaining instances for testing. [sent-212, score-0.157]

46 Instances in the Spam email example are randomly divided into halves with 20 labeled and 580 unlabeled instances for training, and the remaining instances for testing. [sent-213, score-0.18]

47 0021) Table 1: Averaged test errors as well as the estimated standard errors (in parenthesis) of SVM with labeled data alone, TSVM, the graphical method, SSVM and SPSI, over 100 pairs of training and testing samples, in the simulated examples. [sent-280, score-0.105]

48 0018) Table 2: Averaged test errors as well as the estimated standard errors (in parenthesis) of SVM with labeled data alone, TSVM, the graphical method, SSVM and SPSI, over 100 pairs of training and testing samples, in the benchmark examples. [sent-373, score-0.103]

49 The amount of improvement is deﬁned in (12), where the performance of SVM with labeled data alone serves as a baseline for comparison in absence of the Bayes error. [sent-374, score-0.086]

50 In contrast, the grouping boundaries estimated by unlabeled covariates, almost recover the true decision boundaries for classiﬁcation. [sent-394, score-0.15]

51 This, together with the information obtained from the labeled data regarding the sign of labeling, results in much better estimated classiﬁcation boundaries. [sent-395, score-0.109]

52 The solid, dashed, dotted and dotted-dashed (vertical) lines represent our ψ-learning-based decision function, the SVM decision function with labeled data alone, the partition decision function deﬁned by unlabeled data, and the true decision boundary for classiﬁcation. [sent-397, score-0.185]

53 Particularly, SVM is tuned using the method of Wang and Shen (2006) with labeled data alone, and SPSI, SSVM , TSVM and the graphical method are tuned by minimizing the GE( fˆC ) in (10) involving both labeled and unlabeled data over a set of grid points in the same fashion as in Section 5. [sent-405, score-0.205]

54 As expected, SPSI and SSVM outperform both SVM with labeled data alone and TSVM in all cases, and the graphical method in all examples except Mushroom, with improvements ranging from 2. [sent-409, score-0.112]

55 0056) Table 3: Averaged test errors as well as the estimated standard errors (in parenthesis) of SVM with labeled data alone, TSVM, the graphical method, SSVM and SPSI after tuning, over 100 pairs of training and testing samples, for the simulated and benchmark examples. [sent-561, score-0.118]

56 It is also interesting to note that TSVM obtained from SVMlight performs even worse than SVM with labeled data alone in the WBC example for linear learning, and the Spam email example for both linear and Gaussian kernel learning. [sent-563, score-0.109]

57 Statistical Learning Theory This section derives a ﬁnite-sample probability upper bound measuring the performance of SPSI in terms of complexity of the class of candidate decision functions F , sample sizes (n l , nu ) and tuning parameter C. [sent-568, score-0.38]

58 Theorem 3 (Finite-sample probability bound for SPSI) In addition to Assumptions A-C, assume that nl ≤ nu . [sent-599, score-0.7]

59 5 exp(−a7 nu ((nuC2 )−1 J0 )max(1,2−βg ) )+ C ∗ 6. [sent-601, score-0.335]

60 5 exp(−a10 nl ((nl C1 )−1 min(Jl , J( f ∗ )))max(1,2−β f ) )+ ∗ 6. [sent-602, score-0.365]

61 nu nu 6 Corollary 4 Under the assumptions of Theorem 3, as n u ≥ nl → ∞, 2α 2α ∗ inf |e( fˆC , f ∗ )| = O p (sn ), sn = min δnl f , max(δnu g , inf |e(gC , f ∗ )|) . [sent-604, score-1.117]

62 C∈C C Theorem 3 provides a probability bound for the upper tail of |e( fˆC , f ∗ )| for any ﬁnite (nl , nu ). [sent-605, score-0.335]

63 ∗ Furthermore, Corollary 4 says that the Bayesian regret infC∈C |e(gC , f ∗ )| for the SPSI classiﬁer Sign( fˆC ) after tuning is of order of no larger than sn , when nu ≥ nl → ∞. [sent-606, score-0.761]

64 The local entropy avoids a loss of log nu factor in the linear case, although it may not be useful in the nonlinear case. [sent-612, score-0.364]

65 It then follows from Corollary 4 that infC |e( fˆC , f ∗ )| = O p (nu (log nu )(θ+1)/2 ) as nu ≥ nl → ∞. [sent-626, score-1.035]

66 This says that the optimal ideal performance of the Bayes rule is recovered by SPSI −(θ+1)/2 at speed of nu (log nu )(θ+1)/2 as nu ≥ nl → ∞. [sent-627, score-1.37]

67 Similarly, it follows −1/2 from Corollary 4 that infC |e( fˆC , f ∗ )| = O p (min(n−1 (log nl Jl )3 , nu (log nu Ju )3/2 )) as nu ≥ nl → ∞. [sent-634, score-1.735]

68 l Therefore, the optimal ideal performance of the Bayes rule is recovered by SPSI at fast speed of −1/2 (log nu Ju )3/2 ) as nu ≥ nl → ∞. [sent-635, score-1.035]

69 In contrast to most semi-supervised learning methods assuming various dependencies between the marginal and conditional distributions, the proposed methodology integrates labeled and unlabeled data through regularization to identify such dependencies for enhancing classiﬁcation. [sent-638, score-0.154]

70 The theoretical and numerical results show that our methodology outperforms SVM and TSVM in situations when unlabeled data provides useful information, and performs no worse when unlabeled data does not so. [sent-639, score-0.183]

71 5 exp(−a11 nu (nuC2 )−1 Ju ), where Jl and Ju are deﬁned in Lemma 5. [sent-658, score-0.335]

72 Deﬁne the scaled empirical process as E u (U(gC ) − −1 n ∗ (x )) − U(g(x )) + λ(J(g∗ ) − J(g)) − E(U(g∗ (X )) − U(g(X ))+ U(g)) = nu ∑ j=nl +1 U(gC j j j j C C ∗ ∗ λ(J(gC ) − J(g))) = Eu (U(gC ) −U(g)). [sent-660, score-0.335]

73 Similarly, I2 ≤ 3 exp(−a7 nu (λJ0 )max(1,2−βg ) )/(1 − exp(−a7 nu (λJ0 )max(1,2−βg ) ))2 . [sent-683, score-0.67]

74 Thus I ≤ I1 + I2 ≤ 6 exp(−a7 nu ((nuC2 )−1 J0 )max(1,2−βg ) )/(1 − exp(−a7 nu ((nuC2 )−1 J0 )max(1,2−βg ) ))2 , and I 1/2 ≤ (2. [sent-684, score-0.67]

75 5 + I 1/2 ) exp(−a7 nu ((nuC2 )−1 J0 )max(1,2−βg ) ). [sent-685, score-0.335]

76 5 exp(−a7 nu ˆ ∗ n ((nuC2 )−1 J0 )max(1,2−βg ) ) + 6. [sent-687, score-0.335]

77 By Assumption A and the triangle inequality, |e( f , g∗ )| ≤ ˆC a6 B/C3 ≤ δnu 3 nu C 6 ˆC , g∗ ))αg ≤ a1 (e (gC , g∗ ) + |e ( fˆC , gC )|)αg ≤ a1 (e (gC , g∗ ) + δ2 , implying that ˆ ˆ C ˆ C a1 (eU ( f C nu U U U ∗ ˆ ∗ P |e( fˆC , gC )| ≥ a1 (2δ2u )αg ≤ P(eU (gC , gC ) ≥ δ2u ), ∀C ∈ C . [sent-691, score-0.67]

78 5 exp(− a7 nu ((nuC2 )−1 J0 )max(1,2−βg ) ) + ˆ n 1884 L ARGE M ARGIN S EMI - SUPERVISED L EARNING ∗ ∗ 6. [sent-693, score-0.335]

79 5 exp(−a10 nl ((nl C1 )−1 Jl )max(1,2−β f ) ) + 6. [sent-694, score-0.365]

80 5 exp(−a11 nu ((nuC2 )−1 Ju )max(1,2−βg ) ), where C∗ = ∗ , f ∗ )|. [sent-695, score-0.335]

81 − i=1 2 ˆC , f ∗ ) ≥ a1 δ2 ) ≤ P(infC eL ( fˆC , f ∗ ) ≥ An application of the same treatment yields that P(infC e( f nl ∗ ∗ a1 δ2l ) ≤ 3. [sent-697, score-0.378]

82 5 exp(−a10 nl ((nl C1 )−1 J( f ∗ ))max(1,2−β f ) ) when nl C1 ≥ 2δ−2 max(J( f ∗ ), 1). [sent-698, score-0.73]

83 C1 nl nu i=nl +1 An application of the same argument as in the proof of Theorem 3 yields that for constants a 8 , a9 > 0, ˜ P(eW ( fˆC , gC ; fC , gC ) ≥ δ2 ) is upper bounded by ˆ ∗ ∗ w 3. [sent-705, score-0.713]

84 5 exp(−a9 nu ((nuC2 )−1 Ju )max(1,2−βg ) ), ∗ ∗ ∗ ∗ ˜ ˜ provided that 2Jl ≤ nl C1 δ2l and 2Ju ≤ nuC2 δ2u , where eW ( f , g; fC , gC ) = eL ( f , fC ) + C2 eU (g, gC ), n n C1 C2 nu ˜ 2 ˜ ˜ δ2 = δ2 + δ , Jl = max(Jl ( f ∗ , g∗ ), 1) and Ju = max(Ju ( f ∗ , g∗ ), 1). [sent-707, score-1.035]

85 w nl C1 nl nu C C C C ∗ ∗ ∗ ∗ Without loss of generality, assume min(Jl ( fC , gC ), Ju ( fC , gC )) ≥ 1. [sent-708, score-1.065]

86 5 exp(−a10 nl ((nl C1 )−1 Jl )max(1,2−β f ) ) + 6. [sent-712, score-0.365]

87 −1 ≤ Note that J( fˆC , gC ) ≤ s( fˆ, g) ≤ s(1, 1) ≤ 2C1 nl . [sent-714, score-0.365]

88 Suppose there exist ε-brackets (gl , gu )M for some m m m=1 M such that for any g ∈ F (k) = {g ∈ F : J(g) ≤ k}, gl ≤ g ≤ gu and gu − gl ∞ ≤ ε for some 1 ≤ m m m m u u l u l u m ≤ M. [sent-736, score-0.09]

89 Solving (15), we obtain εnl = ( loglnl )1/2 when C1 ∼ n J( f ∗ )δ−2 n−1 ∼ log nl and εnu = ( logunu )1/2 when C2 ∼ J0 δ−2 n−1 ∼ log nu . [sent-756, score-0.728]

90 Assumption C is fulﬁlled nl l nu u n because E(X 2 ) < ∞. [sent-757, score-0.7]

91 In conclusion, we obtain, by Corollary 4, that infC |e( fˆC , f ∗ )| = O p ((n−1 log nu )(θ+1)/2 ). [sent-758, score-0.349]

92 k=0 2σ For (13), note that F is rich for sufﬁciently large nl in that for any continuous function f , there exists a f˜ ∈ F such that f − f˜ ∞ ≤ ε2l , compare with Steinwart (2001). [sent-761, score-0.365]

93 Similarly, we have εnl = (n−1 (log nl Jl )3 )1/2 when C1 ∼ l J( f ∗ )δ−2 n−1 ∼ (log nl Jl )−3 and εnu = (n−1 (log nu Ju )3 )1/2 when C2 ∼ J0 δ−2 n−1 ∼ (log nu Ju )−3 . [sent-779, score-1.4]

94 Asnl l u nu u sumption C is fulﬁlled with the Gaussian kernel. [sent-780, score-0.335]

95 Furthermore, KKT’s condition requires that αi (1 − yi ( w f , xi ) − ξi ) = 0, β j ( wg , x j − 1 − ξ j ), γ j ( wg , x j +1+ξ j ) = 0, γi ξi = 0, and η j ξ j = 0. [sent-791, score-0.543]

96 Therefore, if 0 < αi < C1 , then ξi = 0 and 1 − yi ( w f , xi ) = 0, if 0 < β j + γ j < C2 , then ξ j = 0 and wg , x j + 1 = 0 or wg , x j − 1 = 0. [sent-793, score-0.543]

97 C2 ∇U2 (g nl +1 )), · · · , C2 ∇U2 (g n Proof of Theorem 10: The proof is similar to that of Theorem 9, and thus is omitted. [sent-799, score-0.365]

98 A framework for learning predictive structures from multiple tasks and unlabeled data. [sent-821, score-0.079]

99 Text classiﬁcation from labeled and unlabeled documents using EM. [sent-955, score-0.129]

100 A probability analysis on the value of unlabeled data for classiﬁcation problems. [sent-1046, score-0.079]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gc', 0.646), ('nl', 0.365), ('nu', 0.335), ('wg', 0.232), ('spsi', 0.221), ('fc', 0.215), ('ssvm', 0.175), ('ge', 0.144), ('tsvm', 0.124), ('shen', 0.098), ('ju', 0.086), ('infc', 0.085), ('unlabeled', 0.079), ('eu', 0.078), ('jl', 0.065), ('ssv', 0.057), ('hen', 0.051), ('emi', 0.051), ('sv', 0.051), ('labeled', 0.05), ('yi', 0.049), ('wang', 0.047), ('arge', 0.047), ('argin', 0.047), ('vh', 0.045), ('wbc', 0.045), ('sign', 0.045), ('grouping', 0.043), ('wh', 0.04), ('svm', 0.04), ('nv', 0.038), ('wl', 0.036), ('cov', 0.036), ('alone', 0.036), ('regularizers', 0.033), ('tuning', 0.031), ('mushroom', 0.031), ('fu', 0.031), ('wong', 0.03), ('xi', 0.03), ('fv', 0.029), ('interplay', 0.028), ('spam', 0.027), ('graphical', 0.026), ('ew', 0.025), ('max', 0.025), ('methodology', 0.025), ('vg', 0.024), ('email', 0.023), ('dc', 0.023), ('svmc', 0.023), ('supervised', 0.022), ('sup', 0.022), ('earning', 0.021), ('el', 0.021), ('wu', 0.021), ('margin', 0.021), ('exp', 0.02), ('var', 0.02), ('bayes', 0.019), ('inf', 0.019), ('kkt', 0.018), ('gl', 0.018), ('gu', 0.018), ('lemma', 0.018), ('fl', 0.017), ('ege', 0.017), ('svml', 0.017), ('ug', 0.017), ('vgc', 0.017), ('ynl', 0.017), ('zhu', 0.016), ('decompositions', 0.016), ('labeling', 0.016), ('minimizer', 0.016), ('arg', 0.016), ('gaussian', 0.016), ('gm', 0.015), ('sn', 0.015), ('assumptions', 0.015), ('entropy', 0.015), ('simulated', 0.015), ('regret', 0.015), ('fm', 0.014), ('instances', 0.014), ('minimization', 0.014), ('parenthesis', 0.014), ('bracket', 0.014), ('ugc', 0.014), ('log', 0.014), ('decision', 0.014), ('estimated', 0.014), ('min', 0.014), ('yields', 0.013), ('benchmark', 0.013), ('perturbation', 0.013), ('tu', 0.013), ('delivering', 0.013), ('adams', 0.013), ('inequality', 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 44 jmlr-2007-Large Margin Semi-supervised Learning

Author: Junhui Wang, Xiaotong Shen

2 0.065089121 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

Author: Philippe Rigollet

Abstract: We consider semi-supervised classiﬁcation when part of the available data is unlabeled. These unlabeled data can be useful for the classiﬁcation problem when we make an assumption relating the behavior of the regression function to that of the marginal distribution. Seeger (2000) proposed the well-known cluster assumption as a reasonable one. We propose a mathematical formulation of this assumption and a method based on density level sets estimation that takes advantage of it to achieve fast rates of convergence both in the number of unlabeled examples and the number of labeled examples. Keywords: semi-supervised learning, statistical learning theory, classiﬁcation, cluster assumption, generalization bounds

3 0.043370668 45 jmlr-2007-Learnability of Gaussians with Flexible Variances

Author: Yiming Ying, Ding-Xuan Zhou

Abstract: Gaussian kernels with ﬂexible variances provide a rich family of Mercer kernels for learning algorithms. We show that the union of the unit balls of reproducing kernel Hilbert spaces generated by Gaussian kernels with ﬂexible variances is a uniform Glivenko-Cantelli (uGC) class. This result conﬁrms a conjecture concerning learnability of Gaussian kernels and veriﬁes the uniform convergence of many learning algorithms involving Gaussians with changing variances. Rademacher averages and empirical covering numbers are used to estimate sample errors of multi-kernel regularization schemes associated with general loss functions. It is then shown that the regularization error associated with the least square loss and the Gaussian kernels can be greatly improved when ﬂexible variances are allowed. Finally, for regularization schemes generated by Gaussian kernels with ﬂexible variances we present explicit learning rates for regression with least square loss and classiﬁcation with hinge loss. Keywords: Gaussian kernel, ﬂexible variances, learning theory, Glivenko-Cantelli class, regularization scheme, empirical covering number

4 0.038739797 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)

Author: Yann Guermeur

Abstract: In the context of discriminant analysis, Vapnik’s statistical learning theory has mainly been developed in three directions: the computation of dichotomies with binary-valued functions, the computation of dichotomies with real-valued functions, and the computation of polytomies with functions taking their values in ﬁnite sets, typically the set of categories itself. The case of classes of vectorvalued functions used to compute polytomies has seldom been considered independently, which is unsatisfactory, for three main reasons. First, this case encompasses the other ones. Second, it cannot be treated appropriately through a na¨ve extension of the results devoted to the computation of ı dichotomies. Third, most of the classiﬁcation problems met in practice involve multiple categories. In this paper, a VC theory of large margin multi-category classiﬁers is introduced. Central in this theory are generalized VC dimensions called the γ-Ψ-dimensions. First, a uniform convergence bound on the risk of the classiﬁers of interest is derived. The capacity measure involved in this bound is a covering number. This covering number can be upper bounded in terms of the γ-Ψdimensions thanks to generalizations of Sauer’s lemma, as is illustrated in the speciﬁc case of the scale-sensitive Natarajan dimension. A bound on this latter dimension is then computed for the class of functions on which multi-class SVMs are based. This makes it possible to apply the structural risk minimization inductive principle to those machines. Keywords: multi-class discriminant analysis, large margin classiﬁers, uniform strong laws of large numbers, generalized VC dimensions, multi-class SVMs, structural risk minimization inductive principle, model selection

5 0.032362182 90 jmlr-2007-Value Regularization and Fenchel Duality

Author: Ryan M. Rifkin, Ross A. Lippert

Abstract: Regularization is an approach to function learning that balances ﬁt and smoothness. In practice, we search for a function f with a ﬁnite representation f = ∑i ci φi (·). In most treatments, the ci are the primary objects of study. We consider value regularization, constructing optimization problems in which the predicted values at the training points are the primary variables, and therefore the central objects of study. Although this is a simple change, it has profound consequences. From convex conjugacy and the theory of Fenchel duality, we derive separate optimality conditions for the regularization and loss portions of the learning problem; this technique yields clean and short derivations of standard algorithms. This framework is ideally suited to studying many other phenomena at the intersection of learning theory and optimization. We obtain a value-based variant of the representer theorem, which underscores the transductive nature of regularization in reproducing kernel Hilbert spaces. We unify and extend previous results on learning kernel functions, with very simple proofs. We analyze the use of unregularized bias terms in optimization problems, and low-rank approximations to kernel matrices, obtaining new results in these areas. In summary, the combination of value regularization and Fenchel duality are valuable tools for studying the optimization problems in machine learning. Keywords: kernel machines, duality, optimization, convex analysis, kernel learning

6 0.030270562 9 jmlr-2007-AdaBoost is Consistent

7 0.029054226 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

8 0.025756218 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results

9 0.025176359 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

10 0.02272981 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

11 0.022611393 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections

12 0.022161664 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

13 0.021830956 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

14 0.021611052 15 jmlr-2007-Bilinear Discriminant Component Analysis

15 0.020373346 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

16 0.020029962 52 jmlr-2007-Margin Trees for High-dimensional Classification

17 0.01935917 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds

18 0.018559558 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression

19 0.018151615 12 jmlr-2007-Attribute-Efficient and Non-adaptive Learning of Parities and DNF Expressions (Special Topic on the Conference on Learning Theory 2005)

20 0.017505543 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.12), (1, -0.037), (2, 0.042), (3, 0.027), (4, 0.06), (5, 0.001), (6, -0.022), (7, 0.013), (8, 0.036), (9, -0.011), (10, 0.039), (11, -0.013), (12, 0.011), (13, -0.001), (14, 0.09), (15, -0.002), (16, 0.082), (17, 0.017), (18, -0.088), (19, 0.009), (20, 0.087), (21, 0.241), (22, -0.006), (23, -0.008), (24, 0.028), (25, -0.079), (26, -0.279), (27, -0.099), (28, 0.144), (29, -0.277), (30, 0.205), (31, 0.332), (32, -0.023), (33, 0.106), (34, -0.08), (35, -0.043), (36, 0.256), (37, 0.114), (38, -0.054), (39, -0.041), (40, 0.126), (41, -0.222), (42, 0.054), (43, -0.1), (44, 0.321), (45, -0.143), (46, 0.098), (47, 0.145), (48, 0.025), (49, 0.111)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96796775 44 jmlr-2007-Large Margin Semi-supervised Learning

Author: Junhui Wang, Xiaotong Shen

2 0.34148538 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

Author: Philippe Rigollet

3 0.20246175 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)

Author: Yann Guermeur

4 0.15721275 45 jmlr-2007-Learnability of Gaussians with Flexible Variances

Author: Yiming Ying, Ding-Xuan Zhou

5 0.15710595 15 jmlr-2007-Bilinear Discriminant Component Analysis

Author: Mads Dyrholm, Christoforos Christoforou, Lucas C. Parra

Abstract: Factor analysis and discriminant analysis are often used as complementary approaches to identify linear components in two dimensional data arrays. For three dimensional arrays, which may organize data in dimensions such as space, time, and trials, the opportunity arises to combine these two approaches. A new method, Bilinear Discriminant Component Analysis (BDCA), is derived and demonstrated in the context of functional brain imaging data for which it seems ideally suited. The work suggests to identify a subspace projection which optimally separates classes while ensuring that each dimension in this space captures an independent contribution to the discrimination. Keywords: bilinear, decomposition, component, classiﬁcation, regularization

6 0.15524061 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

7 0.13791285 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results

8 0.13024652 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

9 0.12759551 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels

10 0.11972343 90 jmlr-2007-Value Regularization and Fenchel Duality

11 0.11493137 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

12 0.1139985 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

13 0.10865387 12 jmlr-2007-Attribute-Efficient and Non-adaptive Learning of Parities and DNF Expressions (Special Topic on the Conference on Learning Theory 2005)

14 0.10542407 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

15 0.10159592 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

16 0.098840185 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections

17 0.091281563 9 jmlr-2007-AdaBoost is Consistent

18 0.090993464 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization

19 0.090878136 70 jmlr-2007-Ranking the Best Instances

20 0.087469742 77 jmlr-2007-Stagewise Lasso

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.017), (10, 0.023), (12, 0.033), (15, 0.044), (19, 0.429), (28, 0.051), (40, 0.036), (45, 0.016), (48, 0.034), (60, 0.043), (80, 0.015), (85, 0.043), (98, 0.083)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74560106 44 jmlr-2007-Large Margin Semi-supervised Learning

Author: Junhui Wang, Xiaotong Shen

2 0.28546464 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

Author: Roland Nilsson, José M. Peña, Johan Björkegren, Jesper Tegnér

Abstract: We analyze two different feature selection problems: ﬁnding a minimal feature set optimal for classiﬁcation (MINIMAL - OPTIMAL) vs. ﬁnding all features relevant to the target variable (ALL RELEVANT ). The latter problem is motivated by recent applications within bioinformatics, particularly gene expression analysis. For both problems, we identify classes of data distributions for which there exist consistent, polynomial-time algorithms. We also prove that ALL - RELEVANT is much harder than MINIMAL - OPTIMAL and propose two consistent, polynomial-time algorithms. We argue that the distribution classes considered are reasonable in many practical cases, so that our results simplify feature selection in a wide range of machine learning tasks. Keywords: learning theory, relevance, classiﬁcation, Markov blanket, bioinformatics

3 0.28452134 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

Author: Guy Lebanon, Yi Mao, Joshua Dillon

Abstract: The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efﬁcient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words representation and its n-gram extensions. This representation uses local smoothing to embed documents as smooth curves in the multinomial simplex thereby preserving valuable sequential information. In contrast to bag of words or n-grams, the new representation is able to robustly capture medium and long range sequential trends in the document. We discuss the representation and its geometric properties and demonstrate its applicability for various text processing tasks. Keywords: text processing, local smoothing

4 0.27893621 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

Author: Rie Johnson, Tong Zhang

Abstract: This paper investigates the effect of Laplacian normalization in graph-based semi-supervised learning. To this end, we consider multi-class transductive learning on graphs with Laplacian regularization. Generalization bounds are derived using geometric properties of the graph. Speciﬁcally, by introducing a deﬁnition of graph cut from learning theory, we obtain generalization bounds that depend on the Laplacian regularizer. We then use this analysis to better understand the role of graph Laplacian matrix normalization. Under assumptions that the cut is small, we derive near-optimal normalization factors by approximately minimizing the generalization bounds. The analysis reveals the limitations of the standard degree-based normalization method in that the resulting normalization factors can vary signiﬁcantly within each connected component with the same class label, which may cause inferior generalization performance. Our theory also suggests a remedy that does not suffer from this problem. Experiments conﬁrm the superiority of the normalization scheme motivated by learning theory on artiﬁcial and real-world data sets. Keywords: transductive learning, graph learning, Laplacian regularization, normalization of graph Laplacian

5 0.27811795 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

Author: Miroslav Dudík, Steven J. Phillips, Robert E. Schapire

Abstract: We present a uniﬁed and complete account of maximum entropy density estimation subject to constraints represented by convex potential functions or, alternatively, by convex regularization. We provide fully general performance guarantees and an algorithm with a complete convergence proof. As special cases, we easily derive performance guarantees for many known regularization types, including 1 , 2 , 2 , and 1 + 2 style regularization. We propose an algorithm solving a large and 2 2 general subclass of generalized maximum entropy problems, including all discussed in the paper, and prove its convergence. Our approach generalizes and uniﬁes techniques based on information geometry and Bregman divergences as well as those based more directly on compactness. Our work is motivated by a novel application of maximum entropy to species distribution modeling, an important problem in conservation biology and ecology. In a set of experiments on real-world data, we demonstrate the utility of maximum entropy in this setting. We explore effects of different feature types, sample sizes, and regularization levels on the performance of maxent, and discuss interpretability of the resulting models. Keywords: maximum entropy, density estimation, regularization, iterative scaling, species distribution modeling

6 0.27809918 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

7 0.27668053 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

8 0.27653664 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

9 0.27633268 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

10 0.27526575 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

11 0.27468562 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression

12 0.27198574 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds

13 0.27164006 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

14 0.2715205 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

15 0.27072996 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm

16 0.2689476 88 jmlr-2007-Unlabeled Compression Schemes for Maximum Classes

17 0.26877803 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

18 0.2686756 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study

19 0.26862198 39 jmlr-2007-Handling Missing Values when Applying Classification Models

20 0.26760116 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions