jmlr jmlr2007 jmlr2007-44 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Junhui Wang, Xiaotong Shen
Abstract: In classification, semi-supervised learning occurs when a large amount of unlabeled data is available with only a small number of labeled data. In such a situation, how to enhance predictability of classification through unlabeled data is the focus. In this article, we introduce a novel large margin semi-supervised learning methodology, using grouping information from unlabeled data, together with the concept of margins, in a form of regularization controlling the interplay between labeled and unlabeled data. Based on this methodology, we develop two specific machines involving support vector machines and ψ-learning, denoted as SSVM and SPSI, through difference convex programming. In addition, we estimate the generalization error using both labeled and unlabeled data, for tuning regularizers. Finally, our theoretical and numerical analyses indicate that the proposed methodology achieves the desired objective of delivering high performance in generalization, particularly against some strong performers. Keywords: generalization, grouping, sequential quadratic programming, support vectors
Reference: text
sentIndex sentText sentNum sentScore
1 EDU School of Statistics University of Minnesota Minneapolis, MN 55455, USA Editor: Tommi Jaakkola Abstract In classification, semi-supervised learning occurs when a large amount of unlabeled data is available with only a small number of labeled data. [sent-5, score-0.129]
2 In such a situation, how to enhance predictability of classification through unlabeled data is the focus. [sent-6, score-0.079]
3 In this article, we introduce a novel large margin semi-supervised learning methodology, using grouping information from unlabeled data, together with the concept of margins, in a form of regularization controlling the interplay between labeled and unlabeled data. [sent-7, score-0.3]
4 In addition, we estimate the generalization error using both labeled and unlabeled data, for tuning regularizers. [sent-9, score-0.16]
5 Introduction In many classification problems, a large amount of unlabeled data is available, while it is costly to obtain labeled data. [sent-12, score-0.129]
6 In text categorization, particularly web-page classification, a machine is trained with a small number of manually labeled texts (web-pages), as well as a huge amount of unlabeled texts (web-pages), because manually labeling is impractical; compare with Joachims (1999). [sent-13, score-0.145]
7 In spam detection, a small group of identified e-mails, spam or non-spam, is used, in conjunction with a large number of unidentified e-mails, to train a filter to flag incoming spam e-mails, compare with Amini and Gallinari (2003). [sent-14, score-0.081]
8 In a situation as such, one research problem is how to enhance accuracy of prediction in classification by using both unlabeled and labeled data. [sent-17, score-0.129]
9 The problem of this sort is referred to as semi-supervised learning, which differs from a conventional “missing data” problem in that the size of unlabeled data greatly exceeds that of labeled data, and missing occurs only in response. [sent-18, score-0.129]
10 The central issue that this article addresses is how to use information from unlabeled data to enhance predictability of classification. [sent-19, score-0.079]
11 In semi-supervised learning, a sample {Zi = (Xi ,Yi )}nl is observed with labeling Yi ∈ {−1, 1}, i=1 in addition to an independent unlabeled sample {X j }n l +1 with n = nl + nu , where Xk = (Xk1 , · · · , j=n Xkp ); k = 1, · · · , n is an p-dimensional input. [sent-20, score-0.795]
12 Here the labeled sample is independently and identically distributed according to an unknown joint distribution P(x, y), and the unlabeled sample is c 2007 Junhui Wang and Xiaotong Shen. [sent-21, score-0.129]
13 Essentially all existing methods make various assumptions about the relationship between P(Y = 1|X = x) and P(x) in a way for an improvement to occur when unlabeled data is used. [sent-27, score-0.094]
14 The primary objective of this article is to develop a large margin semi-supervised learning methodology to deliver high performance of classification by using unlabeled data. [sent-31, score-0.125]
15 The methodology is designed to adapt to a variety of situations by identifying as opposed to specifying a relationship between labeled and unlabeled data from data. [sent-32, score-0.154]
16 It yields an improvement when unlabeled data can reconstruct the optimal classification boundary, and yields a no worse performance than its supervised counterpart otherwise. [sent-33, score-0.127]
17 These ingredients are integrated in a form of regularization involving three regularizers, each controlling classification with labeled data, grouping with unlabeled data, and interplay between them. [sent-36, score-0.2]
18 Moreover, we introduce a tuning method using unlabeled data for tuning the regularizers. [sent-37, score-0.141]
19 Moreover, a novel learning theory is developed to quantify SPSI’s generalization error as a function of complexity of the class of candidate decision functions, the sample sizes (nl , nu ), and the regularizers. [sent-40, score-0.349]
20 To our knowledge, this is the first attempt to relate a classifier’s generalization error to (nl , nu ) and regularizers in semisupervised learning. [sent-41, score-0.368]
21 Section 4 proposes a tuning methodology that uses both labeled and unlabeled data to enhance of accuracy of estimation of the generalization error. [sent-46, score-0.185]
22 In order to extract useful information about classification from unlabeled data, we construct a loss U(·) for a grouping decision function g(x) = (1, x T )wg ≡ wT x+wg,0 , with Sign(g(x)) indicating ˜g grouping. [sent-58, score-0.136]
23 While U can be used to extract the grouping boundary, it needs to yield the Bayes decision function f ∗ = arg min f ∈F EL(Y f (X)) in order for it to be useful for classification, where E is the expectation with respect to (X,Y ). [sent-61, score-0.087]
24 Whereas L(y f (x)) regularizes the contribution from labeled 1869 WANG AND S HEN data, U(g(x)) controls the information extracted from unlabeled data, and w f − wg 2 penalizes the disagreement between f and g, specifying a loose relationship between f and g. [sent-66, score-0.361]
25 Note that in (1) the geometric margin w2f 2 does not enter ˜ as it is regularized implicitly through 2 w f −wg and 2 2 wg ˜ 2 . [sent-68, score-0.253]
26 In nonlinear learning, a kernel K(·, ·) that maps from S × S to R 1 is usually introduced for flexible representations: f (x) = (1, K(x, x1 ), · · · , K(x, xn ))w f and g(x) = (1, K(x, x1 ), · · · , K(x, xn ))wg with w f = (w f , w f ,0 ) and wg = wg + wg,0 . [sent-69, score-0.464]
27 The constrained version of (2), after introducing slack variables {ξk ≥ 0; k = 1, · · · , n}, becomes nl C1 ∑ ξi +C2 i=1 n ∑ ξj + j=nl +1 C3 f −g 2 2 + 1 g 2 2 −, (3) subject to ξi − L(yi f (xi )) ≥ 0; i = 1, · · · , nl ; ξ j − U(g(x j )) ≥ 0; j = nl + 1, · · · , n. [sent-80, score-1.095]
28 SSVM in (3) stems from grouping and interplay between grouping and classification, whereas TSVM focuses on classification. [sent-88, score-0.114]
29 Similarly, ∇s2 = (∇1 f , ∇2 f , ∇1g , ∇2g ) ψ ψ ψ is the gradient vector of s2 with respect to (w f , wg ), with ∇1 f = C1 ∑nl ∇ψ2 (yi f (xi ))yi xi , ∇2 f = i=1 ψ ψ C1 ∑nl ∇ψ2 (yi f (xi ))yi , ∇1g = 2∇SV M , and ∇2g = 2∇SV M , where ∇ψ2 (z) = 0 if z > 0 and ∇ψ2 (z) = 2g 1g i=1 −2 otherwise. [sent-133, score-0.262]
30 Tuning Involving Unlabeled Data This section proposes a novel tuning method based on the concept of generalized degrees of freedom (GDF) and the technique of data perturbation (Shen and Huang, 2006; Wang and Shen, 2006), through both labeled and unlabeled data. [sent-138, score-0.173]
31 By Theorem 1 j=n i=1 of Wang and Shen (2006), the optimal estimated GE( fˆC ), after ignoring the terms independent of fˆC , has the form of EGE( fˆC ) + 1 2nl nl 1 ∑ Cov(Yi , Sign( fˆC (Xi ))|X l ) + 4 D1 (X l , fˆC ). [sent-144, score-0.379]
32 (5) i=1 1 Here, EGE( fˆC ) = 2nl ∑nl (1−Yi Sign( fˆC (Xi ))) is the training error, and D1 (X l , fˆC ) = E E( (X))− i=1 nl 1 l with (X) = (E(Y |X) − Sign( fˆC (X)))2 , where E(·|X) and E(·|X l ) are condinl ∑i=1 (Xi )|X tional expectations with respect to Y and Y l respectively. [sent-145, score-0.365]
33 In (5), Cov(Yi , Sign( fˆC (Xi ))|X l ); i = 1 · · · , nl and D1 (X l , fˆC ) need to be estimated. [sent-147, score-0.365]
34 It appears that Cov(Yi , Sign( fˆC (Xi ))|X l ) is estimated only through labeled data, for which we apply the data perturbation technique of Wang and Shen (2006). [sent-148, score-0.077]
35 Our proposed estimate GE becomes i=1 1 GE( fˆC ) = EGE( fˆC ) + 2nl (9) nl 1 D ∗d ∑d Yi∗d , and fˆC is trained 1 ∑ Cov(Yi , Sign( fˆC (Xi ))|X l ) + 4 D1 (X l , fˆC ), (10) i=1 By the law of large numbers, GE converges to (5) as D → ∞. [sent-156, score-0.365]
36 In practice, we recommend D to be at least nl to ensure the precision of MC approximation and τ to be 0. [sent-157, score-0.365]
37 In contrast to the estimated GE with labeled data alone, the GE( fˆC ) in (10) requires no perturbation of X when X u is available. [sent-159, score-0.077]
38 2): (Consistency of initial estimates) For almost all x, pi (x) → pi (x), as nl → ∞; i = 1, · · · , nl . [sent-165, score-0.73]
39 3, lim nl ,nu →∞ lim GE( fˆC )/ inf GE( fˆC ) = 1. [sent-170, score-0.384]
40 ˆ τ→0+ C Theorem 2 says the ideal optimal performance infC GE( fˆC ) can be realized by GE( fˆC ) when ˆ τ → 0+ and nl , nu → ∞ against any other tuning method. [sent-171, score-0.731]
41 Numerical Examples This section examines effectiveness of SSVM and SPSI and compare them against SVM with labeled data alone, TSVM and a graphical method of Zhu, Ghahramani and Lafferty (2003), in both simulated and benchmark examples. [sent-173, score-0.104]
42 Specifically, one regularizer for SVM and one tuning parameter σ in the Gaussian weight matrix for the graphical method, two regularization regularizers for TSVM, and three regularizers for SSVM and SPSI are optimized over [10−2 , 103 ]. [sent-187, score-0.123]
43 Next, 190 unlabeled instances (Xi1 , Xi2 ) are obtained by removing labels from a randomly chosen subset of the training sample, whereas the remaining 10 instances are treated as labeled data. [sent-197, score-0.157]
44 As in Example 1, we obtain 200 (10 labeled and 190 unlabeled) instances for training as well as 800 instances for testing. [sent-206, score-0.078]
45 Instances in the WBC and Mushroom examples are randomly divided into halves with 10 labeled and 190 unlabeled instances for training, and the remaining instances for testing. [sent-212, score-0.157]
46 Instances in the Spam email example are randomly divided into halves with 20 labeled and 580 unlabeled instances for training, and the remaining instances for testing. [sent-213, score-0.18]
47 0021) Table 1: Averaged test errors as well as the estimated standard errors (in parenthesis) of SVM with labeled data alone, TSVM, the graphical method, SSVM and SPSI, over 100 pairs of training and testing samples, in the simulated examples. [sent-280, score-0.105]
48 0018) Table 2: Averaged test errors as well as the estimated standard errors (in parenthesis) of SVM with labeled data alone, TSVM, the graphical method, SSVM and SPSI, over 100 pairs of training and testing samples, in the benchmark examples. [sent-373, score-0.103]
49 The amount of improvement is defined in (12), where the performance of SVM with labeled data alone serves as a baseline for comparison in absence of the Bayes error. [sent-374, score-0.086]
50 In contrast, the grouping boundaries estimated by unlabeled covariates, almost recover the true decision boundaries for classification. [sent-394, score-0.15]
51 This, together with the information obtained from the labeled data regarding the sign of labeling, results in much better estimated classification boundaries. [sent-395, score-0.109]
52 The solid, dashed, dotted and dotted-dashed (vertical) lines represent our ψ-learning-based decision function, the SVM decision function with labeled data alone, the partition decision function defined by unlabeled data, and the true decision boundary for classification. [sent-397, score-0.185]
53 Particularly, SVM is tuned using the method of Wang and Shen (2006) with labeled data alone, and SPSI, SSVM , TSVM and the graphical method are tuned by minimizing the GE( fˆC ) in (10) involving both labeled and unlabeled data over a set of grid points in the same fashion as in Section 5. [sent-405, score-0.205]
54 As expected, SPSI and SSVM outperform both SVM with labeled data alone and TSVM in all cases, and the graphical method in all examples except Mushroom, with improvements ranging from 2. [sent-409, score-0.112]
55 0056) Table 3: Averaged test errors as well as the estimated standard errors (in parenthesis) of SVM with labeled data alone, TSVM, the graphical method, SSVM and SPSI after tuning, over 100 pairs of training and testing samples, for the simulated and benchmark examples. [sent-561, score-0.118]
56 It is also interesting to note that TSVM obtained from SVMlight performs even worse than SVM with labeled data alone in the WBC example for linear learning, and the Spam email example for both linear and Gaussian kernel learning. [sent-563, score-0.109]
57 Statistical Learning Theory This section derives a finite-sample probability upper bound measuring the performance of SPSI in terms of complexity of the class of candidate decision functions F , sample sizes (n l , nu ) and tuning parameter C. [sent-568, score-0.38]
58 Theorem 3 (Finite-sample probability bound for SPSI) In addition to Assumptions A-C, assume that nl ≤ nu . [sent-599, score-0.7]
59 5 exp(−a7 nu ((nuC2 )−1 J0 )max(1,2−βg ) )+ C ∗ 6. [sent-601, score-0.335]
60 5 exp(−a10 nl ((nl C1 )−1 min(Jl , J( f ∗ )))max(1,2−β f ) )+ ∗ 6. [sent-602, score-0.365]
61 nu nu 6 Corollary 4 Under the assumptions of Theorem 3, as n u ≥ nl → ∞, 2α 2α ∗ inf |e( fˆC , f ∗ )| = O p (sn ), sn = min δnl f , max(δnu g , inf |e(gC , f ∗ )|) . [sent-604, score-1.117]
62 C∈C C Theorem 3 provides a probability bound for the upper tail of |e( fˆC , f ∗ )| for any finite (nl , nu ). [sent-605, score-0.335]
63 ∗ Furthermore, Corollary 4 says that the Bayesian regret infC∈C |e(gC , f ∗ )| for the SPSI classifier Sign( fˆC ) after tuning is of order of no larger than sn , when nu ≥ nl → ∞. [sent-606, score-0.761]
64 The local entropy avoids a loss of log nu factor in the linear case, although it may not be useful in the nonlinear case. [sent-612, score-0.364]
65 It then follows from Corollary 4 that infC |e( fˆC , f ∗ )| = O p (nu (log nu )(θ+1)/2 ) as nu ≥ nl → ∞. [sent-626, score-1.035]
66 This says that the optimal ideal performance of the Bayes rule is recovered by SPSI −(θ+1)/2 at speed of nu (log nu )(θ+1)/2 as nu ≥ nl → ∞. [sent-627, score-1.37]
67 Similarly, it follows −1/2 from Corollary 4 that infC |e( fˆC , f ∗ )| = O p (min(n−1 (log nl Jl )3 , nu (log nu Ju )3/2 )) as nu ≥ nl → ∞. [sent-634, score-1.735]
68 l Therefore, the optimal ideal performance of the Bayes rule is recovered by SPSI at fast speed of −1/2 (log nu Ju )3/2 ) as nu ≥ nl → ∞. [sent-635, score-1.035]
69 In contrast to most semi-supervised learning methods assuming various dependencies between the marginal and conditional distributions, the proposed methodology integrates labeled and unlabeled data through regularization to identify such dependencies for enhancing classification. [sent-638, score-0.154]
70 The theoretical and numerical results show that our methodology outperforms SVM and TSVM in situations when unlabeled data provides useful information, and performs no worse when unlabeled data does not so. [sent-639, score-0.183]
71 5 exp(−a11 nu (nuC2 )−1 Ju ), where Jl and Ju are defined in Lemma 5. [sent-658, score-0.335]
72 Define the scaled empirical process as E u (U(gC ) − −1 n ∗ (x )) − U(g(x )) + λ(J(g∗ ) − J(g)) − E(U(g∗ (X )) − U(g(X ))+ U(g)) = nu ∑ j=nl +1 U(gC j j j j C C ∗ ∗ λ(J(gC ) − J(g))) = Eu (U(gC ) −U(g)). [sent-660, score-0.335]
73 Similarly, I2 ≤ 3 exp(−a7 nu (λJ0 )max(1,2−βg ) )/(1 − exp(−a7 nu (λJ0 )max(1,2−βg ) ))2 . [sent-683, score-0.67]
74 Thus I ≤ I1 + I2 ≤ 6 exp(−a7 nu ((nuC2 )−1 J0 )max(1,2−βg ) )/(1 − exp(−a7 nu ((nuC2 )−1 J0 )max(1,2−βg ) ))2 , and I 1/2 ≤ (2. [sent-684, score-0.67]
75 5 + I 1/2 ) exp(−a7 nu ((nuC2 )−1 J0 )max(1,2−βg ) ). [sent-685, score-0.335]
76 5 exp(−a7 nu ˆ ∗ n ((nuC2 )−1 J0 )max(1,2−βg ) ) + 6. [sent-687, score-0.335]
77 By Assumption A and the triangle inequality, |e( f , g∗ )| ≤ ˆC a6 B/C3 ≤ δnu 3 nu C 6 ˆC , g∗ ))αg ≤ a1 (e (gC , g∗ ) + |e ( fˆC , gC )|)αg ≤ a1 (e (gC , g∗ ) + δ2 , implying that ˆ ˆ C ˆ C a1 (eU ( f C nu U U U ∗ ˆ ∗ P |e( fˆC , gC )| ≥ a1 (2δ2u )αg ≤ P(eU (gC , gC ) ≥ δ2u ), ∀C ∈ C . [sent-691, score-0.67]
78 5 exp(− a7 nu ((nuC2 )−1 J0 )max(1,2−βg ) ) + ˆ n 1884 L ARGE M ARGIN S EMI - SUPERVISED L EARNING ∗ ∗ 6. [sent-693, score-0.335]
79 5 exp(−a10 nl ((nl C1 )−1 Jl )max(1,2−β f ) ) + 6. [sent-694, score-0.365]
80 5 exp(−a11 nu ((nuC2 )−1 Ju )max(1,2−βg ) ), where C∗ = ∗ , f ∗ )|. [sent-695, score-0.335]
81 − i=1 2 ˆC , f ∗ ) ≥ a1 δ2 ) ≤ P(infC eL ( fˆC , f ∗ ) ≥ An application of the same treatment yields that P(infC e( f nl ∗ ∗ a1 δ2l ) ≤ 3. [sent-697, score-0.378]
82 5 exp(−a10 nl ((nl C1 )−1 J( f ∗ ))max(1,2−β f ) ) when nl C1 ≥ 2δ−2 max(J( f ∗ ), 1). [sent-698, score-0.73]
83 C1 nl nu i=nl +1 An application of the same argument as in the proof of Theorem 3 yields that for constants a 8 , a9 > 0, ˜ P(eW ( fˆC , gC ; fC , gC ) ≥ δ2 ) is upper bounded by ˆ ∗ ∗ w 3. [sent-705, score-0.713]
84 5 exp(−a9 nu ((nuC2 )−1 Ju )max(1,2−βg ) ), ∗ ∗ ∗ ∗ ˜ ˜ provided that 2Jl ≤ nl C1 δ2l and 2Ju ≤ nuC2 δ2u , where eW ( f , g; fC , gC ) = eL ( f , fC ) + C2 eU (g, gC ), n n C1 C2 nu ˜ 2 ˜ ˜ δ2 = δ2 + δ , Jl = max(Jl ( f ∗ , g∗ ), 1) and Ju = max(Ju ( f ∗ , g∗ ), 1). [sent-707, score-1.035]
85 w nl C1 nl nu C C C C ∗ ∗ ∗ ∗ Without loss of generality, assume min(Jl ( fC , gC ), Ju ( fC , gC )) ≥ 1. [sent-708, score-1.065]
86 5 exp(−a10 nl ((nl C1 )−1 Jl )max(1,2−β f ) ) + 6. [sent-712, score-0.365]
87 −1 ≤ Note that J( fˆC , gC ) ≤ s( fˆ, g) ≤ s(1, 1) ≤ 2C1 nl . [sent-714, score-0.365]
88 Suppose there exist ε-brackets (gl , gu )M for some m m m=1 M such that for any g ∈ F (k) = {g ∈ F : J(g) ≤ k}, gl ≤ g ≤ gu and gu − gl ∞ ≤ ε for some 1 ≤ m m m m u u l u l u m ≤ M. [sent-736, score-0.09]
89 Solving (15), we obtain εnl = ( loglnl )1/2 when C1 ∼ n J( f ∗ )δ−2 n−1 ∼ log nl and εnu = ( logunu )1/2 when C2 ∼ J0 δ−2 n−1 ∼ log nu . [sent-756, score-0.728]
90 Assumption C is fulfilled nl l nu u n because E(X 2 ) < ∞. [sent-757, score-0.7]
91 In conclusion, we obtain, by Corollary 4, that infC |e( fˆC , f ∗ )| = O p ((n−1 log nu )(θ+1)/2 ). [sent-758, score-0.349]
92 k=0 2σ For (13), note that F is rich for sufficiently large nl in that for any continuous function f , there exists a f˜ ∈ F such that f − f˜ ∞ ≤ ε2l , compare with Steinwart (2001). [sent-761, score-0.365]
93 Similarly, we have εnl = (n−1 (log nl Jl )3 )1/2 when C1 ∼ l J( f ∗ )δ−2 n−1 ∼ (log nl Jl )−3 and εnu = (n−1 (log nu Ju )3 )1/2 when C2 ∼ J0 δ−2 n−1 ∼ (log nu Ju )−3 . [sent-779, score-1.4]
94 Asnl l u nu u sumption C is fulfilled with the Gaussian kernel. [sent-780, score-0.335]
95 Furthermore, KKT’s condition requires that αi (1 − yi ( w f , xi ) − ξi ) = 0, β j ( wg , x j − 1 − ξ j ), γ j ( wg , x j +1+ξ j ) = 0, γi ξi = 0, and η j ξ j = 0. [sent-791, score-0.543]
96 Therefore, if 0 < αi < C1 , then ξi = 0 and 1 − yi ( w f , xi ) = 0, if 0 < β j + γ j < C2 , then ξ j = 0 and wg , x j + 1 = 0 or wg , x j − 1 = 0. [sent-793, score-0.543]
97 C2 ∇U2 (g nl +1 )), · · · , C2 ∇U2 (g n Proof of Theorem 10: The proof is similar to that of Theorem 9, and thus is omitted. [sent-799, score-0.365]
98 A framework for learning predictive structures from multiple tasks and unlabeled data. [sent-821, score-0.079]
99 Text classification from labeled and unlabeled documents using EM. [sent-955, score-0.129]
100 A probability analysis on the value of unlabeled data for classification problems. [sent-1046, score-0.079]
wordName wordTfidf (topN-words)
[('gc', 0.646), ('nl', 0.365), ('nu', 0.335), ('wg', 0.232), ('spsi', 0.221), ('fc', 0.215), ('ssvm', 0.175), ('ge', 0.144), ('tsvm', 0.124), ('shen', 0.098), ('ju', 0.086), ('infc', 0.085), ('unlabeled', 0.079), ('eu', 0.078), ('jl', 0.065), ('ssv', 0.057), ('hen', 0.051), ('emi', 0.051), ('sv', 0.051), ('labeled', 0.05), ('yi', 0.049), ('wang', 0.047), ('arge', 0.047), ('argin', 0.047), ('vh', 0.045), ('wbc', 0.045), ('sign', 0.045), ('grouping', 0.043), ('wh', 0.04), ('svm', 0.04), ('nv', 0.038), ('wl', 0.036), ('cov', 0.036), ('alone', 0.036), ('regularizers', 0.033), ('tuning', 0.031), ('mushroom', 0.031), ('fu', 0.031), ('wong', 0.03), ('xi', 0.03), ('fv', 0.029), ('interplay', 0.028), ('spam', 0.027), ('graphical', 0.026), ('ew', 0.025), ('max', 0.025), ('methodology', 0.025), ('vg', 0.024), ('email', 0.023), ('dc', 0.023), ('svmc', 0.023), ('supervised', 0.022), ('sup', 0.022), ('earning', 0.021), ('el', 0.021), ('wu', 0.021), ('margin', 0.021), ('exp', 0.02), ('var', 0.02), ('bayes', 0.019), ('inf', 0.019), ('kkt', 0.018), ('gl', 0.018), ('gu', 0.018), ('lemma', 0.018), ('fl', 0.017), ('ege', 0.017), ('svml', 0.017), ('ug', 0.017), ('vgc', 0.017), ('ynl', 0.017), ('zhu', 0.016), ('decompositions', 0.016), ('labeling', 0.016), ('minimizer', 0.016), ('arg', 0.016), ('gaussian', 0.016), ('gm', 0.015), ('sn', 0.015), ('assumptions', 0.015), ('entropy', 0.015), ('simulated', 0.015), ('regret', 0.015), ('fm', 0.014), ('instances', 0.014), ('minimization', 0.014), ('parenthesis', 0.014), ('bracket', 0.014), ('ugc', 0.014), ('log', 0.014), ('decision', 0.014), ('estimated', 0.014), ('min', 0.014), ('yields', 0.013), ('benchmark', 0.013), ('perturbation', 0.013), ('tu', 0.013), ('delivering', 0.013), ('adams', 0.013), ('inequality', 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 44 jmlr-2007-Large Margin Semi-supervised Learning
Author: Junhui Wang, Xiaotong Shen
Abstract: In classification, semi-supervised learning occurs when a large amount of unlabeled data is available with only a small number of labeled data. In such a situation, how to enhance predictability of classification through unlabeled data is the focus. In this article, we introduce a novel large margin semi-supervised learning methodology, using grouping information from unlabeled data, together with the concept of margins, in a form of regularization controlling the interplay between labeled and unlabeled data. Based on this methodology, we develop two specific machines involving support vector machines and ψ-learning, denoted as SSVM and SPSI, through difference convex programming. In addition, we estimate the generalization error using both labeled and unlabeled data, for tuning regularizers. Finally, our theoretical and numerical analyses indicate that the proposed methodology achieves the desired objective of delivering high performance in generalization, particularly against some strong performers. Keywords: generalization, grouping, sequential quadratic programming, support vectors
2 0.065089121 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
Author: Philippe Rigollet
Abstract: We consider semi-supervised classification when part of the available data is unlabeled. These unlabeled data can be useful for the classification problem when we make an assumption relating the behavior of the regression function to that of the marginal distribution. Seeger (2000) proposed the well-known cluster assumption as a reasonable one. We propose a mathematical formulation of this assumption and a method based on density level sets estimation that takes advantage of it to achieve fast rates of convergence both in the number of unlabeled examples and the number of labeled examples. Keywords: semi-supervised learning, statistical learning theory, classification, cluster assumption, generalization bounds
3 0.043370668 45 jmlr-2007-Learnability of Gaussians with Flexible Variances
Author: Yiming Ying, Ding-Xuan Zhou
Abstract: Gaussian kernels with flexible variances provide a rich family of Mercer kernels for learning algorithms. We show that the union of the unit balls of reproducing kernel Hilbert spaces generated by Gaussian kernels with flexible variances is a uniform Glivenko-Cantelli (uGC) class. This result confirms a conjecture concerning learnability of Gaussian kernels and verifies the uniform convergence of many learning algorithms involving Gaussians with changing variances. Rademacher averages and empirical covering numbers are used to estimate sample errors of multi-kernel regularization schemes associated with general loss functions. It is then shown that the regularization error associated with the least square loss and the Gaussian kernels can be greatly improved when flexible variances are allowed. Finally, for regularization schemes generated by Gaussian kernels with flexible variances we present explicit learning rates for regression with least square loss and classification with hinge loss. Keywords: Gaussian kernel, flexible variances, learning theory, Glivenko-Cantelli class, regularization scheme, empirical covering number
4 0.038739797 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
Author: Yann Guermeur
Abstract: In the context of discriminant analysis, Vapnik’s statistical learning theory has mainly been developed in three directions: the computation of dichotomies with binary-valued functions, the computation of dichotomies with real-valued functions, and the computation of polytomies with functions taking their values in finite sets, typically the set of categories itself. The case of classes of vectorvalued functions used to compute polytomies has seldom been considered independently, which is unsatisfactory, for three main reasons. First, this case encompasses the other ones. Second, it cannot be treated appropriately through a na¨ve extension of the results devoted to the computation of ı dichotomies. Third, most of the classification problems met in practice involve multiple categories. In this paper, a VC theory of large margin multi-category classifiers is introduced. Central in this theory are generalized VC dimensions called the γ-Ψ-dimensions. First, a uniform convergence bound on the risk of the classifiers of interest is derived. The capacity measure involved in this bound is a covering number. This covering number can be upper bounded in terms of the γ-Ψdimensions thanks to generalizations of Sauer’s lemma, as is illustrated in the specific case of the scale-sensitive Natarajan dimension. A bound on this latter dimension is then computed for the class of functions on which multi-class SVMs are based. This makes it possible to apply the structural risk minimization inductive principle to those machines. Keywords: multi-class discriminant analysis, large margin classifiers, uniform strong laws of large numbers, generalized VC dimensions, multi-class SVMs, structural risk minimization inductive principle, model selection
5 0.032362182 90 jmlr-2007-Value Regularization and Fenchel Duality
Author: Ryan M. Rifkin, Ross A. Lippert
Abstract: Regularization is an approach to function learning that balances fit and smoothness. In practice, we search for a function f with a finite representation f = ∑i ci φi (·). In most treatments, the ci are the primary objects of study. We consider value regularization, constructing optimization problems in which the predicted values at the training points are the primary variables, and therefore the central objects of study. Although this is a simple change, it has profound consequences. From convex conjugacy and the theory of Fenchel duality, we derive separate optimality conditions for the regularization and loss portions of the learning problem; this technique yields clean and short derivations of standard algorithms. This framework is ideally suited to studying many other phenomena at the intersection of learning theory and optimization. We obtain a value-based variant of the representer theorem, which underscores the transductive nature of regularization in reproducing kernel Hilbert spaces. We unify and extend previous results on learning kernel functions, with very simple proofs. We analyze the use of unregularized bias terms in optimization problems, and low-rank approximations to kernel matrices, obtaining new results in these areas. In summary, the combination of value regularization and Fenchel duality are valuable tools for studying the optimization problems in machine learning. Keywords: kernel machines, duality, optimization, convex analysis, kernel learning
6 0.030270562 9 jmlr-2007-AdaBoost is Consistent
7 0.029054226 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
8 0.025756218 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results
9 0.025176359 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
10 0.02272981 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression
11 0.022611393 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections
12 0.022161664 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
13 0.021830956 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition
14 0.021611052 15 jmlr-2007-Bilinear Discriminant Component Analysis
15 0.020373346 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
16 0.020029962 52 jmlr-2007-Margin Trees for High-dimensional Classification
17 0.01935917 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds
18 0.018559558 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression
20 0.017505543 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss
topicId topicWeight
[(0, 0.12), (1, -0.037), (2, 0.042), (3, 0.027), (4, 0.06), (5, 0.001), (6, -0.022), (7, 0.013), (8, 0.036), (9, -0.011), (10, 0.039), (11, -0.013), (12, 0.011), (13, -0.001), (14, 0.09), (15, -0.002), (16, 0.082), (17, 0.017), (18, -0.088), (19, 0.009), (20, 0.087), (21, 0.241), (22, -0.006), (23, -0.008), (24, 0.028), (25, -0.079), (26, -0.279), (27, -0.099), (28, 0.144), (29, -0.277), (30, 0.205), (31, 0.332), (32, -0.023), (33, 0.106), (34, -0.08), (35, -0.043), (36, 0.256), (37, 0.114), (38, -0.054), (39, -0.041), (40, 0.126), (41, -0.222), (42, 0.054), (43, -0.1), (44, 0.321), (45, -0.143), (46, 0.098), (47, 0.145), (48, 0.025), (49, 0.111)]
simIndex simValue paperId paperTitle
same-paper 1 0.96796775 44 jmlr-2007-Large Margin Semi-supervised Learning
Author: Junhui Wang, Xiaotong Shen
Abstract: In classification, semi-supervised learning occurs when a large amount of unlabeled data is available with only a small number of labeled data. In such a situation, how to enhance predictability of classification through unlabeled data is the focus. In this article, we introduce a novel large margin semi-supervised learning methodology, using grouping information from unlabeled data, together with the concept of margins, in a form of regularization controlling the interplay between labeled and unlabeled data. Based on this methodology, we develop two specific machines involving support vector machines and ψ-learning, denoted as SSVM and SPSI, through difference convex programming. In addition, we estimate the generalization error using both labeled and unlabeled data, for tuning regularizers. Finally, our theoretical and numerical analyses indicate that the proposed methodology achieves the desired objective of delivering high performance in generalization, particularly against some strong performers. Keywords: generalization, grouping, sequential quadratic programming, support vectors
2 0.34148538 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
Author: Philippe Rigollet
Abstract: We consider semi-supervised classification when part of the available data is unlabeled. These unlabeled data can be useful for the classification problem when we make an assumption relating the behavior of the regression function to that of the marginal distribution. Seeger (2000) proposed the well-known cluster assumption as a reasonable one. We propose a mathematical formulation of this assumption and a method based on density level sets estimation that takes advantage of it to achieve fast rates of convergence both in the number of unlabeled examples and the number of labeled examples. Keywords: semi-supervised learning, statistical learning theory, classification, cluster assumption, generalization bounds
3 0.20246175 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
Author: Yann Guermeur
Abstract: In the context of discriminant analysis, Vapnik’s statistical learning theory has mainly been developed in three directions: the computation of dichotomies with binary-valued functions, the computation of dichotomies with real-valued functions, and the computation of polytomies with functions taking their values in finite sets, typically the set of categories itself. The case of classes of vectorvalued functions used to compute polytomies has seldom been considered independently, which is unsatisfactory, for three main reasons. First, this case encompasses the other ones. Second, it cannot be treated appropriately through a na¨ve extension of the results devoted to the computation of ı dichotomies. Third, most of the classification problems met in practice involve multiple categories. In this paper, a VC theory of large margin multi-category classifiers is introduced. Central in this theory are generalized VC dimensions called the γ-Ψ-dimensions. First, a uniform convergence bound on the risk of the classifiers of interest is derived. The capacity measure involved in this bound is a covering number. This covering number can be upper bounded in terms of the γ-Ψdimensions thanks to generalizations of Sauer’s lemma, as is illustrated in the specific case of the scale-sensitive Natarajan dimension. A bound on this latter dimension is then computed for the class of functions on which multi-class SVMs are based. This makes it possible to apply the structural risk minimization inductive principle to those machines. Keywords: multi-class discriminant analysis, large margin classifiers, uniform strong laws of large numbers, generalized VC dimensions, multi-class SVMs, structural risk minimization inductive principle, model selection
4 0.15721275 45 jmlr-2007-Learnability of Gaussians with Flexible Variances
Author: Yiming Ying, Ding-Xuan Zhou
Abstract: Gaussian kernels with flexible variances provide a rich family of Mercer kernels for learning algorithms. We show that the union of the unit balls of reproducing kernel Hilbert spaces generated by Gaussian kernels with flexible variances is a uniform Glivenko-Cantelli (uGC) class. This result confirms a conjecture concerning learnability of Gaussian kernels and verifies the uniform convergence of many learning algorithms involving Gaussians with changing variances. Rademacher averages and empirical covering numbers are used to estimate sample errors of multi-kernel regularization schemes associated with general loss functions. It is then shown that the regularization error associated with the least square loss and the Gaussian kernels can be greatly improved when flexible variances are allowed. Finally, for regularization schemes generated by Gaussian kernels with flexible variances we present explicit learning rates for regression with least square loss and classification with hinge loss. Keywords: Gaussian kernel, flexible variances, learning theory, Glivenko-Cantelli class, regularization scheme, empirical covering number
5 0.15710595 15 jmlr-2007-Bilinear Discriminant Component Analysis
Author: Mads Dyrholm, Christoforos Christoforou, Lucas C. Parra
Abstract: Factor analysis and discriminant analysis are often used as complementary approaches to identify linear components in two dimensional data arrays. For three dimensional arrays, which may organize data in dimensions such as space, time, and trials, the opportunity arises to combine these two approaches. A new method, Bilinear Discriminant Component Analysis (BDCA), is derived and demonstrated in the context of functional brain imaging data for which it seems ideally suited. The work suggests to identify a subspace projection which optimally separates classes while ensuring that each dimension in this space captures an independent contribution to the discrimination. Keywords: bilinear, decomposition, component, classification, regularization
6 0.15524061 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
7 0.13791285 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results
8 0.13024652 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
9 0.12759551 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels
10 0.11972343 90 jmlr-2007-Value Regularization and Fenchel Duality
11 0.11493137 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression
12 0.1139985 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
14 0.10542407 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models
15 0.10159592 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation
16 0.098840185 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections
17 0.091281563 9 jmlr-2007-AdaBoost is Consistent
18 0.090993464 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization
19 0.090878136 70 jmlr-2007-Ranking the Best Instances
20 0.087469742 77 jmlr-2007-Stagewise Lasso
topicId topicWeight
[(8, 0.017), (10, 0.023), (12, 0.033), (15, 0.044), (19, 0.429), (28, 0.051), (40, 0.036), (45, 0.016), (48, 0.034), (60, 0.043), (80, 0.015), (85, 0.043), (98, 0.083)]
simIndex simValue paperId paperTitle
same-paper 1 0.74560106 44 jmlr-2007-Large Margin Semi-supervised Learning
Author: Junhui Wang, Xiaotong Shen
Abstract: In classification, semi-supervised learning occurs when a large amount of unlabeled data is available with only a small number of labeled data. In such a situation, how to enhance predictability of classification through unlabeled data is the focus. In this article, we introduce a novel large margin semi-supervised learning methodology, using grouping information from unlabeled data, together with the concept of margins, in a form of regularization controlling the interplay between labeled and unlabeled data. Based on this methodology, we develop two specific machines involving support vector machines and ψ-learning, denoted as SSVM and SPSI, through difference convex programming. In addition, we estimate the generalization error using both labeled and unlabeled data, for tuning regularizers. Finally, our theoretical and numerical analyses indicate that the proposed methodology achieves the desired objective of delivering high performance in generalization, particularly against some strong performers. Keywords: generalization, grouping, sequential quadratic programming, support vectors
2 0.28546464 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
Author: Roland Nilsson, José M. Peña, Johan Björkegren, Jesper Tegnér
Abstract: We analyze two different feature selection problems: finding a minimal feature set optimal for classification (MINIMAL - OPTIMAL) vs. finding all features relevant to the target variable (ALL RELEVANT ). The latter problem is motivated by recent applications within bioinformatics, particularly gene expression analysis. For both problems, we identify classes of data distributions for which there exist consistent, polynomial-time algorithms. We also prove that ALL - RELEVANT is much harder than MINIMAL - OPTIMAL and propose two consistent, polynomial-time algorithms. We argue that the distribution classes considered are reasonable in many practical cases, so that our results simplify feature selection in a wide range of machine learning tasks. Keywords: learning theory, relevance, classification, Markov blanket, bioinformatics
3 0.28452134 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation
Author: Guy Lebanon, Yi Mao, Joshua Dillon
Abstract: The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words representation and its n-gram extensions. This representation uses local smoothing to embed documents as smooth curves in the multinomial simplex thereby preserving valuable sequential information. In contrast to bag of words or n-grams, the new representation is able to robustly capture medium and long range sequential trends in the document. We discuss the representation and its geometric properties and demonstrate its applicability for various text processing tasks. Keywords: text processing, local smoothing
4 0.27893621 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
Author: Rie Johnson, Tong Zhang
Abstract: This paper investigates the effect of Laplacian normalization in graph-based semi-supervised learning. To this end, we consider multi-class transductive learning on graphs with Laplacian regularization. Generalization bounds are derived using geometric properties of the graph. Specifically, by introducing a definition of graph cut from learning theory, we obtain generalization bounds that depend on the Laplacian regularizer. We then use this analysis to better understand the role of graph Laplacian matrix normalization. Under assumptions that the cut is small, we derive near-optimal normalization factors by approximately minimizing the generalization bounds. The analysis reveals the limitations of the standard degree-based normalization method in that the resulting normalization factors can vary significantly within each connected component with the same class label, which may cause inferior generalization performance. Our theory also suggests a remedy that does not suffer from this problem. Experiments confirm the superiority of the normalization scheme motivated by learning theory on artificial and real-world data sets. Keywords: transductive learning, graph learning, Laplacian regularization, normalization of graph Laplacian
Author: Miroslav Dudík, Steven J. Phillips, Robert E. Schapire
Abstract: We present a unified and complete account of maximum entropy density estimation subject to constraints represented by convex potential functions or, alternatively, by convex regularization. We provide fully general performance guarantees and an algorithm with a complete convergence proof. As special cases, we easily derive performance guarantees for many known regularization types, including 1 , 2 , 2 , and 1 + 2 style regularization. We propose an algorithm solving a large and 2 2 general subclass of generalized maximum entropy problems, including all discussed in the paper, and prove its convergence. Our approach generalizes and unifies techniques based on information geometry and Bregman divergences as well as those based more directly on compactness. Our work is motivated by a novel application of maximum entropy to species distribution modeling, an important problem in conservation biology and ecology. In a set of experiments on real-world data, we demonstrate the utility of maximum entropy in this setting. We explore effects of different feature types, sample sizes, and regularization levels on the performance of maxent, and discuss interpretability of the resulting models. Keywords: maximum entropy, density estimation, regularization, iterative scaling, species distribution modeling
6 0.27809918 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data
7 0.27668053 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
8 0.27653664 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
9 0.27633268 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification
10 0.27526575 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition
11 0.27468562 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression
12 0.27198574 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds
13 0.27164006 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
14 0.2715205 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
15 0.27072996 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm
16 0.2689476 88 jmlr-2007-Unlabeled Compression Schemes for Maximum Classes
17 0.26877803 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification
18 0.2686756 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study
19 0.26862198 39 jmlr-2007-Handling Missing Values when Applying Classification Models
20 0.26760116 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions