jmlr jmlr2010 jmlr2010-103 knowledge-graph by maker-knowledge-mining

103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

Source: pdf

Author: Shiliang Sun, John Shawe-Taylor

Abstract: In this paper, we propose a general framework for sparse semi-supervised learning, which concerns using a small portion of unlabeled data and a few labeled data to represent target functions and thus has the merit of accelerating function evaluations when predicting the output of a new example. This framework makes use of Fenchel-Legendre conjugates to rewrite a convex insensitive loss involving a regularization with unlabeled data, and is applicable to a family of semi-supervised learning methods such as multi-view co-regularized least squares and single-view Laplacian support vector machines (SVMs). As an instantiation of this framework, we propose sparse multi-view SVMs which use a squared ε-insensitive loss. The resultant optimization is an inf-sup problem and the optimal solutions have arguably saddle-point properties. We present a globally optimal iterative algorithm to optimize the problem. We give the margin bound on the generalization error of the sparse multi-view SVMs, and derive the empirical Rademacher complexity for the induced function class. Experiments on artiﬁcial and real-world data show their effectiveness. We further give a sequential training approach to show their possibility and potential for uses in large-scale problems and provide encouraging experimental results indicating the efﬁcacy of the margin bound and empirical Rademacher complexity on characterizing the roles of unlabeled data for semi-supervised learning. Keywords: semi-supervised learning, Fenchel-Legendre conjugate, representer theorem, multiview regularization, support vector machine, statistical learning theory

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Introduction Semi-supervised learning, considering how to estimate a target function from a few labeled examples and a large quantity of unlabeled examples, is one of currently active research directions. [sent-16, score-0.579]

2 If the unlabeled data are properly used, it can get a superior performance over the counterpart supervised learning approaches. [sent-17, score-0.535]

3 Generally, in the objective function J( f ) of these semi-supervised learning methods, labeled examples are used to calculate an empirical loss of the target function and simultaneously unlabeled examples are used for some regularization purpose. [sent-40, score-0.717]

4 By the representer theorem, the target function would involve kernel evaluations on all the labeled and unlabeled examples. [sent-41, score-0.623]

5 This is computationally undesirable, because for semi-supervised learning usually a considerably large number of unlabeled examples are available. [sent-42, score-0.515]

6 Consequently, sparsity in the number of unlabeled data used to represent target functions is crucial, which constitutes the focus of this paper. [sent-43, score-0.537]

7 As far as multi-view learning is concerned there has been work that introduces sparsity of the unlabeled data into the representation of the classiﬁers (Szedmak and Shawe-Taylor, 2007). [sent-48, score-0.537]

8 Unfortunately the resulting optimization is somewhat unmanageable and only scales to small-scale data sets despite interesting theoretical bounds that show the improvement gained using the unlabeled data. [sent-52, score-0.521]

9 To compute the value of this function on new data would still require a non-sparse dual representation in terms of the unlabeled data. [sent-55, score-0.535]

10 However, we show that through optimizing weights of the unlabeled data the solution of the l2 problem converges to the solution of an ε-insensitive problem ensuring that we subsequently obtain sparsity in the unlabeled data. [sent-56, score-1.027]

11 To show the application of Fenchel-Legendre conjugates, in Section 2 we propose a novel sparse semi-supervised learning approach: sparse multi-view SVMs, where the conjugate functions play a central role in reformulating the optimization problem. [sent-58, score-0.267]

12 The dual optimization of a subroutine of the sparse multi-view SVMs is converted to a quadratic programming problem in Section 3 whose scale only depends on the number of labeled examples, indicating the advantages of using conjugate functions. [sent-59, score-0.31]

13 Suppose we have a set of ℓ labeled examples {(xi , yi )}ℓ i=1 ℓ+u with yi ∈ {1, −1}, and a set of u unlabeled examples {xi }i=ℓ+1 . [sent-81, score-0.83]

14 From the above equation, w1∗ is the linear combination of labeled examples with λi > 0, and those unlabeled examples on which the 1 difference of predictions from two views exceeds ε. [sent-109, score-0.697]

15 Therefore, function f (x) is sparse in the number of used unlabeled examples. [sent-112, score-0.556]

16 Thus, function f (t ) is convex for ti ∈ (ε ε i ε i 2 i ti ∈ (ε2 , +∞). [sent-124, score-0.415]

17 Moreover, the value of function fε (ti ) for ti ∈ (ε2 , +∞) is larger than 0 which is the value of fε (ti ) for ti ∈ [0, ε2 ], and function fε (ti ) with domain [0, +∞) is continuous at ε2 . [sent-125, score-0.36]

18 The FenchelLegendre conjugate (which is also often called convex conjugate or conjugate function) fε∗ (z) is ℓ+u ℓ+u √ √ fε∗ (z) = sup (t⊤ z − fε (t)) = sup ∑ [ziti − ( ti − ε)2 ] = ∑ sup[ziti − ( ti − ε)2 ]. [sent-132, score-0.981]

19 + + t t∈dom fε i=1 ti i=1 The domain of the conjugate function consists of z ∈ Rℓ+u for which the supremum is ﬁnite (i. [sent-133, score-0.335]

20 Theorem 3 Function fε∗ (zi ) deﬁned by (4) has the following form fε∗ (zi ) = zi ε2 1−zi , 0, 2427 for 0 < zi < 1 for zi ≤ 0 . [sent-141, score-0.6]

21 (5) S UN AND S HAWE -TAYLOR Proof By deﬁnition, we have √ sup ziti , sup [ziti − ( ti − ε)2 ] fε∗ (zi ) = max ti 0≤ti ≤ε2 . [sent-142, score-0.882]

22 (6) ti >ε2 The value of function sup0≤ti ≤ε2 ziti is simple to characterize. [sent-143, score-0.448]

23 We now characterize the second term √ √ supti >ε2 [ziti − ( ti − ε)2 ] = supti >ε2 (ziti − ti − ε2 + 2ε ti ). [sent-144, score-0.588]

24 For zi ≤ 0 or zi ≥ 1 the derivative does not exist and thus we use function values at end points to ﬁnd the supremum. [sent-146, score-0.4]

25 As a result, we have   zi ε2 , for 0 < zi < 1  1−zi √ sup [ziti − ( ti − ε)2 ] = zi ε2 , for zi ≤ 0   +∞, ti >ε2 for zi ≥ 1 . [sent-147, score-1.487]

26 (7) z That is ℓ+u ∑( i=1 ℓ+u √ ti − ε)2 = sup(z⊤ t − fε∗ (z)) = sup ∑ [ziti − fε∗ (zi )]. [sent-157, score-0.307]

27 + z z i=1 By (7), we have ℓ+u ℓ+u i=1 i=1 ∑ (| f1 (xi ) − f2 (xi )| − ε)2 = ∑ ( + ℓ+u √ ti − ε)2 = sup ∑ [ziti − fε∗ (zi )]. [sent-158, score-0.307]

28 3 Saddle-Point Property γv [(K1 α1 − K2 α2 )⊤U(K1 α1 − K2 α2 ) − ∑ fε∗ (zi )] i=1  ℓ+u j  yi (∑ j=1 α1 k1 (x j , xi ) + b1 ) ≥ 1 − ξi ,  1 j yi (∑ℓ+u α2 k2 (x j , xi ) + b2 ) ≥ 1 − ξi , 2  i j=1  ξ , ξi ≥ 0, i = 1, . [sent-182, score-0.374]

29 As fε∗ (zi ) is convex, zi [ f1 (xi ) − f2 (xi )]2 − fε∗ (zi ) is conℓ+u cave with respect to zi . [sent-204, score-0.4]

30 We can simply denote the above optimization problem as inf sup f (θ, z) (11) θ z associated with constraints on the labeled examples, where f (θ, z) is convex with respect to θ, and concave with respect to z. [sent-208, score-0.388]

31 Theorem 7 (Boyd and Vandenberghe, 2004) If f (θ, z) with domain Θ and Z is convex with respect to θ ∈ Θ, and concave with respect to z ∈ Z, the following equality holds inf sup f (θ, z) = sup inf f (θ, z) θ z z θ under some slight assumptions. [sent-210, score-0.483]

32 The conjugate of p∗ (z) is given by p∗∗ (u) = sup(u⊤ z − p∗ (z)) = sup(u⊤ z + inf f (θ, z)) = sup inf[ f (θ, z) + u⊤ z] . [sent-218, score-0.294]

33 Theorem 8 If the following equality holds for function f (θ, z) ˜ ˜ inf sup f (θ, z) = sup inf f (θ, z) = f (θ, z) , θ z θ z ˜ ˜ then the optimal pair θ, z forms a saddle-point. [sent-222, score-0.38]

34 Proof From the given equality, we have ˜ ˜ ˜ inf sup f (θ, z) = sup f (θ, z) = f (θ, z) , θ z z and ˜ ˜ sup inf f (θ, z) = inf f (θ, z) = f (θ, z) , ˜ z θ θ Therefore, ˜ ˜ ˜ f (θ, z) ≤ f (θ, z) ≤ f (θ, z), ˜ which indeed satisﬁes the deﬁnition of a saddle-point. [sent-223, score-0.57]

35 Theorem 10 sup √ [ziti − fε∗ (zi )] = ( ti − ε)2 , + ∗ zi ∈dom fε (zi ) and arg sup ∗ zi ∈dom fε (zi ) ε 1 − √ti , 0, [ziti − fε∗ (zi )] = for ti > ε2 for 0 ≤ ti ≤ ε2 . [sent-241, score-1.194]

36 Proof We have sup ∗ zi ∈dom fε (zi ) [ziti − fε∗ (zi )] = max{ sup ziti − zi 0 <1 zi ε2 , sup ziti } . [sent-243, score-1.517]

37 1 − zi zi ≤0 The ﬁrst supremum can be solved by setting the derivative with respect to zi to zero. [sent-244, score-0.651]

38 We have sup ziti − 0 <1 √ zi ε2 = ( ti − ε)2 1 − zi ε where ti > ε2 , and the supremum is attained with zi = 1 − √ti . [sent-245, score-1.406]

39 When 0 ≤ ti ≤ ε2 , supzi ≤0 ziti = 0 with the supre√ zi ε2 mum attained at zi = 0. [sent-247, score-0.885]

40 Therefore, maxzi {sup0 <1 ziti − 1−zi , supzi ≤0 ziti } = ( ti − ε)2 with the + supremum attained when zi ∈ [0, 1), which completes the proof. [sent-248, score-1.004]

41 For sparsity pursuit, during each iteration we remove those unlabeled examples whose corresponding zi ’s are zero. [sent-249, score-0.762]

42 When there are no unlabeled examples eligible for ˆ ˆ elimination, the iteration will terminate and the convergence point (θ, z) is reached. [sent-252, score-0.515]

43 1 Lagrange Dual Function γv (K α − K2 α2 )⊤U(K1 α1 − K2 α2 )  1 1 j  yi (∑ℓ+u α1 k1 (x j , xi ) + b1 ) ≥ 1 − ξi ,  1 j=1 j yi (∑ℓ+u α2 k2 (x j , xi ) + b2 ) ≥ 1 − ξi , 2 j=1  i  ξ , ξi ≥ 0, i = 1, . [sent-259, score-0.374]

44 Therefore, we can resolve the bias terms by averaging yi (∑ℓ+u α1 k1 (x j , xi ) + b1 ) − 1 = 0 and yi (∑ℓ+u α2 k2 (x j , xi ) + j=1 j=1 b2 ) − 1 = 0 over all support vectors. [sent-334, score-0.374]

45 With yi ∈ {1, −1}, we have ℓ 2 ˆ Rℓ (F ) = Eσ [sup | ∑ σi f˜(xi , yi )|] f ∈F ℓ i=1 2 ℓ ∑ σi yi g(xi )|] g∈G ℓ i=1 = Eσ [sup | 2 ℓ ∑ σi g(xi )|] g∈G ℓ i=1 = Eσ [sup | ˆ = Rℓ (G ). [sent-427, score-0.339]

46 γn γn i=1 (32) Now, we see that for iterative optimization of sparse multi-view SVMs, the empirical Rademacher ˆ complexity Rl (G ) with U 2 = tr(S ) only depends on the ℓ labeled examples and the chosen kernel functions. [sent-492, score-0.295]

47 Consequently, the margin bound does not rely on the unlabeled training sets. [sent-493, score-0.632]

48 By a similar technique as in (32), we can show that Θ only depends on the unlabeled data and the kernel functions, while J encodes the interaction between labeled and unlabeled data. [sent-496, score-1.065]

49 As a result, the margin bound relies on both labeled and unlabeled data. [sent-497, score-0.672]

50 For this case, we will give an evaluation of the margin bound with different sizes of unlabeled sets in Section 6. [sent-498, score-0.608]

51 The termination condition for iterative optimization is either no unlabeled examples can be removed or the maximum iteration number surpasses 50. [sent-504, score-0.604]

52 Each accuracy/error reported in this paper is an averaged accuracy/error value over ten random splits of data into labeled, unlabeled and test data. [sent-506, score-0.49]

53 Later in this section, we also provide a sequential training strategy for SpMvSVMs, which shows an accuracy improvement over the gradual adding of unlabeled data while with roughly linear and sub-linear increases of running time. [sent-507, score-0.589]

54 For SpMvSVMs, the numbers of examples in the labeled training set, unlabeled training set and test set were ﬁxed as four, 596, and 200, respectively. [sent-517, score-0.627]

55 In this paper, γv is normalized by the number of labeled and unlabeled examples involved in the multi-view regularization term. [sent-522, score-0.625]

56 To evaluate SpMvSVMs, we varied the size of the unlabeled training set from 20%, 60% to 100% of the total number of unlabeled data, and used different values for the insensitive parameter ε, which ranged from 0 to 0. [sent-524, score-1.03]

57 The test accuracies and transductive accuracies (on the corresponding unlabeled set) are given in Figure 2(a) and Figure 2(b), respectively. [sent-527, score-0.792]

58 The numbers of removed unlabeled examples for different ε values are shown in Figure 3. [sent-529, score-0.542]

59 From Figure 2 and Figure 3, we ﬁnd that with the increase of ε, more and more unlabeled data are removed, and the remove of a small number of unlabeled data can hardly decrease the performance of the resultant classiﬁers, especially when the original size of unlabeled set is large. [sent-530, score-1.496]

60 In addition, more unlabeled data can beneﬁt the performance of the learned classiﬁers with the same ε. [sent-532, score-0.49]

61 2 (b) Transductive accuracy Figure 2: Classiﬁcation accuracies of SpMvSVMs with different sizes of unlabeled set and ε values on the artiﬁcial data. [sent-549, score-0.61]

62 2 Figure 3: The numbers of unlabeled examples removed by SpMvSVMs for different ε values on the artiﬁcial data. [sent-555, score-0.542]

63 2 (b) Transductive accuracy Figure 4: Classiﬁcation accuracies of SpMvSVMs with different sizes of unlabeled set and ε values on text classiﬁcation. [sent-582, score-0.64]

64 For SpMvSVMs, the numbers of examples in the labeled training set, unlabeled training set and test set were ﬁxed as 32, 699, and 320, respectively. [sent-590, score-0.627]

65 To evaluate SpMvSVMs, we also varied the size of the unlabeled training set from 20%, 60% to 100% of the total number of unlabeled data, and used different values for the insensitive parameter ε ranging from 0 to 0. [sent-597, score-1.03]

66 The test accuracies and transductive accuracies are given in Figure 4(a) and Figure 4(b), respectively. [sent-600, score-0.302]

67 The numbers of removed unlabeled examples for different ε values are shown in Figure 5. [sent-601, score-0.542]

68 From Figure 5, we ﬁnd that with the increase of ε, more and more unlabeled data can be removed. [sent-602, score-0.49]

69 Reﬂected by Figure 4, the remove of unlabeled data only slightly decrease the performance of the resultant classiﬁers, and this decrease is less when more unlabeled data is used. [sent-603, score-1.006]

70 We also observe a different phenomenon, that is, using 60% and 100% unlabeled data result in similar test accuracies as shown in Figure 4(a). [sent-605, score-0.61]

71 2 Figure 5: The numbers of unlabeled examples removed by SpMvSVMs for different ε values on text classiﬁcation. [sent-611, score-0.572]

72 3 Comparison with SVM-2K, and Sequential Training The SVM-2K method proposed by Szedmak and Shawe-Taylor (2007) can exploit unlabeled data for multi-view learning. [sent-613, score-0.49]

73 However, it adopts l1 norm for multi-view regularization, and this regularization only uses unlabeled data. [sent-615, score-0.536]

74 Our ﬁrst comparison takes the ε-insensitive parameter in both SpMvSVMs and SVM-2K as zero and uses the above two data sets with different sizes of unlabeled training sets, namely, from 20% to 60% to 100%. [sent-626, score-0.514]

75 The experimental results are listed in Table 1, from which we see that both the test accuracies and transductive accuracies of SpMvSVMs are superior to those counterparts of SVM-2K. [sent-633, score-0.302]

76 90 Table 1: Test and transductive accuracies (%) of SVM-2K and SpMvSVMs with different sizes of unlabeled training sets on the artiﬁcial data (the ﬁrst three lines) and text classiﬁcation data (the last three lines). [sent-663, score-0.726]

77 gradually added unlabeled examples, and thus evaluate the possibility and potential of applying the methods to large-scale problems. [sent-664, score-0.512]

78 The text classiﬁcation data were used where all the unlabeled training data were divided into ten equal sizes. [sent-665, score-0.544]

79 First, we train SpMvSVMs using the labeled data and the ﬁrst portion of unlabeled data. [sent-670, score-0.554]

80 Then, we combine the retained unlabeled data from the last training with the next portion of unlabeled data together to train SpMvSVMs (with the original labeled data). [sent-671, score-1.111]

81 The test accuracies and total numbers of retained unlabeled data after each step are shown in Figure 6(a) and Figure 6(b). [sent-673, score-0.653]

82 Figure 6(b) shows that SpMvSVMs obtain sparse solutions in the sense that the number of retained unlabeled data is small compared to the number of all the added unlabeled data. [sent-676, score-1.089]

83 1 the running time is roughly linear with respect to the gradual adding of unlabeled data, and when ε = 0. [sent-678, score-0.521]

84 Though the SVM-2K method was not initially proposed for sparse semi-supervised learning, we ﬁnd that with ε > 0 it can reduce the number of unlabeled data used for representing classiﬁers. [sent-682, score-0.556]

85 However, for different ε values we did not observe an improvement of test accuracies with the gradual adding of unlabeled data. [sent-684, score-0.641]

86 05 SVM-2K would not use any unlabeled data at all. [sent-686, score-0.49]

87 9 1 (b) # Retained Unlabeled Examples Figure 6: Classiﬁcation accuracies and numbers of retained unlabeled examples for sequential training of SpMvSVMs. [sent-721, score-0.746]

88 We have normalized the running times with 10% unlabeled examples to be 1. [sent-744, score-0.515]

89 2, but varied the size of unlabeled training set from {10%, 20%, . [sent-750, score-0.514]

90 9 1 (b) Complexity and margin bound Figure 8: Classiﬁcation error rates, empirical Rademacher complexity, and the margin bound of SpMvSVMs with different sizes of unlabeled sets. [sent-780, score-0.748]

91 The overall decrease of error rates is well explained by the drop of the margin bound and empirical Rademacher complexity brought by the regularization role of more and more unlabeled data. [sent-782, score-0.711]

92 Figure 8(b) also indicates that after adding a certain number of unlabeled data, including more unlabeled data will only improve the performance marginally. [sent-783, score-0.98]

93 2, for each example the ε-insensitive loss used is fε (ti ) = ( ti − ε)2 with ti = [ f1 (xi ) − + f2 (xi )]2 . [sent-791, score-0.382]

94 This can be relaxed to a general class of user-designed losses that can be deﬁned as a convex function of ti , for example, using existing convex functions or compositions of convex functions with some good properties (Boyd and Vandenberghe, 2004). [sent-792, score-0.345]

95 , 2006) min f 1 ∈ H1 , f 2 ∈ H 2 1 ℓ ∑ [( f1 (xi ) − yi )2 + ( f2 (xi ) − yi )2 ] + 2ℓ i=1 γn ( f1 2 ℓ+u + f2 2 ) + γv ∑ ( f1 (xi ) − f2 (xi ))2 , i=1 where nonnegative scalars γn , γv are respectively norm regularization and multi-view regularization coefﬁcients. [sent-799, score-0.367]

96 ℓ+u Replacing the last term with ∑i=1 (| f1 (xi ) − f2 (xi )| − ε)2 results in the sparse Co-RLS algorithm + min f 1 ∈ H 1 , f 2 ∈ H2 1 ℓ ∑ [( f1 (xi ) − yi )2 + ( f2 (xi ) − yi )2 ] + 2ℓ i=1 γn ( f1 2 ℓ+u + f2 2 ) + γv ∑ (| f1 (xi ) − f2 (xi )| − ε)2 . [sent-801, score-0.315]

97 We also performed experiments to validate the usefulness of the margin bound and empirical Rademacher complexity, which explain well the regularization role unlabeled data play for multi-view learning. [sent-825, score-0.676]

98 A PAC-style model for learning from labeled and unlabeled data. [sent-832, score-0.554]

99 Manifold regularization: A geometric framework for learning from labeled and unlabeled exampls. [sent-849, score-0.554]

100 Synthesis of maximum margin and multiview learning using unlabeled data. [sent-947, score-0.612]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('unlabeled', 0.49), ('spmvsvms', 0.366), ('ziti', 0.268), ('rademacher', 0.217), ('zi', 0.2), ('hawe', 0.195), ('ti', 0.18), ('onjugate', 0.167), ('emi', 0.15), ('unctions', 0.15), ('sing', 0.129), ('sup', 0.127), ('epsilon', 0.125), ('accuracies', 0.12), ('svms', 0.117), ('uu', 0.113), ('yi', 0.113), ('conjugate', 0.104), ('szedmak', 0.098), ('views', 0.093), ('margin', 0.091), ('parse', 0.09), ('rosenberg', 0.084), ('un', 0.084), ('xi', 0.074), ('conjugates', 0.073), ('earning', 0.067), ('sparse', 0.066), ('sindhwani', 0.065), ('labeled', 0.064), ('inf', 0.063), ('transductive', 0.062), ('vtr', 0.061), ('boyd', 0.058), ('convex', 0.055), ('vandenberghe', 0.055), ('tr', 0.054), ('gl', 0.053), ('supremum', 0.051), ('representer', 0.048), ('concave', 0.048), ('sparsity', 0.047), ('regularization', 0.046), ('dual', 0.045), ('supervised', 0.045), ('sequential', 0.044), ('rl', 0.043), ('retained', 0.043), ('laplacian', 0.042), ('dom', 0.04), ('lagrange', 0.038), ('balcan', 0.038), ('farquhar', 0.037), ('supz', 0.037), ('supzi', 0.037), ('blum', 0.036), ('bartlett', 0.036), ('complexity', 0.035), ('theorem', 0.034), ('optimization', 0.031), ('gradual', 0.031), ('multiview', 0.031), ('iterative', 0.031), ('lagrangian', 0.031), ('manifold', 0.031), ('text', 0.03), ('rkhs', 0.029), ('belkin', 0.029), ('multimedia', 0.028), ('removed', 0.027), ('bound', 0.027), ('nonnegative', 0.026), ('af', 0.026), ('insensitive', 0.026), ('resultant', 0.026), ('classi', 0.026), ('china', 0.026), ('wi', 0.025), ('gram', 0.025), ('examples', 0.025), ('latala', 0.024), ('salton', 0.024), ('shiliang', 0.024), ('supti', 0.024), ('training', 0.024), ('convexity', 0.024), ('proportion', 0.024), ('objective', 0.023), ('sgn', 0.023), ('web', 0.023), ('min', 0.023), ('empirical', 0.022), ('possibility', 0.022), ('loss', 0.022), ('rewritten', 0.022), ('concavity', 0.022), ('kernel', 0.021), ('maxz', 0.021), ('porter', 0.021), ('rls', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

Author: Shiliang Sun, John Shawe-Taylor

2 0.21402752 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

Author: Gideon S. Mann, Andrew McCallum

Abstract: In this paper, we present an overview of generalized expectation criteria (GE), a simple, robust, scalable method for semi-supervised training using weakly-labeled data. GE ﬁts model parameters by favoring models that match certain expectation constraints, such as marginal label distributions, on the unlabeled data. This paper shows how to apply generalized expectation criteria to two classes of parametric models: maximum entropy models and conditional random ﬁelds. Experimental results demonstrate accuracy improvements over supervised training and a number of other stateof-the-art semi-supervised learning methods for these models. Keywords: generalized expectation criteria, semi-supervised learning, logistic regression, conditional random ﬁelds

3 0.12688248 102 jmlr-2010-Semi-Supervised Novelty Detection

Author: Gilles Blanchard, Gyemin Lee, Clayton Scott

Abstract: A common setting for novelty detection assumes that labeled examples from the nominal class are available, but that labeled examples of novelties are unavailable. The standard (inductive) approach is to declare novelties where the nominal density is low, which reduces the problem to density level set estimation. In this paper, we consider the setting where an unlabeled and possibly contaminated sample is also available at learning time. We argue that novelty detection in this semi-supervised setting is naturally solved by a general reduction to a binary classiﬁcation problem. In particular, a detector with a desired false positive rate can be achieved through a reduction to Neyman-Pearson classiﬁcation. Unlike the inductive approach, semi-supervised novelty detection (SSND) yields detectors that are optimal (e.g., statistically consistent) regardless of the distribution on novelties. Therefore, in novelty detection, unlabeled data have a substantial impact on the theoretical properties of the decision rule. We validate the practical utility of SSND with an extensive experimental study. We also show that SSND provides distribution-free, learning-theoretic solutions to two well known problems in hypothesis testing. First, our results provide a general solution to the general two-sample problem, that is, the problem of determining whether two random samples arise from the same distribution. Second, a specialization of SSND coincides with the standard p-value approach to multiple testing under the so-called random effects model. Unlike standard rejection regions based on thresholded p-values, the general SSND framework allows for adaptation to arbitrary alternative distributions in multiple dimensions. Keywords: semi-supervised learning, novelty detection, Neyman-Pearson classiﬁcation, learning reduction, two-sample problem, multiple testing

4 0.11392342 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

Author: Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian

Abstract: Estimating the error rates of classiﬁers or regression models is a fundamental task in machine learning which has thus far been studied exclusively using supervised learning techniques. We propose a novel unsupervised framework for estimating these error rates using only unlabeled data and mild assumptions. We prove consistency results for the framework and demonstrate its practical applicability on both synthetic and real world data. Keywords: classiﬁcation and regression, maximum likelihood, latent variable models

5 0.11364057 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

Author: Vladimir Koltchinskii

Abstract: Sequential algorithms of active learning based on the estimation of the level sets of the empirical risk are discussed in the paper. Localized Rademacher complexities are used in the algorithms to estimate the sample sizes needed to achieve the required accuracy of learning in an adaptive way. Probabilistic bounds on the number of active examples have been proved and several applications to binary classiﬁcation problems are considered. Keywords: active learning, excess risk, Rademacher complexities, capacity function, disagreement coefﬁcient

6 0.09666045 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

7 0.092791595 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

8 0.073750786 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning

9 0.068437025 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

10 0.066161752 22 jmlr-2010-Classification Using Geometric Level Sets

11 0.061964456 44 jmlr-2010-Graph Kernels

12 0.060809158 15 jmlr-2010-Approximate Tree Kernels

13 0.059848472 84 jmlr-2010-On Spectral Learning

14 0.059769504 65 jmlr-2010-Learning Translation Invariant Kernels for Classification

15 0.059022088 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

16 0.054826871 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence

17 0.049469467 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

18 0.048903514 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization

19 0.048222754 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs

20 0.04816144 45 jmlr-2010-High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.25), (1, -0.089), (2, -0.016), (3, 0.132), (4, -0.024), (5, -0.002), (6, 0.034), (7, 0.156), (8, 0.028), (9, -0.165), (10, -0.09), (11, -0.209), (12, 0.344), (13, 0.121), (14, 0.082), (15, 0.252), (16, -0.027), (17, 0.035), (18, -0.008), (19, 0.043), (20, -0.041), (21, 0.054), (22, 0.024), (23, -0.107), (24, 0.093), (25, -0.028), (26, 0.032), (27, -0.054), (28, -0.004), (29, -0.143), (30, -0.145), (31, 0.074), (32, 0.113), (33, 0.053), (34, 0.027), (35, 0.087), (36, 0.198), (37, 0.05), (38, -0.027), (39, 0.057), (40, 0.015), (41, -0.029), (42, -0.037), (43, -0.005), (44, 0.062), (45, -0.005), (46, 0.025), (47, -0.039), (48, -0.013), (49, -0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94422364 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

Author: Shiliang Sun, John Shawe-Taylor

2 0.66263735 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

Author: Gideon S. Mann, Andrew McCallum

3 0.55749995 102 jmlr-2010-Semi-Supervised Novelty Detection

Author: Gilles Blanchard, Gyemin Lee, Clayton Scott

4 0.42424038 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

Author: Pannagadatta K. Shivaswamy, Tony Jebara

Abstract: Leading classiﬁcation methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identiﬁes its sensitivity to afﬁne transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classiﬁcation function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classiﬁcation function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efﬁciency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classiﬁcation and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages. Keywords: support vector machines, kernel methods, large margin, Rademacher complexity

5 0.4044826 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

Author: Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian

6 0.3999874 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

7 0.37909579 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

8 0.33641922 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning

9 0.31309468 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence

10 0.28248736 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

11 0.27814665 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

12 0.26022503 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression

13 0.25344509 84 jmlr-2010-On Spectral Learning

14 0.25134742 65 jmlr-2010-Learning Translation Invariant Kernels for Classification

15 0.24476302 22 jmlr-2010-Classification Using Geometric Level Sets

16 0.22199035 14 jmlr-2010-Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes

17 0.2207693 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

18 0.21439005 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond

19 0.21025428 40 jmlr-2010-Fast and Scalable Local Kernel Machines

20 0.20730479 69 jmlr-2010-Lp-Nested Symmetric Distributions

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.028), (3, 0.036), (4, 0.02), (8, 0.012), (15, 0.017), (21, 0.041), (32, 0.048), (36, 0.033), (37, 0.117), (66, 0.274), (75, 0.157), (81, 0.014), (85, 0.083), (96, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.83129752 13 jmlr-2010-Approximate Inference on Planar Graphs using Loop Calculus and Belief Propagation

Author: Vicenç Gómez, Hilbert J. Kappen, Michael Chertkov

Abstract: We introduce novel results for approximate inference on planar graphical models using the loop calculus framework. The loop calculus (Chertkov and Chernyak, 2006a) allows to express the exact partition function of a graphical model as a ﬁnite sum of terms that can be evaluated once the belief propagation (BP) solution is known. In general, full summation over all correction terms is intractable. We develop an algorithm for the approach presented in Chertkov et al. (2008) which represents an efﬁcient truncation scheme on planar graphs and a new representation of the series in terms of Pfafﬁans of matrices. We analyze the performance of the algorithm for models with binary variables and pairwise interactions on grids and other planar graphs. We study in detail both the loop series and the equivalent Pfafﬁan series and show that the ﬁrst term of the Pfafﬁan series for the general, intractable planar model, can provide very accurate approximations. The algorithm outperforms previous truncation schemes of the loop series and is competitive with other state of the art methods for approximate inference. Keywords: belief propagation, loop calculus, approximate inference, partition function, planar graphs, Ising model

same-paper 2 0.7389192 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

Author: Shiliang Sun, John Shawe-Taylor

3 0.60995257 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

Author: Gideon S. Mann, Andrew McCallum

4 0.60793257 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

Author: Ming Yuan

Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity

5 0.60440183 57 jmlr-2010-Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models

Author: Fang-Lan Huang, Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin

Abstract: Maximum entropy (Maxent) is useful in natural language processing and many other areas. Iterative scaling (IS) methods are one of the most popular approaches to solve Maxent. With many variants of IS methods, it is difﬁcult to understand them and see the differences. In this paper, we create a general and uniﬁed framework for iterative scaling methods. This framework also connects iterative scaling and coordinate descent methods. We prove general convergence results for IS methods and analyze their computational complexity. Based on the proposed framework, we extend a coordinate descent method for linear SVM to Maxent. Results show that it is faster than existing iterative scaling methods. Keywords: maximum entropy, iterative scaling, coordinate descent, natural language processing, optimization

6 0.59884238 104 jmlr-2010-Sparse Spectrum Gaussian Process Regression

7 0.58886838 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

8 0.5840677 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

9 0.58097082 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

10 0.58052897 109 jmlr-2010-Stochastic Composite Likelihood

11 0.58014715 102 jmlr-2010-Semi-Supervised Novelty Detection

12 0.57808429 110 jmlr-2010-The SHOGUN Machine Learning Toolbox

13 0.5777685 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

14 0.57549119 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking

15 0.57532066 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems

16 0.57286227 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

17 0.57068062 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

18 0.5698759 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory

19 0.56949371 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization

20 0.56920493 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices