jmlr jmlr2009 jmlr2009-63 knowledge-graph by maker-knowledge-mining

63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

Source: pdf

Author: Junhui Wang, Xiaotong Shen, Wei Pan

Abstract: In classiﬁcation, semisupervised learning usually involves a large amount of unlabeled data with only a small number of labeled data. This imposes a great challenge in that it is difﬁcult to achieve good classiﬁcation performance through labeled data alone. To leverage unlabeled data for enhancing classiﬁcation, this article introduces a large margin semisupervised learning method within the framework of regularization, based on an efﬁcient margin loss for unlabeled data, which seeks efﬁcient extraction of the information from unlabeled data for estimating the Bayes decision boundary for classiﬁcation. For implementation, an iterative scheme is derived through conditional expectations. Finally, theoretical and numerical analyses are conducted, in addition to an application to gene function prediction. They suggest that the proposed method enables to recover the performance of its supervised counterpart based on complete data in rates of convergence, when possible. Keywords: difference convex programming, classiﬁcation, nonconvex minimization, regularization, support vectors

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU Division of Biostatistics University of Minnesota Minneapolis, MN 55455, USA Editor: John Shawe-Taylor Abstract In classiﬁcation, semisupervised learning usually involves a large amount of unlabeled data with only a small number of labeled data. [sent-7, score-0.536]

2 They suggest that the proposed method enables to recover the performance of its supervised counterpart based on complete data in rates of convergence, when possible. [sent-12, score-0.12]

3 Introduction Semisupervised learning occurs in classiﬁcation, where only a small number of labeled data is available with a large amount of unlabeled data, because of the difﬁculty of labeling. [sent-14, score-0.275]

4 WANG , S HEN AND PAN situations as such, the primary goal is to leverage unlabeled data to enhance predictive performance of classiﬁcation (Zhu, 2005). [sent-21, score-0.197]

5 In semisupervised learning, labeled data {(xi , yi )nl } are sampled from an unknown distribution i=1 P(x, y), together with an independent unlabeled sample {x j }n l +1 from its marginal distribution j=n q(x). [sent-22, score-0.562]

6 Here label yi ∈ {−1, 1}, xi = (xi1 , · · · , xid ) is an d-dimensional input, nl ≪ nu and n = nl + nu is the combined size of labeled and unlabeled samples. [sent-23, score-1.343]

7 This article develops a large margin semisupervised learning method, which aims to extract the information from unlabeled data for estimating the Bayes decision boundary. [sent-32, score-0.632]

8 This is achieved by constructing an efﬁcient loss for unlabeled data with regard to reconstruction of the Bayes decision boundary and by incorporating some knowledge from an estimate of p. [sent-33, score-0.281]

9 This permits efﬁcient use of unlabeled data for accurate estimation of the Bayes decision boundary thus enhancing the classiﬁcation performance based on labeled data alone. [sent-34, score-0.323]

10 The proposed method, using both the grouping (clustering) structure of unlabeled data and the smoothness structure of p, is designed to recover the classiﬁcation performance based on complete data without missing labels, when possible. [sent-35, score-0.308]

11 That is, given a consistent initial classiﬁer, an iterative improvement can be obtained through the constructed loss function. [sent-37, score-0.157]

12 Numerical analysis indicates that the proposed method performs well against several state-of-theart semisupervised methods, including TSVM and Wang and Shen (2007), where Wang and Shen (2007) compares favorably against several smooth and clustering based semisupervised methods. [sent-38, score-0.617]

13 The theory reveals that the ψ-learning classiﬁer’s generalization performance based on complete data can be recovered by its semisupervised counterpart based on incomplete data in rates of convergence, when some regularity assumptions are satisﬁed. [sent-40, score-0.336]

14 The theory also says that the least favorable situation for a semisupervised problem occurs at points near p(x) = 0 or 1 because little information can be provided by these points for reconstructing the classiﬁcation boundary as discussed in Section 2. [sent-41, score-0.391]

15 In conclusion, this semisupervised method achieves the desired objective of delivering higher generalization performance. [sent-45, score-0.261]

16 This article also examines one novel application in gene function prediction, which has been a primary focus of biomedical research. [sent-46, score-0.11]

17 In gene function prediction, microarray gene expression proﬁles can be used to predict gene functions, because genes sharing the same function tend to co720 E FFICIENT L ARGE M ARGIN S EMISUPERVISED L EARNING express, see Zhou, Kao and Wong (2002). [sent-47, score-0.292]

18 Therefore, gene function prediction is an ideal application for semisupervised methods and also employed in this article as a real numerical example. [sent-51, score-0.41]

19 Methodology In this section, we present our proposed efﬁcient large margin semisupervised learning method as well its connection to other existing popular methodologies. [sent-59, score-0.348]

20 1 Large Margin Classiﬁcation Consider large margin classiﬁcation with labeled data (xi , yi )nl . [sent-61, score-0.165]

21 In linear classiﬁcation, given a i=1 class of candidate decision functions F , a cost function nl C ∑ L(yi f (xi )) + J( f ) (1) i=1 is minimized over f ∈ F = { f (x) = wT x + w f ,0 ≡ (1, xT )w f } to yield the minimizer fˆ leading to ˜f ˆ). [sent-62, score-0.437]

22 Here J( f ) is the reciprocal of the geometric margin of various form with the usual classiﬁer sign( f L2 margin J( f ) = w f 2 /2 to be discussed in further detail, and L(·) is a margin loss deﬁned by ˜ functional margin z = y f (x), and C > 0 is a regularization parameter. [sent-63, score-0.384]

23 A margin loss L(z) is said to be large margin if L(z) is non-increasing in z, penalizing small margin values. [sent-70, score-0.297]

24 5 being a global minimizer of the generalization error GE( f ) = EI(Y = sign( f (X))), which is usually estimated by labeled data through L(·) in (1). [sent-77, score-0.103]

25 In absence of sufﬁcient labeled data, the focus is on how to improve (1) by using additional unlabeled data. [sent-78, score-0.275]

26 For this, we construct a margin loss U to measure the performance of estimating f¯. [sent-79, score-0.123]

27 Lemma 1 (Optimal loss) For any margin loss L(z), argmin E(L(Y f (X)) − T ( f (X)))2 = E(L(Y f (X))|X = x) = U( f (x)), T where U( f (x)) = p(x)L( f (x)) + (1 − p(x))L(− f (x)) and p(x) = P(Y = 1|X = x). [sent-83, score-0.123]

28 ˆ Based on Lemma 1, we deﬁne U( f ) to be p(x)L( f (x)) + (1 − p(x))L(− f (x)) by replacing p in ˆ ˆ ˆ f ) approximates the ideal loss U( f ) for reconstructing the Bayes decision U( f ) by p. [sent-85, score-0.118]

29 Through (approximately) optimal loss U( f ), an iterative improvement of estimation accuracy is achieved by starting with a consistent estimate ˆ p of p, which, for instance, can be obtained through SVM or TSVM. [sent-89, score-0.112]

30 The preceding discussion leads to our proposed cost function: nl s( f ) = C n−1 ∑ L(yi f (xi )) + n−1 u l i=1 n ∑ ˆ U( f (x j )) + J( f ). [sent-95, score-0.437]

31 By comparison, U( f ) enables not only to identify the clustering boundary through the hat function as L(| f |) does but also to discriminate f (x) from − f (x) through an estimated p(x). [sent-105, score-0.168]

32 5, and vice versa, That is, U( ˆ whereas L(| f |) is in-discriminative with regard to the sign of f (x). [sent-107, score-0.169]

33 There U( f ) favors one clustering boundary for classiﬁcation if a consistent p is provided, whereas L(| f |) fails to disˆ criminate these two. [sent-110, score-0.143]

34 ˆ In conclusion, U( f ) yields a more efﬁcient loss for a semisupervised problem as it uses the clustering information from the unlabeled data as L(| f |) does, in addition to guidance about labeling through p to gain a higher discriminative power. [sent-113, score-0.589]

35 Computation In this section, we implement the proposed semisupervised method through an iterative scheme as well as a nonconvex optimization technique. [sent-115, score-0.357]

36 1, respectively, to f ˆ which are a more general version of the clustering assumption and a smoothness assumption of p. [sent-121, score-0.253]

37 In other words, the marginal information from unlabeled data has been effectively incorporated in each iteration of Algorithm 1 for improving estimation accuracy of fˆ and p. [sent-122, score-0.223]

38 Train weighted margin classiﬁers fˆπt by solving min Cn−1 (1 − πt ) f ∈F ∑ L(yi f (xi )) + πt ∑ yi =1 L(yi f (xi )) + J( f ), yi =−1 with 1 − πt associated with positive instances and πt associated with negative instances. [sent-130, score-0.152]

39 Algorithm 1 also differs from Yarowsky’s algorithm (Yarowsky, 1995; Abney, 2004) in that Yarowsky’s algorithm solely relies on the strength of the estimated p, ignoring the potential ˆ information from the clustering assumption. [sent-156, score-0.12]

40 Most importantly, the smoothness and clustering assumptions have been used in estimating p, and thus semisupervised learning. [sent-160, score-0.438]

41 2 Nonconvex Minimization This section develops a nonconvex minimization method based on DC programming (An and Tao, 1997) for (2) with the ψ-loss, which was previously employed in Liu, Shen and Wong (2005) for supervised ψ-learning. [sent-165, score-0.153]

42 They are SVM (with labeled data alone), TSVM (TSVMDCA ; Wang, Shen and Pan, 2007) , and the methods of Wang and Shen (2007) with the hinge loss (SSVM) and with the ψloss (SPSI), where SSVM and SPSI compare favorably against their competitors. [sent-207, score-0.114]

43 Corresponding to these methods, our method, with m = n1/2 and ε = 10−3 , yields four semisupervised classiﬁers denoted as ESVM, ETSVM, ESSVM and ESPSI. [sent-208, score-0.261]

44 1 S IMULATED E XAMPLES Examples 1 and 2 are taken from Wang and Shen (2007), where 200 and 800 labeled instances are randomly selected for training and testing. [sent-211, score-0.116]

45 Instances in WBC, PIMA, HEART, and MUSHROOM are randomly divided into two halves with 10 labeled and 190 unlabeled instances for training, and the remaining 400 for testing. [sent-222, score-0.313]

46 Instances in SPAM are randomly divided into halves with 20 labeled and 380 unlabeled instances for training, and the remaining instances for testing. [sent-223, score-0.351]

47 html, with 10 labeled and 390 unlabeled instances while no instance for testing. [sent-229, score-0.313]

48 An averaged error rate over the unlabeled set is used in BCI example to approximate the test error. [sent-230, score-0.228]

49 SVMc denotes the performance of SVM with complete labeled data. [sent-423, score-0.107]

50 This is the case for ESVM with linear kernel in SPAM, where ESVM performs worse than SVM with nl = 10 labeled data alone. [sent-610, score-0.546]

51 This suggests that a better initial estimate should be used together with unlabeled data. [sent-611, score-0.242]

52 Moreover, ESPSI nearly recovers the classiﬁcation performance of its counterpart SVM with complete labeled data in the two simulated examples, WBC and HEART. [sent-613, score-0.179]

53 2 Gene Function Prediction Through Expression Proﬁles This section applies the proposed method to predict gene functions through gene data in Hughes et al. [sent-615, score-0.154]

54 The error rate is computed on the unlabeled data and averaged over twelve splits. [sent-618, score-0.228]

55 In this case almost half of the genes have unknown functions although gene expression proﬁles are available for almost the entire yeast genome. [sent-622, score-0.168]

56 The goal is to predict gene functional categories for genes annotated within these two categories by training our semisupervised classiﬁer on expression proﬁles of genes, where some genes are treated as if their functions are unknown to mimic the semisupervised scenario in complete dataset. [sent-626, score-0.75]

57 The training set involves a random sample of nl = 20 labeled and nu = 380 unlabeled gene proﬁles, while the testing set contains 280 remaining proﬁles. [sent-629, score-0.886]

58 5% Table 3: Averaged test errors as well as estimated standard errors (in parenthesis) of ESVM, ETSVM, ESSVM, ESPSI, and their initial counterparts, over 100 pairs of training and testing samples, in gene function prediction. [sent-672, score-0.147]

59 Here I stands for an initial classiﬁer, E stands for our proposed method with the initial method, and the amount of improvement is deﬁned in (6). [sent-673, score-0.124]

60 Statistical Learning Theory In the literature, several theories have been developed to understand the problem of semisupervised learning, including Rigollet (2007) and Singh, Nowak and Zhu (2008). [sent-678, score-0.261]

61 Both the theories rely on a different clustering assumption that homogeneous labels are assumed over local clusters. [sent-679, score-0.133]

62 Based on the original clustering assumption, as well as a smoothness assumption on the conditional probability p(x), this section develops a novel statistical learning theory. [sent-680, score-0.269]

63 5 ) are expressed in terms of complexity of the candidate class F , the (0) sample size n, tuning parameter λ = (nC)−1 , the error rate of the initial classiﬁer δn , and the maximum number of iteration K in Algorithm 1. [sent-688, score-0.108]

64 The results imply that ESPSI, without knowing labels of the unlabeled data, enables to recover the classiﬁcation accuracy of ψ-learning based on complete data under regularity conditions. [sent-689, score-0.226]

65 Deﬁne the margin loss Vπ ( f , Z) for unequal cost classiﬁcation to be Sπ (y)L(y f (x)), with cost 0 < π < 1 for the positive class and Sπ (y) = 1 − π if y = 1, and π otherwise. [sent-692, score-0.172]

66 5 )≤δ} sup { f ∈F :eVπ ( f , f¯π )≤δ} sup { f ∈F :eVπ ( f , f¯π )≤δ} e( f , f¯. [sent-700, score-0.118]

67 (9) Assumption B describes local smoothness of the Bayesian regret e( f , f¯. [sent-702, score-0.119]

68 5 {βπ } as β and γ respectively, where β quantiﬁes the clustering assumption through the degree to which the positive and negative clusters are distinguishable, and γ measures the conversion rate between the classiﬁcation and probability estimation accuracies. [sent-709, score-0.164]

69 Given any ε > 0, denote {( frl , fru )}R as an ε-bracketing function r=1 set of F if for any f ∈ F , there exists an r such that frl ≤ f ≤ fru and frl − fru 2 ≤ ε; r = 1, · · · , R. [sent-711, score-0.258]

70 5 ), where the parameter B measures the level of difﬁculty of a semisupervised problem, with small value of B indicating more difﬁculty. [sent-736, score-0.261]

71 In fact, α, β and γ quantify the local smoothness of the Bayesian regret e( f , f¯. [sent-738, score-0.119]

72 Next, by letting nl , nu tending inﬁnity, we obtain the rates of convergence of ESPSI in terms of the error rate δ2α of its supervised counterpart ψ-learning based on complete data, and the initial n (0) error rate δn , B, and the maximum number K of iteration. [sent-740, score-0.761]

73 Corollary 4 Under the assumptions of Theorem 3, as nu , nl → ∞, (0) K |e( fˆC , f¯. [sent-741, score-0.534]

74 5 ) ≥ 2a11 ρn (δn )2 → 0, with any (0) slow varying sequence ρn → ∞ and ρn δn → 0, and the tuning parameter λ is chosen such that ∗ ∗ n(λJπ )max(1,2−ζ) and nl (λJ. [sent-745, score-0.469]

75 When B > 1, ESPSI achieves the convergence rate δ2α of its supervised counterpart ψ-learning based on complete data, c. [sent-748, score-0.151]

76 Corollary 5 (Optimality) Under the assumptions of Corollary 4, as nu , nl → ∞, ˆ sup U( f ) −U( f ) f ∈F = O p max(δβγ , (ρn δn )βγ max(1,B ) ) , n (0) 1 K ˆ where U( f ) is estimated U( f ) loss with p estimated based on fˆC . [sent-757, score-0.679]

77 ˆ To argue that the approximation error rate of U( f ) to U( f ) is sufﬁciently small, note that fˆC obtained from minimizing (2) recovers the classiﬁcation error rate of its supervised counterpart based on complete data, as suggested by Corollary 4. [sent-758, score-0.208]

78 In conclusion, ESPSI, without knowing label values of unlabeled instances, enables to reconstruct the classiﬁcation and estimation performance of ψ-learning based on complete data in rates of convergence, when possible. [sent-760, score-0.226]

79 In all cases, ESPSI (nearly) achieves the generalization error rates of ψ-learning for complete data when unlabeled data provides useful information, and yields no worse performance then its initial classiﬁer otherwise. [sent-763, score-0.271]

80 It is evident that the clustering assumption (Assumption B) is met since the neighborhood of f. [sent-769, score-0.17]

81 5 (x) has low density as showed in the left panel of Figure 3, and the smoothness assumption (Assumption D) and the boundedness assumption of p(x) (Assumption E) are met as well since p(x) is a hyperplane bounded by (0. [sent-770, score-0.195]

82 (0) For ESPSI fˆC , we choose δn = n−1 log nl , the convergence rate of supervised linear ψ-learning, l ρn → ∞ to be an arbitrarily slow sequence and C = O((log n)−1 ). [sent-778, score-0.551]

83 5 )| = O(max(n−1 log n, (n−1 (log nl )2 )max(1,2B ) )), with B = 2κ2(1+κ1 ) 2 ) . [sent-780, score-0.475]

84 (0) Similarly, we choose δn = n−1 (log nl )3 to be the convergence rate of supervised ψ-learning l with Gaussian kernel, ρn → ∞ to be an arbitrarily slow sequence and C = O((log n)−3 ). [sent-789, score-0.513]

85 5 )| = O(max((n−1 (log nl )3 )max(1,2B ) , n−1 (log n)3 )) = O(n−1 (log n)3 ) when κ1 + 1 > l 2κ2 (1 + κ2 ) and K is sufﬁciently large, and O(n−1 (log nl )3 ) otherwise. [sent-791, score-0.874]

86 Summary This article introduces a large margin semisupervised learning method through an iterative scheme based on an efﬁcient loss for unlabeled data. [sent-794, score-0.656]

87 In contrast to most methods assuming a relationship between the conditional and the marginal distributions, the proposed method integrates labeled and unlabeled data through using the clustering structure of unlabeled data as well as the smoothness structure of the estimated p. [sent-795, score-0.7]

88 The theoretical and numerical results suggest that the method compares favorably against top competitors, and achieves the desired goal of reconstructing the classiﬁcation performance of its supervised counterpart on complete labeled data. [sent-796, score-0.241]

89 One critical issue is how to use unlabeled data to enhance the accuracy of estimating the generalization error so that adaptive tuning is possible. [sent-798, score-0.229]

90 In this proof, we denote labeled and unlabeled samples by {(Xi ,Yi )}nl and {X j }n l +1 to indicate that they are all random variables. [sent-815, score-0.275]

91 j=n i=1 Step 1: First we bound the probability of the percentage of wrongly labeled unlabeled instances by sign( fˆ(k) ) by the tail probability of eV. [sent-816, score-0.313]

92 5 (X j )); nl + 1 ≤ j ≤ n} to be the set of unlabeled data that are wrongly labeled by sign( fˆ(k) ), n with n f = #{D f } being its cardinality. [sent-820, score-0.712]

93 According to Markov’s inequality, the fact that E( nf ) = nu ˆ(k+1) ) − sign( f¯. [sent-821, score-0.194]

94 Let Vπ ( f , Z) = Vπ ( f , Z) + λJ( f ), and Z j = (X j ,Y j ) with ˆ(k) (X j )); nl + 1 ≤ j ≤ n. [sent-832, score-0.437]

95 n−1 ∑i∈D + ∑i∈D / f f (k) By the deﬁnition of fˆπ and (11), (k) P eVπ ( fˆπ , f¯π ) ≥ δ2 k nf (k) ≥ a1 (a11 ρ2 (δn )2 )β + n n nf 1 n ˜ ∗ (k) ˜ P∗ sup ∑ (Vπ ( fπ , Zi ) − Vπ ( f , Zi )) ≥ 0, ≤ a1 (a11 ρ2 (δn )2 )β n n Nk n i=1 ≤ P (k) ≤ P eV. [sent-834, score-0.253]

96 loss of generality, assume that 4sn < δk π π π π ∗ ) ≥ e ( f , f ∗ ) − n f E|V ( f , Z) − V ( f ∗ , Z) + V ( f , Z) − ¯π For the ﬁrst moment, note that ∇( f , fπ Vπ π π π π n n ∗ ∗ ¯ ∗ ¯ Vπ ( fπ , Z)| ≥ eVπ ( f , fπ ) − 4 nf with Vπ ( f , z) = Sπ (−y)L(−y f (x)). [sent-841, score-0.133]

97 5 ( f (x)) = 1 (p(x)L( f (x)) + (1 − p(x))L(− f (x))) is 2 2 ˆ (k) ˆ the ideal loss for unlabeled data. [sent-872, score-0.272]

98 A framework for learning predictive structures from multiple tasks and unlabeled data. [sent-1070, score-0.197]

99 Text classiﬁcation from labeled and unlabeled documents using EM. [sent-1247, score-0.275]

100 A probability analysis on the value of unlabeled data for classiﬁcation problems. [sent-1369, score-0.197]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('nl', 0.437), ('espsi', 0.272), ('semisupervised', 0.261), ('shen', 0.26), ('unlabeled', 0.197), ('pan', 0.194), ('wang', 0.186), ('esvm', 0.172), ('sign', 0.169), ('ev', 0.16), ('emisupervised', 0.157), ('essvm', 0.114), ('etsvm', 0.114), ('arge', 0.11), ('fficient', 0.11), ('spsi', 0.1), ('hen', 0.099), ('nu', 0.097), ('nf', 0.097), ('argin', 0.096), ('clustering', 0.095), ('margin', 0.087), ('spam', 0.086), ('ssvm', 0.086), ('tsvm', 0.086), ('smoothness', 0.082), ('labeled', 0.078), ('gene', 0.077), ('pima', 0.076), ('mushroom', 0.073), ('wbc', 0.073), ('bci', 0.071), ('zi', 0.062), ('genes', 0.061), ('sup', 0.059), ('chapelle', 0.058), ('zhu', 0.058), ('var', 0.057), ('yarowsky', 0.057), ('max', 0.056), ('en', 0.055), ('hb', 0.054), ('nonconvex', 0.054), ('develops', 0.054), ('bayes', 0.054), ('classi', 0.053), ('nk', 0.049), ('unequal', 0.049), ('ge', 0.049), ('zien', 0.049), ('boundary', 0.048), ('earning', 0.047), ('counterpart', 0.046), ('initial', 0.045), ('supervised', 0.045), ('pro', 0.044), ('frl', 0.043), ('fru', 0.043), ('reconstructing', 0.043), ('svmc', 0.043), ('iterative', 0.042), ('heart', 0.041), ('exp', 0.041), ('corollary', 0.04), ('ideal', 0.039), ('dc', 0.039), ('favorable', 0.039), ('assumption', 0.038), ('instances', 0.038), ('log', 0.038), ('regret', 0.037), ('met', 0.037), ('parenthesis', 0.036), ('loss', 0.036), ('les', 0.035), ('improvement', 0.034), ('svm', 0.033), ('sn', 0.033), ('article', 0.033), ('tuning', 0.032), ('liu', 0.031), ('rate', 0.031), ('kernel', 0.031), ('mitchell', 0.03), ('yeast', 0.03), ('blum', 0.03), ('complete', 0.029), ('junhui', 0.029), ('kao', 0.029), ('mason', 0.029), ('mewes', 0.029), ('mips', 0.029), ('min', 0.027), ('el', 0.026), ('deferred', 0.026), ('distributional', 0.026), ('recovers', 0.026), ('marginal', 0.026), ('er', 0.025), ('estimated', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999911 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

Author: Junhui Wang, Xiaotong Shen, Wei Pan

2 0.075727664 23 jmlr-2009-Discriminative Learning Under Covariate Shift

Author: Steffen Bickel, Michael Brückner, Tobias Scheffer

Abstract: We address classiﬁcation problems for which the training instances are governed by an input distribution that is allowed to differ arbitrarily from the test distribution—problems also referred to as classiﬁcation under covariate shift. We derive a solution that is purely discriminative: neither training nor test distribution are modeled explicitly. The problem of learning under covariate shift can be written as an integrated optimization problem. Instantiating the general optimization problem leads to a kernel logistic regression and an exponential model classiﬁer for covariate shift. The optimization problem is convex under certain conditions; our ﬁndings also clarify the relationship to the known kernel mean matching procedure. We report on experiments on problems of spam ﬁltering, text classiﬁcation, and landmine detection. Keywords: covariate shift, discriminative learning, transfer learning

3 0.058630645 29 jmlr-2009-Estimating Labels from Label Proportions

Author: Novi Quadrianto, Alex J. Smola, Tibério S. Caetano, Quoc V. Le

Abstract: Consider the following problem: given sets of unlabeled observations, each set with known label proportions, predict the labels of another set of observations, possibly with known label proportions. This problem occurs in areas like e-commerce, politics, spam ﬁltering and improper content detection. We present consistent estimators which can reconstruct the correct labels with high probability in a uniform convergence sense. Experiments show that our method works well in practice. Keywords: unsupervised learning, Gaussian processes, classiﬁcation and prediction, probabilistic models, missing variables

4 0.054050684 28 jmlr-2009-Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks

Author: Jean Hausser, Korbinian Strimmer

Abstract: We present a procedure for effective estimation of entropy and mutual information from smallsample data, and apply it to the problem of inferring high-dimensional gene association networks. SpeciÄ?Ĺš cally, we develop a James-Stein-type shrinkage estimator, resulting in a procedure that is highly efÄ?Ĺš cient statistically as well as computationally. Despite its simplicity, we show that it outperforms eight other entropy estimation procedures across a diverse range of sampling scenarios and data-generating models, even in cases of severe undersampling. We illustrate the approach by analyzing E. coli gene expression data and computing an entropy-based gene-association network from gene expression data. A computer program is available that implements the proposed shrinkage estimator. Keywords: entropy, shrinkage estimation, James-Stein estimator, Ă˘€œsmall n, large pĂ˘€? setting, mutual information, gene association network

5 0.050545104 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods

Author: Christian Rieger, Barbara Zwicknagl

Abstract: We introduce a new technique for the analysis of kernel-based regression problems. The basic tools are sampling inequalities which apply to all machine learning problems involving penalty terms induced by kernels related to Sobolev spaces. They lead to explicit deterministic results concerning the worst case behaviour of ε- and ν-SVRs. Using these, we show how to adjust regularization parameters to get best possible approximation orders for regression. The results are illustrated by some numerical examples. Keywords: sampling inequality, radial basis functions, approximation theory, reproducing kernel Hilbert space, Sobolev space

6 0.048974618 87 jmlr-2009-Sparse Online Learning via Truncated Gradient

7 0.048467975 50 jmlr-2009-Learning When Concepts Abound

8 0.047482874 16 jmlr-2009-Classification with Gaussians and Convex Loss

9 0.04492379 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

10 0.044390637 98 jmlr-2009-Universal Kernel-Based Learning with Applications to Regular Languages (Special Topic on Mining and Learning with Graphs and Relations)

11 0.043769479 43 jmlr-2009-Java-ML: A Machine Learning Library (Machine Learning Open Source Software Paper)

12 0.042675771 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost

13 0.042388145 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs

14 0.042019814 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification

15 0.041845109 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers

16 0.04117012 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers

17 0.040875968 18 jmlr-2009-Consistency and Localizability

18 0.039771322 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks

19 0.039622951 59 jmlr-2009-Nearest Neighbor Clustering: A Baseline Method for Consistent Clustering with Arbitrary Objective Functions

20 0.039433617 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.203), (1, -0.039), (2, 0.102), (3, -0.025), (4, -0.023), (5, -0.003), (6, 0.011), (7, 0.038), (8, -0.014), (9, 0.119), (10, 0.116), (11, -0.03), (12, -0.072), (13, -0.055), (14, 0.123), (15, -0.15), (16, 0.102), (17, 0.023), (18, 0.113), (19, -0.143), (20, -0.13), (21, 0.052), (22, -0.036), (23, -0.021), (24, 0.0), (25, 0.052), (26, -0.042), (27, 0.028), (28, -0.016), (29, 0.021), (30, -0.086), (31, 0.035), (32, -0.184), (33, -0.208), (34, 0.113), (35, -0.102), (36, -0.13), (37, -0.048), (38, 0.041), (39, 0.183), (40, -0.117), (41, 0.008), (42, -0.159), (43, -0.041), (44, -0.144), (45, 0.211), (46, -0.087), (47, -0.209), (48, -0.15), (49, -0.082)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9316048 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

Author: Junhui Wang, Xiaotong Shen, Wei Pan

2 0.40230134 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods

Author: Christian Rieger, Barbara Zwicknagl

3 0.39291608 28 jmlr-2009-Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks

Author: Jean Hausser, Korbinian Strimmer

4 0.33075082 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers

Author: Volkan Vural, Glenn Fung, Balaji Krishnapuram, Jennifer G. Dy, Bharat Rao

Abstract: Most classiﬁcation methods assume that the samples are drawn independently and identically from an unknown data generating distribution, yet this assumption is violated in several real life problems. In order to relax this assumption, we consider the case where batches or groups of samples may have internal correlations, whereas the samples from different batches may be considered to be uncorrelated. This paper introduces three algorithms for classifying all the samples in a batch jointly: one based on a probabilistic formulation, and two based on mathematical programming analysis. Experiments on three real-life computer aided diagnosis (CAD) problems demonstrate that the proposed algorithms are signiﬁcantly more accurate than a naive support vector machine which ignores the correlations among the samples. Keywords: batch-wise classiﬁcation, support vector machine, linear programming, machine learning, statistical methods, unconstrained optimization

5 0.3177121 16 jmlr-2009-Classification with Gaussians and Convex Loss

Author: Dao-Hong Xiang, Ding-Xuan Zhou

Abstract: This paper considers binary classiﬁcation algorithms generated from Tikhonov regularization schemes associated with general convex loss functions and varying Gaussian kernels. Our main goal is to provide fast convergence rates for the excess misclassiﬁcation error. Allowing varying Gaussian kernels in the algorithms improves learning rates measured by regularization error and sample error. Special structures of Gaussian kernels enable us to construct, by a nice approximation scheme with a Fourier analysis technique, uniformly bounded regularizing functions achieving polynomial decays of the regularization error under a Sobolev smoothness condition. The sample error is estimated by using a projection operator and a tight bound for the covering numbers of reproducing kernel Hilbert spaces generated by Gaussian kernels. The convexity of the general loss function plays a very important role in our analysis. Keywords: reproducing kernel Hilbert space, binary classiﬁcation, general convex loss, varying Gaussian kernels, covering number, approximation

6 0.31256649 29 jmlr-2009-Estimating Labels from Label Proportions

7 0.30340451 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification

8 0.30203596 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks

9 0.29925165 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

10 0.27679071 23 jmlr-2009-Discriminative Learning Under Covariate Shift

11 0.26456395 98 jmlr-2009-Universal Kernel-Based Learning with Applications to Regular Languages (Special Topic on Mining and Learning with Graphs and Relations)

12 0.24758497 87 jmlr-2009-Sparse Online Learning via Truncated Gradient

13 0.23923789 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification

14 0.23499216 59 jmlr-2009-Nearest Neighbor Clustering: A Baseline Method for Consistent Clustering with Arbitrary Objective Functions

15 0.23184809 50 jmlr-2009-Learning When Concepts Abound

16 0.23086913 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs

17 0.23026724 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM

18 0.2277666 12 jmlr-2009-Bi-Level Path Following for Cross Validated Solution of Kernel Quantile Regression

19 0.21836983 43 jmlr-2009-Java-ML: A Machine Learning Library (Machine Learning Open Source Software Paper)

20 0.21560974 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.017), (11, 0.022), (38, 0.029), (47, 0.012), (52, 0.052), (55, 0.579), (58, 0.03), (66, 0.093), (68, 0.013), (90, 0.041), (96, 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95276833 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality

Author: Wenxin Jiang

Abstract: The statistical learning theory of risk minimization depends heavily on probability bounds for uniform deviations of the empirical risks. Classical probability bounds using Hoeffding’s inequality cannot accommodate more general situations with unbounded loss and dependent data. The current paper introduces an inequality that extends Hoeffding’s inequality to handle these more general situations. We will apply this inequality to provide probability bounds for uniform deviations in a very general framework, which can involve discrete decision rules, unbounded loss, and a dependence structure that can be more general than either martingale or strong mixing. We will consider two examples with high dimensional predictors: autoregression (AR) with ℓ1 -loss, and ARX model with variable selection for sign classiﬁcation, which uses both lagged responses and exogenous predictors. Keywords: dependence, empirical risk, probability bound, unbounded loss, uniform deviation

2 0.94195849 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques

Author: Barnabás Póczos, András Lőrincz

Abstract: We introduce novel online Bayesian methods for the identiﬁcation of a family of noisy recurrent neural networks (RNNs). We present Bayesian active learning techniques for stimulus selection given past experiences. In particular, we consider the unknown parameters as stochastic variables and use A-optimality and D-optimality principles to choose optimal stimuli. We derive myopic cost functions in order to maximize the information gain concerning network parameters at each time step. We also derive the A-optimal and D-optimal estimations of the additive noise that perturbs the dynamical system of the RNN. Here we investigate myopic as well as non-myopic estimations, and study the problem of simultaneous estimation of both the system parameters and the noise. Employing conjugate priors our derivations remain approximation-free and give rise to simple update rules for the online learning of the parameters. The efﬁciency of our method is demonstrated for a number of selected cases, including the task of controlled independent component analysis. Keywords: active learning, system identiﬁcation, online Bayesian learning, A-optimality, Doptimality, infomax control, optimal design

same-paper 3 0.92639303 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

Author: Junhui Wang, Xiaotong Shen, Wei Pan

4 0.51866078 9 jmlr-2009-Analysis of Perceptron-Based Active Learning

Author: Sanjoy Dasgupta, Adam Tauman Kalai, Claire Monteleoni

Abstract: We start by showing that in an active learning setting, the Perceptron algorithm needs Ω( ε12 ) labels to learn linear separators within generalization error ε. We then present a simple active learning algorithm for this problem, which combines a modiﬁcation of the Perceptron update with an adaptive ﬁltering rule for deciding which points to query. For data distributed uniformly over the unit ˜ sphere, we show that our algorithm reaches generalization error ε after asking for just O(d log 1 ) ε labels. This exponential improvement over the usual sample complexity of supervised learning had previously been demonstrated only for the computationally more complex query-by-committee algorithm. Keywords: active learning, perceptron, label complexity bounds, online learning

5 0.49771515 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression

Author: Tong Zhang

Abstract: This paper studies the feature selection problem using a greedy least squares regression algorithm. We show that under a certain irrepresentable condition on the design matrix (but independent of the sparse target), the greedy algorithm can select features consistently when the sample size approaches inﬁnity. The condition is identical to a corresponding condition for Lasso. Moreover, under a sparse eigenvalue condition, the greedy algorithm can reliably identify features as long as each nonzero coefﬁcient is larger than a constant times the noise level. In compar√ ison, Lasso may require the coefﬁcients to be larger than O( s) times the noise level in the worst case, where s is the number of nonzero coefﬁcients. Keywords: greedy algorithm, feature selection, sparsity

6 0.47776544 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

7 0.4659251 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification

8 0.45314217 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods

9 0.44667438 5 jmlr-2009-Adaptive False Discovery Rate Control under Independence and Dependence

10 0.44583318 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs

11 0.4445661 28 jmlr-2009-Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks

12 0.44355103 19 jmlr-2009-Controlling the False Discovery Rate of the Association Causality Structure Learned with the PC Algorithm (Special Topic on Mining and Learning with Graphs and Relations)

13 0.42671865 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models

14 0.42392188 38 jmlr-2009-Hash Kernels for Structured Data

15 0.42355105 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model

16 0.42058963 1 jmlr-2009-A Least-squares Approach to Direct Importance Estimation

17 0.41562781 89 jmlr-2009-Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks

18 0.41021627 18 jmlr-2009-Consistency and Localizability

19 0.40859625 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences

20 0.40157643 87 jmlr-2009-Sparse Online Learning via Truncated Gradient