jmlr jmlr2009 jmlr2009-23 knowledge-graph by maker-knowledge-mining

23 jmlr-2009-Discriminative Learning Under Covariate Shift


Source: pdf

Author: Steffen Bickel, Michael Brückner, Tobias Scheffer

Abstract: We address classification problems for which the training instances are governed by an input distribution that is allowed to differ arbitrarily from the test distribution—problems also referred to as classification under covariate shift. We derive a solution that is purely discriminative: neither training nor test distribution are modeled explicitly. The problem of learning under covariate shift can be written as an integrated optimization problem. Instantiating the general optimization problem leads to a kernel logistic regression and an exponential model classifier for covariate shift. The optimization problem is convex under certain conditions; our findings also clarify the relationship to the known kernel mean matching procedure. We report on experiments on problems of spam filtering, text classification, and landmine detection. Keywords: covariate shift, discriminative learning, transfer learning

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 89 14482 Potsdam, Germany Editor: Bianca Zadrozny Abstract We address classification problems for which the training instances are governed by an input distribution that is allowed to differ arbitrarily from the test distribution—problems also referred to as classification under covariate shift. [sent-8, score-0.765]

2 The problem of learning under covariate shift can be written as an integrated optimization problem. [sent-10, score-0.837]

3 Instantiating the general optimization problem leads to a kernel logistic regression and an exponential model classifier for covariate shift. [sent-11, score-0.749]

4 The optimization problem is convex under certain conditions; our findings also clarify the relationship to the known kernel mean matching procedure. [sent-12, score-0.47]

5 Keywords: covariate shift, discriminative learning, transfer learning 1. [sent-14, score-0.379]

6 The covariate shift model and the missing at random case in the sample selection bias model allow for differences between the training and test distribution of instances; the conditional distribution of the class variable given the instance is constant over training and test set. [sent-19, score-1.262]

7 In the covariate shift problem setting, a training sample is available in matrix XL with row vectors x1 , . [sent-20, score-0.589]

8 We contribute a discriminative model for learning under different training and test distributions. [sent-40, score-0.414]

9 The model directly characterizes the divergence between training and test distribution, without the intermediate—intrinsically model-based—step of estimating training and test distribution. [sent-41, score-0.64]

10 We formulate the search for all model parameters as an integrated optimization problem. [sent-42, score-0.413]

11 This complements the predominant procedure of first estimating the bias of the training sample, and then learning the classifier on a weighted version of the training sample. [sent-43, score-0.35]

12 We show that the integrated optimization can be convex, depending on the model type; it is convex for the exponential model. [sent-44, score-0.521]

13 We derive a Newton gradient descent procedure, leading to a kernel logistic regression and an exponential model classifier for covariate shift. [sent-45, score-0.723]

14 After reviewing models for differing training and test distributions in Section 2, we introduce our integrated model in Section 3. [sent-46, score-0.64]

15 We derive primal and kernelized classifiers for differing training and test distributions in Sections 4 and 5. [sent-47, score-0.372]

16 In Section 6, we analyze the convexity of the integrated optimization problem. [sent-48, score-0.437]

17 Section 7 describes an approximation to the joint optimization problem and Section 8 reveals a new interpretation of kernel mean matching and analyzes the relationship to our model. [sent-49, score-0.424]

18 In Section 9 we discuss different tuning procedures for learning under covariate shift. [sent-50, score-0.441]

19 New findings (Section 6) show that the integrated optimization problem can in fact be convex when the loss function is chosen appropriately. [sent-55, score-0.461]

20 Prior Work If training and test distributions were known, then the loss on the test distribution could be minimized by weighting the loss on the training distribution with an instance-specific factor. [sent-63, score-0.754]

21 This shows that covariate shift can only be compensated for as long as the training distribution covers the entire support of the test distribution. [sent-70, score-0.778]

22 If a test instance had zero density under the training distribution, the test-to-training density ratio which it would need to be scaled with would incur a zero denominator. [sent-71, score-0.454]

23 A straightforward approach to compensating for covariate shift is to first obtain estimates p(x|θ) and ˆ p(x|λ) from the test and training data, respectively, using kernel density estimation (Shimodaira, ˆ 2000; Sugiyama and M¨ ller, 2005). [sent-73, score-0.922]

24 For instances in the training set (σ = 1) a label is drawn from p(y|x), for the instances in the rejected set the labels are unknown. [sent-85, score-0.353]

25 The distribution of the selector variable then maps the test onto the training distribution: p(x|λ) ∝ p(x|θ)p(σ = 1|x, θ, λ). [sent-90, score-0.437]

26 No test examples drawn directly from p(x|θ) are needed to train the model, only labeled selected and unlabeled rejected examples are required. [sent-94, score-0.357]

27 This is in contrast to the covariate shift model that requires samples drawn from the test distribution, but no selection process is assumed and no rejected examples are needed. [sent-95, score-0.778]

28 Covariate shift models can be applied to learning under sample selection bias in the missing at random setting by treating the selected examples as labeled sample and the union of selected (ignoring the labels) and rejected examples as unlabeled sample. [sent-96, score-0.471]

29 Propensity scores (Rosenbaum and Rubin, 1983; Lunceford and Davidian, 2004) are applied in settings related to sample selection bias; the training data is again assumed to be drawn from the test distribution p(x|θ) followed by a selection process. [sent-97, score-0.392]

30 Weighting the selected examples by the inverse of the propensity score p(σ = 1|x, λ, θ)−1 and weighting the rejected examples by p(σ = −1|x, λ, θ)−1 results in two unbiased samples with respect to the test distribution. [sent-99, score-0.39]

31 , 2007) is a two-step method that first finds weights for the training instances such that the first momentum of training and test sets—that is, their mean value— matches in feature space. [sent-112, score-0.517]

32 , 2008) estimates resampling weights for the training examples by minimizing the Kullback-Leibler divergence between the test distribution and the weighted training distribution. [sent-122, score-0.447]

33 A regular maximum a posteriori estimation w′ = argmaxw p(y|XL ; w)p(w), would only use the training data (y, XL ) governed by p(x|λ). [sent-128, score-0.345]

34 By ignoring the test data, this estimate will not generally result in a model that predicts the missing labels of the test data with a minimum error because the training distribution p(x|λ) is different from the test distribution p(x|θ). [sent-129, score-0.698]

35 The model is based on a binary selector variable s: Given an instance vector x from the joint matrix X of all available instances, selector variable s decides whether x is drawn into the test data XT (s = −1) or into the training data XL (s = 1) in which case y is determined. [sent-132, score-0.558]

36 For all selected training examples (all examples xi with si = 1) draw vector y of all labels from p(y|s, X; w, v). [sent-138, score-0.404]

37 When p(s = −1) > 0, which is implied by the test set not being empty, the definition of s allows us to rewrite the test distribution as p(x|θ) = p(x|s = −1, θ). [sent-159, score-0.344]

38 The conditional p(s = 1|x, θ, λ) discriminates training (s = 1) against test instances (s = −1). [sent-165, score-0.396]

39 p(s = −1|θ, λ) p(s = 1|x, θ, λ) = p(x|s = −1, θ, λ) = = = (9) (10) (11) (12) The significance of Equation 12 is that it shows how the optimal example weights, the test-top(x|θ) training ratio p(x|λ) , can be determined without knowledge of either training or test density. [sent-167, score-0.463]

40 The right hand side of Equation 12 can be evaluated based on a model that discriminates training against test examples and outputs how much more likely an instance is to occur in the test data than it is to occur in the training data. [sent-168, score-0.649]

41 By using this likelihood p(y|XL ; w) instead of p(y|s, X; w, v), one would wrongly assume that the training data XL was governed by the test 2142 D ISCRIMINATIVE L EARNING U NDER C OVARIATE S HIFT p(x|θ) distribution. [sent-173, score-0.43]

42 3 Optimization Problem for Integrated Model The likelihood function p(s|X; v) resolves to p(si = 1|xi ; v) for all training instances and p(si = −1|xi ; v) for all test instances: m m+n i=1 i=m+1 p(s|X; v) = ∏ p(si = 1|xi ; v) ∏ p(si = −1|xi ; v). [sent-187, score-0.402]

43 2143 ¨ B ICKEL , B R UCKNER AND S CHEFFER Optimization Problem 1 is derived from Equation 16 in logarithmic form, using linear models vT xi and wT xi and a logistic model for p(s = 1|x; v). [sent-196, score-0.38]

44 Kernelized Learning Algorithm We derive a kernelized version of the integrated classifier for differing training and test distributions. [sent-210, score-0.639]

45 With this theorem we can easily check whether the integrated classifier for covariate shift is convex for specific models of the negative log-likelihood functions. [sent-231, score-0.773]

46 We use a logistic model ℓv (si vT x) = log(1 + exp(−si vT x)); the abbreviations of Appendix A can now be expanded: ℓ′ si xi j = − v,i exp(−si vT xi ) si xi j ; 1 + exp(−si vT xi ) ℓ′′ xi j xik = v,i exp(−si vT xi ) xi j xik . [sent-248, score-1.481]

47 For the logistic model the derivatives of ℓw are the same as for ℓv , only v needs to be replaced by w and si by yi . [sent-250, score-0.422]

48 For an exponential model with ℓw (yi wT x) = exp(−yi wT x) the abbreviations are expanded as follows: ℓ′ yi xi j = − exp(−yi wT xi )yi xi j ; w,i ℓ′′ xi j xik = exp(−yi wT xi )xi j xik . [sent-251, score-0.931]

49 w,i Using Theorem 4 we can now easily check the convexity of the integrated classifier with logistic model and with exponential model for ℓw . [sent-252, score-0.599]

50 Corollary 5 With a logistic model for ℓw , the condition of Equation 24 is violated and therefore Optimization Problem 1 with logistic model for ℓw may not convex in general. [sent-253, score-0.394]

51 w = argmaxw Z p(w, v|y, s, X)dv = argmaxw ∗ Z p(w|y, s, X; v)p(v|y, s, X)dv ≈ argmaxw p(w|y, s, X; vMAP′ ) (26) (27) = argmaxw p(y|s, X; w, vMAP′ )p(w) = argmaxv p(v|y, s, X) (29) = argmaxv p(s|y, X; v)p(v). [sent-281, score-0.484]

52 (2007) motivate the kernel mean matching algorithm as a procedure that minimizes the distance between the means of unlabeled and weighted labeled data in feature space. [sent-295, score-0.426]

53 In order to solve the second stage (Optimization Problem 3), kernel mean matching does not use p(s=1) re-weighting factors p(s=−1) exp(−vT xi − b) but directly uses the dual αi parameters as weights. [sent-314, score-0.48]

54 It discriminates training against test examples using a partially Rocchio-style approximation to the SVM optimization criterion. [sent-316, score-0.439]

55 For the twow v stage approximation of Section 7 and reference methods like kernel mean matching two similar parameters need to be specified. [sent-319, score-0.411]

56 Parameter tuning for covariate shift models is much more difficult than for regular prediction models because in the covariate shift setting there is no labeled data available drawn from the test distribution. [sent-321, score-1.242]

57 Parameter tuning by regular cross-validation on the labeled training data is inappropriate because the labeled training data is not governed by the test distribution. [sent-322, score-0.719]

58 The one without prior knowledge cannot be used for kernel mean matching and the one-stage model of Optimization Problem 1. [sent-325, score-0.35]

59 A typical setting with prior knowledge on the hyper-parameters is when the difference between training and test data is introduced by a covariate shift over time and the input distribution shifts constantly over time. [sent-326, score-0.778]

60 The most recent data is the unlabeled test data and the older data has been labeled and is the training data. [sent-327, score-0.396]

61 Another setting with prior knowledge is when in addition to the pair of training and test set an additional pair of training and fully labeled test set from a different domain with a similar magnitude of covariate shift is available. [sent-330, score-1.072]

62 Due to the similar magnitude of the covariate shift the optimal parameters for the additional domain are assumed to be a good choice for the parameters of the target domain. [sent-332, score-0.46]

63 For some two-stage models for covariate shift there is no prior knowledge necessary to tune hyper-parameters. [sent-333, score-0.518]

64 The training and the test data are both split into training and tuning folds and the hold-out likelihood of the tuning folds is optimized with grid search on σ2 (and kernel parameters). [sent-337, score-0.898]

65 Once the regularizer of the first stage is tuned, the second stage parameter σ2 (and kernel parameters) can w be tuned with cross-validation on weighted training data (Sugiyama and M¨ ller, 2005). [sent-339, score-0.417]

66 v w In order to compare the one-stage model and kernel mean matching to the other two-stage models we use tuning procedures based on prior knowledge in the empirical studies in the next section. [sent-343, score-0.506]

67 Empirical Results We study the benefit of two versions of the integrated classifier for covariate shift and other reference methods on spam filtering, text classification, and landmine detection problems. [sent-345, score-0.958]

68 The first integrated classifier uses a logistic model for ℓw (“integrated log model”), the second an exponential model for ℓw (“integrated exp model”); ℓv is a logistic model in both cases. [sent-346, score-0.782]

69 All other reference methods consist of a two-stage procedure: first, the difference between training and test distribution is estimated, the classifier is trained on weighted data in a second step. [sent-348, score-0.352]

70 In the third method, separate density estimates for p(x|λ) and p(x|θ) are obtained using kernel density estimation (Shimodaira, 2000), the bandwidth of the kernel is chosen according to the rule-of-thumb of Silverman (1986). [sent-351, score-0.356]

71 The example weights are computed according to Equation 17 using a logistic model in the first stage, p(s = 1|x; v) is estimated by training a logistic regression that discriminates training from test examples. [sent-353, score-0.77]

72 (2007); the collection contains nine different inboxes with test emails (5270 to 10964 emails, depending on inbox) and one set of training emails compiled from various different sources. [sent-361, score-0.528]

73 We repeat this process 10 times for 2048 test emails and 20 to 640 times for 1024 to 32 test emails. [sent-364, score-0.407]

74 As tuning data we use the labeled emails from an additional inbox different from the test inboxes. [sent-365, score-0.493]

75 We analyze the rank of the kernel matrix and find that it fulfills the universal kernel requirement of kernel mean matching; this is due to the high-dimensionality of the data. [sent-368, score-0.391]

76 The left column of Figure 1 compares the integrated classifiers for covariate shift to the kernel mean matching and kernel density estimation baselines. [sent-370, score-1.219]

77 The integrated and two-step logistic regression and exponential models and kernel mean matching perform similarly well. [sent-376, score-0.781]

78 For 1048 examples the 1 − AUC risk is even reduced by an average of 30% with the integrated exponential model classifier! [sent-378, score-0.399]

79 The 2150 D ISCRIMINATIVE L EARNING U NDER C OVARIATE S HIFT integrated log model integrated exp model iid baseline kernel mean matching kernel density estimation integrated models vs. [sent-380, score-1.609]

80 spam filtering - average of nine users reduction of 1-AUC risk reduction of 1-AUC risk integrated models vs. [sent-382, score-0.461]

81 reference methods landmine detection 64 128 256 512 number of test examples 1024 integrated models vs. [sent-392, score-0.577]

82 006 increase of AUC 64 128 256 512 1024 2048 test examples in user’s inbox integrated models vs. [sent-396, score-0.496]

83 2151 ¨ B ICKEL , B R UCKNER AND S CHEFFER convex integrated exponential model performs slightly better than its two-stage approximation; for larger number of test examples (512 to 2048) this difference is statistically significant according to a paired t-test with significance level of 5%. [sent-402, score-0.566]

84 For the logistic model, the two-stage optimization performs similarly well as the integrated version. [sent-403, score-0.515]

85 We again use linear kernels (rank analysis verifies the universal kernel property) and average the results over 20 to 640 random test samples for different sizes (1024 for 20 samples to 32 for 640 samples) of test sets. [sent-410, score-0.428]

86 The integrated models reduce the 1 − AUC risk by 15% for 1024 test examples. [sent-416, score-0.456]

87 For this problem, the exponential model classifiers and kernel mean matching significantly outperform all other methods on average. [sent-432, score-0.412]

88 Considering only methods with logistic target model, kernel mean matching is better than all other methods. [sent-433, score-0.452]

89 Integrated logistic regression and two-stage logistic regression are still significantly better than the iid baseline except for 32 and 64 test examples. [sent-434, score-0.606]

90 Conclusion We derived a discriminative model for learning under differing training and test distributions. [sent-437, score-0.467]

91 We show that this ratio can be expressed—without modeling either training or test density—by a discriminative model that characterizes how much more likely an instance is to occur in the test sample than it is to occur in the training sample. [sent-439, score-0.784]

92 We gave a new interpretation for kernel mean matching and show that it is also based on a discriminative model similar to Optimization Problem 2. [sent-445, score-0.444]

93 Empirically, we found that the integrated and the two-stage models as well as kernel mean matching outperform the iid baseline and the kernel density estimation model in almost all cases. [sent-446, score-0.97]

94 The performance of kernel mean matching depends on the problem; for one out of three problems it did not beat the iid baseline, for the others it yielded comparable results to the integrated models. [sent-448, score-0.701]

95 The main advantage compared to the integrated model is that regularization parameters can be tuned without prior knowledge by cross-validation. [sent-450, score-0.347]

96 ∂F(v, w) ∂v j ∂F(v, w) ∂w j m m+n i=1 i=1 = − ∑ ωi ℓw,i xi j + m = 1 ∑ ℓ′v,i si xi j + σ2 v j v 1 ∑ ωi ℓ′w,i yi xi j + σ2 w j . [sent-461, score-0.557]

97 ∂2 F(v, w) ∂v j ∂vk ∂2 F(v, w) ∂v j ∂wk ∂2 F(v, w) ∂w j ∂wk m = ∑ ωi ℓw,i xi j xik + i=1 m+n 1 ∑ ℓ′′ xi j xik + σ2 δ jk v,i v i=1 m = − ∑ ωi ℓ′ yi xi j xik w,i i=1 m = 1 ∑ ωi ℓ′′ xi j xik + σ2 δ jk . [sent-463, score-0.972]

98 Improving predictive inference under covariate shift by weighting the log-likelihood function. [sent-568, score-0.502]

99 Direct importance u estimation with model selection and its application to covariate shift adaptation. [sent-586, score-0.533]

100 Direct density ratio estimation for large-scale covariate shift adaptation. [sent-594, score-0.57]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('covariate', 0.285), ('integrated', 0.267), ('vt', 0.179), ('shift', 0.175), ('si', 0.172), ('matching', 0.159), ('cheffer', 0.156), ('ickel', 0.156), ('iscriminative', 0.156), ('ovariate', 0.156), ('uckner', 0.156), ('test', 0.155), ('xl', 0.144), ('wt', 0.139), ('logistic', 0.138), ('hift', 0.132), ('training', 0.129), ('bickel', 0.127), ('tuning', 0.123), ('auc', 0.121), ('argmaxw', 0.121), ('landmine', 0.121), ('vmap', 0.121), ('xik', 0.121), ('iid', 0.12), ('selector', 0.119), ('kernel', 0.118), ('optimization', 0.11), ('xi', 0.103), ('propensity', 0.103), ('nder', 0.101), ('emails', 0.097), ('governed', 0.095), ('discriminative', 0.094), ('rejected', 0.09), ('potsdam', 0.087), ('newton', 0.084), ('sugiyama', 0.079), ('spam', 0.076), ('yi', 0.076), ('inbox', 0.074), ('wmap', 0.069), ('exp', 0.069), ('unlabeled', 0.068), ('instances', 0.067), ('stage', 0.063), ('exponential', 0.062), ('density', 0.06), ('convexity', 0.06), ('er', 0.059), ('xt', 0.059), ('cora', 0.059), ('tune', 0.058), ('bias', 0.057), ('equation', 0.056), ('baseline', 0.055), ('differing', 0.053), ('scheffer', 0.053), ('mine', 0.053), ('rocchio', 0.052), ('huang', 0.052), ('likelihood', 0.051), ('classi', 0.05), ('nine', 0.05), ('ratio', 0.05), ('gradient', 0.049), ('convex', 0.046), ('virtually', 0.045), ('discriminates', 0.045), ('tuned', 0.044), ('kliep', 0.044), ('shimodaira', 0.044), ('zadrozny', 0.044), ('xg', 0.044), ('labeled', 0.044), ('weighting', 0.042), ('lt', 0.04), ('earning', 0.04), ('diag', 0.038), ('loss', 0.038), ('mean', 0.037), ('selection', 0.037), ('characterizes', 0.036), ('model', 0.036), ('descent', 0.035), ('folds', 0.035), ('kernelized', 0.035), ('dudik', 0.035), ('incidence', 0.035), ('japkowicz', 0.035), ('lunceford', 0.035), ('manski', 0.035), ('predominant', 0.035), ('rosenbaum', 0.035), ('reference', 0.034), ('customers', 0.034), ('distribution', 0.034), ('risk', 0.034), ('ltering', 0.033), ('procedures', 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999934 23 jmlr-2009-Discriminative Learning Under Covariate Shift

Author: Steffen Bickel, Michael Brückner, Tobias Scheffer

Abstract: We address classification problems for which the training instances are governed by an input distribution that is allowed to differ arbitrarily from the test distribution—problems also referred to as classification under covariate shift. We derive a solution that is purely discriminative: neither training nor test distribution are modeled explicitly. The problem of learning under covariate shift can be written as an integrated optimization problem. Instantiating the general optimization problem leads to a kernel logistic regression and an exponential model classifier for covariate shift. The optimization problem is convex under certain conditions; our findings also clarify the relationship to the known kernel mean matching procedure. We report on experiments on problems of spam filtering, text classification, and landmine detection. Keywords: covariate shift, discriminative learning, transfer learning

2 0.18371046 9 jmlr-2009-Analysis of Perceptron-Based Active Learning

Author: Sanjoy Dasgupta, Adam Tauman Kalai, Claire Monteleoni

Abstract: We start by showing that in an active learning setting, the Perceptron algorithm needs Ω( ε12 ) labels to learn linear separators within generalization error ε. We then present a simple active learning algorithm for this problem, which combines a modification of the Perceptron update with an adaptive filtering rule for deciding which points to query. For data distributed uniformly over the unit ˜ sphere, we show that our algorithm reaches generalization error ε after asking for just O(d log 1 ) ε labels. This exponential improvement over the usual sample complexity of supervised learning had previously been demonstrated only for the computationally more complex query-by-committee algorithm. Keywords: active learning, perceptron, label complexity bounds, online learning

3 0.13557895 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality

Author: Wenxin Jiang

Abstract: The statistical learning theory of risk minimization depends heavily on probability bounds for uniform deviations of the empirical risks. Classical probability bounds using Hoeffding’s inequality cannot accommodate more general situations with unbounded loss and dependent data. The current paper introduces an inequality that extends Hoeffding’s inequality to handle these more general situations. We will apply this inequality to provide probability bounds for uniform deviations in a very general framework, which can involve discrete decision rules, unbounded loss, and a dependence structure that can be more general than either martingale or strong mixing. We will consider two examples with high dimensional predictors: autoregression (AR) with ℓ1 -loss, and ARX model with variable selection for sign classification, which uses both lagged responses and exogenous predictors. Keywords: dependence, empirical risk, probability bound, unbounded loss, uniform deviation

4 0.12727655 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent

Author: Antoine Bordes, Léon Bottou, Patrick Gallinari

Abstract: The SGD-QN algorithm is a stochastic gradient descent algorithm that makes careful use of secondorder information and splits the parameter update into independently scheduled components. Thanks to this design, SGD-QN iterates nearly as fast as a first-order stochastic gradient descent but requires less iterations to achieve the same accuracy. This algorithm won the “Wild Track” of the first PASCAL Large Scale Learning Challenge (Sonnenburg et al., 2008). Keywords: support vector machine, stochastic gradient descent

5 0.11612451 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting

Author: John Duchi, Yoram Singer

Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization

6 0.10390235 87 jmlr-2009-Sparse Online Learning via Truncated Gradient

7 0.10284683 56 jmlr-2009-Model Monitor (M2): Evaluating, Comparing, and Monitoring Models    (Machine Learning Open Source Software Paper)

8 0.099412426 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

9 0.093986996 29 jmlr-2009-Estimating Labels from Label Proportions

10 0.075805098 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences

11 0.075727664 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

12 0.071971104 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

13 0.071592428 90 jmlr-2009-Structure Spaces

14 0.071545072 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures

15 0.070871577 1 jmlr-2009-A Least-squares Approach to Direct Importance Estimation

16 0.064492382 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification

17 0.061059795 50 jmlr-2009-Learning When Concepts Abound

18 0.060787987 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers

19 0.060609855 38 jmlr-2009-Hash Kernels for Structured Data

20 0.058580946 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.343), (1, 0.093), (2, 0.083), (3, -0.084), (4, 0.194), (5, -0.014), (6, 0.006), (7, 0.013), (8, -0.023), (9, -0.09), (10, 0.206), (11, -0.01), (12, 0.042), (13, -0.137), (14, 0.041), (15, -0.181), (16, 0.347), (17, 0.006), (18, -0.022), (19, -0.114), (20, 0.134), (21, -0.041), (22, 0.075), (23, -0.112), (24, 0.088), (25, 0.04), (26, 0.03), (27, 0.018), (28, 0.006), (29, -0.097), (30, 0.015), (31, -0.041), (32, 0.024), (33, 0.065), (34, -0.055), (35, 0.054), (36, -0.031), (37, 0.099), (38, 0.003), (39, -0.017), (40, 0.017), (41, -0.001), (42, 0.042), (43, 0.022), (44, 0.026), (45, 0.066), (46, -0.054), (47, 0.006), (48, 0.003), (49, 0.038)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94879055 23 jmlr-2009-Discriminative Learning Under Covariate Shift

Author: Steffen Bickel, Michael Brückner, Tobias Scheffer

Abstract: We address classification problems for which the training instances are governed by an input distribution that is allowed to differ arbitrarily from the test distribution—problems also referred to as classification under covariate shift. We derive a solution that is purely discriminative: neither training nor test distribution are modeled explicitly. The problem of learning under covariate shift can be written as an integrated optimization problem. Instantiating the general optimization problem leads to a kernel logistic regression and an exponential model classifier for covariate shift. The optimization problem is convex under certain conditions; our findings also clarify the relationship to the known kernel mean matching procedure. We report on experiments on problems of spam filtering, text classification, and landmine detection. Keywords: covariate shift, discriminative learning, transfer learning

2 0.66644901 9 jmlr-2009-Analysis of Perceptron-Based Active Learning

Author: Sanjoy Dasgupta, Adam Tauman Kalai, Claire Monteleoni

Abstract: We start by showing that in an active learning setting, the Perceptron algorithm needs Ω( ε12 ) labels to learn linear separators within generalization error ε. We then present a simple active learning algorithm for this problem, which combines a modification of the Perceptron update with an adaptive filtering rule for deciding which points to query. For data distributed uniformly over the unit ˜ sphere, we show that our algorithm reaches generalization error ε after asking for just O(d log 1 ) ε labels. This exponential improvement over the usual sample complexity of supervised learning had previously been demonstrated only for the computationally more complex query-by-committee algorithm. Keywords: active learning, perceptron, label complexity bounds, online learning

3 0.5438447 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

Author: Yihua Chen, Eric K. Garcia, Maya R. Gupta, Ali Rahimi, Luca Cazzanti

Abstract: This paper reviews and extends the field of similarity-based classification, presenting new analyses, algorithms, data sets, and a comprehensive set of experimental results for a rich collection of classification problems. Specifically, the generalizability of using similarities as features is analyzed, design goals and methods for weighting nearest-neighbors for similarity-based learning are proposed, and different methods for consistently converting similarities into kernels are compared. Experiments on eight real data sets compare eight approaches and their variants to similarity-based learning. Keywords: similarity, dissimilarity, similarity-based learning, indefinite kernels

4 0.52667236 56 jmlr-2009-Model Monitor (M2): Evaluating, Comparing, and Monitoring Models    (Machine Learning Open Source Software Paper)

Author: Troy Raeder, Nitesh V. Chawla

Abstract: This paper presents Model Monitor (M 2 ), a Java toolkit for robustly evaluating machine learning algorithms in the presence of changing data distributions. M 2 provides a simple and intuitive framework in which users can evaluate classifiers under hypothesized shifts in distribution and therefore determine the best model (or models) for their data under a number of potential scenarios. Additionally, M 2 is fully integrated with the WEKA machine learning environment, so that a variety of commodity classifiers can be used if desired. Keywords: machine learning, open-source software, distribution shift, scenario analysis

5 0.46567115 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality

Author: Wenxin Jiang

Abstract: The statistical learning theory of risk minimization depends heavily on probability bounds for uniform deviations of the empirical risks. Classical probability bounds using Hoeffding’s inequality cannot accommodate more general situations with unbounded loss and dependent data. The current paper introduces an inequality that extends Hoeffding’s inequality to handle these more general situations. We will apply this inequality to provide probability bounds for uniform deviations in a very general framework, which can involve discrete decision rules, unbounded loss, and a dependence structure that can be more general than either martingale or strong mixing. We will consider two examples with high dimensional predictors: autoregression (AR) with ℓ1 -loss, and ARX model with variable selection for sign classification, which uses both lagged responses and exogenous predictors. Keywords: dependence, empirical risk, probability bound, unbounded loss, uniform deviation

6 0.45236936 1 jmlr-2009-A Least-squares Approach to Direct Importance Estimation

7 0.42669964 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures

8 0.42580914 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent

9 0.39095938 50 jmlr-2009-Learning When Concepts Abound

10 0.39036286 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences

11 0.38827312 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

12 0.3873421 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

13 0.3781462 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting

14 0.37407991 29 jmlr-2009-Estimating Labels from Label Proportions

15 0.36033499 38 jmlr-2009-Hash Kernels for Structured Data

16 0.35779014 87 jmlr-2009-Sparse Online Learning via Truncated Gradient

17 0.3558003 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization

18 0.32586405 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification

19 0.32300675 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems

20 0.31420439 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.017), (11, 0.016), (19, 0.012), (21, 0.01), (26, 0.016), (38, 0.027), (52, 0.605), (55, 0.047), (58, 0.03), (66, 0.074), (68, 0.025), (90, 0.032), (96, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9727037 23 jmlr-2009-Discriminative Learning Under Covariate Shift

Author: Steffen Bickel, Michael Brückner, Tobias Scheffer

Abstract: We address classification problems for which the training instances are governed by an input distribution that is allowed to differ arbitrarily from the test distribution—problems also referred to as classification under covariate shift. We derive a solution that is purely discriminative: neither training nor test distribution are modeled explicitly. The problem of learning under covariate shift can be written as an integrated optimization problem. Instantiating the general optimization problem leads to a kernel logistic regression and an exponential model classifier for covariate shift. The optimization problem is convex under certain conditions; our findings also clarify the relationship to the known kernel mean matching procedure. We report on experiments on problems of spam filtering, text classification, and landmine detection. Keywords: covariate shift, discriminative learning, transfer learning

2 0.94171327 50 jmlr-2009-Learning When Concepts Abound

Author: Omid Madani, Michael Connor, Wiley Greiner

Abstract: Many learning tasks, such as large-scale text categorization and word prediction, can benefit from efficient training and classification when the number of classes, in addition to instances and features, is large, that is, in the thousands and beyond. We investigate the learning of sparse class indices to address this challenge. An index is a mapping from features to classes. We compare the index-learning methods against other techniques, including one-versus-rest and top-down classification using perceptrons and support vector machines. We find that index learning is highly advantageous for space and time efficiency, at both training and classification times. Moreover, this approach yields similar and at times better accuracies. On problems with hundreds of thousands of instances and thousands of classes, the index is learned in minutes, while other methods can take hours or days. As we explain, the design of the learning update enables conveniently constraining each feature to connect to a small subset of the classes in the index. This constraint is crucial for scalability. Given an instance with l active (positive-valued) features, each feature on average connecting to d classes in the index (in the order of 10s in our experiments), update and classification take O(dl log(dl)). Keywords: index learning, many-class learning, multiclass learning, online learning, text categorization

3 0.65864629 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures

Author: Babak Shahbaba, Radford Neal

Abstract: We introduce a new nonlinear model for classification, in which we model the joint distribution of response variable, y, and covariates, x, non-parametrically using Dirichlet process mixtures. We keep the relationship between y and x linear within each component of the mixture. The overall relationship becomes nonlinear if the mixture contains more than one component, with different regression coefficients. We use simulated data to compare the performance of this new approach to alternative methods such as multinomial logit (MNL) models, decision trees, and support vector machines. We also evaluate our approach on two classification problems: identifying the folding class of protein sequences and detecting Parkinson’s disease. Our model can sometimes improve predictive accuracy. Moreover, by grouping observations into sub-populations (i.e., mixture components), our model can sometimes provide insight into hidden structure in the data. Keywords: mixture models, Dirichlet process, classification

4 0.64491618 48 jmlr-2009-Learning Nondeterministic Classifiers

Author: Juan José del Coz, Jorge Díez, Antonio Bahamonde

Abstract: Nondeterministic classifiers are defined as those allowed to predict more than one class for some entries from an input space. Given that the true class should be included in predictions and the number of classes predicted should be as small as possible, these kind of classifiers can be considered as Information Retrieval (IR) procedures. In this paper, we propose a family of IR loss functions to measure the performance of nondeterministic learners. After discussing such measures, we derive an algorithm for learning optimal nondeterministic hypotheses. Given an entry from the input space, the algorithm requires the posterior probabilities to compute the subset of classes with the lowest expected loss. From a general point of view, nondeterministic classifiers provide an improvement in the proportion of predictions that include the true class compared to their deterministic counterparts; the price to be paid for this increase is usually a tiny proportion of predictions with more than one class. The paper includes an extensive experimental study using three deterministic learners to estimate posterior probabilities: a multiclass Support Vector Machine (SVM), a Logistic Regression, and a Na¨ve Bayes. The data sets considered comprise both UCI ı multi-class learning tasks and microarray expressions of different kinds of cancer. We successfully compare nondeterministic classifiers with other alternative approaches. Additionally, we shall see how the quality of posterior probabilities (measured by the Brier score) determines the goodness of nondeterministic predictions. Keywords: nondeterministic, multiclassification, reject option, multi-label classification, posterior probabilities

5 0.62236947 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM

Author: Pradip Ghanty, Samrat Paul, Nikhil R. Pal

Abstract: In this paper we propose a new multilayer classifier architecture. The proposed hybrid architecture has two cascaded modules: feature extraction module and classification module. In the feature extraction module we use the multilayered perceptron (MLP) neural networks, although other tools such as radial basis function (RBF) networks can be used. In the classification module we use support vector machines (SVMs)—here also other tool such as MLP or RBF can be used. The feature extraction module has several sub-modules each of which is expected to extract features capturing the discriminating characteristics of different areas of the input space. The classification module classifies the data based on the extracted features. The resultant architecture with MLP in feature extraction module and SVM in classification module is called NEUROSVM. The NEUROSVM is tested on twelve benchmark data sets and the performance of the NEUROSVM is found to be better than both MLP and SVM. We also compare the performance of proposed architecture with that of two ensemble methods: majority voting and averaging. Here also the NEUROSVM is found to perform better than these two ensemble methods. Further we explore the use of MLP and RBF in the classification module of the proposed architecture. The most attractive feature of NEUROSVM is that it practically eliminates the severe dependency of SVM on the choice of kernel. This has been verified with respect to both linear and non-linear kernels. We have also demonstrated that for the feature extraction module, the full training of MLPs is not needed. Keywords: feature extraction, neural networks (NNs), support vector machines (SVMs), hybrid system, majority voting, averaging c 2009 Pradip Ghanty, Samrat Paul and Nikhil R. Pal. G HANTY, PAUL AND PAL

6 0.61151391 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning

7 0.5963878 38 jmlr-2009-Hash Kernels for Structured Data

8 0.58565825 87 jmlr-2009-Sparse Online Learning via Truncated Gradient

9 0.57792431 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model

10 0.5530811 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers

11 0.54328179 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification

12 0.52692419 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification

13 0.52426225 1 jmlr-2009-A Least-squares Approach to Direct Importance Estimation

14 0.51227897 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

15 0.50898421 70 jmlr-2009-Particle Swarm Model Selection    (Special Topic on Model Selection)

16 0.49509296 25 jmlr-2009-Distributed Algorithms for Topic Models

17 0.4844507 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

18 0.4705762 15 jmlr-2009-Cautious Collective Classification

19 0.4537091 29 jmlr-2009-Estimating Labels from Label Proportions

20 0.44776502 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability