nips nips2006 nips2006-62 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jiayuan Huang, Arthur Gretton, Karsten M. Borgwardt, Bernhard Schölkopf, Alex J. Smola
Abstract: We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. Most algorithms for this setting try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate. We present a nonparametric method which directly produces resampling weights without distribution estimation. Our method works by matching distributions between training and testing sets in feature space. Experimental results demonstrate that our method works well in practice.
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. [sent-16, score-0.296]
2 Most algorithms for this setting try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate. [sent-17, score-0.123]
3 Our method works by matching distributions between training and testing sets in feature space. [sent-19, score-0.173]
4 1 Introduction The default assumption in many learning scenarios is that training and test data are independently and identically (iid) drawn from the same distribution. [sent-21, score-0.155]
5 When the distributions on training and test set do not match, we are facing sample selection bias or covariate shift. [sent-22, score-0.384]
6 Specifically, given a domain of patterns X and labels Y, we obtain training samples Z = {(x1 , y1 ), . [sent-23, score-0.089]
7 , (xm , ym )} ⊆ X × Y from ′ ′ a Borel probability distribution Pr(x, y), and test samples Z ′ = {(x′ , y1 ), . [sent-26, score-0.088]
8 Although there exists previous work addressing this problem [2, 5, 8, 9, 12, 16, 20], sample selection bias is typically ignored in standard estimation algorithms. [sent-30, score-0.193]
9 Nonetheless, in reality the problem occurs rather frequently : While the available data have been collected in a biased manner, the test is usually performed over a more general target population. [sent-31, score-0.18]
10 Suppose we wish to generate a model to diagnose breast cancer. [sent-34, score-0.142]
11 Suppose, moreover, that most women who participate in the breast screening test are middle-aged and likely to have attended the screening in the preceding 3 years. [sent-35, score-0.3]
12 Consequently our sample includes mostly older women and those who have low risk of breast cancer because they have been tested before. [sent-36, score-0.421]
13 The examples do not reflect the general population with respect to age (which amounts to a bias in Pr(x)) and they only contain very few diseased cases (i. [sent-37, score-0.141]
14 Gene expression profile studies using DNA microarrays are used in tumor diagnosis. [sent-41, score-0.174]
15 A common problem is that the samples are obtained using certain protocols, microarray platforms and analysis techniques. [sent-42, score-0.082]
16 The test cases are recorded under different conditions, resulting in a different distribution of gene expression values. [sent-44, score-0.154]
17 In this paper, we utilize the availability of unlabeled data to direct a sample selection de-biasing procedure for various learning methods. [sent-45, score-0.135]
18 Unlike previous work we infer the resampling weight directly by distribution matching between training and testing sets in feature space in a non-parametric manner. [sent-46, score-0.17]
19 We do not require the estimation of biased densities or selection probabilities [20, 2, 12], or the assumption that probabilities of the different classes are known [8]. [sent-47, score-0.245]
20 Rather, we account for the difference between Pr(x, y) and Pr′ (x, y) by reweighting the training points such that the means of the training and test points in a reproducing kernel Hilbert space (RKHS) are close. [sent-48, score-0.498]
21 We call this reweighting process kernel mean matching (KMM). [sent-49, score-0.346]
22 The required optimisation is a simple QP problem, and the reweighted sample can be incorporated straightforwardly into several different regression and classification algorithms. [sent-51, score-0.18]
23 We apply our method to a variety of regression and classification benchmarks from UCI and elsewhere, as well as to classification of microarrays from prostate and breast cancer patients. [sent-52, score-0.487]
24 These experiments demonstrate that KMM greatly improves learning performance compared with training on unweighted data, and that our reweighting scheme can in some cases outperform reweighting using the true sample bias distribution. [sent-53, score-0.892]
25 In other words, the conditional probabilities of y|x remain unchanged (this particular case of sample selection bias has been termed covariate shift [12]). [sent-57, score-0.253]
26 However, since typically we only observe examples (x, y) drawn from Pr(x, y) rather than Pr′ (x, y), we resort to computing the empirical average m 1 Remp [Z, θ, l(x, y, θ)] = l(xi , yi , θ). [sent-62, score-0.136]
27 The training set is drawn from Pr, however what we would really like is to minimize R[Pr′ , θ, l] as we wish to generalize to test examples drawn from Pr′ . [sent-66, score-0.242]
28 An observation from the field of importance sampling is that ′ R[Pr ′ , θ, l(x, y, θ)] = E(x,y)∼Pr′ [l(x, y, θ)] = E(x,y)∼Pr Pr (x,y) l(x, y, θ) (3) Pr(x,y) = R[Pr, θ, β(x, y)l(x, y, θ)], :=β(x,y) (4) provided that the support of Pr′ is contained in the support of Pr. [sent-67, score-0.197]
29 When Pr and Pr′ differ only in Pr(x) and Pr′ (x), we have β(x, y) = Pr′ (x)/Pr(x), where β is a reweighting factor for the training examples. [sent-71, score-0.317]
30 This is closely related to the methods in [20, 8], as they have to either estimate the selection probabilities or have prior knowledge of the class distributions. [sent-74, score-0.08]
31 Although intuitive, this approach has two major problems: first, it only works whenever the density estimates for Pr and Pr′ (or potentially, the selection probabilities or class distributions) are good. [sent-75, score-0.08]
32 Second, estimating both densities just for the purpose of computing reweighting coefficients may be overkill: we may be able to directly estimate the coefficients βi := β(xi , yi ) without having to estimate the two distributions. [sent-77, score-0.307]
33 Support Vector Classification: Utilizing the setting of [17]we can have the following minimization problem (the original SVMs can be formulated in the same way): minimize θ,ξ 1 θ 2 m 2 +C βi ξi (6a) i=1 subject to φ(xi , yi ) − φ(xi , y), θ ≥ 1 − ξi /∆(yi , y) for all y ∈ Y, and ξi ≥ 0. [sent-81, score-0.096]
34 (6b) ′ Here, φ(x, y) is a feature map from X × Y into a feature space F, where θ ∈ F and ∆(y, y ) denotes a discrepancy function between y and y ′ . [sent-82, score-0.109]
35 The advantage of this formulation is that it can be solved as easily as solving the standard penalized regression problem. [sent-95, score-0.078]
36 1 Kernel Mean Matching and its relation to importance sampling Let Φ : X → F be a map into a feature space F and denote by µ : P → F the expectation operator µ(Pr) := Ex∼Pr(x) [Φ(x)] . [sent-98, score-0.18]
37 The use of feature space means to compare distributions is further explored in [3]. [sent-103, score-0.09]
38 (10) β This is the kernel mean matching (KMM) procedure. [sent-105, score-0.119]
39 2 Convergence of reweighted means in feature space Lemma 2 shows that in principle, if we knew Pr and µ[Pr′ ], we could fully recover Pr′ by solving a simple quadratic program. [sent-112, score-0.133]
40 Instead, we only have samples X and X ′ of size m and m′ , drawn iid from Pr and Pr′ respectively. [sent-114, score-0.089]
41 However, it is to be expected that empirical averages will differ from each other due to finite sample size effects. [sent-116, score-0.138]
42 Lemma 3 If β(x) ∈ [0, B] is some fixed function of x ∈ X, then given xi ∼ Pr iid such that β(xi ) 1 has finite mean and non-zero variance, the sample mean m i β(xi ) converges in distribution to a B Gaussian with mean β(x)d Pr(x) and standard deviation bounded by 2√m . [sent-119, score-0.262]
43 Our second result demonstrates the deviation between the empirical means of Pr′ and β(x) Pr in feature space, given β(x) is chosen perfectly in the population sense. [sent-124, score-0.156]
44 In particular, this result shows that convergence of these two means will be slow if there is a large difference in the probability mass of Pr′ and Pr (and thus the bound B on the ratio of probability masses is large). [sent-125, score-0.075]
45 This means that, for very different distributions we need a large equivalent sample size to get reasonable convergence. [sent-133, score-0.114]
46 3 Empirical KMM optimization To find suitable values of β ∈ Rm we want to minimize the discrepancy between means subject m 1 to constraints βi ∈ [0, B] and | m i=1 βi − 1| ≤ ǫ. [sent-136, score-0.111]
47 The objective function is given by the discrepancy term between the two empirical m′ m means. [sent-138, score-0.074]
48 The red line is a second reference result, derived only from the training data via OLS, and predicts the test data very poorly. [sent-158, score-0.129]
49 The other three dashed lines are fit with weighted ordinary least square (WOLS), using one of three weighting schemes: the ratio of the underlying training and test densities, KMM, and the information criterion of [12]. [sent-159, score-0.199]
50 4 (a) x from q0 true fitting model OLS fitting x q0 x from q1 OLS fitting xq1 0 0. [sent-172, score-0.12]
51 2 ratio KMM IC OLS (b) Figure 1: (a) Polynomial models of degree 1 fit with OLS and WOLS;(b) Average performances of three WOLS methods and OLS on the test data in (a). [sent-182, score-0.11]
52 Labels are Ratio for ratio of test to training density; KMM for our approach; min IC for the approach of [12]; and OLS for the model trained on the labeled test points. [sent-183, score-0.239]
53 2 Real world datasets We next test our approach on real world data sets, from which we select training examples using a deliberately biased procedure (as in [20, 9]). [sent-185, score-0.305]
54 To describe our biased selection scheme, we need to define an additional random variable si for each point in the pool of possible training samples, where si = 1 means the ith sample is included, and si = 0 indicates an excluded sample. [sent-186, score-0.483]
55 Two situations are considered: the selection bias corresponds to our assumption regarding the relation between the training and test distributions, and P (si = 1|xi , yi ) = P (si |xi ); or si is dependent only on yi , i. [sent-187, score-0.433]
56 In the following, we compare our method (labeled KMM) against two others: a baseline unweighted method (unweighted), in which no modification is made, and a weighting by the inverse of the true sampling distribution (importance sampling), as in [20, 9]. [sent-190, score-0.303]
57 We emphasise, however, that our method does not require any prior knowledge of the true sampling probabilities. [sent-191, score-0.095]
58 In our experiments, we used a Gaussian kernel exp(−σ xi − xj 2 ) in our kernel classification and √ √ regression algorithms, and parameters ǫ = ( m − 1)/ m and B = 1000 in the optimization (12). [sent-192, score-0.216]
59 02 0 1 2 3 4 5 6 biased feature 7 8 0 9 (a) Simple bias on features 0. [sent-210, score-0.231]
60 1 optimal weights inverse of true sampling probabilites 10 0. [sent-217, score-0.095]
61 01 0 1 2 3 4 training set proportion (c) Bias on labels 0 0 5 (d) β 10 20 30 40 50 vs inverse sampling prob. [sent-222, score-0.183]
62 Figure 2: Classification performance analysis on breast cancer dataset from UCI. [sent-223, score-0.325]
63 The data are randomly split into training and test sets, where the proportion of examples used for training varies from 10% to 50%. [sent-228, score-0.239]
64 First, we consider a biased sampling scheme based on the input features, of which there are nine, with integer values from 0 to 9. [sent-231, score-0.239]
65 Since smaller feature values predominate in the unbiased data, we sample according to P (s = 1|x ≤ 5) = 0. [sent-232, score-0.09]
66 Performance is shown in Figure 2(a): we consistently outperform the unweighted method, and match or exceed the performance obtained using the known distribution ratio. [sent-236, score-0.184]
67 Next, we consider a sampling bias that operates jointly across multiple features. [sent-237, score-0.177]
68 We select samples less often when they are further from the sample mean x over the training data, i. [sent-238, score-0.17]
69 Performance of our method in 2(b) is again better than the unweighted case, and as good as or better than reweighting using the sampling model. [sent-241, score-0.506]
70 Finally, we consider a simple biased sampling scheme which depends only on the label y: P (s = 1|y = 1) = 0. [sent-242, score-0.263]
71 Fig- ure 2(d) shows the weights β are proportional to the inverse of true sampling probabilities: positive examples have higher weights and negative ones have lower weights. [sent-246, score-0.117]
72 2 Further Benchmark Datasets We next compare the performance on further benchmark datasets1 by selecting training data via various biased sampling schemes. [sent-249, score-0.3]
73 Specifically, for the sampling distribution bias on labels, we use P (s = 1|y) = exp(a + by)/(1 + exp(a + by)) (datasets 1 to 5), or the simple step distribution P (s = 1|y = 1) = a, P (s = 1|y = −1) = b (datasets 6 and 7). [sent-250, score-0.177]
74 For the remaining datasets, we generate biased sampling schemes over their features. [sent-251, score-0.211]
75 Denoting the minimum value of the projection as m and the mean as m, we apply a normal distribution with mean m + (m − m)/a and variance (m − m)/b as the biased sampling scheme. [sent-253, score-0.259]
76 We use penalized LMS for regression problems and SVM for classification problems. [sent-255, score-0.078]
77 To evaluate generalization performance, we utilize the normalized mean i −µ 1 square error (NMSE) given by n n (yvar yi ) for regression problems, and the average test error i=1 for classification problems. [sent-256, score-0.197]
78 In 13 out of 23 experiments, our reweighting approach is the most accurate (see Table 1), despite having no prior information about the bias of the test sample (and, in some cases, despite the additional fact that the data reweighting does not conform to our key assumption 1). [sent-257, score-0.657]
79 In addition, the KMM always improves test performance compared with the unweighted case. [sent-258, score-0.27]
80 Two additional points should be borne in mind: first, we use the same σ for the kernel mean matching and the SVM, as listed in Table 1. [sent-259, score-0.119]
81 This is not surprising, since a reweighting would further reduce the effective number of points used for training, resulting in insufficient data for learning. [sent-262, score-0.227]
82 Table 1: Test results for three methods on 18 datasets with different sampling schemes. [sent-263, score-0.133]
83 The results are averages over 10 trials for regression problems (marked *) and 30 trials for classification problems. [sent-264, score-0.129]
84 We used a Gaussian kernel of size σ for both the kernel mean matching and the SVM/LMS regression, and set B = 1000. [sent-265, score-0.167]
85 3 Tumor Diagnosis using Microarrays Our next benchmark is a dataset of 102 microarrays from prostate cancer patients [13]. [sent-420, score-0.367]
86 Each of these microarrays measures the expression levels of 12,600 genes. [sent-421, score-0.123]
87 The dataset comprises 50 samples from normal tissues (positive label) and 52 from tumor tissues (negative label). [sent-422, score-0.166]
88 We simulate the realisitc scenario that two sets of microarrays A and B are given with dissimilar proportions of tumor samples, and we want to perform cancer diagnosis via classification, training on A and predicting 1 Regression data from http://www. [sent-423, score-0.439]
89 Sets with numbers in brackets are examined by different sampling schemes. [sent-428, score-0.095]
90 We select training examples via the biased selection scheme P (s = 1|y = 1) = 0. [sent-430, score-0.285]
91 We then perform SVM classification for the unweighted, KMM, and importance sampling approaches. [sent-434, score-0.147]
92 The experiment was repeated over 500 independent draws from the dataset according to our biased scheme; the 500 resulting test errors are plotted in [7]. [sent-435, score-0.205]
93 The KMM achieves much higher accuracy levels than the unweighted approach, and is very close to the importance sampling approach. [sent-436, score-0.331]
94 We study a very similar scenario on two breast cancer microarray datasets from [4] and [19], measuring the expression levels of 2,166 common genes for normal and cancer patients [18]. [sent-437, score-0.647]
95 Our reweighting method achieves significant improvement in classification accuracy over the unweighted SVM (see [7]). [sent-439, score-0.411]
96 Correcting sample selection bias in maximum entropy density estimation. [sent-456, score-0.193]
97 Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns. [sent-479, score-0.417]
98 A method for inferring label sampling mechanisms in semisupervised learning. [sent-508, score-0.119]
99 Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes. [sent-570, score-0.328]
100 Predicting the clinical status of human breast cancer by using gene expression profiles. [sent-585, score-0.417]
wordName wordTfidf (topN-words)
[('pr', 0.714), ('kmm', 0.283), ('reweighting', 0.227), ('unweighted', 0.184), ('cancer', 0.158), ('ols', 0.153), ('breast', 0.142), ('biased', 0.116), ('wols', 0.115), ('sampling', 0.095), ('microarrays', 0.085), ('bias', 0.082), ('iy', 0.077), ('reweighted', 0.071), ('xi', 0.068), ('remp', 0.067), ('training', 0.065), ('test', 0.064), ('microarray', 0.058), ('usps', 0.058), ('ailerons', 0.058), ('sample', 0.057), ('yi', 0.057), ('si', 0.054), ('classi', 0.054), ('selection', 0.054), ('gene', 0.052), ('importance', 0.052), ('regression', 0.052), ('tumor', 0.051), ('gretton', 0.05), ('prostate', 0.05), ('lemma', 0.05), ('kernel', 0.048), ('matching', 0.047), ('ratio', 0.046), ('borgwardt', 0.046), ('correcting', 0.046), ('discrepancy', 0.043), ('fitting', 0.04), ('iid', 0.039), ('minimize', 0.039), ('haberman', 0.038), ('lms', 0.038), ('nmse', 0.038), ('rreg', 0.038), ('datasets', 0.038), ('expression', 0.038), ('population', 0.037), ('rkhs', 0.036), ('risk', 0.036), ('ic', 0.035), ('covariate', 0.034), ('huang', 0.034), ('tissues', 0.033), ('warnat', 0.033), ('screening', 0.033), ('feature', 0.033), ('empirical', 0.031), ('ex', 0.031), ('arthur', 0.03), ('nicta', 0.03), ('cients', 0.03), ('svm', 0.03), ('scenario', 0.03), ('cation', 0.03), ('means', 0.029), ('women', 0.028), ('scheme', 0.028), ('distributions', 0.028), ('coef', 0.027), ('diagnosis', 0.027), ('status', 0.027), ('waterloo', 0.027), ('deviation', 0.026), ('penalized', 0.026), ('trials', 0.026), ('drawn', 0.026), ('probabilities', 0.026), ('dataset', 0.025), ('mpi', 0.025), ('resampling', 0.025), ('patients', 0.025), ('support', 0.025), ('averages', 0.025), ('differ', 0.025), ('label', 0.024), ('samples', 0.024), ('mean', 0.024), ('benchmark', 0.024), ('weighting', 0.024), ('unlabeled', 0.024), ('proportions', 0.023), ('kij', 0.023), ('proportion', 0.023), ('densities', 0.023), ('smola', 0.022), ('examples', 0.022), ('improves', 0.022), ('nonetheless', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000012 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data
Author: Jiayuan Huang, Arthur Gretton, Karsten M. Borgwardt, Bernhard Schölkopf, Alex J. Smola
Abstract: We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. Most algorithms for this setting try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate. We present a nonparametric method which directly produces resampling weights without distribution estimation. Our method works by matching distributions between training and testing sets in feature space. Experimental results demonstrate that our method works well in practice.
2 0.29242697 38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments
Author: Laurent Charlin, Pascal Poupart, Romy Shioda
Abstract: Planning in partially observable domains is a notoriously difficult problem. However, in many real-world scenarios, planning can be simplified by decomposing the task into a hierarchy of smaller planning problems. Several approaches have been proposed to optimize a policy that decomposes according to a hierarchy specified a priori. In this paper, we investigate the problem of automatically discovering the hierarchy. More precisely, we frame the optimization of a hierarchical policy as a non-convex optimization problem that can be solved with general non-linear solvers, a mixed-integer non-linear approximation or a form of bounded hierarchical policy iteration. By encoding the hierarchical structure as variables of the optimization problem, we can automatically discover a hierarchy. Our method is flexible enough to allow any parts of the hierarchy to be specified based on prior knowledge while letting the optimization discover the unknown parts. It can also discover hierarchical policies, including recursive policies, that are more compact (potentially infinitely fewer parameters) and often easier to understand given the decomposition induced by the hierarchy. 1
3 0.15881056 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation
Author: Silvia Richter, Douglas Aberdeen, Jin Yu
Abstract: Current road-traffic optimisation practice around the world is a combination of hand tuned policies with a small degree of automatic adaption. Even state-ofthe-art research controllers need good models of the road traffic, which cannot be obtained directly from existing sensors. We use a policy-gradient reinforcement learning approach to directly optimise the traffic signals, mapping currently deployed sensor observations to control signals. Our trained controllers are (theoretically) compatible with the traffic system used in Sydney and many other cities around the world. We apply two policy-gradient methods: (1) the recent natural actor-critic algorithm, and (2) a vanilla policy-gradient algorithm for comparison. Along the way we extend natural-actor critic approaches to work for distributed and online infinite-horizon problems. 1
4 0.1234615 44 nips-2006-Bayesian Policy Gradient Algorithms
Author: Mohammad Ghavamzadeh, Yaakov Engel
Abstract: Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. Since Monte Carlo methods tend to have high variance, a large number of samples is required, resulting in slow convergence. In this paper, we propose a Bayesian framework that models the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates are provided at little extra cost. 1
5 0.10463791 20 nips-2006-Active learning for misspecified generalized linear models
Author: Francis R. Bach
Abstract: Active learning refers to algorithmic frameworks aimed at selecting training data points in order to reduce the number of required training data points and/or improve the generalization performance of a learning method. In this paper, we present an asymptotic analysis of active learning for generalized linear models. Our analysis holds under the common practical situation of model misspecification, and is based on realistic assumptions regarding the nature of the sampling distributions, which are usually neither independent nor identical. We derive unbiased estimators of generalization performance, as well as estimators of expected reduction in generalization error after adding a new training data point, that allow us to optimize its sampling distribution through a convex optimization problem. Our analysis naturally leads to an algorithm for sequential active learning which is applicable for all tasks supported by generalized linear models (e.g., binary classification, multi-class classification, regression) and can be applied in non-linear settings through the use of Mercer kernels. 1
6 0.099834763 142 nips-2006-Mutagenetic tree Fisher kernel improves prediction of HIV drug resistance from viral genotype
7 0.090481915 54 nips-2006-Comparative Gene Prediction using Conditional Random Fields
8 0.087754592 193 nips-2006-Tighter PAC-Bayes Bounds
9 0.086052254 5 nips-2006-A Kernel Method for the Two-Sample-Problem
10 0.081956975 150 nips-2006-On Transductive Regression
11 0.075691491 131 nips-2006-Mixture Regression for Covariate Shift
12 0.070727699 65 nips-2006-Denoising and Dimension Reduction in Feature Space
13 0.06810803 157 nips-2006-PAC-Bayes Bounds for the Risk of the Majority Vote and the Variance of the Gibbs Classifier
14 0.067812011 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions
15 0.066095755 186 nips-2006-Support Vector Machines on a Budget
16 0.064831309 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space
17 0.062565558 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints
18 0.061809726 68 nips-2006-Dirichlet-Enhanced Spam Filtering based on Biased Samples
19 0.060992047 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
20 0.058108173 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation
topicId topicWeight
[(0, -0.205), (1, 0.056), (2, -0.07), (3, -0.085), (4, 0.002), (5, 0.119), (6, -0.003), (7, 0.022), (8, 0.231), (9, 0.05), (10, -0.088), (11, -0.046), (12, 0.038), (13, -0.057), (14, -0.098), (15, -0.003), (16, -0.021), (17, -0.068), (18, 0.008), (19, 0.12), (20, 0.014), (21, -0.255), (22, -0.177), (23, -0.045), (24, 0.106), (25, -0.217), (26, 0.285), (27, 0.113), (28, 0.076), (29, -0.091), (30, 0.074), (31, -0.093), (32, 0.148), (33, 0.141), (34, -0.046), (35, -0.051), (36, -0.072), (37, -0.03), (38, -0.009), (39, -0.066), (40, -0.057), (41, -0.047), (42, 0.023), (43, 0.132), (44, 0.034), (45, -0.019), (46, 0.001), (47, 0.039), (48, 0.02), (49, 0.004)]
simIndex simValue paperId paperTitle
same-paper 1 0.94800121 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data
Author: Jiayuan Huang, Arthur Gretton, Karsten M. Borgwardt, Bernhard Schölkopf, Alex J. Smola
Abstract: We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. Most algorithms for this setting try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate. We present a nonparametric method which directly produces resampling weights without distribution estimation. Our method works by matching distributions between training and testing sets in feature space. Experimental results demonstrate that our method works well in practice.
2 0.73614478 38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments
Author: Laurent Charlin, Pascal Poupart, Romy Shioda
Abstract: Planning in partially observable domains is a notoriously difficult problem. However, in many real-world scenarios, planning can be simplified by decomposing the task into a hierarchy of smaller planning problems. Several approaches have been proposed to optimize a policy that decomposes according to a hierarchy specified a priori. In this paper, we investigate the problem of automatically discovering the hierarchy. More precisely, we frame the optimization of a hierarchical policy as a non-convex optimization problem that can be solved with general non-linear solvers, a mixed-integer non-linear approximation or a form of bounded hierarchical policy iteration. By encoding the hierarchical structure as variables of the optimization problem, we can automatically discover a hierarchy. Our method is flexible enough to allow any parts of the hierarchy to be specified based on prior knowledge while letting the optimization discover the unknown parts. It can also discover hierarchical policies, including recursive policies, that are more compact (potentially infinitely fewer parameters) and often easier to understand given the decomposition induced by the hierarchy. 1
3 0.70577508 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation
Author: Silvia Richter, Douglas Aberdeen, Jin Yu
Abstract: Current road-traffic optimisation practice around the world is a combination of hand tuned policies with a small degree of automatic adaption. Even state-ofthe-art research controllers need good models of the road traffic, which cannot be obtained directly from existing sensors. We use a policy-gradient reinforcement learning approach to directly optimise the traffic signals, mapping currently deployed sensor observations to control signals. Our trained controllers are (theoretically) compatible with the traffic system used in Sydney and many other cities around the world. We apply two policy-gradient methods: (1) the recent natural actor-critic algorithm, and (2) a vanilla policy-gradient algorithm for comparison. Along the way we extend natural-actor critic approaches to work for distributed and online infinite-horizon problems. 1
4 0.51674259 142 nips-2006-Mutagenetic tree Fisher kernel improves prediction of HIV drug resistance from viral genotype
Author: Tobias Sing, Niko Beerenwinkel
Abstract: Starting with the work of Jaakkola and Haussler, a variety of approaches have been proposed for coupling domain-specific generative models with statistical learning methods. The link is established by a kernel function which provides a similarity measure based inherently on the underlying model. In computational biology, the full promise of this framework has rarely ever been exploited, as most kernels are derived from very generic models, such as sequence profiles or hidden Markov models. Here, we introduce the MTreeMix kernel, which is based on a generative model tailored to the underlying biological mechanism. Specifically, the kernel quantifies the similarity of evolutionary escape from antiviral drug pressure between two viral sequence samples. We compare this novel kernel to a standard, evolution-agnostic amino acid encoding in the prediction of HIV drug resistance from genotype, using support vector regression. The results show significant improvements in predictive performance across 17 anti-HIV drugs. Thus, in our study, the generative-discriminative paradigm is key to bridging the gap between population genetic modeling and clinical decision making. 1
5 0.46768594 44 nips-2006-Bayesian Policy Gradient Algorithms
Author: Mohammad Ghavamzadeh, Yaakov Engel
Abstract: Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. Since Monte Carlo methods tend to have high variance, a large number of samples is required, resulting in slow convergence. In this paper, we propose a Bayesian framework that models the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates are provided at little extra cost. 1
6 0.45618898 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints
7 0.42975333 5 nips-2006-A Kernel Method for the Two-Sample-Problem
8 0.36158007 150 nips-2006-On Transductive Regression
9 0.35865396 193 nips-2006-Tighter PAC-Bayes Bounds
10 0.35208642 131 nips-2006-Mixture Regression for Covariate Shift
11 0.33225554 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions
12 0.31754628 68 nips-2006-Dirichlet-Enhanced Spam Filtering based on Biased Samples
13 0.31706923 140 nips-2006-Multiple Instance Learning for Computer Aided Diagnosis
14 0.30935118 20 nips-2006-Active learning for misspecified generalized linear models
15 0.2926797 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation
16 0.27785203 54 nips-2006-Comparative Gene Prediction using Conditional Random Fields
17 0.2764706 186 nips-2006-Support Vector Machines on a Budget
18 0.25824699 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields
19 0.25786892 25 nips-2006-An Application of Reinforcement Learning to Aerobatic Helicopter Flight
20 0.24908535 82 nips-2006-Gaussian and Wishart Hyperkernels
topicId topicWeight
[(1, 0.099), (3, 0.012), (7, 0.051), (9, 0.046), (20, 0.021), (22, 0.473), (44, 0.063), (57, 0.042), (65, 0.038), (69, 0.03), (83, 0.014), (90, 0.014)]
simIndex simValue paperId paperTitle
1 0.99534261 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
Author: Manfred K. Warmuth, Dima Kuzmin
Abstract: We design an on-line algorithm for Principal Component Analysis. In each trial the current instance is projected onto a probabilistically chosen low dimensional subspace. The total expected quadratic approximation error equals the total quadratic approximation error of the best subspace chosen in hindsight plus some additional term that grows linearly in dimension of the subspace but logarithmically in the dimension of the instances. 1
2 0.96025604 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression
Author: Martin J. Wainwright, John D. Lafferty, Pradeep K. Ravikumar
Abstract: We focus on the problem of estimating the graph structure associated with a discrete Markov random field. We describe a method based on 1 regularized logistic regression, in which the neighborhood of any given node is estimated by performing logistic regression subject to an 1 -constraint. Our framework applies to the high-dimensional setting, in which both the number of nodes p and maximum neighborhood sizes d are allowed to grow as a function of the number of observations n. Our main result is to establish sufficient conditions on the triple (n, p, d) for the method to succeed in consistently estimating the neighborhood of every node in the graph simultaneously. Under certain mutual incoherence conditions analogous to those imposed in previous work on linear regression, we prove that consistent neighborhood selection can be obtained as long as the number of observations n grows more quickly than 6d6 log d + 2d5 log p, thereby establishing that logarithmic growth in the number of samples n relative to graph size p is sufficient to achieve neighborhood consistency. Keywords: Graphical models; Markov random fields; structure learning; 1 -regularization; model selection; convex risk minimization; high-dimensional asymptotics; concentration. 1
same-paper 3 0.93655908 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data
Author: Jiayuan Huang, Arthur Gretton, Karsten M. Borgwardt, Bernhard Schölkopf, Alex J. Smola
Abstract: We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. Most algorithms for this setting try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate. We present a nonparametric method which directly produces resampling weights without distribution estimation. Our method works by matching distributions between training and testing sets in feature space. Experimental results demonstrate that our method works well in practice.
4 0.92972469 172 nips-2006-Scalable Discriminative Learning for Natural Language Parsing and Translation
Author: Joseph Turian, Benjamin Wellington, I. D. Melamed
Abstract: Parsing and translating natural languages can be viewed as problems of predicting tree structures. For machine learning approaches to these predictions, the diversity and high dimensionality of the structures involved mandate very large training sets. This paper presents a purely discriminative learning method that scales up well to problems of this size. Its accuracy was at least as good as other comparable methods on a standard parsing task. To our knowledge, it is the first purely discriminative learning algorithm for translation with treestructured models. Unlike other popular methods, this method does not require a great deal of feature engineering a priori, because it performs feature selection over a compound feature space as it learns. Experiments demonstrate the method’s versatility, accuracy, and efficiency. Relevant software is freely available at http://nlp.cs.nyu.edu/parser and http://nlp.cs.nyu.edu/GenPar. 1
5 0.74012309 203 nips-2006-implicit Online Learning with Kernels
Author: Li Cheng, Dale Schuurmans, Shaojun Wang, Terry Caelli, S.v.n. Vishwanathan
Abstract: We present two new algorithms for online learning in reproducing kernel Hilbert spaces. Our first algorithm, ILK (implicit online learning with kernels), employs a new, implicit update technique that can be applied to a wide variety of convex loss functions. We then introduce a bounded memory version, SILK (sparse ILK), that maintains a compact representation of the predictor without compromising solution quality, even in non-stationary environments. We prove loss bounds and analyze the convergence rate of both. Experimental evidence shows that our proposed algorithms outperform current methods on synthetic and real data. 1
6 0.7168293 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments
7 0.70493126 126 nips-2006-Logistic Regression for Single Trial EEG Classification
8 0.7048409 61 nips-2006-Convex Repeated Games and Fenchel Duality
9 0.68759823 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy
10 0.67769992 68 nips-2006-Dirichlet-Enhanced Spam Filtering based on Biased Samples
11 0.65679979 6 nips-2006-A Kernel Subspace Method by Stochastic Realization for Learning Nonlinear Dynamical Systems
12 0.65176928 194 nips-2006-Towards a general independent subspace analysis
13 0.6479193 131 nips-2006-Mixture Regression for Covariate Shift
14 0.64284676 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians
15 0.64225137 11 nips-2006-A PAC-Bayes Risk Bound for General Loss Functions
16 0.64090914 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
17 0.63664097 124 nips-2006-Linearly-solvable Markov decision problems
18 0.63121825 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections
19 0.62918991 5 nips-2006-A Kernel Method for the Two-Sample-Problem
20 0.62804407 20 nips-2006-Active learning for misspecified generalized linear models