nips nips2011 nips2011-106 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Gilles Blanchard, Gyemin Lee, Clayton Scott
Abstract: We consider the problem of assigning class labels to an unlabeled test data set, given several labeled training data sets drawn from similar distributions. This problem arises in several applications where data distributions fluctuate because of biological, technical, or other sources of variation. We develop a distributionfree, kernel-based approach to the problem. This approach involves identifying an appropriate reproducing kernel Hilbert space and optimizing a regularized empirical risk over the space. We present generalization error analysis, describe universal kernels, and establish universal consistency of the proposed methodology. Experimental results on flow cytometry data are presented. 1
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract We consider the problem of assigning class labels to an unlabeled test data set, given several labeled training data sets drawn from similar distributions. [sent-4, score-0.139]
2 This approach involves identifying an appropriate reproducing kernel Hilbert space and optimizing a regularized empirical risk over the space. [sent-7, score-0.157]
3 We present generalization error analysis, describe universal kernels, and establish universal consistency of the proposed methodology. [sent-8, score-0.294]
4 For a given distribution PXY , we refer to the X marginal distribution PX as simply the marginal distribution, and the conditional PXY (Y |X) as the posterior distribution. [sent-14, score-0.15]
5 For each i, there is a (i) training sample Si = (Xij , Yij )1≤j≤ni of iid realizations of PXY . [sent-19, score-0.115]
6 There is also a test distribution (i) T PXY that is similar to but again distinct from the “training distributions” PXY . [sent-20, score-0.058]
7 Finally, there is a test T T sample (Xj , YjT )1≤j≤nT of iid realizations of PXY , but in this case the labels Yj are not observed. [sent-21, score-0.151]
8 Essentially, given a random sample from T the marginal test distribution PX , we would like to predict the corresponding labels. [sent-23, score-0.133]
9 Thus, when we say that the training and test distributions are “similar,” we mean that there is some pattern making it possible to learn a mapping from marginal distributions to labels. [sent-24, score-0.2]
10 In multi-task learning, only the training distributions are of interest, and the goal is to use the similarity among distributions to improve the training of individual classifiers [1, 2, 3]. [sent-28, score-0.128]
11 In the covariate shift problem, the marginal test distribution is different from the marginal training distribution(s), but the posterior 1 distribution is assumed to be the same [4]. [sent-30, score-0.283]
12 In our case, both marginal and posterior test distributions can differ from their training counterparts [5]. [sent-31, score-0.173]
13 Finally, in transfer learning, it is typically assumed that at least a few labels are available for the test data, and the training data sets are used to improve the performance of a standard classifier, for example by learning a metric or embedding which is appropriate for all data sets [6, 7]. [sent-32, score-0.163]
14 In our case, no test labels are available, but we hope that through access to multiple training data sets, it is still possible to obtain collective knowledge about the “labeling process” that may be transferred to the test distribution. [sent-33, score-0.148]
15 Some authors have considered transductive transfer learning, which is similar to the problem studied here in that no test labels are available. [sent-34, score-0.109]
16 However, existing work has focused on the case N = 1 and typically relies on the covariate shift assumption [8]. [sent-35, score-0.056]
17 Our methodology is shown to yield a consistent learning procedure, meaning that the generalization error tends to the best possible as the sample sizes N, {ni }, nT tend to infinity. [sent-37, score-0.072]
18 We also offer a proof-of-concept experimental study validating the proposed approach on flow cytometry data, including comparisons to multi-task kernels and a simple pooling approach. [sent-38, score-0.326]
19 2 Motivating Application: Automatic Gating of Flow Cytometry Data Flow cytometry is a high-throughput measurement platform that is an important clinical tool for the diagnosis of many blood-related pathologies. [sent-39, score-0.248]
20 This technology allows for quantitative analysis of individual cells from a given population, derived for example from a blood sample from a patient. [sent-40, score-0.076]
21 We may think of a flow cytometry data set as a set of d-dimensional attribute vectors (Xj )1≤j≤n , where n is the number of cells analyzed, and d is the number of attributes recorded per cell. [sent-41, score-0.254]
22 Thus, a flow cytometry data set is a random sample from a patient-specific distribution. [sent-43, score-0.23]
23 For example, lymphocytes are known to be relevant for the diagnosis of leukemia, whereas non-lymphocytes may potentially confound the analysis. [sent-46, score-0.079]
24 Because of the variability in flow cytometry data, this process is difficult to quantify in terms of a small subset of simple rules. [sent-50, score-0.206]
25 Since clinical laboratories maintain historical databases, we can assume access to a number (N ) of historical patients that have already been expert-gated. [sent-53, score-0.171]
26 Because of biological and technical varia(i) tions in flow cytometry data, the distributions PXY of the historical patients will vary. [sent-54, score-0.317]
27 Nonetheless, there are certain general trends that are known to hold for all flow cytometry measurements. [sent-57, score-0.206]
28 For example, lymphocytes are known to exhibit low levels of the “side-scatter” (SS) attribute, while expressing high levels of the attribute CD45 (see column 2 of Fig. [sent-58, score-0.084]
29 Therefore, it is reasonable to assume that there is an underlying distribution (on distributions) governing flow cytometry data sets, that produces roughly similar distributions thereby making possible the automation of the gating process. [sent-63, score-0.267]
30 The distribution of cells differs from patient to patient. [sent-67, score-0.072]
31 realizations (i) from µ, and the sample Si is made of ni i. [sent-82, score-0.164]
32 realizations of (X, Y ) following the distribution PXY . [sent-85, score-0.072]
33 T T Now consider a test distribution PXY and test sample S T = (Xj , YjT )1≤j≤nT , whose labels are not observed. [sent-86, score-0.153]
34 If : R × Y → R+ is a loss, then the average nT T T loss incurred on the test sample is n1 i=1 (Yi , Yi ) . [sent-88, score-0.099]
35 Based on this, we define the average T generalization error of a decision function over test samples of size nT , T T E(f, nT ) := EPXY ∼µ ES T ∼(PXY )⊗nT 1 nT nT T T (f (PX , Xi ), YiT ) . [sent-89, score-0.084]
36 (1) i=1 In important point of the analysis is that, at training time as well as at test time, the marginal distribution PX for a sample is only known through the sample itself, that is, through the empirical marginal PX . [sent-90, score-0.243]
37 As is clear from equation (1), because of this the generalization error also depends on T T the test sample size nT . [sent-91, score-0.092]
38 This motivates the following generalization error when we have an infinite test sample, where we then assume that the true marginal T PX is observed: T T E(f, ∞) := EPXY ∼µ E(X T ,Y T )∼PXY T (f (PX , X T ), Y T ) . [sent-93, score-0.119]
39 (2) To gain some insight into this risk, let us decompose µ into two parts, µX which generates the marginal distribution PX , and µY |X which, conditioned on PX , generates the posterior PY |X . [sent-94, score-0.113]
40 ˜ Here Qµ is the distribution that generates X by first drawing PX according to µX , and then drawing ˜ X according to PX . [sent-97, score-0.073]
41 From this last expression, we see that the risk is like a ˜ standard binary classification risk based on (X, Y ) ∼ Qµ . [sent-99, score-0.078]
42 Despite the similarity to standard binary classification in the infinite sample case, we emphasize that the learning task here is different, because the realizations (Xij , Yij ) are neither independent nor identically distributed. [sent-105, score-0.08]
43 T Finally, we note that there is a condition where for µ-almost all test distribution PXY , the classifier ∗ T ∗ f (PX , . [sent-106, score-0.072]
44 ) (where f is the global minimizer of (2)) coincides with the optimal Bayes classifier for T PXY , although no labels from this test distribution are observed. [sent-107, score-0.101]
45 Although we will not be assuming this condition throughout the paper, it is implicitly assumed in the motivating application presented in Section 2, where an expert labels the data points by just looking at their marginal distribution. [sent-110, score-0.114]
46 For a fixed distribution PXY , and a decision function f : X → R, let us denote R(f, PXY ) = E(X,Y )∼PXY [ (f (X), Y )] and R∗ (PXY ) := min R(f, PXY ) = min E(X,Y )∼PXY [ (f (X), Y )] f :X →R f :X →R the corresponding optimal (Bayes) risk for the loss function . [sent-113, score-0.104]
47 4 Learning Algorithm We consider an approach based on positive semi-definite kernels, or simply kernels for short. [sent-128, score-0.1]
48 Background information on kernels, including the definition, normalized kernels, universal kernels, and reproducing kernel Hilbert spaces (RKHSs), may be found in [11]. [sent-129, score-0.229]
49 Several well-known learning algorithms, such as support vector machines and kernel ridge regression, may be viewed as minimizers of a norm-regularized empirical risk over the RKHS of a kernel. [sent-130, score-0.119]
50 Inspired by this framework, we consider a general kernel algorithm as follows. [sent-132, score-0.08]
51 Let k be a kernel on PX × X , and let Hk be the (i) associated RKHS. [sent-134, score-0.08]
52 Now define fλ = arg min f ∈Hk 1 N N i=1 1 ni ni (f (Xij ), Yij ) + λ f j=1 4 2 . [sent-138, score-0.168]
53 The final predictor has the form N ni (i) fλ (PX , x) = αij Yij k((PX , Xij ), (PX , x)) i=1 j=1 where the αij are nonnegative and mostly zero. [sent-140, score-0.084]
54 In the rest of the paper we will consider a kernel k on PX × X of the product form k((P1 , x1 ), (P2 , x2 )) = kP (P1 , P2 )kX (x1 , x2 ), (4) where kP is a kernel on PX and kX a kernel on X . [sent-142, score-0.24]
55 Furthermore, we will consider kernels on PX of a particular form. [sent-143, score-0.1]
56 Let kX denote a kernel on X (which might be different from kX ) that is measurable and bounded. [sent-144, score-0.08]
57 Therefore, if we consider the kernel kP (PX , PX ) = Ψ(PX ), Ψ(PX ) , it is a linear kernel on PX and cannot be a universal kernel. [sent-148, score-0.284]
58 For this reason, we introduce yet another kernel K on HkX and consider the kernel on PX given by kP (PX , PX ) = K (Ψ(PX ), Ψ(PX )) . [sent-149, score-0.16]
59 (6) Note that particular kernels inspired by the finite dimensional case are of the form K(v, v ) = F ( v − v ), (7) or K(v, v ) = G( v, v ), (8) where F, G are real functions of a real variable such that they define a kernel. [sent-150, score-0.1]
60 Below we apply their results to deduce that k is a universal kernel for certain choices of kX , kX , and K. [sent-153, score-0.222]
61 We then discuss universal kernels, and finally deduce universal consistency of the algorithm. [sent-156, score-0.286]
62 To simplify somewhat the analysis, we assume below that all training samples have the same size ni = n. [sent-157, score-0.119]
63 Also let Bk (r) denote the closed ball of radius r, centered at the origin, in the RKHS of the kernel k. [sent-158, score-0.093]
64 We consider the following assumptions on the loss and kernels: (Loss) The loss function : R × Y → R+ is L -Lipschitz in its first variable and bounded by B . [sent-159, score-0.066]
65 2 2 (Kernels-A) The kernels kX , kX and K are bounded respectively by constants Bk , Bk ≥ 1, and 2 BK . [sent-160, score-0.1]
66 As an example, the condition is shown to hold with α = 1 when K is the Gaussian-like kernel on HkX . [sent-163, score-0.094]
67 We use the decomposition sup f ∈Bk (R) 1 N N i=1 1 ni ni (f (Xij ), Yij ) − E(f, ∞) j=1 ≤ sup f ∈Bk (R) + sup f ∈Bk (R) 1 N 1 N N i=1 1 ni N i=1 ni (i) (i) (f (PX , Xij ), Yij ) − (f (PX , Xij ), Yij ) j=1 1 ni ni (i) (f (PX , Xij ), Yij ) − E(f, ∞) =: (I) + (II). [sent-184, score-0.555]
68 This can be obtained using the reproducing property (i) (i) of the kernel k, the convergence of Ψ(PX ) to Ψ(PX ) as a consequence of Hoeffding’s inequality in a Hilbert space, and the other assumptions (boundedness/H¨ lder property) on the kernels. [sent-191, score-0.122]
69 In both cases, a standard approach using the Azuma-McDiarmid’s inequality [17] followed by symmetrization and Rademacher complexity analysis on a kernel space [18, 19] can be applied. [sent-193, score-0.08]
70 To establish that k is universal on PX × X , the following lemma is useful. [sent-198, score-0.124]
71 Let Ω, Ω be two compact spaces and k, k be kernels on Ω, Ω , respectively. [sent-201, score-0.12]
72 If k, k are both universal, then the product kernel k((x, x ), (y, y )) := k(x, y)k (x , y ) is universal on Ω × Ω . [sent-202, score-0.204]
73 Several examples of universal kernels are known on Euclidean space. [sent-203, score-0.224]
74 Some additional assumptions on the kernels and feature space are required: (Kernels-B) kX , kX , K, and X satisfy the following: X is a compact metric space; kX is universal on X ; kX is continuous and universal on X ; K is universal on any compact subset of HkX . [sent-206, score-0.512]
75 Then, for kP defined as in (6), the product kernel k in (4) is universal on PX ×X . [sent-211, score-0.204]
76 Furthermore, the assumption on K is fulfilled if K is of the form (8), where G is an analytical function with positive Taylor series coefficients, or if K is the normalized kernel associated to such a kernel. [sent-212, score-0.08]
77 Taking G(t) = exp(t), it follows that K(PX , PX ) = exp( Ψ(PX ), Ψ(PX ) Hk ) is universal on X PX . [sent-215, score-0.124]
78 By similar reasoning as in the finite dimensional case, the Gaussian-like kernel K(PX , PX ) = 1 exp(− 2σ2 Ψ(PX ) − Ψ(PX ) 2 k ) is also universal on PX . [sent-216, score-0.204]
79 6 Experiments We demonstrate the proposed methodology for flow cytometry data auto-gating, described above. [sent-223, score-0.228]
80 Peripheral blood samples were obtained from 35 normal patients, and lymphocytes were classified by a domain expert. [sent-224, score-0.088]
81 The corresponding flow cytometry data sets have sample sizes ranging from 10,000 to 100,000, and the proportion of lymphocytes in each data set ranges from 10 to 40%. [sent-225, score-0.311]
82 To speed training time, we subsampled the 10 training data sets to have 1000 data points (cells) each. [sent-227, score-0.087]
83 The kernels kX , kX , and K are all taken to be GausTrain Test kP sian kernels with respective bandwidths σX , σX , and Pooling (τ = 1) 1. [sent-229, score-0.2]
84 Table 1: The misclassification rates (%) on For comparison, we also considered three other training data sets and test data sets for difoptions for kP . [sent-243, score-0.111]
85 The proposed method adapts the kP (P1 , P2 ) = 1 if P1 = P2 , and kP (P1 , P2 ) = τ decision function to the test data (through otherwise. [sent-245, score-0.058]
86 This idea has been previously studied in the context of flow cytometry by [21]. [sent-248, score-0.206]
87 When 0 < τ < 1, we obtain a kernel like what was used for multi-task learning (MTL) by [3]. [sent-249, score-0.08]
88 Note that these kernels have the property that if P1 is a training data set, and P2 a test data set, then P1 = P2 and so kP (P1 , P2 ) is simply a constant. [sent-250, score-0.177]
89 This implies that the learning rules produced by these kernels do not adapt to the test distribution, unlike the proposed kernel. [sent-251, score-0.156]
90 Thus, we study the natural algorithm of minimizing a regularized empirical loss over a reproducing 7 Figure 2: The misclassification rates (%) on training data sets and test data sets for different kP . [sent-261, score-0.182]
91 kernel Hilbert space associated with the extended input domain PX ×X . [sent-263, score-0.08]
92 We also establish universal consistency, using a novel generalization error analysis under the inherent non-iid sampling plan, and a construction of a universal kernel on PX × X . [sent-264, score-0.354]
93 The algorithm is applied to flow cytometry autogating, and shown to improve upon kernels that do not adapt to the test distribution. [sent-266, score-0.362]
94 However, irregularities in a test patient’s heartbeat will differ from irregularities of historical patients, hence the need to adapt to the test distribution [22]. [sent-270, score-0.225]
95 We can also ask how the methodology and analysis can be extended to the context where a small number of labels are available for the test distribution, as is commonly assumed in transfer learning. [sent-271, score-0.116]
96 In this sense, the marginal-adaptive decision function learned from the training samples would serve as a “prior” for learning on the test data. [sent-277, score-0.093]
97 Smola, “A kernel approach to como paring distributions,” in Proceedings of the 22nd AAAI Conference on Artificial Intelligence, R. [sent-363, score-0.08]
98 Smola, “A kernel method o for the two-sample-problem,” in Advances in Neural Information Processing Systems 19, B. [sent-373, score-0.08]
99 Steinwart, “Universal kernels on non-standard input spaces,” in Advances in Neural Information Processing Systems 23, J. [sent-389, score-0.1]
100 Spang, “Automated in-silico detection of cell populations in flow cytometry readouts and its application to leukemia disease monitoring,” BMC Bioinformatics, vol. [sent-426, score-0.249]
wordName wordTfidf (topN-words)
[('px', 0.714), ('pxy', 0.489), ('cytometry', 0.206), ('kx', 0.139), ('universal', 0.124), ('kp', 0.111), ('xij', 0.109), ('kernels', 0.1), ('yij', 0.098), ('py', 0.087), ('ni', 0.084), ('bk', 0.083), ('nt', 0.083), ('kernel', 0.08), ('ow', 0.075), ('hkx', 0.064), ('lymphocytes', 0.064), ('realizations', 0.056), ('yjt', 0.052), ('marginal', 0.051), ('hk', 0.051), ('patients', 0.043), ('test', 0.042), ('mtl', 0.042), ('blanchard', 0.042), ('risk', 0.039), ('historical', 0.039), ('epxy', 0.039), ('training', 0.035), ('loss', 0.033), ('labels', 0.029), ('shift', 0.029), ('distributions', 0.029), ('lkopf', 0.028), ('cells', 0.028), ('patient', 0.028), ('hilbert', 0.027), ('covariate', 0.027), ('clinical', 0.027), ('misclassi', 0.026), ('generalization', 0.026), ('bkx', 0.026), ('christmann', 0.026), ('epx', 0.026), ('epy', 0.026), ('heartbeat', 0.026), ('pathologist', 0.026), ('sch', 0.025), ('classi', 0.025), ('reproducing', 0.025), ('lk', 0.024), ('blood', 0.024), ('scott', 0.024), ('sample', 0.024), ('transfer', 0.023), ('ey', 0.023), ('laboratories', 0.023), ('irregularities', 0.023), ('cell', 0.023), ('gretton', 0.022), ('methodology', 0.022), ('steinwart', 0.021), ('drawing', 0.021), ('attribute', 0.02), ('pooling', 0.02), ('consistency', 0.02), ('motivating', 0.02), ('compact', 0.02), ('leukemia', 0.02), ('rademacher', 0.02), ('ex', 0.018), ('deduce', 0.018), ('rasch', 0.018), ('flow', 0.018), ('sup', 0.017), ('lder', 0.017), ('hinge', 0.017), ('sets', 0.017), ('scatter', 0.016), ('gating', 0.016), ('decision', 0.016), ('posterior', 0.016), ('distribution', 0.016), ('borgwardt', 0.016), ('unlabeled', 0.016), ('bayes', 0.016), ('transductive', 0.015), ('diagnosis', 0.015), ('smola', 0.015), ('si', 0.015), ('generates', 0.015), ('rkhs', 0.015), ('nity', 0.014), ('condition', 0.014), ('minimizer', 0.014), ('adapt', 0.014), ('mapping', 0.014), ('er', 0.014), ('ball', 0.013), ('regularized', 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 106 nips-2011-Generalizing from Several Related Classification Tasks to a New Unlabeled Sample
Author: Gilles Blanchard, Gyemin Lee, Clayton Scott
Abstract: We consider the problem of assigning class labels to an unlabeled test data set, given several labeled training data sets drawn from similar distributions. This problem arises in several applications where data distributions fluctuate because of biological, technical, or other sources of variation. We develop a distributionfree, kernel-based approach to the problem. This approach involves identifying an appropriate reproducing kernel Hilbert space and optimizing a regularized empirical risk over the space. We present generalization error analysis, describe universal kernels, and establish universal consistency of the proposed methodology. Experimental results on flow cytometry data are presented. 1
2 0.32933626 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels
Author: Mona Eberts, Ingo Steinwart
Abstract: We prove a new oracle inequality for support vector machines with Gaussian RBF kernels solving the regularized least squares regression problem. To this end, we apply the modulus of smoothness. With the help of the new oracle inequality we then derive learning rates that can also be achieved by a simple data-dependent parameter selection method. Finally, it turns out that our learning rates are asymptotically optimal for regression functions satisfying certain standard smoothness conditions. 1
3 0.11214818 59 nips-2011-Composite Multiclass Losses
Author: Elodie Vernet, Mark D. Reid, Robert C. Williamson
Abstract: We consider loss functions for multiclass prediction problems. We show when a multiclass loss can be expressed as a “proper composite loss”, which is the composition of a proper loss and a link function. We extend existing results for binary losses to multiclass losses. We determine the stationarity condition, Bregman representation, order-sensitivity, existence and uniqueness of the composite representation for multiclass losses. We subsume existing results on “classification calibration” by relating it to properness and show that the simple integral representation for binary proper losses can not be extended to multiclass losses. 1
4 0.10684506 256 nips-2011-Solving Decision Problems with Limited Information
Author: Denis D. Maua, Cassio Campos
Abstract: We present a new algorithm for exactly solving decision-making problems represented as an influence diagram. We do not require the usual assumptions of no forgetting and regularity, which allows us to solve problems with limited information. The algorithm, which implements a sophisticated variable elimination procedure, is empirically shown to outperform a state-of-the-art algorithm in randomly generated problems of up to 150 variables and 1064 strategies. 1
5 0.099878117 162 nips-2011-Lower Bounds for Passive and Active Learning
Author: Maxim Raginsky, Alexander Rakhlin
Abstract: We develop unified information-theoretic machinery for deriving lower bounds for passive and active learning schemes. Our bounds involve the so-called Alexander’s capacity function. The supremum of this function has been recently rediscovered by Hanneke in the context of active learning under the name of “disagreement coefficient.” For passive learning, our lower bounds match the upper bounds of Gin´ and Koltchinskii up to constants and generalize analogous results of Mase sart and N´ d´ lec. For active learning, we provide first known lower bounds based e e on the capacity function rather than the disagreement coefficient. 1
6 0.096814871 139 nips-2011-Kernel Bayes' Rule
7 0.075289756 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels
8 0.052543383 198 nips-2011-On U-processes and clustering performance
9 0.050746869 285 nips-2011-The Kernel Beta Process
10 0.047005467 29 nips-2011-Algorithms and hardness results for parallel large margin learning
11 0.045623094 114 nips-2011-Hierarchical Multitask Structured Output Learning for Large-scale Sequence Segmentation
12 0.045509037 171 nips-2011-Metric Learning with Multiple Kernels
13 0.043051321 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization
14 0.042977341 12 nips-2011-A Two-Stage Weighting Framework for Multi-Source Domain Adaptation
15 0.041493606 301 nips-2011-Variational Gaussian Process Dynamical Systems
16 0.040774103 152 nips-2011-Learning in Hilbert vs. Banach Spaces: A Measure Embedding Viewpoint
17 0.039499812 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning
18 0.0391718 286 nips-2011-The Local Rademacher Complexity of Lp-Norm Multiple Kernel Learning
19 0.0391188 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
20 0.037492957 28 nips-2011-Agnostic Selective Classification
topicId topicWeight
[(0, 0.117), (1, -0.017), (2, -0.035), (3, -0.071), (4, -0.028), (5, 0.025), (6, 0.014), (7, -0.107), (8, -0.046), (9, 0.031), (10, -0.136), (11, 0.085), (12, 0.137), (13, 0.057), (14, -0.036), (15, -0.026), (16, 0.07), (17, 0.049), (18, -0.061), (19, 0.175), (20, -0.005), (21, -0.114), (22, 0.024), (23, 0.011), (24, -0.082), (25, 0.012), (26, -0.013), (27, -0.11), (28, -0.259), (29, 0.017), (30, 0.17), (31, 0.121), (32, 0.024), (33, -0.06), (34, -0.009), (35, 0.116), (36, 0.213), (37, -0.021), (38, -0.133), (39, -0.114), (40, 0.034), (41, -0.11), (42, -0.071), (43, 0.089), (44, -0.098), (45, -0.094), (46, -0.103), (47, -0.055), (48, -0.019), (49, 0.067)]
simIndex simValue paperId paperTitle
same-paper 1 0.94585121 106 nips-2011-Generalizing from Several Related Classification Tasks to a New Unlabeled Sample
Author: Gilles Blanchard, Gyemin Lee, Clayton Scott
Abstract: We consider the problem of assigning class labels to an unlabeled test data set, given several labeled training data sets drawn from similar distributions. This problem arises in several applications where data distributions fluctuate because of biological, technical, or other sources of variation. We develop a distributionfree, kernel-based approach to the problem. This approach involves identifying an appropriate reproducing kernel Hilbert space and optimizing a regularized empirical risk over the space. We present generalization error analysis, describe universal kernels, and establish universal consistency of the proposed methodology. Experimental results on flow cytometry data are presented. 1
2 0.81401438 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels
Author: Mona Eberts, Ingo Steinwart
Abstract: We prove a new oracle inequality for support vector machines with Gaussian RBF kernels solving the regularized least squares regression problem. To this end, we apply the modulus of smoothness. With the help of the new oracle inequality we then derive learning rates that can also be achieved by a simple data-dependent parameter selection method. Finally, it turns out that our learning rates are asymptotically optimal for regression functions satisfying certain standard smoothness conditions. 1
3 0.60981238 152 nips-2011-Learning in Hilbert vs. Banach Spaces: A Measure Embedding Viewpoint
Author: Kenji Fukumizu, Gert R. Lanckriet, Bharath K. Sriperumbudur
Abstract: The goal of this paper is to investigate the advantages and disadvantages of learning in Banach spaces over Hilbert spaces. While many works have been carried out in generalizing Hilbert methods to Banach spaces, in this paper, we consider the simple problem of learning a Parzen window classifier in a reproducing kernel Banach space (RKBS)—which is closely related to the notion of embedding probability measures into an RKBS—in order to carefully understand its pros and cons over the Hilbert space classifier. We show that while this generalization yields richer distance measures on probabilities compared to its Hilbert space counterpart, it however suffers from serious computational drawback limiting its practical applicability, which therefore demonstrates the need for developing efficient learning algorithms in Banach spaces.
4 0.58281195 256 nips-2011-Solving Decision Problems with Limited Information
Author: Denis D. Maua, Cassio Campos
Abstract: We present a new algorithm for exactly solving decision-making problems represented as an influence diagram. We do not require the usual assumptions of no forgetting and regularity, which allows us to solve problems with limited information. The algorithm, which implements a sophisticated variable elimination procedure, is empirically shown to outperform a state-of-the-art algorithm in randomly generated problems of up to 150 variables and 1064 strategies. 1
5 0.55763066 59 nips-2011-Composite Multiclass Losses
Author: Elodie Vernet, Mark D. Reid, Robert C. Williamson
Abstract: We consider loss functions for multiclass prediction problems. We show when a multiclass loss can be expressed as a “proper composite loss”, which is the composition of a proper loss and a link function. We extend existing results for binary losses to multiclass losses. We determine the stationarity condition, Bregman representation, order-sensitivity, existence and uniqueness of the composite representation for multiclass losses. We subsume existing results on “classification calibration” by relating it to properness and show that the simple integral representation for binary proper losses can not be extended to multiclass losses. 1
6 0.50359762 139 nips-2011-Kernel Bayes' Rule
7 0.3026005 162 nips-2011-Lower Bounds for Passive and Active Learning
8 0.29644492 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels
9 0.28846326 201 nips-2011-On the Completeness of First-Order Knowledge Compilation for Lifted Probabilistic Inference
10 0.26905018 198 nips-2011-On U-processes and clustering performance
11 0.26325023 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension
12 0.24735808 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison
13 0.24721447 285 nips-2011-The Kernel Beta Process
14 0.24491607 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines
15 0.23684889 295 nips-2011-Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction
16 0.23645952 284 nips-2011-The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers
17 0.23107043 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization
18 0.2292472 69 nips-2011-Differentially Private M-Estimators
19 0.21802618 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators
20 0.21492535 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization
topicId topicWeight
[(0, 0.066), (4, 0.07), (20, 0.04), (26, 0.037), (31, 0.059), (33, 0.014), (43, 0.063), (45, 0.129), (48, 0.042), (57, 0.024), (62, 0.203), (65, 0.015), (74, 0.032), (83, 0.043), (99, 0.035)]
simIndex simValue paperId paperTitle
1 0.82338458 161 nips-2011-Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation
Author: Zhouchen Lin, Risheng Liu, Zhixun Su
Abstract: Many machine learning and signal processing problems can be formulated as linearly constrained convex programs, which could be efficiently solved by the alternating direction method (ADM). However, usually the subproblems in ADM are easily solvable only when the linear mappings in the constraints are identities. To address this issue, we propose a linearized ADM (LADM) method by linearizing the quadratic penalty term and adding a proximal term when solving the subproblems. For fast convergence, we also allow the penalty to change adaptively according a novel update rule. We prove the global convergence of LADM with adaptive penalty (LADMAP). As an example, we apply LADMAP to solve lowrank representation (LRR), which is an important subspace clustering technique yet suffers from high computation cost. By combining LADMAP with a skinny SVD representation technique, we are able to reduce the complexity O(n3 ) of the original ADM based method to O(rn2 ), where r and n are the rank and size of the representation matrix, respectively, hence making LRR possible for large scale applications. Numerical experiments verify that for LRR our LADMAP based methods are much faster than state-of-the-art algorithms. 1
2 0.7900933 216 nips-2011-Portmanteau Vocabularies for Multi-Cue Image Representation
Author: Fahad S. Khan, Joost Weijer, Andrew D. Bagdanov, Maria Vanrell
Abstract: We describe a novel technique for feature combination in the bag-of-words model of image classification. Our approach builds discriminative compound words from primitive cues learned independently from training images. Our main observation is that modeling joint-cue distributions independently is more statistically robust for typical classification problems than attempting to empirically estimate the dependent, joint-cue distribution directly. We use Information theoretic vocabulary compression to find discriminative combinations of cues and the resulting vocabulary of portmanteau1 words is compact, has the cue binding property, and supports individual weighting of cues in the final image representation. State-of-theart results on both the Oxford Flower-102 and Caltech-UCSD Bird-200 datasets demonstrate the effectiveness of our technique compared to other, significantly more complex approaches to multi-cue image representation. 1
same-paper 3 0.78622794 106 nips-2011-Generalizing from Several Related Classification Tasks to a New Unlabeled Sample
Author: Gilles Blanchard, Gyemin Lee, Clayton Scott
Abstract: We consider the problem of assigning class labels to an unlabeled test data set, given several labeled training data sets drawn from similar distributions. This problem arises in several applications where data distributions fluctuate because of biological, technical, or other sources of variation. We develop a distributionfree, kernel-based approach to the problem. This approach involves identifying an appropriate reproducing kernel Hilbert space and optimizing a regularized empirical risk over the space. We present generalization error analysis, describe universal kernels, and establish universal consistency of the proposed methodology. Experimental results on flow cytometry data are presented. 1
4 0.77382362 208 nips-2011-Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness
Author: Rémi Munos
Abstract: We consider a global optimization problem of a deterministic function f in a semimetric space, given a finite budget of n evaluations. The function f is assumed to be locally smooth (around one of its global maxima) with respect to a semi-metric ℓ. We describe two algorithms based on optimistic exploration that use a hierarchical partitioning of the space at all scales. A first contribution is an algorithm, DOO, that requires the knowledge of ℓ. We report a finite-sample performance bound in terms of a measure of the quantity of near-optimal states. We then define a second algorithm, SOO, which does not require the knowledge of the semimetric ℓ under which f is smooth, and whose performance is almost as good as DOO optimally-fitted.
5 0.6826244 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari
Abstract: Learning theory has largely focused on two main learning scenarios: the classical statistical setting where instances are drawn i.i.d. from a fixed distribution, and the adversarial scenario wherein, at every time step, an adversarially chosen instance is revealed to the player. It can be argued that in the real world neither of these assumptions is reasonable. We define the minimax value of a game where the adversary is restricted in his moves, capturing stochastic and non-stochastic assumptions on data. Building on the sequential symmetrization approach, we define a notion of distribution-dependent Rademacher complexity for the spectrum of problems ranging from i.i.d. to worst-case. The bounds let us immediately deduce variation-type bounds. We study a smoothed online learning scenario and show that exponentially small amount of noise can make function classes with infinite Littlestone dimension learnable. 1
6 0.6738475 80 nips-2011-Efficient Online Learning via Randomized Rounding
7 0.67154044 198 nips-2011-On U-processes and clustering performance
8 0.66751015 22 nips-2011-Active Ranking using Pairwise Comparisons
9 0.66552669 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
10 0.66462076 114 nips-2011-Hierarchical Multitask Structured Output Learning for Large-scale Sequence Segmentation
11 0.66437167 29 nips-2011-Algorithms and hardness results for parallel large margin learning
12 0.66334391 231 nips-2011-Randomized Algorithms for Comparison-based Search
13 0.66027957 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
14 0.6579197 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
15 0.65678972 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition
16 0.65582967 199 nips-2011-On fast approximate submodular minimization
17 0.65470868 265 nips-2011-Sparse recovery by thresholded non-negative least squares
18 0.6545887 64 nips-2011-Convergent Bounds on the Euclidean Distance
19 0.65455353 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation
20 0.65233123 186 nips-2011-Noise Thresholds for Spectral Clustering