nips nips2011 nips2011-106 knowledge-graph by maker-knowledge-mining

106 nips-2011-Generalizing from Several Related Classification Tasks to a New Unlabeled Sample

Source: pdf

Author: Gilles Blanchard, Gyemin Lee, Clayton Scott

Abstract: We consider the problem of assigning class labels to an unlabeled test data set, given several labeled training data sets drawn from similar distributions. This problem arises in several applications where data distributions ﬂuctuate because of biological, technical, or other sources of variation. We develop a distributionfree, kernel-based approach to the problem. This approach involves identifying an appropriate reproducing kernel Hilbert space and optimizing a regularized empirical risk over the space. We present generalization error analysis, describe universal kernels, and establish universal consistency of the proposed methodology. Experimental results on ﬂow cytometry data are presented. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de Abstract We consider the problem of assigning class labels to an unlabeled test data set, given several labeled training data sets drawn from similar distributions. [sent-4, score-0.139]

2 This approach involves identifying an appropriate reproducing kernel Hilbert space and optimizing a regularized empirical risk over the space. [sent-7, score-0.157]

3 We present generalization error analysis, describe universal kernels, and establish universal consistency of the proposed methodology. [sent-8, score-0.294]

4 For a given distribution PXY , we refer to the X marginal distribution PX as simply the marginal distribution, and the conditional PXY (Y |X) as the posterior distribution. [sent-14, score-0.15]

5 For each i, there is a (i) training sample Si = (Xij , Yij )1≤j≤ni of iid realizations of PXY . [sent-19, score-0.115]

6 There is also a test distribution (i) T PXY that is similar to but again distinct from the “training distributions” PXY . [sent-20, score-0.058]

7 Finally, there is a test T T sample (Xj , YjT )1≤j≤nT of iid realizations of PXY , but in this case the labels Yj are not observed. [sent-21, score-0.151]

8 Essentially, given a random sample from T the marginal test distribution PX , we would like to predict the corresponding labels. [sent-23, score-0.133]

9 Thus, when we say that the training and test distributions are “similar,” we mean that there is some pattern making it possible to learn a mapping from marginal distributions to labels. [sent-24, score-0.2]

10 In multi-task learning, only the training distributions are of interest, and the goal is to use the similarity among distributions to improve the training of individual classiﬁers [1, 2, 3]. [sent-28, score-0.128]

11 In the covariate shift problem, the marginal test distribution is different from the marginal training distribution(s), but the posterior 1 distribution is assumed to be the same [4]. [sent-30, score-0.283]

12 In our case, both marginal and posterior test distributions can differ from their training counterparts [5]. [sent-31, score-0.173]

13 Finally, in transfer learning, it is typically assumed that at least a few labels are available for the test data, and the training data sets are used to improve the performance of a standard classiﬁer, for example by learning a metric or embedding which is appropriate for all data sets [6, 7]. [sent-32, score-0.163]

14 In our case, no test labels are available, but we hope that through access to multiple training data sets, it is still possible to obtain collective knowledge about the “labeling process” that may be transferred to the test distribution. [sent-33, score-0.148]

15 Some authors have considered transductive transfer learning, which is similar to the problem studied here in that no test labels are available. [sent-34, score-0.109]

16 However, existing work has focused on the case N = 1 and typically relies on the covariate shift assumption [8]. [sent-35, score-0.056]

17 Our methodology is shown to yield a consistent learning procedure, meaning that the generalization error tends to the best possible as the sample sizes N, {ni }, nT tend to inﬁnity. [sent-37, score-0.072]

18 We also offer a proof-of-concept experimental study validating the proposed approach on ﬂow cytometry data, including comparisons to multi-task kernels and a simple pooling approach. [sent-38, score-0.326]

19 2 Motivating Application: Automatic Gating of Flow Cytometry Data Flow cytometry is a high-throughput measurement platform that is an important clinical tool for the diagnosis of many blood-related pathologies. [sent-39, score-0.248]

20 This technology allows for quantitative analysis of individual cells from a given population, derived for example from a blood sample from a patient. [sent-40, score-0.076]

21 We may think of a ﬂow cytometry data set as a set of d-dimensional attribute vectors (Xj )1≤j≤n , where n is the number of cells analyzed, and d is the number of attributes recorded per cell. [sent-41, score-0.254]

22 Thus, a ﬂow cytometry data set is a random sample from a patient-speciﬁc distribution. [sent-43, score-0.23]

23 For example, lymphocytes are known to be relevant for the diagnosis of leukemia, whereas non-lymphocytes may potentially confound the analysis. [sent-46, score-0.079]

24 Because of the variability in ﬂow cytometry data, this process is difﬁcult to quantify in terms of a small subset of simple rules. [sent-50, score-0.206]

25 Since clinical laboratories maintain historical databases, we can assume access to a number (N ) of historical patients that have already been expert-gated. [sent-53, score-0.171]

26 Because of biological and technical varia(i) tions in ﬂow cytometry data, the distributions PXY of the historical patients will vary. [sent-54, score-0.317]

27 Nonetheless, there are certain general trends that are known to hold for all ﬂow cytometry measurements. [sent-57, score-0.206]

28 For example, lymphocytes are known to exhibit low levels of the “side-scatter” (SS) attribute, while expressing high levels of the attribute CD45 (see column 2 of Fig. [sent-58, score-0.084]

29 Therefore, it is reasonable to assume that there is an underlying distribution (on distributions) governing ﬂow cytometry data sets, that produces roughly similar distributions thereby making possible the automation of the gating process. [sent-63, score-0.267]

30 The distribution of cells differs from patient to patient. [sent-67, score-0.072]

31 realizations (i) from µ, and the sample Si is made of ni i. [sent-82, score-0.164]

32 realizations of (X, Y ) following the distribution PXY . [sent-85, score-0.072]

33 T T Now consider a test distribution PXY and test sample S T = (Xj , YjT )1≤j≤nT , whose labels are not observed. [sent-86, score-0.153]

34 If : R × Y → R+ is a loss, then the average nT T T loss incurred on the test sample is n1 i=1 (Yi , Yi ) . [sent-88, score-0.099]

35 Based on this, we deﬁne the average T generalization error of a decision function over test samples of size nT , T T E(f, nT ) := EPXY ∼µ ES T ∼(PXY )⊗nT 1 nT nT T T (f (PX , Xi ), YiT ) . [sent-89, score-0.084]

36 (1) i=1 In important point of the analysis is that, at training time as well as at test time, the marginal distribution PX for a sample is only known through the sample itself, that is, through the empirical marginal PX . [sent-90, score-0.243]

37 As is clear from equation (1), because of this the generalization error also depends on T T the test sample size nT . [sent-91, score-0.092]

38 This motivates the following generalization error when we have an inﬁnite test sample, where we then assume that the true marginal T PX is observed: T T E(f, ∞) := EPXY ∼µ E(X T ,Y T )∼PXY T (f (PX , X T ), Y T ) . [sent-93, score-0.119]

39 (2) To gain some insight into this risk, let us decompose µ into two parts, µX which generates the marginal distribution PX , and µY |X which, conditioned on PX , generates the posterior PY |X . [sent-94, score-0.113]

40 ˜ Here Qµ is the distribution that generates X by ﬁrst drawing PX according to µX , and then drawing ˜ X according to PX . [sent-97, score-0.073]

41 From this last expression, we see that the risk is like a ˜ standard binary classiﬁcation risk based on (X, Y ) ∼ Qµ . [sent-99, score-0.078]

42 Despite the similarity to standard binary classiﬁcation in the inﬁnite sample case, we emphasize that the learning task here is different, because the realizations (Xij , Yij ) are neither independent nor identically distributed. [sent-105, score-0.08]

43 T Finally, we note that there is a condition where for µ-almost all test distribution PXY , the classiﬁer ∗ T ∗ f (PX , . [sent-106, score-0.072]

44 ) (where f is the global minimizer of (2)) coincides with the optimal Bayes classiﬁer for T PXY , although no labels from this test distribution are observed. [sent-107, score-0.101]

45 Although we will not be assuming this condition throughout the paper, it is implicitly assumed in the motivating application presented in Section 2, where an expert labels the data points by just looking at their marginal distribution. [sent-110, score-0.114]

46 For a ﬁxed distribution PXY , and a decision function f : X → R, let us denote R(f, PXY ) = E(X,Y )∼PXY [ (f (X), Y )] and R∗ (PXY ) := min R(f, PXY ) = min E(X,Y )∼PXY [ (f (X), Y )] f :X →R f :X →R the corresponding optimal (Bayes) risk for the loss function . [sent-113, score-0.104]

47 4 Learning Algorithm We consider an approach based on positive semi-deﬁnite kernels, or simply kernels for short. [sent-128, score-0.1]

48 Background information on kernels, including the deﬁnition, normalized kernels, universal kernels, and reproducing kernel Hilbert spaces (RKHSs), may be found in [11]. [sent-129, score-0.229]

49 Several well-known learning algorithms, such as support vector machines and kernel ridge regression, may be viewed as minimizers of a norm-regularized empirical risk over the RKHS of a kernel. [sent-130, score-0.119]

50 Inspired by this framework, we consider a general kernel algorithm as follows. [sent-132, score-0.08]

51 Let k be a kernel on PX × X , and let Hk be the (i) associated RKHS. [sent-134, score-0.08]

52 Now deﬁne fλ = arg min f ∈Hk 1 N N i=1 1 ni ni (f (Xij ), Yij ) + λ f j=1 4 2 . [sent-138, score-0.168]

53 The ﬁnal predictor has the form N ni (i) fλ (PX , x) = αij Yij k((PX , Xij ), (PX , x)) i=1 j=1 where the αij are nonnegative and mostly zero. [sent-140, score-0.084]

54 In the rest of the paper we will consider a kernel k on PX × X of the product form k((P1 , x1 ), (P2 , x2 )) = kP (P1 , P2 )kX (x1 , x2 ), (4) where kP is a kernel on PX and kX a kernel on X . [sent-142, score-0.24]

55 Furthermore, we will consider kernels on PX of a particular form. [sent-143, score-0.1]

56 Let kX denote a kernel on X (which might be different from kX ) that is measurable and bounded. [sent-144, score-0.08]

57 Therefore, if we consider the kernel kP (PX , PX ) = Ψ(PX ), Ψ(PX ) , it is a linear kernel on PX and cannot be a universal kernel. [sent-148, score-0.284]

58 For this reason, we introduce yet another kernel K on HkX and consider the kernel on PX given by kP (PX , PX ) = K (Ψ(PX ), Ψ(PX )) . [sent-149, score-0.16]

59 (6) Note that particular kernels inspired by the ﬁnite dimensional case are of the form K(v, v ) = F ( v − v ), (7) or K(v, v ) = G( v, v ), (8) where F, G are real functions of a real variable such that they deﬁne a kernel. [sent-150, score-0.1]

60 Below we apply their results to deduce that k is a universal kernel for certain choices of kX , kX , and K. [sent-153, score-0.222]

61 We then discuss universal kernels, and ﬁnally deduce universal consistency of the algorithm. [sent-156, score-0.286]

62 To simplify somewhat the analysis, we assume below that all training samples have the same size ni = n. [sent-157, score-0.119]

63 Also let Bk (r) denote the closed ball of radius r, centered at the origin, in the RKHS of the kernel k. [sent-158, score-0.093]

64 We consider the following assumptions on the loss and kernels: (Loss) The loss function : R × Y → R+ is L -Lipschitz in its ﬁrst variable and bounded by B . [sent-159, score-0.066]

65 2 2 (Kernels-A) The kernels kX , kX and K are bounded respectively by constants Bk , Bk ≥ 1, and 2 BK . [sent-160, score-0.1]

66 As an example, the condition is shown to hold with α = 1 when K is the Gaussian-like kernel on HkX . [sent-163, score-0.094]

67 We use the decomposition sup f ∈Bk (R) 1 N N i=1 1 ni ni (f (Xij ), Yij ) − E(f, ∞) j=1 ≤ sup f ∈Bk (R) + sup f ∈Bk (R) 1 N 1 N N i=1 1 ni N i=1 ni (i) (i) (f (PX , Xij ), Yij ) − (f (PX , Xij ), Yij ) j=1 1 ni ni (i) (f (PX , Xij ), Yij ) − E(f, ∞) =: (I) + (II). [sent-184, score-0.555]

68 This can be obtained using the reproducing property (i) (i) of the kernel k, the convergence of Ψ(PX ) to Ψ(PX ) as a consequence of Hoeffding’s inequality in a Hilbert space, and the other assumptions (boundedness/H¨ lder property) on the kernels. [sent-191, score-0.122]

69 In both cases, a standard approach using the Azuma-McDiarmid’s inequality [17] followed by symmetrization and Rademacher complexity analysis on a kernel space [18, 19] can be applied. [sent-193, score-0.08]

70 To establish that k is universal on PX × X , the following lemma is useful. [sent-198, score-0.124]

71 Let Ω, Ω be two compact spaces and k, k be kernels on Ω, Ω , respectively. [sent-201, score-0.12]

72 If k, k are both universal, then the product kernel k((x, x ), (y, y )) := k(x, y)k (x , y ) is universal on Ω × Ω . [sent-202, score-0.204]

73 Several examples of universal kernels are known on Euclidean space. [sent-203, score-0.224]

74 Some additional assumptions on the kernels and feature space are required: (Kernels-B) kX , kX , K, and X satisfy the following: X is a compact metric space; kX is universal on X ; kX is continuous and universal on X ; K is universal on any compact subset of HkX . [sent-206, score-0.512]

75 Then, for kP deﬁned as in (6), the product kernel k in (4) is universal on PX ×X . [sent-211, score-0.204]

76 Furthermore, the assumption on K is fulﬁlled if K is of the form (8), where G is an analytical function with positive Taylor series coefﬁcients, or if K is the normalized kernel associated to such a kernel. [sent-212, score-0.08]

77 Taking G(t) = exp(t), it follows that K(PX , PX ) = exp( Ψ(PX ), Ψ(PX ) Hk ) is universal on X PX . [sent-215, score-0.124]

78 By similar reasoning as in the ﬁnite dimensional case, the Gaussian-like kernel K(PX , PX ) = 1 exp(− 2σ2 Ψ(PX ) − Ψ(PX ) 2 k ) is also universal on PX . [sent-216, score-0.204]

79 6 Experiments We demonstrate the proposed methodology for ﬂow cytometry data auto-gating, described above. [sent-223, score-0.228]

80 Peripheral blood samples were obtained from 35 normal patients, and lymphocytes were classiﬁed by a domain expert. [sent-224, score-0.088]

81 The corresponding ﬂow cytometry data sets have sample sizes ranging from 10,000 to 100,000, and the proportion of lymphocytes in each data set ranges from 10 to 40%. [sent-225, score-0.311]

82 To speed training time, we subsampled the 10 training data sets to have 1000 data points (cells) each. [sent-227, score-0.087]

83 The kernels kX , kX , and K are all taken to be GausTrain Test kP sian kernels with respective bandwidths σX , σX , and Pooling (τ = 1) 1. [sent-229, score-0.2]

84 Table 1: The misclassiﬁcation rates (%) on For comparison, we also considered three other training data sets and test data sets for difoptions for kP . [sent-243, score-0.111]

85 The proposed method adapts the kP (P1 , P2 ) = 1 if P1 = P2 , and kP (P1 , P2 ) = τ decision function to the test data (through otherwise. [sent-245, score-0.058]

86 This idea has been previously studied in the context of ﬂow cytometry by [21]. [sent-248, score-0.206]

87 When 0 < τ < 1, we obtain a kernel like what was used for multi-task learning (MTL) by [3]. [sent-249, score-0.08]

88 Note that these kernels have the property that if P1 is a training data set, and P2 a test data set, then P1 = P2 and so kP (P1 , P2 ) is simply a constant. [sent-250, score-0.177]

89 This implies that the learning rules produced by these kernels do not adapt to the test distribution, unlike the proposed kernel. [sent-251, score-0.156]

90 Thus, we study the natural algorithm of minimizing a regularized empirical loss over a reproducing 7 Figure 2: The misclassiﬁcation rates (%) on training data sets and test data sets for different kP . [sent-261, score-0.182]

91 kernel Hilbert space associated with the extended input domain PX ×X . [sent-263, score-0.08]

92 We also establish universal consistency, using a novel generalization error analysis under the inherent non-iid sampling plan, and a construction of a universal kernel on PX × X . [sent-264, score-0.354]

93 The algorithm is applied to ﬂow cytometry autogating, and shown to improve upon kernels that do not adapt to the test distribution. [sent-266, score-0.362]

94 However, irregularities in a test patient’s heartbeat will differ from irregularities of historical patients, hence the need to adapt to the test distribution [22]. [sent-270, score-0.225]

95 We can also ask how the methodology and analysis can be extended to the context where a small number of labels are available for the test distribution, as is commonly assumed in transfer learning. [sent-271, score-0.116]

96 In this sense, the marginal-adaptive decision function learned from the training samples would serve as a “prior” for learning on the test data. [sent-277, score-0.093]

97 Smola, “A kernel approach to como paring distributions,” in Proceedings of the 22nd AAAI Conference on Artiﬁcial Intelligence, R. [sent-363, score-0.08]

98 Smola, “A kernel method o for the two-sample-problem,” in Advances in Neural Information Processing Systems 19, B. [sent-373, score-0.08]

99 Steinwart, “Universal kernels on non-standard input spaces,” in Advances in Neural Information Processing Systems 23, J. [sent-389, score-0.1]

100 Spang, “Automated in-silico detection of cell populations in ﬂow cytometry readouts and its application to leukemia disease monitoring,” BMC Bioinformatics, vol. [sent-426, score-0.249]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('px', 0.714), ('pxy', 0.489), ('cytometry', 0.206), ('kx', 0.139), ('universal', 0.124), ('kp', 0.111), ('xij', 0.109), ('kernels', 0.1), ('yij', 0.098), ('py', 0.087), ('ni', 0.084), ('bk', 0.083), ('nt', 0.083), ('kernel', 0.08), ('ow', 0.075), ('hkx', 0.064), ('lymphocytes', 0.064), ('realizations', 0.056), ('yjt', 0.052), ('marginal', 0.051), ('hk', 0.051), ('patients', 0.043), ('test', 0.042), ('mtl', 0.042), ('blanchard', 0.042), ('risk', 0.039), ('historical', 0.039), ('epxy', 0.039), ('training', 0.035), ('loss', 0.033), ('labels', 0.029), ('shift', 0.029), ('distributions', 0.029), ('lkopf', 0.028), ('cells', 0.028), ('patient', 0.028), ('hilbert', 0.027), ('covariate', 0.027), ('clinical', 0.027), ('misclassi', 0.026), ('generalization', 0.026), ('bkx', 0.026), ('christmann', 0.026), ('epx', 0.026), ('epy', 0.026), ('heartbeat', 0.026), ('pathologist', 0.026), ('sch', 0.025), ('classi', 0.025), ('reproducing', 0.025), ('lk', 0.024), ('blood', 0.024), ('scott', 0.024), ('sample', 0.024), ('transfer', 0.023), ('ey', 0.023), ('laboratories', 0.023), ('irregularities', 0.023), ('cell', 0.023), ('gretton', 0.022), ('methodology', 0.022), ('steinwart', 0.021), ('drawing', 0.021), ('attribute', 0.02), ('pooling', 0.02), ('consistency', 0.02), ('motivating', 0.02), ('compact', 0.02), ('leukemia', 0.02), ('rademacher', 0.02), ('ex', 0.018), ('deduce', 0.018), ('rasch', 0.018), ('flow', 0.018), ('sup', 0.017), ('lder', 0.017), ('hinge', 0.017), ('sets', 0.017), ('scatter', 0.016), ('gating', 0.016), ('decision', 0.016), ('posterior', 0.016), ('distribution', 0.016), ('borgwardt', 0.016), ('unlabeled', 0.016), ('bayes', 0.016), ('transductive', 0.015), ('diagnosis', 0.015), ('smola', 0.015), ('si', 0.015), ('generates', 0.015), ('rkhs', 0.015), ('nity', 0.014), ('condition', 0.014), ('minimizer', 0.014), ('adapt', 0.014), ('mapping', 0.014), ('er', 0.014), ('ball', 0.013), ('regularized', 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 106 nips-2011-Generalizing from Several Related Classification Tasks to a New Unlabeled Sample

Author: Gilles Blanchard, Gyemin Lee, Clayton Scott

2 0.32933626 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels

Author: Mona Eberts, Ingo Steinwart

Abstract: We prove a new oracle inequality for support vector machines with Gaussian RBF kernels solving the regularized least squares regression problem. To this end, we apply the modulus of smoothness. With the help of the new oracle inequality we then derive learning rates that can also be achieved by a simple data-dependent parameter selection method. Finally, it turns out that our learning rates are asymptotically optimal for regression functions satisfying certain standard smoothness conditions. 1

3 0.11214818 59 nips-2011-Composite Multiclass Losses

Author: Elodie Vernet, Mark D. Reid, Robert C. Williamson

Abstract: We consider loss functions for multiclass prediction problems. We show when a multiclass loss can be expressed as a “proper composite loss”, which is the composition of a proper loss and a link function. We extend existing results for binary losses to multiclass losses. We determine the stationarity condition, Bregman representation, order-sensitivity, existence and uniqueness of the composite representation for multiclass losses. We subsume existing results on “classiﬁcation calibration” by relating it to properness and show that the simple integral representation for binary proper losses can not be extended to multiclass losses. 1

4 0.10684506 256 nips-2011-Solving Decision Problems with Limited Information

Author: Denis D. Maua, Cassio Campos

Abstract: We present a new algorithm for exactly solving decision-making problems represented as an inﬂuence diagram. We do not require the usual assumptions of no forgetting and regularity, which allows us to solve problems with limited information. The algorithm, which implements a sophisticated variable elimination procedure, is empirically shown to outperform a state-of-the-art algorithm in randomly generated problems of up to 150 variables and 1064 strategies. 1

5 0.099878117 162 nips-2011-Lower Bounds for Passive and Active Learning

Author: Maxim Raginsky, Alexander Rakhlin

Abstract: We develop uniﬁed information-theoretic machinery for deriving lower bounds for passive and active learning schemes. Our bounds involve the so-called Alexander’s capacity function. The supremum of this function has been recently rediscovered by Hanneke in the context of active learning under the name of “disagreement coefﬁcient.” For passive learning, our lower bounds match the upper bounds of Gin´ and Koltchinskii up to constants and generalize analogous results of Mase sart and N´ d´ lec. For active learning, we provide ﬁrst known lower bounds based e e on the capacity function rather than the disagreement coefﬁcient. 1

6 0.096814871 139 nips-2011-Kernel Bayes' Rule

7 0.075289756 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels

8 0.052543383 198 nips-2011-On U-processes and clustering performance

9 0.050746869 285 nips-2011-The Kernel Beta Process

10 0.047005467 29 nips-2011-Algorithms and hardness results for parallel large margin learning

11 0.045623094 114 nips-2011-Hierarchical Multitask Structured Output Learning for Large-scale Sequence Segmentation

12 0.045509037 171 nips-2011-Metric Learning with Multiple Kernels

13 0.043051321 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization

14 0.042977341 12 nips-2011-A Two-Stage Weighting Framework for Multi-Source Domain Adaptation

15 0.041493606 301 nips-2011-Variational Gaussian Process Dynamical Systems

16 0.040774103 152 nips-2011-Learning in Hilbert vs. Banach Spaces: A Measure Embedding Viewpoint

17 0.039499812 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning

18 0.0391718 286 nips-2011-The Local Rademacher Complexity of Lp-Norm Multiple Kernel Learning

19 0.0391188 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model

20 0.037492957 28 nips-2011-Agnostic Selective Classification

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.117), (1, -0.017), (2, -0.035), (3, -0.071), (4, -0.028), (5, 0.025), (6, 0.014), (7, -0.107), (8, -0.046), (9, 0.031), (10, -0.136), (11, 0.085), (12, 0.137), (13, 0.057), (14, -0.036), (15, -0.026), (16, 0.07), (17, 0.049), (18, -0.061), (19, 0.175), (20, -0.005), (21, -0.114), (22, 0.024), (23, 0.011), (24, -0.082), (25, 0.012), (26, -0.013), (27, -0.11), (28, -0.259), (29, 0.017), (30, 0.17), (31, 0.121), (32, 0.024), (33, -0.06), (34, -0.009), (35, 0.116), (36, 0.213), (37, -0.021), (38, -0.133), (39, -0.114), (40, 0.034), (41, -0.11), (42, -0.071), (43, 0.089), (44, -0.098), (45, -0.094), (46, -0.103), (47, -0.055), (48, -0.019), (49, 0.067)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94585121 106 nips-2011-Generalizing from Several Related Classification Tasks to a New Unlabeled Sample

Author: Gilles Blanchard, Gyemin Lee, Clayton Scott

2 0.81401438 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels

Author: Mona Eberts, Ingo Steinwart

3 0.60981238 152 nips-2011-Learning in Hilbert vs. Banach Spaces: A Measure Embedding Viewpoint

Author: Kenji Fukumizu, Gert R. Lanckriet, Bharath K. Sriperumbudur

Abstract: The goal of this paper is to investigate the advantages and disadvantages of learning in Banach spaces over Hilbert spaces. While many works have been carried out in generalizing Hilbert methods to Banach spaces, in this paper, we consider the simple problem of learning a Parzen window classiﬁer in a reproducing kernel Banach space (RKBS)—which is closely related to the notion of embedding probability measures into an RKBS—in order to carefully understand its pros and cons over the Hilbert space classiﬁer. We show that while this generalization yields richer distance measures on probabilities compared to its Hilbert space counterpart, it however suffers from serious computational drawback limiting its practical applicability, which therefore demonstrates the need for developing efﬁcient learning algorithms in Banach spaces.

4 0.58281195 256 nips-2011-Solving Decision Problems with Limited Information

Author: Denis D. Maua, Cassio Campos

5 0.55763066 59 nips-2011-Composite Multiclass Losses

Author: Elodie Vernet, Mark D. Reid, Robert C. Williamson

6 0.50359762 139 nips-2011-Kernel Bayes' Rule

7 0.3026005 162 nips-2011-Lower Bounds for Passive and Active Learning

8 0.29644492 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels

9 0.28846326 201 nips-2011-On the Completeness of First-Order Knowledge Compilation for Lifted Probabilistic Inference

10 0.26905018 198 nips-2011-On U-processes and clustering performance

11 0.26325023 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension

12 0.24735808 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison

13 0.24721447 285 nips-2011-The Kernel Beta Process

14 0.24491607 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines

15 0.23684889 295 nips-2011-Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction

16 0.23645952 284 nips-2011-The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers

17 0.23107043 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization

18 0.2292472 69 nips-2011-Differentially Private M-Estimators

19 0.21802618 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators

20 0.21492535 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.066), (4, 0.07), (20, 0.04), (26, 0.037), (31, 0.059), (33, 0.014), (43, 0.063), (45, 0.129), (48, 0.042), (57, 0.024), (62, 0.203), (65, 0.015), (74, 0.032), (83, 0.043), (99, 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.82338458 161 nips-2011-Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation

Author: Zhouchen Lin, Risheng Liu, Zhixun Su

Abstract: Many machine learning and signal processing problems can be formulated as linearly constrained convex programs, which could be efﬁciently solved by the alternating direction method (ADM). However, usually the subproblems in ADM are easily solvable only when the linear mappings in the constraints are identities. To address this issue, we propose a linearized ADM (LADM) method by linearizing the quadratic penalty term and adding a proximal term when solving the subproblems. For fast convergence, we also allow the penalty to change adaptively according a novel update rule. We prove the global convergence of LADM with adaptive penalty (LADMAP). As an example, we apply LADMAP to solve lowrank representation (LRR), which is an important subspace clustering technique yet suffers from high computation cost. By combining LADMAP with a skinny SVD representation technique, we are able to reduce the complexity O(n3 ) of the original ADM based method to O(rn2 ), where r and n are the rank and size of the representation matrix, respectively, hence making LRR possible for large scale applications. Numerical experiments verify that for LRR our LADMAP based methods are much faster than state-of-the-art algorithms. 1

2 0.7900933 216 nips-2011-Portmanteau Vocabularies for Multi-Cue Image Representation

Author: Fahad S. Khan, Joost Weijer, Andrew D. Bagdanov, Maria Vanrell

Abstract: We describe a novel technique for feature combination in the bag-of-words model of image classiﬁcation. Our approach builds discriminative compound words from primitive cues learned independently from training images. Our main observation is that modeling joint-cue distributions independently is more statistically robust for typical classiﬁcation problems than attempting to empirically estimate the dependent, joint-cue distribution directly. We use Information theoretic vocabulary compression to ﬁnd discriminative combinations of cues and the resulting vocabulary of portmanteau1 words is compact, has the cue binding property, and supports individual weighting of cues in the ﬁnal image representation. State-of-theart results on both the Oxford Flower-102 and Caltech-UCSD Bird-200 datasets demonstrate the effectiveness of our technique compared to other, signiﬁcantly more complex approaches to multi-cue image representation. 1

same-paper 3 0.78622794 106 nips-2011-Generalizing from Several Related Classification Tasks to a New Unlabeled Sample

Author: Gilles Blanchard, Gyemin Lee, Clayton Scott

4 0.77382362 208 nips-2011-Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness

Author: Rémi Munos

Abstract: We consider a global optimization problem of a deterministic function f in a semimetric space, given a ﬁnite budget of n evaluations. The function f is assumed to be locally smooth (around one of its global maxima) with respect to a semi-metric ℓ. We describe two algorithms based on optimistic exploration that use a hierarchical partitioning of the space at all scales. A ﬁrst contribution is an algorithm, DOO, that requires the knowledge of ℓ. We report a ﬁnite-sample performance bound in terms of a measure of the quantity of near-optimal states. We then deﬁne a second algorithm, SOO, which does not require the knowledge of the semimetric ℓ under which f is smooth, and whose performance is almost as good as DOO optimally-ﬁtted.

5 0.6826244 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari

Abstract: Learning theory has largely focused on two main learning scenarios: the classical statistical setting where instances are drawn i.i.d. from a ﬁxed distribution, and the adversarial scenario wherein, at every time step, an adversarially chosen instance is revealed to the player. It can be argued that in the real world neither of these assumptions is reasonable. We deﬁne the minimax value of a game where the adversary is restricted in his moves, capturing stochastic and non-stochastic assumptions on data. Building on the sequential symmetrization approach, we deﬁne a notion of distribution-dependent Rademacher complexity for the spectrum of problems ranging from i.i.d. to worst-case. The bounds let us immediately deduce variation-type bounds. We study a smoothed online learning scenario and show that exponentially small amount of noise can make function classes with inﬁnite Littlestone dimension learnable. 1

6 0.6738475 80 nips-2011-Efficient Online Learning via Randomized Rounding

7 0.67154044 198 nips-2011-On U-processes and clustering performance

8 0.66751015 22 nips-2011-Active Ranking using Pairwise Comparisons

9 0.66552669 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

10 0.66462076 114 nips-2011-Hierarchical Multitask Structured Output Learning for Large-scale Sequence Segmentation

11 0.66437167 29 nips-2011-Algorithms and hardness results for parallel large margin learning

12 0.66334391 231 nips-2011-Randomized Algorithms for Comparison-based Search

13 0.66027957 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

14 0.6579197 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

15 0.65678972 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

16 0.65582967 199 nips-2011-On fast approximate submodular minimization

17 0.65470868 265 nips-2011-Sparse recovery by thresholded non-negative least squares

18 0.6545887 64 nips-2011-Convergent Bounds on the Euclidean Distance

19 0.65455353 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation

20 0.65233123 186 nips-2011-Noise Thresholds for Spectral Clustering