nips nips2012 nips2012-271 knowledge-graph by maker-knowledge-mining

271 nips-2012-Pointwise Tracking the Optimal Regression Function

Source: pdf

Author: Yair Wiener, Ran El-Yaniv

Abstract: This paper examines the possibility of a ‘reject option’ in the context of least squares regression. It is shown that using rejection it is theoretically possible to learn ‘selective’ regressors that can ǫ-pointwise track the best regressor in hindsight from the same hypothesis class, while rejecting only a bounded portion of the domain. Moreover, the rejected volume vanishes with the training set size, under certain conditions. We then develop efﬁcient and exact implementation of these selective regressors for the case of linear regression. Empirical evaluation over a suite of real-world datasets corroborates the theoretical analysis and indicates that our selective regressors can provide substantial advantage by reducing estimation error.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 il Abstract This paper examines the possibility of a ‘reject option’ in the context of least squares regression. [sent-4, score-0.132]

2 It is shown that using rejection it is theoretically possible to learn ‘selective’ regressors that can ǫ-pointwise track the best regressor in hindsight from the same hypothesis class, while rejecting only a bounded portion of the domain. [sent-5, score-1.004]

3 We then develop efﬁcient and exact implementation of these selective regressors for the case of linear regression. [sent-7, score-0.588]

4 Empirical evaluation over a suite of real-world datasets corroborates the theoretical analysis and indicates that our selective regressors can provide substantial advantage by reducing estimation error. [sent-8, score-0.725]

5 1 Introduction Consider a standard least squares regression problem. [sent-9, score-0.217]

6 realizaoutput predictions, f ˆ tions of some unknown stochastic source, P (x, y), we would like to choose f so as to minimize the standard least squares risk functional, ˆ R(f ) = ˆ (y − f (x))2 dP (x, y). [sent-17, score-0.278]

7 Let f ∗ = argminf ∈F R(f ) be the optimal predictor in hindsight (based on full knowledge of P ). [sent-18, score-0.112]

8 A classical result in statistical learning is that under certain structural conditions on F and possibly on P , one can learn a regressor that approaches the average optimal performance, R(f ∗ ), when the sample size, m, approaches inﬁnity [1]. [sent-19, score-0.578]

9 In this paper we contemplate the challenge of pointwise tracking the optimal predictions of f ∗ after observing only a ﬁnite (and possibly small) set of training samples. [sent-20, score-0.39]

10 Instead of predicting the output for the entire input domain, the regressor is allowed to abstain from prediction for part of the domain. [sent-22, score-0.564]

11 We present here new techniques for regression with a reject option, capable of achieving pointwise optimality on substantial parts of the input domain, under certain conditions. [sent-23, score-0.572]

12 Section 3 introduces a general strategy for learning selective regressors. [sent-24, score-0.506]

13 This strategy is guaranteed to achieve ǫ-pointwise optimality (deﬁned in Section 2) all through its region of action. [sent-25, score-0.149]

14 8, which also shows that the guaranteed coverage increases monotonically with the training sample size and converges to 1. [sent-27, score-0.341]

15 This type of guarantee is quite strong, as it ensures tight tracking of individual optimal predictions made by f ∗ , while covering a substantial portion of the input domain. [sent-28, score-0.166]

16 At the outset, the general strategy we propose appears to be out of reach because accept/reject decisions require the computation of a supremum over a a very large, and possibly inﬁnite hypothesis 1 subset. [sent-29, score-0.285]

17 In Section 4, however, we show how to compute the strategy for each point of interest using only two constrained ERM calculations. [sent-30, score-0.149]

18 2, opens possibilities for efﬁcient implementations of optimal selective regressors whenever the hypothesis class of interest allows for efﬁcient (constrained) ERM (see Deﬁnition 4. [sent-32, score-0.759]

19 For the case of linear least squares regression we utilize known techniques for both ERM and constrained ERM and derive in Section 5 exact implementation achieving pointwise optimal selective regression. [sent-34, score-0.981]

20 3 in this section states a novel pointwise bound on the difference between the prediction of an ERM linear regressor and the prediction of f ∗ for each individual point. [sent-37, score-0.773]

21 Finally, in Section 6 we present numerical examples over a suite of real-world regression datasets demonstrating the effectiveness of our methods, and indicating that substantial performance improvements can be gained by using selective regression. [sent-38, score-0.621]

22 Utilizations of a reject option are quite common in classiﬁcation where this technique was initiated more than 50 years ago with Chow’s pioneering work [2, 3]. [sent-40, score-0.23]

23 However, the reject option is only scarcely and anecdotally mentioned in the context of regression. [sent-41, score-0.232]

24 In [4] a boosting algorithm for regression is proposed and a few reject mechanisms are considered, applied both on the aggregate decision and/or on the underlying weak regressors. [sent-42, score-0.248]

25 A straightforward thresholdbased reject mechanism (rejecting low response values) is applied in [5] on top of support vector regression. [sent-43, score-0.137]

26 The present paper is inspired and draws upon recent results on selective classiﬁcation [6, 7, 8], and can be viewed as a natural continuation of the results of [8]. [sent-45, score-0.451]

27 2 Selective regression and other preliminary deﬁnitions We begin with a deﬁnition of the following general and standard regression setting. [sent-47, score-0.17]

28 Using Sm we are required to select a regressor f ∈ F, where F is a ﬁxed hypothesis class containing potential regressors of the form f : X → Y. [sent-49, score-0.815]

29 Given a loss function, ℓ : Y × Y → [0, ∞), we quantify the prediction quality of any f through its true error or risk, R(f ), deﬁned as its expected loss with respect to P , R(f ) E(x,y) {ℓ(f (x), y)} = ℓ(f (x), y)dP (x, y). [sent-55, score-0.11]

30 i=1 ˆ arg inf f ∈F R(f ) be the empirical risk minimizer (ERM), and f ∗ ˆ Let f true risk minimizer. [sent-57, score-0.263]

31 arg inf f ∈F R(f ), the Next we deﬁne selective regression using the following deﬁnitions, which are taken, as is, from the selective classiﬁcation setting of [6]. [sent-58, score-0.958]

32 Here again, we are given a training sample Sm as above, but are now required to output a selective regressor deﬁned to be a pair (f, g), with f ∈ F being a standard regressor, and g : X → {0, 1} is a selection function, which is served as qualiﬁer for f as follows. [sent-59, score-0.972]

33 Thus, the selective regressor abstains from prediction at a point x iff g(x) = 0. [sent-61, score-0.959]

34 The general performance of a selective regressor is characterized in terms of two quantities: coverage and risk. [sent-62, score-1.179]

35 2 The true risk of (f, g) is the risk of f restricted to its region of activity as qualiﬁed by g, and normalized by its coverage, EP [ℓ(f (x), y) · g(x)] . [sent-64, score-0.24]

36 Φ(f, g) R(f, g) We say that the selective regressor (f, g) is ǫ-pointwise optimal if ∀x ∈ {x ∈ X : g(x) = 1} , |f (x) − f ∗ (x)| ≤ ǫ. [sent-65, score-0.964]

37 Note that pointwise optimality is a considerably stronger property than risk, which only refers to average performance. [sent-66, score-0.282]

38 We deﬁne a (standard) distance metric over the hypothesis class F . [sent-67, score-0.142]

39 3 Pointwise optimality with bounded coverage In this section we analyze the following strategy for learning a selective regressor, which turns out to ensure ǫ-pointwise optimality with monotonically increasing coverage (with m). [sent-71, score-1.19]

40 We call it a strategy rather than an algorithm because it is not at all clear at the outset how to implement it. [sent-72, score-0.133]

41 For any hypothesis class F , target hypothesis f ∈ F, distribution P , sample Sm , and real r > 0, deﬁne, V(f, r) {f ′ ∈ F : R(f ′ ) ≤ R(f ) + r} ˆ and V(f, r) ˆ ˆ f ′ ∈ F : R(f ′ ) ≤ R(f ) + r . [sent-75, score-0.261]

42 (2) Strategy 1 A learning strategy for ǫ-pointwise optimal selective regressors Input: Sm , m, δ, F, ǫ ˆ Output: A selective regressor (f , g) achieving ǫ-pointwise optimality ˆ ˆ 1: Set f = ERM (F, Sm ), i. [sent-76, score-1.729]

43 , f is any empirical risk minimizer from F ” “ ` ´ ˆ ˆ f , σ(m, δ/4, F)2 − 1 · R(f ) ˆ ˆ 2: Set G = V /* see Deﬁnition 3. [sent-78, score-0.12]

44 Let ℓ : Y × Y → [0, ∞) be the squared loss function and F be a convex hypothesis class. [sent-86, score-0.263]

45 Let σδ σ (m, δ, F ) be deﬁned such that for any 0 < δ < 1, with probability of at least 1 − δ over the choice of Sm from P m , any hypothesis f ∈ F satisﬁes ˆ R(f ) ≤ R(f ) · σ (m, δ, F ) . [sent-99, score-0.176]

46 3 is to facilitate the use of any (known) risk bound as a plug-in component in subsequent derivations. [sent-104, score-0.169]

47 We deﬁne σ as a multiplicative bound, which is common in the treatment of unbounded loss functions such as the squared loss (see discussion by Vapnik in [10], page 993). [sent-105, score-0.232]

48 We also developed the entire set of results that follow while relying on additive bounds, which are common when using bounded loss functions. [sent-109, score-0.114]

49 The proof of the following lemma follows closely the proof of Lemma 5. [sent-111, score-0.139]

50 However, it considers a multiplicative risk bound rather than additive. [sent-113, score-0.224]

51 Let F be a convex hypothesis space, ℓ : Y × Y → [0, ∞), a convex loss function, and ˆ f be an ERM. [sent-119, score-0.272]

52 Then, with probability of at least 1 − δ/2, for any x ∈ X , ˆ |f ∗ (x) − f (x)| ≤ sup “ ” ˆ ˆ 2 ˆ ˆ f ∈V f ,(σδ/4 −1)·R(f ) ˆ |f (x) − f (x)|. [sent-120, score-0.146]

53 Applying the multiplicative risk bound, we get that with probability of at least 1 − δ/4, ˆ R(f ∗ ) ≤ R(f ∗ ) · σδ/4 . [sent-122, score-0.281]

54 Applying the multiplicative risk bound on f , ˆ) ≤ R(f ) · σδ/4 . [sent-124, score-0.224]

55 Combining the three ˆ ˆ we know also that with probability of at least 1 − δ/4, R(f inequalities by using the union bound we get that with probability of at least 1 − δ/2, 2 2 ˆ ˆ ˆ ˆ ˆ ˆ ˆ R(f ∗ ) ≤ R(f ) · σδ/4 = R(f ) + σδ/4 − 1 · R(f ). [sent-125, score-0.242]

56 ˆ ˆ 2 ˆ ˆ Hence, with probability of at least 1 − δ/2 we get f ∗ ∈ V f , (σδ/4 − 1) · R(f ) Let G ⊆ F. [sent-126, score-0.106]

57 The ǫ-disagreement coefﬁcient of F under P is, θǫ sup r>r0 ∆ǫ B(f ∗ , r) . [sent-138, score-0.089]

58 Speciﬁcally, the coefﬁcient proposed there is unbounded for the squared loss function when Y is unbounded. [sent-144, score-0.122]

59 Let F be a convex hypothesis class, and assume ℓ : Y × Y → [0, ∞) is the squared loss function. [sent-147, score-0.263]

60 The following theorem is the main result of this section, showing that Strategy 1 achieves ǫ-pointwise optimality with a meaningful coverage that converges to 1. [sent-151, score-0.355]

61 Although R(f ∗ ) in the bound (5) is an unknown quantity, it is still a constant, and as σ approaches 1, the coverage lower bound approaches 1 as well. [sent-152, score-0.368]

62 When using a typical additive risk bound, R(f ∗ ) disappears from the RHS. [sent-153, score-0.146]

63 Let (f, g) be the selective regressor chosen by Strategy 1. [sent-158, score-0.935]

64 5 we get that, with probability of at least 1 − δ/2, ∀x ∈ {x ∈ X : g(x) = 1} |f (x) − f ∗ (x)| < ǫ. [sent-163, score-0.106]

65 ˆ ˆ ˆ 2 ˆ ˆ Since f ∈ V f , (σδ/4 − 1) · R(f ) = G wet get Φ(f, g) = E{g(X)} = E I ˆ sup |f (x) − f (x)| < ǫ f ∈G ˆ sup |f (x) − f (x)| ≥ ǫ = 1−E I f ∈G ≥ 1−E I f1 ,f2 ∈G sup |f1 (x) − f2 (x)| ≥ ǫ = 1 − ∆ǫ G. [sent-164, score-0.342]

66 7 and the union bound we conclude that with probability of at least 1 − δ, Φ(f, g) = E{g(X)} ≥ 1 − θǫ 2 ˆ ˆ σδ/4 − 1 · R(f ∗ ) + σδ/4 · R(f ) . [sent-166, score-0.136]

67 4 Rejection via constrained ERM In Strategy 1 we are required to track the supremum of a possibly inﬁnite hypothesis subset, which in general might be intractable. [sent-167, score-0.303]

68 2 reduces the problem of calculating the supremum to a problem of calculating a constrained ERM for two hypotheses. [sent-169, score-0.16]

69 Deﬁne, ˆ fǫ,x ˆ | f (x) = f (x) + ǫ , ˆ argmin R(f ) f ∈F ˆ where f (x) is, as usual, the value of the unconstrained ERM regressor at point x. [sent-173, score-0.51]

70 Let F be a convex hypothesis space, and ℓ : Y × Y → [0, ∞), a convex loss function. [sent-176, score-0.272]

71 Let ǫ > 0 be given, and let (f, g) be a selective regressor chosen by Strategy 1 after observing the ˆ training sample Sm . [sent-177, score-0.996]

72 2 provides the means to compute such an optimal regressor assuming that a method to compute a constrained ERM is available (as is the case for squared loss linear regressors ; see next section). [sent-194, score-0.865]

73 However, as was discussed in [6], in many cases our objective is to explore the entire risk-coverage trade-off, in other words, to get a pointwise bound on |f ∗ (x)−f (x)|, i. [sent-195, score-0.342]

74 Let F be a convex hypothesis class, ℓ : Y × Y → [0, ∞), a convex loss function, and ˆ let f be an ERM. [sent-201, score-0.296]

75 Then, with probability of at least 1 − δ/2 over the choice of Sm from P m , for any x ∈ X, ˆ ˆ ˆ ˆ ˆ |f ∗ (x) − f (x)| ≤ sup |ǫ| : R(fǫ,x ) ≤ R(f ) · σ 2 . [sent-202, score-0.146]

76 We thus have, |ǫ| : R(fǫ,x ) ≤ R(f ) · σ 2 ǫ′ = sup ǫ∈R δ/4 sup “ ” ˆ ˆ ˆ ˆ 2 f ∈V f ,(σδ/4 −1)·R(f) ˆ |f (x) − f (x)| = a ≤ ǫ′ . [sent-212, score-0.178]

77 We conclude this section with a general result on the monotonicity of the empirical risk attained by constrained ERM regressors. [sent-215, score-0.221]

78 Let F be a convex hypothesis space, ℓ : Y × Y → [0, ∞), a convex ˆ ˆ ˆ ˆ ˆ ˆ ˆ loss function, and 0 ≤ ǫ1 < ǫ2 , be given. [sent-219, score-0.272]

79 6 5 Selective linear regression We now restrict attention to linear least squares regression (LLSR), and, relying on Theorem 4. [sent-222, score-0.331]

80 4, as well as on known closed-form expressions for LLSR, we derive efﬁcient implementation of Strategy 1 and a new pointwise bound. [sent-224, score-0.214]

81 4, for squared loss, R(fǫ,x0 ) is strictly monotonically increasing for 2 ˆ ˆ ˆ ˆ ǫ > 0, and decreasing for ǫ < 0. [sent-241, score-0.1]

82 Denoting these solutions by ǫ1 , ǫ2 we get, 2 ˆ ˆ ˆ ˆ sup |ǫ| : R(fǫ,x0 ) ≤ R(f ) · σδ/4 = max(|ǫ1 |, |ǫ2 |). [sent-243, score-0.089]

83 6 Numerical examples Focusing on linear least squares regression, we empirically evaluated the proposed method. [sent-253, score-0.132]

84 The selective regressor (f, g) is computed as follows. [sent-255, score-0.935]

85 The regressor f is an ERM over Sm , and for any coverage value Φ, the function g selects a subset of Sn of size n · Φ, including all test points with lowest value of the bound in Theorem 5. [sent-256, score-0.829]

86 These ρ(x) distances, corresponding to all x ∈ Sn , were used as alternative method to reject test points in decreasing order of their ρ(x) values. [sent-260, score-0.163]

87 Each of the graphs on the ﬁrst row shows the average absolute difference between the selective regressor (f, g) and the optimal regressor f ∗ (taken as an ERM over the entire dataset) as a function of coverage, where the average is taken over the accepted instances. [sent-268, score-1.537]

88 −3 x 10 bodyfat x 10 4 cadata x 10 1 cpusmall x 10 0 housing −2 −2 space 0. [sent-273, score-0.211]

89 5 c c 3 cpusmall x 10 1 1 c housing x 10 2. [sent-297, score-0.101]

90 5 1 c Figure 1: (top row) absolute difference between the selective regressor (f, g) and the optimal regressor f ∗ . [sent-323, score-1.474]

91 (bottom row) test error of selective regressor (f, g). [sent-324, score-0.961]

92 Each of the graphs in the second row shows the test error of the selective regressor (f, g) as a function of coverage. [sent-327, score-0.994]

93 7 Concluding remarks Rooted in the centuries-old linear least squares method of Gauss and Legendre, regression estimation remains an indispensable routine in statistical analysis, modeling and prediction. [sent-331, score-0.217]

94 This paper proposes a novel rejection technique allowing for a least squares regressor, learned from a ﬁnite and possibly small training sample, to pointwise track, within its selected region of activity, the predictions of the globally optimal regressor in hindsight (from the same class). [sent-332, score-1.134]

95 Immediate plausible extensions are the handling of other types of regressions including regularized, and kernel regression, as well as extensions to other convex loss functions such as the epsiloninsensitive loss. [sent-334, score-0.104]

96 The presence of the ǫ-disagreement coefﬁcient in our coverage bound suggests a possible relation to active learning, since the standard version of this coefﬁcient has a key role in characterizing the efﬁciency of active learning in classiﬁcation [17]. [sent-335, score-0.465]

97 Indeed, a formal reduction of active learning to selective classiﬁcation was recently found, whereby rejected points are precisely those points to be queried in a stream based active learning setting. [sent-336, score-0.635]

98 Moreover, “fast” coverage bounds in selective classiﬁcation give rise to fast rates in active learning [7]. [sent-337, score-0.783]

99 Borrowing their intuition to our setting, one could consider devising a querying function for active regression that is based on the pointwise bound of Theorem 5. [sent-338, score-0.434]

100 A bound on the label complexity of agnostic active learning. [sent-402, score-0.19]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('regressor', 0.51), ('selective', 0.425), ('erm', 0.284), ('coverage', 0.244), ('sm', 0.217), ('pointwise', 0.214), ('regressors', 0.163), ('lemma', 0.139), ('reject', 0.137), ('risk', 0.12), ('hypothesis', 0.119), ('sup', 0.089), ('active', 0.086), ('regression', 0.085), ('strategy', 0.081), ('squares', 0.075), ('option', 0.069), ('constrained', 0.068), ('optimality', 0.068), ('nn', 0.061), ('monotonically', 0.06), ('hindsight', 0.059), ('cadata', 0.058), ('cpusmall', 0.058), ('llsr', 0.058), ('ols', 0.058), ('least', 0.057), ('loss', 0.055), ('agnostic', 0.055), ('multiplicative', 0.055), ('disagreement', 0.053), ('rejection', 0.053), ('outset', 0.052), ('bodyfat', 0.052), ('get', 0.049), ('convex', 0.049), ('bound', 0.049), ('quali', 0.048), ('sn', 0.046), ('xk', 0.046), ('supremum', 0.046), ('coef', 0.044), ('theorem', 0.043), ('suite', 0.043), ('rejecting', 0.043), ('housing', 0.043), ('nition', 0.042), ('dis', 0.041), ('substantial', 0.04), ('squared', 0.04), ('possibly', 0.039), ('rejected', 0.038), ('training', 0.037), ('rc', 0.037), ('predictions', 0.037), ('curve', 0.035), ('tracking', 0.034), ('row', 0.033), ('monotonicity', 0.033), ('yi', 0.033), ('track', 0.031), ('libsvm', 0.031), ('applying', 0.031), ('union', 0.03), ('nitions', 0.03), ('entire', 0.03), ('completes', 0.03), ('optimal', 0.029), ('ordinary', 0.029), ('relying', 0.029), ('jensen', 0.029), ('classi', 0.029), ('bounds', 0.028), ('achieving', 0.028), ('datasets', 0.028), ('unbounded', 0.027), ('unknown', 0.026), ('xt', 0.026), ('boosting', 0.026), ('test', 0.026), ('dp', 0.026), ('portion', 0.026), ('corroborates', 0.026), ('continuation', 0.026), ('harnessing', 0.026), ('wet', 0.026), ('disappears', 0.026), ('cls', 0.026), ('anecdotally', 0.026), ('let', 0.024), ('technique', 0.024), ('xi', 0.024), ('abstain', 0.024), ('substitutions', 0.024), ('abstains', 0.024), ('predictor', 0.024), ('baseline', 0.023), ('class', 0.023), ('calculating', 0.023), ('inf', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 271 nips-2012-Pointwise Tracking the Optimal Regression Function

Author: Yair Wiener, Ran El-Yaniv

2 0.12053008 32 nips-2012-Active Comparison of Prediction Models

Author: Christoph Sawade, Niels Landwehr, Tobias Scheffer

Abstract: We address the problem of comparing the risks of two given predictive models—for instance, a baseline model and a challenger—as conﬁdently as possible on a ﬁxed labeling budget. This problem occurs whenever models cannot be compared on held-out training data, possibly because the training data are unavailable or do not reﬂect the desired test distribution. In this case, new test instances have to be drawn and labeled at a cost. We devise an active comparison method that selects instances according to an instrumental sampling distribution. We derive the sampling distribution that maximizes the power of a statistical test applied to the observed empirical risks, and thereby minimizes the likelihood of choosing the inferior model. Empirically, we investigate model selection problems on several classiﬁcation and regression tasks and study the accuracy of the resulting p-values. 1

3 0.11112892 200 nips-2012-Local Supervised Learning through Space Partitioning

Author: Joseph Wang, Venkatesh Saligrama

Abstract: We develop a novel approach for supervised learning based on adaptively partitioning the feature space into different regions and learning local region-speciﬁc classiﬁers. We formulate an empirical risk minimization problem that incorporates both partitioning and classiﬁcation in to a single global objective. We show that space partitioning can be equivalently reformulated as a supervised learning problem and consequently any discriminative learning method can be utilized in conjunction with our approach. Nevertheless, we consider locally linear schemes by learning linear partitions and linear region classiﬁers. Locally linear schemes can not only approximate complex decision boundaries and ensure low training error but also provide tight control on over-ﬁtting and generalization error. We train locally linear classiﬁers by using LDA, logistic regression and perceptrons, and so our scheme is scalable to large data sizes and high-dimensions. We present experimental results demonstrating improved performance over state of the art classiﬁcation techniques on benchmark datasets. We also show improved robustness to label noise.

4 0.10410042 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes

Author: Takayuki Osogami

Abstract: We uncover relations between robust MDPs and risk-sensitive MDPs. The objective of a robust MDP is to minimize a function, such as the expectation of cumulative cost, for the worst case when the parameters have uncertainties. The objective of a risk-sensitive MDP is to minimize a risk measure of the cumulative cost when the parameters are known. We show that a risk-sensitive MDP of minimizing the expected exponential utility is equivalent to a robust MDP of minimizing the worst-case expectation with a penalty for the deviation of the uncertain parameters from their nominal values, which is measured with the Kullback-Leibler divergence. We also show that a risk-sensitive MDP of minimizing an iterated risk measure that is composed of certain coherent risk measures is equivalent to a robust MDP of minimizing the worst-case expectation when the possible deviations of uncertain parameters from their nominal values are characterized with a concave function. 1

5 0.10210232 305 nips-2012-Selective Labeling via Error Bound Minimization

Author: Quanquan Gu, Tong Zhang, Jiawei Han, Chris H. Ding

Abstract: In many practical machine learning problems, the acquisition of labeled data is often expensive and/or time consuming. This motivates us to study a problem as follows: given a label budget, how to select data points to label such that the learning performance is optimized. We propose a selective labeling method by analyzing the out-of-sample error of Laplacian regularized Least Squares (LapRLS). In particular, we derive a deterministic out-of-sample error bound for LapRLS trained on subsampled data, and propose to select a subset of data points to label by minimizing this upper bound. Since the minimization is a combinational problem, we relax it into continuous domain and solve it by projected gradient descent. Experiments on benchmark datasets show that the proposed method outperforms the state-of-the-art methods.

6 0.09769278 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications

7 0.097688302 16 nips-2012-A Polynomial-time Form of Robust Regression

8 0.081991233 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

9 0.079541147 62 nips-2012-Burn-in, bias, and the rationality of anchoring

10 0.079541147 116 nips-2012-Emergence of Object-Selective Features in Unsupervised Feature Learning

11 0.077279709 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

12 0.073184818 227 nips-2012-Multiclass Learning with Simplex Coding

13 0.07266625 330 nips-2012-Supervised Learning with Similarity Functions

14 0.070992768 145 nips-2012-Gradient Weights help Nonparametric Regressors

15 0.069987252 141 nips-2012-GenDeR: A Generic Diversified Ranking Algorithm

16 0.068641305 107 nips-2012-Effective Split-Merge Monte Carlo Methods for Nonparametric Models of Sequential Data

17 0.066442706 266 nips-2012-Patient Risk Stratification for Hospital-Associated C. diff as a Time-Series Classification Task

18 0.065607727 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

19 0.064470626 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

20 0.064344689 361 nips-2012-Volume Regularization for Binary Classification

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.182), (1, 0.011), (2, 0.047), (3, -0.022), (4, 0.102), (5, 0.019), (6, 0.006), (7, 0.104), (8, -0.005), (9, -0.024), (10, 0.006), (11, 0.037), (12, -0.026), (13, -0.027), (14, -0.008), (15, -0.086), (16, -0.038), (17, -0.008), (18, -0.032), (19, 0.113), (20, -0.038), (21, 0.056), (22, -0.018), (23, 0.06), (24, 0.035), (25, -0.039), (26, 0.038), (27, 0.053), (28, -0.065), (29, -0.037), (30, -0.058), (31, -0.066), (32, -0.078), (33, 0.004), (34, -0.069), (35, 0.044), (36, -0.1), (37, -0.013), (38, -0.001), (39, -0.023), (40, -0.034), (41, -0.036), (42, -0.134), (43, 0.018), (44, -0.02), (45, 0.052), (46, -0.065), (47, 0.04), (48, -0.074), (49, -0.038)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93133241 271 nips-2012-Pointwise Tracking the Optimal Regression Function

Author: Yair Wiener, Ran El-Yaniv

2 0.71797031 32 nips-2012-Active Comparison of Prediction Models

Author: Christoph Sawade, Niels Landwehr, Tobias Scheffer

3 0.70300758 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL

Author: Nishant Mehta, Dongryeol Lee, Alexander G. Gray

Abstract: Since its inception, the modus operandi of multi-task learning (MTL) has been to minimize the task-wise mean of the empirical risks. We introduce a generalized loss-compositional paradigm for MTL that includes a spectrum of formulations as a subfamily. One endpoint of this spectrum is minimax MTL: a new MTL formulation that minimizes the maximum of the tasks’ empirical risks. Via a certain relaxation of minimax MTL, we obtain a continuum of MTL formulations spanning minimax MTL and classical MTL. The full paradigm itself is loss-compositional, operating on the vector of empirical risks. It incorporates minimax MTL, its relaxations, and many new MTL formulations as special cases. We show theoretically that minimax MTL tends to avoid worst case outcomes on newly drawn test tasks in the learning to learn (LTL) test setting. The results of several MTL formulations on synthetic and real problems in the MTL and LTL test settings are encouraging. 1

4 0.6752634 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications

Author: Amit Daniely, Sivan Sabato, Shai S. Shwartz

Abstract: We theoretically analyze and compare the following ﬁve popular multiclass classiﬁcation methods: One vs. All, All Pairs, Tree-based classiﬁers, Error Correcting Output Codes (ECOC) with randomly generated code matrices, and Multiclass SVM. In the ﬁrst four methods, the classiﬁcation is based on a reduction to binary classiﬁcation. We consider the case where the binary classiﬁer comes from a class of VC dimension d, and in particular from the class of halfspaces over Rd . We analyze both the estimation error and the approximation error of these methods. Our analysis reveals interesting conclusions of practical relevance, regarding the success of the different approaches under various conditions. Our proof technique employs tools from VC theory to analyze the approximation error of hypothesis classes. This is in contrast to most previous uses of VC theory, which only deal with estimation error. 1

5 0.6443634 361 nips-2012-Volume Regularization for Binary Classification

Author: Koby Crammer, Tal Wagner

Abstract: We introduce a large-volume box classiﬁcation for binary prediction, which maintains a subset of weight vectors, and speciﬁcally axis-aligned boxes. Our learning algorithm seeks for a box of large volume that contains “simple” weight vectors which most of are accurate on the training set. Two versions of the learning process are cast as convex optimization problems, and it is shown how to solve them efﬁciently. The formulation yields a natural PAC-Bayesian performance bound and it is shown to minimize a quantity directly aligned with it. The algorithm outperforms SVM and the recently proposed AROW algorithm on a majority of 30 NLP datasets and binarized USPS optical character recognition datasets. 1

6 0.61861879 266 nips-2012-Patient Risk Stratification for Hospital-Associated C. diff as a Time-Series Classification Task

7 0.61182046 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses

8 0.58991253 16 nips-2012-A Polynomial-time Form of Robust Regression

9 0.58186859 305 nips-2012-Selective Labeling via Error Bound Minimization

10 0.58019048 221 nips-2012-Multi-Stage Multi-Task Feature Learning

11 0.5728091 145 nips-2012-Gradient Weights help Nonparametric Regressors

12 0.57023245 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization

13 0.56183594 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification

14 0.56042212 200 nips-2012-Local Supervised Learning through Space Partitioning

15 0.55799174 330 nips-2012-Supervised Learning with Similarity Functions

16 0.55000883 30 nips-2012-Accuracy at the Top

17 0.54716074 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

18 0.54380822 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

19 0.52913922 291 nips-2012-Reducing statistical time-series problems to binary classification

20 0.52789247 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.018), (21, 0.029), (38, 0.177), (42, 0.028), (45, 0.241), (54, 0.021), (55, 0.034), (74, 0.036), (76, 0.176), (80, 0.09), (92, 0.046)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93144423 45 nips-2012-Approximating Equilibria in Sequential Auctions with Incomplete Information and Multi-Unit Demand

Author: Amy Greenwald, Jiacui Li, Eric Sodomka

Abstract: In many large economic markets, goods are sold through sequential auctions. Examples include eBay, online ad auctions, wireless spectrum auctions, and the Dutch ﬂower auctions. In this paper, we combine methods from game theory and decision theory to search for approximate equilibria in sequential auction domains, in which bidders do not know their opponents’ values for goods, bidders only partially observe the actions of their opponents’, and bidders demand multiple goods. We restrict attention to two-phased strategies: ﬁrst predict (i.e., learn); second, optimize. We use best-reply dynamics [4] for prediction (i.e., to predict other bidders’ strategies), and then assuming ﬁxed other-bidder strategies, we estimate and solve the ensuing Markov decision processes (MDP) [18] for optimization. We exploit auction properties to represent the MDP in a more compact state space, and we use Monte Carlo simulation to make estimating the MDP tractable. We show how equilibria found using our search procedure compare to known equilibria for simpler auction domains, and we approximate an equilibrium for a more complex auction domain where analytical solutions are unknown. 1

same-paper 2 0.84849733 271 nips-2012-Pointwise Tracking the Optimal Regression Function

Author: Yair Wiener, Ran El-Yaniv

3 0.84847939 220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models

Author: Qixia Jiang, Jun Zhu, Maosong Sun, Eric P. Xing

Abstract: An effective strategy to exploit the supervising side information for discovering predictive topic representations is to impose discriminative constraints induced by such information on the posterior distributions under a topic model. This strategy has been adopted by a number of supervised topic models, such as MedLDA, which employs max-margin posterior constraints. However, unlike the likelihoodbased supervised topic models, of which posterior inference can be carried out using the Bayes’ rule, the max-margin posterior constraints have made Monte Carlo methods infeasible or at least not directly applicable, thereby limited the choice of inference algorithms to be based on variational approximation with strict mean ﬁeld assumptions. In this paper, we develop two efﬁcient Monte Carlo methods under much weaker assumptions for max-margin supervised topic models based on an importance sampler and a collapsed Gibbs sampler, respectively, in a convex dual formulation. We report thorough experimental results that compare our approach favorably against existing alternatives in both accuracy and efﬁciency.

4 0.81936222 292 nips-2012-Regularized Off-Policy TD-Learning

Author: Bo Liu, Sridhar Mahadevan, Ji Liu

Abstract: We present a novel l1 regularized off-policy convergent TD-learning method (termed RO-TD), which is able to learn sparse representations of value functions with low computational complexity. The algorithmic framework underlying ROTD integrates two key ideas: off-policy convergent gradient TD methods, such as TDC, and a convex-concave saddle-point formulation of non-smooth convex optimization, which enables ﬁrst-order solvers and feature selection using online convex regularization. A detailed theoretical and experimental analysis of RO-TD is presented. A variety of experiments are presented to illustrate the off-policy convergence, sparse feature selection capability and low computational cost of the RO-TD algorithm. 1

5 0.80951041 330 nips-2012-Supervised Learning with Similarity Functions

Author: Purushottam Kar, Prateek Jain

Abstract: We address the problem of general supervised learning when data can only be accessed through an (indeﬁnite) similarity function between data points. Existing work on learning with indeﬁnite kernels has concentrated solely on binary/multiclass classiﬁcation problems. We propose a model that is generic enough to handle any supervised learning task and also subsumes the model previously proposed for classiﬁcation. We give a “goodness” criterion for similarity functions w.r.t. a given supervised learning task and then adapt a well-known landmarking technique to provide efﬁcient algorithms for supervised learning using “good” similarity functions. We demonstrate the effectiveness of our model on three important supervised learning problems: a) real-valued regression, b) ordinal regression and c) ranking where we show that our method guarantees bounded generalization error. Furthermore, for the case of real-valued regression, we give a natural goodness deﬁnition that, when used in conjunction with a recent result in sparse vector recovery, guarantees a sparse predictor with bounded generalization error. Finally, we report results of our learning algorithms on regression and ordinal regression tasks using non-PSD similarity functions and demonstrate the effectiveness of our algorithms, especially that of the sparse landmark selection algorithm that achieves signiﬁcantly higher accuracies than the baseline methods while offering reduced computational costs. 1

6 0.74830484 179 nips-2012-Learning Manifolds with K-Means and K-Flats

7 0.74802649 319 nips-2012-Sparse Prediction with the $k$-Support Norm

8 0.74767888 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance

9 0.74736285 361 nips-2012-Volume Regularization for Binary Classification

10 0.74591833 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation

11 0.74567753 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

12 0.74550086 34 nips-2012-Active Learning of Multi-Index Function Models

13 0.74430656 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models

14 0.7442345 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL

15 0.74400359 187 nips-2012-Learning curves for multi-task Gaussian process regression

16 0.74295986 127 nips-2012-Fast Bayesian Inference for Non-Conjugate Gaussian Process Regression

17 0.74293703 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

18 0.74276865 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization

19 0.74272656 16 nips-2012-A Polynomial-time Form of Robust Regression

20 0.74213767 364 nips-2012-Weighted Likelihood Policy Search with Model Selection