nips nips2010 nips2010-27 knowledge-graph by maker-knowledge-mining

27 nips-2010-Agnostic Active Learning Without Constraints


Source: pdf

Author: Alina Beygelzimer, John Langford, Zhang Tong, Daniel J. Hsu

Abstract: We present and analyze an agnostic active learning algorithm that works without keeping a version space. This is unlike all previous approaches where a restricted set of candidate hypotheses is maintained throughout learning, and only hypotheses from this set are ever returned. By avoiding this version space approach, our algorithm sheds the computational burden and brittleness associated with maintaining version spaces, yet still allows for substantial improvements over supervised learning for classification. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We present and analyze an agnostic active learning algorithm that works without keeping a version space. [sent-9, score-0.594]

2 This is unlike all previous approaches where a restricted set of candidate hypotheses is maintained throughout learning, and only hypotheses from this set are ever returned. [sent-10, score-0.309]

3 By avoiding this version space approach, our algorithm sheds the computational burden and brittleness associated with maintaining version spaces, yet still allows for substantial improvements over supervised learning for classification. [sent-11, score-0.231]

4 1 Introduction In active learning, a learner is given access to unlabeled data and is allowed to adaptively choose which ones to label. [sent-12, score-0.594]

5 Therefore, the hope is that the active learner only needs to query the labels of a small number of the unlabeled data, and otherwise perform as well as a fully supervised learner. [sent-14, score-0.797]

6 In this work, we are interested in agnostic active learning algorithms for binary classification that are provably consistent, i. [sent-15, score-0.526]

7 that converge to an optimal hypothesis in a given hypothesis class. [sent-17, score-0.256]

8 One technique that has proved theoretically profitable is to maintain a candidate set of hypotheses (sometimes called a version space), and to query the label of a point only if there is disagreement within this set about how to label the point. [sent-18, score-0.694]

9 The criteria for membership in this candidate set needs to be carefully defined so that an optimal hypothesis is always included, but otherwise this set can be quickly whittled down as more labels are queried. [sent-19, score-0.22]

10 The first is computational intractability: maintaining a version space and guaranteeing that only hypotheses from this set are returned is difficult for linear predictors and appears intractable for interesting nonlinear predictors such as neural nets and decision trees [1]. [sent-22, score-0.252]

11 Another drawback of the approach is its brittleness: a single mishap (due to, say, modeling failures or computational approximations) might cause the learner to exclude the best hypothesis from the version space forever; this is an ungraceful failure mode that is not easy to correct. [sent-23, score-0.357]

12 As this oracle matches our abstraction of many supervised learning algorithms, we believe active learning algorithms built in this way are immediately and widely applicable. [sent-26, score-0.513]

13 Our approach instantiates the importance weighted active learning framework of [5] using a rejection threshold similar to the algorithm of [4] which only accesses hypotheses via a supervised learning oracle. [sent-27, score-1.214]

14 Moreover, our algorithm creates an importance weighted sample that allows for unbiased risk estimation, even for hypotheses from a class different from the one employed by the active learner. [sent-29, score-0.844]

15 We prove that our algorithm is always consistent and has an improved label complexity over passive learning in cases previously studied in the literature. [sent-33, score-0.47]

16 The algorithm from [4] extends the selective sampling method of [1] to the agnostic setting using generalization bounds in a manner similar to that first suggested in [3]. [sent-37, score-0.289]

17 It accesses hypotheses only through a special ERM oracle that can enforce an arbitrary number of example-based constraints; these constraints define a version space, and the algorithm only ever returns hypotheses from this space, which can be undesirable as we previously argued. [sent-38, score-0.444]

18 Our algorithm differs from these in that (i) it never restricts its attention to a version space when selecting a hypothesis to return, and (ii) it only requires an ERM oracle that enforces at most one example-based constraint, and this constraint is only used for selective sampling. [sent-42, score-0.272]

19 Our label complexity bounds are comparable to those proved in [5] (though somewhat worse that those in [3, 4, 6, 7]). [sent-43, score-0.269]

20 The use of importance weights to correct for sampling bias is a standard technique for many machine learning problems (e. [sent-44, score-0.24]

21 , [9, 10, 11]) including active learning [12, 13, 5]. [sent-46, score-0.361]

22 Our algorithm is based on the importance weighted active learning (IWAL) framework introduced by [5]. [sent-47, score-0.681]

23 In that work, a rejection threshold procedure called loss-weighting is rigorously analyzed and shown to yield improved label complexity bounds in certain cases. [sent-48, score-0.588]

24 On the other hand, the loss-weighting rejection threshold requires optimizing over a restricted version space, which is computationally undesirable. [sent-50, score-0.329]

25 Moreover, the label complexity bound given in [5] only applies to hypotheses selected from this version space, and not when selected from the entire hypothesis class (as the general IWAL framework suggests). [sent-51, score-0.558]

26 We avoid these deficiencies using a new rejection threshold procedure and a more subtle martingale analysis. [sent-52, score-0.328]

27 Many of the previously mentioned algorithms are analyzed in the agnostic learning model, where no assumption is made about the noise distribution (see also [14]). [sent-53, score-0.214]

28 In this setting, the label complexity of active learning algorithms cannot generally improve over supervised learners by more than a constant factor [15, 5]. [sent-54, score-0.607]

29 However, under a parameterization of the noise distribution related to Tsybakov’s low-noise condition [16], active learning algorithms have been shown to have improved label complexity bounds over what is achievable in the purely agnostic setting [17, 8, 18, 6, 7]. [sent-55, score-0.853]

30 An active learner receives a sequence (X1 , Y1 ), (X2 , Y2 ), . [sent-60, score-0.498]

31 This keeps the focus on the relevant aspects of active learning that differ from passive learning. [sent-75, score-0.552]

32 The goal of the active learner is to return a hypothesis h ∈ H with error err(h) not much more than err(h∗ ), using as few label queries as possible. [sent-78, score-0.849]

33 2 Importance Weighted Active Learning In the importance weighted active learning (IWAL) framework of [5], an active learner looks at the unlabeled data X1 , X2 , . [sent-80, score-1.25]

34 Then a coin with bias Pi is flipped, and the label Yi is queried if and only if the coin comes up heads. [sent-85, score-0.56]

35 The query probability Pi can depend on all previous unlabeled examples X1:i−1 , any previously queried labels, any past coin flips, and the current unlabeled point Xi . [sent-86, score-0.601]

36 Formally, an IWAL algorithm specifies a rejection threshold function p : (X × Y × {0, 1})∗ × X → [0, 1] for determining these query probabilities. [sent-87, score-0.402]

37 That is, Qi indicates if the label Yi is queried (the outcome of the coin toss). [sent-90, score-0.456]

38 Although the notation does not explicitly suggest this, the query probability Pi = p(Z1:i−1 , Xi ) is allowed to explicitly depend on a label Yj (j < i) if and only if it has been queried (Qj = 1). [sent-91, score-0.443]

39 3 Importance Weighted Estimators We first review some standard facts about the importance weighting technique. [sent-93, score-0.229]

40 For a function f : X × Y → R, define the importance weighted estimator of E[f (X, Y )] from Z1:n ∈ (X × Y × {0, 1})n to be n 1 Qi f (Z1:n ) := · f (Xi , Yi ). [sent-94, score-0.331]

41 n i=1 Pi Note that this quantity depends on a label Yi only if it has been queried (i. [sent-95, score-0.396]

42 Our rejection threshold will be based on a specialization of this estimator, specifically the importance weighted empirical error of a hypothesis h err(h, Z1:n ) := 1 n n i=1 Qi · 1[h(Xi ) = Yi ]. [sent-98, score-0.764]

43 Pi In the notation of Algorithm 1, this is equivalent to err(h, Sn ) := 1 n (Xi ,Yi ,1/Pi )∈Sn (1/Pi ) · 1[h(Xi ) = Yi ] (1) where Sn ⊆ X × Y × R is the importance weighted sample collected by the algorithm. [sent-99, score-0.295]

44 So, for example, the importance weighted empirical error of a hypothesis h is an unbiased estimator of its true error err(h). [sent-101, score-0.603]

45 This holds for any choice of the rejection threshold that guarantees Pi > 0. [sent-102, score-0.314]

46 3 A Deviation Bound for Importance Weighted Estimators As mentioned before, the rejection threshold used by our algorithm is based on importance weighted error estimates err(h, Z1:n ). [sent-103, score-0.661]

47 To get a handle on this, we need a deviation bound for importance weighted estimators. [sent-105, score-0.399]

48 The importance weighted samples (Xi , Yi , 1/Pi ) (or equivalently, the Zi = (Xi , Yi , Qi )) are not i. [sent-107, score-0.295]

49 This is because the query probability Pi (and thus the importance weight 1/Pi ) generally depends on Z1:i−1 and Xi . [sent-110, score-0.281]

50 Consider any rejection threshold function p : (X × Y × {0, 1})∗ × X → (0, 1] for which Pn = p(Z1:n−1 , Xn ) is bounded below by some positive quantity (which may depend on n). [sent-115, score-0.361]

51 Equivalently, the query probabilities Pn should have inverses 1/Pn bounded above by some deterministic quantity rmax (which, again, may depend on n). [sent-116, score-0.246]

52 The a priori upper bound rmax on 1/Pn can be pessimistic, as the dependence on rmax in the final deviation bound will be very mild—it enters in as log log rmax . [sent-117, score-0.483]

53 4 Algorithm First, we state a deviation bound for the importance weighted error of hypotheses in a finite hypothesis class H that holds for all n ≥ 1. [sent-130, score-0.737]

54 2 using a rejection threshold p : (X × Y × {0, 1})∗ × X → [0, 1] that satisfies p(z1:n , x) ≥ 1/nn for all (z1:n , x) ∈ (X × Y × {0, 1})n × X and all n ≥ 1. [sent-139, score-0.286]

55 The following absolute constants are used in the description of the rejection 4 Algorithm 1 Notes: see Eq. [sent-144, score-0.211]

56 (1) for the definition of err (importance weighted error), and Section 4 for the definitions of C0 , c1 , and c2 . [sent-145, score-0.63]

57 Figure 1: Algorithm for importance weighted active learning with an error minimization oracle. [sent-160, score-0.711]

58 The rejection threshold (Step 2) is based on the deviation bound from Lemma 1. [sent-163, score-0.39]

59 First, the importance weighted error minimizing hypothesis hk and the “alternative” hypothesis h′ are found. [sent-164, score-0.694]

60 Note that both optimizations are over the entire hypothesis k class H (with h′ only being required to disagree with hk on xk )—this is a key aspect where our k algorithm differs from previous approaches. [sent-165, score-0.355]

61 The difference in importance weighted errors Gk of the two hypotheses is then computed. [sent-166, score-0.445]

62 In order to apply Lemma 1 with our rejection threshold, we need to establish the (very crude) bound Pk ≥ 1/k k for all k. [sent-173, score-0.274]

63 The rejection threshold of Algorithm 1 satisfies p(z1:n−1 , x) ≥ 1/nn for all n ≥ 1 and all (z1:n−1 , x) ∈ (X × Y × {0, 1})n−1 × X . [sent-175, score-0.286]

64 1 Analysis Correctness We first prove a consistency guarantee for Algorithm 1 that bounds the generalization error of the importance weighted empirical error minimizer. [sent-178, score-0.477]

65 The proof actually establishes a lower bound on 5 the query probabilities Pi ≥ 1/2 for Xi such that hn (Xi ) = h∗ (Xi ). [sent-179, score-0.224]

66 This offers an intuitive characterization of the weighting landscape induced by the importance weights 1/Pi . [sent-180, score-0.229]

67 n−1 n−1 err(hn ) ≤ err(h∗ ) + Therefore, the final hypothesis returned by Algorithm 1 after seeing n unlabeled data has roughly the same error bound as a hypothesis returned by a standard passive learner with n labeled data. [sent-185, score-0.912]

68 The following lemma bounds the probability of querying the label Yn ; this is subsequently used to establish the final bound on the expected number of labels queried. [sent-189, score-0.4]

69 This is mediated through the disagreement coefficient, a quantity first used by [14] for analyzing the label complexity of the A2 algorithm of [3]. [sent-191, score-0.368]

70 The disagreement coefficient θ := θ(h∗ , H, D) is defined as θ(h∗ , H, D) := sup Pr(X ∈ DIS(h∗ , r)) :r>0 r where DIS(h∗ , r) := {x ∈ X : ∃h′ ∈ H such that Pr(h∗ (X) = h′ (X)) ≤ r and h∗ (x) = h′ (x)} (the disagreement region around h∗ at radius r). [sent-192, score-0.204]

71 With probability at least 1 − δ, the expected number of labels queried by Algorithm 1 after n iterations is at most 1 + θ · 2 err(h∗ ) · (n − 1) + O θ · C0 n log n + θ · C0 log3 n . [sent-200, score-0.315]

72 The linear term err(h∗ ) · n is unavoidable in the worst case, as evident from label complexity lower bounds [15, 5]. [sent-202, score-0.269]

73 , the data is separable) and θ is bounded (as is the case for many problems studied in the literature [14]), then the bound represents a polynomial label complexity improvement over supervised learning, similar to that achieved by the version space algorithm from [5]. [sent-205, score-0.434]

74 3 Analysis under Low Noise Conditions Some recent work on active learning has focused on improved label complexity under certain noise conditions [17, 8, 18, 6, 7]. [sent-207, score-0.616]

75 Essentially, this condition requires that low error hypotheses not be too far from the optimal hypothesis h∗ under the disagreement metric Pr(h∗ (X) = h(X)). [sent-210, score-0.412]

76 With probability at least 1 − δ, the expected number of labels queried by Algorithm 1 after n iterations is at most α/2 θ · κ · cα · (C0 log n) · n1−α/2 . [sent-216, score-0.315]

77 Note that the bound is sublinear in n for all 0 < α ≤ 1, which implies label complexity improvements whenever θ is bounded (an improved analogue of Theorem 2 under these conditions can be established using similar techniques). [sent-217, score-0.355]

78 6 Experiments Although agnostic learning is typically intractable in the worst case, empirical risk minimization can serve as a useful abstraction for many practical supervised learning algorithms in non-worst case scenarios. [sent-219, score-0.318]

79 2 (with default parameters) to select the hypothesis hk in each round k; to produce the “alternative” hypothesis h′ , we just modify k the decision tree hk by changing the label of the node used for predicting on xk . [sent-223, score-0.672]

80 We set C0 = 8 and c1 = c2 = 1—these can be regarded as tuning parameters, with C0 controlling the aggressiveness of the rejection threshold. [sent-225, score-0.211]

81 We did not perform parameter tuning with active learning although the importance weighting approach developed here could potentially be used for that. [sent-226, score-0.59]

82 This required modifying how h′ is selected: we force h′ (xk ) = hk (xk ) by changing the label of k k the prediction node for xk to the next best label. [sent-234, score-0.306]

83 2 Results We examined the test error as a function of (i) the number of unlabeled data seen, and (ii) the number of labels queried. [sent-237, score-0.214]

84 We compared the performance of the active learner described above to a passive learner (one that queries every label, so (i) and (ii) are the same) using J48 with default parameters. [sent-238, score-0.902]

85 In all three cases, the test errors as a function of the number of unlabeled data were roughly the same for both the active and passive learners. [sent-239, score-0.669]

86 We note that this is a basic property not satisfied by many active learning algorithms (this issue is discussed further in [22]). [sent-241, score-0.361]

87 In terms of test error as a function of the number of labels queried (Figure 2), the active learner had minimal improvement over the passive learner on the binary MNIST task, but a substantial improvement over the passive learner on the KDDCUP99 task (even at small numbers of label queries). [sent-242, score-1.724]

88 For the multi-class MNIST task, the active learner had a moderate improvement over the passive learner. [sent-243, score-0.715]

89 Note that KDDCUP99 is far less noisy (more separable) than MNIST 3s vs 5s task, so the results are in line with the label complexity behavior suggested by Theorem 3, which states that the label complexity improvement may scale with the error of the optimal hypothesis. [sent-244, score-0.475]

90 01 0 1000 2000 3000 number of labels queried 0 4000 0 MNIST 3s vs 5s 0. [sent-255, score-0.277]

91 08 1000 2000 3000 4000 number of labels queried 0. [sent-259, score-0.277]

92 14 0 0 100 200 300 400 500 number of labels queried 0 600 KDDCUP99 (close-up) 1 2 3 number of labels queried 4 4 x 10 MNIST multi-class (close-up) Figure 2: Test errors as a function of the number of labels queried. [sent-266, score-0.64]

93 the results from MNIST tasks suggest that the active learner may require an initial random sampling phase during which it is equivalent to the passive learner, and the advantage manifests itself after this phase. [sent-267, score-0.689]

94 7 Conclusion This paper provides a new active learning algorithm based on error minimization oracles, a departure from the version space approach adopted by previous works. [sent-269, score-0.484]

95 The algorithm we introduce here motivates computationally tractable and effective methods for active learning with many classifier training algorithms. [sent-270, score-0.41]

96 Furthermore, although these properties might only hold in an approximate or heuristic sense, the created active learning algorithm will be “safe” in the sense that it will eventually converge to the same solution as a passive supervised learning algorithm. [sent-272, score-0.65]

97 Recent theoretical work on active learning has focused on improving rates of convergence. [sent-274, score-0.386]

98 Rademacher complexities and bounding the excess risk in active learning. [sent-314, score-0.388]

99 Covariate shift adaptation by importance weighted cross u validation. [sent-344, score-0.295]

100 A bound on the label complexity of agnostic active learning. [sent-356, score-0.762]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('err', 0.525), ('active', 0.337), ('queried', 0.214), ('rejection', 0.211), ('passive', 0.191), ('importance', 0.19), ('agnostic', 0.165), ('learner', 0.161), ('qi', 0.153), ('label', 0.138), ('hypothesis', 0.128), ('hypotheses', 0.127), ('pi', 0.116), ('weighted', 0.105), ('mnist', 0.104), ('coin', 0.104), ('disagreement', 0.102), ('pk', 0.099), ('unlabeled', 0.096), ('yi', 0.095), ('iwal', 0.095), ('gk', 0.093), ('sk', 0.091), ('query', 0.091), ('hk', 0.088), ('xk', 0.08), ('rmax', 0.08), ('ei', 0.077), ('threshold', 0.075), ('bounds', 0.072), ('hn', 0.07), ('erm', 0.065), ('labels', 0.063), ('bound', 0.063), ('xi', 0.062), ('complexity', 0.059), ('wi', 0.057), ('error', 0.055), ('pr', 0.052), ('oracle', 0.049), ('supervised', 0.049), ('accesses', 0.047), ('brittleness', 0.047), ('beygelzimer', 0.047), ('quantity', 0.044), ('version', 0.043), ('sn', 0.042), ('heads', 0.042), ('martingale', 0.042), ('toss', 0.042), ('deviation', 0.041), ('weighting', 0.039), ('rutgers', 0.038), ('castro', 0.038), ('log', 0.038), ('dasgupta', 0.038), ('estimator', 0.036), ('misspeci', 0.036), ('returned', 0.034), ('unbiased', 0.034), ('optimizations', 0.034), ('lemma', 0.034), ('improved', 0.033), ('dis', 0.031), ('balcan', 0.031), ('sublinear', 0.031), ('bounded', 0.031), ('queries', 0.03), ('querying', 0.03), ('abstraction', 0.03), ('twentieth', 0.03), ('annual', 0.029), ('candidate', 0.029), ('yahoo', 0.028), ('holds', 0.028), ('estimators', 0.028), ('tsybakov', 0.027), ('qj', 0.027), ('hsu', 0.027), ('pn', 0.027), ('selective', 0.027), ('technique', 0.026), ('risk', 0.026), ('ever', 0.026), ('improvement', 0.026), ('rates', 0.025), ('noise', 0.025), ('algorithm', 0.025), ('drawback', 0.025), ('excess', 0.025), ('learning', 0.024), ('motivates', 0.024), ('predictors', 0.024), ('algorithmic', 0.023), ('yk', 0.023), ('variance', 0.023), ('errors', 0.023), ('labeling', 0.023), ('default', 0.022), ('roughly', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 27 nips-2010-Agnostic Active Learning Without Constraints

Author: Alina Beygelzimer, John Langford, Zhang Tong, Daniel J. Hsu

Abstract: We present and analyze an agnostic active learning algorithm that works without keeping a version space. This is unlike all previous approaches where a restricted set of candidate hypotheses is maintained throughout learning, and only hypotheses from this set are ever returned. By avoiding this version space approach, our algorithm sheds the computational burden and brittleness associated with maintaining version spaces, yet still allows for substantial improvements over supervised learning for classification. 1

2 0.21915999 22 nips-2010-Active Estimation of F-Measures

Author: Christoph Sawade, Niels Landwehr, Tobias Scheffer

Abstract: We address the problem of estimating the Fα -measure of a given model as accurately as possible on a fixed labeling budget. This problem occurs whenever an estimate cannot be obtained from held-out training data; for instance, when data that have been used to train the model are held back for reasons of privacy or do not reflect the test distribution. In this case, new test instances have to be drawn and labeled at a cost. An active estimation procedure selects instances according to an instrumental sampling distribution. An analysis of the sources of estimation error leads to an optimal sampling distribution that minimizes estimator variance. We explore conditions under which active estimates of Fα -measures are more accurate than estimates based on instances sampled from the test distribution. 1

3 0.21764404 173 nips-2010-Multi-View Active Learning in the Non-Realizable Case

Author: Wei Wang, Zhi-hua Zhou

Abstract: The sample complexity of active learning under the realizability assumption has been well-studied. The realizability assumption, however, rarely holds in practice. In this paper, we theoretically characterize the sample complexity of active learning in the non-realizable case under multi-view setting. We prove that, with unbounded Tsybakov noise, the sample complexity of multi-view active learning can be O(log 1 ), contrasting to single-view setting where the polynomial improveǫ ment is the best possible achievement. We also prove that in general multi-view setting the sample complexity of active learning with unbounded Tsybakov noise is O( 1 ), where the order of 1/ǫ is independent of the parameter in Tsybakov noise, ǫ contrasting to previous polynomial bounds where the order of 1/ǫ is related to the parameter in Tsybakov noise. 1

4 0.21360213 243 nips-2010-Smoothness, Low Noise and Fast Rates

Author: Nathan Srebro, Karthik Sridharan, Ambuj Tewari

Abstract: √ ˜ We establish an excess risk bound of O HR2 + HL∗ Rn for ERM with an H-smooth loss n function and a hypothesis class with Rademacher complexity Rn , where L∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = R/n, this translates to ˜ ˜ a learning rate of O (RH/n) in the separable (L∗ = 0) case and O RH/n + L∗ RH/n more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth non-negative objective. 1

5 0.1717011 25 nips-2010-Active Learning by Querying Informative and Representative Examples

Author: Sheng-jun Huang, Rong Jin, Zhi-hua Zhou

Abstract: Most active learning approaches select either informative or representative unlabeled instances to query their labels. Although several active learning algorithms have been proposed to combine the two criteria for query selection, they are usually ad hoc in finding unlabeled instances that are both informative and representative. We address this challenge by a principled approach, termed Q UIRE, based on the min-max view of active learning. The proposed approach provides a systematic way for measuring and combining the informativeness and representativeness of an instance. Extensive experimental results show that the proposed Q UIRE approach outperforms several state-of -the-art active learning approaches. 1

6 0.14349648 142 nips-2010-Learning Bounds for Importance Weighting

7 0.14223388 23 nips-2010-Active Instance Sampling via Matrix Partition

8 0.12359539 112 nips-2010-Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning

9 0.12234057 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average

10 0.11753461 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information

11 0.10748369 191 nips-2010-On the Theory of Learnining with Privileged Information

12 0.1038917 180 nips-2010-Near-Optimal Bayesian Active Learning with Noisy Observations

13 0.10236255 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks

14 0.10171003 270 nips-2010-Tight Sample Complexity of Large-Margin Learning

15 0.09390381 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis

16 0.091378696 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning

17 0.080380134 45 nips-2010-CUR from a Sparse Optimization Viewpoint

18 0.077061579 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

19 0.074781969 24 nips-2010-Active Learning Applied to Patient-Adaptive Heartbeat Classification

20 0.074053742 177 nips-2010-Multitask Learning without Label Correspondences


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.225), (1, 0.022), (2, 0.175), (3, -0.029), (4, 0.01), (5, 0.152), (6, -0.183), (7, -0.118), (8, 0.018), (9, -0.243), (10, 0.085), (11, -0.085), (12, -0.162), (13, -0.098), (14, -0.103), (15, -0.016), (16, -0.121), (17, -0.053), (18, -0.051), (19, 0.048), (20, 0.019), (21, 0.101), (22, -0.01), (23, 0.014), (24, 0.014), (25, -0.027), (26, -0.09), (27, 0.024), (28, 0.105), (29, -0.059), (30, 0.009), (31, -0.018), (32, -0.011), (33, 0.005), (34, 0.008), (35, -0.093), (36, -0.036), (37, -0.078), (38, 0.087), (39, 0.11), (40, 0.001), (41, -0.101), (42, 0.089), (43, 0.022), (44, 0.069), (45, 0.028), (46, 0.011), (47, 0.09), (48, 0.002), (49, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96003336 27 nips-2010-Agnostic Active Learning Without Constraints

Author: Alina Beygelzimer, John Langford, Zhang Tong, Daniel J. Hsu

Abstract: We present and analyze an agnostic active learning algorithm that works without keeping a version space. This is unlike all previous approaches where a restricted set of candidate hypotheses is maintained throughout learning, and only hypotheses from this set are ever returned. By avoiding this version space approach, our algorithm sheds the computational burden and brittleness associated with maintaining version spaces, yet still allows for substantial improvements over supervised learning for classification. 1

2 0.84123003 173 nips-2010-Multi-View Active Learning in the Non-Realizable Case

Author: Wei Wang, Zhi-hua Zhou

Abstract: The sample complexity of active learning under the realizability assumption has been well-studied. The realizability assumption, however, rarely holds in practice. In this paper, we theoretically characterize the sample complexity of active learning in the non-realizable case under multi-view setting. We prove that, with unbounded Tsybakov noise, the sample complexity of multi-view active learning can be O(log 1 ), contrasting to single-view setting where the polynomial improveǫ ment is the best possible achievement. We also prove that in general multi-view setting the sample complexity of active learning with unbounded Tsybakov noise is O( 1 ), where the order of 1/ǫ is independent of the parameter in Tsybakov noise, ǫ contrasting to previous polynomial bounds where the order of 1/ǫ is related to the parameter in Tsybakov noise. 1

3 0.81550282 22 nips-2010-Active Estimation of F-Measures

Author: Christoph Sawade, Niels Landwehr, Tobias Scheffer

Abstract: We address the problem of estimating the Fα -measure of a given model as accurately as possible on a fixed labeling budget. This problem occurs whenever an estimate cannot be obtained from held-out training data; for instance, when data that have been used to train the model are held back for reasons of privacy or do not reflect the test distribution. In this case, new test instances have to be drawn and labeled at a cost. An active estimation procedure selects instances according to an instrumental sampling distribution. An analysis of the sources of estimation error leads to an optimal sampling distribution that minimizes estimator variance. We explore conditions under which active estimates of Fα -measures are more accurate than estimates based on instances sampled from the test distribution. 1

4 0.70850623 25 nips-2010-Active Learning by Querying Informative and Representative Examples

Author: Sheng-jun Huang, Rong Jin, Zhi-hua Zhou

Abstract: Most active learning approaches select either informative or representative unlabeled instances to query their labels. Although several active learning algorithms have been proposed to combine the two criteria for query selection, they are usually ad hoc in finding unlabeled instances that are both informative and representative. We address this challenge by a principled approach, termed Q UIRE, based on the min-max view of active learning. The proposed approach provides a systematic way for measuring and combining the informativeness and representativeness of an instance. Extensive experimental results show that the proposed Q UIRE approach outperforms several state-of -the-art active learning approaches. 1

5 0.70540959 142 nips-2010-Learning Bounds for Importance Weighting

Author: Corinna Cortes, Yishay Mansour, Mehryar Mohri

Abstract: This paper presents an analysis of importance weighting for learning from finite samples and gives a series of theoretical and algorithmic results. We point out simple cases where importance weighting can fail, which suggests the need for an analysis of the properties of this technique. We then give both upper and lower bounds for generalization with bounded importance weights and, more significantly, give learning guarantees for the more common case of unbounded importance weights under the weak assumption that the second moment is bounded, a condition related to the R´ nyi divergence of the training and test distributions. e These results are based on a series of novel and general bounds we derive for unbounded loss functions, which are of independent interest. We use these bounds to guide the definition of an alternative reweighting algorithm and report the results of experiments demonstrating its benefits. Finally, we analyze the properties of normalized importance weights which are also commonly used.

6 0.67980677 112 nips-2010-Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning

7 0.62787163 191 nips-2010-On the Theory of Learnining with Privileged Information

8 0.59638268 23 nips-2010-Active Instance Sampling via Matrix Partition

9 0.53621942 243 nips-2010-Smoothness, Low Noise and Fast Rates

10 0.5215627 180 nips-2010-Near-Optimal Bayesian Active Learning with Noisy Observations

11 0.5215106 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information

12 0.50742513 270 nips-2010-Tight Sample Complexity of Large-Margin Learning

13 0.46334305 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average

14 0.43912855 202 nips-2010-Parallelized Stochastic Gradient Descent

15 0.43683448 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis

16 0.43500927 205 nips-2010-Permutation Complexity Bound on Out-Sample Error

17 0.43116403 47 nips-2010-Co-regularization Based Semi-supervised Domain Adaptation

18 0.42595899 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods

19 0.42286873 289 nips-2010-b-Bit Minwise Hashing for Estimating Three-Way Similarities

20 0.37261117 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.048), (24, 0.018), (27, 0.044), (30, 0.088), (35, 0.022), (45, 0.217), (50, 0.041), (52, 0.028), (60, 0.057), (77, 0.07), (78, 0.048), (82, 0.191), (90, 0.044)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87955445 64 nips-2010-Distributionally Robust Markov Decision Processes

Author: Huan Xu, Shie Mannor

Abstract: We consider Markov decision processes where the values of the parameters are uncertain. This uncertainty is described by a sequence of nested sets (that is, each set contains the previous one), each of which corresponds to a probabilistic guarantee for a different confidence level so that a set of admissible probability distributions of the unknown parameters is specified. This formulation models the case where the decision maker is aware of and wants to exploit some (yet imprecise) a-priori information of the distribution of parameters, and arises naturally in practice where methods to estimate the confidence region of parameters abound. We propose a decision criterion based on distributional robustness: the optimal policy maximizes the expected total reward under the most adversarial probability distribution over realizations of the uncertain parameters that is admissible (i.e., it agrees with the a-priori information). We show that finding the optimal distributionally robust policy can be reduced to a standard robust MDP where the parameters belong to a single uncertainty set, hence it can be computed in polynomial time under mild technical conditions.

same-paper 2 0.85973632 27 nips-2010-Agnostic Active Learning Without Constraints

Author: Alina Beygelzimer, John Langford, Zhang Tong, Daniel J. Hsu

Abstract: We present and analyze an agnostic active learning algorithm that works without keeping a version space. This is unlike all previous approaches where a restricted set of candidate hypotheses is maintained throughout learning, and only hypotheses from this set are ever returned. By avoiding this version space approach, our algorithm sheds the computational burden and brittleness associated with maintaining version spaces, yet still allows for substantial improvements over supervised learning for classification. 1

3 0.80375469 222 nips-2010-Random Walk Approach to Regret Minimization

Author: Hariharan Narayanan, Alexander Rakhlin

Abstract: We propose a computationally efficient random walk on a convex body which rapidly mixes to a time-varying Gibbs distribution. In the setting of online convex optimization and repeated games, the algorithm yields low regret and presents a novel efficient method for implementing mixture forecasting strategies. 1

4 0.7931053 243 nips-2010-Smoothness, Low Noise and Fast Rates

Author: Nathan Srebro, Karthik Sridharan, Ambuj Tewari

Abstract: √ ˜ We establish an excess risk bound of O HR2 + HL∗ Rn for ERM with an H-smooth loss n function and a hypothesis class with Rademacher complexity Rn , where L∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = R/n, this translates to ˜ ˜ a learning rate of O (RH/n) in the separable (L∗ = 0) case and O RH/n + L∗ RH/n more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth non-negative objective. 1

5 0.79226702 63 nips-2010-Distributed Dual Averaging In Networks

Author: Alekh Agarwal, Martin J. Wainwright, John C. Duchi

Abstract: The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. We develop and analyze distributed algorithms based on dual averaging of subgradients, and provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is confirmed both by theoretical lower bounds and simulations for various networks. 1

6 0.78670204 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

7 0.78498751 282 nips-2010-Variable margin losses for classifier design

8 0.78498149 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models

9 0.78439146 265 nips-2010-The LASSO risk: asymptotic results and real world examples

10 0.78433752 148 nips-2010-Learning Networks of Stochastic Differential Equations

11 0.78384054 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

12 0.78201073 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average

13 0.78185314 117 nips-2010-Identifying graph-structured activation patterns in networks

14 0.78100157 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning

15 0.78094494 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods

16 0.78059983 290 nips-2010-t-logistic regression

17 0.7801773 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression

18 0.77945048 155 nips-2010-Learning the context of a category

19 0.77904987 31 nips-2010-An analysis on negative curvature induced by singularity in multi-layer neural-network learning

20 0.77814549 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework