nips nips2005 nips2005-41 knowledge-graph by maker-knowledge-mining

41 nips-2005-Coarse sample complexity bounds for active learning

Source: pdf

Author: Sanjoy Dasgupta

Abstract: We characterize the sample complexity of active learning problems in terms of a parameter which takes into account the distribution over the input space, the speciﬁc target hypothesis, and the desired accuracy.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego dasgupta@cs. [sent-1, score-0.49]

2 edu Abstract We characterize the sample complexity of active learning problems in terms of a parameter which takes into account the distribution over the input space, the speciﬁc target hypothesis, and the desired accuracy. [sent-3, score-0.692]

3 1 Introduction The goal of active learning is to learn a classiﬁer in a setting where data comes unlabeled, and any labels must be explicitly requested and paid for. [sent-4, score-0.483]

4 So far the most encouraging theoretical results in this ﬁeld are [7, 6], which show that if the hypothesis class is that of homogeneous (i. [sent-6, score-0.311]

5 through the origin) linear separators, and the data is distributed uniformly over the unit sphere in Rd , and the labels correspond perfectly to one of the hypotheses (i. [sent-8, score-0.596]

6 the separable case) then at most O(d log d/ǫ) labels are needed to learn a classiﬁer with error less than ǫ. [sent-10, score-0.292]

7 This is exponentially smaller than the usual Ω(d/ǫ) sample complexity of learning linear classiﬁers in a supervised setting. [sent-11, score-0.291]

8 In fact, in this example the label complexity of active learning depends heavily on the speciﬁc target hypothesis, and ranges from O(log 1/ǫ) to Ω(1/ǫ). [sent-14, score-0.68]

9 In this paper, we consider arbitrary hypothesis classes H of VC dimension d < ∞, and learning problems which are separable. [sent-15, score-0.258]

10 We characterize the sample complexity of active learning in terms of a parameter which takes into account: (1) the distribution P over the input space X ; (2) the speciﬁc target hypothesis h∗ ∈ H; and (3) the desired accuracy ǫ. [sent-16, score-0.986]

11 We show that this quantity fairly tightly describes the sample complexity of active learning: any active learning scheme requires Ω(1/ρ) labels and there is a generic active ˜ learner which always uses at most O(d/ρ) labels1 . [sent-18, score-1.273]

12 This ρ is always at least ǫ; if it is ǫ we just get the usual sample complexity of supervised 1 ˜ The O(·) notation hides factors polylogarithmic in d, 1/ǫ, 1/δ, and 1/τ . [sent-19, score-0.31]

13 But sometimes ρ is a constant, and in such instances active learning gives an exponential improvement in the number of labels needed. [sent-21, score-0.515]

14 We look at various hypothesis classes and derive splitting indices for target hypotheses at different levels of accuracy. [sent-22, score-0.887]

15 For homogeneous linear separators and the uniform input distribution, we easily ﬁnd ρ to be a constant – perhaps the most direct proof yet of the efﬁcacy of active learning in this case. [sent-23, score-0.741]

16 1 Motivating examples Linear separators in R1 Our ﬁrst example is taken from [3, 4]. [sent-26, score-0.38]

17 But suppose we instead draw m unlabeled samples from P. [sent-28, score-0.397]

18 If we lay these points down on the line, their hidden labels are a sequence of 0’s followed by a sequence of 1’s, and the goal is to discover the point w at which the transition occurs. [sent-29, score-0.209]

19 Thus, in this case active learning gives an exponential improvement in the number of labels needed. [sent-31, score-0.515]

20 Can we always achieve a label complexity proportional to log 1/ǫ rather than 1/ǫ? [sent-32, score-0.218]

21 A natural next step is to consider linear separators in two dimensions. [sent-33, score-0.345]

22 Linear separators in R2 Let H be the hypothesis class of linear separators in R2 , and suppose the input distribution P is some density supported on the perimeter of the unit circle. [sent-34, score-1.055]

23 It turns out that the positive results of the one-dimensional case do not generalize: there are some target hypotheses in H for which Ω(1/ǫ) labels are needed to ﬁnd a classiﬁer with error rate less than ǫ, no matter what active learning scheme is used. [sent-35, score-1.132]

24 To see this, consider the following possible target hypotheses (Figure 1, left): h0 , for which all points are positive; and hi (1 ≤ i ≤ 1/ǫ), for which all points are positive except for a small slice Bi of probability mass ǫ. [sent-36, score-0.857]

25 For instance, suppose nature chooses a target hypothesis at random from among the hi , 1 ≤ i ≤ 1/ǫ. [sent-38, score-0.541]

26 Then, to identify this target with probability at least 1/2, it is necessary to query points in at least (about) half the Bi ’s. [sent-39, score-0.486]

27 Thus for these particular target hypotheses, active learning offers no improvement in sample complexity. [sent-40, score-0.582]

28 What about other target hypotheses in H, for instance those in which the positive and negative regions are most evenly balanced? [sent-41, score-0.645]

29 Consider the following active learning scheme: h3 B2 x3 h2 P B1 P′ h1 origin h0 Figure 1: Left: The data lie on the circumference of a circle. [sent-42, score-0.411]

30 From this pool, choose query points at random until at least one positive and one negative point have been found. [sent-48, score-0.306]

31 Apply binary search to ﬁnd the two boundaries between positive and negative on the perimeter of the circle. [sent-51, score-0.241]

32 For any h ∈ H, deﬁne i(h) = min{positive mass of h, negative mass of h}. [sent-52, score-0.223]

33 It is not hard to see that when the target hypothesis is h, step (2) asks for O(1/i(h)) labels (with probability at least 9/10, say) and step (3) asks for O(log 1/ǫ) labels. [sent-53, score-0.746]

34 Thus even within this simple hypothesis class, the label complexity of active learning can run anywhere from O(log 1/ǫ) to Ω(1/ǫ), depending on the speciﬁc target hypothesis. [sent-54, score-0.891]

35 Linear separators in R3 In our two previous examples, the amount of unlabeled data needed was O(1/ǫ), exactly the usual sample complexity of supervised learning. [sent-55, score-0.937]

36 We next turn to a case in which it is helpful to have signiﬁcantly more unlabeled data than this. [sent-56, score-0.261]

37 Let H consist of homogeneous linear separators in R3 . [sent-59, score-0.412]

38 The “bad” linear separators 1 2 in H cut off just a small portion of P but nonetheless divide P′ perfectly in half. [sent-62, score-0.504]

39 This O(log 1/ǫ) label complexity is made possible by the presence of P′ and is only achievable if the amount of unlabeled data is Ω(1/τ ), which could potentially be enormous. [sent-65, score-0.475]

40 With less unlabeled data, the usual Ω(1/ǫ) label complexity applies. [sent-66, score-0.49]

41 2 Basic deﬁnitions The sample complexity of supervised learning is commonly expressed as a function of the error rate ǫ and the underlying distribution P. [sent-69, score-0.277]

42 For active learning, the previous three examples demonstrate that it is also important to take into account the target hypothesis and the amount of unlabeled data. [sent-70, score-0.99]

43 Let H be the hypothesis class, a set of functions from X to {0, 1} whose VC dimension is d < ∞. [sent-73, score-0.211]

44 We will be dealing with a separable learning scenario, in which all labels correspond perfectly to some concept h∗ ∈ H, and the goal is to ﬁnd h ∈ H such that d(h∗ , h) ≤ ǫ. [sent-78, score-0.311]

45 To do this, it is sufﬁcient to whittle down the version space to the point where it has diameter at most ǫ, and to then return any of the remaining hypotheses. [sent-79, score-0.274]

46 Likewise, if the diameter of the current version space is more than ǫ then any hypothesis chosen from it will have error more than ǫ/2 with respect to the worst-case target. [sent-80, score-0.425]

47 Thus, in a non-Bayesian setting, active learning is about reducing the diameter of the version space. [sent-81, score-0.511]

48 We can think of x as a cut through hypothesis space; see Figure 2(a). [sent-84, score-0.274]

49 For us, each such edge will represent a pair of hypotheses which need to be distinguished from one another: that is, they are relatively far apart, so there is no way to achieve our target accuracy if both of them remain in the version space. [sent-91, score-0.665]

50 We would hope that for any ﬁnite set of edges Q, there are queries that will remove a substantial fraction of them. [sent-92, score-0.304]

51 To this end, a point x ∈ X is said to ρ-split Q if its label is guaranteed to reduce the number of edges by a fraction ρ > 0, that is, if: + + − − max{|Q ∩ (Hx × Hx )|, |Q ∩ (Hx × Hx )|} ≤ (1 − ρ)|Q|. [sent-93, score-0.215]

52 If our target accuracy is ǫ, we only really care about edges of length more than ǫ. [sent-95, score-0.413]

53 Finally, we say that a subset of hypotheses S ⊂ H is (ρ, ǫ, τ )-splittable if for all ﬁnite edge-sets Q ⊂ S × S, P{x : x ρ-splits Qǫ } ≥ τ. [sent-97, score-0.342]

54 Paraphrasing, at least a τ fraction of the distribution P is useful for splitting S. [sent-98, score-0.224]

55 2 This τ gives a sense of how many unlabeled samples are needed. [sent-99, score-0.261]

56 If τ is miniscule, then there are good points to query, but these will emerge only in an enormous pool of unlabeled data. [sent-100, score-0.384]

57 It will soon transpire that the parameters ρ, τ play roughly the following roles: # labels needed ∝ 1/ρ, # of unlabeled points needed ∝ 1/τ A ﬁrst step towards understanding them is to establish a trivial lower bound on ρ. [sent-101, score-0.58]

58 Since the edges have length at least ǫ, this x has at least an ǫ chance of cutting any of them, whereby EZ ≥ ǫ|Qǫ |. [sent-107, score-0.357]

59 We will now see that the splitting index roughly characterizes the sample complexity of active learning. [sent-110, score-0.568]

60 3 Lower bound We start by showing that if some region of the hypothesis space has a low splitting index, then it must contain hypotheses which are not conducive to active learning. [sent-112, score-0.992]

61 Theorem 2 Fix a hypothesis space H and distribution P. [sent-113, score-0.276]

62 Then any active learner which achieves an accuracy of ǫ on all target hypotheses in S, with conﬁdence > 3/4 (over the random sampling of data), either needs ≥ 1/τ unlabeled samples or ≥ 1/ρ labels. [sent-115, score-1.137]

63 We’ll show that in order to distinguish between hypotheses in V, either 1/τ unlabeled samples or 1/ρ queries are needed. [sent-118, score-0.719]

64 With probability at least (1 − τ )1/τ ≥ 1/4, none of these points ρ-splits Qǫ ; put differently, each of these potential queries has a bad outcome (+ or −) in which at most ρ|Qǫ | edges are eliminated. [sent-120, score-0.48]

65 In this case there must be a target hypothesis in V for which at least 1/ρ labels are required. [sent-121, score-0.6]

66 , xtm Query the xti which maximally splits Qt Let Qt+1 be the remaining edges until Qt+1 = ∅ return remaining hypotheses in S Figure 3: A generic active learner. [sent-134, score-0.826]

67 Corollary 3 Suppose that in some neighborhood B(h0 , ∆), there are hypotheses h1 , . [sent-135, score-0.342]

68 , hN such that: (1) d(h0 , hi ) > ǫ for all i; and (2) the “disagree sets” {x : h0 (x) = hi (x)} are disjoint for different i. [sent-138, score-0.236]

69 Any active learning scheme which achieves an accuracy of ǫ on all of B(h0 , ∆) must use at least N labels for some of the target hypotheses, no matter how much unlabeled data is available. [sent-140, score-1.063]

70 Each query only cuts off one spoke, so N queries are needed. [sent-145, score-0.247]

71 4 Upper bound We now show a loosely matching upper bound on sample complexity, via an algorithm (Figure 3) which repeatedly halves the diameter of the remaining version space. [sent-147, score-0.234]

72 For some ǫ0 less than half the target error rate ǫ, it starts with an ǫ0 -cover of H: a set of hypotheses S0 ⊂ H such that any h ∈ H is within distance ǫ0 of S0 . [sent-148, score-0.579]

73 The ǫ0 -cover serves as a surrogate for the hypothesis class – for instance, the ﬁnal hypothesis is chosen from it. [sent-150, score-0.422]

74 Theorem 4 Let the target hypothesis be some h∗ ∈ H. [sent-152, score-0.38]

75 Pick any target accuracy ǫ > 0 and conﬁdence level δ > 0. [sent-153, score-0.22]

76 Then there is an appropriate choice of ǫ0 and m for which, with probability at least 1 − δ, ˜ ˜ the algorithm will draw O((1/ǫ) + (d/ρτ )) unlabeled points, make O(d/ρ) queries, and return a hypothesis with error at most ǫ. [sent-155, score-0.666]

77 This theorem makes it possible to derive label complexity bounds which are ﬁne-tuned to the speciﬁc target hypothesis. [sent-156, score-0.351]

78 1 Simple boundaries on the line Returning to our ﬁrst example, let X = R and H = {hw : w ∈ R}, where each hw is a threshold function hw (x) = 1(x ≥ w). [sent-159, score-0.512]

79 The distance measure P induces on H is d(hw , hw′ ) = P{x : hw (x) = hw′ (x)} = P{x : w ≤ x < w′ } = P[w, w′ ) (assuming w′ ≥ w). [sent-161, score-0.284]

80 Pick any accuracy ǫ > 0 and consider any ﬁnite set of edges Q = ′ {(hwi , hwi ) : i = 1, . [sent-162, score-0.254]

81 It is easy to see that any x ∈ [wn/2 , w) must eliminate at least half the edges in Q. [sent-167, score-0.24]

82 2 Intervals on the line The next case we consider is almost identical to our earlier example of 2-d linear separators (and the results carry over to that example, within constant factors). [sent-171, score-0.379]

83 The hypotheses correspond to intervals on the real line: X = R and H = {ha,b : a, b ∈ R}, where ha,b (x) = 1(a ≤ x ≤ b). [sent-172, score-0.414]

84 Even in this very simple class, some hypotheses are much easier to active-learn than others. [sent-175, score-0.342]

85 Divide the real line into 1/ǫ disjoint intervals, each with probability mass ǫ, and let {hi : i = 1, . [sent-177, score-0.214]

86 , 1/ǫ} denote the hypotheses taking value 1 on the corresponding intervals. [sent-180, score-0.342]

87 Then these hi satisfy the conditions of Corollary 3; their star-shaped conﬁguration forces a ρ-value of ǫ, and active learning doesn’t help at all in choosing amongst them. [sent-182, score-0.422]

88 The bad hypotheses are the ones whose intervals have small probability mass. [sent-184, score-0.515]

89 Any concept in B(ha,b , 4ǫ) (more precisely, its interval) must lie within the outer box and must contain the inner box (this inner box might be empty). [sent-189, score-0.255]

90 r a 4ǫ b 4ǫ 4ǫ 4ǫ Any edge (ha′ ,b′ , ha′′ ,b′′ ) ∈ Q has length > ǫ, so [a′ , b′ ]∆[a′′ , b′′ ] (either a single interval or a union of two intervals) has total length > ǫ and lies between the inner and outer boxes. [sent-190, score-0.223]

91 This space has mass at most 16ǫ and at least 4ǫ, of which at least ǫ is occupied by [a′ , b′ ]∆[a′′ , b′′ ]. [sent-192, score-0.257]

92 The expected number of edges split by our x is at least |Q|/16, and therefore the probability that more than |Q|/32 edges are split is at least 1/32. [sent-195, score-0.514]

93 To summarize, for any hypothesis ha,b , let i(ha,b ) = P[a, b] denote the probability mass of its interval. [sent-197, score-0.341]

94 In short, once the version space is whittled down to B(h, i(h)/4), efﬁcient active learning is possible. [sent-199, score-0.413]

95 3 Linear separators under the uniform distribution The most encouraging positive result for active learning to date has been for learning homogeneous (through the origin) linear separators with data drawn uniformly from the surface of the unit sphere in Rd . [sent-202, score-1.286]

96 4 Related work and open problems There has been a lot of work on a related model in which the points to be queried are synthetically constructed, rather than chosen from unlabeled data [1]. [sent-204, score-0.364]

97 One other technique which seems useful for active learning is to look at the unlabeled data and then place bets on certain target hypotheses, for instance the ones with large margin. [sent-207, score-0.846]

98 This insight – nicely formulated in [2, 10] – is not speciﬁc to active learning and is orthogonal to the search issues considered in this paper. [sent-208, score-0.362]

99 This permits a naive active learning strategy, also suggested in [3]: just pick a random point whose label you are not yet sure of. [sent-210, score-0.572]

100 A PAC-style model for learning from labeled and unlabeled data. [sent-222, score-0.308]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('separators', 0.345), ('hypotheses', 0.342), ('active', 0.282), ('unlabeled', 0.261), ('hx', 0.218), ('hypothesis', 0.211), ('hw', 0.203), ('target', 0.169), ('labels', 0.154), ('edges', 0.142), ('diameter', 0.13), ('pick', 0.125), ('splitting', 0.125), ('ha', 0.121), ('queries', 0.116), ('complexity', 0.109), ('bad', 0.101), ('query', 0.098), ('hi', 0.093), ('mass', 0.093), ('label', 0.073), ('asks', 0.073), ('intervals', 0.072), ('draw', 0.068), ('suppose', 0.068), ('pool', 0.068), ('homogeneous', 0.067), ('least', 0.066), ('dasgupta', 0.064), ('perfectly', 0.063), ('cut', 0.063), ('hwi', 0.061), ('return', 0.06), ('points', 0.055), ('needed', 0.055), ('ll', 0.054), ('vc', 0.054), ('perimeter', 0.053), ('bi', 0.052), ('sample', 0.052), ('qt', 0.052), ('version', 0.052), ('accuracy', 0.051), ('edge', 0.051), ('length', 0.051), ('disjoint', 0.05), ('positive', 0.05), ('split', 0.049), ('queried', 0.048), ('eighteenth', 0.048), ('separable', 0.047), ('usual', 0.047), ('learning', 0.047), ('instance', 0.047), ('hope', 0.046), ('origin', 0.046), ('permits', 0.045), ('ez', 0.045), ('kalai', 0.045), ('induces', 0.045), ('er', 0.043), ('expanded', 0.04), ('look', 0.04), ('notion', 0.04), ('classi', 0.039), ('doesn', 0.039), ('box', 0.038), ('amenable', 0.037), ('sphere', 0.037), ('let', 0.037), ('negative', 0.037), ('corollary', 0.037), ('distance', 0.036), ('lie', 0.036), ('log', 0.036), ('freund', 0.036), ('supervised', 0.036), ('examples', 0.035), ('inner', 0.035), ('outer', 0.035), ('boundaries', 0.035), ('hn', 0.035), ('line', 0.034), ('likewise', 0.034), ('binary', 0.033), ('st', 0.033), ('search', 0.033), ('scheme', 0.033), ('distribution', 0.033), ('divide', 0.033), ('cuts', 0.033), ('encouraging', 0.033), ('ers', 0.032), ('space', 0.032), ('half', 0.032), ('improvement', 0.032), ('wi', 0.032), ('amount', 0.032), ('chance', 0.032), ('learner', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999893 41 nips-2005-Coarse sample complexity bounds for active learning

Author: Sanjoy Dasgupta

2 0.17903027 160 nips-2005-Query by Committee Made Real

Author: Ran Gilad-bachrach, Amir Navot, Naftali Tishby

Abstract: Training a learning algorithm is a costly task. A major goal of active learning is to reduce this cost. In this paper we introduce a new algorithm, KQBC, which is capable of actively learning large scale problems by using selective sampling. The algorithm overcomes the costly sampling step of the well known Query By Committee (QBC) algorithm by projecting onto a low dimensional space. KQBC also enables the use of kernels, providing a simple way of extending QBC to the non-linear scenario. Sampling the low dimension space is done using the hit and run random walk. We demonstrate the success of this novel algorithm by applying it to both artiﬁcial and a real world problems.

3 0.17065531 74 nips-2005-Faster Rates in Regression via Active Learning

Author: Rebecca Willett, Robert Nowak, Rui M. Castro

Abstract: This paper presents a rigorous statistical analysis characterizing regimes in which active learning signiﬁcantly outperforms classical passive learning. Active learning algorithms are able to make queries or select sample locations in an online fashion, depending on the results of the previous queries. In some regimes, this extra ﬂexibility leads to signiﬁcantly faster rates of error decay than those possible in classical passive learning settings. The nature of these regimes is explored by studying fundamental performance limits of active and passive learning in two illustrative nonparametric function classes. In addition to examining the theoretical potential of active learning, this paper describes a practical algorithm capable of exploiting the extra ﬂexibility of the active setting and provably improving upon the classical passive techniques. Our active learning theory and methods show promise in a number of applications, including ﬁeld estimation using wireless sensor networks and fault line detection. 1

4 0.16874868 54 nips-2005-Data-Driven Online to Batch Conversions

Author: Ofer Dekel, Yoram Singer

Abstract: Online learning algorithms are typically fast, memory efﬁcient, and simple to implement. However, many common learning problems ﬁt more naturally in the batch learning setting. The power of online learning algorithms can be exploited in batch settings by using online-to-batch conversions techniques which build a new batch algorithm from an existing online algorithm. We ﬁrst give a uniﬁed overview of three existing online-to-batch conversion techniques which do not use training data in the conversion process. We then build upon these data-independent conversions to derive and analyze data-driven conversions. Our conversions ﬁnd hypotheses with a small risk by explicitly minimizing datadependent generalization bounds. We experimentally demonstrate the usefulness of our approach and in particular show that the data-driven conversions consistently outperform the data-independent conversions.

5 0.14639543 19 nips-2005-Active Learning for Misspecified Models

Author: Masashi Sugiyama

Abstract: Active learning is the problem in supervised learning to design the locations of training input points so that the generalization error is minimized. Existing active learning methods often assume that the model used for learning is correctly speciﬁed, i.e., the learning target function can be expressed by the model at hand. In many practical situations, however, this assumption may not be fulﬁlled. In this paper, we ﬁrst show that the existing active learning method can be theoretically justiﬁed under slightly weaker condition: the model does not have to be correctly speciﬁed, but slightly misspeciﬁed models are also allowed. However, it turns out that the weakened condition is still restrictive in practice. To cope with this problem, we propose an alternative active learning method which can be theoretically justiﬁed for a wider class of misspeciﬁed models. Thus, the proposed method has a broader range of applications than the existing method. Numerical studies show that the proposed active learning method is robust against the misspeciﬁcation of models and is thus reliable. 1 Introduction and Problem Formulation Let us discuss the regression problem of learning a real-valued function Ê from training examples ´Ü Ý µ ´Ü µ · ¯ Ý Ò ´Üµ deﬁned on ½ where ¯ Ò ½ are i.i.d. noise with mean zero and unknown variance ¾. We use the following linear regression model for learning. ´Ü µ ´µ Ô ½ « ³ ´Ü µ where ³ Ü Ô ½ are ﬁxed linearly independent functions and are parameters to be learned. ´ µ « ´«½ «¾ « Ô µ We evaluate the goodness of the learned function Ü by the expected squared test error over test input points and noise (i.e., the generalization error). When the test input points are drawn independently from a distribution with density ÔØ Ü , the generalization error is expressed as ´ µ ¯ ´Üµ ´Üµ ¾ Ô ´Üµ Ü Ø where ¯ denotes the expectation over the noise ¯ Ò Ô ´Üµ is known1. ½. In the following, we suppose that Ø In a standard setting of regression, the training input points are provided from the environment, i.e., Ü Ò ½ independently follow the distribution with density ÔØ Ü . On the other hand, in some cases, the training input points can be designed by users. In such cases, it is expected that the accuracy of the learning result can be improved if the training input points are chosen appropriately, e.g., by densely locating training input points in the regions of high uncertainty. ´ µ Active learning—also referred to as experimental design—is the problem of optimizing the location of training input points so that the generalization error is minimized. In active learning research, it is often assumed that the regression model is correctly speciﬁed [2, 1, 3], i.e., the learning target function Ü can be expressed by the model. In practice, however, this assumption is often violated. ´ µ In this paper, we ﬁrst show that the existing active learning method can still be theoretically justiﬁed when the model is approximately correct in a strong sense. Then we propose an alternative active learning method which can also be theoretically justiﬁed for approximately correct models, but the condition on the approximate correctness of the models is weaker than that for the existing method. Thus, the proposed method has a wider range of applications. In the following, we suppose that the training input points Ü Ò ½ are independently drawn from a user-deﬁned distribution with density ÔÜ Ü , and discuss the problem of ﬁnding the optimal density function. ´µ 2 Existing Active Learning Method The generalization error deﬁned by Eq.(1) can be decomposed as ·Î is the (squared) bias term and Î is the variance term given by where ¯ ´Üµ ´Üµ ¾ Ô ´Üµ Ü Ø Î and ¯ ´Üµ ¯ ´Üµ ¾ Ô ´Üµ Ü Ø A standard way to learn the parameters in the regression model (1) is the ordinary leastsquares learning, i.e., parameter vector « is determined as follows. « ÇÄË It is known that «ÇÄË is given by Ö« Ò Ñ « ÇÄË where Ä ÇÄË ´ µ ½ Ò ´Ü µ Ý ½ Ä ÇÄË ³ ´Ü µ ¾ Ý and Ý ´Ý½ Ý¾ Ý Ò µ Let ÇÄË , ÇÄË and ÎÇÄË be , and Î for the learned function obtained by the ordinary least-squares learning, respectively. Then the following proposition holds. 1 In some application domains such as web page analysis or bioinformatics, a large number of unlabeled samples—input points without output values independently drawn from the distribution with density ÔØ ´Üµ—are easily gathered. In such cases, a reasonably good estimate of ÔØ ´Üµ may be obtained by some standard density estimation method. Therefore, the assumption that ÔØ ´Üµ is known may not be so restrictive. Proposition 1 ([2, 1, 3]) Suppose that the model is correctly speciﬁed, i.e., the learning target function Ü is expressed as ´µ Ô ´Ü µ Then ½ «£ ³ ´Üµ and ÎÇÄË are expressed as ÇÄË ¼ ÇÄË and Î ¾ ÇÄË Â ÇÄË where ØÖ´ÍÄ Â ÇÄË ÇÄË Ä ÇÄË µ ³ ´Üµ³ ´ÜµÔ ´Üµ Ü Í and Ø Therefore, for the correctly speciﬁed model (1), the generalization error as ÇÄË ¾ ÇÄË is expressed Â ÇÄË Based on this expression, the existing active learning method determines the location of training input points Ü Ò ½ (or the training input density ÔÜ Ü ) so that ÂÇÄË is minimized [2, 1, 3]. ´ µ 3 Analysis of Existing Method under Misspeciﬁcation of Models In this section, we investigate the validity of the existing active learning method for misspeciﬁed models. ´ µ Suppose the model does not exactly include the learning target function Ü , but it approximately includes it, i.e., for a scalar Æ such that Æ is small, Ü is expressed as ´ µ ´Ü µ ´Üµ · ÆÖ´Üµ where ´Üµ is the orthogonal projection of ´Üµ onto the span of residual Ö´Üµ is orthogonal to ³ ´Üµ ½ : Ô Ô ´Üµ ½ «£ ³ ´Üµ Ö´Üµ³ ´ÜµÔ ´Üµ Ü and In this case, the bias term Ø ¼ for ³ ´Üµ ½¾ Ô and the ½ Ô is expressed as ¾ ´ ´Üµ ´Üµµ¾ Ô ´Üµ Ü is constant which does not depend on the training input density Ô ´Üµ, we subtract ¯ ´Üµ ´Üµ Ô ´Üµ Ü · where Ø Ø Since in the following discussion. Ü Then we have the following lemma2 . Lemma 2 For the approximately correct model (3), we have ÇÄË ÇÄË Î ÇÄË where 2 Þ Æ ¾ ÍÄ ¾Â Ö ÇÄË Þ Ä Þ Ç ´Ò ½ µ ´Ö´Ü½µ Ö´Ü¾µ Ö ÇÄË Ö Ô Ö ´Ü Proofs of lemmas are provided in an extended version [6]. Ò µµ Ç ´Æ ¾ µ Note that the asymptotic order in Eq.(1) is in probability since ÎÇÄË is a random variable that includes Ü Ò ½ . The above lemma implies that ½ Ó ´Ò ¾ µ Therefore, the existing active learning method of minimizing Â is still justiﬁed if Æ ½ ¾ µ. However, when Æ Ó ´Ò ½ µ, the existing method may not work well because ¾ Ó ´Ò the bias term is not smaller than the variance term Î , so it can not be ÇÄË ¾ · Ó ´Ò ½µ Â ÇÄË if Æ Ô Ô ÇÄË Ô Ô ÇÄË ÇÄË neglected. 4 New Active Learning Method In this section, we propose a new active learning method based on the weighted leastsquares learning. 4.1 Weighted Least-Squares Learning When the model is correctly speciﬁed, «ÇÄË is an unbiased estimator of «£ . However, for misspeciﬁed models, «ÇÄË is generally biased even asymptotically if Æ ÇÔ . ´½µ The bias of «ÇÄË is actually caused by the covariate shift [5]—the training input density ÔÜ Ü is different from the test input density ÔØ Ü . For correctly speciﬁed models, inﬂuence of the covariate shift can be ignored, as the existing active learning method does. However, for misspeciﬁed models, we should explicitly cope with the covariate shift. ´µ ´ µ Under the covariate shift, it is known that the following weighted least-squares learning is [5]. asymptotically unbiased even if Æ ÇÔ ´½µ Ô ´Ü µ Ô ´Ü µ ½ Ò Ö« Ò Ñ « Ï ÄË ¾ ´Ü µ Ý Ø Ü Asymptotic unbiasedness of «Ï ÄË would be intuitively understood by the following identity, which is similar in spirit to importance sampling: ´Üµ ´Üµ ¾ Ô ´Ü µ Ü ´Üµ ´Üµ Ø ´µ ¾ Ô ´Üµ Ô ´Ü µ Ü Ô ´Üµ Ø Ü Ü In the following, we assume that ÔÜ Ü is strictly positive for all Ü. Let matrix with the -th diagonal element be the diagonal Ô ´Ü µ Ô ´Ü µ Ø Ü Then it can be conﬁrmed that «Ï ÄË is given by « Ä Ï ÄË Ï ÄË Ý where Ä ´ Ï ÄË µ ½ 4.2 Active Learning Based on Weighted Least-Squares Learning Let Ï ÄË , Ï ÄË and ÎÏ ÄË be , and Î for the learned function obtained by the above weighted least-squares learning, respectively. Then we have the following lemma. Lemma 3 For the approximately correct model (3), we have Ï ÄË Î Æ ¾ ÍÄ ¾Â Ï ÄË where Ï ÄË Ï ÄË Â Ï ÄË Þ Ä Þ Ç ´Ò ½ µ Ö Ï ÄË Ö Ô Ô ØÖ´ÍÄ Ï ÄË Ä Ï ÄË Ç ´Æ ¾ Ò ½ µ µ This lemma implies that ¾ Â · Ó ´Ò ½µ ´½µ if Æ ÓÔ Based on this expression, we propose determining the training input density ÔÜ ÂÏ ÄË is minimized. Ï ÄË Ï ÄË Ô ´Üµ so that ´½µ The use of the proposed criterion ÂÏ ÄË can be theoretically justiﬁed when Æ ÓÔ , ½ while the existing criterion ÂÇÄË requires Æ ÓÔ Ò ¾ . Therefore, the proposed method has a wider range of applications. The effect of this extension is experimentally investigated in the next section. ´ 5 µ Numerical Examples We evaluate the usefulness of the proposed active learning method through experiments. Toy Data Set: setting. We ﬁrst illustrate how the proposed method works under a controlled ½ ´µ ´µ ½ · · ½¼¼ ´µ Let and the learning target function Ü be Ü Ü Ü¾ ÆÜ¿. Let Ò ½¼¼ be i.i.d. Gaussian noise with mean zero and standard deviation and ¯ . Let ÔØ Ü ½ be the Gaussian density with mean and standard deviation , which is assumed to be known here. Let Ô and the basis functions be ³ Ü Ü ½ for . Let us consider the following three cases. Æ , where each case corresponds to “correctly speciﬁed”, “approximately correct”, and “misspeciﬁed” (see Figure 1). We choose the training input density ÔÜ Ü from the Gaussian density with mean and standard , where deviation ¼¾ ¿ ´µ ¼ ¼ ¼¼ ¼ ¼ ¼ ½¼ ´µ ¼ ¼¿ ½¾¿ ¼¾ ¾ We compare the accuracy of the following three methods: (A) Proposed active learning criterion + WLS learning : The training input density is determined so that ÂÏ ÄË is minimized. Following the determined input density, training input points Ü ½¼¼ are created and corresponding output values Ý ½¼¼ ½ ½ are observed. Then WLS learning is used for estimating the parameters. (B) Existing active learning criterion + OLS learning [2, 1, 3]: The training input density is determined so that ÂÇÄË is minimized. OLS learning is used for estimating the parameters. (C) Passive learning + OLS learning: The test input density ÔØ Ü is used as the training input density. OLS learning is used for estimating the parameters. ´ µ First, we evaluate the accuracy of ÂÏ ÄË and ÂÇÄË as approximations of Ï ÄË and ÇÄË . The means and standard deviations of Ï ÄË , ÂÏ ÄË , ÇÄË , and ÂÇÄË over runs are (“correctly depicted as functions of in Figure 2. These graphs show that when Æ speciﬁed”), both ÂÏ ÄË and ÂÇÄË give accurate estimates of Ï ÄË and ÇÄË . When Æ (“approximately correct”), ÂÏ ÄË again works well, while ÂÇÄË tends to be negatively biased for large . This result is surprising since as illustrated in Figure 1, the learning target functions with Æ and Æ are visually quite similar. Therefore, it intuitively seems that the result of Æ is not much different from that of Æ . However, the simulation result shows that this slight difference makes ÂÇÄË unreliable. (“misspeciﬁed”), ÂÏ ÄË is still reasonably accurate, while ÂÇÄË is heavily When Æ biased. ½¼¼ ¼ ¼¼ ¼ ¼ ¼¼ ¼¼ ¼ These results show that as an approximation of the generalization error, ÂÏ ÄË is more robust against the misspeciﬁcation of models than ÂÇÄË , which is in good agreement with the theoretical analyses given in Section 3 and Section 4. Learning target function f(x) 8 δ=0 δ=0.04 δ=0.5 6 Table 1: The means and standard deviations of the generalization error for Toy data set. The best method and comparable ones by the t-test at the are described with boldface. signiﬁcance level The value of method (B) for Æ is extremely large but it is not a typo. 4 ± 2 0 −1.5 −1 −0.5 0 0.5 1 1.5 2 Input density functions 1.5 ¼ pt(x) Æ ¼ ½ ¦¼ ¼ px(x) 1 0.5 0 −1.5 −1 −0.5 0 0.5 1 1.5 2 Figure 1: Learning target function and input density functions. ¼ Æ (A) (B) (C) ¼¼ Æ −3 −3 −3 G−WLS 12 4 3 G−WLS 5 4 ¼ x 10 6 5 ½¼¿. “misspeciﬁed” x 10 G−WLS ¼ ¦¼ ¼ ¿¼¿ ¦ ½ ¦½ ½ ¿ ¾ ¦ ½ ¾¿ ¾ ¾¦¼ ¿ “approximately correct” x 10 6 Æ All values in the table are multiplied by Æ “correctly speciﬁed” ¦¼ ¼ ¾ ¼¦¼ ½¿ ¼¼ Æ ¾ ¼¾ ¦ ¼ ¼ 3 10 8 6 0.8 1.2 1.6 2 0.07 2.4 J−WLS 0.06 0.8 1.2 1.6 2 0.07 2.4 0.8 1.2 1.6 2 0.07 J−WLS 0.06 0.05 0.05 0.05 0.04 0.04 0.04 0.03 0.03 2.4 J−WLS 0.06 0.8 −3 x 10 1.2 1.6 2 2.4 G−OLS 5 0.03 0.8 −3 x 10 1.2 1.6 2 3 1.2 1.6 2 1.6 2.4 2 G−OLS 0.4 4 3 0.8 0.5 G−OLS 5 4 2.4 0.3 0.2 0.1 2 2 0.8 1.2 1.6 2 0.06 2.4 J−OLS 0.8 1.2 1.6 2 0.06 2.4 0.8 1.2 0.06 J−OLS 0.05 0.05 0.05 0.04 0.04 0.04 0.03 0.03 0.02 0.02 2.4 J−OLS 0.8 1.2 1.6 c 2 2.4 0.03 0.02 0.8 Figure 2: The means and error bars of functions of . 1.2 1.6 c Ï ÄË , 2 Â Ï ÄË 2.4 , 0.8 ÇÄË 1.2 1.6 c , and ÂÇÄË over 2 2.4 ½¼¼ runs as In Table 1, the mean and standard deviation of the generalization error obtained by each method is described. When Æ , the existing method (B) works better than the proposed method (A). Actually, in this case, training input densities that approximately minimize Ï ÄË and ÇÄË were found by ÂÏ ÄË and ÂÇÄË . Therefore, the difference of the errors is caused by the difference of WLS and OLS: WLS generally has larger variance than OLS. Since bias is zero for both WLS and OLS if Æ , OLS would be more accurate than WLS. Although the proposed method (A) is outperformed by the existing method (B), it still works better than the passive learning scheme (C). When Æ and Æ the proposed method (A) gives signiﬁcantly smaller errors than other methods. ¼ ¼ ¼¼ ¼ Overall, we found that for all three cases, the proposed method (A) works reasonably well and outperforms the passive learning scheme (C). On the other hand, the existing method (B) works excellently in the correctly speciﬁed case, although it tends to perform poorly once the correctness of the model is violated. Therefore, the proposed method (A) is found to be robust against the misspeciﬁcation of models and thus it is reliable. Table 2: The means and standard deviations of the test error for DELVE data sets. All values in the table are multiplied by ¿. Bank-8fm Bank-8fh Bank-8nm Bank-8nh (A) ¼ ¿½ ¦ ¼ ¼ ¾ ½¼ ¦ ¼ ¼ ¾ ¦ ½ ¾¼ ¿ ¦ ½ ½½ (B) ¦ ¦ ¦ ¦ (C) ¦ ¦ ¦ ¦ ½¼ ¼ ¼¼ ¼¿ ¼¼ ¾ ¾½ ¼ ¼ ¾ ¾¼ ¼ ¼ Kin-8fm Kin-8fh ½ ¦¼ ¼ ½ ¦¼ ¼ ½ ¼¦¼ ¼ (A) (B) (C) ¾ ½ ¼ ¿ ½ ½¿ ¾ ¿ ½¿ ¿ ½¿ Kin-8nm ¼¦¼ ½ ¿ ¦ ¼ ½¿ ¾ ¦¼ ¾ Kin-8nh ¿ ¦¼ ¼ ¿ ¼¦ ¼ ¼ ¿ ¦¼ ½ ¼ ¾¦¼ ¼ ¼ ¦¼ ¼ ¼ ½¦¼ ¼ (A)/(C) (B)/(C) (C)/(C) 1.2 1.1 1 0.9 Bank−8fm Bank−8fh Bank−8nm Bank−8nh Kin−8fm Kin−8fh Kin−8nm Kin−8nh Figure 3: Mean relative performance of (A) and (B) compared with (C). For each run, the test errors of (A) and (B) are normalized by the test error of (C), and then the values are averaged over runs. Note that the error bars were reasonably small so they were omitted. ½¼¼ Realistic Data Set: Here we use eight practical data sets provided by DELVE [4]: Bank8fm, Bank-8fh, Bank-8nm, Bank-8nh, Kin-8fm, Kin-8fh, Kin-8nm, and Kin-8nh. Each data set includes samples, consisting of -dimensional input and -dimensional output values. For convenience, every attribute is normalized into . ½¾ ¼ ½℄ ½¾ ½ Suppose we are given all input points (i.e., unlabeled samples). Note that output values are unknown. From the pool of unlabeled samples, we choose Ò input points Ü ½¼¼¼ for training and observe the corresponding output values Ý ½¼¼¼. The ½ ½ task is to predict the output values of all unlabeled samples. ½¼¼¼ In this experiment, the test input density independent Gaussian density. Ô ´Üµ and Ø ´¾ ¾ ÅÄ Ô ´Üµ is unknown. Ø µ ÜÔ Ü ¾ ÅÄ So we estimate it using the ¾ ´¾¾ µ¡ ÅÄ where Å Ä are the maximum likelihood estimates of the mean and standard ÅÄ and the basis functions be deviation obtained from all unlabeled samples. Let Ô where Ø ³ ´Üµ ¼ ½ ÜÔ Ü Ø ¾ ¡ ¾ ¼ for ½¾ ¼ are template points randomly chosen from the pool of unlabeled samples. ´µ We select the training input density ÔÜ Ü from the independent Gaussian density with mean Å Ä and standard deviation Å Ä , where ¼ ¼ ¼ ¾ In this simulation, we can not create the training input points in an arbitrary location because we only have samples. Therefore, we ﬁrst create temporary input points following the determined training input density, and then choose the input points from the pool of unlabeled samples that are closest to the temporary input points. For each data set, we repeat this simulation times, by changing the template points Ø ¼ ½ in each run. ½¾ ½¼¼ ½¼¼ The means and standard deviations of the test error over runs are described in Table 2. The proposed method (A) outperforms the existing method (B) for ﬁve data sets, while it is outperformed by (B) for the other three data sets. We conjecture that the model used for learning is almost correct in these three data sets. This result implies that the proposed method (A) is slightly better than the existing method (B). Figure 3 depicts the relative performance of the proposed method (A) and the existing method (B) compared with the passive learning scheme (C). This shows that (A) outperforms (C) for all eight data sets, while (B) is comparable or is outperformed by (C) for ﬁve data sets. Therefore, the proposed method (A) is overall shown to work better than other schemes. 6 Conclusions We argued that active learning is essentially the situation under the covariate shift—the training input density is different from the test input density. When the model used for learning is correctly speciﬁed, the covariate shift does not matter. However, for misspeciﬁed models, we have to explicitly cope with the covariate shift. In this paper, we proposed a new active learning method based on the weighted least-squares learning. The numerical study showed that the existing method works better than the proposed method if model is correctly speciﬁed. However, the existing method tends to perform poorly once the correctness of the model is violated. On the other hand, the proposed method overall worked reasonably well and it consistently outperformed the passive learning scheme. Therefore, the proposed method would be robust against the misspeciﬁcation of models and thus it is reliable. The proposed method can be theoretically justiﬁed if the model is approximately correct in a weak sense. However, it is no longer valid for totally misspeciﬁed models. A natural future direction would be therefore to devise an active learning method which has theoretical guarantee with totally misspeciﬁed models. It is also important to notice that when the model is totally misspeciﬁed, even learning with optimal training input points would not be successful anyway. In such cases, it is of course important to carry out model selection. In active learning research—including the present paper, however, the location of training input points are designed for a single model at hand. That is, the model should have been chosen before performing active learning. Devising a method for simultaneously optimizing models and the location of training input points would be a more important and promising future direction. Acknowledgments: The author would like to thank MEXT (Grant-in-Aid for Young Scientists 17700142) for partial ﬁnancial support. References [1] D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models. Journal of Artiﬁcial Intelligence Research, 4:129–145, 1996. [2] V. V. Fedorov. Theory of Optimal Experiments. Academic Press, New York, 1972. [3] K. Fukumizu. Statistical active learning in multilayer perceptrons. IEEE Transactions on Neural Networks, 11(1):17–26, 2000. [4] C. E. Rasmussen, R. M. Neal, G. E. Hinton, D. van Camp, M. Revow, Z. Ghahramani, R. Kustra, and R. Tibshirani. The DELVE manual, 1996. [5] H. Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, 2000. [6] M. Sugiyama. Active learning for misspeciﬁed models. Technical report, Department of Computer Science, Tokyo Institute of Technology, 2005.

6 0.13680424 95 nips-2005-Improved risk tail bounds for on-line algorithms

7 0.13363706 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget

8 0.12988704 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables

9 0.12567779 182 nips-2005-Statistical Convergence of Kernel CCA

10 0.10782955 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

11 0.094913125 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

12 0.093673781 76 nips-2005-From Batch to Transductive Online Learning

13 0.085097916 50 nips-2005-Convex Neural Networks

14 0.084936425 18 nips-2005-Active Learning For Identifying Function Threshold Boundaries

15 0.074941039 85 nips-2005-Generalization to Unseen Cases

16 0.074796565 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning

17 0.07218872 149 nips-2005-Optimal cue selection strategy

18 0.071552999 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

19 0.070966691 117 nips-2005-Learning from Data of Variable Quality

20 0.070313811 33 nips-2005-Bayesian Sets

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.237), (1, 0.075), (2, -0.025), (3, -0.081), (4, 0.081), (5, 0.207), (6, 0.032), (7, 0.088), (8, -0.137), (9, 0.235), (10, -0.084), (11, 0.241), (12, 0.185), (13, -0.087), (14, -0.028), (15, -0.223), (16, -0.056), (17, -0.023), (18, 0.013), (19, -0.051), (20, 0.085), (21, 0.015), (22, 0.029), (23, 0.074), (24, -0.028), (25, -0.066), (26, -0.007), (27, -0.117), (28, -0.023), (29, -0.167), (30, 0.004), (31, 0.07), (32, -0.021), (33, -0.047), (34, -0.02), (35, -0.038), (36, -0.056), (37, -0.023), (38, -0.026), (39, -0.048), (40, -0.093), (41, -0.054), (42, -0.021), (43, 0.112), (44, -0.012), (45, -0.036), (46, -0.05), (47, -0.044), (48, -0.036), (49, -0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96869636 41 nips-2005-Coarse sample complexity bounds for active learning

Author: Sanjoy Dasgupta

2 0.73457742 160 nips-2005-Query by Committee Made Real

Author: Ran Gilad-bachrach, Amir Navot, Naftali Tishby

3 0.71820164 74 nips-2005-Faster Rates in Regression via Active Learning

Author: Rebecca Willett, Robert Nowak, Rui M. Castro

4 0.7168898 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget

Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer

Abstract: The Perceptron algorithm, despite its simplicity, often performs well on online classiﬁcation tasks. The Perceptron becomes especially effective when it is used in conjunction with kernels. However, a common difﬁculty encountered when implementing kernel-based online algorithms is the amount of memory required to store the online hypothesis, which may grow unboundedly. In this paper we present and analyze the Forgetron algorithm for kernel-based online learning on a ﬁxed memory budget. To our knowledge, this is the ﬁrst online learning algorithm which, on one hand, maintains a strict limit on the number of examples it stores while, on the other hand, entertains a relative mistake bound. In addition to the formal results, we also present experiments with real datasets which underscore the merits of our approach.

5 0.66651672 54 nips-2005-Data-Driven Online to Batch Conversions

Author: Ofer Dekel, Yoram Singer

6 0.64767927 19 nips-2005-Active Learning for Misspecified Models

7 0.56624347 76 nips-2005-From Batch to Transductive Online Learning

8 0.48717061 18 nips-2005-Active Learning For Identifying Function Threshold Boundaries

9 0.4181565 95 nips-2005-Improved risk tail bounds for on-line algorithms

10 0.39125568 112 nips-2005-Learning Minimum Volume Sets

11 0.38841712 117 nips-2005-Learning from Data of Variable Quality

12 0.37856498 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables

13 0.37465996 182 nips-2005-Statistical Convergence of Kernel CCA

14 0.37054116 33 nips-2005-Bayesian Sets

15 0.33245671 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction

16 0.32844979 85 nips-2005-Generalization to Unseen Cases

17 0.32788083 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

18 0.3235147 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation

19 0.30521652 151 nips-2005-Pattern Recognition from One Example by Chopping

20 0.30473137 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.101), (10, 0.061), (27, 0.022), (31, 0.051), (34, 0.096), (39, 0.021), (41, 0.013), (55, 0.036), (65, 0.017), (69, 0.07), (73, 0.037), (76, 0.2), (77, 0.017), (88, 0.122), (91, 0.057)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.85702419 41 nips-2005-Coarse sample complexity bounds for active learning

Author: Sanjoy Dasgupta

2 0.72535199 74 nips-2005-Faster Rates in Regression via Active Learning

Author: Rebecca Willett, Robert Nowak, Rui M. Castro

3 0.72042638 177 nips-2005-Size Regularized Cut for Data Clustering

Author: Yixin Chen, Ya Zhang, Xiang Ji

Abstract: We present a novel spectral clustering method that enables users to incorporate prior knowledge of the size of clusters into the clustering process. The cost function, which is named size regularized cut (SRcut), is deﬁned as the sum of the inter-cluster similarity and a regularization term measuring the relative size of two clusters. Finding a partition of the data set to minimize SRcut is proved to be NP-complete. An approximation algorithm is proposed to solve a relaxed version of the optimization problem as an eigenvalue problem. Evaluations over different data sets demonstrate that the method is not sensitive to outliers and performs better than normalized cut. 1

4 0.71291345 112 nips-2005-Learning Minimum Volume Sets

Author: Clayton Scott, Robert Nowak

Abstract: Given a probability measure P and a reference measure µ, one is often interested in the minimum µ-measure set with P -measure at least α. Minimum volume sets of this type summarize the regions of greatest probability mass of P , and are useful for detecting anomalies and constructing conﬁdence regions. This paper addresses the problem of estimating minimum volume sets based on independent samples distributed according to P . Other than these samples, no other information is available regarding P , but the reference measure µ is assumed to be known. We introduce rules for estimating minimum volume sets that parallel the empirical risk minimization and structural risk minimization principles in classiﬁcation. As in classiﬁcation, we show that the performances of our estimators are controlled by the rate of uniform convergence of empirical to true probabilities over the class from which the estimator is drawn. Thus we obtain ﬁnite sample size performance bounds in terms of VC dimension and related quantities. We also demonstrate strong universal consistency and an oracle inequality. Estimators based on histograms and dyadic partitions illustrate the proposed rules. 1

5 0.71055758 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction

Author: Gilles Blanchard, Masashi Sugiyama, Motoaki Kawanabe, Vladimir Spokoiny, Klaus-Robert Müller

Abstract: We propose a new linear method for dimension reduction to identify nonGaussian components in high dimensional data. Our method, NGCA (non-Gaussian component analysis), uses a very general semi-parametric framework. In contrast to existing projection methods we deﬁne what is uninteresting (Gaussian): by projecting out uninterestingness, we can estimate the relevant non-Gaussian subspace. We show that the estimation error of ﬁnding the non-Gaussian components tends to zero at a parametric rate. Once NGCA components are identiﬁed and extracted, various tasks can be applied in the data analysis process, like data visualization, clustering, denoising or classiﬁcation. A numerical study demonstrates the usefulness of our method. 1

6 0.710352 151 nips-2005-Pattern Recognition from One Example by Chopping

7 0.70984411 50 nips-2005-Convex Neural Networks

8 0.70974737 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity

9 0.7094152 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization

10 0.7069295 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

11 0.70488632 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

12 0.70369977 144 nips-2005-Off-policy Learning with Options and Recognizers

13 0.70369238 30 nips-2005-Assessing Approximations for Gaussian Process Classification

14 0.70223731 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations

15 0.70160711 160 nips-2005-Query by Committee Made Real

16 0.69751972 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error

17 0.69712514 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations

18 0.69594461 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation

19 0.69551265 23 nips-2005-An Application of Markov Random Fields to Range Sensing

20 0.694718 184 nips-2005-Structured Prediction via the Extragradient Method