nips nips2005 nips2005-41 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sanjoy Dasgupta
Abstract: We characterize the sample complexity of active learning problems in terms of a parameter which takes into account the distribution over the input space, the specific target hypothesis, and the desired accuracy.
Reference: text
sentIndex sentText sentNum sentScore
1 Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego dasgupta@cs. [sent-1, score-0.49]
2 edu Abstract We characterize the sample complexity of active learning problems in terms of a parameter which takes into account the distribution over the input space, the specific target hypothesis, and the desired accuracy. [sent-3, score-0.692]
3 1 Introduction The goal of active learning is to learn a classifier in a setting where data comes unlabeled, and any labels must be explicitly requested and paid for. [sent-4, score-0.483]
4 So far the most encouraging theoretical results in this field are [7, 6], which show that if the hypothesis class is that of homogeneous (i. [sent-6, score-0.311]
5 through the origin) linear separators, and the data is distributed uniformly over the unit sphere in Rd , and the labels correspond perfectly to one of the hypotheses (i. [sent-8, score-0.596]
6 the separable case) then at most O(d log d/ǫ) labels are needed to learn a classifier with error less than ǫ. [sent-10, score-0.292]
7 This is exponentially smaller than the usual Ω(d/ǫ) sample complexity of learning linear classifiers in a supervised setting. [sent-11, score-0.291]
8 In fact, in this example the label complexity of active learning depends heavily on the specific target hypothesis, and ranges from O(log 1/ǫ) to Ω(1/ǫ). [sent-14, score-0.68]
9 In this paper, we consider arbitrary hypothesis classes H of VC dimension d < ∞, and learning problems which are separable. [sent-15, score-0.258]
10 We characterize the sample complexity of active learning in terms of a parameter which takes into account: (1) the distribution P over the input space X ; (2) the specific target hypothesis h∗ ∈ H; and (3) the desired accuracy ǫ. [sent-16, score-0.986]
11 We show that this quantity fairly tightly describes the sample complexity of active learning: any active learning scheme requires Ω(1/ρ) labels and there is a generic active ˜ learner which always uses at most O(d/ρ) labels1 . [sent-18, score-1.273]
12 This ρ is always at least ǫ; if it is ǫ we just get the usual sample complexity of supervised 1 ˜ The O(·) notation hides factors polylogarithmic in d, 1/ǫ, 1/δ, and 1/τ . [sent-19, score-0.31]
13 But sometimes ρ is a constant, and in such instances active learning gives an exponential improvement in the number of labels needed. [sent-21, score-0.515]
14 We look at various hypothesis classes and derive splitting indices for target hypotheses at different levels of accuracy. [sent-22, score-0.887]
15 For homogeneous linear separators and the uniform input distribution, we easily find ρ to be a constant – perhaps the most direct proof yet of the efficacy of active learning in this case. [sent-23, score-0.741]
16 1 Motivating examples Linear separators in R1 Our first example is taken from [3, 4]. [sent-26, score-0.38]
17 But suppose we instead draw m unlabeled samples from P. [sent-28, score-0.397]
18 If we lay these points down on the line, their hidden labels are a sequence of 0’s followed by a sequence of 1’s, and the goal is to discover the point w at which the transition occurs. [sent-29, score-0.209]
19 Thus, in this case active learning gives an exponential improvement in the number of labels needed. [sent-31, score-0.515]
20 Can we always achieve a label complexity proportional to log 1/ǫ rather than 1/ǫ? [sent-32, score-0.218]
21 A natural next step is to consider linear separators in two dimensions. [sent-33, score-0.345]
22 Linear separators in R2 Let H be the hypothesis class of linear separators in R2 , and suppose the input distribution P is some density supported on the perimeter of the unit circle. [sent-34, score-1.055]
23 It turns out that the positive results of the one-dimensional case do not generalize: there are some target hypotheses in H for which Ω(1/ǫ) labels are needed to find a classifier with error rate less than ǫ, no matter what active learning scheme is used. [sent-35, score-1.132]
24 To see this, consider the following possible target hypotheses (Figure 1, left): h0 , for which all points are positive; and hi (1 ≤ i ≤ 1/ǫ), for which all points are positive except for a small slice Bi of probability mass ǫ. [sent-36, score-0.857]
25 For instance, suppose nature chooses a target hypothesis at random from among the hi , 1 ≤ i ≤ 1/ǫ. [sent-38, score-0.541]
26 Then, to identify this target with probability at least 1/2, it is necessary to query points in at least (about) half the Bi ’s. [sent-39, score-0.486]
27 Thus for these particular target hypotheses, active learning offers no improvement in sample complexity. [sent-40, score-0.582]
28 What about other target hypotheses in H, for instance those in which the positive and negative regions are most evenly balanced? [sent-41, score-0.645]
29 Consider the following active learning scheme: h3 B2 x3 h2 P B1 P′ h1 origin h0 Figure 1: Left: The data lie on the circumference of a circle. [sent-42, score-0.411]
30 From this pool, choose query points at random until at least one positive and one negative point have been found. [sent-48, score-0.306]
31 Apply binary search to find the two boundaries between positive and negative on the perimeter of the circle. [sent-51, score-0.241]
32 For any h ∈ H, define i(h) = min{positive mass of h, negative mass of h}. [sent-52, score-0.223]
33 It is not hard to see that when the target hypothesis is h, step (2) asks for O(1/i(h)) labels (with probability at least 9/10, say) and step (3) asks for O(log 1/ǫ) labels. [sent-53, score-0.746]
34 Thus even within this simple hypothesis class, the label complexity of active learning can run anywhere from O(log 1/ǫ) to Ω(1/ǫ), depending on the specific target hypothesis. [sent-54, score-0.891]
35 Linear separators in R3 In our two previous examples, the amount of unlabeled data needed was O(1/ǫ), exactly the usual sample complexity of supervised learning. [sent-55, score-0.937]
36 We next turn to a case in which it is helpful to have significantly more unlabeled data than this. [sent-56, score-0.261]
37 Let H consist of homogeneous linear separators in R3 . [sent-59, score-0.412]
38 The “bad” linear separators 1 2 in H cut off just a small portion of P but nonetheless divide P′ perfectly in half. [sent-62, score-0.504]
39 This O(log 1/ǫ) label complexity is made possible by the presence of P′ and is only achievable if the amount of unlabeled data is Ω(1/τ ), which could potentially be enormous. [sent-65, score-0.475]
40 With less unlabeled data, the usual Ω(1/ǫ) label complexity applies. [sent-66, score-0.49]
41 2 Basic definitions The sample complexity of supervised learning is commonly expressed as a function of the error rate ǫ and the underlying distribution P. [sent-69, score-0.277]
42 For active learning, the previous three examples demonstrate that it is also important to take into account the target hypothesis and the amount of unlabeled data. [sent-70, score-0.99]
43 Let H be the hypothesis class, a set of functions from X to {0, 1} whose VC dimension is d < ∞. [sent-73, score-0.211]
44 We will be dealing with a separable learning scenario, in which all labels correspond perfectly to some concept h∗ ∈ H, and the goal is to find h ∈ H such that d(h∗ , h) ≤ ǫ. [sent-78, score-0.311]
45 To do this, it is sufficient to whittle down the version space to the point where it has diameter at most ǫ, and to then return any of the remaining hypotheses. [sent-79, score-0.274]
46 Likewise, if the diameter of the current version space is more than ǫ then any hypothesis chosen from it will have error more than ǫ/2 with respect to the worst-case target. [sent-80, score-0.425]
47 Thus, in a non-Bayesian setting, active learning is about reducing the diameter of the version space. [sent-81, score-0.511]
48 We can think of x as a cut through hypothesis space; see Figure 2(a). [sent-84, score-0.274]
49 For us, each such edge will represent a pair of hypotheses which need to be distinguished from one another: that is, they are relatively far apart, so there is no way to achieve our target accuracy if both of them remain in the version space. [sent-91, score-0.665]
50 We would hope that for any finite set of edges Q, there are queries that will remove a substantial fraction of them. [sent-92, score-0.304]
51 To this end, a point x ∈ X is said to ρ-split Q if its label is guaranteed to reduce the number of edges by a fraction ρ > 0, that is, if: + + − − max{|Q ∩ (Hx × Hx )|, |Q ∩ (Hx × Hx )|} ≤ (1 − ρ)|Q|. [sent-93, score-0.215]
52 If our target accuracy is ǫ, we only really care about edges of length more than ǫ. [sent-95, score-0.413]
53 Finally, we say that a subset of hypotheses S ⊂ H is (ρ, ǫ, τ )-splittable if for all finite edge-sets Q ⊂ S × S, P{x : x ρ-splits Qǫ } ≥ τ. [sent-97, score-0.342]
54 Paraphrasing, at least a τ fraction of the distribution P is useful for splitting S. [sent-98, score-0.224]
55 2 This τ gives a sense of how many unlabeled samples are needed. [sent-99, score-0.261]
56 If τ is miniscule, then there are good points to query, but these will emerge only in an enormous pool of unlabeled data. [sent-100, score-0.384]
57 It will soon transpire that the parameters ρ, τ play roughly the following roles: # labels needed ∝ 1/ρ, # of unlabeled points needed ∝ 1/τ A first step towards understanding them is to establish a trivial lower bound on ρ. [sent-101, score-0.58]
58 Since the edges have length at least ǫ, this x has at least an ǫ chance of cutting any of them, whereby EZ ≥ ǫ|Qǫ |. [sent-107, score-0.357]
59 We will now see that the splitting index roughly characterizes the sample complexity of active learning. [sent-110, score-0.568]
60 3 Lower bound We start by showing that if some region of the hypothesis space has a low splitting index, then it must contain hypotheses which are not conducive to active learning. [sent-112, score-0.992]
61 Theorem 2 Fix a hypothesis space H and distribution P. [sent-113, score-0.276]
62 Then any active learner which achieves an accuracy of ǫ on all target hypotheses in S, with confidence > 3/4 (over the random sampling of data), either needs ≥ 1/τ unlabeled samples or ≥ 1/ρ labels. [sent-115, score-1.137]
63 We’ll show that in order to distinguish between hypotheses in V, either 1/τ unlabeled samples or 1/ρ queries are needed. [sent-118, score-0.719]
64 With probability at least (1 − τ )1/τ ≥ 1/4, none of these points ρ-splits Qǫ ; put differently, each of these potential queries has a bad outcome (+ or −) in which at most ρ|Qǫ | edges are eliminated. [sent-120, score-0.48]
65 In this case there must be a target hypothesis in V for which at least 1/ρ labels are required. [sent-121, score-0.6]
66 , xtm Query the xti which maximally splits Qt Let Qt+1 be the remaining edges until Qt+1 = ∅ return remaining hypotheses in S Figure 3: A generic active learner. [sent-134, score-0.826]
67 Corollary 3 Suppose that in some neighborhood B(h0 , ∆), there are hypotheses h1 , . [sent-135, score-0.342]
68 , hN such that: (1) d(h0 , hi ) > ǫ for all i; and (2) the “disagree sets” {x : h0 (x) = hi (x)} are disjoint for different i. [sent-138, score-0.236]
69 Any active learning scheme which achieves an accuracy of ǫ on all of B(h0 , ∆) must use at least N labels for some of the target hypotheses, no matter how much unlabeled data is available. [sent-140, score-1.063]
70 Each query only cuts off one spoke, so N queries are needed. [sent-145, score-0.247]
71 4 Upper bound We now show a loosely matching upper bound on sample complexity, via an algorithm (Figure 3) which repeatedly halves the diameter of the remaining version space. [sent-147, score-0.234]
72 For some ǫ0 less than half the target error rate ǫ, it starts with an ǫ0 -cover of H: a set of hypotheses S0 ⊂ H such that any h ∈ H is within distance ǫ0 of S0 . [sent-148, score-0.579]
73 The ǫ0 -cover serves as a surrogate for the hypothesis class – for instance, the final hypothesis is chosen from it. [sent-150, score-0.422]
74 Theorem 4 Let the target hypothesis be some h∗ ∈ H. [sent-152, score-0.38]
75 Pick any target accuracy ǫ > 0 and confidence level δ > 0. [sent-153, score-0.22]
76 Then there is an appropriate choice of ǫ0 and m for which, with probability at least 1 − δ, ˜ ˜ the algorithm will draw O((1/ǫ) + (d/ρτ )) unlabeled points, make O(d/ρ) queries, and return a hypothesis with error at most ǫ. [sent-155, score-0.666]
77 This theorem makes it possible to derive label complexity bounds which are fine-tuned to the specific target hypothesis. [sent-156, score-0.351]
78 1 Simple boundaries on the line Returning to our first example, let X = R and H = {hw : w ∈ R}, where each hw is a threshold function hw (x) = 1(x ≥ w). [sent-159, score-0.512]
79 The distance measure P induces on H is d(hw , hw′ ) = P{x : hw (x) = hw′ (x)} = P{x : w ≤ x < w′ } = P[w, w′ ) (assuming w′ ≥ w). [sent-161, score-0.284]
80 Pick any accuracy ǫ > 0 and consider any finite set of edges Q = ′ {(hwi , hwi ) : i = 1, . [sent-162, score-0.254]
81 It is easy to see that any x ∈ [wn/2 , w) must eliminate at least half the edges in Q. [sent-167, score-0.24]
82 2 Intervals on the line The next case we consider is almost identical to our earlier example of 2-d linear separators (and the results carry over to that example, within constant factors). [sent-171, score-0.379]
83 The hypotheses correspond to intervals on the real line: X = R and H = {ha,b : a, b ∈ R}, where ha,b (x) = 1(a ≤ x ≤ b). [sent-172, score-0.414]
84 Even in this very simple class, some hypotheses are much easier to active-learn than others. [sent-175, score-0.342]
85 Divide the real line into 1/ǫ disjoint intervals, each with probability mass ǫ, and let {hi : i = 1, . [sent-177, score-0.214]
86 , 1/ǫ} denote the hypotheses taking value 1 on the corresponding intervals. [sent-180, score-0.342]
87 Then these hi satisfy the conditions of Corollary 3; their star-shaped configuration forces a ρ-value of ǫ, and active learning doesn’t help at all in choosing amongst them. [sent-182, score-0.422]
88 The bad hypotheses are the ones whose intervals have small probability mass. [sent-184, score-0.515]
89 Any concept in B(ha,b , 4ǫ) (more precisely, its interval) must lie within the outer box and must contain the inner box (this inner box might be empty). [sent-189, score-0.255]
90 r a 4ǫ b 4ǫ 4ǫ 4ǫ Any edge (ha′ ,b′ , ha′′ ,b′′ ) ∈ Q has length > ǫ, so [a′ , b′ ]∆[a′′ , b′′ ] (either a single interval or a union of two intervals) has total length > ǫ and lies between the inner and outer boxes. [sent-190, score-0.223]
91 This space has mass at most 16ǫ and at least 4ǫ, of which at least ǫ is occupied by [a′ , b′ ]∆[a′′ , b′′ ]. [sent-192, score-0.257]
92 The expected number of edges split by our x is at least |Q|/16, and therefore the probability that more than |Q|/32 edges are split is at least 1/32. [sent-195, score-0.514]
93 To summarize, for any hypothesis ha,b , let i(ha,b ) = P[a, b] denote the probability mass of its interval. [sent-197, score-0.341]
94 In short, once the version space is whittled down to B(h, i(h)/4), efficient active learning is possible. [sent-199, score-0.413]
95 3 Linear separators under the uniform distribution The most encouraging positive result for active learning to date has been for learning homogeneous (through the origin) linear separators with data drawn uniformly from the surface of the unit sphere in Rd . [sent-202, score-1.286]
96 4 Related work and open problems There has been a lot of work on a related model in which the points to be queried are synthetically constructed, rather than chosen from unlabeled data [1]. [sent-204, score-0.364]
97 One other technique which seems useful for active learning is to look at the unlabeled data and then place bets on certain target hypotheses, for instance the ones with large margin. [sent-207, score-0.846]
98 This insight – nicely formulated in [2, 10] – is not specific to active learning and is orthogonal to the search issues considered in this paper. [sent-208, score-0.362]
99 This permits a naive active learning strategy, also suggested in [3]: just pick a random point whose label you are not yet sure of. [sent-210, score-0.572]
100 A PAC-style model for learning from labeled and unlabeled data. [sent-222, score-0.308]
wordName wordTfidf (topN-words)
[('separators', 0.345), ('hypotheses', 0.342), ('active', 0.282), ('unlabeled', 0.261), ('hx', 0.218), ('hypothesis', 0.211), ('hw', 0.203), ('target', 0.169), ('labels', 0.154), ('edges', 0.142), ('diameter', 0.13), ('pick', 0.125), ('splitting', 0.125), ('ha', 0.121), ('queries', 0.116), ('complexity', 0.109), ('bad', 0.101), ('query', 0.098), ('hi', 0.093), ('mass', 0.093), ('label', 0.073), ('asks', 0.073), ('intervals', 0.072), ('draw', 0.068), ('suppose', 0.068), ('pool', 0.068), ('homogeneous', 0.067), ('least', 0.066), ('dasgupta', 0.064), ('perfectly', 0.063), ('cut', 0.063), ('hwi', 0.061), ('return', 0.06), ('points', 0.055), ('needed', 0.055), ('ll', 0.054), ('vc', 0.054), ('perimeter', 0.053), ('bi', 0.052), ('sample', 0.052), ('qt', 0.052), ('version', 0.052), ('accuracy', 0.051), ('edge', 0.051), ('length', 0.051), ('disjoint', 0.05), ('positive', 0.05), ('split', 0.049), ('queried', 0.048), ('eighteenth', 0.048), ('separable', 0.047), ('usual', 0.047), ('learning', 0.047), ('instance', 0.047), ('hope', 0.046), ('origin', 0.046), ('permits', 0.045), ('ez', 0.045), ('kalai', 0.045), ('induces', 0.045), ('er', 0.043), ('expanded', 0.04), ('look', 0.04), ('notion', 0.04), ('classi', 0.039), ('doesn', 0.039), ('box', 0.038), ('amenable', 0.037), ('sphere', 0.037), ('let', 0.037), ('negative', 0.037), ('corollary', 0.037), ('distance', 0.036), ('lie', 0.036), ('log', 0.036), ('freund', 0.036), ('supervised', 0.036), ('examples', 0.035), ('inner', 0.035), ('outer', 0.035), ('boundaries', 0.035), ('hn', 0.035), ('line', 0.034), ('likewise', 0.034), ('binary', 0.033), ('st', 0.033), ('search', 0.033), ('scheme', 0.033), ('distribution', 0.033), ('divide', 0.033), ('cuts', 0.033), ('encouraging', 0.033), ('ers', 0.032), ('space', 0.032), ('half', 0.032), ('improvement', 0.032), ('wi', 0.032), ('amount', 0.032), ('chance', 0.032), ('learner', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999893 41 nips-2005-Coarse sample complexity bounds for active learning
Author: Sanjoy Dasgupta
Abstract: We characterize the sample complexity of active learning problems in terms of a parameter which takes into account the distribution over the input space, the specific target hypothesis, and the desired accuracy.
2 0.17903027 160 nips-2005-Query by Committee Made Real
Author: Ran Gilad-bachrach, Amir Navot, Naftali Tishby
Abstract: Training a learning algorithm is a costly task. A major goal of active learning is to reduce this cost. In this paper we introduce a new algorithm, KQBC, which is capable of actively learning large scale problems by using selective sampling. The algorithm overcomes the costly sampling step of the well known Query By Committee (QBC) algorithm by projecting onto a low dimensional space. KQBC also enables the use of kernels, providing a simple way of extending QBC to the non-linear scenario. Sampling the low dimension space is done using the hit and run random walk. We demonstrate the success of this novel algorithm by applying it to both artificial and a real world problems.
3 0.17065531 74 nips-2005-Faster Rates in Regression via Active Learning
Author: Rebecca Willett, Robert Nowak, Rui M. Castro
Abstract: This paper presents a rigorous statistical analysis characterizing regimes in which active learning significantly outperforms classical passive learning. Active learning algorithms are able to make queries or select sample locations in an online fashion, depending on the results of the previous queries. In some regimes, this extra flexibility leads to significantly faster rates of error decay than those possible in classical passive learning settings. The nature of these regimes is explored by studying fundamental performance limits of active and passive learning in two illustrative nonparametric function classes. In addition to examining the theoretical potential of active learning, this paper describes a practical algorithm capable of exploiting the extra flexibility of the active setting and provably improving upon the classical passive techniques. Our active learning theory and methods show promise in a number of applications, including field estimation using wireless sensor networks and fault line detection. 1
4 0.16874868 54 nips-2005-Data-Driven Online to Batch Conversions
Author: Ofer Dekel, Yoram Singer
Abstract: Online learning algorithms are typically fast, memory efficient, and simple to implement. However, many common learning problems fit more naturally in the batch learning setting. The power of online learning algorithms can be exploited in batch settings by using online-to-batch conversions techniques which build a new batch algorithm from an existing online algorithm. We first give a unified overview of three existing online-to-batch conversion techniques which do not use training data in the conversion process. We then build upon these data-independent conversions to derive and analyze data-driven conversions. Our conversions find hypotheses with a small risk by explicitly minimizing datadependent generalization bounds. We experimentally demonstrate the usefulness of our approach and in particular show that the data-driven conversions consistently outperform the data-independent conversions.
5 0.14639543 19 nips-2005-Active Learning for Misspecified Models
Author: Masashi Sugiyama
Abstract: Active learning is the problem in supervised learning to design the locations of training input points so that the generalization error is minimized. Existing active learning methods often assume that the model used for learning is correctly specified, i.e., the learning target function can be expressed by the model at hand. In many practical situations, however, this assumption may not be fulfilled. In this paper, we first show that the existing active learning method can be theoretically justified under slightly weaker condition: the model does not have to be correctly specified, but slightly misspecified models are also allowed. However, it turns out that the weakened condition is still restrictive in practice. To cope with this problem, we propose an alternative active learning method which can be theoretically justified for a wider class of misspecified models. Thus, the proposed method has a broader range of applications than the existing method. Numerical studies show that the proposed active learning method is robust against the misspecification of models and is thus reliable. 1 Introduction and Problem Formulation Let us discuss the regression problem of learning a real-valued function Ê from training examples ´Ü Ý µ ´Ü µ · ¯ Ý Ò ´Üµ defined on ½ where ¯ Ò ½ are i.i.d. noise with mean zero and unknown variance ¾. We use the following linear regression model for learning. ´Ü µ ´µ Ô ½ « ³ ´Ü µ where ³ Ü Ô ½ are fixed linearly independent functions and are parameters to be learned. ´ µ « ´«½ «¾ « Ô µ We evaluate the goodness of the learned function Ü by the expected squared test error over test input points and noise (i.e., the generalization error). When the test input points are drawn independently from a distribution with density ÔØ Ü , the generalization error is expressed as ´ µ ¯ ´Üµ ´Üµ ¾ Ô ´Üµ Ü Ø where ¯ denotes the expectation over the noise ¯ Ò Ô ´Üµ is known1. ½. In the following, we suppose that Ø In a standard setting of regression, the training input points are provided from the environment, i.e., Ü Ò ½ independently follow the distribution with density ÔØ Ü . On the other hand, in some cases, the training input points can be designed by users. In such cases, it is expected that the accuracy of the learning result can be improved if the training input points are chosen appropriately, e.g., by densely locating training input points in the regions of high uncertainty. ´ µ Active learning—also referred to as experimental design—is the problem of optimizing the location of training input points so that the generalization error is minimized. In active learning research, it is often assumed that the regression model is correctly specified [2, 1, 3], i.e., the learning target function Ü can be expressed by the model. In practice, however, this assumption is often violated. ´ µ In this paper, we first show that the existing active learning method can still be theoretically justified when the model is approximately correct in a strong sense. Then we propose an alternative active learning method which can also be theoretically justified for approximately correct models, but the condition on the approximate correctness of the models is weaker than that for the existing method. Thus, the proposed method has a wider range of applications. In the following, we suppose that the training input points Ü Ò ½ are independently drawn from a user-defined distribution with density ÔÜ Ü , and discuss the problem of finding the optimal density function. ´µ 2 Existing Active Learning Method The generalization error defined by Eq.(1) can be decomposed as ·Î is the (squared) bias term and Î is the variance term given by where ¯ ´Üµ ´Üµ ¾ Ô ´Üµ Ü Ø Î and ¯ ´Üµ ¯ ´Üµ ¾ Ô ´Üµ Ü Ø A standard way to learn the parameters in the regression model (1) is the ordinary leastsquares learning, i.e., parameter vector « is determined as follows. « ÇÄË It is known that «ÇÄË is given by Ö« Ò Ñ « ÇÄË where Ä ÇÄË ´ µ ½ Ò ´Ü µ Ý ½ Ä ÇÄË ³ ´Ü µ ¾ Ý and Ý ´Ý½ ݾ Ý Ò µ Let ÇÄË , ÇÄË and ÎÇÄË be , and Î for the learned function obtained by the ordinary least-squares learning, respectively. Then the following proposition holds. 1 In some application domains such as web page analysis or bioinformatics, a large number of unlabeled samples—input points without output values independently drawn from the distribution with density ÔØ ´Üµ—are easily gathered. In such cases, a reasonably good estimate of ÔØ ´Üµ may be obtained by some standard density estimation method. Therefore, the assumption that ÔØ ´Üµ is known may not be so restrictive. Proposition 1 ([2, 1, 3]) Suppose that the model is correctly specified, i.e., the learning target function Ü is expressed as ´µ Ô ´Ü µ Then ½ «£ ³ ´Üµ and ÎÇÄË are expressed as ÇÄË ¼ ÇÄË and Î ¾ ÇÄË Â ÇÄË where ØÖ´ÍÄ Â ÇÄË ÇÄË Ä ÇÄË µ ³ ´Üµ³ ´ÜµÔ ´Üµ Ü Í and Ø Therefore, for the correctly specified model (1), the generalization error as ÇÄË ¾ ÇÄË is expressed  ÇÄË Based on this expression, the existing active learning method determines the location of training input points Ü Ò ½ (or the training input density ÔÜ Ü ) so that ÂÇÄË is minimized [2, 1, 3]. ´ µ 3 Analysis of Existing Method under Misspecification of Models In this section, we investigate the validity of the existing active learning method for misspecified models. ´ µ Suppose the model does not exactly include the learning target function Ü , but it approximately includes it, i.e., for a scalar Æ such that Æ is small, Ü is expressed as ´ µ ´Ü µ ´Üµ · Æִܵ where ´Üµ is the orthogonal projection of ´Üµ onto the span of residual ִܵ is orthogonal to ³ ´Üµ ½ : Ô Ô ´Üµ ½ «£ ³ ´Üµ ִܵ³ ´ÜµÔ ´Üµ Ü and In this case, the bias term Ø ¼ for ³ ´Üµ ½¾ Ô and the ½ Ô is expressed as ¾ ´ ´Üµ ´Üµµ¾ Ô ´Üµ Ü is constant which does not depend on the training input density Ô ´Üµ, we subtract ¯ ´Üµ ´Üµ Ô ´Üµ Ü · where Ø Ø Since in the following discussion. Ü Then we have the following lemma2 . Lemma 2 For the approximately correct model (3), we have ÇÄË ÇÄË Î ÇÄË where 2 Þ Æ ¾ ÍÄ ¾Â Ö ÇÄË Þ Ä Þ Ç ´Ò ½ µ ´Ö´Ü½µ ִܾµ Ö ÇÄË Ö Ô Ö ´Ü Proofs of lemmas are provided in an extended version [6]. Ò µµ Ç ´Æ ¾ µ Note that the asymptotic order in Eq.(1) is in probability since ÎÇÄË is a random variable that includes Ü Ò ½ . The above lemma implies that ½ Ó ´Ò ¾ µ Therefore, the existing active learning method of minimizing  is still justified if Æ ½ ¾ µ. However, when Æ Ó ´Ò ½ µ, the existing method may not work well because ¾ Ó ´Ò the bias term is not smaller than the variance term Î , so it can not be ÇÄË ¾ · Ó ´Ò ½µ  ÇÄË if Æ Ô Ô ÇÄË Ô Ô ÇÄË ÇÄË neglected. 4 New Active Learning Method In this section, we propose a new active learning method based on the weighted leastsquares learning. 4.1 Weighted Least-Squares Learning When the model is correctly specified, «ÇÄË is an unbiased estimator of «£ . However, for misspecified models, «ÇÄË is generally biased even asymptotically if Æ ÇÔ . ´½µ The bias of «ÇÄË is actually caused by the covariate shift [5]—the training input density ÔÜ Ü is different from the test input density ÔØ Ü . For correctly specified models, influence of the covariate shift can be ignored, as the existing active learning method does. However, for misspecified models, we should explicitly cope with the covariate shift. ´µ ´ µ Under the covariate shift, it is known that the following weighted least-squares learning is [5]. asymptotically unbiased even if Æ ÇÔ ´½µ Ô ´Ü µ Ô ´Ü µ ½ Ò Ö« Ò Ñ « Ï ÄË ¾ ´Ü µ Ý Ø Ü Asymptotic unbiasedness of «Ï ÄË would be intuitively understood by the following identity, which is similar in spirit to importance sampling: ´Üµ ´Üµ ¾ Ô ´Ü µ Ü ´Üµ ´Üµ Ø ´µ ¾ Ô ´Üµ Ô ´Ü µ Ü Ô ´Üµ Ø Ü Ü In the following, we assume that ÔÜ Ü is strictly positive for all Ü. Let matrix with the -th diagonal element be the diagonal Ô ´Ü µ Ô ´Ü µ Ø Ü Then it can be confirmed that «Ï ÄË is given by « Ä Ï ÄË Ï ÄË Ý where Ä ´ Ï ÄË µ ½ 4.2 Active Learning Based on Weighted Least-Squares Learning Let Ï ÄË , Ï ÄË and ÎÏ ÄË be , and Î for the learned function obtained by the above weighted least-squares learning, respectively. Then we have the following lemma. Lemma 3 For the approximately correct model (3), we have Ï ÄË Î Æ ¾ ÍÄ ¾Â Ï ÄË where Ï ÄË Ï ÄË Â Ï ÄË Þ Ä Þ Ç ´Ò ½ µ Ö Ï ÄË Ö Ô Ô ØÖ´ÍÄ Ï ÄË Ä Ï ÄË Ç ´Æ ¾ Ò ½ µ µ This lemma implies that ¾  · Ó ´Ò ½µ ´½µ if Æ ÓÔ Based on this expression, we propose determining the training input density ÔÜ ÂÏ ÄË is minimized. Ï ÄË Ï ÄË Ô ´Üµ so that ´½µ The use of the proposed criterion ÂÏ ÄË can be theoretically justified when Æ ÓÔ , ½ while the existing criterion ÂÇÄË requires Æ ÓÔ Ò ¾ . Therefore, the proposed method has a wider range of applications. The effect of this extension is experimentally investigated in the next section. ´ 5 µ Numerical Examples We evaluate the usefulness of the proposed active learning method through experiments. Toy Data Set: setting. We first illustrate how the proposed method works under a controlled ½ ´µ ´µ ½ · · ½¼¼ ´µ Let and the learning target function Ü be Ü Ü Ü¾ ÆÜ¿. Let Ò ½¼¼ be i.i.d. Gaussian noise with mean zero and standard deviation and ¯ . Let ÔØ Ü ½ be the Gaussian density with mean and standard deviation , which is assumed to be known here. Let Ô and the basis functions be ³ Ü Ü ½ for . Let us consider the following three cases. Æ , where each case corresponds to “correctly specified”, “approximately correct”, and “misspecified” (see Figure 1). We choose the training input density ÔÜ Ü from the Gaussian density with mean and standard , where deviation ¼¾ ¿ ´µ ¼ ¼ ¼¼ ¼ ¼ ¼ ½¼ ´µ ¼ ¼¿ ½¾¿ ¼¾ ¾ We compare the accuracy of the following three methods: (A) Proposed active learning criterion + WLS learning : The training input density is determined so that ÂÏ ÄË is minimized. Following the determined input density, training input points Ü ½¼¼ are created and corresponding output values Ý ½¼¼ ½ ½ are observed. Then WLS learning is used for estimating the parameters. (B) Existing active learning criterion + OLS learning [2, 1, 3]: The training input density is determined so that ÂÇÄË is minimized. OLS learning is used for estimating the parameters. (C) Passive learning + OLS learning: The test input density ÔØ Ü is used as the training input density. OLS learning is used for estimating the parameters. ´ µ First, we evaluate the accuracy of ÂÏ ÄË and ÂÇÄË as approximations of Ï ÄË and ÇÄË . The means and standard deviations of Ï ÄË , ÂÏ ÄË , ÇÄË , and ÂÇÄË over runs are (“correctly depicted as functions of in Figure 2. These graphs show that when Æ specified”), both ÂÏ ÄË and ÂÇÄË give accurate estimates of Ï ÄË and ÇÄË . When Æ (“approximately correct”), ÂÏ ÄË again works well, while ÂÇÄË tends to be negatively biased for large . This result is surprising since as illustrated in Figure 1, the learning target functions with Æ and Æ are visually quite similar. Therefore, it intuitively seems that the result of Æ is not much different from that of Æ . However, the simulation result shows that this slight difference makes ÂÇÄË unreliable. (“misspecified”), ÂÏ ÄË is still reasonably accurate, while ÂÇÄË is heavily When Æ biased. ½¼¼ ¼ ¼¼ ¼ ¼ ¼¼ ¼¼ ¼ These results show that as an approximation of the generalization error, ÂÏ ÄË is more robust against the misspecification of models than ÂÇÄË , which is in good agreement with the theoretical analyses given in Section 3 and Section 4. Learning target function f(x) 8 δ=0 δ=0.04 δ=0.5 6 Table 1: The means and standard deviations of the generalization error for Toy data set. The best method and comparable ones by the t-test at the are described with boldface. significance level The value of method (B) for Æ is extremely large but it is not a typo. 4 ± 2 0 −1.5 −1 −0.5 0 0.5 1 1.5 2 Input density functions 1.5 ¼ pt(x) Æ ¼ ½ ¦¼ ¼ px(x) 1 0.5 0 −1.5 −1 −0.5 0 0.5 1 1.5 2 Figure 1: Learning target function and input density functions. ¼ Æ (A) (B) (C) ¼¼ Æ −3 −3 −3 G−WLS 12 4 3 G−WLS 5 4 ¼ x 10 6 5 ½¼¿. “misspecified” x 10 G−WLS ¼ ¦¼ ¼ ¿¼¿ ¦ ½ ¦½ ½ ¿ ¾ ¦ ½ ¾¿ ¾ ¾¦¼ ¿ “approximately correct” x 10 6 Æ All values in the table are multiplied by Æ “correctly specified” ¦¼ ¼ ¾ ¼¦¼ ½¿ ¼¼ Æ ¾ ¼¾ ¦ ¼ ¼ 3 10 8 6 0.8 1.2 1.6 2 0.07 2.4 J−WLS 0.06 0.8 1.2 1.6 2 0.07 2.4 0.8 1.2 1.6 2 0.07 J−WLS 0.06 0.05 0.05 0.05 0.04 0.04 0.04 0.03 0.03 2.4 J−WLS 0.06 0.8 −3 x 10 1.2 1.6 2 2.4 G−OLS 5 0.03 0.8 −3 x 10 1.2 1.6 2 3 1.2 1.6 2 1.6 2.4 2 G−OLS 0.4 4 3 0.8 0.5 G−OLS 5 4 2.4 0.3 0.2 0.1 2 2 0.8 1.2 1.6 2 0.06 2.4 J−OLS 0.8 1.2 1.6 2 0.06 2.4 0.8 1.2 0.06 J−OLS 0.05 0.05 0.05 0.04 0.04 0.04 0.03 0.03 0.02 0.02 2.4 J−OLS 0.8 1.2 1.6 c 2 2.4 0.03 0.02 0.8 Figure 2: The means and error bars of functions of . 1.2 1.6 c Ï ÄË , 2 Â Ï ÄË 2.4 , 0.8 ÇÄË 1.2 1.6 c , and ÂÇÄË over 2 2.4 ½¼¼ runs as In Table 1, the mean and standard deviation of the generalization error obtained by each method is described. When Æ , the existing method (B) works better than the proposed method (A). Actually, in this case, training input densities that approximately minimize Ï ÄË and ÇÄË were found by ÂÏ ÄË and ÂÇÄË . Therefore, the difference of the errors is caused by the difference of WLS and OLS: WLS generally has larger variance than OLS. Since bias is zero for both WLS and OLS if Æ , OLS would be more accurate than WLS. Although the proposed method (A) is outperformed by the existing method (B), it still works better than the passive learning scheme (C). When Æ and Æ the proposed method (A) gives significantly smaller errors than other methods. ¼ ¼ ¼¼ ¼ Overall, we found that for all three cases, the proposed method (A) works reasonably well and outperforms the passive learning scheme (C). On the other hand, the existing method (B) works excellently in the correctly specified case, although it tends to perform poorly once the correctness of the model is violated. Therefore, the proposed method (A) is found to be robust against the misspecification of models and thus it is reliable. Table 2: The means and standard deviations of the test error for DELVE data sets. All values in the table are multiplied by ¿. Bank-8fm Bank-8fh Bank-8nm Bank-8nh (A) ¼ ¿½ ¦ ¼ ¼ ¾ ½¼ ¦ ¼ ¼ ¾ ¦ ½ ¾¼ ¿ ¦ ½ ½½ (B) ¦ ¦ ¦ ¦ (C) ¦ ¦ ¦ ¦ ½¼ ¼ ¼¼ ¼¿ ¼¼ ¾ ¾½ ¼ ¼ ¾ ¾¼ ¼ ¼ Kin-8fm Kin-8fh ½ ¦¼ ¼ ½ ¦¼ ¼ ½ ¼¦¼ ¼ (A) (B) (C) ¾ ½ ¼ ¿ ½ ½¿ ¾ ¿ ½¿ ¿ ½¿ Kin-8nm ¼¦¼ ½ ¿ ¦ ¼ ½¿ ¾ ¦¼ ¾ Kin-8nh ¿ ¦¼ ¼ ¿ ¼¦ ¼ ¼ ¿ ¦¼ ½ ¼ ¾¦¼ ¼ ¼ ¦¼ ¼ ¼ ½¦¼ ¼ (A)/(C) (B)/(C) (C)/(C) 1.2 1.1 1 0.9 Bank−8fm Bank−8fh Bank−8nm Bank−8nh Kin−8fm Kin−8fh Kin−8nm Kin−8nh Figure 3: Mean relative performance of (A) and (B) compared with (C). For each run, the test errors of (A) and (B) are normalized by the test error of (C), and then the values are averaged over runs. Note that the error bars were reasonably small so they were omitted. ½¼¼ Realistic Data Set: Here we use eight practical data sets provided by DELVE [4]: Bank8fm, Bank-8fh, Bank-8nm, Bank-8nh, Kin-8fm, Kin-8fh, Kin-8nm, and Kin-8nh. Each data set includes samples, consisting of -dimensional input and -dimensional output values. For convenience, every attribute is normalized into . ½¾ ¼ ½℄ ½¾ ½ Suppose we are given all input points (i.e., unlabeled samples). Note that output values are unknown. From the pool of unlabeled samples, we choose Ò input points Ü ½¼¼¼ for training and observe the corresponding output values Ý ½¼¼¼. The ½ ½ task is to predict the output values of all unlabeled samples. ½¼¼¼ In this experiment, the test input density independent Gaussian density. Ô ´Üµ and Ø ´¾ ¾ ÅÄ Ô ´Üµ is unknown. Ø µ ÜÔ Ü ¾ ÅÄ So we estimate it using the ¾ ´¾¾ µ¡ ÅÄ where Å Ä are the maximum likelihood estimates of the mean and standard ÅÄ and the basis functions be deviation obtained from all unlabeled samples. Let Ô where Ø ³ ´Üµ ¼ ½ ÜÔ Ü Ø ¾ ¡ ¾ ¼ for ½¾ ¼ are template points randomly chosen from the pool of unlabeled samples. ´µ We select the training input density ÔÜ Ü from the independent Gaussian density with mean Å Ä and standard deviation Å Ä , where ¼ ¼ ¼ ¾ In this simulation, we can not create the training input points in an arbitrary location because we only have samples. Therefore, we first create temporary input points following the determined training input density, and then choose the input points from the pool of unlabeled samples that are closest to the temporary input points. For each data set, we repeat this simulation times, by changing the template points Ø ¼ ½ in each run. ½¾ ½¼¼ ½¼¼ The means and standard deviations of the test error over runs are described in Table 2. The proposed method (A) outperforms the existing method (B) for five data sets, while it is outperformed by (B) for the other three data sets. We conjecture that the model used for learning is almost correct in these three data sets. This result implies that the proposed method (A) is slightly better than the existing method (B). Figure 3 depicts the relative performance of the proposed method (A) and the existing method (B) compared with the passive learning scheme (C). This shows that (A) outperforms (C) for all eight data sets, while (B) is comparable or is outperformed by (C) for five data sets. Therefore, the proposed method (A) is overall shown to work better than other schemes. 6 Conclusions We argued that active learning is essentially the situation under the covariate shift—the training input density is different from the test input density. When the model used for learning is correctly specified, the covariate shift does not matter. However, for misspecified models, we have to explicitly cope with the covariate shift. In this paper, we proposed a new active learning method based on the weighted least-squares learning. The numerical study showed that the existing method works better than the proposed method if model is correctly specified. However, the existing method tends to perform poorly once the correctness of the model is violated. On the other hand, the proposed method overall worked reasonably well and it consistently outperformed the passive learning scheme. Therefore, the proposed method would be robust against the misspecification of models and thus it is reliable. The proposed method can be theoretically justified if the model is approximately correct in a weak sense. However, it is no longer valid for totally misspecified models. A natural future direction would be therefore to devise an active learning method which has theoretical guarantee with totally misspecified models. It is also important to notice that when the model is totally misspecified, even learning with optimal training input points would not be successful anyway. In such cases, it is of course important to carry out model selection. In active learning research—including the present paper, however, the location of training input points are designed for a single model at hand. That is, the model should have been chosen before performing active learning. Devising a method for simultaneously optimizing models and the location of training input points would be a more important and promising future direction. Acknowledgments: The author would like to thank MEXT (Grant-in-Aid for Young Scientists 17700142) for partial financial support. References [1] D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models. Journal of Artificial Intelligence Research, 4:129–145, 1996. [2] V. V. Fedorov. Theory of Optimal Experiments. Academic Press, New York, 1972. [3] K. Fukumizu. Statistical active learning in multilayer perceptrons. IEEE Transactions on Neural Networks, 11(1):17–26, 2000. [4] C. E. Rasmussen, R. M. Neal, G. E. Hinton, D. van Camp, M. Revow, Z. Ghahramani, R. Kustra, and R. Tibshirani. The DELVE manual, 1996. [5] H. Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, 2000. [6] M. Sugiyama. Active learning for misspecified models. Technical report, Department of Computer Science, Tokyo Institute of Technology, 2005.
6 0.13680424 95 nips-2005-Improved risk tail bounds for on-line algorithms
7 0.13363706 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget
8 0.12988704 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
9 0.12567779 182 nips-2005-Statistical Convergence of Kernel CCA
10 0.10782955 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
11 0.094913125 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
12 0.093673781 76 nips-2005-From Batch to Transductive Online Learning
13 0.085097916 50 nips-2005-Convex Neural Networks
14 0.084936425 18 nips-2005-Active Learning For Identifying Function Threshold Boundaries
15 0.074941039 85 nips-2005-Generalization to Unseen Cases
16 0.074796565 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
17 0.07218872 149 nips-2005-Optimal cue selection strategy
18 0.071552999 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
19 0.070966691 117 nips-2005-Learning from Data of Variable Quality
20 0.070313811 33 nips-2005-Bayesian Sets
topicId topicWeight
[(0, 0.237), (1, 0.075), (2, -0.025), (3, -0.081), (4, 0.081), (5, 0.207), (6, 0.032), (7, 0.088), (8, -0.137), (9, 0.235), (10, -0.084), (11, 0.241), (12, 0.185), (13, -0.087), (14, -0.028), (15, -0.223), (16, -0.056), (17, -0.023), (18, 0.013), (19, -0.051), (20, 0.085), (21, 0.015), (22, 0.029), (23, 0.074), (24, -0.028), (25, -0.066), (26, -0.007), (27, -0.117), (28, -0.023), (29, -0.167), (30, 0.004), (31, 0.07), (32, -0.021), (33, -0.047), (34, -0.02), (35, -0.038), (36, -0.056), (37, -0.023), (38, -0.026), (39, -0.048), (40, -0.093), (41, -0.054), (42, -0.021), (43, 0.112), (44, -0.012), (45, -0.036), (46, -0.05), (47, -0.044), (48, -0.036), (49, -0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.96869636 41 nips-2005-Coarse sample complexity bounds for active learning
Author: Sanjoy Dasgupta
Abstract: We characterize the sample complexity of active learning problems in terms of a parameter which takes into account the distribution over the input space, the specific target hypothesis, and the desired accuracy.
2 0.73457742 160 nips-2005-Query by Committee Made Real
Author: Ran Gilad-bachrach, Amir Navot, Naftali Tishby
Abstract: Training a learning algorithm is a costly task. A major goal of active learning is to reduce this cost. In this paper we introduce a new algorithm, KQBC, which is capable of actively learning large scale problems by using selective sampling. The algorithm overcomes the costly sampling step of the well known Query By Committee (QBC) algorithm by projecting onto a low dimensional space. KQBC also enables the use of kernels, providing a simple way of extending QBC to the non-linear scenario. Sampling the low dimension space is done using the hit and run random walk. We demonstrate the success of this novel algorithm by applying it to both artificial and a real world problems.
3 0.71820164 74 nips-2005-Faster Rates in Regression via Active Learning
Author: Rebecca Willett, Robert Nowak, Rui M. Castro
Abstract: This paper presents a rigorous statistical analysis characterizing regimes in which active learning significantly outperforms classical passive learning. Active learning algorithms are able to make queries or select sample locations in an online fashion, depending on the results of the previous queries. In some regimes, this extra flexibility leads to significantly faster rates of error decay than those possible in classical passive learning settings. The nature of these regimes is explored by studying fundamental performance limits of active and passive learning in two illustrative nonparametric function classes. In addition to examining the theoretical potential of active learning, this paper describes a practical algorithm capable of exploiting the extra flexibility of the active setting and provably improving upon the classical passive techniques. Our active learning theory and methods show promise in a number of applications, including field estimation using wireless sensor networks and fault line detection. 1
4 0.7168898 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget
Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer
Abstract: The Perceptron algorithm, despite its simplicity, often performs well on online classification tasks. The Perceptron becomes especially effective when it is used in conjunction with kernels. However, a common difficulty encountered when implementing kernel-based online algorithms is the amount of memory required to store the online hypothesis, which may grow unboundedly. In this paper we present and analyze the Forgetron algorithm for kernel-based online learning on a fixed memory budget. To our knowledge, this is the first online learning algorithm which, on one hand, maintains a strict limit on the number of examples it stores while, on the other hand, entertains a relative mistake bound. In addition to the formal results, we also present experiments with real datasets which underscore the merits of our approach.
5 0.66651672 54 nips-2005-Data-Driven Online to Batch Conversions
Author: Ofer Dekel, Yoram Singer
Abstract: Online learning algorithms are typically fast, memory efficient, and simple to implement. However, many common learning problems fit more naturally in the batch learning setting. The power of online learning algorithms can be exploited in batch settings by using online-to-batch conversions techniques which build a new batch algorithm from an existing online algorithm. We first give a unified overview of three existing online-to-batch conversion techniques which do not use training data in the conversion process. We then build upon these data-independent conversions to derive and analyze data-driven conversions. Our conversions find hypotheses with a small risk by explicitly minimizing datadependent generalization bounds. We experimentally demonstrate the usefulness of our approach and in particular show that the data-driven conversions consistently outperform the data-independent conversions.
6 0.64767927 19 nips-2005-Active Learning for Misspecified Models
7 0.56624347 76 nips-2005-From Batch to Transductive Online Learning
8 0.48717061 18 nips-2005-Active Learning For Identifying Function Threshold Boundaries
9 0.4181565 95 nips-2005-Improved risk tail bounds for on-line algorithms
10 0.39125568 112 nips-2005-Learning Minimum Volume Sets
11 0.38841712 117 nips-2005-Learning from Data of Variable Quality
12 0.37856498 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
13 0.37465996 182 nips-2005-Statistical Convergence of Kernel CCA
14 0.37054116 33 nips-2005-Bayesian Sets
15 0.33245671 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
16 0.32844979 85 nips-2005-Generalization to Unseen Cases
17 0.32788083 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
18 0.3235147 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation
19 0.30521652 151 nips-2005-Pattern Recognition from One Example by Chopping
20 0.30473137 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
topicId topicWeight
[(3, 0.101), (10, 0.061), (27, 0.022), (31, 0.051), (34, 0.096), (39, 0.021), (41, 0.013), (55, 0.036), (65, 0.017), (69, 0.07), (73, 0.037), (76, 0.2), (77, 0.017), (88, 0.122), (91, 0.057)]
simIndex simValue paperId paperTitle
same-paper 1 0.85702419 41 nips-2005-Coarse sample complexity bounds for active learning
Author: Sanjoy Dasgupta
Abstract: We characterize the sample complexity of active learning problems in terms of a parameter which takes into account the distribution over the input space, the specific target hypothesis, and the desired accuracy.
2 0.72535199 74 nips-2005-Faster Rates in Regression via Active Learning
Author: Rebecca Willett, Robert Nowak, Rui M. Castro
Abstract: This paper presents a rigorous statistical analysis characterizing regimes in which active learning significantly outperforms classical passive learning. Active learning algorithms are able to make queries or select sample locations in an online fashion, depending on the results of the previous queries. In some regimes, this extra flexibility leads to significantly faster rates of error decay than those possible in classical passive learning settings. The nature of these regimes is explored by studying fundamental performance limits of active and passive learning in two illustrative nonparametric function classes. In addition to examining the theoretical potential of active learning, this paper describes a practical algorithm capable of exploiting the extra flexibility of the active setting and provably improving upon the classical passive techniques. Our active learning theory and methods show promise in a number of applications, including field estimation using wireless sensor networks and fault line detection. 1
3 0.72042638 177 nips-2005-Size Regularized Cut for Data Clustering
Author: Yixin Chen, Ya Zhang, Xiang Ji
Abstract: We present a novel spectral clustering method that enables users to incorporate prior knowledge of the size of clusters into the clustering process. The cost function, which is named size regularized cut (SRcut), is defined as the sum of the inter-cluster similarity and a regularization term measuring the relative size of two clusters. Finding a partition of the data set to minimize SRcut is proved to be NP-complete. An approximation algorithm is proposed to solve a relaxed version of the optimization problem as an eigenvalue problem. Evaluations over different data sets demonstrate that the method is not sensitive to outliers and performs better than normalized cut. 1
4 0.71291345 112 nips-2005-Learning Minimum Volume Sets
Author: Clayton Scott, Robert Nowak
Abstract: Given a probability measure P and a reference measure µ, one is often interested in the minimum µ-measure set with P -measure at least α. Minimum volume sets of this type summarize the regions of greatest probability mass of P , and are useful for detecting anomalies and constructing confidence regions. This paper addresses the problem of estimating minimum volume sets based on independent samples distributed according to P . Other than these samples, no other information is available regarding P , but the reference measure µ is assumed to be known. We introduce rules for estimating minimum volume sets that parallel the empirical risk minimization and structural risk minimization principles in classification. As in classification, we show that the performances of our estimators are controlled by the rate of uniform convergence of empirical to true probabilities over the class from which the estimator is drawn. Thus we obtain finite sample size performance bounds in terms of VC dimension and related quantities. We also demonstrate strong universal consistency and an oracle inequality. Estimators based on histograms and dyadic partitions illustrate the proposed rules. 1
5 0.71055758 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
Author: Gilles Blanchard, Masashi Sugiyama, Motoaki Kawanabe, Vladimir Spokoiny, Klaus-Robert Müller
Abstract: We propose a new linear method for dimension reduction to identify nonGaussian components in high dimensional data. Our method, NGCA (non-Gaussian component analysis), uses a very general semi-parametric framework. In contrast to existing projection methods we define what is uninteresting (Gaussian): by projecting out uninterestingness, we can estimate the relevant non-Gaussian subspace. We show that the estimation error of finding the non-Gaussian components tends to zero at a parametric rate. Once NGCA components are identified and extracted, various tasks can be applied in the data analysis process, like data visualization, clustering, denoising or classification. A numerical study demonstrates the usefulness of our method. 1
6 0.710352 151 nips-2005-Pattern Recognition from One Example by Chopping
7 0.70984411 50 nips-2005-Convex Neural Networks
8 0.70974737 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
9 0.7094152 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization
10 0.7069295 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
11 0.70488632 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
12 0.70369977 144 nips-2005-Off-policy Learning with Options and Recognizers
13 0.70369238 30 nips-2005-Assessing Approximations for Gaussian Process Classification
14 0.70223731 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations
15 0.70160711 160 nips-2005-Query by Committee Made Real
16 0.69751972 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error
17 0.69712514 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
18 0.69594461 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
19 0.69551265 23 nips-2005-An Application of Markov Random Fields to Range Sensing
20 0.694718 184 nips-2005-Structured Prediction via the Extragradient Method