nips nips2004 nips2004-23 knowledge-graph by maker-knowledge-mining

23 nips-2004-Analysis of a greedy active learning strategy

Source: pdf

Author: Sanjoy Dasgupta

Abstract: We abstract out the core search problem of active learning schemes, to better understand the extent to which adaptive labeling can improve sample complexity. We give various upper and lower bounds on the number of labels which need to be queried, and we prove that a popular greedy active learning rule is approximately as good as any other strategy for minimizing this number of labels. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Analysis of a greedy active learning strategy Sanjoy Dasgupta∗ University of California, San Diego dasgupta@cs. [sent-1, score-0.515]

2 edu Abstract We abstract out the core search problem of active learning schemes, to better understand the extent to which adaptive labeling can improve sample complexity. [sent-3, score-0.306]

3 We give various upper and lower bounds on the number of labels which need to be queried, and we prove that a popular greedy active learning rule is approximately as good as any other strategy for minimizing this number of labels. [sent-4, score-0.833]

4 1 Introduction An increasingly common phenomenon in classiﬁcation tasks is that unlabeled data is abundant, whereas labels are considerably harder to come by. [sent-5, score-0.368]

5 This distinction between labeled and unlabeled data is not captured in standard models like the PAC framework, and has motivated the ﬁeld of active learning, in which the learner is able to ask for the labels of speciﬁc points, but is charged for each label. [sent-7, score-0.698]

6 These query points are typically chosen from an unlabeled data set, a practice called pool-based learning [10]. [sent-8, score-0.686]

7 Suppose the hypothesis class has VC dimension d and we want a classiﬁer whose error rate on distribution P over the joint (input, label) space, is less than > 0. [sent-12, score-0.309]

8 Can we get away with substantially fewer than m labels if we are given unlabeled points from P and are able to adaptively choose which points to label? [sent-14, score-0.59]

9 VC theory tells us that if the underlying distribution P can be classiﬁed perfectly by some hypothesis in H (called the realizable case), then it is enough to draw m = O(1/ ) random labeled examples from P , and to return any classiﬁer consistent with them. [sent-18, score-0.457]

10 If we lay these points down on the line, their hidden labels are a sequence of 0’s followed by a sequence of 1’s, and the goal is to discover the point w at which the transition occurs. [sent-20, score-0.312]

11 Thus active learning gives us an exponential improvement in the number of labels needed: by adaptively querying log m labels, we can automatically infer the rest of them. [sent-22, score-0.78]

12 For more complicated hypothesis classes H, is a sort of a generalized binary search possible? [sent-25, score-0.334]

13 For supervised learning, in the realizable case, the usual bounds specify a sample complexity of (very roughly) m ≈ d/ labeled points if the target error rate is . [sent-27, score-0.422]

14 So let’s pick this many unlabeled points, and then try to ﬁnd a hypothesis consistent with all the hidden labels by adaptively querying just a few of them. [sent-28, score-0.932]

15 This ﬁnite set is the effective hypothesis class H. [sent-30, score-0.276]

16 ) The most we can possibly learn about the target hypothesis, even if all labels are revealed, is to narrow it down to one of these regions. [sent-32, score-0.351]

17 If binary search were possible, just O(d log m) labels would be needed. [sent-34, score-0.378]

18 We show that for d ≥ 2, the situation is very different: Pick any collection of m (unlabeled) points on the unit sphere in Rd , for d ≥ 2, and assume their hidden labels correspond perfectly to some linear separator. [sent-37, score-0.373]

19 Then there are target hypotheses in H which cannot be identiﬁed without querying all the labels. [sent-38, score-0.517]

20 (What if the active learner is not required to identify exactly the right hypothesis, but something close to it? [sent-39, score-0.293]

21 ) Therefore, even in the most benign situations, we cannot expect that every target hypothesis will be identiﬁable using o(m) labels. [sent-41, score-0.353]

22 To put it differently, in the worst case over target hypotheses, active learning gives no improvement in sample complexity. [sent-42, score-0.354]

23 But hopefully, on average (with respect to some distribution over target hypotheses), the number of labels needed is small. [sent-43, score-0.44]

24 For instance, when d = 2 in the bad case above, a target hypothesis chosen uniformly at random from H can be identiﬁed by querying just O(log m) labels in expectation. [sent-44, score-0.875]

25 An average-case model We will count the expected number of labels queried when the target hypothesis is chosen from some distribution π over H. [sent-46, score-0.687]

26 This can be interpreted as a Bayesian setting, but it is more accurate to think of π merely as a device for averaging query counts, which has no bearing on the ﬁnal generalization bound. [sent-47, score-0.467]

27 Most existing active learning schemes work with H rather than H; but H reﬂects the underlying combinatorial structure of the problem, and it can’t hurt to deal with it directly. [sent-49, score-0.253]

28 What is the expected number of labels needed to identify a target hypothesis chosen from π? [sent-52, score-0.726]

29 Thus the beneﬁt of active learning is really a function of the speciﬁc hypothesis class and the particular pool of unlabeled data. [sent-55, score-0.653]

30 Depending on these, the expected number of labels needed lies in the following range (within constants): ideal case: worst case: d log m m perfect binary search all labels, or randomly chosen queries Notice the exponential gap between the top and bottom of this range. [sent-56, score-0.766]

31 Is there some simple querying strategy which always achieves close to the minimum (expected) number of labels, whatever this minimum number might be? [sent-57, score-0.413]

32 Our main result is that this property holds for a variant of a popular greedy scheme: always ask for the label which most evenly divides the current effective version space weighted by π. [sent-58, score-0.455]

33 This doesn’t necessarily minimize the number of queries, just as a greedy decision tree algorithm need not produce trees of minimum size. [sent-59, score-0.471]

34 However: When π is uniform over H, the expected number of labels needed by this greedy strategy is at most O(ln |H|) times that of any other strategy. [sent-60, score-0.666]

35 Variants of this greedy scheme underlie many active learning heuristics, and are often described as optimal in the literature. [sent-62, score-0.527]

36 The performance guarantee is signiﬁcant: recall log |H| = O(d log m), the minimum number of queries possible. [sent-64, score-0.414]

37 We want to select a hypothesis (a function X → Y) from some class H of VC dimension d < ∞, which will accurately predict labels of points in X . [sent-66, score-0.621]

38 Standard bounds give us a function m( , d) such that if we want a hypothesis of error ≤ (on P , modulo some ﬁxed conﬁdence level), and if m ≥ m( , d), then we need only pick a hypothesis h ∈ H consistent with these labeled points [9]. [sent-72, score-0.767]

39 The possible labelings of these points form a subset of {0, 1}m , the effective hypothesis class H ∼ {(h(x1 ), . [sent-77, score-0.393]

40 We want to pick the unique h ∈ H which is consistent with all the hidden labels, by querying just a few of them. [sent-82, score-0.32]

41 Any deterministic search strategy can be represented as a binary tree whose internal nodes are queries (“what is the xi ’s label? [sent-83, score-0.666]

42 We can also accommodate randomization – for instance, to allow a random choice of query point – by letting internal nodes of the tree be random coin ﬂips. [sent-85, score-0.624]

43 xm x1 x2 L3 x3 x4 Figure 1: To identify target hypotheses like L3 , we need to see all the labels. [sent-87, score-0.445]

44 3 Some bad news Claim 1 Let H be the hypothesis class of linear separators in R2 . [sent-88, score-0.429]

45 For any set of m distinct data points on the perimeter of the unit circle, there are always some target hypotheses in H which cannot be identiﬁed without querying all m labels. [sent-89, score-0.708]

46 No matter what this density is, there are always target hypotheses in H which force us to ask for Ω(1/ ) labels: no improvement over the sample complexity of supervised learning. [sent-96, score-0.472]

47 In this example, the bad target hypotheses have a large imbalance in probability mass between their positive and negative regions. [sent-97, score-0.447]

48 By adding an extra dimension and an extra point, exactly the same example can be modiﬁed to make the bad hypotheses balanced. [sent-98, score-0.268]

49 Some hypotheses must lie at depth m in any query tree; but what about the rest? [sent-100, score-0.747]

50 Then H = {hij : 1 ≤ i = j ≤ m} ∪ {h0 , h1 }, where hij labels xi · · · xj−1 positive (if j < i it wraps around) and the remaining points negative, and h0 , h1 are everywhere negative/positive. [sent-105, score-0.424]

51 It is possible to construct a query tree in which each hij lies at depth ≤ 2(m/|j − i| + log |j − i|). [sent-106, score-0.851]

52 Thus, if the target hypothesis is chosen uniformly from H, the expected number of labels queried is at most 1 m(m−1)+2 2m + i=j 2(m/|j − i| + log |j − i|) = O(log m). [sent-107, score-0.733]

53 4 Main result Let π be any distribution over H; we will analyze search strategies according to the number of labels they require, averaged over target hypotheses drawn from π. [sent-111, score-0.625]

54 In terms of query trees, this is the average depth of a leaf chosen according to π. [sent-112, score-0.654]

55 π(h) · (# labels needed for h) = Q(T, π) = h∈H h∈H Is there always a tree of average depth o(m)? [sent-115, score-0.638]

56 There is an input space X of size m and a hypothesis class H of VC dimension d, deﬁned on domain X , with the following property: if π is chosen to be uniform over H = H, then any query tree T has Q(T, π) ≥ m/8. [sent-118, score-0.993]

57 , xm , and let H consist of all hypotheses h : X → {0, 1} which are positive on exactly d inputs. [sent-123, score-0.326]

58 In order to identify a particular element h ∈ H, any querying method must discover exactly the d points xi on which h is nonzero. [sent-124, score-0.338]

59 By construction, the order in which queries are asked is irrelevant – it might as well be x1 , x2 , . [sent-125, score-0.32]

60 In our average-case model, we have seen one example in which intelligent querying results in an exponential improvement in the number of labels required, and one in which it is no help at all. [sent-130, score-0.465]

61 For each + − unlabeled xi , let Si be the hypotheses which label xi positive and Si the ones which label it negative. [sent-134, score-0.499]

62 We show this is almost as good at minimizing queries as any other strategy. [sent-136, score-0.29]

63 Suppose that the optimal query tree requires Q∗ labels in expectation, for target hypotheses chosen according to π. [sent-138, score-1.252]

64 Then the expected number of labels needed by the greedy strategy is at most 4Q∗ ln 1/(minh π(h)). [sent-139, score-0.649]

65 1 Analysis of the greedy active learner Lower bounds on the greedy scheme The greedy approach is not optimal because it doesn’t take into account the way in which a query reshapes the search space – speciﬁcally, the effect of a query on the quality of other queries. [sent-143, score-2.111]

66 However, the version space must ﬁrst be whittled down to one of these subregions, and this process, though ultimately optimal, might initially be slower at shrinking the hypothesis space than more shortsighted alternatives. [sent-145, score-0.315]

67 Claim 4 For any n ≥ 16 which is a power of two, there is a concept class Hn of size n such that: under uniform π, the optimal tree has average height at most qn = Θ(log n), log n but the greedy active learning strategy produces a tree of average height Ω(q n · log log n ). [sent-147, score-1.295]

68 For non-uniform π, the greedy scheme can deviate more substantially from optimality. [sent-148, score-0.327]

69 Claim 5 For any n ≥ 2, there is a hypothesis class H with 2n + 1 elements and a distribution π over H, such that: (a) π ranges in value from 1/2 to 1/2n+1 ; (b) the optimal tree has average depth less than 3; (c) the greedy tree has average depth at least n/2. [sent-149, score-1.186]

70 The lower bounds on the quality of a greedy learner are sobering, but things cannot get too much worse than this. [sent-153, score-0.455]

71 Here’s the basic argument for uniform π: we show that if the optimal tree T ∗ requires Q∗ queries in expectation, then some query must (again in expectation) “cut off” a chunk of H of π-mass Ω(1/Q∗ ). [sent-154, score-1.041]

72 Therefore, the root query of the greedy tree TG is at least this good (cf. [sent-155, score-0.877]

73 Things get trickier when we try to show that the rest of TG is also good, because although T ∗ uses just Q∗ queries on average, it may need many more queries for certain hypotheses. [sent-157, score-0.61]

74 Subtrees of TG could correspond to version spaces for which more than Q∗ queries are needed, and the roots of these subtrees might not cut down the version space much. [sent-158, score-0.488]

75 The key concept we have to deﬁne is the quality of a query, and it turns out that we need this to be monotonically decreasing, that is, it should only go down as active learning proceeds and the version space shrinks. [sent-165, score-0.305]

76 Suppose we are down to some version space S ⊆ H, and a possible next query is xj . [sent-167, score-0.526]

77 If S + is the subset of S which labels xj positive, and S − are the ones that label it negative, then on average the probability mass (measured by π) eliminated by xj is π(S + ) − π(S) π(S ) + π(S − ) + π(S) π(S ) = 2π(S + )π(S − ) . [sent-168, score-0.455]

78 Lemma 6 If xj shrinks (H, π) by ∆, then it shrinks (S, π) by at most ∆ for any S ⊆ H. [sent-171, score-0.326]

79 We would expect that if the optimal tree is short, there must be at least one query which shrinks (H, π) considerably. [sent-172, score-0.796]

80 More concretely, the deﬁnition of shrinkage seems to suggest that if all queries provide shrinkage at most ∆, and the current version space has mass π(S), then at least about π(S)/∆ more queries are needed. [sent-173, score-0.805]

81 Roughly speaking, when there are lots of hypotheses with signiﬁcant mass left in S, the ﬁrst effect dominates; thereafter the second takes over. [sent-175, score-0.273]

82 Lemma 7 Suppose every query shrinks (H, π) by at most ∆ > 0. [sent-178, score-0.581]

83 Pick any S ⊆ H, and any query tree T whose leaves include S. [sent-179, score-0.688]

84 Then there must exist a query which shrinks (S, πS ) by at least (1 − CP(πS ))/Q(T, πS ). [sent-182, score-0.581]

85 So if the current version space S ⊆ H is such that πS has small collision probability, some query must split off a sizeable chunk of S. [sent-183, score-0.569]

86 In this case, the mass of some particular hypothesis h0 ∈ S exceeds that of all the others combined, and S could shrink by just an insigniﬁcant amount during the subsequent greedy query, or even during the next few iterations of greedy queries. [sent-186, score-0.806]

87 It turns out, however, that within roughly the number of iterations that the optimal tree needs for target h0 , the greedy procedure will either reject h0 or identify it as the target. [sent-187, score-0.624]

88 Lemma 9 Let T ∗ denote any particular query tree for π, and let T be the greedilyconstructed query tree. [sent-190, score-1.062]

89 Algorithmically, the main problem is that the query selection rule is not immediately tractable, so approximations are necessary. [sent-196, score-0.47]

90 For linear separators, H consists of convex sets, and if π is chosen to be proportional to volume, query selection involves estimating volumes of convex regions, which is tractable but (using present techniques) inconvenient. [sent-197, score-0.476]

91 But if the algorithm places more weight on certain hypotheses (for instance, those with large margin), then its ﬁnal error bound is excellent if it guessed right, and worse-than-usual if it guessed wrong. [sent-202, score-0.334]

92 A Bayesian assumption has an immediate beneﬁt for active learning: if at any stage the remaining version space (weighted by prior π) is largely in agreement on the unlabeled data, it is legitimate to stop and output one of these remaining hypotheses [7]. [sent-209, score-0.59]

93 When the hypothesis class consists of probabilistic classiﬁers, the Bayesian assumption has also been used in another way: to approximate the greedy selection rule using the MAP estimate instead of an expensive summation over the posterior (eg. [sent-211, score-0.529]

94 In terms of theoretical results, another work which considers the tradeoff between labels and generalization error is [7], in which a greedy scheme, realized using sampling, is analyzed in a Bayesian setting. [sent-213, score-0.517]

95 The authors show that it is possible to achieve an exponential improvement in the number of labels needed to learn linear separators, when both data and target hypothesis are chosen uniformly from the unit sphere. [sent-214, score-0.755]

96 What about ﬁxing the number of queries and asking for the best (average) error rate possible? [sent-218, score-0.29]

97 In other words, the query tree has a ﬁxed depth, and each leaf is annotated with its remaining version space S ⊆ H. [sent-219, score-0.722]

98 What is a good querying strategy for producing low-diameter leaves? [sent-221, score-0.254]

99 Existing active learning schemes ignore the rich algebraic structure of H, an arrangement of hyperplanes [4]. [sent-223, score-0.253]

100 A theoretical analysis of query selection for collaborative ﬁltering. [sent-260, score-0.438]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('query', 0.438), ('queries', 0.29), ('greedy', 0.253), ('hypothesis', 0.237), ('labels', 0.235), ('hypotheses', 0.21), ('active', 0.199), ('querying', 0.191), ('tree', 0.186), ('shrinks', 0.143), ('unlabeled', 0.133), ('target', 0.116), ('realizable', 0.109), ('depth', 0.099), ('pick', 0.096), ('cp', 0.096), ('separators', 0.095), ('hij', 0.082), ('vc', 0.081), ('xm', 0.079), ('points', 0.077), ('suppose', 0.073), ('claim', 0.064), ('leaves', 0.064), ('search', 0.064), ('mass', 0.063), ('strategy', 0.063), ('queried', 0.061), ('tg', 0.061), ('needed', 0.06), ('bad', 0.058), ('shrinkage', 0.057), ('dasgupta', 0.057), ('bounds', 0.055), ('uniform', 0.055), ('perimeter', 0.055), ('sadly', 0.055), ('lemma', 0.055), ('learner', 0.054), ('schemes', 0.054), ('leaf', 0.05), ('tells', 0.048), ('version', 0.048), ('guessed', 0.048), ('hw', 0.048), ('minh', 0.048), ('sauer', 0.048), ('label', 0.048), ('scheme', 0.046), ('log', 0.046), ('ask', 0.045), ('doesn', 0.045), ('pool', 0.045), ('chunk', 0.043), ('labeling', 0.043), ('instance', 0.041), ('collision', 0.04), ('subtrees', 0.04), ('labelings', 0.04), ('adaptively', 0.04), ('identify', 0.04), ('xj', 0.04), ('class', 0.039), ('md', 0.039), ('improvement', 0.039), ('hamming', 0.038), ('chosen', 0.038), ('ln', 0.038), ('consist', 0.037), ('things', 0.036), ('whatever', 0.036), ('si', 0.035), ('binary', 0.033), ('want', 0.033), ('supervised', 0.033), ('cut', 0.032), ('looked', 0.032), ('main', 0.032), ('minimum', 0.032), ('labeled', 0.032), ('bayesian', 0.031), ('tong', 0.031), ('perfectly', 0.031), ('height', 0.03), ('toy', 0.03), ('might', 0.03), ('xi', 0.03), ('unit', 0.03), ('rest', 0.03), ('average', 0.029), ('concept', 0.029), ('identi', 0.029), ('generalization', 0.029), ('optimal', 0.029), ('classi', 0.029), ('quality', 0.029), ('always', 0.029), ('substantially', 0.028), ('places', 0.028), ('lower', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000012 23 nips-2004-Analysis of a greedy active learning strategy

Author: Sanjoy Dasgupta

2 0.16182096 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection

Author: Dan Pelleg, Andrew W. Moore

Abstract: We introduce a novel active-learning scenario in which a user wants to work with a learning algorithm to identify useful anomalies. These are distinguished from the traditional statistical deﬁnition of anomalies as outliers or merely ill-modeled points. Our distinction is that the usefulness of anomalies is categorized subjectively by the user. We make two additional assumptions. First, there exist extremely few useful anomalies to be hunted down within a massive dataset. Second, both useful and useless anomalies may sometimes exist within tiny classes of similar anomalies. The challenge is thus to identify “rare category” records in an unlabeled noisy set with help (in the form of class labels) from a human expert who has a small budget of datapoints that they are prepared to categorize. We propose a technique to meet this challenge, which assumes a mixture model ﬁt to the data, but otherwise makes no assumptions on the particular form of the mixture components. This property promises wide applicability in real-life scenarios and for various statistical models. We give an overview of several alternative methods, highlighting their strengths and weaknesses, and conclude with a detailed empirical analysis. We show that our method can quickly zoom in on an anomaly set containing a few tens of points in a dataset of hundreds of thousands. 1

3 0.13795042 164 nips-2004-Semi-supervised Learning by Entropy Minimization

Author: Yves Grandvalet, Yoshua Bengio

Abstract: We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. Our approach includes other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed solution beneﬁts from unlabeled data. The method challenges mixture models when the data are sampled from the distribution class spanned by the generative model. The performances are deﬁnitely in favor of minimum entropy regularization when generative models are misspeciﬁed, and the weighting of unlabeled data provides robustness to the violation of the “cluster assumption”. Finally, we also illustrate that the method can also be far superior to manifold learning in high dimension spaces. 1

4 0.11844846 136 nips-2004-On Semi-Supervised Classification

Author: Balaji Krishnapuram, David Williams, Ya Xue, Lawrence Carin, Mário Figueiredo, Alexander J. Hartemink

Abstract: A graph-based prior is proposed for parametric semi-supervised classiﬁcation. The prior utilizes both labelled and unlabelled data; it also integrates features from multiple views of a given sample (e.g., multiple sensors), thus implementing a Bayesian form of co-training. An EM algorithm for training the classiﬁer automatically adjusts the tradeoff between the contributions of: (a) the labelled data; (b) the unlabelled data; and (c) the co-training information. Active label query selection is performed using a mutual information based criterion that explicitly uses the unlabelled data and the co-training information. Encouraging results are presented on public benchmarks and on measured data from single and multiple sensors. 1

5 0.11841071 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms

Author: Ting Liu, Andrew W. Moore, Ke Yang, Alexander G. Gray

Abstract: This paper concerns approximate nearest neighbor searching algorithms, which have become increasingly important, especially in high dimensional perception areas such as computer vision, with dozens of publications in recent years. Much of this enthusiasm is due to a successful new approximate nearest neighbor approach called Locality Sensitive Hashing (LSH). In this paper we ask the question: can earlier spatial data structure approaches to exact nearest neighbor, such as metric trees, be altered to provide approximate answers to proximity queries and if so, how? We introduce a new kind of metric tree that allows overlap: certain datapoints may appear in both the children of a parent. We also introduce new approximate k-NN search algorithms on this structure. We show why these structures should be able to exploit the same randomprojection-based approximations that LSH enjoys, but with a simpler algorithm and perhaps with greater efﬁciency. We then provide a detailed empirical evaluation on ﬁve large, high dimensional datasets which show up to 31-fold accelerations over LSH. This result holds true throughout the spectrum of approximation levels. 1

6 0.11659783 85 nips-2004-Instance-Based Relevance Feedback for Image Retrieval

7 0.11569349 54 nips-2004-Distributed Information Regularization on Graphs

8 0.10968185 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

9 0.10872893 40 nips-2004-Common-Frame Model for Object Recognition

10 0.0996328 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

11 0.087515742 99 nips-2004-Learning Hyper-Features for Visual Identification

12 0.083808713 137 nips-2004-On the Adaptive Properties of Decision Trees

13 0.083179168 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

14 0.080711469 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

15 0.079722829 19 nips-2004-An Application of Boosting to Graph Classification

16 0.078674093 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices

17 0.077112138 92 nips-2004-Kernel Methods for Implicit Surface Modeling

18 0.074862331 115 nips-2004-Maximum Margin Clustering

19 0.074848458 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

20 0.074605398 141 nips-2004-Optimal sub-graphical models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.237), (1, 0.075), (2, 0.037), (3, 0.052), (4, 0.113), (5, 0.101), (6, 0.038), (7, 0.11), (8, 0.025), (9, -0.009), (10, -0.027), (11, 0.115), (12, 0.023), (13, -0.048), (14, -0.11), (15, -0.008), (16, 0.119), (17, -0.061), (18, 0.019), (19, -0.131), (20, 0.126), (21, 0.247), (22, 0.091), (23, 0.14), (24, 0.025), (25, 0.05), (26, -0.014), (27, 0.031), (28, 0.097), (29, 0.017), (30, -0.097), (31, -0.026), (32, -0.059), (33, 0.017), (34, 0.008), (35, -0.016), (36, 0.153), (37, 0.028), (38, -0.03), (39, 0.052), (40, -0.09), (41, 0.13), (42, 0.064), (43, 0.05), (44, -0.077), (45, -0.099), (46, 0.015), (47, -0.008), (48, -0.1), (49, 0.086)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96785933 23 nips-2004-Analysis of a greedy active learning strategy

Author: Sanjoy Dasgupta

2 0.7480697 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection

Author: Dan Pelleg, Andrew W. Moore

3 0.59483325 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

Author: Omid Madani, David M. Pennock, Gary W. Flake

Abstract: In the context of binary classiﬁcation, we deﬁne disagreement as a measure of how often two independently-trained models differ in their classiﬁcation of unlabeled data. We explore the use of disagreement for error estimation and model selection. We call the procedure co-validation, since the two models effectively (in)validate one another by comparing results on unlabeled data, which we assume is relatively cheap and plentiful compared to labeled data. We show that per-instance disagreement is an unbiased estimate of the variance of error for that instance. We also show that disagreement provides a lower bound on the prediction (generalization) error, and a tight upper bound on the “variance of prediction error”, or the variance of the average error across instances, where variance is measured across training sets. We present experimental results on several data sets exploring co-validation for error estimation and model selection. The procedure is especially effective in active learning settings, where training sets are not drawn at random and cross validation overestimates error. 1

4 0.58126462 164 nips-2004-Semi-supervised Learning by Entropy Minimization

Author: Yves Grandvalet, Yoshua Bengio

5 0.54015839 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms

Author: Ting Liu, Andrew W. Moore, Ke Yang, Alexander G. Gray

6 0.53438258 54 nips-2004-Distributed Information Regularization on Graphs

7 0.52288371 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

8 0.5183205 136 nips-2004-On Semi-Supervised Classification

9 0.46316996 85 nips-2004-Instance-Based Relevance Feedback for Image Retrieval

10 0.43032131 111 nips-2004-Maximal Margin Labeling for Multi-Topic Text Categorization

11 0.42559826 40 nips-2004-Common-Frame Model for Object Recognition

12 0.40924904 166 nips-2004-Semi-supervised Learning via Gaussian Processes

13 0.3910431 75 nips-2004-Heuristics for Ordering Cue Search in Decision Making

14 0.36985958 141 nips-2004-Optimal sub-graphical models

15 0.36379504 126 nips-2004-Nearly Tight Bounds for the Continuum-Armed Bandit Problem

16 0.35620806 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

17 0.35588098 137 nips-2004-On the Adaptive Properties of Decision Trees

18 0.34940904 44 nips-2004-Conditional Random Fields for Object Recognition

19 0.34840277 109 nips-2004-Mass Meta-analysis in Talairach Space

20 0.34730789 99 nips-2004-Learning Hyper-Features for Visual Identification

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.105), (15, 0.132), (26, 0.053), (31, 0.043), (32, 0.268), (33, 0.195), (35, 0.022), (39, 0.022), (50, 0.058), (77, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.86133897 23 nips-2004-Analysis of a greedy active learning strategy

Author: Sanjoy Dasgupta

2 0.85617852 7 nips-2004-A Large Deviation Bound for the Area Under the ROC Curve

Author: Shivani Agarwal, Thore Graepel, Ralf Herbrich, Dan Roth

Abstract: The area under the ROC curve (AUC) has been advocated as an evaluation criterion for the bipartite ranking problem. We study large deviation properties of the AUC; in particular, we derive a distribution-free large deviation bound for the AUC which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on an independent test sequence. A comparison of our result with a corresponding large deviation result for the classiﬁcation error rate suggests that the test sample size required to obtain an -accurate estimate of the expected accuracy of a ranking function with δ-conﬁdence is larger than that required to obtain an -accurate estimate of the expected error rate of a classiﬁcation function with the same conﬁdence. A simple application of the union bound allows the large deviation bound to be extended to learned ranking functions chosen from ﬁnite function classes. 1

3 0.82014012 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

Author: Peter L. Bartlett, Michael Collins, Ben Taskar, David A. McAllester

Abstract: We consider the problem of structured classiﬁcation, where the task is to predict a label y from an input x, and y has meaningful internal structure. Our framework includes supervised training of Markov random ﬁelds and weighted context-free grammars as special cases. We describe an algorithm that solves the large-margin optimization problem deﬁned in [12], using an exponential-family (Gibbs distribution) representation of structured objects. The algorithm is efﬁcient—even in cases where the number of labels y is exponential in size—provided that certain expectations under Gibbs distributions can be calculated efﬁciently. The method for structured labels relies on a more general result, speciﬁcally the application of exponentiated gradient updates [7, 8] to quadratic programs. 1

4 0.73615038 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer

Abstract: Prediction sufﬁx trees (PST) provide a popular and effective tool for tasks such as compression, classiﬁcation, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efﬁcient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the ﬁrst provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any ﬁxed PST determined in hindsight. 1

5 0.71749717 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

Author: Nicolò Cesa-bianchi, Claudio Gentile, Luca Zaniboni

Abstract: We provide a worst-case analysis of selective sampling algorithms for learning linear threshold functions. The algorithms considered in this paper are Perceptron-like algorithms, i.e., algorithms which can be efﬁciently run in any reproducing kernel Hilbert space. Our algorithms exploit a simple margin-based randomized rule to decide whether to query the current label. We obtain selective sampling algorithms achieving on average the same bounds as those proven for their deterministic counterparts, but using much fewer labels. We complement our theoretical ﬁndings with an empirical comparison on two text categorization tasks. The outcome of these experiments is largely predicted by our theoretical results: Our selective sampling algorithms tend to perform as good as the algorithms receiving the true label after each classiﬁcation, while observing in practice substantially fewer labels. 1

6 0.71493828 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering

7 0.70936078 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

8 0.70410144 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection

9 0.70230895 131 nips-2004-Non-Local Manifold Tangent Learning

10 0.70117581 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors

11 0.70112342 69 nips-2004-Fast Rates to Bayes for Kernel Machines

12 0.69905585 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

13 0.69836879 40 nips-2004-Common-Frame Model for Object Recognition

14 0.6979993 102 nips-2004-Learning first-order Markov models for control

15 0.69561118 163 nips-2004-Semi-parametric Exponential Family PCA

16 0.6948573 178 nips-2004-Support Vector Classification with Input Data Uncertainty

17 0.69468725 127 nips-2004-Neighbourhood Components Analysis

18 0.69329643 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

19 0.69322526 70 nips-2004-Following Curved Regularized Optimization Solution Paths

20 0.69315499 207 nips-2004-ℓ₀-norm Minimization for Basis Selection