nips nips2001 nips2001-143 knowledge-graph by maker-knowledge-mining

143 nips-2001-PAC Generalization Bounds for Co-training

Source: pdf

Author: Sanjoy Dasgupta, Michael L. Littman, David A. McAllester

Abstract: The rule-based bootstrapping introduced by Yarowsky, and its cotraining variant by Blum and Mitchell, have met with considerable empirical success. Earlier work on the theory of co-training has been only loosely related to empirically useful co-training algorithms. Here we give a new PAC-style bound on generalization error which justiﬁes both the use of conﬁdences — partial rules and partial labeling of the unlabeled data — and the use of an agreement-based objective function as suggested by Collins and Singer. Our bounds apply to the multiclass case, i.e., where instances are to be assigned one of labels for . £ ¡ ¤¢

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract The rule-based bootstrapping introduced by Yarowsky, and its cotraining variant by Blum and Mitchell, have met with considerable empirical success. [sent-8, score-0.236]

2 Here we give a new PAC-style bound on generalization error which justiﬁes both the use of conﬁdences — partial rules and partial labeling of the unlabeled data — and the use of an agreement-based objective function as suggested by Collins and Singer. [sent-10, score-1.371]

3 , where instances are to be assigned one of labels for . [sent-13, score-0.099]

4 £ ¡ ¤¢ 1 Introduction In this paper, we study bootstrapping algorithms for learning from unlabeled data. [sent-14, score-0.419]

5 The general idea in bootstrapping is to use some initial labeled data to build a (possibly partial) predictive labeling procedure; then use the labeling procedure to label more data; then use the newly labeled data to build a new predictive procedure and so on. [sent-15, score-0.545]

6 Here we give PAC style bounds on generalization error which can be used to formally justify certain boostrapping algorithms. [sent-17, score-0.319]

7 One well-known form of bootstrapping is the EM algorithm (Dempster, Laird and Rubin, 1977). [sent-18, score-0.196]

8 This algorithm iteratively updates model parameters by using the current model to infer (a probability distribution on) labels for the unlabeled data and then adjusting the model parameters to ﬁt the (distribution on) ﬁlled-in labels. [sent-19, score-0.359]

9 No PAC-style guarantee has yet been given for the generalization accuracy of the maximum likelihood model. [sent-23, score-0.089]

10 An alternative to EM is rule-based bootstrapping of the form used by Yarowsky (1995), in which one assigns labels to some fraction of a corpus of unlabeled data and then infers new labeling rules using these assigned labels as training data. [sent-24, score-0.962]

11 New labels lead to new rules which in turn lead to new labels, and so on. [sent-25, score-0.348]

12 Unlike EM, rule-based bootstrapping typically does not attempt to ﬁll in, or assign a distribution over, labels unless there is compelling evidence for a particular label. [sent-26, score-0.295]

13 One intuitive motivation for this is that by avoiding training on low-conﬁdence ﬁlled-in labels one might avoid the self-justifying local optima encountered by EM. [sent-27, score-0.099]

14 If we generate a parse tree using such a grammar then the nonterminal symbol labeling a phrase separates the phrase from its context — the phrase and the context are statistically independent given the nonterminal symbol. [sent-31, score-0.775]

15 More intuitively, in natural language the distribution of contexts into which a given phrase can be inserted is determined to some extent by the “type” of the phrase. [sent-32, score-0.196]

16 The type includes the syntactic category but might also include semantic subclassiﬁcations, for instance, whether a noun phrase refers to a person, organization, or location. [sent-33, score-0.196]

17 If we think of each particular occurrence of a phrase as a triple , where is the phrase itself, is the “type” of the phrase, and is the context, then we expect that is conditionally independent of given . [sent-34, score-0.379]

18 The conditional independence can be made to hold precisely if we generate such triples using a stochastic context free grammar where is the syntactic category of the phrase. [sent-35, score-0.213]

19 £ ¡ ¡ ¥ § ¥ £ ¡ ©¨¦¤¢ £ ¡ ©¡ § § ¡ § Blum and Mitchell introduce co-training as a general term for rule-based bootstrapping in which each rule must be based entirely on or entirely on . [sent-36, score-0.383]

20 The principal assumption made by Blum and Mitchell is that is conditionally independent of given . [sent-39, score-0.105]

21 However, it does not provide a bound on generalization error as a function of empirically measurable quantities. [sent-42, score-0.281]

22 Furthermore, there is no apparent relationship between this PAC-learnability theorem and the iterative co-training algorithm they suggest. [sent-43, score-0.081]

23 £ ¡ § £ £ ¡ £ " " Collins and Singer (1999) suggest a reﬁnement of the co-training algorithm in which one explicitly optimizes an objective function that measures the degree of agreement between the predictions based on and those based on . [sent-45, score-0.142]

24 They describe methods for “boosting” this objective function but do not provide any formal justiﬁcation for the objective function itself. [sent-46, score-0.148]

25 Here we give a PAC-style performance guarantee in terms of this agreement rate. [sent-47, score-0.142]

26 ¡ £ ¡ In this paper, we use partial classiﬁcation rules, which either output a class label or output a special symbol indicating no opinion. [sent-49, score-0.239]

27 The error of a partial rule is the probability that the rule is incorrect given that it has an opinion. [sent-50, score-0.7]

28 We work in the co-training setting where we have a pair of partial rules and where (sometimes) predicts from and (sometimes) predicts from . [sent-51, score-0.504]

29 Each of the rules and can be “composite rules”, such as decision lists, where each composite rule contains a large set of smaller rules within it. [sent-52, score-0.797]

30 We give a bound on the generalization error of each of the rules and in terms of the empirical agreement rate between the two rules. [sent-53, score-0.749]

31 This bound formally justiﬁes both the use # $ £ ¡ § £ © $ £ £ $ £ ©¡ § h1 X1 h2 X2 Y and $ Figure 1: The co-training scenario with rules . [sent-54, score-0.422]

32 £ of agreement in the objective function and the use of partial rules. [sent-55, score-0.346]

33 The bound shows the potential power of unlabeled data — low generalization error can be achieved for complex rules with a sufﬁcient quantity of unlabeled data. [sent-56, score-0.976]

34 The use of partial rules is analogous to the use of conﬁdence ratings — a partial rule is just a rule with two levels of conﬁdence. [sent-57, score-1.031]

35 So the bound can also be viewed as justifying the partial labeling aspect of rule-based bootstrapping, at least in the case of co-training where an independence assumption holds. [sent-58, score-0.541]

36 The generalization bound leads naturally to algorithms for optimizing the bound. [sent-59, score-0.196]

37 A simple greedy procedure for doing this is quite similar to the co-training algorithm suggested by Collins and Singer. [sent-60, score-0.098]

38 For the co-training bounds proved here we assume data is drawn from some distribution over triples with , and , and where and are conditionally independent given , that is, and . [sent-68, score-0.224]

39 In the co-training framework we are given an unlabeled sample of pairs drawn i. [sent-69, score-0.267]

40 We will mainly be interested in making inferences from the some labeled samples unlabeled data. [sent-73, score-0.284]

41 We will be interested in pairs of partial rules and which largely agree on the unlabeled data. [sent-75, score-0.735]

42 And yet ﬁnding such a pair requires no labeled data at all. [sent-81, score-0.112]

43 This simple observation is a major motivation for the proof, but things are complicated considerably by partial rules and by approximate agreement. [sent-82, score-0.453]

44 This is equivalent to saying that should be a permutation of the possible labels . [sent-85, score-0.221]

45 Here we give a condition using only unlabeled data which guarantees, up to high conﬁdence, that is a permutation; this is the best we can hope for using unlabeled § £ © )# £ R9 £ & § W ©2(¢X2© 9 £ W § W " ¥ ¢ ¢ ¢ 6 ¤£¤¥ e % % data alone. [sent-86, score-0.486]

46 We also bound the error rates using only unlabeled data. [sent-87, score-0.449]

47 In the case of , if is a permutation then is either the identity function or the function reversing the two possible values. [sent-88, score-0.122]

48 We use the unlabeled data to select and so that is a permutation and has low error rate. [sent-89, score-0.43]

49 We can then use a smaller amount of labeled data to determine which permutation we have found. [sent-90, score-0.183]

50 Some measure of the complexity of rules and is needed; rather than VC dimension, we adopt a clean notion of for the bit length. [sent-93, score-0.249]

51 We assume that rules are speciﬁed in some rule language and write number of bits used to specify the rule . [sent-94, score-0.661]

52 We assume that the rule language is preﬁx-free (no proper preﬁx of the bit string specifying a rule is itself a legal rule speciﬁcation). [sent-95, score-0.599]

53 For given partial rules and preﬁx free code satisﬁes the Kraft inequality and we now deﬁne the following functions of the sample . [sent-97, score-0.554]

54 The ﬁrst, as we will see, is a bound on the sampling error for empirical probabilities conditioned upon . [sent-98, score-0.266]

55 32BA% @" 968# 76&53210 ) x y9 R9 £ # £ ¥ Note that if the sample size is sufﬁciently large (relative to and ) then is near zero. [sent-108, score-0.096]

56 Also note that if and have near perfect agreement when they both are not then is near one. [sent-109, score-0.172]

57 £ )# $ % R9 E£ ¥ x 9 ya § XQTcE£ W & x R9 % The theorem also justiﬁes the use of partial rules. [sent-113, score-0.285]

58 Of course it is possible to convert a partial rule to a total rule by forcing a random choice when the rule would otherwise return . [sent-114, score-0.916]

59 Converting a partial rule to a total rule in this way and then applying the above theorem to the total rule gives a weaker result. [sent-115, score-1.002]

60 In this case the empirical error rate of the corresponding total rule — the rule that guesses when has no opinion — will be statistically indistinguishable from from 1/2. [sent-117, score-0.711]

61 However, in this case theorem 1 can still establish that the false positive and false negative rate of the partial rule is near zero. [sent-118, score-0.595]

62 £ 9 # t3 ©I6 & & £ ¤ R9 E£ )# % § £ 3 The Analysis We start with a general lemma about conditional probability estimation. [sent-119, score-0.124]

63 By the previous lemma is a permutation, so has the same information content as . [sent-125, score-0.087]

64 41 1 R9 E£ £ ¥ x y9 & ya(§ X` x 9 W % Lemma 4 Pick any rules W 1 #9 The summation is a convex combination; therefore there must exist some such that . [sent-129, score-0.249]

65 Q ¥ ¥ £ h s x S )# R9 E£ R9 s ¥ h x £ © ¥ 9 $¦& ¢ ¢ ¢ 6 ¤¤£¥ 7 x 9 ca § XQ& c9 W x ¥ £ qg Q § & x 29 v q w(q £ % £ s ¥ ¦$ ¥ £ x 9 £ c© h 1 % x 9 ca XW x % W 6 ¨x ¨ £ g i6 £ . [sent-133, score-0.344]

66 By the union bound and the Kraft inequality, we have that with this must hold simultaneously for all and , and all . [sent-134, score-0.19]

67 9 £)# & % 9 & % w" ¦©¨ §¦ ¦ ¤¥ £S ww )# & § $ % I)e& § % ww ¡% ¢ i# w w ww b¦ ¦ ¤¥ ¢ S www )# & ww ©¨$¦ 9 ¤)# & & £ i 3 g ¦ 0¢ I"£ ¦ (¢ §©£ u6 )¦¤ '¦¤ , % & % qg § . [sent-137, score-0.903]

68 and for which equation (2) as well as Lemma 3 Pick any rules hold for all . [sent-141, score-0.298]

69 Then is a permutation, and moreover, for any , for any given probability at least Therefore, with probability at least Proof. [sent-142, score-0.19]

70 (3) )# R9 £ % iu6 r i s £ 3 qg 6 V g £36 ¥ ¥ © £ e ¨ p i I2g )# R9 £ % ¨ £ ¤ ¢ ¥ ¢ ¢ §¦¤£¡ g i £ 3 x H9u6( XwEx W & Proof. [sent-144, score-0.344]

71 From our main theorem, we know that with probability at least , for all . [sent-145, score-0.095]

72 we have the followand ( s x g i $ £ ¨ 6 We can now state a bound on the total error rate as a corollary to theorem 1. [sent-148, score-0.566]

73 Also by the previous lemma, we have so the second term in the above sum must be negative, whereby 3 £ 6 § (¢XW S )# R9 £ ¥ 9 § & (¢XW 8 9 £ % % i £3g i $ & % With probability at least we have that is no larger than . [sent-153, score-0.138]

74 So by the union bound both of these conditions hold simultaneously with probability at least . [sent-154, score-0.285]

75 Since we have that the upper bound in (3) is maximized by setting equal to . [sent-155, score-0.141]

76 One could use a relative Chernoff bound to tighten the uncertainty in in the case where this probability is small. [sent-157, score-0.178]

77 One could also use the error rate bounds to construct bounds on . [sent-158, score-0.37]

78 Another approach is to use the error rate of a rule that combines and , e. [sent-160, score-0.343]

79 , the rule outputs if , otherwise outputs if , and otherwise guesses a random value. [sent-162, score-0.324]

80 This combined rule will have a lower error rate and it is possible to give bounds on the error rate of the combined rule. [sent-163, score-0.646]

81 It should be noted, however, that the algorithm described in section 4 can be used with any bound on total error rate. [sent-165, score-0.304]

82 Corollary 5, or some more reﬁned bound on total error, provides an objective function that can be pursued by a learning algorithm — the objective is to ﬁnd and so as to minimize the upper bound on the total error rate. [sent-167, score-0.671]

83 Following Yarowsky, we consider the greedy construction of decision list rules. [sent-169, score-0.291]

84 We assume that is to be a decision list over the features in , i. [sent-171, score-0.193]

85 A decision list can be viewed as a right-branching decision tree. [sent-174, score-0.264]

86 More speciﬁcally, if is the list ; ; ; then is if and otherwise equals the value of the list ; ; on . [sent-175, score-0.281]

87 This implies the Kraft inequality which is all that is needed in theorem 1 and corollary 5. [sent-179, score-0.248]

88 We also assume that is a decision list over the features and deﬁne similarly. [sent-180, score-0.193]

89 6 £ & & Following Yarowsky we suggest growing the decision lists in a greedy manner adding one feature value pair at a time. [sent-181, score-0.285]

90 A natural choice of greedy heuristic might be a bound on the total error rate. [sent-182, score-0.513]

91 However, in many cases the ﬁnal objective function is not an appropriate choice for the greedy heuristic in greedy algorithms. [sent-183, score-0.381]

92 A* search, for example, might be viewed as a greedy heuristic where the heuristic function estimates the number of steps needed to reach a high-value conﬁguration — a low value conﬁguration might be one step away from a high value conﬁguration. [sent-184, score-0.248]

93 The greedy heuristic used in greedy search should estimate the value of the ﬁnal conﬁguration. [sent-185, score-0.271]

94 Here we suggest using as a heuristic estimate of the ﬁnal total error rate — in the ﬁnal conﬁguration we should g £3q ¥ ¥ £ ¨ p i Qg qg ¨ % have that is reasonably large and the important term will be . [sent-186, score-0.653]

95 Q ¥ ¥ £ p i I2g to “seed rule” decision lists using domain-speciﬁc prior are both zero, or all features have been used in % 9 )# 9 $ f x S T)# 9 £ U g £36 ¥ ¥ £ $ $ and otherwise. [sent-189, score-0.136]

96 f £ ¨ y £ 3 qg ¥ ¥ © £ ¡g g £f3 q ¥ ¥ © £ e ¨ s h p i eI2g 3. [sent-196, score-0.344]

97 Prune the rules — iteratively remove the pair from the end of either or that greedily minimizes the bound on total error until no such removal reduces the bound. [sent-197, score-0.604]

98 One direction for future research is to try to relax this assumption somehow. [sent-200, score-0.083]

99 We could relax this assumption by allowing some small amount of mutual information between and given and giving bounds on error rates that involve this quantity of mutual information. [sent-203, score-0.371]

100 Another direction for future work, of course, is the empirical evaluation of co-training and bootstrapping methods suggested by our theory. [sent-204, score-0.236]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('qg', 0.344), ('rules', 0.249), ('unlabeled', 0.223), ('partial', 0.204), ('bootstrapping', 0.196), ('xw', 0.188), ('yarowsky', 0.188), ('rule', 0.187), ('phrase', 0.158), ('bound', 0.141), ('justi', 0.138), ('mitchell', 0.136), ('ww', 0.124), ('qq', 0.124), ('list', 0.122), ('permutation', 0.122), ('collins', 0.118), ('corollary', 0.11), ('blum', 0.11), ('ya', 0.109), ('bounds', 0.107), ('labels', 0.099), ('greedy', 0.098), ('labeling', 0.096), ('kraft', 0.094), ('con', 0.092), ('lemma', 0.087), ('tx', 0.087), ('error', 0.085), ('theorem', 0.081), ('total', 0.078), ('heuristic', 0.075), ('objective', 0.074), ('decision', 0.071), ('rate', 0.071), ('agreement', 0.068), ('em', 0.067), ('singer', 0.066), ('pre', 0.065), ('lists', 0.065), ('conditionally', 0.063), ('guesses', 0.063), ('qqg', 0.063), ('tce', 0.063), ('tcr', 0.063), ('tya', 0.063), ('www', 0.063), ('xqtce', 0.063), ('labeled', 0.061), ('agree', 0.059), ('least', 0.058), ('yx', 0.057), ('guration', 0.057), ('inequality', 0.057), ('generalization', 0.055), ('dasgupta', 0.054), ('triples', 0.054), ('labs', 0.053), ('near', 0.052), ('pair', 0.051), ('nonterminal', 0.05), ('pac', 0.05), ('xq', 0.05), ('hold', 0.049), ('dence', 0.048), ('mutual', 0.048), ('ne', 0.047), ('sample', 0.044), ('whereby', 0.043), ('de', 0.043), ('pick', 0.042), ('assumption', 0.042), ('es', 0.042), ('chernoff', 0.041), ('composite', 0.041), ('relax', 0.041), ('statements', 0.041), ('nal', 0.041), ('empirical', 0.04), ('give', 0.04), ('ss', 0.039), ('grammar', 0.039), ('rubin', 0.039), ('cation', 0.039), ('language', 0.038), ('syntactic', 0.038), ('otherwise', 0.037), ('probability', 0.037), ('rf', 0.037), ('dempster', 0.037), ('laird', 0.037), ('choice', 0.036), ('label', 0.035), ('predictor', 0.034), ('guarantee', 0.034), ('re', 0.033), ('conditioning', 0.033), ('context', 0.033), ('formally', 0.032), ('statement', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 143 nips-2001-PAC Generalization Bounds for Co-training

Author: Sanjoy Dasgupta, Michael L. Littman, David A. McAllester

2 0.16327584 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces

Author: T. Zhang

Abstract: We investigate the generalization performance of some learning problems in Hilbert functional Spaces. We introduce a notion of convergence of the estimated functional predictor to the best underlying predictor, and obtain an estimate on the rate of the convergence. This estimate allows us to derive generalization bounds on some learning formulations.

3 0.15756623 1 nips-2001-(Not) Bounding the True Error

Author: John Langford, Rich Caruana

Abstract: We present a new approach to bounding the true error rate of a continuous valued classiﬁer based upon PAC-Bayes bounds. The method ﬁrst constructs a distribution over classiﬁers by determining how sensitive each parameter in the model is to noise. The true error rate of the stochastic classiﬁer found with the sensitivity analysis can then be tightly bounded using a PAC-Bayes bound. In this paper we demonstrate the method on artiﬁcial neural networks with results of a order of magnitude improvement vs. the best deterministic neural net bounds. £ ¡ ¤¢

4 0.15242842 8 nips-2001-A General Greedy Approximation Algorithm with Applications

Author: T. Zhang

Abstract: Greedy approximation algorithms have been frequently used to obtain sparse solutions to learning problems. In this paper, we present a general greedy algorithm for solving a class of convex optimization problems. We derive a bound on the rate of approximation for this algorithm, and show that our algorithm includes a number of earlier studies as special cases.

5 0.13752745 144 nips-2001-Partially labeled classification with Markov random walks

Author: Martin Szummer, Tommi Jaakkola

Abstract: To classify a large number of unlabeled examples we combine a limited number of labeled examples with a Markov random walk representation over the unlabeled examples. The random walk representation exploits any low dimensional structure in the data in a robust, probabilistic manner. We develop and compare several estimation criteria/algorithms suited to this representation. This includes in particular multi-way classiﬁcation with an average margin criterion which permits a closed form solution. The time scale of the random walk regularizes the representation and can be set through a margin-based criterion favoring unambiguous classiﬁcation. We also extend this basic regularization by adapting time scales for individual examples. We demonstrate the approach on synthetic examples and on text classiﬁcation problems.

6 0.12125093 167 nips-2001-Semi-supervised MarginBoost

7 0.10140464 193 nips-2001-Unsupervised Learning of Human Motion Models

8 0.10078257 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

9 0.095265247 119 nips-2001-Means, Correlations and Bounds

10 0.090986922 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms

11 0.087624259 25 nips-2001-Active Learning in the Drug Discovery Process

12 0.084010355 105 nips-2001-Kernel Machines and Boolean Functions

13 0.080367006 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

14 0.079477899 114 nips-2001-Learning from Infinite Data in Finite Time

15 0.079086147 31 nips-2001-Algorithmic Luckiness

16 0.078690529 130 nips-2001-Natural Language Grammar Induction Using a Constituent-Context Model

17 0.078446746 56 nips-2001-Convolution Kernels for Natural Language

18 0.074704975 62 nips-2001-Duality, Geometry, and Support Vector Regression

19 0.073786847 147 nips-2001-Pranking with Ranking

20 0.071624659 58 nips-2001-Covariance Kernels from Bayesian Generative Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.237), (1, 0.066), (2, 0.032), (3, 0.135), (4, 0.067), (5, -0.137), (6, -0.009), (7, 0.04), (8, -0.148), (9, -0.086), (10, -0.038), (11, 0.015), (12, -0.02), (13, 0.004), (14, -0.088), (15, 0.026), (16, 0.027), (17, -0.116), (18, -0.039), (19, -0.025), (20, -0.149), (21, 0.108), (22, -0.106), (23, -0.101), (24, 0.088), (25, 0.007), (26, 0.077), (27, -0.01), (28, -0.078), (29, -0.067), (30, 0.001), (31, 0.153), (32, 0.131), (33, -0.002), (34, -0.092), (35, 0.0), (36, -0.055), (37, -0.05), (38, -0.03), (39, 0.048), (40, -0.088), (41, -0.134), (42, -0.002), (43, -0.031), (44, -0.007), (45, -0.035), (46, -0.071), (47, 0.124), (48, -0.177), (49, -0.094)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95862156 143 nips-2001-PAC Generalization Bounds for Co-training

Author: Sanjoy Dasgupta, Michael L. Littman, David A. McAllester

2 0.62649274 1 nips-2001-(Not) Bounding the True Error

Author: John Langford, Rich Caruana

3 0.58699691 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces

Author: T. Zhang

4 0.5391649 114 nips-2001-Learning from Infinite Data in Finite Time

Author: Pedro Domingos, Geoff Hulten

Abstract: We propose the following general method for scaling learning algorithms to arbitrarily large data sets. Consider the model Mii learned by the algorithm using ni examples in step i (ii = (nl , ... ,nm )) , and the model Moo that would be learned using infinite examples. Upper-bound the loss L(Mii' M oo ) between them as a function of ii, and then minimize the algorithm's time complexity f(ii) subject to the constraint that L(Moo , M ii ) be at most f with probability at most 8. We apply this method to the EM algorithm for mixtures of Gaussians. Preliminary experiments on a series of large data sets provide evidence of the potential of this approach. 1 An Approach to Large-Scale Learning Large data sets make it possible to reliably learn complex models. On the other hand , they require large computational resources to learn from. While in the past the factor limit ing the quality of learnable models was typically the quantity of data available, in many domains today data is super-abundant, and the bottleneck is t he t ime required to process it. Many algorithms for learning on large data sets have been proposed, but in order to achieve scalability they generally compromise the quality of the results to an unspecified degree. We believe this unsatisfactory state of affairs is avoidable, and in this paper we propose a general method for scaling learning algorithms to arbitrarily large databases without compromising the quality of the results. Our method makes it possible to learn in finite time a model that is essentially indistinguishable from the one that would be obtained using infinite data. Consider the simplest possible learning problem: estimating the mean of a random variable x. If we have a very large number of samples, most of them are probably superfluous. If we are willing to accept an error of at most f with probability at most 8, Hoeffding bounds [4] (for example) tell us that, irrespective of the distribution of x, only n = ~(R/f)2 1n (2/8) samples are needed, where R is x's range. We propose to extend this type of reasoning beyond learning single parameters, to learning complex models. The approach we propose consists of three steps: 1. Derive an upper bound on the relative loss between the finite-data and infinite-data models, as a function of the number of samples used in each step of the finite-data algorithm. 2. Derive an upper bound on the time complexity of the learning algorithm , as a function of the number of samples used in each step. 3. Minimize the time bound (via the number of samples used in each step) subject to target limits on the loss. In this paper we exemplify this approach using the EM algorithm for mixtures of Gaussians. In earlier papers we applied it (or an earlier version of it) to decision tree induction [2J and k-means clustering [3J. Despite its wide use, EM has long been criticized for its inefficiency (see discussion following Dempster et al. [1]), and has been considered unsuitable for large data sets [8J. Many approaches to speeding it up have been proposed (see Thiesson et al. [6J for a survey) . Our method can be seen as an extension of progressive sampling approaches like Meek et al. [5J: rather than minimize the total number of samples needed by the algorithm , we minimize the number needed by each step, leading to potentially much greater savings; and we obtain guarantees that do not depend on unverifiable extrapolations of learning curves. 2 A Loss Bound for EM In a mixture of Gaussians model, each D-dimensional data point Xj is assumed to have been independently generated by the following process: 1) randomly choose a mixture component k; 2) randomly generate a point from it according to a Gaussian distribution with mean f-Lk and covariance matrix ~k. In this paper we will restrict ourselves to the case where the number K of mixture components and the probability of selection P(f-Lk) and covariance matrix for each component are known. Given a training set S = {Xl, ... , XN }, the learning goal is then to find the maximumlikelihood estimates of the means f-Lk. The EM algorithm [IJ accomplishes this by, starting from some set of initial means , alternating until convergence between estimating the probability p(f-Lk IXj) that each point was generated by each Gaussian (the Estep), and computing the ML estimates of the means ilk = 2::;':1 WjkXj / 2::f=l Wjk (the M step), where Wjk = p(f-Lklxj) from the previous E step. In the basic EM algorithm, all N examples in the training set are used in each iteration. The goal in this paper is to speed up EM by using only ni < N examples in the ith iteration, while guaranteeing that the means produced by the algorithm do not differ significantly from those that would be obtained with arbitrarily large N. Let Mii = (ill , . . . , ilK) be the vector of mean estimates obtained by the finite-data EM algorithm (i.e., using ni examples in iteration i), and let Moo = (f-L1, ... ,f-LK) be the vector obtained using infinite examples at each iteration. In order to proceed, we need to quantify the difference between Mii and Moo . A natural choice is the sum of the squared errors between corresponding means, which is proportional to the negative log-likelihood of the finite-data means given the infinite-data ones: K L(Mii' Moo ) = L k=l K Ililk - f-Lkl12 = D LL lilkd - f-Lkdl 2 (1) k=ld=l where ilkd is the dth coordinate of il, and similarly for f-Lkd. After any given iteration of EM, lilkd - f-Lkdl has two components. One, which we call the sampling error, derives from the fact that ilkd is estimated from a finite sample, while J-Lkd is estimated from an infinite one. The other component, which we call the weighting error, derives from the fact that , due to sampling errors in previous iterations, the weights Wjk used to compute the two estimates may differ. Let J-Lkdi be the infinite-data estimate of the dth coordinate of the kth mean produced in iteration i, flkdi be the corresponding finite-data estimate, and flkdi be the estimate that would be obtained if there were no weighting errors in that iteration. Then the sampling error at iteration i is Iflkdi - J-Lkdi I, the weighting error is Iflkdi - flkdi I, and the total error is Iflkdi - J-Lkdi 1 :::; Iflkdi - flkdi 1 + Iflkdi - J-Lkdi IčˇŻ Given bounds on the total error of each coordinate of each mean after iteration i-I, we can derive a bound on the weighting error after iteration i as follows. Bounds on J-Lkd ,i-l for each d imply bounds on p(XjlJ-Lki ) for each example Xj, obtained by substituting the maximum and minimum allowed distances between Xjd and J-Lkd ,i-l into the expression of the Gaussian distribution. Let P}ki be the upper bound on P(XjlJ-Lki) , and Pjki be the lower bound. Then the weight of example Xj in mean J-Lki can be bounded from below by by W}ki W (+) jki = min{p}kiP(J-Lk)/ wjki = PjkiP(J-Lk)/ ~~= l P}k'iP(J-LU, and from above ~~=l Pjk'iP(J-LU, I}. Let w;t: = W}ki if Xj ::::: 0 and (- - 1 > (- + . W jki) - W jki' f Xj _ 0 an d W jki) - W jki 0 th erWlse. ot h ' erWlse, an d 1et W- jki Then Iflkdi - flkdi , IJ-Lkdi 1 < max - ~7~1 Wjk i Xj I

5 0.53540123 144 nips-2001-Partially labeled classification with Markov random walks

Author: Martin Szummer, Tommi Jaakkola

6 0.50465453 8 nips-2001-A General Greedy Approximation Algorithm with Applications

7 0.50064021 119 nips-2001-Means, Correlations and Bounds

8 0.46840924 193 nips-2001-Unsupervised Learning of Human Motion Models

9 0.43769294 31 nips-2001-Algorithmic Luckiness

10 0.42084587 64 nips-2001-EM-DD: An Improved Multiple-Instance Learning Technique

11 0.42055383 25 nips-2001-Active Learning in the Drug Discovery Process

12 0.41606826 167 nips-2001-Semi-supervised MarginBoost

13 0.41043779 130 nips-2001-Natural Language Grammar Induction Using a Constituent-Context Model

14 0.38561019 66 nips-2001-Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms

15 0.37267029 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

16 0.36919269 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions

17 0.35251975 36 nips-2001-Approximate Dynamic Programming via Linear Programming

18 0.35100758 5 nips-2001-A Bayesian Model Predicts Human Parse Preference and Reading Times in Sentence Processing

19 0.33982092 62 nips-2001-Duality, Geometry, and Support Vector Regression

20 0.33022767 105 nips-2001-Kernel Machines and Boolean Functions

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.047), (15, 0.203), (17, 0.042), (19, 0.046), (27, 0.138), (30, 0.062), (36, 0.032), (59, 0.024), (70, 0.014), (72, 0.082), (79, 0.041), (83, 0.022), (84, 0.015), (91, 0.156)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90747416 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine

Author: Ji Zhu, Trevor Hastie

Abstract: The support vector machine (SVM) is known for its good performance in binary classiﬁcation, but its extension to multi-class classiﬁcation is still an on-going research issue. In this paper, we propose a new approach for classiﬁcation, called the import vector machine (IVM), which is built on kernel logistic regression (KLR). We show that the IVM not only performs as well as the SVM in binary classiﬁcation, but also can naturally be generalized to the multi-class case. Furthermore, the IVM provides an estimate of the underlying probability. Similar to the “support points” of the SVM, the IVM model uses only a fraction of the training data to index kernel basis functions, typically a much smaller fraction than the SVM. This gives the IVM a computational advantage over the SVM, especially when the size of the training data set is large.

same-paper 2 0.88439405 143 nips-2001-PAC Generalization Bounds for Co-training

Author: Sanjoy Dasgupta, Michael L. Littman, David A. McAllester

3 0.86415976 30 nips-2001-Agglomerative Multivariate Information Bottleneck

Author: Noam Slonim, Nir Friedman, Naftali Tishby

Abstract: The information bottleneck method is an unsupervised model independent data organization technique. Given a joint distribution peA, B), this method constructs a new variable T that extracts partitions, or clusters, over the values of A that are informative about B. In a recent paper, we introduced a general principled framework for multivariate extensions of the information bottleneck method that allows us to consider multiple systems of data partitions that are inter-related. In this paper, we present a new family of simple agglomerative algorithms to construct such systems of inter-related clusters. We analyze the behavior of these algorithms and apply them to several real-life datasets.

4 0.76850617 13 nips-2001-A Natural Policy Gradient

Author: Sham M. Kakade

Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1

5 0.75746238 8 nips-2001-A General Greedy Approximation Algorithm with Applications

Author: T. Zhang

6 0.75733721 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning

7 0.75284547 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

8 0.74944758 132 nips-2001-Novel iteration schemes for the Cluster Variation Method

9 0.74823332 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data

10 0.74791527 135 nips-2001-On Spectral Clustering: Analysis and an algorithm

11 0.74665761 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

12 0.74653268 36 nips-2001-Approximate Dynamic Programming via Linear Programming

13 0.74261725 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

14 0.74251616 121 nips-2001-Model-Free Least-Squares Policy Iteration

15 0.74241954 58 nips-2001-Covariance Kernels from Bayesian Generative Models

16 0.74160171 193 nips-2001-Unsupervised Learning of Human Motion Models

17 0.74102724 89 nips-2001-Grouping with Bias

18 0.74012768 56 nips-2001-Convolution Kernels for Natural Language

19 0.73782492 127 nips-2001-Multi Dimensional ICA to Separate Correlated Sources

20 0.73671585 155 nips-2001-Quantizing Density Estimators