nips nips2001 nips2001-147 knowledge-graph by maker-knowledge-mining

147 nips-2001-Pranking with Ranking

Source: pdf

Author: Koby Crammer, Yoram Singer

Abstract: We discuss the problem of ranking instances. In our framework each instance is associated with a rank or a rating, which is an integer from 1 to k. Our goal is to find a rank-prediction rule that assigns each instance a rank which is as close as possible to the instance's true rank. We describe a simple and efficient online algorithm, analyze its performance in the mistake bound model, and prove its correctness. We describe two sets of experiments, with synthetic data and with the EachMovie dataset for collaborative filtering. In the experiments we performed, our algorithm outperforms online algorithms for regression and classification applied to ranking. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 il Abstract We discuss the problem of ranking instances. [sent-4, score-0.432]

2 In our framework each instance is associated with a rank or a rating, which is an integer from 1 to k. [sent-5, score-0.256]

3 Our goal is to find a rank-prediction rule that assigns each instance a rank which is as close as possible to the instance's true rank. [sent-6, score-0.36]

4 We describe a simple and efficient online algorithm, analyze its performance in the mistake bound model, and prove its correctness. [sent-7, score-0.407]

5 We describe two sets of experiments, with synthetic data and with the EachMovie dataset for collaborative filtering. [sent-8, score-0.174]

6 In the experiments we performed, our algorithm outperforms online algorithms for regression and classification applied to ranking. [sent-9, score-0.261]

7 1 Introduction The ranking problem we discuss in this paper shares common properties with both classification and regression problems. [sent-10, score-0.51]

8 We refer to the labels as ranks and without loss of generality assume that the ranks constitute the set {I, 2, . [sent-13, score-0.189]

9 Settings in which it is natural to rank or rate instances rather than classify are common in tasks such as information retrieval and collaborative filtering. [sent-17, score-0.327]

10 In collaborative filtering the goal is to predict a user's rating on new items such as books or movies given the user's past ratings of the similar items. [sent-19, score-0.437]

11 While the different ratings carry meaningful semantics, from a learning-theoretic point of view we model the ratings as a totally ordered set (whose size is 5 in the example above). [sent-22, score-0.365]

12 The interest in ordering or ranking of objects is by no means new and is still the source of ongoing research in many fields such mathematical economics, social science, and computer science. [sent-23, score-0.458]

13 One of the main results of [1] underscores a complexity gap between classification learning and ranking learning. [sent-26, score-0.479]

14 To sidestep the inherent intractability problems of ranking learning several approaches have been suggested. [sent-27, score-0.432]

15 One possible approach is to cast a ranking problem as a regression problem. [sent-28, score-0.463]

16 The first case imposes a metric on the set of ranking rules which might not be realistic, while the second approach is time consuming since it requires increasing the sample size from n to O(n 2 ). [sent-31, score-0.432]

17 However, our work then deviates and operates directly on rankings by associating each ranking with distinct sub-interval of the reals and adapting the support of each sub-interval while learning. [sent-35, score-0.432]

18 In the next section we describe a simple and efficient online algorithm that manipulates concurrently the direction onto which we project the instances and the division into sub-intervals. [sent-36, score-0.245]

19 3 we prove the correctness of the algorithm and analyze its performance in the mistake bound model. [sent-38, score-0.302]

20 4 experiments that compare the algorithm to online algorithms for classification and regression applied to ranking which demonstrate the merits of our approach. [sent-40, score-0.693]

21 2 The PRank Algorithm This paper focuses on online algorithms for ranking instances. [sent-41, score-0.571]

22 Each instance xt is in IR n and its corresponding rank yt is an element from finite set y with a total order relation. [sent-49, score-0.692]

23 We also say that xt and x S are not comparable if neither yt > yS nor yt < yS. [sent-56, score-0.622]

24 A ranking rule H is a mapping from instances to ranks, H : IR n -+ y. [sent-59, score-0.588]

25 The family of ranking rules we discuss in this paper employs a vector w E IR n and a set of k thresholds bl :::; . [sent-60, score-0.579]

26 ,bk-d the vector of thresholds excluding bk which is fixed to 00. [sent-67, score-0.179]

27 Given a new instance x the ranking rule first computes the inner-product between w and x . [sent-68, score-0.585]

28 The predicted rank is then defined to be the index of the first (smallest) threshold br for which w . [sent-69, score-0.593]

29 This type of ranking rules divide the space into parallel equally-ranked regions: all the instances that satisfy br - l < W· x < br are assigned the same rank r. [sent-71, score-1.339]

30 Formally, given a ranking rule defined by wand b the predicted rank of an instance x is, H(x) = minrE{l, . [sent-72, score-0.918]

31 The analysis that we use in this paper is based on the mistake bound model for online learning. [sent-78, score-0.316]

32 On round t the learning algorithm gets an instance xt. [sent-80, score-0.17]

33 Given x t , the algorithm outputs a rank , il = minr {r : W· x - br < O}. [sent-81, score-0.575]

34 It then receives the correct rank yt and updates its ranking rule by modifying wand b. [sent-82, score-1.046]

35 We say that our algorithm made a ranking mistake if il f:. [sent-83, score-0.611]

36 We wish to make the predicted rank as close as possible to the true rank. [sent-127, score-0.269]

37 Formally, the goal of the learning algorithm is to minimize the ranking-loss which is defined to be the number of thresholds between the true rank and the predicted rank. [sent-128, score-0.411]

38 k}, the ranking-loss after T rounds is equal to the accumulated difference between the predicted and true rank-values, '£'[=1 W- yt I. [sent-132, score-0.367]

39 The algorithm we describe updates its ranking rule only on rounds on which it made ranking mistakes. [sent-133, score-1.141]

40 We now describe the update rule of the algorithm which is motivated by the perceptron algorithm for classification and hence we call it the PRank algorithm (for Perceptron Ranking). [sent-135, score-0.408]

41 For simplicity, we omit the index of the round when referring to an input instance-rank pair (x, y) and the ranking rule wand h. [sent-136, score-0.677]

42 :::; bk - 1 :::; bk then the predicted rank is correct if w . [sent-140, score-0.46]

43 We represent the above inequalities by expanding the rank y into into k - 1 virtual variables Yl , . [sent-149, score-0.207]

44 We set Yr = +1 for the case W· x > br and Yr = -1 for w . [sent-153, score-0.324]

45 Put another way, a rank value y induces the vector (Yl, . [sent-155, score-0.207]

46 Thus, the prediction of a ranking rule is correct if Yr(w· x - br ) > 0 for all r. [sent-165, score-0.92]

47 If the algorithm makes a mistake by ranking x as fj instead of Y then there is at least one threshold, indexed r, for which the value of W· x is on the wrong side of br , i. [sent-166, score-0.99]

48 To correct the mistake, we need to "move" the values of W· x and br toward each other. [sent-169, score-0.353]

49 An illustration of the update rule is given in Fig 1. [sent-176, score-0.173]

50 ) The 5 correct rank of the instance is Y = 4, and thus the value of w . [sent-182, score-0.285]

51 x fell below b1 and the predicted rank is fj = 1. [sent-185, score-0.269]

52 To mend the mistake the algorithm decreases b1 , b2 and b3 by a unit value and replace them with b1 -1 , b2 -1 and b3 -1. [sent-187, score-0.205]

53 Note that after the update, the predicted rank of x is Y = 3 which is closer to the true rank y = 4. [sent-195, score-0.476]

54 To conclude this section we like to note that PRank can be straightforwardly combined with Mercer kernels [8] and voting techniques [4] often used for improving the performance of margin classifiers in batch and online settings. [sent-197, score-0.173]

55 3 Analysis Before we prove the mistake bound of the algorithm we first show that it maintains a consistent hypothesis in the sense that it preserves the correct order of the thresholds. [sent-198, score-0.371]

56 Specifically, we show by induction that for any ranking rule that can be derived by the algorithm along its run, (w 1 , b 1 ) , . [sent-199, score-0.58]

57 T; T; Lemma 1 (Order Preservation) Let w t and b t be the current ranking rule, where bi :S . [sent-216, score-0.432]

58 Denote by wt+1 and bt+1 the resulting ranking rule after the update of PRank, then bi+1 :S . [sent-220, score-0.577]

59 :S bt~ll· Proof: In order to show that PRank maintains the order of the thresholds we use the definition of the algorithm for y~, namely we define y~ = +1 for r < yt and y~ = -1 for r 2:: yt. [sent-223, score-0.422]

60 xt - b;)y; :S 0], (1) which we obtain by substituting the values of bt+1. [sent-226, score-0.25]

61 Recall that y~ = 1 if yt > r and y~ = -1 otherwise, and therefore, y~+l :S y~. [sent-229, score-0.186]

62 Given a hyperplane wand a set of k -1 thresholds b we denote by v E ~n+k-1 the vector which is a concatenation of wand b that is v = (w, b). [sent-245, score-0.255]

63 For brevity we refer to the vector vas a ranking rule. [sent-246, score-0.432]

64 , (x T , yT) be an input sequence for PRank where xt E ~n and yt E {l. [sent-254, score-0.436]

65 Assume that there is a ranking rule v* = (w* , b*) with :S . [sent-259, score-0.536]

66 Then, the rank loss of the algorithm '£;=1 Iyt - yt I, is at most (k - 1) (R 2 + 1) / "(2 . [sent-264, score-0.464]

67 br Proof: Let us fix an example (xt, yt) which the algorithm received on round t. [sent-265, score-0.445]

68 By definition the algorithm ranked the example using the ranking rule v t which is composed of w t and the thresholds b t . [sent-266, score-0.732]

69 Similarly, we denote by vt+l the updated rule (wt+l , bt+l) after round t. [sent-267, score-0.21]

70 Let us denote by n t = yt 1the difference between the true rank and the predicted rank. [sent-272, score-0.484]

71 It is straightforward to verify that nt = 2:=r ITn Note that if there wasn't a ranking mistake on round t then = for r = 1, . [sent-273, score-0.901]

72 To prove the theorem we bound 2:=t nt from above by bounding IIvtl12 from above and below. [sent-277, score-0.382]

73 xt - b;) (2) r=1 T; from the pseudocode in Fig. [sent-283, score-0.284]

74 Using the assumption that v* ranks the data correctly with a margin We further bound the right term by considering two cases. [sent-286, score-0.184]

75 Unfolding the sum, we get that after T rounds the algorithm satisfies, v* . [sent-299, score-0.212]

76 As before, assume that an example (xt, yt) was ranked using the ranking rule v t and denote by vt+l the ranking rule after the round. [sent-304, score-1.125]

77 Since T; E {-1,0,+1} we have that (2:=rT;)2 :::; (nt)2 and 2:=r(T;) 2 = nt and we therefore get, IIv H1 112 :::; IIvtl12 + 22:= T; (w t . [sent-307, score-0.257]

78 (4) r We further develop the second term using the update rule of the algorithm and get, 2:= T; (w t . [sent-309, score-0.189]

79 Thus, the ranking rule we obtain after T rounds of the algorithm satisfies the upper bound, IlvT+l W :::; R2 2:=t(nt )2 + 2:=t nt. [sent-316, score-0.675]

80 Combining the lower bound IlvT+l W ~ (2:=t nt)2 "(2 with the upper bound we have that, (2:=tnt) 2"(2:::; Ilv T+1112:::; R2 2:=t(nt )2 + 2:=t nt . [sent-317, score-0.389]

81 Dividing both sides by "(2 2:=tnt we finally get, 2:= nt :::; R2 [2:=t(n t )2] t f [2:=t ntl + 1 . [sent-318, score-0.257]

82 (6) "( By definition, nt is at most k - 1, which implies that 2:=t(n t )2 :::; 2:=t nt(k - 1) = (k -1) 2:=t nt. [sent-319, score-0.257]

83 (6) we get the desired bound, 2:=;=1 Ig t ytl = 2:=;=1 nt :::; [(k - 1)R2 + 1lh2 :::; [(k - 1)(R2 + 1)lh2 . [sent-321, score-0.33]

84 Comparison of the time-averaged ranking-loss of PRank, WH, and MCP on the EachMovie dataset using viewers who rated and at least 200 movies (middle) and at least 100 movies (right). [sent-326, score-0.539]

85 Each point was assigned a rank y from the set {I, . [sent-335, score-0.207]

86 , 5} according to the following ranking rule, y = maxr{r : lO((XI - 0. [sent-338, score-0.432]

87 We converted the real-valued predictions of WH into ranks by rounding each prediction to its closest rank value. [sent-346, score-0.319]

88 WH initially suffers the smallest instantaneous loss but after about 500 rounds PRank achieves the best performance and eventually the number of ranking mistakes that PRank suffers is significantly lower than both WH and MPC. [sent-359, score-0.636]

89 This dataset is used for collaborative filtering tasks and contains ratings of movies provided by 61 , 265 people. [sent-361, score-0.439]

90 Each person in the dataset viewed a subset of movies from a collection of 1623 titles. [sent-362, score-0.229]

91 Each viewer rated each movie that she saw using one of 6 possible ratings: 0, 0. [sent-363, score-0.178]

92 We chose subsets of people who viewed a significant amount of movies extracting for evaluation people who have rated at least 100 movies. [sent-368, score-0.287]

93 We chose at random one person among these viewers and set the person's ratings to be the target rank. [sent-370, score-0.249]

94 We used the ratings of all the rest of the people who viewed enough movies as features. [sent-371, score-0.32]

95 Thus, the goal is to learn to predict the "taste" of a random user using the user's past ratings as a feedback and the ratings of fellow viewers as features. [sent-372, score-0.444]

96 The prediction rule associates a weight with each fellow viewer an therefore can be seen as learning correlations between the tastes of different viewers. [sent-373, score-0.252]

97 5 from each rating and therefore the possible ratings are -0. [sent-375, score-0.188]

98 Since we picked viewer who rated at least 100 movies, we were able to perform at least 100 rounds of online predictions and updates. [sent-384, score-0.397]

99 In this experiment, we ran PRank over the training data as an online algorithm and used its last hypothesis to rank unseen test data. [sent-395, score-0.366]

100 Acknowledgments Thanks to Sanjoy Dagupta and Rob Schapire for numerous discussions on ranking problems and algorithms. [sent-397, score-0.432]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ranking', 0.432), ('prank', 0.37), ('br', 0.324), ('nt', 0.257), ('xt', 0.25), ('rank', 0.207), ('yt', 0.186), ('movies', 0.152), ('yr', 0.152), ('ratings', 0.142), ('mistake', 0.135), ('wh', 0.123), ('online', 0.115), ('rule', 0.104), ('thresholds', 0.098), ('mcp', 0.097), ('rounds', 0.095), ('wt', 0.091), ('bk', 0.081), ('ranks', 0.081), ('viewer', 0.078), ('viewers', 0.078), ('round', 0.077), ('bt', 0.077), ('get', 0.073), ('collaborative', 0.068), ('eachmovie', 0.068), ('bound', 0.066), ('wand', 0.064), ('maintains', 0.064), ('predicted', 0.062), ('rated', 0.057), ('instances', 0.052), ('totally', 0.051), ('vt', 0.05), ('perceptron', 0.05), ('instance', 0.049), ('dataset', 0.048), ('classification', 0.047), ('iiv', 0.046), ('rating', 0.046), ('algorithm', 0.044), ('movie', 0.043), ('user', 0.043), ('update', 0.041), ('coni', 0.039), ('fellow', 0.039), ('ilvt', 0.039), ('iyt', 0.039), ('tnt', 0.039), ('fig', 0.039), ('herbrich', 0.039), ('multiclass', 0.039), ('margin', 0.037), ('describe', 0.034), ('pseudocode', 0.034), ('crammer', 0.034), ('prove', 0.033), ('yl', 0.032), ('instantaneous', 0.032), ('integers', 0.031), ('yoram', 0.031), ('plugging', 0.031), ('prediction', 0.031), ('regression', 0.031), ('ordered', 0.03), ('definition', 0.03), ('correct', 0.029), ('norm', 0.029), ('person', 0.029), ('filtering', 0.029), ('schapire', 0.029), ('side', 0.029), ('denote', 0.029), ('fed', 0.028), ('illustration', 0.028), ('xl', 0.027), ('ir', 0.027), ('bl', 0.027), ('experiment', 0.027), ('loss', 0.027), ('people', 0.026), ('replace', 0.026), ('inequality', 0.026), ('least', 0.026), ('bounding', 0.026), ('builds', 0.026), ('social', 0.026), ('losses', 0.026), ('claim', 0.025), ('suffers', 0.025), ('analyze', 0.024), ('synthetic', 0.024), ('algorithms', 0.024), ('ranked', 0.024), ('accumulated', 0.024), ('modifying', 0.024), ('middle', 0.023), ('employs', 0.022), ('batch', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 147 nips-2001-Pranking with Ranking

Author: Koby Crammer, Yoram Singer

2 0.31625378 22 nips-2001-A kernel method for multi-labelled classification

Author: André Elisseeff, Jason Weston

Abstract: This article presents a Support Vector Machine (SVM) like learning system to handle multi-label problems. Such problems are usually decomposed into many two-class problems but the expressive power of such a system can be weak [5, 7]. We explore a new direct approach. It is based on a large margin ranking system that shares a lot of common properties with SVMs. We tested it on a Yeast gene functional classiﬁcation problem with positive results.

3 0.15289129 105 nips-2001-Kernel Machines and Boolean Functions

Author: Adam Kowalczyk, Alex J. Smola, Robert C. Williamson

Abstract: We give results about the learnability and required complexity of logical formulae to solve classiﬁcation problems. These results are obtained by linking propositional logic with kernel machines. In particular we show that decision trees and disjunctive normal forms (DNF) can be represented by the help of a special kernel, linking regularized risk to separation margin. Subsequently we derive a number of lower bounds on the required complexity of logic formulae using properties of algorithms for generation of linear estimators, such as perceptron and maximal perceptron learning.

4 0.13116993 156 nips-2001-Rao-Blackwellised Particle Filtering via Data Augmentation

Author: Christophe Andrieu, Nando D. Freitas, Arnaud Doucet

Abstract: In this paper, we extend the Rao-Blackwellised particle filtering method to more complex hybrid models consisting of Gaussian latent variables and discrete observations. This is accomplished by augmenting the models with artificial variables that enable us to apply Rao-Blackwellisation. Other improvements include the design of an optimal importance proposal distribution and being able to swap the sampling an selection steps to handle outliers. We focus on sequential binary classifiers that consist of linear combinations of basis functions , whose coefficients evolve according to a Gaussian smoothness prior. Our results show significant improvements. 1

5 0.12002449 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

Author: Evan Greensmith, Peter L. Bartlett, Jonathan Baxter

Abstract: We consider the use of two additive control variate methods to reduce the variance of performance gradient estimates in reinforcement learning problems. The ﬁrst approach we consider is the baseline method, in which a function of the current state is added to the discounted value estimate. We relate the performance of these methods, which use sample paths, to the variance of estimates based on iid data. We derive the baseline function that minimizes this variance, and we show that the variance for any baseline is the sum of the optimal variance and a weighted squared distance to the optimal baseline. We show that the widely used average discounted value baseline (where the reward is replaced by the difference between the reward and its expectation) is suboptimal. The second approach we consider is the actor-critic method, which uses an approximate value function. We give bounds on the expected squared error of its estimates. We show that minimizing distance to the true value function is suboptimal in general; we provide an example for which the true value function gives an estimate with positive variance, but the optimal value function gives an unbiased estimate with zero variance. Our bounds suggest algorithms to estimate the gradient of the performance of parameterized baseline or value functions. We present preliminary experiments that illustrate the performance improvements on a simple control problem. 1 Introduction, Background, and Preliminary Results In reinforcement learning problems, the aim is to select a controller that will maximize the average reward in some environment. We model the environment as a partially observable Markov decision process (POMDP). Gradient ascent methods (e.g., [7, 12, 15]) estimate the gradient of the average reward, usually using Monte Carlo techniques to cal∗ Most of this work was performed while the authors were with the Research School of Information Sciences and Engineering at the Australian National University. culate an average over a sample path of the controlled POMDP. However such estimates tend to have a high variance, which means many steps are needed to obtain a good estimate. GPOMDP [4] is an algorithm for generating an estimate of the gradient in this way. Compared with other approaches, it is suitable for large systems, when the time between visits to a state is large but the mixing time of the controlled POMDP is short. However, it can suffer from the problem of producing high variance estimates. In this paper, we investigate techniques for variance reduction in GPOMDP. One generic approach to reducing the variance of Monte Carlo estimates of integrals is to use an additive control variate (see, for example, [6]). Suppose we wish to estimate the integral of f : X → R, and we know the integral of another function ϕ : X → R. Since X f = X (f − ϕ) + X ϕ, the integral of f − ϕ can be estimated instead. Obviously if ϕ = f then the variance is zero. More generally, Var(f − ϕ) = Var(f ) − 2Cov(f, ϕ) + Var(ϕ), so that if φ and f are strongly correlated, the variance of the estimate is reduced. In this paper, we consider two approaches of this form. The ﬁrst (Section 2) is the technique of adding a baseline. We ﬁnd the optimal baseline and we show that the additional variance of a suboptimal baseline can be expressed as a weighted squared distance from the optimal baseline. Constant baselines, which do not depend on the state or observations, have been widely used [13, 15, 9, 11]. In particular, the expectation over all states of the discounted value of the state is a popular constant baseline (where, for example, the reward at each step is replaced by the difference between the reward and the expected reward). We give bounds on the estimation variance that show that, perhaps surprisingly, this may not be the best choice. The second approach (Section 3) is the use of an approximate value function. Such actorcritic methods have been investigated extensively [3, 1, 14, 10]. Generally the idea is to minimize some notion of distance between the ﬁxed value function and the true value function. In this paper we show that this may not be the best approach: selecting the ﬁxed value function to be equal to the true value function is not always the best choice. Even more surprisingly, we give an example for which the use of a ﬁxed value function that is different from the true value function reduces the variance to zero, for no increase in bias. We give a bound on the expected squared error (that is, including the estimation variance) of the gradient estimate produced with a ﬁxed value function. Our results suggest new algorithms to learn the optimum baseline, and to learn a ﬁxed value function that minimizes the bound on the error of the estimate. In Section 5, we describe the results of preliminary experiments, which show that these algorithms give performance improvements. POMDP with Reactive, Parameterized Policy A partially observable Markov decision process (POMDP) consists of a state space, S, a control space, U, an observation space, Y, a set of transition probability matrices {P(u) : u ∈ U}, each with components pij (u) for i, j ∈ S, u ∈ U, an observation process ν : S → PY , where PY is the space of probability distributions over Y, and a reward function r : S → R. We assume that S, U, Y are ﬁnite, although all our results extend easily to inﬁnite U and Y, and with more restrictive assumptions can be extended to inﬁnite S. A reactive, parameterized policy for a POMDP is a set of mappings {µ(·, θ) : Y → PU |θ ∈ RK }. Together with the POMDP, this deﬁnes the controlled POMDP (S, U, Y, P , ν, r, µ). The joint state, observation and control process, {Xt , Yt , Ut }, is Markov. The state process, {Xt }, is also Markov, with transition probabilities pij (θ) = y∈Y,u∈U νy (i)µu (y, θ)pij (u), where νy (i) denotes the probability of observation y given the state i, and µu (y, θ) denotes the probability of action u given parameters θ and observation y. The Markov chain M(θ) = (S, P(θ)) then describes the behaviour of the process {Xt }. Assumption 1 The controlled POMDP (S, U, Y, P , ν, r, µ) satisﬁes: For all θ ∈ RK there exists a unique stationary distribution satisfying π (θ) P(θ) = π (θ). There is an R < ∞ such that, for all i ∈ S, |r(i)| ≤ R. There is a B < ∞ such that, for all u ∈ U, y ∈ Y and θ ∈ RK the derivatives ∂µu (y, θ)/∂θk (1 ≤ k ≤ K) exist, and the vector of these derivatives satisﬁes µu (y, θ)/µu (y, θ) ≤ B, where · denotes the Euclidean norm on RK . def T −1 1 We consider the average reward, η(θ) = limT →∞ E T t=0 r(Xt ) . Assumption 1 implies that this limit exists, and does not depend on the start state X0 . The aim is to def select a policy to maximize this quantity. Deﬁne the discounted value function, J β (i, θ) = T −1 t limT →∞ E t=0 β r(Xt ) X0 = i . Throughout the rest of the paper, dependences upon θ are assumed, and dropped in the notation. For a random vector A, we denote Var(A) = E (A − E [A])2 , where a2 denotes a a, and a denotes the transpose of the column vector a. GPOMDP Algorithm The GPOMDP algorithm [4] uses a sample path to estimate the gradient approximation def µu(y) η, but the βη = E β η approaches the true gradient µu(y) Jβ (j) . As β → 1, def variance increases. We consider a slight modiﬁcation [2]: with Jt = def ∆T = 1 T T −1 t=0 2T s=t µUt (Yt ) Jt+1 . µUt (Yt ) β s−t r(Xs ), (1) Throughout this paper the process {Xt , Yt , Ut , Xt+1 } is generally understood to be generated by a controlled POMDP satisfying Assumption 1, with X0 ∼π (ie the initial state distributed according to the stationary distribution). That is, before computing the gradient estimates, we wait until the process has settled down to the stationary distribution. Dependent Samples Correlation terms arise in the variance quantities to be analysed. We show here that considering iid samples gives an upper bound on the variance of the general case. The mixing time of a ﬁnite ergodic Markov chain M = (S, P ) is deﬁned as def τ = min t > 1 : max dT V i,j Pt i , Pt j ≤ e−1 , where [P t ]i denotes the ith row of P t and dT V is the total variation distance, dT V (P, Q) = i |P (i) − Q(i)|. Theorem 1 Let M = (S, P ) be a ﬁnite ergodic Markov chain, with mixing time τ , and 2|S|e and 0 ≤ α < let π be its stationary distribution. There are constants L < exp(−1/(2τ )), which depend only on M , such that, for all f : S → R and all t, Covπ (t) ≤ Lαt Varπ (f), where Varπ (f) is the variance of f under π, and Covπ (t) is f f the auto-covariance of the process {f(Xt )}, where the process {Xt } is generated by M with initial distribution π. Hence, for some constant Ω∗ ≤ 4Lτ , Var 1 T T −1 f(Xt ) t=0 ≤ Ω∗ Varπ (f). T We call (L, τ ) the mixing constants of the Markov chain M (or of the controlled POMDP D; ie the Markov chain (S, P )). We omit the proof (all proofs are in the full version [8]). Brieﬂy, we show that for a ﬁnite ergodic Markov chain M , Covπ (t) ≤ Rt (M )Varπ (f), f 2 t for some Rt (M ). We then show that Rt (M ) < 2|S| exp(− τ ). In fact, for a reversible chain, we can choose L = 1 and α = |λ2 |, the second largest magnitude eigenvalue of P . 2 Baseline We consider an alteration of (1), def ∆T = 1 T T −1 µUt (Yt ) (Jt+1 − Ar (Yt )) . µUt (Yt ) t=0 (2) For any baseline Ar : Y → R, it is easy to show that E [∆T ] = E [∆T ]. Thus, we select Ar to minimize variance. The following theorem shows that this variance is bounded by a variance involving iid samples, with Jt replaced by the exact value function. Theorem 2 Suppose that D = (S, U, Y, P , ν, r, µ) is a controlled POMDP satisfying Assumption 1, D has mixing constants (L, τ ), {Xt , Yt , Ut , Xt+1 } is a process generated by D with X0 ∼π ,Ar : Y → R is a baseline that is uniformly bounded by M, and J (j) has the distribution of ∞ β s r(Xt ), where the states Xt are generated by D starting in s=0 X0 = j. Then there are constants C ≤ 5B2 R(R + M) and Ω ≤ 4Lτ ln(eT ) such that Var 1 T T −1 t=0 µUt (Yt ) Ω (Jt+1 −Ar (Yt )) ≤ Varπ µUt (Yt ) T + Ω E T µu (y) (J (j) − Jβ (j)) µu (y) µu (y) (Jβ (j)−Ar (y)) µu (y) 2 + Ω +1 T C βT , (1 − β)2 where, as always, (i, y, u, j) are generated iid with i∼π, y∼ν(i), u∼µ(y) and j∼P i (u). The proof uses Theorem 1 and [2, Lemma 4.3]. Here we have bounded the variance of (2) with the variance of a quantity we may readily analyse. The second term on the right hand side shows the error associated with replacing an unbiased, uncorrelated estimate of the value function with the true value function. This quantity is not dependent on the baseline. The ﬁnal term on the right hand side arises from the truncation of the discounted reward— and is exponentially decreasing. We now concentrate on minimizing the variance σ 2 = Varπ r µu (y) (Jβ (j) − Ar (y)) , µu (y) (3) which the following lemma relates to the variance σ 2 without a baseline, µu (y) Jβ (j) . µu (y) σ 2 = Varπ Lemma 3 Let D = (S, U, Y, P , ν, r, µ) be a controlled POMDP satisfying Assumption 1. For any baseline Ar : Y → R, and for i∼π, y∼ν(i), u∼µ(y) and j∼Pi (u), σ 2 = σ 2 + E A2 (y) E r r µu (y) µu (y) 2 y − 2Ar (y)E µu (y) µu (y) 2 Jβ (j) y . From Lemma 3 it can be seen that the task of ﬁnding the optimal baseline is in effect that of minimizing a quadratic for each observation y ∈ Y. This gives the following theorem. Theorem 4 For the controlled POMDP as in Lemma 3,  2 µu (y) min σ 2 = σ 2 − E  E Jβ (j) y r Ar µu (y) 2 /E µu (y) µu (y) 2 y and this minimum is attained with the baseline 2 µu (y) µu (y) A∗ (y) = E r Jβ (j) , 2 µu (y) µu (y) /E y  y . Furthermore, the optimal constant baseline is µu (y) µu (y) A∗ = E r 2 Jβ (j) /E µu (y) µu (y) 2 . The following theorem shows that the variance of an estimate with an arbitrary baseline can be expressed as the sum of the variance with the optimal baseline and a certain squared weighted distance between the baseline function and the optimal baseline function. Theorem 5 If Ar : Y → R is a baseline function, A∗ is the optimal baseline deﬁned in r Theorem 4, and σ 2 is the variance of the corresponding estimate, then r∗ µu (y) µu (y) σ 2 = σ2 + E r r∗ 2 (Ar (y) − A∗ (y)) r 2 , where i∼π, y ∼ν(i), and u∼µ(y). Furthermore, the same result is true for the case of constant baselines, with Ar (y) replaced by an arbitrary constant baseline Ar , and A∗ (y) r replaced by A∗ , the optimum constant baseline deﬁned in Theorem 4. r For the constant baseline Ar = E i∼π [Jβ (i)], Theorem 5 implies that σ 2 is equal to r min Ar ∈R σ2 r + E µu (y) µu (y) 2 E [Jβ (j)] − E µu (y) µu (y) 2 2 /E Jβ (j) µu (y) µu (y) 2 . Thus, its performance depends on the random variables ( µu (y)/µu (y))2 and Jβ (j); if they are nearly independent, E [Jβ ] is a good choice. 3 Fixed Value Functions: Actor-Critic Methods We consider an alteration of (1), ˜ def 1 ∆T = T T −1 t=0 µUt (Yt ) ˜ Jβ (Xt+1 ), µUt (Yt ) (4) ˜ for some ﬁxed value function Jβ : S → R. Deﬁne ∞ def β k d(Xk , Xk+1 ) Aβ (j) = E X0 = j , k=0 def ˜ ˜ where d(i, j) = r(i) + β Jβ (j) − Jβ (i) is the temporal difference. Then it is easy to show that the estimate (4) has a bias of µu (y) ˜ Aβ (j) . β η − E ∆T = E µu (y) The following theorem gives a bound on the expected squared error of (4). The main tool in the proof is Theorem 1. Theorem 6 Let D = (S, U, Y, P , ν, r, µ) be a controlled POMDP satisfying Assumption 1. For a sample path from D, that is, {X0∼π, Yt∼ν(Xt ), Ut∼µ(Yt ), Xt+1∼PXt (Ut )}, E ˜ ∆T − βη 2 ≤ Ω∗ Varπ T µu (y) ˜ Jβ (j) + E µu (y) µu (y) Aβ (j) µu (y) 2 , where the second expectation is over i∼π, y∼ν(i), u∼µ(y), and j∼P i (u). ˜ If we write Jβ (j) = Jβ (j) + v(j), then by selecting v = (v(1), . . . , v(|S|)) from the right def null space of the K × |S| matrix G, where G = i,y,u πi νy (i) µu (y)Pi (u), (4) will produce an unbiased estimate of β η. An obvious example of such a v is a constant vector, (c, c, . . . , c) : c ∈ R. We can use this to construct a trivial example where (4) produces an unbiased estimate with zero variance. Indeed, let D = (S, U, Y, P , ν, r, µ) be a controlled POMDP satisfying Assumption 1, with r(i) = c, for some 0 < c ≤ R. Then Jβ (j) = c/(1 − β) and β η = 0. If we choose v = (−c/(1 − β), . . . , −c/(1 − β)) and ˜ ˜ Jβ (j) = Jβ (j) + v(j), then µµu(y) Jβ (j) = 0 for all y, u, j, and so (4) gives an unbiased u(y) estimate of β η, with zero variance. Furthermore, for any D for which there exists a pair y, u such that µu (y) > 0 and µu (y) = 0, choosing ˜β (j) = Jβ (j) gives a variance J greater than zero—there is a non-zero probability event, (Xt = i, Yt = y, Ut = u, Xt+1 = j), such that µµu(y) Jβ (j) = 0. u(y) 4 Algorithms Given a parameterized class of baseline functions Ar (·, θ) : Y → R θ ∈ RL , we can use Theorem 5 to bound the variance of our estimates. Computing the gradient of this bound with respect to the parameters θ of the baseline function allows a gradient optimization of the baseline. The GDORB Algorithm produces an estimate ∆ S of these gradients from a sample path of length S. Under the assumption that the baseline function and its gradients are uniformly bounded, we can show that these estimates converge to the gradient of σ 2 . We omit the details (see [8]). r GDORB Algorithm: Given: Controlled POMDP (S, U, Y, P , ν, r, µ), parameterized baseline Ar . set z0 = 0 (z0 ∈ RL ), ∆0 = 0 (∆0 ∈ RL ) for all {is , ys , us , is+1 , ys+1 } generated by the POMDP do zs+1 = βzs + ∆s+1 = ∆s + end for Ar (ys ) 1 s+1 µus(ys ) µus(ys ) 2 ((Ar (ys ) − βAr (ys+1 ) − r(xs+1 )) zs+1 − ∆s ) ˜ For a parameterized class of ﬁxed value functions {Jβ (·, θ) : S → R θ ∈ RL }, we can use Theorem 6 to bound the expected squared error of our estimates, and compute the gradient of this bound with respect to the parameters θ of the baseline function. The GBTE Algorithm produces an estimate ∆S of these gradients from a sample path of length S. Under the assumption that the value function and its gradients are uniformly bounded, we can show that these estimates converge to the true gradient. GBTE Algorithm: Given: Controlled POMDP (S, U, Y, P , ν, r, µ), parameterized ﬁxed value function ˜β . J set z0 = 0 (z0 ∈ RK ), ∆A0 = 0 (∆A0 ∈ R1×L ), ∆B 0 = 0 (∆B 0 ∈ RK ), ∆C 0 = 0 (∆C 0 ∈ RK ) and ∆D0 = 0 (∆D0 ∈ RK×L ) for all {is , ys , us , is+1 , is+2 } generated by the POMDP do µ s(y ) zs+1 = βzs + µuu(yss ) s µus(ys ) ˜ µus(ys ) Jβ (is+1 ) µus(ys ) µus(ys ) ˜ Jβ (is+1 ) ∆As+1 = ∆As + 1 s+1 ∆B s+1 = ∆B s + 1 s+1 µus(ys ) ˜ µus(ys ) Jβ (is+1 ) ∆C s+1 = ∆C s + 1 s+1 ˜ ˜ r(is+1 ) + β Jβ (is+2 ) − Jβ (is+1 ) zs+1 − ∆C s ∆Ds+1 = ∆Ds + 1 s+1 µus(ys ) µus(ys ) ∆s+1 = end for Ω∗ T ∆As+1 − − ∆B s ˜ Jβ (is+1 ) Ω∗ T ∆B s+1 ∆D s+1 − ∆As − ∆D s − ∆C s+1 ∆Ds+1 5 Experiments Experimental results comparing these GPOMDP variants for a simple three state MDP (described in [5]) are shown in Figure 1. The exact value function plots show how different choices of baseline and ﬁxed value function compare when all algorithms have access to the exact value function Jβ . Using the expected value function as a baseline was an improvement over GPOMDP. Using the optimum, or optimum constant, baseline was a further improvement, each performing comparably to the other. Using the pre-trained ﬁxed value function was also an improvement over GPOMDP, showing that selecting the true value function was indeed not the best choice in this case. The trained ﬁxed value function was not optimal though, as Jβ (j) − A∗ is a valid choice of ﬁxed value function. r The optimum baseline, and ﬁxed value function, will not normally be known. The online plots show experiments where the baseline and ﬁxed value function were trained using online gradient descent whilst the performance gradient was being estimated, using the same data. Clear improvement over GPOMDP is seen for the online trained baseline variant. For the online trained ﬁxed value function, improvement is seen until T becomes—given the simplicity of the system—very large. References [1] L. Baird and A. Moore. Gradient descent for general reinforcement learning. In Advances in Neural Information Processing Systems 11, pages 968–974. MIT Press, 1999. [2] P. L. Bartlett and J. Baxter. Estimation and approximation bounds for gradient-based reinforcement learning. Journal of Computer and Systems Sciences, 2002. To appear. [3] A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve difﬁcult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13:834–846, 1983. [4] J. Baxter and P. L. Bartlett. Inﬁnite-horizon gradient-based policy search. Journal of Artiﬁcial Intelligence Research, 15:319–350, 2001. [5] J. Baxter, P. L. Bartlett, and L. Weaver. Inﬁnite-horizon gradient-based policy search: II. Gradient ascent algorithms and experiments. Journal of Artiﬁcial Intelligence Research, 15:351–381, 2001. [6] M. Evans and T. Swartz. Approximating integrals via Monte Carlo and deterministic methods. Oxford University Press, 2000. Exact Value Function—Mean Error Exact Value Function—One Standard Deviation 0.4 0.4 GPOMDP-Jβ BL- [Jβ ] BL-A∗ (y) r BL-A∗ r FVF-pretrain Relative Norm Difference Relative Norm Difference 0.25 GPOMDP-Jβ BL- [Jβ ] BL-A∗ (y) r BL-A∗ r FVF-pretrain 0.35 0.3 0.2 0.15 0.1 0.05 ¡ 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 1 10 100 1000 10000 100000 1e+06 1e+07 1 10 100 1000 T Online—Mean Error 100000 1e+06 1e+07 Online—One Standard Deviation 1 1 GPOMDP BL-online FVF-online 0.8 Relative Norm Difference Relative Norm Difference 10000 T 0.6 0.4 0.2 0 GPOMDP BL-online FVF-online 0.8 0.6 0.4 0.2 0 1 10 100 1000 10000 100000 1e+06 1e+07 1 10 100 T 1000 10000 100000 1e+06 1e+07 T Figure 1: Three state experiments—relative norm error ∆ est − η / η . Exact value function plots compare mean error and standard deviations for gradient estimates (with knowledge of Jβ ) computed by: GPOMDP [GPOMDP-Jβ ]; with baseline Ar = [Jβ ] [BL- [Jβ ]]; with optimum baseline [BL-A∗ (y)]; with optimum constant baseline [BL-A∗ ]; with pre-trained ﬁxed value r r function [FVF-pretrain]. Online plots do a similar comparison of estimates computed by: GPOMDP [GPOMDP]; with online trained baseline [BL-online]; with online trained ﬁxed value function [FVFonline]. The plots were computed over 500 runs (1000 for FVF-online), with β = 0.95. Ω∗ /T was set to 0.001 for FVF-pretrain, and 0.01 for FVF-online. ¢ ¢ [7] P. W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33:75–84, 1990. [8] E. Greensmith, P. L. Bartlett, and J. Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Technical report, ANU, 2002. [9] H. Kimura, K. Miyazaki, and S. Kobayashi. Reinforcement learning in POMDPs with function approximation. In D. H. Fisher, editor, Proceedings of the Fourteenth International Conference on Machine Learning (ICML’97), pages 152–160, 1997. [10] V. R. Konda and J. N. Tsitsiklis. Actor-Critic Algorithms. In Advances in Neural Information Processing Systems 12, pages 1008–1014. MIT Press, 2000. [11] P. Marbach and J. N. Tsitsiklis. Simulation-Based Optimization of Markov Reward Processes. Technical report, MIT, 1998. [12] R. Y. Rubinstein. How to optimize complex stochastic systems from a single sample path by the score function method. Ann. Oper. Res., 27:175–211, 1991. [13] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge MA, 1998. ISBN 0-262-19398-1. [14] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in Neural Information Processing Systems 12, pages 1057–1063. MIT Press, 2000. [15] R. J. Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8:229–256, 1992.

6 0.094945483 139 nips-2001-Online Learning with Kernels

7 0.075631328 180 nips-2001-The Concave-Convex Procedure (CCCP)

8 0.074401125 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms

9 0.073786847 143 nips-2001-PAC Generalization Bounds for Co-training

10 0.067114986 171 nips-2001-Spectral Relaxation for K-means Clustering

11 0.06566707 56 nips-2001-Convolution Kernels for Natural Language

12 0.063054658 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data

13 0.061335042 190 nips-2001-Thin Junction Trees

14 0.058502182 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks

15 0.05770272 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

16 0.055923555 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces

17 0.055304818 24 nips-2001-Active Information Retrieval

18 0.054543622 1 nips-2001-(Not) Bounding the True Error

19 0.052678622 8 nips-2001-A General Greedy Approximation Algorithm with Applications

20 0.052635241 66 nips-2001-Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.163), (1, 0.065), (2, 0.015), (3, 0.073), (4, 0.071), (5, -0.111), (6, 0.01), (7, 0.097), (8, -0.126), (9, -0.039), (10, -0.054), (11, -0.017), (12, -0.021), (13, 0.15), (14, 0.09), (15, 0.073), (16, 0.106), (17, 0.322), (18, 0.029), (19, -0.049), (20, 0.028), (21, 0.161), (22, 0.165), (23, 0.062), (24, 0.046), (25, -0.339), (26, 0.133), (27, -0.112), (28, 0.018), (29, 0.043), (30, -0.06), (31, -0.037), (32, -0.101), (33, -0.065), (34, 0.122), (35, -0.075), (36, -0.155), (37, -0.097), (38, -0.178), (39, 0.161), (40, 0.072), (41, -0.032), (42, 0.067), (43, 0.083), (44, 0.061), (45, -0.059), (46, -0.077), (47, -0.065), (48, 0.019), (49, -0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96430629 147 nips-2001-Pranking with Ranking

Author: Koby Crammer, Yoram Singer

2 0.73829377 22 nips-2001-A kernel method for multi-labelled classification

Author: André Elisseeff, Jason Weston

3 0.43298027 156 nips-2001-Rao-Blackwellised Particle Filtering via Data Augmentation

Author: Christophe Andrieu, Nando D. Freitas, Arnaud Doucet

4 0.34662393 105 nips-2001-Kernel Machines and Boolean Functions

Author: Adam Kowalczyk, Alex J. Smola, Robert C. Williamson

5 0.33701184 180 nips-2001-The Concave-Convex Procedure (CCCP)

Author: Alan L. Yuille, Anand Rangarajan

Abstract: We introduce the Concave-Convex procedure (CCCP) which constructs discrete time iterative dynamical systems which are guaranteed to monotonically decrease global optimization/energy functions. It can be applied to (almost) any optimization problem and many existing algorithms can be interpreted in terms of CCCP. In particular, we prove relationships to some applications of Legendre transform techniques. We then illustrate CCCP by applications to Potts models, linear assignment, EM algorithms, and Generalized Iterative Scaling (GIS). CCCP can be used both as a new way to understand existing optimization algorithms and as a procedure for generating new algorithms. 1

6 0.31182995 139 nips-2001-Online Learning with Kernels

7 0.29667777 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

8 0.25398129 64 nips-2001-EM-DD: An Improved Multiple-Instance Learning Technique

9 0.22083467 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms

10 0.21883476 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data

11 0.20444684 159 nips-2001-Reducing multiclass to binary by coupling probability estimates

12 0.20363842 26 nips-2001-Active Portfolio-Management based on Error Correction Neural Networks

13 0.20193166 190 nips-2001-Thin Junction Trees

14 0.19741049 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes

15 0.19052139 120 nips-2001-Minimax Probability Machine

16 0.18957391 56 nips-2001-Convolution Kernels for Natural Language

17 0.1867319 171 nips-2001-Spectral Relaxation for K-means Clustering

18 0.18177544 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces

19 0.18148912 114 nips-2001-Learning from Infinite Data in Finite Time

20 0.17868045 149 nips-2001-Probabilistic Abstraction Hierarchies

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(12, 0.339), (14, 0.058), (17, 0.022), (19, 0.047), (27, 0.139), (30, 0.048), (38, 0.017), (59, 0.05), (72, 0.047), (79, 0.022), (83, 0.014), (91, 0.101)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80551165 147 nips-2001-Pranking with Ranking

Author: Koby Crammer, Yoram Singer

2 0.70385855 114 nips-2001-Learning from Infinite Data in Finite Time

Author: Pedro Domingos, Geoff Hulten

Abstract: We propose the following general method for scaling learning algorithms to arbitrarily large data sets. Consider the model Mii learned by the algorithm using ni examples in step i (ii = (nl , ... ,nm )) , and the model Moo that would be learned using infinite examples. Upper-bound the loss L(Mii' M oo ) between them as a function of ii, and then minimize the algorithm's time complexity f(ii) subject to the constraint that L(Moo , M ii ) be at most f with probability at most 8. We apply this method to the EM algorithm for mixtures of Gaussians. Preliminary experiments on a series of large data sets provide evidence of the potential of this approach. 1 An Approach to Large-Scale Learning Large data sets make it possible to reliably learn complex models. On the other hand , they require large computational resources to learn from. While in the past the factor limit ing the quality of learnable models was typically the quantity of data available, in many domains today data is super-abundant, and the bottleneck is t he t ime required to process it. Many algorithms for learning on large data sets have been proposed, but in order to achieve scalability they generally compromise the quality of the results to an unspecified degree. We believe this unsatisfactory state of affairs is avoidable, and in this paper we propose a general method for scaling learning algorithms to arbitrarily large databases without compromising the quality of the results. Our method makes it possible to learn in finite time a model that is essentially indistinguishable from the one that would be obtained using infinite data. Consider the simplest possible learning problem: estimating the mean of a random variable x. If we have a very large number of samples, most of them are probably superfluous. If we are willing to accept an error of at most f with probability at most 8, Hoeffding bounds [4] (for example) tell us that, irrespective of the distribution of x, only n = ~(R/f)2 1n (2/8) samples are needed, where R is x's range. We propose to extend this type of reasoning beyond learning single parameters, to learning complex models. The approach we propose consists of three steps: 1. Derive an upper bound on the relative loss between the finite-data and infinite-data models, as a function of the number of samples used in each step of the finite-data algorithm. 2. Derive an upper bound on the time complexity of the learning algorithm , as a function of the number of samples used in each step. 3. Minimize the time bound (via the number of samples used in each step) subject to target limits on the loss. In this paper we exemplify this approach using the EM algorithm for mixtures of Gaussians. In earlier papers we applied it (or an earlier version of it) to decision tree induction [2J and k-means clustering [3J. Despite its wide use, EM has long been criticized for its inefficiency (see discussion following Dempster et al. [1]), and has been considered unsuitable for large data sets [8J. Many approaches to speeding it up have been proposed (see Thiesson et al. [6J for a survey) . Our method can be seen as an extension of progressive sampling approaches like Meek et al. [5J: rather than minimize the total number of samples needed by the algorithm , we minimize the number needed by each step, leading to potentially much greater savings; and we obtain guarantees that do not depend on unverifiable extrapolations of learning curves. 2 A Loss Bound for EM In a mixture of Gaussians model, each D-dimensional data point Xj is assumed to have been independently generated by the following process: 1) randomly choose a mixture component k; 2) randomly generate a point from it according to a Gaussian distribution with mean f-Lk and covariance matrix ~k. In this paper we will restrict ourselves to the case where the number K of mixture components and the probability of selection P(f-Lk) and covariance matrix for each component are known. Given a training set S = {Xl, ... , XN }, the learning goal is then to find the maximumlikelihood estimates of the means f-Lk. The EM algorithm [IJ accomplishes this by, starting from some set of initial means , alternating until convergence between estimating the probability p(f-Lk IXj) that each point was generated by each Gaussian (the Estep), and computing the ML estimates of the means ilk = 2::;':1 WjkXj / 2::f=l Wjk (the M step), where Wjk = p(f-Lklxj) from the previous E step. In the basic EM algorithm, all N examples in the training set are used in each iteration. The goal in this paper is to speed up EM by using only ni < N examples in the ith iteration, while guaranteeing that the means produced by the algorithm do not differ significantly from those that would be obtained with arbitrarily large N. Let Mii = (ill , . . . , ilK) be the vector of mean estimates obtained by the finite-data EM algorithm (i.e., using ni examples in iteration i), and let Moo = (f-L1, ... ,f-LK) be the vector obtained using infinite examples at each iteration. In order to proceed, we need to quantify the difference between Mii and Moo . A natural choice is the sum of the squared errors between corresponding means, which is proportional to the negative log-likelihood of the finite-data means given the infinite-data ones: K L(Mii' Moo ) = L k=l K Ililk - f-Lkl12 = D LL lilkd - f-Lkdl 2 (1) k=ld=l where ilkd is the dth coordinate of il, and similarly for f-Lkd. After any given iteration of EM, lilkd - f-Lkdl has two components. One, which we call the sampling error, derives from the fact that ilkd is estimated from a finite sample, while J-Lkd is estimated from an infinite one. The other component, which we call the weighting error, derives from the fact that , due to sampling errors in previous iterations, the weights Wjk used to compute the two estimates may differ. Let J-Lkdi be the infinite-data estimate of the dth coordinate of the kth mean produced in iteration i, flkdi be the corresponding finite-data estimate, and flkdi be the estimate that would be obtained if there were no weighting errors in that iteration. Then the sampling error at iteration i is Iflkdi - J-Lkdi I, the weighting error is Iflkdi - flkdi I, and the total error is Iflkdi - J-Lkdi 1 :::; Iflkdi - flkdi 1 + Iflkdi - J-Lkdi IčˇŻ Given bounds on the total error of each coordinate of each mean after iteration i-I, we can derive a bound on the weighting error after iteration i as follows. Bounds on J-Lkd ,i-l for each d imply bounds on p(XjlJ-Lki ) for each example Xj, obtained by substituting the maximum and minimum allowed distances between Xjd and J-Lkd ,i-l into the expression of the Gaussian distribution. Let P}ki be the upper bound on P(XjlJ-Lki) , and Pjki be the lower bound. Then the weight of example Xj in mean J-Lki can be bounded from below by by W}ki W (+) jki = min{p}kiP(J-Lk)/ wjki = PjkiP(J-Lk)/ ~~= l P}k'iP(J-LU, and from above ~~=l Pjk'iP(J-LU, I}. Let w;t: = W}ki if Xj ::::: 0 and (- - 1 > (- + . W jki) - W jki' f Xj _ 0 an d W jki) - W jki 0 th erWlse. ot h ' erWlse, an d 1et W- jki Then Iflkdi - flkdi , IJ-Lkdi 1 < max - ~7~1 Wjk i Xj I

3 0.5942027 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines

Author: Carlotta Domeniconi, Dimitrios Gunopulos

Abstract: The nearest neighbor technique is a simple and appealing method to address classification problems. It relies on t he assumption of locally constant class conditional probabilities. This assumption becomes invalid in high dimensions with a finite number of examples due to the curse of dimensionality. We propose a technique that computes a locally flexible metric by means of Support Vector Machines (SVMs). The maximum margin boundary found by the SVM is used to determine the most discriminant direction over the query's neighborhood. Such direction provides a local weighting scheme for input features. We present experimental evidence of classification performance improvement over the SVM algorithm alone and over a variety of adaptive learning schemes, by using both simulated and real data sets. 1

4 0.50232303 137 nips-2001-On the Convergence of Leveraging

Author: Gunnar Rätsch, Sebastian Mika, Manfred K. Warmuth

Abstract: We give an uniﬁed convergence analysis of ensemble learning methods including e.g. AdaBoost, Logistic Regression and the Least-SquareBoost algorithm for regression. These methods have in common that they iteratively call a base learning algorithm which returns hypotheses that are then linearly combined. We show that these methods are related to the Gauss-Southwell method known from numerical optimization and state non-asymptotical convergence results for all these methods. Our analysis includes -norm regularized cost functions leading to a clean and general way to regularize ensemble learning.

5 0.50164491 8 nips-2001-A General Greedy Approximation Algorithm with Applications

Author: T. Zhang

Abstract: Greedy approximation algorithms have been frequently used to obtain sparse solutions to learning problems. In this paper, we present a general greedy algorithm for solving a class of convex optimization problems. We derive a bound on the rate of approximation for this algorithm, and show that our algorithm includes a number of earlier studies as special cases.

6 0.49946016 13 nips-2001-A Natural Policy Gradient

7 0.49945644 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family

8 0.49768043 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

9 0.49756327 103 nips-2001-Kernel Feature Spaces and Nonlinear Blind Souce Separation

10 0.49611393 127 nips-2001-Multi Dimensional ICA to Separate Correlated Sources

11 0.49603814 197 nips-2001-Why Neuronal Dynamics Should Control Synaptic Learning Rules

12 0.49515465 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines

13 0.49424517 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms

14 0.49215394 74 nips-2001-Face Recognition Using Kernel Methods

15 0.49152508 58 nips-2001-Covariance Kernels from Bayesian Generative Models

16 0.49112248 44 nips-2001-Blind Source Separation via Multinode Sparse Representation

17 0.49044555 190 nips-2001-Thin Junction Trees

18 0.48751044 84 nips-2001-Global Coordination of Local Linear Models

19 0.48683411 155 nips-2001-Quantizing Density Estimators

20 0.4863584 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes