nips nips2004 nips2004-130 knowledge-graph by maker-knowledge-mining

130 nips-2004-Newscast EM


Source: pdf

Author: Wojtek Kowalczyk, Nikos A. Vlassis

Abstract: We propose a gossip-based distributed algorithm for Gaussian mixture learning, Newscast EM. The algorithm operates on network topologies where each node observes a local quantity and can communicate with other nodes in an arbitrary point-to-point fashion. The main difference between Newscast EM and the standard EM algorithm is that the M-step in our case is implemented in a decentralized manner: (random) pairs of nodes repeatedly exchange their local parameter estimates and combine them by (weighted) averaging. We provide theoretical evidence and demonstrate experimentally that, under this protocol, nodes converge exponentially fast to the correct estimates in each M-step of the EM algorithm. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 nl Abstract We propose a gossip-based distributed algorithm for Gaussian mixture learning, Newscast EM. [sent-5, score-0.162]

2 The algorithm operates on network topologies where each node observes a local quantity and can communicate with other nodes in an arbitrary point-to-point fashion. [sent-6, score-0.667]

3 The main difference between Newscast EM and the standard EM algorithm is that the M-step in our case is implemented in a decentralized manner: (random) pairs of nodes repeatedly exchange their local parameter estimates and combine them by (weighted) averaging. [sent-7, score-0.598]

4 We provide theoretical evidence and demonstrate experimentally that, under this protocol, nodes converge exponentially fast to the correct estimates in each M-step of the EM algorithm. [sent-8, score-0.36]

5 1 Introduction Advances in network technology, like peer-to-peer networks on the Internet or sensor networks, have highlighted the need for efficient ways to deal with large amounts of data that are distributed over a set of nodes. [sent-9, score-0.135]

6 Ideally, we would like to have a fully decentralized algorithm that computes and disseminates aggregates of the data, with minimal processing and communication requirements and good fault-tolerant behavior. [sent-12, score-0.14]

7 Roughly, in a gossip-based protocol each node repeatedly contacts some other node at random and the two nodes exchange information. [sent-14, score-1.051]

8 In this paper we propose a gossip-based, fully decentralized implementation of the Expectation-Maximization (EM) algorithm for Gaussian mixture learning [6]. [sent-17, score-0.189]

9 Our algorithm, which we call ‘Newscast EM’, assumes a set of data {xi } that are drawn independently from a common Gaussian mixture and are distributed over the nodes of a network (one data point per node). [sent-18, score-0.369]

10 Newscast EM utilizes a gossip-based protocol in its M-step to learn a global Gaussian mixture model p(x) from the data. [sent-19, score-0.297]

11 Each node starts with a local estimate of the model parameters. [sent-21, score-0.33]

12 Then, in every cycle, each node contacts some other node that is chosen at random from a list of known nodes, and the two nodes replace their local model estimates by their (weighted) averages. [sent-22, score-0.954]

13 As we show below, under such a protocol the (erroneous) local models of the individual nodes converge exponentially fast to the (correct) global model in each M-step of the algorithm. [sent-23, score-0.479]

14 A disadvantage of that approach is that only one node is carrying out computations at any time step, whereas in Newscast EM all nodes are running the same protocol in parallel. [sent-26, score-0.611]

15 , xn } of independent and identically distributed samples from p(x), the learning task is to estimate the parameter vector θ = {πs , ms , Cs }k of the k components that maximizes the logs=1 n likelihood L = i=1 log p(xi ; θ). [sent-33, score-0.241]

16 , n, where each qi (s) corresponds to a data point xi and defines an arbitrary discrete distribution over s. [sent-41, score-0.215]

17 This lower bound is given by: n F= k i=1 s=1 qi (s) log p(xi , s; θ) − log qi (s) . [sent-42, score-0.404]

18 (2) In the E-step of the EM algorithm, the responsibility qi (s) for each point xi is set to the Bayes posterior p(s|xi ) given the parameters found in the previous step. [sent-43, score-0.249]

19 In the M-step we solve for the unknown parameters of the mixture by maximizing F for fixed qi (s). [sent-44, score-0.236]

20 This yields the following updates: πs = n i=1 qi (s) n , ms = n i=1 qi (s)xi nπs , Cs = n i=1 qi (s)xi xi nπs −ms ms . [sent-45, score-0.743]

21 (3) Note that the main operation of the M-step is averaging: πs is the average of qi (s), ms is the average of products qi (s)xi (divided by πs ), and the covariance matrix Cs is the average of matrices qi (s)xi xi (divided by πs and decreased by ms ms ). [sent-46, score-0.899]

22 3 Newscast computing and averaging The proposed distributed EM algorithm for Gaussian mixture learning relies on the use of the Newscast protocol for distributed computing [3]. [sent-48, score-0.5]

23 The protocol is very robust, scalable, and simple to implement—its Java implementation is only a few kBytes of code and can run on small network-enabled computing devices such as mobile phones, PDAs, or sensors. [sent-50, score-0.189]

24 , vn are stored in the nodes of a network, one value per node. [sent-55, score-0.182]

25 Moreover suppose that each node n 1 knows the addresses of all other nodes. [sent-56, score-0.271]

26 To compute µ = n i=1 vi , each node i initially sets µi = vi as its local estimate of µ, and then runs the following protocol for a number of cycles: Uniform Newscast (for node i) 1. [sent-57, score-0.763]

27 Contact a node j = f (i) that is chosen uniformly at random from 1, . [sent-58, score-0.24]

28 Nodes i and j update their estimates as follows: µi = µj = (µi + µj )/2. [sent-63, score-0.13]

29 For the purpose of analysis we will assume that in each cycle every node initiates a single contact (but in practice the algorithm can be fully asynchronous). [sent-64, score-0.424]

30 Note that the mean of the local estimates {µi } is always the correct mean µ, while for their variance holds: Lemma 1. [sent-65, score-0.28]

31 In each cycle of uniform Newscast the variance of the local estimates drops on 1 the average by factor λ, with λ ≤ 2√e . [sent-66, score-0.426]

32 1 Let Φt = i=1 (µi − µ)2 be the unnormalized variance of the local estimates µi at cycle t. [sent-68, score-0.319]

33 Suppose, without loss of generality, that within cycle t nodes initiate contacts in the order 1, 2, . [sent-69, score-0.351]

34 2 e (8) Thus after t cycles of uniform Newscast, the original variance φ0 of the local estimates is √ reduced on the average to φt ≤ φ0 /(2 e)t . [sent-78, score-0.524]

35 The fact that the variance drops at an exponential rate means that the nodes learn the correct average very fast. [sent-79, score-0.346]

36 Indeed, using Chebyshev’s inequality Pt [|µi − µ| ≥ ε] ≤ φt /(nε2 ) we see that for any ε > 0, the probability that some node makes an estimation error larger than ε is dropping exponentially fast with the number of cycles t. [sent-80, score-0.483]

37 In particular, we can derive a bound on the number of cycles that are needed in order to guarantee with high probability that all nodes know the correct answer with some specific accuracy: Theorem 1. [sent-81, score-0.471]

38 581(log n + 2 log σ + 2 log 1 + log 1 ) cycles ε δ of uniform Newscast holds maxi |µi − µ| ≤ ε, for any ε > 0 and data variance σ 2 . [sent-83, score-0.385]

39 For example, for unit-variance data and a network with n = 104 nodes we need 49 cycles to guarantee with probability 95% that each node is within 10−10 from the correct answer. [sent-90, score-0.729]

40 Note that in uniform Newscast, each node in the network is assumed to know the addresses of all other nodes, and therefore can choose in each cycle one node uniformly at random to exchange data with. [sent-91, score-0.783]

41 In practice, however, each node can only have a limited cache of addresses of other nodes. [sent-92, score-0.372]

42 In this case, the averaging algorithm is modified as follows: Non-uniform Newscast (for node i) 1. [sent-93, score-0.333]

43 Contact a node j = f (i) that is appropriately chosen from i’s local cache. [sent-94, score-0.324]

44 In our experiments we adopted the protocol of [10] which roughly operates as follows. [sent-100, score-0.209]

45 Each entry k in node’s i cache contains an ‘age’ attribute that indicates the number of cycles that have been elapsed since node k created that entry. [sent-101, score-0.545]

46 In step 1 above, node i contacts the node j with the largest age from i’s cache, and increases by one the age of all other entries in i’s cache. [sent-102, score-0.684]

47 Then node i exchanges estimates with node j as in step 2. [sent-103, score-0.647]

48 In step 3, both nodes i and j select a random subset of their cache entries and mutually exchange them, filling empty slots and discarding self-pointers and duplicates. [sent-104, score-0.389]

49 Finally node i creates an entry with i’s address in it and age zero, which is added in j’s cache. [sent-105, score-0.288]

50 The resulting protocol is particularly effective and, as we show in the experiments below, in some cases it even outperforms the uniform Newscast. [sent-106, score-0.25]

51 4 The Newscast EM algorithm Newscast EM (NEM) is a gossip-based distributed implementation of the EM algorithm for Gaussian mixture learning, that applies to the following setting. [sent-108, score-0.201]

52 We are given a set of data {xi } that are distributed over the nodes of a network (one data point per node). [sent-109, score-0.301]

53 The data are assumed independent samples from a common k-component Gaussian mixture p(x) with (unknown) parameters θ = {πs , ms , Cs }k . [sent-110, score-0.164]

54 The task is to learn the parameters of s=1 the mixture with maximum likelihood in a decentralized manner: that is, all learning steps should be performed locally at the nodes, and they should involve as little communication as possible. [sent-111, score-0.219]

55 The NEM algorithm is a direct application of the averaging protocol of Section 3 for estimating the parameters θ of p(x) using the EM updates (3). [sent-112, score-0.282]

56 The E-step of NEM is identical to the E-step of the standard EM algorithm, and it can be performed by all nodes in parallel. [sent-113, score-0.198]

57 The novel characteristic of NEM is the M-step which is implemented as a sequence of gossip-based cycles: At the beginning of each M-step, each node i starts with a local estimate θi of the ‘correct’ parameter vector θ (correct according to EM and for the current EM iteration). [sent-114, score-0.349]

58 Then, in every cycle, each node contacts some other node at random, and the two nodes replace their local estimates θi by their (weighted) averages. [sent-115, score-0.954]

59 At the end of the M-step each node has converged (within machine precision) to the correct parameter θ. [sent-116, score-0.318]

60 ˜ To simplify notation, we will denote by θi = {πsi , msi , Csi } the local estimates of node i ˜ for the parameters of component s at any point of the algorithm. [sent-117, score-0.666]

61 The complete algorithm, which runs identically defined such that Csi = C si si and in parallel for each node, is as follows: Newscast EM (for node i) 1. [sent-119, score-0.568]

62 Set qi (s) to some random positive value and then normalize all qi (s) to sum to 1 over all s. [sent-121, score-0.336]

63 Initialize i’s local estimates for each component s as follows: πsi = qi (s), ˜ msi = xi , Csi = xi xi . [sent-124, score-0.735]

64 Nodes i and j update their local estimates for each component s as follows: πsi + πsj , 2 πsi msi + πsj msj msi = msj = , πsi + πsj ˜ ˜ πsi Csi + πsj Csj ˜ ˜ . [sent-128, score-0.732]

65 Compute new responsibilities qi (s) = p(s|xi ) for each component s using the ˜ M-step estimates πsi , msi , and Csi = Csi − msi msi . [sent-133, score-1.015]

66 Go to step 2, unless a stopping criterion is satisfied that involves the parameter estimates themselves or the energy F. [sent-136, score-0.172]

67 Similarly, a stopping criterion involving the parameter estimates can be implemented locally if each node caches its estimates from the previous EM-iteration. [sent-139, score-0.659]

68 Given that all nodes agree on the number τ of Newscast cycles in the M-step, and assuming that τ is large enough to guarantee convergence to the correct parameter estimates, the complete NEM algorithm can be performed identically and in parallel by all nodes in the network. [sent-141, score-0.73]

69 It is easy to see that at any cycle of an M-step, and for any component s, the weighted 2 Note that F is a sum of local terms, and thus it can also be computed using the same protocol. [sent-142, score-0.192]

70 averages over all nodes of the local estimates are always the EM-correct estimates, i. [sent-143, score-0.394]

71 , n i=1 πsi msi n i=1 πsi = ms (12) ˜ and similarly for the Csi . [sent-145, score-0.318]

72 Moreover, note that the weighted averages of the msi in (10) ˜si in (11), with weights given by (9), can be written as unweighted averages of and the C ˜ the corresponding products πsi msi and πsi Csi . [sent-146, score-0.526]

73 In other words, each local estimate can be written as the ratio of two local estimates that converge to the correct values at the same exponential rate (as shown in the previous section). [sent-147, score-0.311]

74 In every M-step of Newscast EM, each node converges exponentially fast to the correct parameter estimates for each component of the mixture. [sent-149, score-0.459]

75 Similarly, the number of cycles τ for each M-step can be chosen according to Theorem 1. [sent-150, score-0.204]

76 However, note that in every M-step each node has to wait τ cycles before its local estimates have converged, and only then can it use these estimates in a new next E-step. [sent-151, score-0.74]

77 We describe here a modification of NEM that allows a node to run a local E-step before its M-step has converged. [sent-152, score-0.308]

78 This ‘partial’ NEM algorithm is based on the following ‘self-correction’ idea: instead of waiting until the M-step converges, after a small number of cycles each node runs a local E-step, adjusts its responsibilities, and propagates appropriate corrections through the network. [sent-153, score-0.611]

79 Such a scheme additionally requires that each node caches its responsibilities from the previous E-step, denoted by qi (s). [sent-154, score-0.538]

80 ˜ C ˜ (13) (14) (15) After these corrections, the Newscast averaging protocol is executed for a number of cycles (smaller than the number τ of the ‘full’ NEM). [sent-156, score-0.467]

81 These corrections may increase the variance of the local estimates, but in most cases the corresponding increase of variance is relatively small. [sent-157, score-0.236]

82 1 we demonstrate the the performance of uniform and non-uniform Newscast in typical averaging tasks involving zero-mean unit-variance data. [sent-161, score-0.171]

83 1(left) we plot the variance reduction rate λ (mean and one standard deviation for 50 runs) as a function of the number of cycles, for averaging problems involving n = 105 data. [sent-163, score-0.211]

84 Moreover note that in non-uniform Newscast the variance drops faster than in uniform Newscast. [sent-166, score-0.144]

85 This is due to the fact that the dynamic cache exchange scheme of [10] results in in-degree network distributions that are very peaked around the cache size. [sent-167, score-0.333]

86 In practice this means that on the average each node is 3 This would require, for instance, that individual nodes have estimates of the total variance over the network, which is not obvious how it can be done. [sent-168, score-0.613]

87 5 5 10 15 20 25 Number of cycles 30 35 40 39 3 10 4 5 10 10 6 10 Number of nodes Figure 1: (Left) Variance reduction rate of uniform and non-uniform Newscast, in averaging tasks involving n = 105 nodes. [sent-182, score-0.601]

88 (Right) Number of cycles to achieve convergence within ε = 10−10 for unit-variance datasets of various sizes. [sent-183, score-0.223]

89 equally often contacted to by other nodes in each cycle of the protocol. [sent-184, score-0.262]

90 We also observed that the variance reduction rate is on the average unaffected by the network size, while larger networks result in smaller deviations. [sent-185, score-0.165]

91 1(right) we plot the number of cycles that are required to achieve model accuracy at all nodes within ε = 10−10 as a function of the network size. [sent-188, score-0.43]

92 We also performed some experiments with the ‘partial’ NEM, where it turned out that in most cases we could obtain the same model accuracy with a much smaller number of cycles (5–10 times than the ‘full’ NEM), but in some cases the algorithm did not converge. [sent-191, score-0.239]

93 6 Summary and extensions We presented Newscast EM, a distributed gossip-based implementation of the EM algorithm for learning Gaussian mixture models. [sent-192, score-0.162]

94 Newscast EM applies on networks where each one of a (large) number of nodes observes a local quantity, and can communicate with other nodes in a point-to-point fashion. [sent-193, score-0.527]

95 Here we have assumed that each node in the network observes one data point. [sent-198, score-0.336]

96 We can easily generalize this to situations where each node observes (and stores) a collection of points, like in [8]. [sent-199, score-0.292]

97 Another interesting extension is to replace the averaging step 2 of uniform and non-uniform Newscast with weighted averaging (for some choice of weights), and study the variance reduction rate in this case. [sent-201, score-0.372]

98 Another interesting problem is when the E-step cannot be performed locally at a node but it requires distributing some information over the network. [sent-202, score-0.287]

99 This could be the case, for instance, when each node observes only a few elements of a vector-valued quantity while, for instance, all nodes together observe the complete sample. [sent-203, score-0.493]

100 Finally, it would be interesting to investigate the applicability of the Newscast protocol in problems involving distributed inference/learning in more general graphical models [12]. [sent-205, score-0.3]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('newscast', 0.665), ('node', 0.24), ('msi', 0.222), ('csi', 0.205), ('cycles', 0.204), ('protocol', 0.189), ('nodes', 0.182), ('nem', 0.171), ('qi', 0.168), ('em', 0.153), ('si', 0.136), ('estimates', 0.114), ('cache', 0.101), ('ms', 0.096), ('contacts', 0.089), ('exchange', 0.087), ('caches', 0.085), ('decentralized', 0.085), ('cycle', 0.08), ('distributed', 0.075), ('averaging', 0.074), ('local', 0.068), ('mixture', 0.068), ('contact', 0.068), ('cs', 0.064), ('uniform', 0.061), ('variance', 0.057), ('corrections', 0.054), ('sj', 0.053), ('observes', 0.052), ('vlassis', 0.051), ('age', 0.048), ('xi', 0.047), ('protocols', 0.045), ('amsterdam', 0.045), ('responsibilities', 0.045), ('network', 0.044), ('correct', 0.041), ('involving', 0.036), ('csj', 0.034), ('exchanges', 0.034), ('kempe', 0.034), ('kowalczyk', 0.034), ('msj', 0.034), ('responsibility', 0.034), ('universiteit', 0.034), ('vrije', 0.034), ('wojtek', 0.034), ('locally', 0.031), ('addresses', 0.031), ('identically', 0.03), ('averages', 0.03), ('verbeek', 0.03), ('dutch', 0.03), ('management', 0.029), ('gaussian', 0.029), ('bound', 0.026), ('drops', 0.026), ('runs', 0.026), ('netherlands', 0.025), ('repeatedly', 0.024), ('internet', 0.024), ('reduction', 0.024), ('exponentially', 0.023), ('runtime', 0.023), ('communicate', 0.023), ('utilizes', 0.023), ('starts', 0.022), ('component', 0.022), ('weighted', 0.022), ('replace', 0.021), ('log', 0.021), ('average', 0.02), ('operates', 0.02), ('stopping', 0.02), ('rate', 0.02), ('applies', 0.02), ('van', 0.02), ('parameter', 0.019), ('step', 0.019), ('algorithm', 0.019), ('communication', 0.019), ('quantity', 0.019), ('membership', 0.019), ('convergence', 0.019), ('converged', 0.018), ('instance', 0.018), ('batch', 0.018), ('guarantee', 0.018), ('global', 0.017), ('implements', 0.017), ('lemma', 0.017), ('foundations', 0.017), ('nes', 0.017), ('fully', 0.017), ('performed', 0.016), ('sensor', 0.016), ('appropriately', 0.016), ('inequality', 0.016), ('update', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 130 nips-2004-Newscast EM

Author: Wojtek Kowalczyk, Nikos A. Vlassis

Abstract: We propose a gossip-based distributed algorithm for Gaussian mixture learning, Newscast EM. The algorithm operates on network topologies where each node observes a local quantity and can communicate with other nodes in an arbitrary point-to-point fashion. The main difference between Newscast EM and the standard EM algorithm is that the M-step in our case is implemented in a decentralized manner: (random) pairs of nodes repeatedly exchange their local parameter estimates and combine them by (weighted) averaging. We provide theoretical evidence and demonstrate experimentally that, under this protocol, nodes converge exponentially fast to the correct estimates in each M-step of the EM algorithm. 1

2 0.079090685 82 nips-2004-Incremental Algorithms for Hierarchical Classification

Author: Nicolò Cesa-bianchi, Claudio Gentile, Andrea Tironi, Luca Zaniboni

Abstract: We study the problem of hierarchical classification when labels corresponding to partial and/or multiple paths in the underlying taxonomy are allowed. We introduce a new hierarchical loss function, the H-loss, implementing the simple intuition that additional mistakes in the subtree of a mistaken class should not be charged for. Based on a probabilistic data model introduced in earlier work, we derive the Bayes-optimal classifier for the H-loss. We then empirically compare two incremental approximations of the Bayes-optimal classifier with a flat SVM classifier and with classifiers obtained by using hierarchical versions of the Perceptron and SVM algorithms. The experiments show that our simplest incremental approximation of the Bayes-optimal classifier performs, after just one training epoch, nearly as well as the hierarchical SVM classifier (which performs best). For the same incremental algorithm we also derive an H-loss bound showing, when data are generated by our probabilistic data model, exponentially fast convergence to the H-loss of the hierarchical classifier based on the true model parameters. 1 Introduction and basic definitions We study the problem of classifying data in a given taxonomy of labels, where the taxonomy is specified as a tree forest. We assume that every data instance is labelled with a (possibly empty) set of class labels called multilabel, with the only requirement that multilabels including some node i in the taxonony must also include all ancestors of i. Thus, each multilabel corresponds to the union of one or more paths in the forest, where each path must start from a root but it can terminate on an internal node (rather than a leaf). Learning algorithms for hierarchical classification have been investigated in, e.g., [8, 9, 10, 11, 12, 14, 15, 17, 20]. However, the scenario where labelling includes multiple and partial paths has received very little attention. The analysis in [5], which is mainly theoretical, shows in the multiple and partial path case a 0/1-loss bound for a hierarchical learning algorithm based on regularized least-squares estimates. In this work we extend [5] in several ways. First, we introduce a new hierarchical loss function, the H-loss, which is better suited than the 0/1-loss to analyze hierarchical classification tasks, and we derive the corresponding Bayes-optimal classifier under the parametric data model introduced in [5]. Second, considering various loss functions, including the H-loss, we empirically compare the performance of the following three incremental kernel-based ∗ This work was supported in part by the PASCAL Network of Excellence under EC grant no. 506778. This publication only reflects the authors’ views. algorithms: 1) a hierarchical version of the classical Perceptron algorithm [16]; 2) an approximation to the Bayes-optimal classifier; 3) a simplified variant of this approximation. Finally, we show that, assuming data are indeed generated according to the parametric model mentioned before, the H-loss of the algorithm in 3) converges to the H-loss of the classifier based on the true model parameters. Our incremental algorithms are based on training linear-threshold classifiers in each node of the taxonomy. A similar approach has been studied in [8], though their model does not consider multiple-path classifications as we do. Incremental algorithms are the main focus of this research, since we strongly believe that they are a key tool for coping with tasks where large quantities of data items are generated and the classification system needs to be frequently adjusted to keep up with new items. However, we found it useful to provide a reference point for our empirical results. Thus we have also included in our experiments the results achieved by nonincremental algorithms. In particular, we have chosen a flat and a hierarchical version of SVM [21, 7, 19], which are known to perform well on the textual datasets considered here. We assume data elements are encoded as real vectors x ∈ Rd which we call instances. A multilabel for an instance x is any subset of the set {1, . . . , N } of all labels/classes, including the empty set. We denote the multilabel associated with x by a vector y = (y1 , . . . , yN ) ∈ {0, 1}N , where i belongs to the multilabel of x if and only if yi = 1. A taxonomy G is a forest whose trees are defined over the set of labels. A multilabel y ∈ {0, 1}N is said to respect a taxonomy G if and only if y is the union of one or more paths in G, where each path starts from a root but need not terminate on a leaf. See Figure 1. We assume the data-generating mechanism produces examples (x, y) such that y respects some fixed underlying taxonomy G with N nodes. The set of roots in G is denoted by root(G). We use par(i) to denote the unique parent of node i, anc(i) to denote the set of ancestors of i, and sub(i) to denote the set of nodes in the subtree rooted at i (including i). Finally, given a predicate φ over a set Ω, we will use {φ} to denote both the subset of Ω where φ is true and the indicator function of this subset. 2 The H-loss Though several hierarchical losses have been proposed in the literature (e.g., in [11, 20]), no one has emerged as a standard yet. Since hierarchical losses are defined over multilabels, we start by considering two very simple functions measuring the discrepancy between multilabels y = (y1 , ..., yN ) and y = (y1 , ..., yN ): the 0/1-loss 0/1 (y, y) = {∃i : yi = yi } and the symmetric difference loss ∆ (y, y) = {y1 = y1 } + . . . + {yN = yN }. There are several ways of making these losses depend on a given taxonomy G. In this work, we follow the intuition “if a mistake is made at node i, then further mistakes made in the subtree rooted at i are unimportant”. That is, we do not require the algorithm be able to make fine-grained distinctions on tasks when it is unable to make coarse-grained ones. For example, if an algorithm failed to label a document with the class SPORTS, then the algorithm should not be charged more loss because it also failed to label the same document with the subclass SOCCER and the sub-subclass CHAMPIONS LEAGUE. A function implementing this intuition is defined by N H (y, y) = i=1 ci {yi = yi ∧ yj = yj , j ∈ anc(i)}, where c1 , . . . , cN > 0 are fixed cost coefficients. This loss, which we call H-loss, can also be described as follows: all paths in G from a root down to a leaf are examined and, whenever we encounter a node i such that yi = yi , we add ci to the loss, whereas all the loss contributions in the subtree rooted at i are discarded. Note that if c1 = . . . = cN = 1 then 0/1 ≤ H ≤ ∆ . Choices of ci depending on the structure of G are proposed in Section 4. Given a multilabel y ∈ {0, 1}N define its G-truncation as the multilabel y = (y1 , ..., yN ) ∈ {0, 1}N where, for each i = 1, . . . , N , yi = 1 iff yi = 1 and yj = 1 for all j ∈ anc(i). Note that the G-truncation of any multilabel always respects G. A graphical (a) (b) (c) (d) Figure 1: A one-tree forest (repeated four times). Each node corresponds to a class in the taxonomy G, hence in this case N = 12. Gray nodes are included in the multilabel under consideration, white nodes are not. (a) A generic multilabel which does not respect G; (b) its G-truncation. (c) A second multilabel that respects G. (d) Superposition of multilabel (b) on multilabel (c): Only the checked nodes contribute to the H-loss between (b) and (c). representation of the notions introduced so far is given in Figure 1. In the next lemma we show that whenever y respects G, then H (y, y) cannot be smaller than H (y , y). In other words, when the multilabel y to be predicted respects a taxonomy G then there is no loss of generality in restricting to predictions which respect G. Lemma 1 Let G be a taxonomy, y, y ∈ {0, 1}N be two multilabels such that y respects G, and y be the G-truncation of y. Then H (y , y) ≤ H (y, y) . Proof. For each i = 1, . . . , N we show that yi = yi and yj = yj for all j ∈ anc(i) implies yi = yi and yj = yj for all j ∈ anc(i). Pick some i and suppose yi = yi and yj = yj for all j ∈ anc(i). Now suppose yj = 0 (and thus yj = 0) for some j ∈ anc(i). Then yi = 0 since y respects G. But this implies yi = 1, contradicting the fact that the G-truncation y respects G. Therefore, it must be the case that yj = yj = 1 for all j ∈ anc(i). Hence the G-truncation of y left each node j ∈ anc(i) unchanged, implying yj = yj for all j ∈ anc(i). But, since the G-truncation of y does not change the value of a node i whose ancestors j are such that yj = 1, this also implies yi = yi . Therefore yi = yi and the proof is concluded. 3 A probabilistic data model Our learning algorithms are based on the following statistical model for the data, originally introduced in [5]. The model defines a probability distribution fG over the set of multilabels respecting a given taxonomy G by associating with each node i of G a Bernoulli random variable Yi and defining fG (y | x) = N i=1 P Yi = yi | Ypar(i) = ypar(i) , X = x . To guarantee that fG (y | x) = 0 whenever y ∈ {0, 1}N does not respect G, we set P Yi = 1 | Ypar(i) = 0, X = x = 0. Notice that this definition of fG makes the (rather simplistic) assumption that all Yk with the same parent node i (i.e., the children of i) are independent when conditioned on Yi and x. Through fG we specify an i.i.d. process {(X 1 , Y 1 ), (X 2 , Y 2 ), . . .}, where, for t = 1, 2, . . ., the multilabel Y t is distributed according to fG (· | X t ) and X t is distributed according to a fixed and unknown distribution D. Each example (xt , y t ) is thus a realization of the corresponding pair (X t , Y t ) of random variables. Our parametric model for fG is described as follows. First, we assume that the support of D is the surface of the d-dimensional unit sphere (i.e., instances x ∈ R d are such that ||x|| = 1). With each node i in the taxonomy, we associate a unit-norm weight vector ui ∈ Rd . Then, we define the conditional probabilities for a nonroot node i with parent j by P (Yi = 1 | Yj = 1, X = x) = (1 + ui x)/2. If i is a root node, the previous equation simplifies to P (Yi = 1 | X = x) = (1 + ui x)/2. 3.1 The Bayes-optimal classifier for the H-loss We now describe a classifier, called H - BAYES, that is the Bayes-optimal classifier for the H-loss. In other words, H - BAYES classifies any instance x with the multilabel y = argminy∈{0,1} E[ H (¯ , Y ) | x ]. Define pi (x) = P Yi = 1 | Ypar(i) = 1, X = x . y ¯ When no ambiguity arises, we write pi instead of pi (x). Now, fix any unit-length instance x and let y be a multilabel that respects G. For each node i in G, recursively define H i,x (y) = ci (pi (1 − yi ) + (1 − pi )yi ) + k∈child(i) H k,x (y) . The classifier H - BAYES operates as follows. It starts by putting all nodes of G in a set S; nodes are then removed from S one by one. A node i can be removed only if i is a leaf or if all nodes j in the subtree rooted at i have been already removed. When i is removed, its value yi is set to 1 if and only if pi 2 − k∈child(i) H k,x (y)/ci ≥ 1 . (1) (Note that if i is a leaf then (1) is equivalent to yi = {pi ≥ 1/2}.) If yi is set to zero, then all nodes in the subtree rooted at i are set to zero. Theorem 2 For any taxonomy G and all unit-length x ∈ Rd , the multilabel generated by H - BAYES is the Bayes-optimal classification of x for the H-loss. Proof sketch. Let y be the multilabel assigned by H - BAYES and y ∗ be any multilabel minimizing the expected H-loss. Introducing the short-hand Ex [·] = E[· | x], we can write Ex H (y, Y )= N i=1 ci (pi (1 − yi ) + (1 − pi )yi ) j∈anc(i) pj {yj = 1} . Note that we can recursively decompose the expected H-loss as Ex H (y, Y )= i∈root(G) where Ex Hi (y, Y ) = ci (pi (1 − yi ) + (1 − pi )yi ) Ex Hi (y, Y ), pj {yj = 1} + j∈anc(i) Ex Hk (y, Y ) . (2) k∈child(i) ∗ Pick a node i. If i is a leaf, then the sum in the RHS of (2) disappears and yi = {pi ≥ 1/2}, ∗ which is also the minimizer of H i,x (y) = ci (pi (1 − yi ) + (1 − pi )yi ), implying yi = yi . ∗ Now let i be an internal node and inductively assume yj = yj for all j ∈ sub(i). Notice ∗ that the factors j∈anc(i) pj {yj = 1} occur in both terms in the RHS of (2). Hence yi does not depend on these factors and we can equivalently minimize ci (pi (1 − yi ) + (1 − pi )yi ) + pi {yi = 1} k∈child(i) H k,x (y), (3) where we noted that, for each k ∈ child(i), Ex Hk (y, Y ) = j∈anc(i) pj {yj = 1} pi {yi = 1}H k,x (y) . ∗ Now observe that yi minimizing (3) is equivalent to the assignment produced by H - BAYES. ∗ ∗ To conclude the proof, note that whenever yi = 0, Lemma 1 requires that yj = 0 for all nodes j ∈ sub(i), which is exactly what H - BAYES does. 4 The algorithms We consider three incremental algorithms. Each one of these algorithms learns a hierarchical classifier by training a decision function gi : Rd → {0, 1} at each node i = 1, . . . , N . For a given set g1 , . . . , gN of decision functions, the hierarchical classifier generated by these algorithms classifies an instance x through a multilabel y = (y1 , ..., yN ) defined as follows: yi = gi (x) 0 if i ∈ root(G) or yj = 1 for all j ∈ anc(i) otherwise. (4) Note that y computed this way respects G. The classifiers (4) are trained incrementally. Let gi,t be the decision function at node i after training on the first t − 1 examples. When the next training example (xt , y t ) is available, the algorithms compute the multilabel y t using classifier (4) based on g1,t (xt ), . . . , gN,t (xt ). Then, the algorithms consider for an update only those decision functions sitting at nodes i satisfying either i ∈ root(G) or ypar(i),t = 1. We call such nodes eligible at time t. The decision functions of all other nodes are left unchanged. The first algorithm we consider is a simple hierarchical version of the Perceptron algorithm [16], which we call H - PERC. The decision functions at time t are defined by gi,t (xt ) = {wi,t xt ≥ 0}. In the update phase, the Perceptron rule wi,t+1 = wi,t + yi,t xt is applied to every node i eligible at time t and such that yi,t = yi,t . The second algorithm, called APPROX - H - BAYES, approximates the H - BAYES classifier of Section 3.1 by replacing the unknown quantities pi (xt ) with estimates (1+w i,t xt )/2. The weights w i,t are regularized least-squares estimates defined by (i) wi,t = (I + Si,t−1 Si,t−1 + xt xt )−1 Si,t−1 y t−1 . (5) The columns of the matrix Si,t−1 are all past instances xs that have been stored at node i; (i) the s-th component of vector y t−1 is the i-th component yi,s of the multilabel y s associated with instance xs . In the update phase, an instance xt is stored at node i, causing an update of wi,t , whenever i is eligible at time t and |w i,t xt | ≤ (5 ln t)/Ni,t , where Ni,t is the number of instances stored at node i up to time t − 1. The corresponding decision functions gi,t are of the form gi,t (xt ) = {w i,t xt ≥ τi,t }, where the threshold τi,t ≥ 0 at node i depends on the margin values w j,t xt achieved by nodes j ∈ sub(i) — recall (1). Note that gi,t is not a linear-threshold function, as xt appears in the definition of w i,t . The margin threshold (5 ln t)/Ni,t , controlling the update of node i at time t, reduces the space requirements of the classifier by keeping matrices Si,t suitably small. This threshold is motivated by the work [4] on selective sampling. The third algorithm, which we call H - RLS (Hierarchical Regularized Least Squares), is a simplified variant of APPROX - H - BAYES in which the thresholds τi,t are set to zero. That is, we have gi,t (xt ) = {w i,t xt ≥ 0} where the weights w i,t are defined as in (5) and updated as in the APPROX - H - BAYES algorithm. Details on how to run APPROX - H - BAYES 2 and H - RLS in dual variables and perform an update at node i in time O(Ni,t ) are found in [3] (where a mistake-driven version of H - RLS is analyzed). 5 Experimental results The empirical evaluation of the algorithms was carried out on two well-known datasets of free-text documents. The first dataset consists of the first (in chronological order) 100,000 newswire stories from the Reuters Corpus Volume 1, RCV1 [2]. The associated taxonomy of labels, which are the topics of the documents, has 101 nodes organized in a forest of 4 trees. The forest is shallow: the longest path has length 3 and the the distribution of nodes, sorted by increasing path length, is {0.04, 0.53, 0.42, 0.01}. For this dataset, we used the bag-of-words vectorization performed by Xerox Research Center Europe within the EC project KerMIT (see [4] for details on preprocessing). The 100,000 documents were divided into 5 equally sized groups of chronologically consecutive documents. We then used each adjacent pair of groups as training and test set in an experiment (here the fifth and first group are considered adjacent), and then averaged the test set performance over the 5 experiments. The second dataset is a specific subtree of the OHSUMED corpus of medical abstracts [1]: the subtree rooted in “Quality of Health Care” (MeSH code N05.715). After removing overlapping classes (OHSUMED is not quite a tree but a DAG), we ended up with 94 Table 1: Experimental results on two hierarchical text classification tasks under various loss functions. We report average test errors along with standard deviations (in parenthesis). In bold are the best performance figures among the incremental algorithms. RCV1 PERC H - PERC H - RLS AH - BAY SVM H - SVM OHSU. PERC H - PERC H - RLS AH - BAY SVM H - SVM 0/1-loss 0.702(±0.045) 0.655(±0.040) 0.456(±0.010) 0.550(±0.010) 0.482(±0.009) 0.440(±0.008) unif. H-loss 1.196(±0.127) 1.224(±0.114) 0.743(±0.026) 0.815(±0.028) 0.790(±0.023) 0.712(±0.021) norm. H-loss 0.100(±0.029) 0.099(±0.028) 0.057(±0.001) 0.090(±0.001) 0.057(±0.001) 0.055(±0.001) ∆-loss 1.695(±0.182) 1.861(±0.172) 1.086(±0.036) 1.465(±0.040) 1.173(±0.051) 1.050(±0.027) 0/1-loss 0.899(±0.024) 0.846(±0.024) 0.769(±0.004) 0.819(±0.004) 0.784(±0.003) 0.759(±0.002) unif. H-loss 1.938(±0.219) 1.560(±0.155) 1.200(±0.007) 1.197(±0.006) 1.206(±0.003) 1.170(±0.005) norm. H-loss 0.058(±0.005) 0.057(±0.005) 0.045(±0.000) 0.047(±0.000) 0.044(±0.000) 0.044(±0.000) ∆-loss 2.639(±0.226) 2.528(±0.251) 1.957(±0.011) 2.029(±0.009) 1.872(±0.005) 1.910(±0.007) classes and 55,503 documents. We made this choice based only on the structure of the subtree: the longest path has length 4, the distribution of nodes sorted by increasing path length is {0.26, 0.37, 0.22, 0.12, 0.03}, and there are a significant number of partial and multiple path multilabels. The vectorization of the subtree was carried out as follows: after tokenization, we removed all stopwords and also those words that did not occur at least 3 times in the corpus. Then, we vectorized the documents using the Bow library [13] with a log(1 + TF) log(IDF) encoding. We ran 5 experiments by randomly splitting the corpus in a training set of 40,000 documents and a test set of 15,503 documents. Test set performances are averages over these 5 experiments. In the training set we kept more documents than in the RCV1 splits since the OHSUMED corpus turned out to be a harder classification problem than RCV1. In both datasets instances have been normalized to unit length. We tested the hierarchical Perceptron algorithm (H - PERC), the hierarchical regularized leastsquares algorithm (H - RLS), and the approximated Bayes-optimal algorithm (APPROX - H BAYES ), all described in Section 4. The results are summarized in Table 1. APPROX - H BAYES ( AH - BAY in Table 1) was trained using cost coefficients c i chosen as follows: if i ∈ root(G) then ci = |root(G)|−1 . Otherwise, ci = cj /|child(j)|, where j is the parent of i. Note that this choice of coefficients amounts to splitting a unit cost equally among the roots and then splitting recursively each node’s cost equally among its children. Since, in this case, 0 ≤ H ≤ 1, we call the resulting loss normalized H-loss. We also tested a hierarchical version of SVM (denoted by H - SVM in Table 1) in which each node is an SVM classifier trained using a batch version of our hierarchical learning protocol. More precisely, each node i was trained only on those examples (xt , y t ) such that ypar(i),t = 1 (note that, as no conditions are imposed on yi,t , node i is actually trained on both positive and negative examples). The resulting set of linear-threshold functions was then evaluated on the test set using the hierachical classification scheme (4). We tried both the C and ν parametrizations [18] for SVM and found the setting C = 1 to work best for our data. 1 We finally tested the “flat” variants of Perceptron and SVM, denoted by PERC and SVM. In these variants, each node is trained and evaluated independently of the others, disregarding all taxonomical information. All SVM experiments were carried out using the libSVM implementation [6]. All the tested algorithms used a linear kernel. 1 It should be emphasized that this tuning of C was actually chosen in hindsight, with no crossvalidation. As far as loss functions are concerned, we considered the 0/1-loss, the H-loss with cost coefficients set to 1 (denoted by uniform H-loss), the normalized H-loss, and the symmetric difference loss (denoted by ∆-loss). Note that H - SVM performs best, but our incremental algorithms were trained for a single epoch on the training set. The good performance of SVM (the flat variant of H - SVM ) is surprising. However, with a single epoch of training H - RLS does not perform worse than SVM (except on OHSUMED under the normalized H-loss) and comes reasonably close to H - SVM. On the other hand, the performance of APPROX - H - BAYES is disappointing: on OHSUMED it is the best algorithm only for the uniform H-loss, though it was trained using the normalized H-loss; on RCV1 it never outperforms H - RLS, though it always does better than PERC and H - PERC. A possible explanation for this behavior is that APPROX - H - BAYES is very sensitive to errors in the estimates of pi (x) (recall Section 3.1). Indeed, the least-squares estimates (5), which we used to approximate H - BAYES, seem to work better in practice on simpler (and possibly more robust) algorithms, such as H - RLS. The lower values of normalized H-loss on OHSUMED (a harder corpus than RCV1) can be explained because a quarter of the 94 nodes in the OHSUMED taxonomy are roots, and thus each top-level mistake is only charged about 4/94. As a final remark, we observe that the normalized H-loss gave too small a range of values to afford fine comparisons among the best performing algorithms. 6 Regret bounds for the H-loss In this section we prove a theoretical bound on the H-loss of a slight variant of the algorithm H - RLS tested in Section 5. More precisely, we assume data are generated according to the probabilistic model introduced in Section 3 with unknown instance distribution D and unknown coefficients u1 , . . . , uN . We define the regret of a classifier assigning label y to instance X as E H (y, Y t ) − E H (y, Y ), where the expected value is with respect the random draw of (X, Y ) and y is the multilabel assigned by classifier (4) when the decision functions gi are zero-threshold functions of the form gi (x) = {ui x ≥ 0}. The theorem below shows that the regret of the classifier learned by a variant of H - RLS after t training examples, with t large enough, is exponentially small in t. In other words, H - RLS learns to classify as well as the algorithm that is given the true parameters u1 , . . . , uN of the underlying data-generating process. We have been able to prove the theorem only for the variant of H - RLS storing all instances at each node. That is, every eligible node at time t is updated, irrespective of whether |w i,t xt | ≤ (5 ln t)/Ni,t . Given the i.i.d. data-generating process (X 1 , Y 1 ), (X 2 , Y 2 ), . . ., for each node k we define the derived process X k1 , X k2 , . . . including all and only the instances X s of the original process that satisfy Ypar(k),s = 1. We call this derived process the process at node k. Note that, for each k, the process at node k is an i.i.d. process. However, its distribution might depend on k. The spectrum of the process at node k is the set of eigenvalues of the correlation matrix with entries E[Xk1 ,i Xk1 ,j ] for i, j = 1, . . . , d. We have the following theorem, whose proof is omitted due to space limitations. Theorem 3 Let G be a taxonomy with N nodes and let fG be a joint density for G parametrized by N unit-norm vectors u1 , . . . , uN ∈ Rd . Assume the instance distribution is such that there exist γ1 , . . . , γN > 0 satisfying P |ui X t | ≥ γi = 1 for i = 1, . . . , N . Then, for all t > max maxi=1,...,N E H (y t , Y t ) −E 16 µ i λ i γi , maxi=1,...,N 192d µi λ 2 i the regret H (y t , Y t ) of the modified H - RLS algorithm is at most N 2 2 µi t e−κ1 γi λi t µi + t2 e−κ2 λi t µi cj , i=1 j∈sub(i) where κ1 , κ2 are constants, µi = E j∈anc(i) (1 + uj X)/2 eigenvalue in the spectrum of the process at node i. and λi is the smallest 7 Conclusions and open problems In this work we have studied the problem of hierarchical classification of data instances in the presence of partial and multiple path labellings. We have introduced a new hierarchical loss function, the H-loss, derived the corresponding Bayes-optimal classifier, and empirically compared an incremental approximation to this classifier with some other incremental and nonincremental algorithms. Finally, we have derived a theoretical guarantee on the H-loss of a simplified variant of the approximated Bayes-optimal algorithm. Our investigation leaves several open issues. The current approximation to the Bayesoptimal classifier is not satisfying, and this could be due to a bad choice of the model, of the estimators, of the datasets, or of a combination of them. Also, the normalized H-loss is not fully satisfying, since the resulting values are often too small. From the theoretical viewpoint, we would like to analyze the regret of our algorithms with respect to the Bayesoptimal classifier, rather than with respect to a classifier that makes a suboptimal use of the true model parameters. References [1] The OHSUMED test collection. URL: medir.ohsu.edu/pub/ohsumed/. [2] Reuters corpus volume 1. URL: about.reuters.com/researchandstandards/corpus/. [3] N. Cesa-Bianchi, A. Conconi, and C. Gentile. A second-order Perceptron algorithm. In Proc. 15th COLT, pages 121–137. Springer, 2002. [4] N. Cesa-Bianchi, A. Conconi, and C. Gentile. Learning probabilistic linear-threshold classifiers via selective sampling. In Proc. 16th COLT, pages 373–386. Springer, 2003. [5] N. Cesa-Bianchi, A. Conconi, and C. Gentile. Regret bounds for hierarchical classification with linear-threshold functions. In Proc. 17th COLT. Springer, 2004. To appear. [6] C.-C. Chang and C.-J. Lin. Libsvm — a library for support vector machines. URL: www.csie.ntu.edu.tw/∼cjlin/libsvm/. [7] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2001. [8] O. Dekel, J. Keshet, and Y. Singer. Large margin hierarchical classification. In Proc. 21st ICML. Omnipress, 2004. [9] S.T. Dumais and H. Chen. Hierarchical classification of web content. In Proc. 23rd ACM Int. Conf. RDIR, pages 256–263. ACM Press, 2000. [10] M. Granitzer. Hierarchical Text Classification using Methods from Machine Learning. PhD thesis, Graz University of Technology, 2003. [11] T. Hofmann, L. Cai, and M. Ciaramita. Learning with taxonomies: Classifying documents and words. In NIPS Workshop on Syntax, Semantics, and Statistics, 2003. [12] D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In Proc. 14th ICML, Morgan Kaufmann, 1997. [13] A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. URL: www-2.cs.cmu.edu/∼mccallum/bow/. [14] A.K. McCallum, R. Rosenfeld, T.M. Mitchell, and A.Y. Ng. Improving text classification by shrinkage in a hierarchy of classes. In Proc. 15th ICML. Morgan Kaufmann, 1998. [15] D. Mladenic. Turning yahoo into an automatic web-page classifier. In Proceedings of the 13th European Conference on Artificial Intelligence, pages 473–474, 1998. [16] F. Rosenblatt. The Perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Review, 65:386–408, 1958. [17] M.E. Ruiz and P. Srinivasan. Hierarchical text categorization using neural networks. Information Retrieval, 5(1):87–118, 2002. [18] B. Sch¨ lkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. o Neural Computation, 12:1207–1245, 2000. [19] B. Sch¨ lkopf and A. Smola. Learning with kernels. MIT Press, 2002. o [20] A. Sun and E.-P. Lim. Hierarchical text classification and evaluation. In Proc. 2001 Int. Conf. Data Mining, pages 521–528. IEEE Press, 2001. [21] V.N. Vapnik. Statistical Learning Theory. Wiley, 1998.

3 0.073631294 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

Author: Tzu-kuo Huang, Chih-jen Lin, Ruby C. Weng

Abstract: The Bradley-Terry model for paired comparison has been popular in many areas. We propose a generalized version in which paired individual comparisons are extended to paired team comparisons. We introduce a simple algorithm with convergence proofs to solve the model and obtain individual skill. A useful application to multi-class probability estimates using error-correcting codes is demonstrated. 1

4 0.067897439 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

Author: Saharon Rosset, Ji Zhu, Hui Zou, Trevor J. Hastie

Abstract: We consider the situation in semi-supervised learning, where the “label sampling” mechanism stochastically depends on the true response (as well as potentially on the features). We suggest a method of moments for estimating this stochastic dependence using the unlabeled data. This is potentially useful for two distinct purposes: a. As an input to a supervised learning procedure which can be used to “de-bias” its results using labeled data only and b. As a potentially interesting learning task in itself. We present several examples to illustrate the practical usefulness of our method.

5 0.0673948 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification

Author: Shyam Visweswaran, Gregory F. Cooper

Abstract: Classification algorithms typically induce population-wide models that are trained to perform well on average on expected future instances. We introduce a Bayesian framework for learning instance-specific models from data that are optimized to predict well for a particular instance. Based on this framework, we present a lazy instance-specific algorithm called ISA that performs selective model averaging over a restricted class of Bayesian networks. On experimental evaluation, this algorithm shows superior performance over model selection. We intend to apply such instance-specific algorithms to improve the performance of patient-specific predictive models induced from medical data. 1 In t ro d u c t i o n Commonly used classification algorithms, such as neural networks, decision trees, Bayesian networks and support vector machines, typically induce a single model from a training set of instances, with the intent of applying it to all future instances. We call such a model a population-wide model because it is intended to be applied to an entire population of future instances. A population-wide model is optimized to predict well on average when applied to expected future instances. In contrast, an instance-specific model is one that is constructed specifically for a particular instance. The structure and parameters of an instance-specific model are specialized to the particular features of an instance, so that it is optimized to predict especially well for that instance. Usually, methods that induce population-wide models employ eager learning in which the model is induced from the training data before the test instance is encountered. In contrast, lazy learning defers most or all processing until a response to a test instance is required. Learners that induce instance-specific models are necessarily lazy in nature since they take advantage of the information in the test instance. An example of a lazy instance-specific method is the lazy Bayesian rule (LBR) learner, implemented by Zheng and Webb [1], which induces rules in a lazy fashion from examples in the neighborhood of the test instance. A rule generated by LBR consists of a conjunction of the attribute-value pairs present in the test instance as the antecedent and a local simple (naïve) Bayes classifier as the consequent. The structure of the local simple Bayes classifier consists of the attribute of interest as the parent of all other attributes that do not appear in the antecedent, and the parameters of the classifier are estimated from the subset of training instances that satisfy the antecedent. A greedy step-forward search selects the optimal LBR rule for a test instance to be classified. When evaluated on 29 UCI datasets, LBR had the lowest average error rate when compared to several eager learning methods [1]. Typically, both eager and lazy algorithms select a single model from some model space, ignoring the uncertainty in model selection. Bayesian model averaging is a coherent approach to dealing with the uncertainty in model selection, and it has been shown to improve the predictive performance of classifiers [2]. However, since the number of models in practically useful model spaces is enormous, exact model averaging over the entire model space is usually not feasible. In this paper, we describe a lazy instance-specific averaging (ISA) algorithm for classification that approximates Bayesian model averaging in an instance-sensitive manner. ISA extends LBR by adding Bayesian model averaging to an instance-specific model selection algorithm. While the ISA algorithm is currently able to directly handle only discrete variables and is computationally more intensive than comparable eager algorithms, the results in this paper show that it performs well. In medicine, such lazy instance-specific algorithms can be applied to patient-specific modeling for improving the accuracy of diagnosis, prognosis and risk assessment. The rest of this paper is structured as follows. Section 2 introduces a Bayesian framework for instance-specific learning. Section 3 describes the implementation of ISA. In Section 4, we evaluate ISA and compare its performance to that of LBR. Finally, in Section 5 we discuss the results of the comparison. 2 Deci si on Th eo ret i c F rame wo rk We use the following notation. Capital letters like X, Z, denote random variables and corresponding lower case letters, x, z, denote specific values assigned to them. Thus, X = x denotes that variable X is assigned the value x. Bold upper case letters, such as X, Z, represent sets of variables or random vectors and their realization is denoted by the corresponding bold lower case letters, x, z. Hence, X = x denotes that the variables in X have the states given by x. In addition, Z denotes the target variable being predicted, X denotes the set of attribute variables, M denotes a model, D denotes the training dataset, and denotes a generic test instance that is not in D. We now characterize population-wide and instance-specific model selection in decision theoretic terms. Given training data D and a separate generic test instance , the Bayes optimal prediction for Zt is obtained by combining the predictions of all models weighted by their posterior probabilities, as follows: P (Z t | X t , D ) = ∫ P( Z t | X t , M ) P ( M | D )dM . (1) M The optimal population-wide model for predicting Zt is as follows:   max∑ U P( Z t | X t , D), P (Z t | X t , M ) P ( X | D) , M  Xt  [ ] (2) where the function U gives the utility of approximating the Bayes optimal estimate P(Zt | Xt , D), with the estimate P(Zt | Xt , M) obtained from model M. The term P(X | D) is given by: P ( X | D) = ∫ P ( X | M ) P ( M | D)dM . (3) M The optimal instance-specific model for predicting Zt is as follows: { [ ]} max U P ( Z t | X t = x t , D), P (Z t | X t = x t , M ) , M (4) where xt are the values of the attributes of the test instance Xt for which we want to predict Zt. The Bayes optimal estimate P(Zt | Xt = xt, D), in Equation 4 is derived using Equation 1, for the special case in which Xt = xt . The difference between the population-wide and the instance-specific models can be noted by comparing Equations 2 and 4. Equation 2 for the population-wide model selects the model that on average will have the greatest utility. Equation 4 for the instance-specific model, however, selects the model that will have the greatest expected utility for the specific instance Xt = xt . For predicting Zt in a given instance Xt = xt, the model selected using Equation 2 can never have an expected utility greater than the model selected using Equation 4. This observation provides support for developing instance-specific models. Equations 2 and 4 represent theoretical ideals for population-wide and instancespecific model selection, respectively; we are not suggesting they are practical to compute. The current paper focuses on model averaging, rather than model selection. Ideal Bayesian model averaging is given by Equation 1. Model averaging has previously been applied using population-wide models. Studies have shown that approximate Bayesian model averaging using population-wide models can improve predictive performance over population-wide model selection [2]. The current paper concentrates on investigating the predictive performance of approximate Bayesian model averaging using instance-specific models. 3 In st an ce- S p eci fi c Algo ri t h m We present the implementation of the lazy instance-specific algorithm based on the above framework. ISA searches the space of a restricted class of Bayesian networks to select a subset of the models over which to derive a weighted (averaged) posterior of the target variable Zt . A key characteristic of the search is the use of a heuristic to select models that will have a significant influence on the weighted posterior. We introduce Bayesian networks briefly and then describe ISA in detail. 3.1 B ay e s i a n N e t w or k s A Bayesian network is a probabilistic model that combines a graphical representation (the Bayesian network structure) with quantitative information (the parameters of the Bayesian network) to represent the joint probability distribution over a set of random variables [3]. Specifically, a Bayesian network M representing the set of variables X consists of a pair (G, ΘG ). G is a directed acyclic graph that contains a node for every variable in X and an arc between every pair of nodes if the corresponding variables are directly probabilistically dependent. Conversely, the absence of an arc between a pair of nodes denotes probabilistic independence between the corresponding variables. ΘG represents the parameterization of the model. In a Bayesian network M, the immediate predecessors of a node X i in X are called the parents of X i and the successors, both immediate and remote, of Xi in X are called the descendants of X i . The immediate successors of X i are called the children of X i . For each node Xi there is a local probability distribution (that may be discrete or continuous) on that node given the state of its parents. The complete joint probability distribution over X, represented by the parameterization ΘG, can be factored into a product of local probability distributions defined on each node in the network. This factorization is determined by the independences captured by the structure of the Bayesian network and is formalized in the Bayesian network Markov condition: A node (representing a variable) is independent of its nondescendants given just its parents. According to this Markov condition, the joint probability distribution on model variables X = (X1 , X 2, …, X n ) can be factored as follows: n P ( X 1 , X 2 , ..., X n ) = ∏ P ( X i | parents( X i )) , (5) i =1 where parents(Xi ) denotes the set of nodes that are the parents of X i . If Xi has no parents, then the set parents(Xi ) is empty and P(Xi | parents(X i)) is just P(Xi ). 3.2 I S A M od e l s The LBR models of Zheng and Webb [1] can be represented as members of a restricted class of Bayesian networks (see Figure 1). We use the same class of Bayesian networks for the ISA models, to facilitate comparison between the two algorithms. In Figure 1, all nodes represent attributes that are discrete. Each node in X has either an outgoing arc into target node, Z, or receives an arc from Z. That is, each node is either a parent or a child of Z. Thus, X is partitioned into two sets: the first containing nodes (X 1 , …, X j in Figure 1) each of which is a parent of Z and every node in the second set, and the second containing nodes (X j+1 , …, X k in Figure 1) that have as parents the node Z and every node in the first set. The nodes in the first set are instantiated to the corresponding values in the test instance for which Zt is to be predicted. Thus, the first set of nodes represents the antecedent of the LBR rule and the second set of nodes represents the consequent. ... X1= x1 Xi = xi Z Xi+1 ... Xk Figure 1: An example of a Bayesian network LBR model with target node Z and k attribute nodes of which X1 , …, X j are instantiated to values x 1 , …, x j in xt . X 1, …, X j are present in the antecedent of the LBR rule and Z, X j+1 , …, X k (that form the local simple Bayes classifier) are present in the consequent. The indices need not be ordered as shown, but are presented in this example for convenience of exposition. 3.3 M od e l A ve r ag i n g For Bayesian networks, Equation 1 can be evaluated as follows: P ( Z t | x t , D ) = ∑ P ( Z t | x t , M ) P( M | D ) , (6) M with M being a Bayesian network comprised of structure G and parameters ΘG. The probability distribution of interest is a weighted average of the posterior distribution over all possible Bayesian networks where the weight is the probability of the Bayesian network given the data. Since exhaustive enumeration of all possible models is not feasible, even for this class of simple Bayesian networks, we approximate exact model averaging with selective model averaging. Let R be the set of models selected by the search procedure from all possible models in the model space, as described in the next section. Then, with selective model averaging, P(Zt | xt, D) is estimated as: ∑RP( Z t | x t , M ) P(M | D) P (Z t | x t , D) ≅ M ∈ . ∑RP(M | D) M∈ (7) Assuming uniform prior belief over all possible models, the model posterior P(M | D) in Equation 7 can be replaced by the marginal likelihood P(D | M), to obtain the following equation: P ( Z | x , D) ≅ t t ∑ P ( Z t | x t , M ) P( D | M ) . ∑RP( D | M ) M∈ M ∈R (8) The (unconditional) marginal likelihood P(D | M) in Equation 8, is a measure of the goodness of fit of the model to the data and is also known as the model score. While this score is suitable for assessing the model’s fit to the joint probability distribution, it is not necessarily appropriate for assessing the goodness of fit to a conditional probability distribution which is the focus in prediction and classification tasks, as is the case here. A more suitable score in this situation is a conditional model score that is computed from training data D of d instances as: d score( D, M ) = ∏ P ( z p | x1 ,..., x p ,z 1 ,...,z p −1 ,M ) . (9) p =1 This score is computed in a predictive and sequential fashion: for the pth training instance the probability of predicting the observed value zp for the target variable is computed based on the values of all the variables in the preceding p-1 training instances and the values xp of the attributes in the pth instance. One limitation of this score is that its value depends on the ordering of the data. Despite this limitation, it has been shown to be an effective scoring criterion for classification models [4]. The parameters of the Bayesian network M, used in the above computations, are defined as follows: P ( X i = k | parents ( X i ) = j ) ≡ θ ijk = N ijk + α ijk N ij + α ij , (10) where (i) Nijk is the number of instances in the training dataset D where variable Xi has value k and the parents of X i are in state j, (ii) N ij = ∑k N ijk , (iii) αijk is a parameter prior that can be interpreted as the belief equivalent of having previously observed αijk instances in which variable Xi has value k and the parents of X i are in state j, and (iv) α ij = ∑k α ijk . 3.4 M od e l Se a r c h We use a two-phase best-first heuristic search to sample the model space. The first phase ignores the evidence xt in the test instance while searching for models that have high scores as given by Equation 9. This is followed by the second phase that searches for models having the greatest impact on the prediction of Zt for the test instance, which we formalize below. The first phase searches for models that predict Z in the training data very well; these are the models that have high conditional model scores. The initial model is the simple Bayes network that includes all the attributes in X as children of Z. A succeeding model is derived from a current model by reversing the arc of a child node in the current model, adding new outgoing arcs from it to Z and the remaining children, and instantiating this node to the value in the test instance. This process is performed for each child in the current model. An incoming arc of a child node is considered for reversal only if the node’s value is not missing in the test instance. The newly derived models are added to a priority queue, Q. During each iteration of the search, the model with the highest score (given by Equation 9) is removed from Q and placed in a set R, following which new models are generated as described just above, scored and added to Q. The first phase terminates after a user-specified number of models have accumulated in R. The second phase searches for models that change the current model-averaged estimate of P(Zt | xt , D) the most. The idea here is to find viable competing models for making this posterior probability prediction. When no competitive models can be found, the prediction becomes stable. During each iteration of the search, the highest ranked model M* is removed from Q and added to R. The ranking is based on how much the model changes the current estimate of P(Zt | xt , D). More change is better. In particular, M* is the model in Q that maximizes the following function: f ( R, M *) = g ( R) − g ( R U {M *}) , (11) where for a set of models S, the function g(S) computes the approximate model averaged prediction for Zt, as follows: g (S ) = ∑ P(Z M ∈S t | x t , M ) score( D, M ) ∑ score( D, M ) ∈ . (12) M S The second phase terminates when no new model can be found that has a value (as given by Equation 11) that is greater than a user-specified minimum threshold T. The final distribution of Zt is then computed from the models in R using Equation 8. 4 Ev a lu a t i o n We evaluated ISA on the 29 UCI datasets that Zheng and Webb used for the evaluation of LBR. On the same datasets, we also evaluated a simple Bayes classifier (SB) and LBR. For SB and LBR, we used the Weka implementations (Weka v3.3.6, http://www.cs.waikato.ac.nz/ml/weka/) with default settings [5]. We implemented the ISA algorithm as a standalone application in Java. The following settings were used for ISA: a maximum of 100 phase-1 models, a threshold T of 0.001 in phase-2, and an upper limit of 500 models in R. For the parameter priors in Equation 10, all αijk were set to 1. All error rates were obtained by averaging the results from two stratified 10-fold cross-validation (20 trials total) similar to that used by Zheng and Webb. Since, both LBR and ISA can handle only discrete attributes, all numeric attributes were discretized in a pre-processing step using the entropy based discretization method described in [6]. For each pair of training and test folds, the discretization intervals were first estimated from the training fold and then applied to both folds. The error rates of two algorithms on a dataset were compared with a paired t-test carried out at the 5% significance level on the error rate statistics obtained from the 20 trials. The results are shown in Table 1. Compared to SB, ISA has significantly fewer errors on 9 datasets and significantly more errors on one dataset. Compared to LBR, ISA has significantly fewer errors on 7 datasets and significantly more errors on two datasets. On two datasets, chess and tic-tac-toe, ISA shows considerable improvement in performance over both SB and LBR. With respect to computation Table 1: Percent error rates of simple Bayes (SB), Lazy Bayesian Rule (LBR) and Instance-Specific Averaging (ISA). A - indicates that the ISA error rate is statistically significantly lower than the marked SB or LBR error rate. A + indicates that the ISA error rate is statistically significantly higher. Dataset Size Annealing Audiology Breast (W) Chess (KR-KP) Credit (A) Echocardiogram Glass Heart (C) Hepatitis Horse colic House votes 84 Hypothyroid Iris Labor LED 24 Liver disorders Lung cancer Lymphography Pima Postoperative Primary tumor Promoters Solar flare Sonar Soybean Splice junction Tic-Tac-Toe Wine Zoo 898 226 699 3169 690 131 214 303 155 368 435 3163 150 57 200 345 32 148 768 90 339 106 1389 208 683 3177 958 178 101 No. of classes 6 24 2 2 2 2 6 2 2 2 2 2 3 2 10 2 3 4 2 3 22 2 2 2 19 3 2 3 7 Num. Attrib. 6 0 9 0 6 6 9 13 6 7 0 7 4 8 0 6 0 0 8 1 0 0 0 60 0 0 0 13 0 Nom. Attrib. 32 69 0 36 9 1 0 0 13 15 16 18 0 8 24 0 56 18 0 7 17 57 10 0 35 60 9 0 16 Percent error rate SB LBR ISA 1.9 3.5 2.7 29.6 29.4 30.9 3.7 2.9 + 2.8 + 1.1 12.1 3.0 13.8 14.0 13.9 33.2 34.0 35.9 26.9 27.8 29.0 16.2 16.2 17.5 14.2 - 14.2 - 11.3 20.2 16.0 17.8 5.1 10.1 7.0 0.9 0.9 1.4 6.0 6.0 5.3 8.8 6.1 7.0 40.5 40.5 40.3 36.8 36.8 36.8 56.3 56.3 56.3 15.5 - 15.5 - 13.2 21.8 22.0 22.3 33.3 33.3 33.3 54.4 53.5 54.2 7.5 7.5 7.5 20.2 18.3 + 19.4 15.4 15.6 15.9 7.1 7.2 7.9 4.7 4.3 4.4 30.3 - 13.7 - 10.3 1.1 1.1 1.1 6.4 8.4 8.4 - times, ISA took 6 times longer to run than LBR on average for a single test instance on a desktop computer with a 2 GHz Pentium 4 processor and 3 GB of RAM. 5 C o n c lu si o n s a n d Fu t u re R e s ea rc h We have introduced a Bayesian framework for instance-specific model averaging and presented ISA as one example of a classification algorithm based on this framework. An instance-specific algorithm like LBR that does model selection has been shown by Zheng and Webb to perform classification better than several eager algorithms [1]. Our results show that ISA, which extends LBR by adding Bayesian model averaging, improves overall on LBR, which provides support that we can obtain additional prediction improvement by performing instance-specific model averaging rather than just instance-specific model selection. In future work, we plan to explore further the behavior of ISA with respect to the number of models being averaged and the effect of the number of models selected in each of the two phases of the search. We will also investigate methods to improve the computational efficiency of ISA. In addition, we plan to examine other heuristics for model search as well as more general model spaces such as unrestricted Bayesian networks. The instance-specific framework is not restricted to the Bayesian network models that we have used in this investigation. In the future, we plan to explore other models using this framework. Our ultimate interest is to apply these instancespecific algorithms to improve patient-specific predictions (for diagnosis, therapy selection, and prognosis) and thereby to improve patient care. A c k n ow l e d g me n t s This work was supported by the grant T15-LM/DE07059 from the National Library of Medicine (NLM) to the University of Pittsburgh’s Biomedical Informatics Training Program. We would like to thank the three anonymous reviewers for their helpful comments. References [1] Zheng, Z. and Webb, G.I. (2000). Lazy Learning of Bayesian Rules. Machine Learning, 41(1):53-84. [2] Hoeting, J.A., Madigan, D., Raftery, A.E. and Volinsky, C.T. (1999). Bayesian Model Averaging: A Tutorial. Statistical Science, 14:382-417. [3] Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Mateo, CA. [4] Kontkanen, P., Myllymaki, P., Silander, T., and Tirri, H. (1999). On Supervised Selection of Bayesian Networks. In Proceedings of the 15th International Conference on Uncertainty in Artificial Intelligence, pages 334-342, Stockholm, Sweden. Morgan Kaufmann. [5] Witten, I.H. and Frank, E. (2000). Data Mining: Practical Machine Learning Tools with Java Implementations. Morgan Kaufmann, San Francisco, CA. [6] Fayyad, U.M., and Irani, K.B. (1993). Multi-Interval Discretization of ContinuousValued Attributes for Classification Learning. In Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pages 1022-1027, San Mateo, CA. Morgan Kaufmann.

6 0.059653521 161 nips-2004-Self-Tuning Spectral Clustering

7 0.055325408 158 nips-2004-Sampling Methods for Unsupervised Learning

8 0.053547621 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes

9 0.052775003 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution

10 0.051067349 116 nips-2004-Message Errors in Belief Propagation

11 0.050988428 122 nips-2004-Modelling Uncertainty in the Game of Go

12 0.050967962 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms

13 0.050780065 203 nips-2004-Validity Estimates for Loopy Belief Propagation on Binary Real-world Networks

14 0.049301937 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation

15 0.049121533 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

16 0.048943877 37 nips-2004-Co-Training and Expansion: Towards Bridging Theory and Practice

17 0.048608001 17 nips-2004-Adaptive Manifold Learning

18 0.046646319 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment

19 0.045952201 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data

20 0.045778196 163 nips-2004-Semi-parametric Exponential Family PCA


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.139), (1, 0.009), (2, 0.013), (3, -0.017), (4, -0.005), (5, 0.002), (6, -0.004), (7, 0.115), (8, -0.054), (9, -0.085), (10, 0.095), (11, -0.009), (12, 0.025), (13, -0.035), (14, -0.035), (15, -0.034), (16, -0.027), (17, -0.011), (18, 0.105), (19, 0.041), (20, 0.062), (21, 0.064), (22, -0.064), (23, -0.011), (24, -0.044), (25, 0.072), (26, -0.098), (27, 0.101), (28, 0.08), (29, -0.004), (30, -0.001), (31, 0.018), (32, -0.049), (33, 0.028), (34, -0.021), (35, 0.048), (36, -0.08), (37, -0.052), (38, -0.055), (39, -0.108), (40, 0.007), (41, -0.044), (42, -0.237), (43, -0.006), (44, -0.093), (45, 0.076), (46, 0.025), (47, 0.097), (48, 0.13), (49, 0.137)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94620049 130 nips-2004-Newscast EM

Author: Wojtek Kowalczyk, Nikos A. Vlassis

Abstract: We propose a gossip-based distributed algorithm for Gaussian mixture learning, Newscast EM. The algorithm operates on network topologies where each node observes a local quantity and can communicate with other nodes in an arbitrary point-to-point fashion. The main difference between Newscast EM and the standard EM algorithm is that the M-step in our case is implemented in a decentralized manner: (random) pairs of nodes repeatedly exchange their local parameter estimates and combine them by (weighted) averaging. We provide theoretical evidence and demonstrate experimentally that, under this protocol, nodes converge exponentially fast to the correct estimates in each M-step of the EM algorithm. 1

2 0.5307253 122 nips-2004-Modelling Uncertainty in the Game of Go

Author: David H. Stern, Thore Graepel, David MacKay

Abstract: Go is an ancient oriental game whose complexity has defeated attempts to automate it. We suggest using probability in a Bayesian sense to model the uncertainty arising from the vast complexity of the game tree. We present a simple conditional Markov random field model for predicting the pointwise territory outcome of a game. The topology of the model reflects the spatial structure of the Go board. We describe a version of the Swendsen-Wang process for sampling from the model during learning and apply loopy belief propagation for rapid inference and prediction. The model is trained on several hundred records of professional games. Our experimental results indicate that the model successfully learns to predict territory despite its simplicity. 1

3 0.46452591 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification

Author: Shyam Visweswaran, Gregory F. Cooper

Abstract: Classification algorithms typically induce population-wide models that are trained to perform well on average on expected future instances. We introduce a Bayesian framework for learning instance-specific models from data that are optimized to predict well for a particular instance. Based on this framework, we present a lazy instance-specific algorithm called ISA that performs selective model averaging over a restricted class of Bayesian networks. On experimental evaluation, this algorithm shows superior performance over model selection. We intend to apply such instance-specific algorithms to improve the performance of patient-specific predictive models induced from medical data. 1 In t ro d u c t i o n Commonly used classification algorithms, such as neural networks, decision trees, Bayesian networks and support vector machines, typically induce a single model from a training set of instances, with the intent of applying it to all future instances. We call such a model a population-wide model because it is intended to be applied to an entire population of future instances. A population-wide model is optimized to predict well on average when applied to expected future instances. In contrast, an instance-specific model is one that is constructed specifically for a particular instance. The structure and parameters of an instance-specific model are specialized to the particular features of an instance, so that it is optimized to predict especially well for that instance. Usually, methods that induce population-wide models employ eager learning in which the model is induced from the training data before the test instance is encountered. In contrast, lazy learning defers most or all processing until a response to a test instance is required. Learners that induce instance-specific models are necessarily lazy in nature since they take advantage of the information in the test instance. An example of a lazy instance-specific method is the lazy Bayesian rule (LBR) learner, implemented by Zheng and Webb [1], which induces rules in a lazy fashion from examples in the neighborhood of the test instance. A rule generated by LBR consists of a conjunction of the attribute-value pairs present in the test instance as the antecedent and a local simple (naïve) Bayes classifier as the consequent. The structure of the local simple Bayes classifier consists of the attribute of interest as the parent of all other attributes that do not appear in the antecedent, and the parameters of the classifier are estimated from the subset of training instances that satisfy the antecedent. A greedy step-forward search selects the optimal LBR rule for a test instance to be classified. When evaluated on 29 UCI datasets, LBR had the lowest average error rate when compared to several eager learning methods [1]. Typically, both eager and lazy algorithms select a single model from some model space, ignoring the uncertainty in model selection. Bayesian model averaging is a coherent approach to dealing with the uncertainty in model selection, and it has been shown to improve the predictive performance of classifiers [2]. However, since the number of models in practically useful model spaces is enormous, exact model averaging over the entire model space is usually not feasible. In this paper, we describe a lazy instance-specific averaging (ISA) algorithm for classification that approximates Bayesian model averaging in an instance-sensitive manner. ISA extends LBR by adding Bayesian model averaging to an instance-specific model selection algorithm. While the ISA algorithm is currently able to directly handle only discrete variables and is computationally more intensive than comparable eager algorithms, the results in this paper show that it performs well. In medicine, such lazy instance-specific algorithms can be applied to patient-specific modeling for improving the accuracy of diagnosis, prognosis and risk assessment. The rest of this paper is structured as follows. Section 2 introduces a Bayesian framework for instance-specific learning. Section 3 describes the implementation of ISA. In Section 4, we evaluate ISA and compare its performance to that of LBR. Finally, in Section 5 we discuss the results of the comparison. 2 Deci si on Th eo ret i c F rame wo rk We use the following notation. Capital letters like X, Z, denote random variables and corresponding lower case letters, x, z, denote specific values assigned to them. Thus, X = x denotes that variable X is assigned the value x. Bold upper case letters, such as X, Z, represent sets of variables or random vectors and their realization is denoted by the corresponding bold lower case letters, x, z. Hence, X = x denotes that the variables in X have the states given by x. In addition, Z denotes the target variable being predicted, X denotes the set of attribute variables, M denotes a model, D denotes the training dataset, and denotes a generic test instance that is not in D. We now characterize population-wide and instance-specific model selection in decision theoretic terms. Given training data D and a separate generic test instance , the Bayes optimal prediction for Zt is obtained by combining the predictions of all models weighted by their posterior probabilities, as follows: P (Z t | X t , D ) = ∫ P( Z t | X t , M ) P ( M | D )dM . (1) M The optimal population-wide model for predicting Zt is as follows:   max∑ U P( Z t | X t , D), P (Z t | X t , M ) P ( X | D) , M  Xt  [ ] (2) where the function U gives the utility of approximating the Bayes optimal estimate P(Zt | Xt , D), with the estimate P(Zt | Xt , M) obtained from model M. The term P(X | D) is given by: P ( X | D) = ∫ P ( X | M ) P ( M | D)dM . (3) M The optimal instance-specific model for predicting Zt is as follows: { [ ]} max U P ( Z t | X t = x t , D), P (Z t | X t = x t , M ) , M (4) where xt are the values of the attributes of the test instance Xt for which we want to predict Zt. The Bayes optimal estimate P(Zt | Xt = xt, D), in Equation 4 is derived using Equation 1, for the special case in which Xt = xt . The difference between the population-wide and the instance-specific models can be noted by comparing Equations 2 and 4. Equation 2 for the population-wide model selects the model that on average will have the greatest utility. Equation 4 for the instance-specific model, however, selects the model that will have the greatest expected utility for the specific instance Xt = xt . For predicting Zt in a given instance Xt = xt, the model selected using Equation 2 can never have an expected utility greater than the model selected using Equation 4. This observation provides support for developing instance-specific models. Equations 2 and 4 represent theoretical ideals for population-wide and instancespecific model selection, respectively; we are not suggesting they are practical to compute. The current paper focuses on model averaging, rather than model selection. Ideal Bayesian model averaging is given by Equation 1. Model averaging has previously been applied using population-wide models. Studies have shown that approximate Bayesian model averaging using population-wide models can improve predictive performance over population-wide model selection [2]. The current paper concentrates on investigating the predictive performance of approximate Bayesian model averaging using instance-specific models. 3 In st an ce- S p eci fi c Algo ri t h m We present the implementation of the lazy instance-specific algorithm based on the above framework. ISA searches the space of a restricted class of Bayesian networks to select a subset of the models over which to derive a weighted (averaged) posterior of the target variable Zt . A key characteristic of the search is the use of a heuristic to select models that will have a significant influence on the weighted posterior. We introduce Bayesian networks briefly and then describe ISA in detail. 3.1 B ay e s i a n N e t w or k s A Bayesian network is a probabilistic model that combines a graphical representation (the Bayesian network structure) with quantitative information (the parameters of the Bayesian network) to represent the joint probability distribution over a set of random variables [3]. Specifically, a Bayesian network M representing the set of variables X consists of a pair (G, ΘG ). G is a directed acyclic graph that contains a node for every variable in X and an arc between every pair of nodes if the corresponding variables are directly probabilistically dependent. Conversely, the absence of an arc between a pair of nodes denotes probabilistic independence between the corresponding variables. ΘG represents the parameterization of the model. In a Bayesian network M, the immediate predecessors of a node X i in X are called the parents of X i and the successors, both immediate and remote, of Xi in X are called the descendants of X i . The immediate successors of X i are called the children of X i . For each node Xi there is a local probability distribution (that may be discrete or continuous) on that node given the state of its parents. The complete joint probability distribution over X, represented by the parameterization ΘG, can be factored into a product of local probability distributions defined on each node in the network. This factorization is determined by the independences captured by the structure of the Bayesian network and is formalized in the Bayesian network Markov condition: A node (representing a variable) is independent of its nondescendants given just its parents. According to this Markov condition, the joint probability distribution on model variables X = (X1 , X 2, …, X n ) can be factored as follows: n P ( X 1 , X 2 , ..., X n ) = ∏ P ( X i | parents( X i )) , (5) i =1 where parents(Xi ) denotes the set of nodes that are the parents of X i . If Xi has no parents, then the set parents(Xi ) is empty and P(Xi | parents(X i)) is just P(Xi ). 3.2 I S A M od e l s The LBR models of Zheng and Webb [1] can be represented as members of a restricted class of Bayesian networks (see Figure 1). We use the same class of Bayesian networks for the ISA models, to facilitate comparison between the two algorithms. In Figure 1, all nodes represent attributes that are discrete. Each node in X has either an outgoing arc into target node, Z, or receives an arc from Z. That is, each node is either a parent or a child of Z. Thus, X is partitioned into two sets: the first containing nodes (X 1 , …, X j in Figure 1) each of which is a parent of Z and every node in the second set, and the second containing nodes (X j+1 , …, X k in Figure 1) that have as parents the node Z and every node in the first set. The nodes in the first set are instantiated to the corresponding values in the test instance for which Zt is to be predicted. Thus, the first set of nodes represents the antecedent of the LBR rule and the second set of nodes represents the consequent. ... X1= x1 Xi = xi Z Xi+1 ... Xk Figure 1: An example of a Bayesian network LBR model with target node Z and k attribute nodes of which X1 , …, X j are instantiated to values x 1 , …, x j in xt . X 1, …, X j are present in the antecedent of the LBR rule and Z, X j+1 , …, X k (that form the local simple Bayes classifier) are present in the consequent. The indices need not be ordered as shown, but are presented in this example for convenience of exposition. 3.3 M od e l A ve r ag i n g For Bayesian networks, Equation 1 can be evaluated as follows: P ( Z t | x t , D ) = ∑ P ( Z t | x t , M ) P( M | D ) , (6) M with M being a Bayesian network comprised of structure G and parameters ΘG. The probability distribution of interest is a weighted average of the posterior distribution over all possible Bayesian networks where the weight is the probability of the Bayesian network given the data. Since exhaustive enumeration of all possible models is not feasible, even for this class of simple Bayesian networks, we approximate exact model averaging with selective model averaging. Let R be the set of models selected by the search procedure from all possible models in the model space, as described in the next section. Then, with selective model averaging, P(Zt | xt, D) is estimated as: ∑RP( Z t | x t , M ) P(M | D) P (Z t | x t , D) ≅ M ∈ . ∑RP(M | D) M∈ (7) Assuming uniform prior belief over all possible models, the model posterior P(M | D) in Equation 7 can be replaced by the marginal likelihood P(D | M), to obtain the following equation: P ( Z | x , D) ≅ t t ∑ P ( Z t | x t , M ) P( D | M ) . ∑RP( D | M ) M∈ M ∈R (8) The (unconditional) marginal likelihood P(D | M) in Equation 8, is a measure of the goodness of fit of the model to the data and is also known as the model score. While this score is suitable for assessing the model’s fit to the joint probability distribution, it is not necessarily appropriate for assessing the goodness of fit to a conditional probability distribution which is the focus in prediction and classification tasks, as is the case here. A more suitable score in this situation is a conditional model score that is computed from training data D of d instances as: d score( D, M ) = ∏ P ( z p | x1 ,..., x p ,z 1 ,...,z p −1 ,M ) . (9) p =1 This score is computed in a predictive and sequential fashion: for the pth training instance the probability of predicting the observed value zp for the target variable is computed based on the values of all the variables in the preceding p-1 training instances and the values xp of the attributes in the pth instance. One limitation of this score is that its value depends on the ordering of the data. Despite this limitation, it has been shown to be an effective scoring criterion for classification models [4]. The parameters of the Bayesian network M, used in the above computations, are defined as follows: P ( X i = k | parents ( X i ) = j ) ≡ θ ijk = N ijk + α ijk N ij + α ij , (10) where (i) Nijk is the number of instances in the training dataset D where variable Xi has value k and the parents of X i are in state j, (ii) N ij = ∑k N ijk , (iii) αijk is a parameter prior that can be interpreted as the belief equivalent of having previously observed αijk instances in which variable Xi has value k and the parents of X i are in state j, and (iv) α ij = ∑k α ijk . 3.4 M od e l Se a r c h We use a two-phase best-first heuristic search to sample the model space. The first phase ignores the evidence xt in the test instance while searching for models that have high scores as given by Equation 9. This is followed by the second phase that searches for models having the greatest impact on the prediction of Zt for the test instance, which we formalize below. The first phase searches for models that predict Z in the training data very well; these are the models that have high conditional model scores. The initial model is the simple Bayes network that includes all the attributes in X as children of Z. A succeeding model is derived from a current model by reversing the arc of a child node in the current model, adding new outgoing arcs from it to Z and the remaining children, and instantiating this node to the value in the test instance. This process is performed for each child in the current model. An incoming arc of a child node is considered for reversal only if the node’s value is not missing in the test instance. The newly derived models are added to a priority queue, Q. During each iteration of the search, the model with the highest score (given by Equation 9) is removed from Q and placed in a set R, following which new models are generated as described just above, scored and added to Q. The first phase terminates after a user-specified number of models have accumulated in R. The second phase searches for models that change the current model-averaged estimate of P(Zt | xt , D) the most. The idea here is to find viable competing models for making this posterior probability prediction. When no competitive models can be found, the prediction becomes stable. During each iteration of the search, the highest ranked model M* is removed from Q and added to R. The ranking is based on how much the model changes the current estimate of P(Zt | xt , D). More change is better. In particular, M* is the model in Q that maximizes the following function: f ( R, M *) = g ( R) − g ( R U {M *}) , (11) where for a set of models S, the function g(S) computes the approximate model averaged prediction for Zt, as follows: g (S ) = ∑ P(Z M ∈S t | x t , M ) score( D, M ) ∑ score( D, M ) ∈ . (12) M S The second phase terminates when no new model can be found that has a value (as given by Equation 11) that is greater than a user-specified minimum threshold T. The final distribution of Zt is then computed from the models in R using Equation 8. 4 Ev a lu a t i o n We evaluated ISA on the 29 UCI datasets that Zheng and Webb used for the evaluation of LBR. On the same datasets, we also evaluated a simple Bayes classifier (SB) and LBR. For SB and LBR, we used the Weka implementations (Weka v3.3.6, http://www.cs.waikato.ac.nz/ml/weka/) with default settings [5]. We implemented the ISA algorithm as a standalone application in Java. The following settings were used for ISA: a maximum of 100 phase-1 models, a threshold T of 0.001 in phase-2, and an upper limit of 500 models in R. For the parameter priors in Equation 10, all αijk were set to 1. All error rates were obtained by averaging the results from two stratified 10-fold cross-validation (20 trials total) similar to that used by Zheng and Webb. Since, both LBR and ISA can handle only discrete attributes, all numeric attributes were discretized in a pre-processing step using the entropy based discretization method described in [6]. For each pair of training and test folds, the discretization intervals were first estimated from the training fold and then applied to both folds. The error rates of two algorithms on a dataset were compared with a paired t-test carried out at the 5% significance level on the error rate statistics obtained from the 20 trials. The results are shown in Table 1. Compared to SB, ISA has significantly fewer errors on 9 datasets and significantly more errors on one dataset. Compared to LBR, ISA has significantly fewer errors on 7 datasets and significantly more errors on two datasets. On two datasets, chess and tic-tac-toe, ISA shows considerable improvement in performance over both SB and LBR. With respect to computation Table 1: Percent error rates of simple Bayes (SB), Lazy Bayesian Rule (LBR) and Instance-Specific Averaging (ISA). A - indicates that the ISA error rate is statistically significantly lower than the marked SB or LBR error rate. A + indicates that the ISA error rate is statistically significantly higher. Dataset Size Annealing Audiology Breast (W) Chess (KR-KP) Credit (A) Echocardiogram Glass Heart (C) Hepatitis Horse colic House votes 84 Hypothyroid Iris Labor LED 24 Liver disorders Lung cancer Lymphography Pima Postoperative Primary tumor Promoters Solar flare Sonar Soybean Splice junction Tic-Tac-Toe Wine Zoo 898 226 699 3169 690 131 214 303 155 368 435 3163 150 57 200 345 32 148 768 90 339 106 1389 208 683 3177 958 178 101 No. of classes 6 24 2 2 2 2 6 2 2 2 2 2 3 2 10 2 3 4 2 3 22 2 2 2 19 3 2 3 7 Num. Attrib. 6 0 9 0 6 6 9 13 6 7 0 7 4 8 0 6 0 0 8 1 0 0 0 60 0 0 0 13 0 Nom. Attrib. 32 69 0 36 9 1 0 0 13 15 16 18 0 8 24 0 56 18 0 7 17 57 10 0 35 60 9 0 16 Percent error rate SB LBR ISA 1.9 3.5 2.7 29.6 29.4 30.9 3.7 2.9 + 2.8 + 1.1 12.1 3.0 13.8 14.0 13.9 33.2 34.0 35.9 26.9 27.8 29.0 16.2 16.2 17.5 14.2 - 14.2 - 11.3 20.2 16.0 17.8 5.1 10.1 7.0 0.9 0.9 1.4 6.0 6.0 5.3 8.8 6.1 7.0 40.5 40.5 40.3 36.8 36.8 36.8 56.3 56.3 56.3 15.5 - 15.5 - 13.2 21.8 22.0 22.3 33.3 33.3 33.3 54.4 53.5 54.2 7.5 7.5 7.5 20.2 18.3 + 19.4 15.4 15.6 15.9 7.1 7.2 7.9 4.7 4.3 4.4 30.3 - 13.7 - 10.3 1.1 1.1 1.1 6.4 8.4 8.4 - times, ISA took 6 times longer to run than LBR on average for a single test instance on a desktop computer with a 2 GHz Pentium 4 processor and 3 GB of RAM. 5 C o n c lu si o n s a n d Fu t u re R e s ea rc h We have introduced a Bayesian framework for instance-specific model averaging and presented ISA as one example of a classification algorithm based on this framework. An instance-specific algorithm like LBR that does model selection has been shown by Zheng and Webb to perform classification better than several eager algorithms [1]. Our results show that ISA, which extends LBR by adding Bayesian model averaging, improves overall on LBR, which provides support that we can obtain additional prediction improvement by performing instance-specific model averaging rather than just instance-specific model selection. In future work, we plan to explore further the behavior of ISA with respect to the number of models being averaged and the effect of the number of models selected in each of the two phases of the search. We will also investigate methods to improve the computational efficiency of ISA. In addition, we plan to examine other heuristics for model search as well as more general model spaces such as unrestricted Bayesian networks. The instance-specific framework is not restricted to the Bayesian network models that we have used in this investigation. In the future, we plan to explore other models using this framework. Our ultimate interest is to apply these instancespecific algorithms to improve patient-specific predictions (for diagnosis, therapy selection, and prognosis) and thereby to improve patient care. A c k n ow l e d g me n t s This work was supported by the grant T15-LM/DE07059 from the National Library of Medicine (NLM) to the University of Pittsburgh’s Biomedical Informatics Training Program. We would like to thank the three anonymous reviewers for their helpful comments. References [1] Zheng, Z. and Webb, G.I. (2000). Lazy Learning of Bayesian Rules. Machine Learning, 41(1):53-84. [2] Hoeting, J.A., Madigan, D., Raftery, A.E. and Volinsky, C.T. (1999). Bayesian Model Averaging: A Tutorial. Statistical Science, 14:382-417. [3] Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Mateo, CA. [4] Kontkanen, P., Myllymaki, P., Silander, T., and Tirri, H. (1999). On Supervised Selection of Bayesian Networks. In Proceedings of the 15th International Conference on Uncertainty in Artificial Intelligence, pages 334-342, Stockholm, Sweden. Morgan Kaufmann. [5] Witten, I.H. and Frank, E. (2000). Data Mining: Practical Machine Learning Tools with Java Implementations. Morgan Kaufmann, San Francisco, CA. [6] Fayyad, U.M., and Irani, K.B. (1993). Multi-Interval Discretization of ContinuousValued Attributes for Classification Learning. In Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pages 1022-1027, San Mateo, CA. Morgan Kaufmann.

4 0.46444032 57 nips-2004-Economic Properties of Social Networks

Author: Sham M. Kakade, Michael Kearns, Luis E. Ortiz, Robin Pemantle, Siddharth Suri

Abstract: We examine the marriage of recent probabilistic generative models for social networks with classical frameworks from mathematical economics. We are particularly interested in how the statistical structure of such networks influences global economic quantities such as price variation. Our findings are a mixture of formal analysis, simulation, and experiments on an international trade data set from the United Nations. 1

5 0.42647114 158 nips-2004-Sampling Methods for Unsupervised Learning

Author: Rob Fergus, Andrew Zisserman, Pietro Perona

Abstract: We present an algorithm to overcome the local maxima problem in estimating the parameters of mixture models. It combines existing approaches from both EM and a robust fitting algorithm, RANSAC, to give a data-driven stochastic learning scheme. Minimal subsets of data points, sufficient to constrain the parameters of the model, are drawn from proposal densities to discover new regions of high likelihood. The proposal densities are learnt using EM and bias the sampling toward promising solutions. The algorithm is computationally efficient, as well as effective at escaping from local maxima. We compare it with alternative methods, including EM and RANSAC, on both challenging synthetic data and the computer vision problem of alpha-matting. 1

6 0.41321817 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

7 0.40100649 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms

8 0.38705376 141 nips-2004-Optimal sub-graphical models

9 0.38515988 37 nips-2004-Co-Training and Expansion: Towards Bridging Theory and Practice

10 0.37099239 82 nips-2004-Incremental Algorithms for Hierarchical Classification

11 0.35409206 95 nips-2004-Large-Scale Prediction of Disulphide Bond Connectivity

12 0.35337526 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters

13 0.34250599 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation

14 0.34110522 199 nips-2004-Using Machine Learning to Break Visual Human Interaction Proofs (HIPs)

15 0.33192405 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

16 0.3216148 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models

17 0.3161644 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

18 0.31512803 190 nips-2004-The Rescorla-Wagner Algorithm and Maximum Likelihood Estimation of Causal Parameters

19 0.30904195 165 nips-2004-Semi-supervised Learning on Directed Graphs

20 0.30581239 207 nips-2004-ℓ₀-norm Minimization for Basis Selection


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.081), (15, 0.151), (17, 0.343), (26, 0.06), (31, 0.044), (33, 0.148), (35, 0.015), (39, 0.023), (50, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.82088155 43 nips-2004-Conditional Models of Identity Uncertainty with Application to Noun Coreference

Author: Andrew McCallum, Ben Wellner

Abstract: Coreference analysis, also known as record linkage or identity uncertainty, is a difficult and important problem in natural language processing, databases, citation matching and many other tasks. This paper introduces several discriminative, conditional-probability models for coreference analysis, all examples of undirected graphical models. Unlike many historical approaches to coreference, the models presented here are relational—they do not assume that pairwise coreference decisions should be made independently from each other. Unlike other relational models of coreference that are generative, the conditional model here can incorporate a great variety of features of the input without having to be concerned about their dependencies—paralleling the advantages of conditional random fields over hidden Markov models. We present positive results on noun phrase coreference in two standard text data sets. 1

2 0.80632097 191 nips-2004-The Variational Ising Classifier (VIC) Algorithm for Coherently Contaminated Data

Author: Oliver Williams, Andrew Blake, Roberto Cipolla

Abstract: There has been substantial progress in the past decade in the development of object classifiers for images, for example of faces, humans and vehicles. Here we address the problem of contaminations (e.g. occlusion, shadows) in test images which have not explicitly been encountered in training data. The Variational Ising Classifier (VIC) algorithm models contamination as a mask (a field of binary variables) with a strong spatial coherence prior. Variational inference is used to marginalize over contamination and obtain robust classification. In this way the VIC approach can turn a kernel classifier for clean data into one that can tolerate contamination, without any specific training on contaminated positives. 1

same-paper 3 0.77545446 130 nips-2004-Newscast EM

Author: Wojtek Kowalczyk, Nikos A. Vlassis

Abstract: We propose a gossip-based distributed algorithm for Gaussian mixture learning, Newscast EM. The algorithm operates on network topologies where each node observes a local quantity and can communicate with other nodes in an arbitrary point-to-point fashion. The main difference between Newscast EM and the standard EM algorithm is that the M-step in our case is implemented in a decentralized manner: (random) pairs of nodes repeatedly exchange their local parameter estimates and combine them by (weighted) averaging. We provide theoretical evidence and demonstrate experimentally that, under this protocol, nodes converge exponentially fast to the correct estimates in each M-step of the EM algorithm. 1

4 0.69280237 68 nips-2004-Face Detection --- Efficient and Rank Deficient

Author: Wolf Kienzle, Matthias O. Franz, Bernhard Schölkopf, Gökhan H. Bakir

Abstract: This paper proposes a method for computing fast approximations to support vector decision functions in the field of object detection. In the present approach we are building on an existing algorithm where the set of support vectors is replaced by a smaller, so-called reduced set of synthesized input space points. In contrast to the existing method that finds the reduced set via unconstrained optimization, we impose a structural constraint on the synthetic points such that the resulting approximations can be evaluated via separable filters. For applications that require scanning large images, this decreases the computational complexity by a significant amount. Experimental results show that in face detection, rank deficient approximations are 4 to 6 times faster than unconstrained reduced set systems. 1

5 0.57862854 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

Author: John Blitzer, Fernando Pereira, Kilian Q. Weinberger, Lawrence K. Saul

Abstract: Statistical language models estimate the probability of a word occurring in a given context. The most common language models rely on a discrete enumeration of predictive contexts (e.g., n-grams) and consequently fail to capture and exploit statistical regularities across these contexts. In this paper, we show how to learn hierarchical, distributed representations of word contexts that maximize the predictive value of a statistical language model. The representations are initialized by unsupervised algorithms for linear and nonlinear dimensionality reduction [14], then fed as input into a hierarchical mixture of experts, where each expert is a multinomial distribution over predicted words [12]. While the distributed representations in our model are inspired by the neural probabilistic language model of Bengio et al. [2, 3], our particular architecture enables us to work with significantly larger vocabularies and training corpora. For example, on a large-scale bigram modeling task involving a sixty thousand word vocabulary and a training corpus of three million sentences, we demonstrate consistent improvement over class-based bigram models [10, 13]. We also discuss extensions of our approach to longer multiword contexts. 1

6 0.5754981 192 nips-2004-The power of feature clustering: An application to object detection

7 0.57503855 182 nips-2004-Synergistic Face Detection and Pose Estimation with Energy-Based Models

8 0.5696317 55 nips-2004-Distributed Occlusion Reasoning for Tracking with Nonparametric Belief Propagation

9 0.56897163 47 nips-2004-Contextual Models for Object Detection Using Boosted Random Fields

10 0.56603992 178 nips-2004-Support Vector Classification with Input Data Uncertainty

11 0.56523031 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors

12 0.56460768 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations

13 0.56259137 131 nips-2004-Non-Local Manifold Tangent Learning

14 0.56246352 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

15 0.56196934 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

16 0.56194955 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

17 0.56091589 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

18 0.55942756 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

19 0.55930287 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

20 0.55859274 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications