nips nips2005 nips2005-127 knowledge-graph by maker-knowledge-mining

127 nips-2005-Mixture Modeling by Affinity Propagation

Source: pdf

Author: Brendan J. Frey, Delbert Dueck

Abstract: Clustering is a fundamental problem in machine learning and has been approached in many ways. Two general and quite different approaches include iteratively ﬁtting a mixture model (e.g., using EM) and linking together pairs of training cases that have high afﬁnity (e.g., using spectral methods). Pair-wise clustering algorithms need not compute sufﬁcient statistics and avoid poor solutions by directly placing similar examples in the same cluster. However, many applications require that each cluster of data be accurately described by a prototype or model, so afﬁnity-based clustering – and its beneﬁts – cannot be directly realized. We describe a technique called “afﬁnity propagation”, which combines the advantages of both approaches. The method learns a mixture model of the data by recursively propagating afﬁnity messages. We demonstrate afﬁnity propagation on the problems of clustering image patches for image segmentation and learning mixtures of gene expression models from microarray data. We ﬁnd that afﬁnity propagation obtains better solutions than mixtures of Gaussians, the K-medoids algorithm, spectral clustering and hierarchical clustering, and is both able to ﬁnd a pre-speciﬁed number of clusters and is able to automatically determine the number of clusters. Interestingly, afﬁnity propagation can be viewed as belief propagation in a graphical model that accounts for pairwise training case likelihood functions and the identiﬁcation of cluster centers. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 , using EM) and linking together pairs of training cases that have high afﬁnity (e. [sent-8, score-0.151]

2 Pair-wise clustering algorithms need not compute sufﬁcient statistics and avoid poor solutions by directly placing similar examples in the same cluster. [sent-11, score-0.146]

3 However, many applications require that each cluster of data be accurately described by a prototype or model, so afﬁnity-based clustering – and its beneﬁts – cannot be directly realized. [sent-12, score-0.263]

4 We demonstrate afﬁnity propagation on the problems of clustering image patches for image segmentation and learning mixtures of gene expression models from microarray data. [sent-15, score-0.84]

5 We ﬁnd that afﬁnity propagation obtains better solutions than mixtures of Gaussians, the K-medoids algorithm, spectral clustering and hierarchical clustering, and is both able to ﬁnd a pre-speciﬁed number of clusters and is able to automatically determine the number of clusters. [sent-16, score-0.567]

6 Interestingly, afﬁnity propagation can be viewed as belief propagation in a graphical model that accounts for pairwise training case likelihood functions and the identiﬁcation of cluster centers. [sent-17, score-0.799]

7 1 Introduction Many machine learning tasks involve clustering data using a mixture model, so that the data in each cluster is accurately described by a probability model from a pre-deﬁned, possibly parameterized, set of models [1]. [sent-18, score-0.317]

8 By marginalizing over hidden variables, we can still view many hierarchical learning problems as mixture modeling, but the class-conditional models become complicated and nonlinear. [sent-21, score-0.131]

9 Exact computation of the data likelihoods may not be feasible and exact computation of the sufﬁcient statistics needed to update parameterized models may not be feasible. [sent-23, score-0.122]

10 A different approach to clustering ignores the notion of a class-conditional model, and links together pairs of data points that have high afﬁnity. [sent-25, score-0.168]

11 The afﬁnity or similarity (a real number in [0, 1]) between two training cases gives a direct indication of whether they should be in the same cluster. [sent-26, score-0.108]

12 Hierarchical clustering and its Bayesian variants [2] is a popular afﬁnity-based clustering technique, whereby a binary tree is constructed greedily from the leaves to the root, by recursively linking together pairs of training cases with high afﬁnity. [sent-27, score-0.469]

13 We describe a new method that, for the ﬁrst time to our knowledge, combines the advantages of model-based clustering and afﬁnity-based clustering. [sent-30, score-0.146]

14 Like afﬁnity-based clustering, our algorithm directly examines pairs of nearby training cases to help ascertain whether or not they should be in the same cluster. [sent-32, score-0.155]

15 Our method, called “afﬁnity propagation”, can be viewed as the sum-product algorithm or the max-product algorithm in a graphical model describing the mixture model. [sent-34, score-0.125]

16 2 A greedy algorithm: K-medoids The ﬁrst step in obtaining the beneﬁt of pair-wise training case comparisons is to replace the parameters of the mixture model with pointers into the training data. [sent-35, score-0.334]

17 A similar representation is used in K-medians clustering or K-medoids clustering, where the goal is to identify K training cases, or exemplars, as cluster centers. [sent-36, score-0.321]

18 While the algorithm makes greedy hard decisions for the cluster centers, it is a useful intermediate step in introducing afﬁnity propagation. [sent-40, score-0.187]

19 , xN , suppose the likelihood of training case xi given that training case xk is its cluster center is P (xi |xi in xk ) (e. [sent-44, score-0.804]

20 Given the training data, this likelihood depends only on i and k, so we denote it by Lik . [sent-47, score-0.118]

21 Lii is set to the Bayesian prior probability that xi is a cluster center. [sent-48, score-0.183]

22 Initially, K training cases are chosen as exemplars, e. [sent-49, score-0.108]

23 Denote the current set of cluster center indices by K and the index of the current cluster center for xi by si . [sent-52, score-0.341]

24 K-medoids iterates between assigning training cases to exemplars (E step), and choosing a training case as the new exemplar for each cluster (M step). [sent-53, score-0.694]

25 Assuming for simplicity that the mixing proportions are equal and denoting the responsibility likelihood ratio by rik = P (xi |xi in xk )/P (xi |xi not in xk )1 , the updates are E step For i = 1, . [sent-54, score-0.658]

26 , N : For k ∈ K: rik ← Lik /( si ← argmaxk∈K rik j:j=k Lij ) Greedy M step For k ∈ K: Replace k in K with argmaxj:sj =k ( i:si =k Lij ) This algorithm nicely replaces parameter-to-training case comparisons with pair-wise training case comparisons. [sent-57, score-0.348]

27 However, in the greedy M step, speciﬁc training cases are chosen as exemplars. [sent-58, score-0.176]

28 1 Note that using the traditional deﬁnition of responsibility, rik ← Lik /(Σj Lij ), will give the same decisions as using the likelihood ratio. [sent-61, score-0.149]

29 3 Afﬁnity propagation The responsibilities in the greedy K-medoids algorithm can be viewed as messages that are sent from training cases to potential exemplars, providing soft evidence of the preference for each training case to be in each exemplar. [sent-62, score-0.76]

30 To avoid making hard decisions for the cluster centers, we introduce messages called “availabilities”. [sent-63, score-0.194]

31 Availabilities are sent from exemplars to training cases and provide soft evidence of the preference for each exemplar to be available as a center for each training case. [sent-64, score-0.648]

32 Responsibilities are computed using likelihoods and availabilities, and availabilities are computed using responsibilities, recursively. [sent-65, score-0.21]

33 We refer to both responsibilities and availabilities as afﬁnities and we refer to the message-passing scheme as afﬁnity propagation. [sent-66, score-0.237]

34 Here, we explain the update rules; in the next section, we show that afﬁnity propagation can be derived as the sum-product algorithm in a graphical model describing the mixture model. [sent-67, score-0.408]

35 Denote the availability sent from candidate exemplar xk to training case xi by aki . [sent-68, score-0.76]

36 In this rule, the responsibility of a training case xi as its own cluster center, rii , is high if no other exemplars are highly available to xi and if xi has high probability under the Bayesian prior, Lii . [sent-73, score-0.71]

37 The availability of a training case xk as its own exemplar, akk , is high if at least one other training case places high responsibility on xk being an exemplar. [sent-75, score-0.742]

38 The availability of xk as a exemplar for xi , aki is high if the self-responsibility rkk is high (1/rkk −1 approaches −1), but is decreased if other training cases compete in using xk as an exemplar (the term 1/rkk − 1 is scaled down if rjk is large for some other training case xj ). [sent-76, score-1.365]

39 In our implementation, each candidate exemplar absorbs and emits afﬁnities in parallel, and the centers are ordered according to the sum of their likelihoods, i. [sent-78, score-0.276]

40 Direct implementation of the above propagation rules gives an N 2 -time algorithm, but afﬁnities need only be propagated between i and k if Lik > 0. [sent-81, score-0.333]

41 Afﬁnity propagation accounts for a Bayesian prior pdf on the exemplars and is able to automatically search over the appropriate number of exemplars. [sent-83, score-0.521]

42 (Note that the number of exemplars is not pre-speciﬁed in the above updates. [sent-84, score-0.198]

43 ) In applications where a particular number of clusters is desired, the update rule for the responsibilities (in particular, the selfresponsibilities rkk , which determine the availabilities of the exemplars) can be modiﬁed, as described in the next section. [sent-85, score-0.341]

44 The afﬁnity propagation update rules can be derived as an instance of the sum-product Figure 1: Afﬁnity propagation can be viewed as belief propagation in this factor graph. [sent-87, score-0.879]

45 Using si to denote the index of the exemplar for xi , the product of the likelihoods of the training cases and the priors on the exemplars N is i=1 Lisi . [sent-89, score-0.738]

46 (If si = i, xi is an exemplar with a priori pdf Lii . [sent-90, score-0.386]

47 , sN completely speciﬁes the mixture model, but not all conﬁgurations of these variables are allowed: si = k (xi in cluster xk ) implies sk = k (xk is an exemplar) and sk = k (xk is an exemplar) implies si = k for some i = k (some other training case is in cluster xk ). [sent-94, score-0.98]

48 , sN ), where fk is the constraint for candidate cluster xk :  0 if sk = k and si = k for all i = k fk (s1 , . [sent-98, score-0.485]

49 Thus, the joint distribution of the mixture model and data factorizes as follows: N N P = fk (s1 , . [sent-102, score-0.111]

50 It is straightforward to show that the updates for afﬁnity propagation correspond to the message updates for the sum-product algorithm or loopy belief propagation (see [10] for a tutorial). [sent-110, score-0.578]

51 The responsibilities correspond to messages sent from the s’s to the f ’s, while the availabilities correspond to messages sent from the f ’s to the s’s. [sent-111, score-0.487]

52 Messages can be propagated through this function in linear time, by implementing it as a Markov chain that accumulates exemplar counts. [sent-116, score-0.259]

53 Max-product afﬁnity propagation can be derived as an instance of the max-product algorithm, instead of the sum-product algorithm. [sent-118, score-0.257]

54 An advantage of max-product afﬁnity propagation is that the algorithm is invariant to multiplicative constants in the log-likelihoods. [sent-120, score-0.282]

55 4 Image segmentation A sensible model-based approach to image segmentation is to imagine that each patch in the image originates from one of a small number of prototype texture patches. [sent-121, score-0.344]

56 Pair-wise afﬁnity-based techniques and in particular spectral clustering has been employed with some success [4, 9], with the main disadvantage being that without an underlying Figure 2: Segmentation of non-aligned gray-scale characters. [sent-123, score-0.192]

57 Patches clustered by afﬁnity propagation and K-medoids are colored according to classiﬁcation (centers shown below solutions). [sent-124, score-0.327]

58 Afﬁnity propagation achieves a near-best score compared to 1000 runs of Kmedoids. [sent-125, score-0.3]

59 Having a model with class representatives enables efﬁcient synthesis (generation) of patches, and classiﬁcation of test patches – requiring only K comparisons (to class centers) rather than N comparisons (to training cases). [sent-127, score-0.242]

60 Each training case xi is a 24 × 24 image patch and xm is the mth pixel in the patch. [sent-132, score-0.425]

61 The match between patch xi and patch xk is measured by m xm ·f m (xk , T ), where f (xk , T ) is the patch obtained by applying a 2-D translation i T plus cropping to patch xk . [sent-134, score-0.868]

62 This metric is used in the likelihood function: m p(T )eβ(Σm xi Lik ∝ ·f m (xk ,T ))/¯i x m ≈ eβ maxT (Σm xi ·f m (xk ,T ))/¯i x , T 1 where xi = 242 m xm is used to normalize the match by the amount of ink in xi . [sent-136, score-0.45]

63 β ¯ i controls how strictly xi should match xk to have high likelihood. [sent-137, score-0.299]

64 Max-product afﬁnity propagation is independent of the choice of β, and for sum-product afﬁnity propagation we quite arbitrarily chose β = 1. [sent-138, score-0.514]

65 The exemplar priors Lkk were set to mediani,k=i Lik . [sent-139, score-0.213]

66 ) We then took a much larger set of overlapping patches, classiﬁed them into the 4 categories, and then colored each pixel in the image according to the most frequent class for the pixel. [sent-143, score-0.15]

67 While afﬁnity propagation is deterministic, the EM algorithm depends on initialization. [sent-146, score-0.257]

68 The score for afﬁnity propagation is also shown, and achieves near-best performance (98th percentile). [sent-149, score-0.3]

69 The histograms show the percentile in score achieved by afﬁnity propagation compared to 1000 runs of greedy EM, for different random training sets. [sent-156, score-0.471]

70 3 into an 8 × 8 grid of non-overlapping 24 × 24 patches and clustered them into K = 6 classes using afﬁnity propagation (both forms), greedy EM in our model, spectral clustering (using a normalized L-matrix based on a set of 29 × 29 overlapping patches), and mixtures of Gaussians2 . [sent-158, score-0.718]

71 For greedy EM, the afﬁnity propagation algorithms, and mixtures of Gaussians, we then choose all possible 24 × 24 overlapping patches and calculated the likelihoods of them given each of the 6 cluster centers, classifying each patch according to its maximum likelihood. [sent-159, score-0.733]

72 3 shows the segmentations for the various methods, where the central pixel of each patch is colored according to its class. [sent-161, score-0.147]

73 Again, afﬁnity propagation achieves a solution that is near-best compared to one thousand runs of greedy EM. [sent-162, score-0.347]

74 5 Learning mixtures of gene models Currently, an important problem in genomics research is the discovery of genes and gene variants that are expressed as messenger RNAs (mRNAs) in normal tissues. [sent-163, score-0.311]

75 In a recent study [11], we used DNA-based techniques to identify 837,251 possible exons (“putative exons”) in the mouse genome. [sent-164, score-0.164]

76 For each putative exon, we used an Agilent microarray probe to measure the amount of corresponding mRNA that was present in each of 12 mouse tissues. [sent-165, score-0.247]

77 By grouping together feature vectors for nearby probes, we can detect genes and variations of genes. [sent-167, score-0.141]

78 Here, we compare afﬁnity propagation with hierarchical clustering, which was previously used to ﬁnd gene structures [12]. [sent-168, score-0.378]

79 5, 1 and 2, and for each of these tried clustering using 6, 8, 10, 12 and 14 eigenvectors. [sent-171, score-0.175]

80 The eigenvector features were clustered using EM in a mixture of Gaussians and out of 10 trials, the solution with highest likelihood was selected. [sent-173, score-0.152]

81 For mixtures of Gaussians applied directly to the image patches, we picked the model with highest likelihood in 10 trials. [sent-174, score-0.165]

82 (a) (b) Figure 4: (a) A normalized subset of 837,251 tissue expression proﬁles – mRNA level versus tissue – for putative exons from the mouse genome (most proﬁles are much noisier than these). [sent-175, score-0.387]

83 (b) The true exon detection rate (in known genes) versus the false discovery rate, for afﬁnity propagation and hierarchical clustering. [sent-176, score-0.524]

84 The distribution over the distance between probes in the same gene |i − j| is assumed to be geometric with parameter λ. [sent-181, score-0.12]

85 p0 (xi ) is a background distribution that accounts for false putative exons and q is the probability of a false putative exon within a gene. [sent-182, score-0.54]

86 The Bayesian prior probability that xk is an exemplar is set to θ · p0 (xk ), where θ is a control knob used to vary the sensitivity of the system. [sent-184, score-0.46]

87 ≈ λe−λ|i−j| q·p0 (xi ) + (1−q) max p(y, z, σ) Because of the term λe−λ|i−j| and the additional assumption that genes on the same strand do not overlap, it is not necessary to propagate afﬁnities between all 837, 2512 pairs of training cases. [sent-185, score-0.241]

88 We assume Lij = 0 for |i − j| > 100, in which case it is not necessary to propagate afﬁnities between xi and xj . [sent-186, score-0.111]

89 The assumption that genes do not overlap implies that if si = k, then sj = k for j ∈ {min(i, k), . [sent-187, score-0.178]

90 After afﬁnity propagation is used to automatically select the number of mixture 3 Based on the experimental procedure and a set of previously-annotated genes (RefSeq), we estimated λ = 0. [sent-192, score-0.469]

91 We i used a mixture of Gaussians for p0 (xi ), which was learned from the entire training set. [sent-196, score-0.154]

92 components and identify the mixture centers and the probes that belong to them (genes), each probe xi is labeled as an exon or a non-exon depending on which of the two terms in the above likelihood function (q · p0 (xi ) or the large term to its right) is larger. [sent-197, score-0.479]

93 4b shows the fraction of exons in known genes detected by afﬁnity propagation versus the false detection rate. [sent-199, score-0.591]

94 The false detection rate was estimated by randomly permuting the order of the probes in the training set, and applying afﬁnity propagation. [sent-201, score-0.23]

95 Even for quite low false discovery rates, afﬁnity propagation identiﬁes over one third of the known exons. [sent-202, score-0.353]

96 Using a variety of metrics, including the above metric, we also used hierarchical clustering to detect exons. [sent-203, score-0.202]

97 The performance of hierarchical clustering using the metric with highest sensitivity is also shown. [sent-204, score-0.259]

98 , achieving a ﬁve-fold increase in true detection rate at a false detection rate of 0. [sent-207, score-0.12]

99 3 s - Hierarch Clust 16 m + 28 m Summary An advantage of afﬁnity propagation is that the update rules are deterministic, quite simple, and can be derived as an instance of the sum-product algorithm in a factor graph. [sent-222, score-0.339]

100 To our knowledge, afﬁnity propagation is the ﬁrst algorithm to combine advantages of pair-wise clustering methods that make use of bottom-up evidence and model-based methods that seek to ﬁt top-down global models to the data. [sent-224, score-0.403]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('af', 0.472), ('nity', 0.429), ('propagation', 0.257), ('exemplar', 0.213), ('xk', 0.212), ('exemplars', 0.198), ('lik', 0.175), ('clustering', 0.146), ('availabilities', 0.14), ('nities', 0.139), ('exons', 0.122), ('genes', 0.116), ('lij', 0.105), ('responsibilities', 0.097), ('patches', 0.097), ('cluster', 0.096), ('exon', 0.091), ('rik', 0.087), ('xi', 0.087), ('putative', 0.081), ('patch', 0.079), ('training', 0.079), ('responsibility', 0.076), ('mixture', 0.075), ('messages', 0.075), ('false', 0.072), ('likelihoods', 0.07), ('aki', 0.07), ('rjk', 0.07), ('probe', 0.069), ('greedy', 0.068), ('gene', 0.065), ('segmentation', 0.065), ('centers', 0.063), ('si', 0.062), ('image', 0.057), ('hierarchical', 0.056), ('probes', 0.055), ('microarray', 0.055), ('em', 0.054), ('update', 0.052), ('lii', 0.052), ('rkk', 0.052), ('sent', 0.05), ('availability', 0.049), ('propagated', 0.046), ('spectral', 0.046), ('mth', 0.046), ('percentile', 0.046), ('mrna', 0.046), ('sk', 0.043), ('sn', 0.042), ('mouse', 0.042), ('tissue', 0.042), ('xm', 0.041), ('mixtures', 0.041), ('likelihood', 0.039), ('clustered', 0.038), ('fk', 0.036), ('pixel', 0.036), ('sensitivity', 0.035), ('gaussians', 0.035), ('akk', 0.035), ('clust', 0.035), ('lisi', 0.035), ('lkk', 0.035), ('comparisons', 0.033), ('frey', 0.032), ('colored', 0.032), ('updates', 0.032), ('rules', 0.03), ('noisier', 0.03), ('cases', 0.029), ('proc', 0.029), ('tried', 0.029), ('genome', 0.028), ('picked', 0.028), ('bj', 0.027), ('color', 0.027), ('recursively', 0.026), ('viewed', 0.026), ('multiplicative', 0.025), ('nearby', 0.025), ('pro', 0.025), ('overlapping', 0.025), ('pdf', 0.024), ('propagate', 0.024), ('discovery', 0.024), ('detection', 0.024), ('graphical', 0.024), ('decisions', 0.023), ('metric', 0.022), ('pairs', 0.022), ('achieves', 0.022), ('bayesian', 0.021), ('automatically', 0.021), ('accounts', 0.021), ('score', 0.021), ('prototype', 0.021), ('linking', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 127 nips-2005-Mixture Modeling by Affinity Propagation

Author: Brendan J. Frey, Delbert Dueck

2 0.14732553 178 nips-2005-Soft Clustering on Graphs

Author: Kai Yu, Shipeng Yu, Volker Tresp

Abstract: We propose a simple clustering framework on graphs encoding pairwise data similarities. Unlike usual similarity-based methods, the approach softly assigns data to clusters in a probabilistic way. More importantly, a hierarchical clustering is naturally derived in this framework to gradually merge lower-level clusters into higher-level ones. A random walk analysis indicates that the algorithm exposes clustering structures in various resolutions, i.e., a higher level statistically models a longer-term diffusion on graphs and thus discovers a more global clustering structure. Finally we provide very encouraging experimental results. 1

3 0.13521758 102 nips-2005-Kernelized Infomax Clustering

Author: David Barber, Felix V. Agakov

Abstract: We propose a simple information-theoretic approach to soft clustering based on maximizing the mutual information I(x, y) between the unknown cluster labels y and the training patterns x with respect to parameters of speciﬁcally constrained encoding distributions. The constraints are chosen such that patterns are likely to be clustered similarly if they lie close to speciﬁc unknown vectors in the feature space. The method may be conveniently applied to learning the optimal aﬃnity matrix, which corresponds to learning parameters of the kernelized encoder. The procedure does not require computations of eigenvalues of the Gram matrices, which makes it potentially attractive for clustering large data sets. 1

4 0.12032324 13 nips-2005-A Probabilistic Approach for Optimizing Spectral Clustering

Author: Rong Jin, Feng Kang, Chris H. Ding

Abstract: Spectral clustering enjoys its success in both data clustering and semisupervised learning. But, most spectral clustering algorithms cannot handle multi-class clustering problems directly. Additional strategies are needed to extend spectral clustering algorithms to multi-class clustering problems. Furthermore, most spectral clustering algorithms employ hard cluster membership, which is likely to be trapped by the local optimum. In this paper, we present a new spectral clustering algorithm, named “Soft Cut”. It improves the normalized cut algorithm by introducing soft membership, and can be efﬁciently computed using a bound optimization algorithm. Our experiments with a variety of datasets have shown the promising performance of the proposed clustering algorithm. 1

5 0.11968394 46 nips-2005-Consensus Propagation

Author: Benjamin V. Roy, Ciamac C. Moallemi

Abstract: We propose consensus propagation, an asynchronous distributed protocol for averaging numbers across a network. We establish convergence, characterize the convergence rate for regular graphs, and demonstrate that the protocol exhibits better scaling properties than pairwise averaging, an alternative that has received much recent attention. Consensus propagation can be viewed as a special case of belief propagation, and our results contribute to the belief propagation literature. In particular, beyond singly-connected graphs, there are very few classes of relevant problems for which belief propagation is known to converge. 1

6 0.10926712 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning

7 0.081483111 177 nips-2005-Size Regularized Cut for Data Clustering

8 0.079037257 79 nips-2005-Fusion of Similarity Data in Clustering

9 0.077442065 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

10 0.068843767 110 nips-2005-Learning Depth from Single Monocular Images

11 0.066734441 20 nips-2005-Affine Structure From Sound

12 0.065301761 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

13 0.061858635 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models

14 0.059172656 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting

15 0.056817178 31 nips-2005-Asymptotics of Gaussian Regularized Least Squares

16 0.055391483 159 nips-2005-Q-Clustering

17 0.053631961 103 nips-2005-Kernels for gene regulatory regions

18 0.053604051 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification

19 0.052866329 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis

20 0.047452886 136 nips-2005-Noise and the two-thirds power Law

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.186), (1, 0.079), (2, -0.051), (3, 0.03), (4, -0.185), (5, -0.083), (6, -0.036), (7, 0.023), (8, -0.091), (9, -0.008), (10, -0.074), (11, -0.039), (12, 0.032), (13, 0.109), (14, 0.117), (15, 0.041), (16, 0.052), (17, 0.083), (18, -0.004), (19, 0.004), (20, 0.011), (21, 0.058), (22, 0.041), (23, 0.041), (24, 0.017), (25, -0.008), (26, -0.025), (27, 0.093), (28, -0.099), (29, 0.117), (30, -0.036), (31, 0.04), (32, 0.002), (33, 0.054), (34, -0.025), (35, -0.139), (36, 0.07), (37, -0.033), (38, 0.021), (39, 0.057), (40, -0.157), (41, -0.196), (42, -0.073), (43, -0.038), (44, 0.143), (45, -0.11), (46, 0.119), (47, 0.02), (48, 0.022), (49, 0.106)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96147716 127 nips-2005-Mixture Modeling by Affinity Propagation

Author: Brendan J. Frey, Delbert Dueck

2 0.59420657 102 nips-2005-Kernelized Infomax Clustering

Author: David Barber, Felix V. Agakov

3 0.5028106 198 nips-2005-Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails

Author: Nebojsa Jojic, Vladimir Jojic, Christopher Meek, David Heckerman, Brendan J. Frey

Abstract: We introduce a new model of genetic diversity which summarizes a large input dataset into an epitome, a short sequence or a small set of short sequences of probability distributions capturing many overlapping subsequences from the dataset. The epitome as a representation has already been used in modeling real-valued signals, such as images and audio. The discrete sequence model we introduce in this paper targets applications in genetics, from multiple alignment to recombination and mutation inference. In our experiments, we concentrate on modeling the diversity of HIV where the epitome emerges as a natural model for producing relatively small vaccines covering a large number of immune system targets known as epitopes. Our experiments show that the epitome includes more epitopes than other vaccine designs of similar length, including cocktails of consensus strains, phylogenetic tree centers, and observed strains. We also discuss epitome designs that take into account uncertainty about Tcell cross reactivity and epitope presentation. In our experiments, we ﬁnd that vaccine optimization is fairly robust to these uncertainties. 1

4 0.49828604 178 nips-2005-Soft Clustering on Graphs

Author: Kai Yu, Shipeng Yu, Volker Tresp

5 0.4953281 13 nips-2005-A Probabilistic Approach for Optimizing Spectral Clustering

Author: Rong Jin, Feng Kang, Chris H. Ding

6 0.45540741 121 nips-2005-Location-based activity recognition

7 0.45459697 46 nips-2005-Consensus Propagation

8 0.45396176 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning

9 0.44452134 79 nips-2005-Fusion of Similarity Data in Clustering

10 0.37859577 4 nips-2005-A Bayesian Spatial Scan Statistic

11 0.37850034 177 nips-2005-Size Regularized Cut for Data Clustering

12 0.37511253 84 nips-2005-Generalization in Clustering with Unobserved Features

13 0.35683647 159 nips-2005-Q-Clustering

14 0.32708123 20 nips-2005-Affine Structure From Sound

15 0.32094803 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting

16 0.31475723 71 nips-2005-Fast Krylov Methods for N-Body Learning

17 0.30509162 171 nips-2005-Searching for Character Models

18 0.30087855 103 nips-2005-Kernels for gene regulatory regions

19 0.27624774 62 nips-2005-Efficient Estimation of OOMs

20 0.27417749 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.041), (10, 0.045), (11, 0.011), (15, 0.293), (27, 0.035), (31, 0.07), (34, 0.074), (39, 0.017), (41, 0.034), (55, 0.039), (69, 0.053), (73, 0.051), (88, 0.078), (91, 0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81441659 127 nips-2005-Mixture Modeling by Affinity Propagation

Author: Brendan J. Frey, Delbert Dueck

2 0.74764615 178 nips-2005-Soft Clustering on Graphs

Author: Kai Yu, Shipeng Yu, Volker Tresp

3 0.47905776 144 nips-2005-Off-policy Learning with Options and Recognizers

Author: Doina Precup, Cosmin Paduraru, Anna Koop, Richard S. Sutton, Satinder P. Singh

Abstract: We introduce a new algorithm for off-policy temporal-difference learning with function approximation that has lower variance and requires less knowledge of the behavior policy than prior methods. We develop the notion of a recognizer, a ﬁlter on actions that distorts the behavior policy to produce a related target policy with low-variance importance-sampling corrections. We also consider target policies that are deviations from the state distribution of the behavior policy, such as potential temporally abstract options, which further reduces variance. This paper introduces recognizers and their potential advantages, then develops a full algorithm for linear function approximation and proves that its updates are in the same direction as on-policy TD updates, which implies asymptotic convergence. Even though our algorithm is based on importance sampling, we prove that it requires absolutely no knowledge of the behavior policy for the case of state-aggregation function approximators. Off-policy learning is learning about one way of behaving while actually behaving in another way. For example, Q-learning is an off- policy learning method because it learns about the optimal policy while taking actions in a more exploratory fashion, e.g., according to an ε-greedy policy. Off-policy learning is of interest because only one way of selecting actions can be used at any time, but we would like to learn about many different ways of behaving from the single resultant stream of experience. For example, the options framework for temporal abstraction involves considering a variety of different ways of selecting actions. For each such option one would like to learn a model of its possible outcomes suitable for planning and other uses. Such option models have been proposed as fundamental building blocks of grounded world knowledge (Sutton, Precup & Singh, 1999; Sutton, Rafols & Koop, 2005). Using off-policy learning, one would be able to learn predictive models for many options at the same time from a single stream of experience. Unfortunately, off-policy learning using temporal-difference methods has proven problematic when used in conjunction with function approximation. Function approximation is essential in order to handle the large state spaces that are inherent in many problem do- mains. Q-learning, for example, has been proven to converge to an optimal policy in the tabular case, but is unsound and may diverge in the case of linear function approximation (Baird, 1996). Precup, Sutton, and Dasgupta (2001) introduced and proved convergence for the ﬁrst off-policy learning algorithm with linear function approximation. They addressed the problem of learning the expected value of a target policy based on experience generated using a different behavior policy. They used importance sampling techniques to reduce the off-policy case to the on-policy case, where existing convergence theorems apply (Tsitsiklis & Van Roy, 1997; Tadic, 2001). There are two important difﬁculties with that approach. First, the behavior policy needs to be stationary and known, because it is needed to compute the importance sampling corrections. Second, the importance sampling weights are often ill-conditioned. In the worst case, the variance could be inﬁnite and convergence would not occur. The conditions required to prevent this were somewhat awkward and, even when they applied and asymptotic convergence was assured, the variance could still be high and convergence could be slow. In this paper we address both of these problems in the context of off-policy learning for options. We introduce the notion of a recognizer. Rather than specifying an explicit target policy (for instance, the policy of an option), about which we want to make predictions, a recognizer speciﬁes a condition on the actions that are selected. For example, a recognizer for the temporally extended action of picking up a cup would not specify which hand is to be used, or what the motion should be at all different positions of the cup. The recognizer would recognize a whole variety of directions of motion and poses as part of picking the cup. The advantage of this strategy is not that one might prefer a multitude of different behaviors, but that the behavior may be based on a variety of different strategies, all of which are relevant, and we would like to learn from any of them. In general, a recognizer is a function that recognizes or accepts a space of different ways of behaving and thus, can learn from a wider range of data. Recognizers have two advantages over direct speciﬁcation of a target policy: 1) they are a natural and easy way to specify a target policy for which importance sampling will be well conditioned, and 2) they do not require the behavior policy to be known. The latter is important because in many cases we may have little knowledge of the behavior policy, or a stationary behavior policy may not even exist. We show that for the case of state aggregation, even if the behavior policy is unknown, convergence to a good model is achieved. 1 Non-sequential example The beneﬁts of using recognizers in off-policy learning can be most easily seen in a nonsequential context with a single continuous action. Suppose you are given a sequence of sample actions ai ∈ [0, 1], selected i.i.d. according to probability density b : [0, 1] → ℜ+ (the behavior density). For example, suppose the behavior density is of the oscillatory form shown as a red line in Figure 1. For each each action, ai , we observe a corresponding outcome, zi ∈ ℜ, a random variable whose distribution depends only on ai . Thus the behavior density induces an outcome density. The on-policy problem is to estimate the mean mb of the outcome density. This problem can be solved simply by averaging the sample outcomes: mb = (1/n) ∑n zi . The off-policy problem is to use this same data to learn what ˆ i=1 the mean would be if actions were selected in some way other than b, for example, if the actions were restricted to a designated range, such as between 0.7 and 0.9. There are two natural ways to pose this off-policy problem. The most straightforward way is to be equally interested in all actions within the designated region. One professes to be interested in actions selected according to a target density π : [0, 1] → ℜ+ , which in the example would be 5.0 between 0.7 and 0.9, and zero elsewhere, as in the dashed line in 12 Probability density functions 1.5 Target policy with recognizer 1 Target policy w/o recognizer without recognizer .5 Behavior policy 0 0 Action 0.7 Empirical variances (average of 200 sample variances) 0.9 1 0 10 with recognizer 100 200 300 400 500 Number of sample actions Figure 1: The left panel shows the behavior policy and the target policies for the formulations of the problem with and without recognizers. The right panel shows empirical estimates of the variances for the two formulations as a function of the number sample actions. The lowest line is for the formulation using empirically-estimated recognition probabilities. Figure 1 (left). The importance- sampling estimate of the mean outcome is 1 n π(ai ) mπ = ∑ ˆ zi . n i=1 b(ai ) (1) This approach is problematic if there are parts of the region of interest where the behavior density is zero or very nearly so, such as near 0.72 and 0.85 in the example. Here the importance sampling ratios are exceedingly large and the estimate is poorly conditioned (large variance). The upper curve in Figure 1 (right) shows the empirical variance of this estimate as a function of the number of samples. The spikes and uncertain decline of the empirical variance indicate that the distribution is very skewed and that the estimates are very poorly conditioned. The second way to pose the problem uses recognizers. One professes to be interested in actions to the extent that they are both selected by b and within the designated region. This leads to the target policy shown in blue in the left panel of Figure 1 (it is taller because it still must sum to 1). For this problem, the variance of (1) is much smaller, as shown in the lower two lines of Figure 1 (right). To make this way of posing the problem clear, we introduce the notion of a recognizer function c : A → ℜ+ . The action space in the example is A = [0, 1] and the recognizer is c(a) = 1 for a between 0.7 and 0.9 and is zero elsewhere. The target policy is deﬁned in general by c(a)b(a) c(a)b(a) = . (2) π(a) = µ ∑x c(x)b(x) where µ = ∑x c(x)b(x) is a constant, equal to the probability of recognizing an action from the behavior policy. Given π, mπ from (1) can be rewritten in terms of the recognizer as ˆ n π(ai ) 1 n c(ai )b(ai ) 1 1 n c(ai ) 1 mπ = ∑ zi ˆ = ∑ zi = ∑ zi (3) n i=1 b(ai ) n i=1 µ b(ai ) n i=1 µ Note that the target density does not appear at all in the last expression and that the behavior distribution appears only in µ, which is independent of the sample action. If this constant is known, then this estimator can be computed with no knowledge of π or b. The constant µ can easily be estimated as the fraction of recognized actions in the sample. The lowest line in Figure 1 (right) shows the variance of the estimator using this fraction in place of the recognition probability. Its variance is low, no worse than that of the exact algorithm, and apparently slightly lower. Because this algorithm does not use the behavior density, it can be applied when the behavior density is unknown or does not even exist. For example, suppose actions were selected in some deterministic, systematic way that in the long run produced an empirical distribution like b. This would be problematic for the other algorithms but would require no modiﬁcation of the recognition-fraction algorithm. 2 Recognizers improve conditioning of off-policy learning The main use of recognizers is in formulating a target density π about which we can successfully learn predictions, based on the current behavior being followed. Here we formalize this intuition. Theorem 1 Let A = {a1 , . . . ak } ⊆ A be a subset of all the possible actions. Consider a ﬁxed behavior policy b and let πA be the class of policies that only choose actions from A, i.e., if π(a) > 0 then a ∈ A. Then the policy induced by b and the binary recognizer cA is the policy with minimum-variance one-step importance sampling corrections, among those in πA : π(ai ) 2 π as given by (2) = arg min Eb (4) π∈πA b(ai ) Proof: Denote π(ai ) = πi , b(ai ) = bi . Then the expected variance of the one-step importance sampling corrections is: Eb πi bi πi bi 2 2 − Eb = ∑ bi i πi bi 2 −1 = ∑ i π2 i − 1, bi where the summation (here and everywhere below) is such that the action ai ∈ A. We want to ﬁnd πi that minimizes this expression, subject to the constraint that ∑i πi = 1. This is a constrained optimization problem. To solve it, we write down the corresponding Lagrangian: π2 L(πi , β) = ∑ i − 1 + β(∑ πi − 1) i i bi We take the partial derivatives wrt πi and β and set them to 0: βbi ∂L 2 = πi + β = 0 ⇒ πi = − ∂πi bi 2 (5) ∂L = πi − 1 = 0 ∂β ∑ i (6) By taking (5) and plugging into (6), we get the following expression for β: − β 2 bi = 1 ⇒ β = − 2∑ ∑i bi i By substituting β into (5) we obtain: πi = bi ∑i b i This is exactly the policy induced by the recognizer deﬁned by c(ai ) = 1 iff ai ∈ A. We also note that it is advantageous, from the point of view of minimizing the variance of the updates, to have recognizers that accept a broad range of actions: Theorem 2 Consider two binary recognizers c1 and c2 , such that µ1 > µ2 . Then the importance sampling corrections for c1 have lower variance than the importance sampling corrections for c2 . Proof: From the previous theorem, we have the variance of a recognizer cA : Var = ∑ i π2 bi i −1 = ∑ bi ∑ j∈A b j i 2 1 1 1 −1 = −1 = −1 bi µ ∑ j∈A b j 3 Formal framework for sequential problems We turn now to the full case of learning about sequential decision processes with function approximation. We use the standard framework in which an agent interacts with a stochastic environment. At each time step t, the agent receives a state st and chooses an action at . We assume for the moment that actions are selected according to a ﬁxed behavior policy, b : S × A → [0, 1] where b(s, a) is the probability of selecting action a in state s. The behavior policy is used to generate a sequence of experience (observations, actions and rewards). The goal is to learn, from this data, predictions about different ways of behaving. In this paper we focus on learning predictions about expected returns, but other predictions can be tackled as well (for instance, predictions of transition models for options (Sutton, Precup & Singh, 1999), or predictions speciﬁed by a TD-network (Sutton & Tanner, 2005; Sutton, Rafols & Koop, 2006)). We assume that the state space is large or continuous, and function approximation must be used to compute any values of interest. In particular, we assume a space of feature vectors Φ and a mapping φ : S → Φ. We denote by φs the feature vector associated with s. An option is deﬁned as a triple o = I, π, β where I ⊆ S is the set of states in which the option can be initiated, π is the internal policy of the option and β : S → [0, 1] is a stochastic termination condition. In the option work (Sutton, Precup & Singh, 1999), each of these elements has to be explicitly speciﬁed and ﬁxed in order for an option to be well deﬁned. Here, we will instead deﬁne options implicitly, using the notion of a recognizer. A recognizer is deﬁned as a function c : S × A → [0, 1], where c(s, a) indicates to what extent the recognizer allows action a in state s. An important special case, which we treat in this paper, is that of binary recognizers. In this case, c is an indicator function, specifying a subset of actions that are allowed, or recognized, given a particular state. Note that recognizers do not specify policies; instead, they merely give restrictions on the policies that are allowed or recognized. A recognizer c together with a behavior policy b generates a target policy π, where: b(s, a)c(s, a) b(s, a)c(s, a) π(s, a) = (7) = µ(s) ∑x b(s, x)c(s, x) The denominator of this fraction, µ(s) = ∑x b(s, x)c(s, x), is the recognition probability at s, i.e., the probability that an action will be accepted at s when behavior is generated according to b. The policy π is only deﬁned at states for which µ(s) > 0. The numerator gives the probability that action a is produced by the behavior and recognized in s. Note that if the recognizer accepts all state-action pairs, i.e. c(s, a) = 1, ∀s, a, then π is the same as b. Since a recognizer and a behavior policy can specify together a target policy, we can use recognizers as a way to specify policies for options, using (7). An option can only be initiated at a state for which at least one action is recognized, so µ(s) > 0, ∀s ∈ I. Similarly, the termination condition of such an option, β, is deﬁned as β(s) = 1 if µ(s) = 0. In other words, the option must terminate if no actions are recognized at a given state. At all other states, β can be deﬁned between 0 and 1 as desired. We will focus on computing the reward model of an option o, which represents the expected total return. The expected values of different features at the end of the option can be estimated similarly. The quantity that we want to compute is Eo {R(s)} = E{r1 + r2 + . . . + rT |s0 = s, π, β} where s ∈ I, experience is generated according to the policy of the option, π, and T denotes the random variable representing the time step at which the option terminates according to β. We assume that linear function approximation is used to represent these values, i.e. Eo {R(s)} ≈ θT φs where θ is a vector of parameters. 4 Off-policy learning algorithm In this section we present an adaptation of the off-policy learning algorithm of Precup, Sutton & Dasgupta (2001) to the case of learning about options. Suppose that an option’s policy π was used to generate behavior. In this case, learning the reward model of the option is a special case of temporal-difference learning of value functions. The forward ¯ (n) view of this algorithm is as follows. Let Rt denote the truncated n-step return starting at ¯ (0) time step t and let yt denote the 0-step truncated return, Rt . By the deﬁnition of the n-step truncated return, we have: ¯ (n) ¯ (n−1) Rt = rt+1 + (1 − βt+1 )Rt+1 . This is similar to the case of value functions, but it accounts for the possibility of terminating the option at time step t + 1. The λ-return is deﬁned in the usual way: ∞ ¯ (n) ¯ Rtλ = (1 − λ) ∑ λn−1 Rt . n=1 The parameters of the linear function approximator are updated on every time step proportionally to: ¯ ¯ ∆θt = Rtλ − yt ∇θ yt (1 − β1 ) · · · (1 − βt ). In our case, however, trajectories are generated according to the behavior policy b. The main idea of the algorithm is to use importance sampling corrections in order to account for the difference in the state distribution of the two policies. Let ρt = (n) Rt , π(st ,at ) b(st ,at ) be the importance sampling ratio at time step t. The truncated n-step return, satisﬁes: (n) (n−1) Rt = ρt [rt+1 + (1 − βt+1 )Rt+1 ]. The update to the parameter vector is proportional to: ∆θt = Rtλ − yt ∇θ yt ρ0 (1 − β1 ) · · · ρt−1 (1 − βt ). The following result shows that the expected updates of the on-policy and off-policy algorithms are the same. Theorem 3 For every time step t ≥ 0 and any initial state s, ¯ Eb [∆θt |s] = Eπ [∆θt |s]. (n) (n) ¯ Proof: First we will show by induction that Eb {Rt |s} = Eπ {Rt |s}, ∀n (which implies ¯ that Eb {Rtλ |s} = Eπ (Rtλ |s}). For n = 0, the statement is trivial. Assuming that it is true for n − 1, we have (n) Eb Rt |s = a ∑b(s, a)∑Pss ρ(s, a) a = s ∑∑ a Pss b(s, a) a s = a ∑π(s, a)∑Pss a (n−1) a rss + (1 − β(s ))Eb Rt+1 |s π(s, a) a ¯ (n−1) r + (1 − β(s ))Eπ Rt+1 |s b(s, a) ss a ¯ (n−1) rss + (1 − β(s ))Eπ Rt+1 |s ¯ (n) = Eπ Rt |s . s Now we are ready to prove the theorem’s main statement. Deﬁning Ωt to be the set of all trajectory components up to state st , we have: Eb {∆θt |s} = ∑ ω∈Ωt Pb (ω|s)Eb (Rtλ − yt )∇θ yt |ω t−1 ∏ ρi (1 − βi+1 ) i=0 πi (1 − βi+1 ) i=0 bi t−1 = t−1 ∑ ∏ bi Psaiisi+1 ω∈Ωt Eb Rtλ |st − yt ∇θ yt ∏ i=0 t−1 = ∑ ∏ πi Psaiisi+1 ω∈Ωt = ∑ ω∈Ωt ¯ Eπ Rtλ |st − yt ∇θ yt (1 − β1 )...(1 − βt ) i=0 ¯ ¯ Pπ (ω|s)Eπ (Rtλ − yt )∇θ yt |ω (1 − β1 )...(1 − βt ) = Eπ ∆θt |s . Note that we are able to use st and ω interchangeably because of the Markov property. ¯ Since we have shown that Eb [∆θt |s] = Eπ [∆θt |s] for any state s, it follows that the expected updates will also be equal for any distribution of the initial state s. When learning the model of options with data generated from the behavior policy b, the starting state distribution with respect to which the learning is performed, I0 is determined by the stationary distribution of the behavior policy, as well as the initiation set of the option I. We note also that the importance sampling corrections only have to be performed for the trajectory since the initiation of the updates for the option. No corrections are required for the experience prior to this point. This should generate updates that have signiﬁcantly lower variance than in the case of learning values of policies (Precup, Sutton & Dasgupta, 2001). Because of the termination condition of the option, β, ∆θ can quickly decay to zero. To avoid this problem, we can use a restart function g : S → [0, 1], such that g(st ) speciﬁes the extent to which the updating episode is considered to start at time t. Adding restarts generates a new forward update: t ∆θt = (Rtλ − yt )∇θ yt ∑ gi ρi ...ρt−1 (1 − βi+1 )...(1 − βt ), (8) i=0 where Rtλ is the same as above. With an adaptation of the proof in Precup, Sutton & Dasgupta (2001), we can show that we get the same expected value of updates by applying this algorithm from the original starting distribution as we would by applying the algorithm without restarts from a starting distribution deﬁned by I0 and g. We can turn this forward algorithm into an incremental, backward view algorithm in the following way: • Initialize k0 = g0 , e0 = k0 ∇θ y0 • At every time step t: δt = θt+1 = kt+1 = et+1 = ρt (rt+1 + (1 − βt+1 )yt+1 ) − yt θt + αδt et ρt kt (1 − βt+1 ) + gt+1 λρt (1 − βt+1 )et + kt+1 ∇θ yt+1 Using a similar technique to that of Precup, Sutton & Dasgupta (2001) and Sutton & Barto (1998), we can prove that the forward and backward algorithm are equivalent (omitted due to lack of space). This algorithm is guaranteed to converge if the variance of the updates is ﬁnite (Precup, Sutton & Dasgupta, 2001). In the case of options, the termination condition β can be used to ensure that this is the case. 5 Learning when the behavior policy is unknown In this section, we consider the case in which the behavior policy is unknown. This case is generally problematic for importance sampling algorithms, but the use of recognizers will allow us to deﬁne importance sampling corrections, as well as a convergent algorithm. Recall that when using a recognizer, the target policy of the option is deﬁned as: c(s, a)b(s, a) π(s, a) = µ(s) and the recognition probability becomes: π(s, a) c(s, a) = b(s, a) µ(s) Of course, µ(s) depends on b. If b is unknown, instead of µ(s), we will use a maximum likelihood estimate µ : S → [0, 1]. The structure used to compute µ will have to be compatible ˆ ˆ with the feature space used to represent the reward model. We will make this more precise below. Likewise, the recognizer c(s, a) will have to be deﬁned in terms of the features used to represent the model. We will then deﬁne the importance sampling corrections as: c(s, a) ˆ ρ(s, a) = µ(s) ˆ ρ(s, a) = We consider the case in which the function approximator used to model the option is actually a state aggregator. In this case, we will deﬁne recognizers which behave consistently in each partition, i.e., c(s, a) = c(p, a), ∀s ∈ p. This means that an action is either recognized or not recognized in all states of the partition. The recognition probability µ will have one ˆ entry for every partition p of the state space. Its value will be: N(p, c = 1) µ(p) = ˆ N(p) where N(p) is the number of times partition p was visited, and N(p, c = 1) is the number of times the action taken in p was recognized. In the limit, w.p.1, µ converges to ˆ ∑s d b (s|p) ∑a c(p, a)b(s, a) where d b (s|p) is the probability of visiting state s from partiˆ ˆ tion p under the stationary distribution of b. At this limit, π(s, a) = ρ(s, a)b(s, a) will be a ˆ well-deﬁned policy (i.e., ∑a π(s, a) = 1). Using Theorem 3, off-policy updates using imˆ portance sampling corrections ρ will have the same expected value as on-policy updates ˆ ˆ using π. Note though that the learning algorithm never uses π; the only quantities needed ˆ are ρ, which are learned incrementally from data. For the case of general linear function approximation, we conjecture that a similar idea can be used, where the recognition probability is learned using logistic regression. The development of this part is left for future work. Acknowledgements The authors gratefully acknowledge the ideas and encouragement they have received in this work from Eddie Rafols, Mark Ring, Lihong Li and other members of the rlai.net group. We thank Csaba Szepesvari and the reviewers of the paper for constructive comments. This research was supported in part by iCore, NSERC, Alberta Ingenuity, and CFI. References Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of ICML. Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In Proceedings of ICML. Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artiﬁcial Intelligence, vol . 112, pp. 181–211. Sutton,, R.S. and Tanner, B. (2005). Temporal-difference networks. In Proceedings of NIPS-17. Sutton R.S., Raffols E. and Koop, A. (2006). Temporal abstraction in temporal-difference networks”. In Proceedings of NIPS-18. Tadic, V. (2001). On the convergence of temporal-difference learning with linear function approximation. In Machine learning vol. 42, pp. 241-267. Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42:674–690.

4 0.47899669 78 nips-2005-From Weighted Classification to Policy Search

Author: Doron Blatt, Alfred O. Hero

Abstract: This paper proposes an algorithm to convert a T -stage stochastic decision problem with a continuous state space to a sequence of supervised learning problems. The optimization problem associated with the trajectory tree and random trajectory methods of Kearns, Mansour, and Ng, 2000, is solved using the Gauss-Seidel method. The algorithm breaks a multistage reinforcement learning problem into a sequence of single-stage reinforcement learning subproblems, each of which is solved via an exact reduction to a weighted-classiﬁcation problem that can be solved using off-the-self methods. Thus the algorithm converts a reinforcement learning problem into simpler supervised learning subproblems. It is shown that the method converges in a ﬁnite number of steps to a solution that cannot be further improved by componentwise optimization. The implication of the proposed algorithm is that a plethora of classiﬁcation methods can be applied to ﬁnd policies in the reinforcement learning problem. 1

5 0.47796994 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs

Author: Firas Hamze, Nando de Freitas

Abstract: This paper presents a new sampling algorithm for approximating functions of variables representable as undirected graphical models of arbitrary connectivity with pairwise potentials, as well as for estimating the notoriously difﬁcult partition function of the graph. The algorithm ﬁts into the framework of sequential Monte Carlo methods rather than the more widely used MCMC, and relies on constructing a sequence of intermediate distributions which get closer to the desired one. While the idea of using “tempered” proposals is known, we construct a novel sequence of target distributions where, rather than dropping a global temperature parameter, we sequentially couple individual pairs of variables that are, initially, sampled exactly from a spanning tree of the variables. We present experimental results on inference and estimation of the partition function for sparse and densely-connected graphs.

6 0.47631973 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach

7 0.47481477 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

8 0.47473273 30 nips-2005-Assessing Approximations for Gaussian Process Classification

9 0.47249329 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation

10 0.47245163 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity

11 0.47176483 23 nips-2005-An Application of Markov Random Fields to Range Sensing

12 0.46997449 102 nips-2005-Kernelized Infomax Clustering

13 0.46954408 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations

14 0.46809649 79 nips-2005-Fusion of Similarity Data in Clustering

15 0.46660477 184 nips-2005-Structured Prediction via the Extragradient Method

16 0.46634313 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

17 0.46552685 45 nips-2005-Conditional Visual Tracking in Kernel Space

18 0.46484706 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery

19 0.4647623 46 nips-2005-Consensus Propagation

20 0.46452069 124 nips-2005-Measuring Shared Information and Coordinated Activity in Neuronal Networks