nips nips2005 nips2005-56 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Boaz Nadler, Stephane Lafon, Ioannis Kevrekidis, Ronald R. Coifman
Abstract: This paper presents a diffusion based probabilistic interpretation of spectral clustering and dimensionality reduction algorithms that use the eigenvectors of the normalized graph Laplacian. Given the pairwise adjacency matrix of all points, we define a diffusion distance between any two data points and show that the low dimensional representation of the data by the first few eigenvectors of the corresponding Markov matrix is optimal under a certain mean squared error criterion. Furthermore, assuming that data points are random samples from a density p(x) = e−U (x) we identify these eigenvectors as discrete approximations of eigenfunctions of a Fokker-Planck operator in a potential 2U (x) with reflecting boundary conditions. Finally, applying known results regarding the eigenvalues and eigenfunctions of the continuous Fokker-Planck operator, we provide a mathematical justification for the success of spectral clustering and dimensional reduction algorithms based on these first few eigenvectors. This analysis elucidates, in terms of the characteristics of diffusion processes, many empirical findings regarding spectral clustering algorithms. Keywords: Algorithms and architectures, learning theory. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract This paper presents a diffusion based probabilistic interpretation of spectral clustering and dimensionality reduction algorithms that use the eigenvectors of the normalized graph Laplacian. [sent-9, score-1.281]
2 Given the pairwise adjacency matrix of all points, we define a diffusion distance between any two data points and show that the low dimensional representation of the data by the first few eigenvectors of the corresponding Markov matrix is optimal under a certain mean squared error criterion. [sent-10, score-1.081]
3 Furthermore, assuming that data points are random samples from a density p(x) = e−U (x) we identify these eigenvectors as discrete approximations of eigenfunctions of a Fokker-Planck operator in a potential 2U (x) with reflecting boundary conditions. [sent-11, score-0.998]
4 Finally, applying known results regarding the eigenvalues and eigenfunctions of the continuous Fokker-Planck operator, we provide a mathematical justification for the success of spectral clustering and dimensional reduction algorithms based on these first few eigenvectors. [sent-12, score-0.976]
5 This analysis elucidates, in terms of the characteristics of diffusion processes, many empirical findings regarding spectral clustering algorithms. [sent-13, score-0.951]
6 1 Introduction Clustering and low dimensional representation of high dimensional data are important problems in many diverse fields. [sent-15, score-0.088]
7 In recent years various spectral methods to perform these tasks, based on the eigenvectors of adjacency matrices of graphs on the data have been developed, see for example [1]-[10] and references therein. [sent-16, score-0.497]
8 In the simplest version, known as the normalized graph Laplacian, given n data points {xi }n where each xi ∈ Rp , we i=1 define a pairwise similarity matrix between points, for example using a Gaussian kernel ∗ Corresponding author. [sent-17, score-0.316]
9 il/∼nadler with width ε, xi − xj 2 (1) 2ε and a diagonal normalization matrix Di,i = j Li,j . [sent-23, score-0.133]
10 Many works propose to use the first few eigenvectors of the normalized eigenvalue problem Lφ = λDφ, or equivalently of the matrix M = D−1 L, either as a low dimensional representation of data or as good coordinates for clustering purposes. [sent-24, score-0.693]
11 While for actual datasets the choice of a kernel k(xi , xj ) is crucial, it does not qualitatively change our asymptotic analysis [11]. [sent-27, score-0.107]
12 Li,j = k(xi , xj ) = exp − The use of the first few eigenvectors of M as good coordinates is typically justified with heuristic arguments or as a relaxation of a discrete clustering problem [3]. [sent-28, score-0.572]
13 A different theoretical analysis of the eigenvectors of the matrix M , based on the fact that M is a stochastic matrix representing a random walk on the graph was described by Meilˇ and Shi [12], who considered the case of piecewise constant eigena vectors for specific lumpable matrix structures. [sent-30, score-0.807]
14 Additional notable works that considered the random walk aspects of spectral clustering are [8, 13], where the authors suggest clustering based on the average commute time between points, and [14] which considered the relaxation process of this random walk. [sent-31, score-0.923]
15 First, in section 2 we define a distance function between any two points based on the random walk on the graph, which we naturally denote the diffusion distance. [sent-33, score-0.795]
16 We then show that the low dimensional description of the data by the first few eigenvectors, denoted as the diffusion map, is optimal under a mean squared error criterion based on this distance. [sent-34, score-0.571]
17 In section 3 we consider a statistical model, in which data points are iid random samples from a probability density p(x) in a smooth bounded domain Ω ⊂ Rp and analyze the asymptotics of the eigenvectors as the number of data points tends to infinity. [sent-35, score-0.487]
18 This analysis shows that the eigenvectors of the finite matrix M are discrete approximations of the eigenfunctions of a Fokker-Planck (FP) operator with reflecting boundary conditions. [sent-36, score-0.855]
19 This observation, coupled with known results regarding the eigenvalues and eigenfunctions of the FP operator provide new insights into the properties of these eigenvectors and on the performance of spectral clustering algorithms, as described in section 4. [sent-37, score-1.25]
20 2 Diffusion Distances and Diffusion Maps The starting point of our analysis, as also noted in other works, is the observation that the matrix M is adjoint to a symmetric matrix Ms = D1/2 M D−1/2 . [sent-38, score-0.126]
21 Moreover, since Ms is symmetric it is diagonalizable and has a set of n real eigenvalues {λj }n−1 whose corresponding eigenvectors j=0 {v j } form an orthonormal basis of Rn . [sent-40, score-0.428]
22 We now utilize the fact that by construction M is a stochastic matrix with all row sums equal to one, and can thus be interpreted as defining a random walk on the graph. [sent-42, score-0.308]
23 Under this view, Mi,j denotes the transition probability from the point xi to the point xj in one time step. [sent-43, score-0.143]
24 The first is that ε is the (squared) radius of the neighborhood used to infer local geometric and density information for the construction of the adjacency matrix, while the second is that ε is the discrete time step at which the random walk jumps from point to point. [sent-46, score-0.41]
25 We denote by p(t, y|x) the probability distribution of a random walk landing at location y at time t, given a starting location x at time t = 0. [sent-47, score-0.338]
26 For ε large enough, all points in the graph are connected so that M has a unique eigenvalue equal to 1. [sent-49, score-0.217]
27 The other eigenvalues form a non-increasing sequence of non-negative numbers: λ0 = 1 > λ1 ≥ λ2 ≥ . [sent-50, score-0.195]
28 Then, regardless of the initial starting point x, lim p(t, y|x) = φ0 (y) (6) t→∞ where φ0 is the left eigenvector of M with eigenvalue λ0 = 1, explicitly given by Di,i Dj,j φ0 (xi ) = (7) j This eigenvector also has a dual interpretation. [sent-54, score-0.236]
29 Note that for a general shift invariant kernel K(x − y) and for the Gaussian kernel in particular, φ0 is simply the well known Parzen window density estimator. [sent-56, score-0.16]
30 For any finite time t, we decompose the probability distribution in the eigenbasis {φj } aj (x)λt φj (y) j p(t, y|x) = φ0 (y) + (8) j≥1 where the coefficients aj depend on the initial location x. [sent-57, score-0.188]
31 Given the definition of the random walk on the graph it is only natural to quantify the similarity between any two points according to the evolution of their probability distributions. [sent-59, score-0.344]
32 Specifically, we consider the following distance measure at time t, 2 Dt (x0 , x1 ) = = p(t, y|x0 ) − p(t, y|x1 ) y 2 w (9) (p(t, y|x0 ) − p(t, y|x1 ))2 w(y) with the specific choice w(y) = 1/φ0 (y) for the weight function, which takes into account the (empirical) local density of the points. [sent-60, score-0.167]
33 Since this distance depends on the random walk on the graph, we quite naturally denote it as the diffusion distance at time t. [sent-61, score-0.84]
34 We also denote the mapping between the original space and the first k eigenvectors as the diffusion map Ψt (x) = λt ψ1 (x), λt ψ2 (x), . [sent-62, score-0.76]
35 , λt ψk (x) 1 2 k The following theorem relates the diffusion distance and the diffusion map. [sent-65, score-1.06]
36 (10) Theorem: The diffusion distance (9) is equal to Euclidean distance in the diffusion map space with all (n − 1) eigenvectors. [sent-66, score-1.132]
37 This theorem provides a justification for using Euclidean distance in the diffusion map space for spectral clustering purposes. [sent-70, score-1.015]
38 Therefore, geometry in diffusion space is meaningful and can be interpreted in terms of the Markov chain. [sent-71, score-0.535]
39 In particular, as shown in [18], quantizing this diffusion space is equivalent to lumping the random walk. [sent-72, score-0.523]
40 This observation provides a theoretical justification for dimensional reduction with these eigenvectors. [sent-77, score-0.083]
41 We note that the first few eigenvectors are also optimal under other criteria, for example for data sampled from a manifold as in [4], or for multiclass spectral clustering [15]. [sent-81, score-0.707]
42 3 The Asymptotics of the Diffusion Map The analysis of the previous section provides a mathematical explanation for the success of the diffusion maps for dimensionality reduction and spectral clustering. [sent-82, score-0.796]
43 random samples from a probability density p(x) confined to a compact connected subset Ω ⊂ Rp with smooth boundary ∂Ω. [sent-87, score-0.181]
44 Following the statistical physics notation, we write the density in Boltzmann form, p(x) = e−U (x) , where U (x) is the (dimensionless) potential or energy of the configuration x. [sent-88, score-0.12]
45 As shown in [11], in the limit n → ∞ the random walk on the discrete graph converges to a random walk on the continuous space Ω. [sent-89, score-0.586]
46 Then, it is possible to define forward and backward operators Tf and Tb as follows, Tf [φ](x) = M (x|y)φ(y)p(y)dy, Ω Tb [ψ](x) = M (y|x)ψ(y)p(y)dy (14) Ω where M (x|y) = exp(− x − y 2 /2ε)/D(y) is the transition probability from y to x in time ε, and D(y) = exp(− x − y 2 /2ε)p(x)dx. [sent-90, score-0.158]
47 The two operators Tf and Tb have probabilistic interpretations. [sent-91, score-0.085]
48 If φ(x) is a probability distribution on the graph at time t = 0, then Tf [φ] is the probability distribution at time t = ε. [sent-92, score-0.16]
49 Similarly, Tb [ψ](x) is the mean of the function ψ at time t = ε, for a random walk that started at location x at time t = 0. [sent-93, score-0.303]
50 The operators Tf and Tb are thus the continuous analogues of the left and right multiplication by the finite matrix M . [sent-94, score-0.148]
51 The Langevin equation (18) is a common model to describe stochastic dynamical systems in physics, chemistry and biology [19, 20]. [sent-99, score-0.077]
52 Note that when data is uniformly sampled from Ω, U = 0 so the drift term vanishes and we recover the Laplace-Beltrami operator on Ω. [sent-102, score-0.154]
53 Since (17) is defined in the bounded domain Ω, the eigenvalues and eigenfunctions of Hb depend on the boundary conditions imposed on ∂Ω. [sent-105, score-0.526]
54 As shown in [9], in the limit ε → 0, the random walk satisfies reflecting boundary conditions on ∂Ω, which translate into ∂ψ(x) (19) =0 ∂n ∂Ω Table 1: Random Walks and Diffusion Processes Case Operator Stochastic Process ε>0 finite n × n R. [sent-106, score-0.301]
55 discrete in space n<∞ matrix M discrete in time ε>0 operators R. [sent-108, score-0.297]
56 in continuous space n→∞ T f , Tb discrete in time ε→0 infinitesimal diffusion process n = ∞ generator Hf continuous in time & space where n is a unit normal vector at the point x ∈ ∂Ω. [sent-110, score-0.657]
57 To conclude, the left and right eigenvectors of the finite matrix M can be viewed as discrete approximations to those of the operators Tf and Tb , which in turn can be viewed as approximations to those of Hf and Hb . [sent-111, score-0.531]
58 Therefore, if there are enough data points for accurate statistical sampling, the structure and characteristics of the eigenvalues and eigenfunctions of Hb are similar to the corresponding eigenvalues and discrete eigenvectors of M . [sent-112, score-1.021]
59 4 Fokker-Planck eigenfunctions and spectral clustering According to (16), if λε is an eigenvalue of the matrix M or of the integral operator Tb based on a kernel with parameter ε, then the corresponding eigenvalue of Hb is µ ≈ (λε − 1)/ε. [sent-114, score-1.04]
60 Therefore the largest eigenvalues of M correspond to the smallest eigenvalues of Hb . [sent-115, score-0.39]
61 These eigenvalues and their corresponding eigenfunctions have been extensively studied in the literature under various settings. [sent-116, score-0.457]
62 In general, the eigenvalues and eigenfunctions depend both on the geometry of the domain Ω and on the profile of the potential U (x). [sent-117, score-0.553]
63 In the first case Ω = Rp so geometry plays no role, while in the second U (x) = const so density plays no role. [sent-119, score-0.161]
64 Yet we show that in both cases there can still be well defined clusters, with the unifying probabilistic concept being that the mean exit time from one cluster to another is much larger than the characteristic equilibration time inside each cluster. [sent-120, score-0.273]
65 Case I: Consider diffusion in a smooth potential U (x) in Ω = Rp , where U has a few local minima, and U (x) → ∞ as x → ∞ fast enough so that e−U dx = 1 < ∞. [sent-121, score-0.531]
66 Each such local minimum thus defines a metastable state, with transitions between metastable states being relatively rare events, depending on the barrier heights separating them. [sent-122, score-0.337]
67 As shown in [21, 22] (and in many other works) there is an intimate connection between the smallest eigenvalues of Hb and mean exit times out of these metastable states. [sent-123, score-0.456]
68 Specifically, in the asymptotic limit of small noise D 1, exit times are exponentially distributed and the first non-trivial eigenvalue (after µ0 = 0) is given by µ1 = 1/¯ where τ is the mean exit time to τ ¯ overcome the highest potential barrier on the way to the deepest potential well. [sent-124, score-0.562]
69 For the case of two potential wells, for example, the corresponding eigenfunction is roughly constant in each well with a sharp transition near the saddle point between the wells. [sent-125, score-0.181]
70 In general, in the case of k local minima there are asymptotically only k eigenvalues very close to zero. [sent-126, score-0.195]
71 Apart from µ0 = 0, each of the other k − 1 eigenvalues corresponds to the mean exit time from one of the wells into the deepest one, with the corresponding eigenfunctions being almost constant in each well. [sent-127, score-0.694]
72 Therefore, for a finite dataset the presence of only k eigenvalues close to 1 with a spectral gap, e. [sent-128, score-0.416]
73 Indeed there are only two eigenvalues close or equal to 1 with a distinct spectral gap and the first eigenfunction being almost piecewise constant in each well. [sent-132, score-0.585]
74 x1 or the first and second eigenvectors for the case of three Gaussians. [sent-145, score-0.233]
75 In stochastic dynamical systems a spectral gap corresponds to a separation of time scales between long transition times from one well or metastable state to another as compared to short equilibration times inside each well. [sent-146, score-0.675]
76 Therefore, clustering and identification of metastable states are very similar tasks, and not surprisingly algorithms similar to the normalized graph Laplacian have been independently developed in the literature [25]. [sent-147, score-0.453]
77 While the first eigenvector distinguishes between the large cluster and the two smaller ones, the second eigenvector captures the equilibration inside the large cluster instead of further distinguishing the two small clusters. [sent-151, score-0.206]
78 Case II: Consider a uniform density in a region Ω ⊂ R3 composed of two large containers connected by a narrow circular tube, as in the top right frame in figure 1. [sent-153, score-0.178]
79 As shown in [27], the second eigenvalue of the FP operator is extremely small, of the order of a/V where a is the radius of the connecting tube and V is the volume of the containers, thus showing an interesting connection to the Cheeger constant on graphs. [sent-155, score-0.242]
80 The corresponding eigenfunction is almost piecewise constant in each container with a sharp transition in the connecting tube. [sent-156, score-0.181]
81 Even though in this case the density is uniform, there still is a spectral gap with two well defined clusters (the two containers), defined entirely by the geometry of Ω. [sent-157, score-0.45]
82 An example of such a case and the results of the diffusion map are shown in figure 1 (right). [sent-158, score-0.527]
83 In summary the eigenfunctions and eigenvalues of the FP operator, and thus of the corresponding finite Markov matrix, depend on both geometry and density. [sent-159, score-0.507]
84 The diffusion distance and its close relation to mean exit times between different clusters is the quantity that incorporates these two features. [sent-160, score-0.703]
85 This provides novel insight into spectral clustering algorithms, as well as a theoretical justification for the algorithm in [13], which defines clusters according to mean travel times between points on the graph. [sent-161, score-0.493]
86 A similar analysis could also be applied to semi-supervised learning based on spectral methods [28]. [sent-162, score-0.221]
87 Finally, these eigenvectors may be used to design better search and data collection protocols [29]. [sent-163, score-0.233]
88 Nonlinear component analysis as a kernel eigenvalue o u problem, Neural Computation 10, 1998. [sent-171, score-0.118]
89 Laplacian eigenmaps and spectral techniques for embedding and clustering, NIPS Vol. [sent-184, score-0.253]
90 On spectral clustering, analysis and an algorithm, NIPS Vol. [sent-195, score-0.221]
91 Dupont, The principal component analysis of a graph and its relationships to spectral clustering. [sent-205, score-0.315]
92 , Geometric diffusion as a tool for harmonic analysis and structure definition of data, parts I and II, Proc. [sent-217, score-0.485]
93 Kevrekidis, Diffusion maps, spectral clustering, and the reaction coordinates of dynamical systems, to appear in Appl. [sent-228, score-0.287]
94 A random walks view of spectral segmentation, AI and Statistics, 2001. [sent-238, score-0.288]
95 al, Learning eigenfunctions links spectral embedding and kernel PCA, Neural Computation, 16:2197-2219 (2004). [sent-258, score-0.526]
96 Belkin, On the convergence of spectral clustering on random samples: the normalized case, NIPS, 2004. [sent-262, score-0.468]
97 Lee, Diffusion maps: A unified framework for dimension reduction, data partitioning and graph subsampling, submitted. [sent-266, score-0.094]
98 Schuss, Eigenvalues of the Fokker-Planck operator and the approach to equilibrium for diffusions in potential fields, SIAM J. [sent-275, score-0.22]
99 Eckhoff, Precise asymptotics of small eigenvalues of reversible diffusions in the metastable regime, Annals of Prob. [sent-280, score-0.441]
100 von Luxburg, From graphs to manifolds - weak and strong pointwise consistency of graph Laplacians, COLT 2005 (to appear). [sent-288, score-0.127]
wordName wordTfidf (topN-words)
[('diffusion', 0.485), ('eigenfunctions', 0.262), ('eigenvectors', 0.233), ('spectral', 0.221), ('tb', 0.218), ('hb', 0.217), ('eigenvalues', 0.195), ('clustering', 0.177), ('walk', 0.164), ('tf', 0.157), ('metastable', 0.15), ('operator', 0.124), ('exit', 0.111), ('rp', 0.105), ('lafon', 0.1), ('graph', 0.094), ('coifman', 0.087), ('operators', 0.085), ('belkin', 0.085), ('hf', 0.079), ('fp', 0.079), ('containers', 0.075), ('kevrekidis', 0.075), ('nadler', 0.075), ('eigenvalue', 0.075), ('density', 0.074), ('boundary', 0.069), ('eigenfunction', 0.065), ('equilibration', 0.065), ('matrix', 0.063), ('justi', 0.06), ('aj', 0.06), ('distance', 0.06), ('nitesimal', 0.06), ('gap', 0.058), ('discrete', 0.058), ('eigenvector', 0.055), ('maps', 0.051), ('lim', 0.051), ('geometry', 0.05), ('diffusions', 0.05), ('fouss', 0.05), ('saerens', 0.05), ('schuss', 0.05), ('wells', 0.05), ('yen', 0.05), ('points', 0.048), ('generator', 0.048), ('niyogi', 0.048), ('clusters', 0.047), ('approximations', 0.046), ('potential', 0.046), ('asymptotics', 0.046), ('piecewise', 0.046), ('dimensional', 0.044), ('manifold', 0.044), ('deepest', 0.043), ('tube', 0.043), ('kernel', 0.043), ('stochastic', 0.043), ('adjacency', 0.043), ('squared', 0.042), ('laplacian', 0.042), ('map', 0.042), ('ecting', 0.041), ('ms', 0.04), ('transition', 0.04), ('luxburg', 0.04), ('reduction', 0.039), ('regarding', 0.038), ('relaxation', 0.038), ('random', 0.038), ('works', 0.037), ('barrier', 0.037), ('const', 0.037), ('xi', 0.036), ('dy', 0.035), ('location', 0.035), ('xj', 0.034), ('dynamical', 0.034), ('von', 0.033), ('time', 0.033), ('coordinates', 0.032), ('eigenmaps', 0.032), ('multiclass', 0.032), ('normalized', 0.032), ('segmentation', 0.031), ('inside', 0.031), ('dt', 0.031), ('nips', 0.031), ('drift', 0.03), ('sharp', 0.03), ('theorem', 0.03), ('characteristics', 0.03), ('asymptotic', 0.03), ('limit', 0.03), ('ei', 0.029), ('narrow', 0.029), ('walks', 0.029), ('nite', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.000001 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
Author: Boaz Nadler, Stephane Lafon, Ioannis Kevrekidis, Ronald R. Coifman
Abstract: This paper presents a diffusion based probabilistic interpretation of spectral clustering and dimensionality reduction algorithms that use the eigenvectors of the normalized graph Laplacian. Given the pairwise adjacency matrix of all points, we define a diffusion distance between any two data points and show that the low dimensional representation of the data by the first few eigenvectors of the corresponding Markov matrix is optimal under a certain mean squared error criterion. Furthermore, assuming that data points are random samples from a density p(x) = e−U (x) we identify these eigenvectors as discrete approximations of eigenfunctions of a Fokker-Planck operator in a potential 2U (x) with reflecting boundary conditions. Finally, applying known results regarding the eigenvalues and eigenfunctions of the continuous Fokker-Planck operator, we provide a mathematical justification for the success of spectral clustering and dimensional reduction algorithms based on these first few eigenvectors. This analysis elucidates, in terms of the characteristics of diffusion processes, many empirical findings regarding spectral clustering algorithms. Keywords: Algorithms and architectures, learning theory. 1
2 0.38703603 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions
Author: Sridhar Mahadevan, Mauro Maggioni
Abstract: We investigate the problem of automatically constructing efficient representations or basis functions for approximating value functions based on analyzing the structure and topology of the state space. In particular, two novel approaches to value function approximation are explored based on automatically constructing basis functions on state spaces that can be represented as graphs or manifolds: one approach uses the eigenfunctions of the Laplacian, in effect performing a global Fourier analysis on the graph; the second approach is based on diffusion wavelets, which generalize classical wavelets to graphs using multiscale dilations induced by powers of a diffusion operator or random walk on the graph. Together, these approaches form the foundation of a new generation of methods for solving large Markov decision processes, in which the underlying representation and policies are simultaneously learned.
3 0.24565256 135 nips-2005-Neuronal Fiber Delineation in Area of Edema from Diffusion Weighted MRI
Author: Ofer Pasternak, Nathan Intrator, Nir Sochen, Yaniv Assaf
Abstract: Diffusion Tensor Magnetic Resonance Imaging (DT-MRI) is a non invasive method for brain neuronal fibers delineation. Here we show a modification for DT-MRI that allows delineation of neuronal fibers which are infiltrated by edema. We use the Muliple Tensor Variational (MTV) framework which replaces the diffusion model of DT-MRI with a multiple component model and fits it to the signal attenuation with a variational regularization mechanism. In order to reduce free water contamination we estimate the free water compartment volume fraction in each voxel, remove it, and then calculate the anisotropy of the remaining compartment. The variational framework was applied on data collected with conventional clinical parameters, containing only six diffusion directions. By using the variational framework we were able to overcome the highly ill posed fitting. The results show that we were able to find fibers that were not found by DT-MRI.
4 0.21228357 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
Author: Tong Zhang, Rie Kubota Ando
Abstract: We consider a framework for semi-supervised learning using spectral decomposition based un-supervised kernel design. This approach subsumes a class of previously proposed semi-supervised learning methods on data graphs. We examine various theoretical properties of such methods. In particular, we derive a generalization performance bound, and obtain the optimal kernel design by minimizing the bound. Based on the theoretical analysis, we are able to demonstrate why spectral kernel design based methods can often improve the predictive performance. Experiments are used to illustrate the main consequences of our analysis.
5 0.17336653 13 nips-2005-A Probabilistic Approach for Optimizing Spectral Clustering
Author: Rong Jin, Feng Kang, Chris H. Ding
Abstract: Spectral clustering enjoys its success in both data clustering and semisupervised learning. But, most spectral clustering algorithms cannot handle multi-class clustering problems directly. Additional strategies are needed to extend spectral clustering algorithms to multi-class clustering problems. Furthermore, most spectral clustering algorithms employ hard cluster membership, which is likely to be trapped by the local optimum. In this paper, we present a new spectral clustering algorithm, named “Soft Cut”. It improves the normalized cut algorithm by introducing soft membership, and can be efficiently computed using a bound optimization algorithm. Our experiments with a variety of datasets have shown the promising performance of the proposed clustering algorithm. 1
6 0.14787248 147 nips-2005-On the Convergence of Eigenspaces in Kernel Principal Component Analysis
7 0.1375024 178 nips-2005-Soft Clustering on Graphs
8 0.13525568 102 nips-2005-Kernelized Infomax Clustering
9 0.12524198 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
10 0.097040586 75 nips-2005-Fixing two weaknesses of the Spectral Method
11 0.090707771 79 nips-2005-Fusion of Similarity Data in Clustering
12 0.086754173 126 nips-2005-Metric Learning by Collapsing Classes
13 0.083422266 177 nips-2005-Size Regularized Cut for Data Clustering
14 0.081964999 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
15 0.079492353 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error
16 0.075085789 138 nips-2005-Non-Local Manifold Parzen Windows
17 0.074794911 71 nips-2005-Fast Krylov Methods for N-Body Learning
18 0.073770754 159 nips-2005-Q-Clustering
19 0.068019092 74 nips-2005-Faster Rates in Regression via Active Learning
20 0.066220202 47 nips-2005-Consistency of one-class SVM and related algorithms
topicId topicWeight
[(0, 0.241), (1, 0.128), (2, -0.068), (3, -0.113), (4, -0.387), (5, -0.064), (6, -0.006), (7, -0.215), (8, 0.048), (9, -0.098), (10, 0.161), (11, 0.166), (12, 0.068), (13, -0.108), (14, -0.197), (15, -0.035), (16, -0.177), (17, -0.129), (18, 0.127), (19, 0.008), (20, -0.006), (21, -0.058), (22, 0.118), (23, -0.095), (24, -0.156), (25, 0.097), (26, 0.082), (27, 0.09), (28, 0.036), (29, 0.071), (30, -0.027), (31, -0.081), (32, 0.044), (33, 0.049), (34, -0.041), (35, -0.055), (36, 0.076), (37, 0.013), (38, -0.038), (39, 0.028), (40, 0.018), (41, -0.044), (42, 0.077), (43, 0.008), (44, 0.015), (45, -0.025), (46, -0.022), (47, 0.027), (48, -0.0), (49, 0.081)]
simIndex simValue paperId paperTitle
same-paper 1 0.94964969 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
Author: Boaz Nadler, Stephane Lafon, Ioannis Kevrekidis, Ronald R. Coifman
Abstract: This paper presents a diffusion based probabilistic interpretation of spectral clustering and dimensionality reduction algorithms that use the eigenvectors of the normalized graph Laplacian. Given the pairwise adjacency matrix of all points, we define a diffusion distance between any two data points and show that the low dimensional representation of the data by the first few eigenvectors of the corresponding Markov matrix is optimal under a certain mean squared error criterion. Furthermore, assuming that data points are random samples from a density p(x) = e−U (x) we identify these eigenvectors as discrete approximations of eigenfunctions of a Fokker-Planck operator in a potential 2U (x) with reflecting boundary conditions. Finally, applying known results regarding the eigenvalues and eigenfunctions of the continuous Fokker-Planck operator, we provide a mathematical justification for the success of spectral clustering and dimensional reduction algorithms based on these first few eigenvectors. This analysis elucidates, in terms of the characteristics of diffusion processes, many empirical findings regarding spectral clustering algorithms. Keywords: Algorithms and architectures, learning theory. 1
2 0.79548913 135 nips-2005-Neuronal Fiber Delineation in Area of Edema from Diffusion Weighted MRI
Author: Ofer Pasternak, Nathan Intrator, Nir Sochen, Yaniv Assaf
Abstract: Diffusion Tensor Magnetic Resonance Imaging (DT-MRI) is a non invasive method for brain neuronal fibers delineation. Here we show a modification for DT-MRI that allows delineation of neuronal fibers which are infiltrated by edema. We use the Muliple Tensor Variational (MTV) framework which replaces the diffusion model of DT-MRI with a multiple component model and fits it to the signal attenuation with a variational regularization mechanism. In order to reduce free water contamination we estimate the free water compartment volume fraction in each voxel, remove it, and then calculate the anisotropy of the remaining compartment. The variational framework was applied on data collected with conventional clinical parameters, containing only six diffusion directions. By using the variational framework we were able to overcome the highly ill posed fitting. The results show that we were able to find fibers that were not found by DT-MRI.
3 0.7941429 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions
Author: Sridhar Mahadevan, Mauro Maggioni
Abstract: We investigate the problem of automatically constructing efficient representations or basis functions for approximating value functions based on analyzing the structure and topology of the state space. In particular, two novel approaches to value function approximation are explored based on automatically constructing basis functions on state spaces that can be represented as graphs or manifolds: one approach uses the eigenfunctions of the Laplacian, in effect performing a global Fourier analysis on the graph; the second approach is based on diffusion wavelets, which generalize classical wavelets to graphs using multiscale dilations induced by powers of a diffusion operator or random walk on the graph. Together, these approaches form the foundation of a new generation of methods for solving large Markov decision processes, in which the underlying representation and policies are simultaneously learned.
4 0.48779941 71 nips-2005-Fast Krylov Methods for N-Body Learning
Author: Nando D. Freitas, Yang Wang, Maryam Mahdaviani, Dustin Lang
Abstract: This paper addresses the issue of numerical computation in machine learning domains based on similarity metrics, such as kernel methods, spectral techniques and Gaussian processes. It presents a general solution strategy based on Krylov subspace iteration and fast N-body learning methods. The experiments show significant gains in computation and storage on datasets arising in image segmentation, object detection and dimensionality reduction. The paper also presents theoretical bounds on the stability of these methods.
5 0.39985153 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
Author: Tong Zhang, Rie Kubota Ando
Abstract: We consider a framework for semi-supervised learning using spectral decomposition based un-supervised kernel design. This approach subsumes a class of previously proposed semi-supervised learning methods on data graphs. We examine various theoretical properties of such methods. In particular, we derive a generalization performance bound, and obtain the optimal kernel design by minimizing the bound. Based on the theoretical analysis, we are able to demonstrate why spectral kernel design based methods can often improve the predictive performance. Experiments are used to illustrate the main consequences of our analysis.
6 0.38799113 13 nips-2005-A Probabilistic Approach for Optimizing Spectral Clustering
7 0.38689113 178 nips-2005-Soft Clustering on Graphs
8 0.3849197 147 nips-2005-On the Convergence of Eigenspaces in Kernel Principal Component Analysis
9 0.38088197 102 nips-2005-Kernelized Infomax Clustering
10 0.35620728 189 nips-2005-Tensor Subspace Analysis
11 0.32594132 177 nips-2005-Size Regularized Cut for Data Clustering
12 0.31319723 75 nips-2005-Fixing two weaknesses of the Spectral Method
13 0.31238416 79 nips-2005-Fusion of Similarity Data in Clustering
14 0.30174744 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
15 0.29018074 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
16 0.28666979 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
17 0.27578676 159 nips-2005-Q-Clustering
18 0.27123073 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error
19 0.26664245 182 nips-2005-Statistical Convergence of Kernel CCA
20 0.26626232 84 nips-2005-Generalization in Clustering with Unobserved Features
topicId topicWeight
[(3, 0.051), (10, 0.045), (11, 0.284), (18, 0.013), (27, 0.013), (31, 0.054), (34, 0.107), (39, 0.017), (41, 0.015), (50, 0.02), (55, 0.025), (65, 0.016), (69, 0.062), (73, 0.089), (88, 0.071), (91, 0.041)]
simIndex simValue paperId paperTitle
1 0.8192839 183 nips-2005-Stimulus Evoked Independent Factor Analysis of MEG Data with Large Background Activity
Author: Kenneth Hild, Kensuke Sekihara, Hagai T. Attias, Srikantan S. Nagarajan
Abstract: This paper presents a novel technique for analyzing electromagnetic imaging data obtained using the stimulus evoked experimental paradigm. The technique is based on a probabilistic graphical model, which describes the data in terms of underlying evoked and interference sources, and explicitly models the stimulus evoked paradigm. A variational Bayesian EM algorithm infers the model from data, suppresses interference sources, and reconstructs the activity of separated individual brain sources. The new algorithm outperforms existing techniques on two real datasets, as well as on simulated data. 1
same-paper 2 0.80993092 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
Author: Boaz Nadler, Stephane Lafon, Ioannis Kevrekidis, Ronald R. Coifman
Abstract: This paper presents a diffusion based probabilistic interpretation of spectral clustering and dimensionality reduction algorithms that use the eigenvectors of the normalized graph Laplacian. Given the pairwise adjacency matrix of all points, we define a diffusion distance between any two data points and show that the low dimensional representation of the data by the first few eigenvectors of the corresponding Markov matrix is optimal under a certain mean squared error criterion. Furthermore, assuming that data points are random samples from a density p(x) = e−U (x) we identify these eigenvectors as discrete approximations of eigenfunctions of a Fokker-Planck operator in a potential 2U (x) with reflecting boundary conditions. Finally, applying known results regarding the eigenvalues and eigenfunctions of the continuous Fokker-Planck operator, we provide a mathematical justification for the success of spectral clustering and dimensional reduction algorithms based on these first few eigenvectors. This analysis elucidates, in terms of the characteristics of diffusion processes, many empirical findings regarding spectral clustering algorithms. Keywords: Algorithms and architectures, learning theory. 1
3 0.78751796 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior
Author: Liam Paninski
Abstract: We discuss a method for obtaining a subject’s a priori beliefs from his/her behavior in a psychophysics context, under the assumption that the behavior is (nearly) optimal from a Bayesian perspective. The method is nonparametric in the sense that we do not assume that the prior belongs to any fixed class of distributions (e.g., Gaussian). Despite this increased generality, the method is relatively simple to implement, being based in the simplest case on a linear programming algorithm, and more generally on a straightforward maximum likelihood or maximum a posteriori formulation, which turns out to be a convex optimization problem (with no non-global local maxima) in many important cases. In addition, we develop methods for analyzing the uncertainty of these estimates. We demonstrate the accuracy of the method in a simple simulated coin-flipping setting; in particular, the method is able to precisely track the evolution of the subject’s posterior distribution as more and more data are observed. We close by briefly discussing an interesting connection to recent models of neural population coding.
4 0.5385083 144 nips-2005-Off-policy Learning with Options and Recognizers
Author: Doina Precup, Cosmin Paduraru, Anna Koop, Richard S. Sutton, Satinder P. Singh
Abstract: We introduce a new algorithm for off-policy temporal-difference learning with function approximation that has lower variance and requires less knowledge of the behavior policy than prior methods. We develop the notion of a recognizer, a filter on actions that distorts the behavior policy to produce a related target policy with low-variance importance-sampling corrections. We also consider target policies that are deviations from the state distribution of the behavior policy, such as potential temporally abstract options, which further reduces variance. This paper introduces recognizers and their potential advantages, then develops a full algorithm for linear function approximation and proves that its updates are in the same direction as on-policy TD updates, which implies asymptotic convergence. Even though our algorithm is based on importance sampling, we prove that it requires absolutely no knowledge of the behavior policy for the case of state-aggregation function approximators. Off-policy learning is learning about one way of behaving while actually behaving in another way. For example, Q-learning is an off- policy learning method because it learns about the optimal policy while taking actions in a more exploratory fashion, e.g., according to an ε-greedy policy. Off-policy learning is of interest because only one way of selecting actions can be used at any time, but we would like to learn about many different ways of behaving from the single resultant stream of experience. For example, the options framework for temporal abstraction involves considering a variety of different ways of selecting actions. For each such option one would like to learn a model of its possible outcomes suitable for planning and other uses. Such option models have been proposed as fundamental building blocks of grounded world knowledge (Sutton, Precup & Singh, 1999; Sutton, Rafols & Koop, 2005). Using off-policy learning, one would be able to learn predictive models for many options at the same time from a single stream of experience. Unfortunately, off-policy learning using temporal-difference methods has proven problematic when used in conjunction with function approximation. Function approximation is essential in order to handle the large state spaces that are inherent in many problem do- mains. Q-learning, for example, has been proven to converge to an optimal policy in the tabular case, but is unsound and may diverge in the case of linear function approximation (Baird, 1996). Precup, Sutton, and Dasgupta (2001) introduced and proved convergence for the first off-policy learning algorithm with linear function approximation. They addressed the problem of learning the expected value of a target policy based on experience generated using a different behavior policy. They used importance sampling techniques to reduce the off-policy case to the on-policy case, where existing convergence theorems apply (Tsitsiklis & Van Roy, 1997; Tadic, 2001). There are two important difficulties with that approach. First, the behavior policy needs to be stationary and known, because it is needed to compute the importance sampling corrections. Second, the importance sampling weights are often ill-conditioned. In the worst case, the variance could be infinite and convergence would not occur. The conditions required to prevent this were somewhat awkward and, even when they applied and asymptotic convergence was assured, the variance could still be high and convergence could be slow. In this paper we address both of these problems in the context of off-policy learning for options. We introduce the notion of a recognizer. Rather than specifying an explicit target policy (for instance, the policy of an option), about which we want to make predictions, a recognizer specifies a condition on the actions that are selected. For example, a recognizer for the temporally extended action of picking up a cup would not specify which hand is to be used, or what the motion should be at all different positions of the cup. The recognizer would recognize a whole variety of directions of motion and poses as part of picking the cup. The advantage of this strategy is not that one might prefer a multitude of different behaviors, but that the behavior may be based on a variety of different strategies, all of which are relevant, and we would like to learn from any of them. In general, a recognizer is a function that recognizes or accepts a space of different ways of behaving and thus, can learn from a wider range of data. Recognizers have two advantages over direct specification of a target policy: 1) they are a natural and easy way to specify a target policy for which importance sampling will be well conditioned, and 2) they do not require the behavior policy to be known. The latter is important because in many cases we may have little knowledge of the behavior policy, or a stationary behavior policy may not even exist. We show that for the case of state aggregation, even if the behavior policy is unknown, convergence to a good model is achieved. 1 Non-sequential example The benefits of using recognizers in off-policy learning can be most easily seen in a nonsequential context with a single continuous action. Suppose you are given a sequence of sample actions ai ∈ [0, 1], selected i.i.d. according to probability density b : [0, 1] → ℜ+ (the behavior density). For example, suppose the behavior density is of the oscillatory form shown as a red line in Figure 1. For each each action, ai , we observe a corresponding outcome, zi ∈ ℜ, a random variable whose distribution depends only on ai . Thus the behavior density induces an outcome density. The on-policy problem is to estimate the mean mb of the outcome density. This problem can be solved simply by averaging the sample outcomes: mb = (1/n) ∑n zi . The off-policy problem is to use this same data to learn what ˆ i=1 the mean would be if actions were selected in some way other than b, for example, if the actions were restricted to a designated range, such as between 0.7 and 0.9. There are two natural ways to pose this off-policy problem. The most straightforward way is to be equally interested in all actions within the designated region. One professes to be interested in actions selected according to a target density π : [0, 1] → ℜ+ , which in the example would be 5.0 between 0.7 and 0.9, and zero elsewhere, as in the dashed line in 12 Probability density functions 1.5 Target policy with recognizer 1 Target policy w/o recognizer without recognizer .5 Behavior policy 0 0 Action 0.7 Empirical variances (average of 200 sample variances) 0.9 1 0 10 with recognizer 100 200 300 400 500 Number of sample actions Figure 1: The left panel shows the behavior policy and the target policies for the formulations of the problem with and without recognizers. The right panel shows empirical estimates of the variances for the two formulations as a function of the number sample actions. The lowest line is for the formulation using empirically-estimated recognition probabilities. Figure 1 (left). The importance- sampling estimate of the mean outcome is 1 n π(ai ) mπ = ∑ ˆ zi . n i=1 b(ai ) (1) This approach is problematic if there are parts of the region of interest where the behavior density is zero or very nearly so, such as near 0.72 and 0.85 in the example. Here the importance sampling ratios are exceedingly large and the estimate is poorly conditioned (large variance). The upper curve in Figure 1 (right) shows the empirical variance of this estimate as a function of the number of samples. The spikes and uncertain decline of the empirical variance indicate that the distribution is very skewed and that the estimates are very poorly conditioned. The second way to pose the problem uses recognizers. One professes to be interested in actions to the extent that they are both selected by b and within the designated region. This leads to the target policy shown in blue in the left panel of Figure 1 (it is taller because it still must sum to 1). For this problem, the variance of (1) is much smaller, as shown in the lower two lines of Figure 1 (right). To make this way of posing the problem clear, we introduce the notion of a recognizer function c : A → ℜ+ . The action space in the example is A = [0, 1] and the recognizer is c(a) = 1 for a between 0.7 and 0.9 and is zero elsewhere. The target policy is defined in general by c(a)b(a) c(a)b(a) = . (2) π(a) = µ ∑x c(x)b(x) where µ = ∑x c(x)b(x) is a constant, equal to the probability of recognizing an action from the behavior policy. Given π, mπ from (1) can be rewritten in terms of the recognizer as ˆ n π(ai ) 1 n c(ai )b(ai ) 1 1 n c(ai ) 1 mπ = ∑ zi ˆ = ∑ zi = ∑ zi (3) n i=1 b(ai ) n i=1 µ b(ai ) n i=1 µ Note that the target density does not appear at all in the last expression and that the behavior distribution appears only in µ, which is independent of the sample action. If this constant is known, then this estimator can be computed with no knowledge of π or b. The constant µ can easily be estimated as the fraction of recognized actions in the sample. The lowest line in Figure 1 (right) shows the variance of the estimator using this fraction in place of the recognition probability. Its variance is low, no worse than that of the exact algorithm, and apparently slightly lower. Because this algorithm does not use the behavior density, it can be applied when the behavior density is unknown or does not even exist. For example, suppose actions were selected in some deterministic, systematic way that in the long run produced an empirical distribution like b. This would be problematic for the other algorithms but would require no modification of the recognition-fraction algorithm. 2 Recognizers improve conditioning of off-policy learning The main use of recognizers is in formulating a target density π about which we can successfully learn predictions, based on the current behavior being followed. Here we formalize this intuition. Theorem 1 Let A = {a1 , . . . ak } ⊆ A be a subset of all the possible actions. Consider a fixed behavior policy b and let πA be the class of policies that only choose actions from A, i.e., if π(a) > 0 then a ∈ A. Then the policy induced by b and the binary recognizer cA is the policy with minimum-variance one-step importance sampling corrections, among those in πA : π(ai ) 2 π as given by (2) = arg min Eb (4) π∈πA b(ai ) Proof: Denote π(ai ) = πi , b(ai ) = bi . Then the expected variance of the one-step importance sampling corrections is: Eb πi bi πi bi 2 2 − Eb = ∑ bi i πi bi 2 −1 = ∑ i π2 i − 1, bi where the summation (here and everywhere below) is such that the action ai ∈ A. We want to find πi that minimizes this expression, subject to the constraint that ∑i πi = 1. This is a constrained optimization problem. To solve it, we write down the corresponding Lagrangian: π2 L(πi , β) = ∑ i − 1 + β(∑ πi − 1) i i bi We take the partial derivatives wrt πi and β and set them to 0: βbi ∂L 2 = πi + β = 0 ⇒ πi = − ∂πi bi 2 (5) ∂L = πi − 1 = 0 ∂β ∑ i (6) By taking (5) and plugging into (6), we get the following expression for β: − β 2 bi = 1 ⇒ β = − 2∑ ∑i bi i By substituting β into (5) we obtain: πi = bi ∑i b i This is exactly the policy induced by the recognizer defined by c(ai ) = 1 iff ai ∈ A. We also note that it is advantageous, from the point of view of minimizing the variance of the updates, to have recognizers that accept a broad range of actions: Theorem 2 Consider two binary recognizers c1 and c2 , such that µ1 > µ2 . Then the importance sampling corrections for c1 have lower variance than the importance sampling corrections for c2 . Proof: From the previous theorem, we have the variance of a recognizer cA : Var = ∑ i π2 bi i −1 = ∑ bi ∑ j∈A b j i 2 1 1 1 −1 = −1 = −1 bi µ ∑ j∈A b j 3 Formal framework for sequential problems We turn now to the full case of learning about sequential decision processes with function approximation. We use the standard framework in which an agent interacts with a stochastic environment. At each time step t, the agent receives a state st and chooses an action at . We assume for the moment that actions are selected according to a fixed behavior policy, b : S × A → [0, 1] where b(s, a) is the probability of selecting action a in state s. The behavior policy is used to generate a sequence of experience (observations, actions and rewards). The goal is to learn, from this data, predictions about different ways of behaving. In this paper we focus on learning predictions about expected returns, but other predictions can be tackled as well (for instance, predictions of transition models for options (Sutton, Precup & Singh, 1999), or predictions specified by a TD-network (Sutton & Tanner, 2005; Sutton, Rafols & Koop, 2006)). We assume that the state space is large or continuous, and function approximation must be used to compute any values of interest. In particular, we assume a space of feature vectors Φ and a mapping φ : S → Φ. We denote by φs the feature vector associated with s. An option is defined as a triple o = I, π, β where I ⊆ S is the set of states in which the option can be initiated, π is the internal policy of the option and β : S → [0, 1] is a stochastic termination condition. In the option work (Sutton, Precup & Singh, 1999), each of these elements has to be explicitly specified and fixed in order for an option to be well defined. Here, we will instead define options implicitly, using the notion of a recognizer. A recognizer is defined as a function c : S × A → [0, 1], where c(s, a) indicates to what extent the recognizer allows action a in state s. An important special case, which we treat in this paper, is that of binary recognizers. In this case, c is an indicator function, specifying a subset of actions that are allowed, or recognized, given a particular state. Note that recognizers do not specify policies; instead, they merely give restrictions on the policies that are allowed or recognized. A recognizer c together with a behavior policy b generates a target policy π, where: b(s, a)c(s, a) b(s, a)c(s, a) π(s, a) = (7) = µ(s) ∑x b(s, x)c(s, x) The denominator of this fraction, µ(s) = ∑x b(s, x)c(s, x), is the recognition probability at s, i.e., the probability that an action will be accepted at s when behavior is generated according to b. The policy π is only defined at states for which µ(s) > 0. The numerator gives the probability that action a is produced by the behavior and recognized in s. Note that if the recognizer accepts all state-action pairs, i.e. c(s, a) = 1, ∀s, a, then π is the same as b. Since a recognizer and a behavior policy can specify together a target policy, we can use recognizers as a way to specify policies for options, using (7). An option can only be initiated at a state for which at least one action is recognized, so µ(s) > 0, ∀s ∈ I. Similarly, the termination condition of such an option, β, is defined as β(s) = 1 if µ(s) = 0. In other words, the option must terminate if no actions are recognized at a given state. At all other states, β can be defined between 0 and 1 as desired. We will focus on computing the reward model of an option o, which represents the expected total return. The expected values of different features at the end of the option can be estimated similarly. The quantity that we want to compute is Eo {R(s)} = E{r1 + r2 + . . . + rT |s0 = s, π, β} where s ∈ I, experience is generated according to the policy of the option, π, and T denotes the random variable representing the time step at which the option terminates according to β. We assume that linear function approximation is used to represent these values, i.e. Eo {R(s)} ≈ θT φs where θ is a vector of parameters. 4 Off-policy learning algorithm In this section we present an adaptation of the off-policy learning algorithm of Precup, Sutton & Dasgupta (2001) to the case of learning about options. Suppose that an option’s policy π was used to generate behavior. In this case, learning the reward model of the option is a special case of temporal-difference learning of value functions. The forward ¯ (n) view of this algorithm is as follows. Let Rt denote the truncated n-step return starting at ¯ (0) time step t and let yt denote the 0-step truncated return, Rt . By the definition of the n-step truncated return, we have: ¯ (n) ¯ (n−1) Rt = rt+1 + (1 − βt+1 )Rt+1 . This is similar to the case of value functions, but it accounts for the possibility of terminating the option at time step t + 1. The λ-return is defined in the usual way: ∞ ¯ (n) ¯ Rtλ = (1 − λ) ∑ λn−1 Rt . n=1 The parameters of the linear function approximator are updated on every time step proportionally to: ¯ ¯ ∆θt = Rtλ − yt ∇θ yt (1 − β1 ) · · · (1 − βt ). In our case, however, trajectories are generated according to the behavior policy b. The main idea of the algorithm is to use importance sampling corrections in order to account for the difference in the state distribution of the two policies. Let ρt = (n) Rt , π(st ,at ) b(st ,at ) be the importance sampling ratio at time step t. The truncated n-step return, satisfies: (n) (n−1) Rt = ρt [rt+1 + (1 − βt+1 )Rt+1 ]. The update to the parameter vector is proportional to: ∆θt = Rtλ − yt ∇θ yt ρ0 (1 − β1 ) · · · ρt−1 (1 − βt ). The following result shows that the expected updates of the on-policy and off-policy algorithms are the same. Theorem 3 For every time step t ≥ 0 and any initial state s, ¯ Eb [∆θt |s] = Eπ [∆θt |s]. (n) (n) ¯ Proof: First we will show by induction that Eb {Rt |s} = Eπ {Rt |s}, ∀n (which implies ¯ that Eb {Rtλ |s} = Eπ (Rtλ |s}). For n = 0, the statement is trivial. Assuming that it is true for n − 1, we have (n) Eb Rt |s = a ∑b(s, a)∑Pss ρ(s, a) a = s ∑∑ a Pss b(s, a) a s = a ∑π(s, a)∑Pss a (n−1) a rss + (1 − β(s ))Eb Rt+1 |s π(s, a) a ¯ (n−1) r + (1 − β(s ))Eπ Rt+1 |s b(s, a) ss a ¯ (n−1) rss + (1 − β(s ))Eπ Rt+1 |s ¯ (n) = Eπ Rt |s . s Now we are ready to prove the theorem’s main statement. Defining Ωt to be the set of all trajectory components up to state st , we have: Eb {∆θt |s} = ∑ ω∈Ωt Pb (ω|s)Eb (Rtλ − yt )∇θ yt |ω t−1 ∏ ρi (1 − βi+1 ) i=0 πi (1 − βi+1 ) i=0 bi t−1 = t−1 ∑ ∏ bi Psaiisi+1 ω∈Ωt Eb Rtλ |st − yt ∇θ yt ∏ i=0 t−1 = ∑ ∏ πi Psaiisi+1 ω∈Ωt = ∑ ω∈Ωt ¯ Eπ Rtλ |st − yt ∇θ yt (1 − β1 )...(1 − βt ) i=0 ¯ ¯ Pπ (ω|s)Eπ (Rtλ − yt )∇θ yt |ω (1 − β1 )...(1 − βt ) = Eπ ∆θt |s . Note that we are able to use st and ω interchangeably because of the Markov property. ¯ Since we have shown that Eb [∆θt |s] = Eπ [∆θt |s] for any state s, it follows that the expected updates will also be equal for any distribution of the initial state s. When learning the model of options with data generated from the behavior policy b, the starting state distribution with respect to which the learning is performed, I0 is determined by the stationary distribution of the behavior policy, as well as the initiation set of the option I. We note also that the importance sampling corrections only have to be performed for the trajectory since the initiation of the updates for the option. No corrections are required for the experience prior to this point. This should generate updates that have significantly lower variance than in the case of learning values of policies (Precup, Sutton & Dasgupta, 2001). Because of the termination condition of the option, β, ∆θ can quickly decay to zero. To avoid this problem, we can use a restart function g : S → [0, 1], such that g(st ) specifies the extent to which the updating episode is considered to start at time t. Adding restarts generates a new forward update: t ∆θt = (Rtλ − yt )∇θ yt ∑ gi ρi ...ρt−1 (1 − βi+1 )...(1 − βt ), (8) i=0 where Rtλ is the same as above. With an adaptation of the proof in Precup, Sutton & Dasgupta (2001), we can show that we get the same expected value of updates by applying this algorithm from the original starting distribution as we would by applying the algorithm without restarts from a starting distribution defined by I0 and g. We can turn this forward algorithm into an incremental, backward view algorithm in the following way: • Initialize k0 = g0 , e0 = k0 ∇θ y0 • At every time step t: δt = θt+1 = kt+1 = et+1 = ρt (rt+1 + (1 − βt+1 )yt+1 ) − yt θt + αδt et ρt kt (1 − βt+1 ) + gt+1 λρt (1 − βt+1 )et + kt+1 ∇θ yt+1 Using a similar technique to that of Precup, Sutton & Dasgupta (2001) and Sutton & Barto (1998), we can prove that the forward and backward algorithm are equivalent (omitted due to lack of space). This algorithm is guaranteed to converge if the variance of the updates is finite (Precup, Sutton & Dasgupta, 2001). In the case of options, the termination condition β can be used to ensure that this is the case. 5 Learning when the behavior policy is unknown In this section, we consider the case in which the behavior policy is unknown. This case is generally problematic for importance sampling algorithms, but the use of recognizers will allow us to define importance sampling corrections, as well as a convergent algorithm. Recall that when using a recognizer, the target policy of the option is defined as: c(s, a)b(s, a) π(s, a) = µ(s) and the recognition probability becomes: π(s, a) c(s, a) = b(s, a) µ(s) Of course, µ(s) depends on b. If b is unknown, instead of µ(s), we will use a maximum likelihood estimate µ : S → [0, 1]. The structure used to compute µ will have to be compatible ˆ ˆ with the feature space used to represent the reward model. We will make this more precise below. Likewise, the recognizer c(s, a) will have to be defined in terms of the features used to represent the model. We will then define the importance sampling corrections as: c(s, a) ˆ ρ(s, a) = µ(s) ˆ ρ(s, a) = We consider the case in which the function approximator used to model the option is actually a state aggregator. In this case, we will define recognizers which behave consistently in each partition, i.e., c(s, a) = c(p, a), ∀s ∈ p. This means that an action is either recognized or not recognized in all states of the partition. The recognition probability µ will have one ˆ entry for every partition p of the state space. Its value will be: N(p, c = 1) µ(p) = ˆ N(p) where N(p) is the number of times partition p was visited, and N(p, c = 1) is the number of times the action taken in p was recognized. In the limit, w.p.1, µ converges to ˆ ∑s d b (s|p) ∑a c(p, a)b(s, a) where d b (s|p) is the probability of visiting state s from partiˆ ˆ tion p under the stationary distribution of b. At this limit, π(s, a) = ρ(s, a)b(s, a) will be a ˆ well-defined policy (i.e., ∑a π(s, a) = 1). Using Theorem 3, off-policy updates using imˆ portance sampling corrections ρ will have the same expected value as on-policy updates ˆ ˆ using π. Note though that the learning algorithm never uses π; the only quantities needed ˆ are ρ, which are learned incrementally from data. For the case of general linear function approximation, we conjecture that a similar idea can be used, where the recognition probability is learned using logistic regression. The development of this part is left for future work. Acknowledgements The authors gratefully acknowledge the ideas and encouragement they have received in this work from Eddie Rafols, Mark Ring, Lihong Li and other members of the rlai.net group. We thank Csaba Szepesvari and the reviewers of the paper for constructive comments. This research was supported in part by iCore, NSERC, Alberta Ingenuity, and CFI. References Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of ICML. Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In Proceedings of ICML. Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, vol . 112, pp. 181–211. Sutton,, R.S. and Tanner, B. (2005). Temporal-difference networks. In Proceedings of NIPS-17. Sutton R.S., Raffols E. and Koop, A. (2006). Temporal abstraction in temporal-difference networks”. In Proceedings of NIPS-18. Tadic, V. (2001). On the convergence of temporal-difference learning with linear function approximation. In Machine learning vol. 42, pp. 241-267. Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42:674–690.
5 0.5341273 102 nips-2005-Kernelized Infomax Clustering
Author: David Barber, Felix V. Agakov
Abstract: We propose a simple information-theoretic approach to soft clustering based on maximizing the mutual information I(x, y) between the unknown cluster labels y and the training patterns x with respect to parameters of specifically constrained encoding distributions. The constraints are chosen such that patterns are likely to be clustered similarly if they lie close to specific unknown vectors in the feature space. The method may be conveniently applied to learning the optimal affinity matrix, which corresponds to learning parameters of the kernelized encoder. The procedure does not require computations of eigenvalues of the Gram matrices, which makes it potentially attractive for clustering large data sets. 1
6 0.52903873 23 nips-2005-An Application of Markov Random Fields to Range Sensing
7 0.52777791 198 nips-2005-Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails
8 0.52650779 177 nips-2005-Size Regularized Cut for Data Clustering
9 0.52462476 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
10 0.52047342 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games
11 0.51961255 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
12 0.51925218 184 nips-2005-Structured Prediction via the Extragradient Method
13 0.51771832 47 nips-2005-Consistency of one-class SVM and related algorithms
14 0.51646662 98 nips-2005-Infinite latent feature models and the Indian buffet process
15 0.51544738 77 nips-2005-From Lasso regression to Feature vector machine
16 0.51496559 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
17 0.51431662 178 nips-2005-Soft Clustering on Graphs
18 0.51416522 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
19 0.51315415 74 nips-2005-Faster Rates in Regression via Active Learning
20 0.51166481 13 nips-2005-A Probabilistic Approach for Optimizing Spectral Clustering