nips nips2004 nips2004-185 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Alan L. Yuille
Abstract: This paper analyses the Contrastive Divergence algorithm for learning statistical parameters. We relate the algorithm to the stochastic approximation literature. This enables us to specify conditions under which the algorithm is guaranteed to converge to the optimal solution (with probability 1). This includes necessary and sufficient conditions for the solution to be unbiased.
Reference: text
sentIndex sentText sentNum sentScore
1 We relate the algorithm to the stochastic approximation literature. [sent-4, score-0.355]
2 This enables us to specify conditions under which the algorithm is guaranteed to converge to the optimal solution (with probability 1). [sent-5, score-0.319]
3 This includes necessary and sufficient conditions for the solution to be unbiased. [sent-6, score-0.168]
4 Recently Hinton proposed a new algorithm called contrastive divergences (CD) [1]. [sent-9, score-0.346]
5 Computer simulations show that this algorithm tends to converge, and to converge rapidly, although not always to the correct solution [2]. [sent-10, score-0.161]
6 Theoretical analysis shows that CD can fail but does not give conditions which guarantee convergence [3,4]. [sent-11, score-0.378]
7 This paper relates CD to the stochastic approximation literature [5,6] and hence derives elementary conditions which ensure convergence (with probability 1). [sent-12, score-0.697]
8 We conjecture that far stronger results can be obtained by applying more advanced techniques such as those described by Younes [7]. [sent-13, score-0.06]
9 We also give necessary and sufficient conditions for the solution of CD to be unbiased. [sent-14, score-0.231]
10 Section (2) describes CD and shows that it is closely related to a class of stochastic approximation algorithms for which convergence results exist. [sent-15, score-0.444]
11 In section (3) we state and give a proof of a simple convergence theorem for stochastic approximation algorithms. [sent-16, score-0.687]
12 Section (4) applies the theorem to give sufficient conditions for convergence of CD. [sent-17, score-0.423]
13 2 Contrastive Divergence and its Relations The task of statistical inference is to estimate the model parameters ω ∗ which minimize the Kullback-Leibler divergence D(P0 (x)||P (x|ω)) between the empirical distribution func- tion of the observed data P0 (x) and the model P (x|ω). [sent-18, score-0.172]
14 For example, it is natural to try performing steepest descent on D(P0 (x)||P (x|ω)). [sent-21, score-0.367]
15 The steepest descent algorithm can be expressed as: ∂E(x; ω) ∂E(x; ω) P (x|ω) + }, (1) ωt+1 − ωt = γt {− P0 (x) ∂ω ∂ω x x where the {γt } are constants. [sent-22, score-0.434]
16 Unfortunately steepest descent is usually computationally intractable because of the need to compute the second term on the right hand side of equation (1). [sent-23, score-0.545]
17 This is extremely difficult because of the need to evaluate the normalization term Z(ω) of P (x|ω). [sent-24, score-0.071]
18 Moreover, steepest descent also risks getting stuck in a local minimum. [sent-25, score-0.367]
19 There is, however, an important exception if we can express E(x; ω) in the special form E(x; ω) = ω · φ(x), for some function φ(x). [sent-26, score-0.056]
20 In this case D(P0 (x)||P (x|ω)) is convex and so steepest descent is guaranteed to converge to the global minimum. [sent-27, score-0.488]
21 The CD algorithm is formally similar to steepest descent. [sent-29, score-0.282]
22 Instead it approximates the second term on the right hand side of the steepest descent equation (1) by a stochastic term. [sent-31, score-0.737]
23 This approximation is done by defining, for each ω, a Markov Chain Monte Carlo (MCMC) transition kernel Kω (x, y) whose invariant distribution is P (x|ω) (i. [sent-32, score-0.336]
24 We now observe that CD is similar to a class of stochastic approximation algorithms which also use MCMC methods to stochastically approximate the second term on the right hand side of the steepest descent equation (1). [sent-36, score-0.839]
25 A typical algorithm of this type introduces a state vector S t (x) which is initialized by setting S t=0 (x) = P0 (x). [sent-38, score-0.098]
26 S t (x) is obtained by sampling with the transition kernel Kωt (x, y) using S t−1 (x) as the initial state for the chain. [sent-40, score-0.264]
27 Then ωt+1 is computed by replacing the second term in equation (1) by the expectation with respect to S t (x). [sent-41, score-0.146]
28 From this perspective, we can obtain CD by having a state vector S t (x) (= Qω (x)) which gets re-initialized to P0 (x) at each time step. [sent-42, score-0.068]
29 This stochastic approximation algorithm, and its many variants, have been extensively studied and convergence results have been obtained (see [7]). [sent-43, score-0.444]
30 The convergence results are based on stochastic approximation theorems [6] whose history starts with the analysis of the Robbins-Monro algorithm [5]. [sent-44, score-0.511]
31 Precise conditions can be specified which guarantee convergence in probability. [sent-45, score-0.315]
32 In particular, Kushner [9] has proven convergence to global optima. [sent-46, score-0.178]
33 Within the NIPS community, Orr and Leen [10] have studied the ability of these algorithms to escape from local minima by basin hopping. [sent-47, score-0.054]
34 3 Stochastic Approximation Algorithms and Convergence The general stochastic approximation algorithm is of the form: ωt+1 = ωt − γt S(ωt , Nt ), (3) where Nt is a random variable sampled from a distribution Pn (N ), γt is the damping coefficient, and S(. [sent-48, score-0.383]
35 We now state a theorem which gives sufficient conditions to ensure that the stochastic approximation algorithm (3) converges to a (solution) state ω ∗ . [sent-51, score-0.764]
36 The theorem is chosen because of the simplicity of its proof and we point out that a large variety of alternative results are available, see [6,7,9] and the references they cite. [sent-52, score-0.112]
37 The first is a function L(ω) = (1/2)|ω − ω ∗ |2 which is a measure of the distance of the current state ω from the solution state ω ∗ (in the next section we will require ω ∗ = arg minω D(P0 (x)||P (x|ω))). [sent-54, score-0.174]
38 The second is the expected value N Pn (N )S(ω, N ) of the update term in the stochastic approximation algorithm (3). [sent-55, score-0.549]
39 The third is the expected squared magnitude |S(ω, N )|2 of the update term. [sent-56, score-0.216]
40 The theorem states that the algorithm will converge provided three conditions are satisfied. [sent-57, score-0.369]
41 The first condition requires that the expected update ∗ N Pn (N )S(ω, N ) has a large component towards the solution ω (i. [sent-59, score-0.258]
42 The second condition requires that the expected squared magnitude |S(ω, N )|2 is bounded, so that the “noise” in the update is not too large. [sent-62, score-0.253]
43 The third condition requires that the damping coefficients γt decrease with time t, so that the algorithm eventually settles down into a fixed state. [sent-63, score-0.184]
44 This condition is satisfied by setting γt = 1/t, ∀t (which is the fastest fall off rate consistent with the SAC theorem). [sent-64, score-0.066]
45 We now state the theorem and briefly sketch the proof which is based on martingale theory (for an introduction, see [11]). [sent-65, score-0.248]
46 Consider the stochastic approximation algorithm, equation (3), and let L(ω) = (1/2)|ω − ω ∗ |2 . [sent-67, score-0.369]
47 Then the algorithm will converge to ω ∗ with probability 1 provided: (1) − L(ω) · N Pn (N )S(ω, N ) ≥ K1 L(ω) for some constant K1 , (2) |S(ω, N )|2 t ≤ K2 (1 + L(ω)), where K2 is some constant and the expectation . [sent-68, score-0.123]
48 The proof [12] is a consequence of the supermartingale convergence theorem [11]. [sent-71, score-0.262]
49 ∞ This theorem states that if Xt , Yt , Zt are positive random variables obeying t=0 Yt ≤ ∞ with probability one and Xt+1 ≤ Xt + Yt − Zt , ∀t, then Xt converges with probability 1 ∞ 2 and t=0 Zt < ∞. [sent-72, score-0.125]
50 Conditions 1 and 2 imply that Xt can only converge to 0. [sent-74, score-0.093]
51 4 CD and SAC The CD algorithm can be expressed as a stochastic approximation algorithm by setting: S(ωt , Nt ) = − P0 (x) x ∂E(x; ω) + ∂ω Qω (x) x ∂E(x; ω) , ∂ω (4) where the random variable Nt corresponds to the MCMC sampling used to obtain Qω (x). [sent-76, score-0.391]
52 We can now apply the SAC to give three conditions which guarantee convergence of the CD algorithm. [sent-77, score-0.378]
53 The third condition can be satisfied by setting γt = 1/t, ∀t. [sent-78, score-0.095]
54 We can satisfy the second condition by requiring that the gradient of E(x; ω) with respect to ω is bounded, see equation (4). [sent-79, score-0.21]
55 We conjecture that weaker conditions, such as requiring only that the gradient of E(x; ω) be bounded by a function linear in ω, can be obtained using the more sophisticated martingale analysis described in [7]. [sent-80, score-0.211]
56 It remains to understand the first condition and to determine whether the solution is unbiased. [sent-81, score-0.162]
57 We now re-express this expected CD update in two different ways, Results 1 and 2, which give alternative ways of understanding it. [sent-83, score-0.217]
58 We then proceed to Results 3 and 4 which give conditions for convergence and unbiasedness of CD. [sent-84, score-0.37]
59 We choose the transition kernel Kω (x, y) to satisfy detailed balance so that P (x|ω)Kω (x, y) = P (y|ω)Kω (y, x). [sent-86, score-0.309]
60 Detailed balance is obeyed by many MCMC algorithms and, in particular, is always satisfied by Metropolis-Hasting algorithms. [sent-87, score-0.046]
61 It implies that P (x|ω) is the invariant kernel of Kω (x, y) so that x P (x|ω)Kω (x, y) = P (y|ω) (all transition kernels satisfy y Kω (x, y) = 1, ∀x). [sent-88, score-0.304]
62 Detailed balance implies that the matrix Qω (x, y) = P (x|ω)1/2 Kω (x, y)P (y|ω)−1/2 is symmetric and hence has orthogonal eigenvectors and eigenvalues {eµ (x), λµ }. [sent-89, score-0.277]
63 In addition, the left and right eigenvectors are mutually orthonormal so that µ ν x vω (x)uω (x) = δµν , where δµν is the Kronecker delta function. [sent-94, score-0.134]
64 This implies that we can express any function f (x) in equivalent expansions, µ f (y)uµ (y)}vω (x), f (x) = ω { f (x) = µ y µ f (y)vω (y)}uµ (x). [sent-95, score-0.089]
65 ω { µ y (8) Moreover, the first left and right eigenvectors can be calculated explicitly to give: 1 vω (x) = P (x|ω), u1 (x) ∝ 1, λ1 = 1, ω ω (9) which follows because P (x|ω) is the (unique) invariant distribution of the transition kernel Kω (x, y) and hence is the first left eigenvector. [sent-96, score-0.397]
66 We now have sufficient background to state and prove our first result. [sent-97, score-0.068]
67 m The expected CD update replaces x P (x|ω) ∂E(x;ω) by y,x P0 (y)Kω (y, x) ∂E(x;ω) , ∂ω ∂ω see equation (5). [sent-101, score-0.229]
68 We use the eigenvector expansion of the transition kernel, equation (6), µ to express this as y,x,µ P0 (y){λµ }m uµ (y)vω (x) ∂E(x;ω) . [sent-102, score-0.263]
69 The result follows using the ω ω ∂ω specific forms of the first eigenvectors, see equation (9). [sent-103, score-0.136]
70 We now give a second form for the expected update rule. [sent-105, score-0.217]
71 This is chosen so that x P (x|ω)g(x; ω) = 0, ∀ω and the extrema of the Kullback-Leibler divergence occurs when x P0 (x)g(x; ω) = 0. [sent-107, score-0.253]
72 Let g(x; ω) = ∂E(x;ω) − x P (x|ω) ∂E(x;ω) , then x P (x|ω)g(x; ω) = 0, the ∂ω ∂ω extrema of the Kullback-Leibler divergence occur when x P0 (x)g(x; ω) = 0, and the expected update rule can be written as: ωt+1 = ωt − γt { m P0 (y)Kω (y, x)g(x; ω)}. [sent-109, score-0.407]
73 To get the third we substitute the definition of x P0 (x) ∂ω ∂ω g(x; ω) into the expected update equation (5). [sent-113, score-0.258]
74 The result follows using the standard propm erty of transition kernels that y Kω (x, y) = 1, ∀x. [sent-114, score-0.193]
75 We now use Results 1 and 2 to understand the fixed points of the CD algorithm and determine whether it is biased. [sent-115, score-0.088]
76 The fixed points ω ∗ of the CD algorithm are true (unbiased) extrema ∗ of the KL divergence (i. [sent-117, score-0.283]
77 A sufficient condition is that P0 (y) and g(x; ω) lie in orthogonal eigenspaces of Kω∗ (y, x). [sent-120, score-0.17]
78 The first part follows directly from equation (11) in Result 2. [sent-123, score-0.104]
79 Hence P0 (x) and g(x; ω ∗ ) lie in orthogonal eigenspaces of Kω∗ (y, x). [sent-128, score-0.104]
80 Result 3 shows that whether CD converges to an unbiased estimate usually depends on the specific form of the MCMC transition matrix Kω (y, x). [sent-129, score-0.263]
81 But there is an intuitive argument m why the bias term y,x P0 (y)Kω∗ (y, x)g(x; ω ∗ ) may tend to be small at places where ∗ m x P0 (x)g(x; ω ) = 0. [sent-130, score-0.145]
82 This will tend to be small x vω ∗ (x) ∂ω Alternatively, using Result 1, the bias term as µ=2 {λµ ∗ }m { y P0 (y)uµ ∗ (y)}{ ω ω provided the eigenvalue moduli |λµ ∗ | are small for µ ≥ 2 (i. [sent-134, score-0.211]
83 the standard conditions for ω a well defined Markov Chain). [sent-136, score-0.13]
84 In general the bias term should decrease exponentially as |λ2 ∗ |m . [sent-137, score-0.112]
85 Clearly it is also desirable to define the transition kernels Kω (x, y) so that the right ω eigenvectors {uµ (y) : µ ≥ 2} are as orthogonal as possible to the observed data P0 (y). [sent-138, score-0.316]
86 ω The practicality of CD depends on whether we can find an MCMC sampler such that the m bias term y,x P0 (y)Kω∗ (y, x)g(x; ω ∗ ) = 0 is small for most ω. [sent-139, score-0.172]
87 If not, then the alternative stochastic algorithms may be preferable. [sent-140, score-0.192]
88 Finally we give convergence conditions for the CD algorithm. [sent-141, score-0.343]
89 Result 4 CD will converge with probability 1 to state ω ∗ provided γt = 1/t, and (ω − ω ∗ ) · { is bounded, m P0 (y)Kω (y, x)g(x; ω)} ≥ K1 |ω − ω ∗ |2 , P0 (x)g(x; ω) − x ∂E ∂ω (12) y,x for some K1 . [sent-142, score-0.197]
90 The boundedness of ∂E is required ∂ω to ensure that the “update noise” is bounded in order to satisfy the second condition of the SAC theorem. [sent-145, score-0.203]
91 Results 3 and 4 can be combined to ensure that CD converges (with probability 1) to the correct (unbiased) solution. [sent-146, score-0.094]
92 This requires specifying that ω ∗ in Result 4 also satisfies the m conditions x P0 (x)g(x; ω ∗ ) = 0 and y,x P0 (y)Kω∗ (y, x)g(x; ω ∗ ) = 0. [sent-147, score-0.13]
93 5 Conclusion The goal of this paper was to relate the Contrastive Divergence (CD) algorithm to the stochastic approximation literature. [sent-148, score-0.355]
94 This enables us to give convergence conditions which ensure that CD will converge to the parameters ω ∗ that minimize the Kullback-Leibler divergence D(P0 (x)||P (x|ω)). [sent-149, score-0.685]
95 The analysis also gives necessary and sufficient conditions to determine whether the solution is unbiased. [sent-150, score-0.198]
96 The results in this paper are elementary and preliminary. [sent-152, score-0.042]
97 We conjecture that far more powerful results can be obtained by adapting the convergence theorems in the literature [6,7,9]. [sent-153, score-0.312]
98 In particular, Younes [7] gives convergence results when the gradient of the energy ∂E(x; ω)/∂ω is bounded by a term that is linear in ω (and hence unbounded). [sent-154, score-0.304]
99 He is also able to analyze the asymptotic behaviour of these algorithms. [sent-155, score-0.066]
100 Yingnian provided guidance to the stochastic approximation literature and Max gave useful comments on an early draft. [sent-160, score-0.389]
wordName wordTfidf (topN-words)
[('cd', 0.587), ('contrastive', 0.266), ('steepest', 0.252), ('stochastic', 0.192), ('sac', 0.178), ('divergence', 0.172), ('nt', 0.151), ('convergence', 0.15), ('mcmc', 0.136), ('transition', 0.132), ('conditions', 0.13), ('descent', 0.115), ('eigenvectors', 0.102), ('approximation', 0.102), ('converge', 0.093), ('update', 0.091), ('zt', 0.088), ('pn', 0.088), ('extrema', 0.081), ('theorem', 0.08), ('xt', 0.079), ('equation', 0.075), ('term', 0.071), ('martingale', 0.068), ('orr', 0.068), ('state', 0.068), ('condition', 0.066), ('kernel', 0.064), ('give', 0.063), ('expected', 0.063), ('conjecture', 0.06), ('chain', 0.06), ('younes', 0.059), ('damping', 0.059), ('kushner', 0.059), ('yingnian', 0.059), ('hinton', 0.058), ('yt', 0.057), ('unbiased', 0.056), ('express', 0.056), ('angeles', 0.054), ('eigenspaces', 0.054), ('basin', 0.054), ('satis', 0.052), ('bounded', 0.051), ('divergences', 0.05), ('orthogonal', 0.05), ('ensure', 0.049), ('yuille', 0.047), ('eigenvalues', 0.046), ('balance', 0.046), ('carlo', 0.046), ('monte', 0.046), ('welling', 0.045), ('converges', 0.045), ('pp', 0.043), ('suf', 0.042), ('markov', 0.042), ('fields', 0.042), ('elementary', 0.042), ('bias', 0.041), ('los', 0.039), ('solution', 0.038), ('invariant', 0.038), ('theorems', 0.037), ('expressed', 0.037), ('satisfy', 0.037), ('behaviour', 0.036), ('provided', 0.036), ('guarantee', 0.035), ('adapting', 0.033), ('tend', 0.033), ('magnitude', 0.033), ('implies', 0.033), ('gradient', 0.032), ('result', 0.032), ('proof', 0.032), ('literature', 0.032), ('right', 0.032), ('moreover', 0.032), ('relate', 0.031), ('rapidly', 0.031), ('asymptotic', 0.03), ('algorithm', 0.03), ('whether', 0.03), ('detailed', 0.03), ('practicality', 0.03), ('eigenspace', 0.03), ('grimmett', 0.03), ('robbins', 0.03), ('moduli', 0.03), ('slc', 0.03), ('third', 0.029), ('follows', 0.029), ('understand', 0.028), ('enables', 0.028), ('global', 0.028), ('guidance', 0.027), ('unbiasedness', 0.027), ('stimulate', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 185 nips-2004-The Convergence of Contrastive Divergences
Author: Alan L. Yuille
Abstract: This paper analyses the Contrastive Divergence algorithm for learning statistical parameters. We relate the algorithm to the stochastic approximation literature. This enables us to specify conditions under which the algorithm is guaranteed to converge to the optimal solution (with probability 1). This includes necessary and sufficient conditions for the solution to be unbiased.
2 0.10933526 48 nips-2004-Convergence and No-Regret in Multiagent Learning
Author: Michael Bowling
Abstract: Learning in a multiagent system is a challenging problem due to two key factors. First, if other agents are simultaneously learning then the environment is no longer stationary, thus undermining convergence guarantees. Second, learning is often susceptible to deception, where the other agents may be able to exploit a learner’s particular dynamics. In the worst case, this could result in poorer performance than if the agent was not learning at all. These challenges are identifiable in the two most common evaluation criteria for multiagent learning algorithms: convergence and regret. Algorithms focusing on convergence or regret in isolation are numerous. In this paper, we seek to address both criteria in a single algorithm by introducing GIGA-WoLF, a learning algorithm for normalform games. We prove the algorithm guarantees at most zero average regret, while demonstrating the algorithm converges in many situations of self-play. We prove convergence in a limited setting and give empirical results in a wider variety of situations. These results also suggest a third new learning criterion combining convergence and regret, which we call negative non-convergence regret (NNR). 1
3 0.10861409 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
Author: Nicolò Cesa-bianchi, Claudio Gentile, Luca Zaniboni
Abstract: We provide a worst-case analysis of selective sampling algorithms for learning linear threshold functions. The algorithms considered in this paper are Perceptron-like algorithms, i.e., algorithms which can be efficiently run in any reproducing kernel Hilbert space. Our algorithms exploit a simple margin-based randomized rule to decide whether to query the current label. We obtain selective sampling algorithms achieving on average the same bounds as those proven for their deterministic counterparts, but using much fewer labels. We complement our theoretical findings with an empirical comparison on two text categorization tasks. The outcome of these experiments is largely predicted by our theoretical results: Our selective sampling algorithms tend to perform as good as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels. 1
4 0.108444 190 nips-2004-The Rescorla-Wagner Algorithm and Maximum Likelihood Estimation of Causal Parameters
Author: Alan L. Yuille
Abstract: This paper analyzes generalization of the classic Rescorla-Wagner (RW) learning algorithm and studies their relationship to Maximum Likelihood estimation of causal parameters. We prove that the parameters of two popular causal models, ∆P and P C, can be learnt by the same generalized linear Rescorla-Wagner (GLRW) algorithm provided genericity conditions apply. We characterize the fixed points of these GLRW algorithms and calculate the fluctuations about them, assuming that the input is a set of i.i.d. samples from a fixed (unknown) distribution. We describe how to determine convergence conditions and calculate convergence rates for the GLRW algorithms under these conditions.
5 0.10674021 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods
Author: Chakra Chennubhotla, Allan D. Jepson
Abstract: We show how to build hierarchical, reduced-rank representation for large stochastic matrices and use this representation to design an efficient algorithm for computing the largest eigenvalues, and the corresponding eigenvectors. In particular, the eigen problem is first solved at the coarsest level of the representation. The approximate eigen solution is then interpolated over successive levels of the hierarchy. A small number of power iterations are employed at each stage to correct the eigen solution. The typical speedups obtained by a Matlab implementation of our fast eigensolver over a standard sparse matrix eigensolver [13] are at least a factor of ten for large image sizes. The hierarchical representation has proven to be effective in a min-cut based segmentation algorithm that we proposed recently [8]. 1 Spectral Methods Graph-theoretic spectral methods have gained popularity in a variety of application domains: segmenting images [22]; embedding in low-dimensional spaces [4, 5, 8]; and clustering parallel scientific computation tasks [19]. Spectral methods enable the study of properties global to a dataset, using only local (pairwise) similarity or affinity measurements between the data points. The global properties that emerge are best understood in terms of a random walk formulation on the graph. For example, the graph can be partitioned into clusters by analyzing the perturbations to the stationary distribution of a Markovian relaxation process defined in terms of the affinity weights [17, 18, 24, 7]. The Markovian relaxation process need never be explicitly carried out; instead, it can be analytically expressed using the leading order eigenvectors, and eigenvalues, of the Markov transition matrix. In this paper we consider the practical application of spectral methods to large datasets. In particular, the eigen decomposition can be very expensive, on the order of O(n 3 ), where n is the number of nodes in the graph. While it is possible to compute analytically the first eigenvector (see §3 below), the remaining subspace of vectors (necessary for say clustering) has to be explicitly computed. A typical approach to dealing with this difficulty is to first sparsify the links in the graph [22] and then apply an efficient eigensolver [13, 23, 3]. In comparison, we propose in this paper a specialized eigensolver suitable for large stochastic matrices with known stationary distributions. In particular, we exploit the spectral properties of the Markov transition matrix to generate hierarchical, successively lower-ranked approximations to the full transition matrix. The eigen problem is solved directly at the coarsest level of representation. The approximate eigen solution is then interpolated over successive levels of the hierarchy, using a small number of power iterations to correct the solution at each stage. 2 Previous Work One approach to speeding up the eigen decomposition is to use the fact that the columns of the affinity matrix are typically correlated. The idea then is to pick a small number of representative columns to perform eigen decomposition via SVD. For example, in the Nystrom approximation procedure, originally proposed for integral eigenvalue problems, the idea is to randomly pick a small set of m columns; generate the corresponding affinity matrix; solve the eigenproblem and finally extend the solution to the complete graph [9, 10]. The Nystrom method has also been recently applied in the kernel learning methods for fast Gaussian process classification and regression [25]. Other sampling-based approaches include the work reported in [1, 2, 11]. Our starting point is the transition matrix generated from affinity weights and we show how building a representational hierarchy follows naturally from considering the stochastic matrix. A closely related work is the paper by Lin on reduced rank approximations of transition matrices [14]. We differ in how we approximate the transition matrices, in particular our objective function is computationally less expensive to solve. In particular, one of our goals in reducing transition matrices is to develop a fast, specialized eigen solver for spectral clustering. Fast eigensolving is also the goal in ACE [12], where successive levels in the hierarchy can potentially have negative affinities. A graph coarsening process for clustering was also pursued in [21, 3]. 3 Markov Chain Terminology We first provide a brief overview of the Markov chain terminology here (for more details see [17, 15, 6]). We consider an undirected graph G = (V, E) with vertices vi , for i = {1, . . . , n}, and edges ei,j with non-negative weights ai,j . Here the weight ai,j represents the affinity between vertices vi and vj . The affinities are represented by a non-negative, symmetric n × n matrix A having weights ai,j as elements. The degree of a node j is n n defined to be: dj = i=1 ai,j = j=1 aj,i , where we define D = diag(d1 , . . . , dn ). A Markov chain is defined using these affinities by setting a transition probability matrix M = AD −1 , where the columns of M each sum to 1. The transition probability matrix defines the random walk of a particle on the graph G. The random walk need never be explicitly carried out; instead, it can be analytically expressed using the leading order eigenvectors, and eigenvalues, of the Markov transition matrix. Because the stochastic matrices need not be symmetric in general, a direct eigen decomposition step is not preferred for reasons of instability. This problem is easily circumvented by considering a normalized affinity matrix: L = D −1/2 AD−1/2 , which is related to the stochastic matrix by a similarity transformation: L = D −1/2 M D1/2 . Because L is symmetric, it can be diagonalized: L = U ΛU T , where U = [u1 , u2 , · · · , un ] is an orthogonal set of eigenvectors and Λ is a diagonal matrix of eigenvalues [λ1 , λ2 , · · · , λn ] sorted in decreasing order. The eigenvectors have unit length uk = 1 and from the form of A and D it can be shown that the eigenvalues λi ∈ (−1, 1], with at least one eigenvalue equal to one. Without loss of generality, we take λ1 = 1. Because L and M are similar we can perform an eigen decomposition of the Markov transition matrix as: M = D1/2 LD−1/2 = D1/2 U Λ U T D−1/2 . Thus an eigenvector u of L corresponds to an eigenvector D 1/2 u of M with the same eigenvalue λ. The Markovian relaxation process after β iterations, namely M β , can be represented as: M β = D1/2 U Λβ U T D−1/2 . Therefore, a particle undertaking a random walk with an initial distribution p 0 acquires after β steps a distribution p β given by: p β = M β p 0 . Assuming the graph is connected, as β → ∞, the Markov chain approaches a unique n stationary distribution given by π = diag(D)/ i=1 di , and thus, M ∞ = π1T , where 1 is a n-dim column vector of all ones. Observe that π is an eigenvector of M as it is easy to show that M π = π and the corresponding eigenvalue is 1. Next, we show how to generate hierarchical, successively low-ranked approximations for the transition matrix M . 4 Building a Hierarchy of Transition Matrices The goal is to generate a very fast approximation, while simultaneously achieving sufficient accuracy. For notational ease, we think of M as a fine-scale representation and M as some coarse-scale approximation to be derived here. By coarsening M further, we can generate successive levels of the representation hierarchy. We use the stationary distribution π to construct a corresponding coarse-scale stationary distribution δ. As we just discussed a critical property of the fine scale Markov matrix M is that it is similar to the symmetric matrix L and we wish to preserve this property at every level of the representation hierarchy. 4.1 Deriving Coarse-Scale Stationary Distribution We begin by expressing the stationary distribution π as a probabilistic mixture of latent distributions. In matrix notation, we have (1) π = K δ, where δ is an unknown mixture coefficient vector of length m, K is an n × m non-negative n kernel matrix whose columns are latent distributions that each sum to 1: i=1 Ki,j = 1 and m n. It is easy to derive a maximum likelihood approximation of δ using an EM type algorithm [16]. The main step is to find a stationary point δ, λ for the Lagrangian: m n i=1 m Ki,j δj + λ πi ln E≡− j=1 δj − 1 . (2) j=1 An implicit step in this EM procedure is to compute the the ownership probability r i,j of the j th kernel (or node) at the coarse scale for the ith node on the fine scale and is given by ri,j = δj Ki,j . m k=1 δk Ki,k (3) The EM procedure allows for an update of both δ and the latent distributions in the kernel matrix K (see §8.3.1 in [6]). For initialization, δ is taken to be uniform over the coarse-scale states. But in choosing kernels K, we provide a good initialization for the EM procedure. Specifically, the Markov matrix M is diffused using a small number of iterations to get M β . The diffusion causes random walks from neighboring nodes to be less distinguishable. This in turn helps us select a small number of columns of M β in a fast and greedy way to be the kernel matrix K. We defer the exact details on kernel selection to a later section (§4.3). 4.2 Deriving the Coarse-Scale Transition Matrix In order to define M , the coarse-scale transition matrix, we break it down into three steps. First, the Markov chain propagation at the coarse scale can be defined as: q k+1 = M q k , (4) where q is the coarse scale probability distribution after k steps of the random walk. Second, we expand q k into the fine scale using the kernels K resulting in a fine scale probability distribution p k : p k = Kq k . (5) k Finally, we lift p k back into the coarse scale by using the ownership probability of the j th kernel for the ith node on the fine grid: n qjk+1 = ri,j pik i=1 (6) Substituting for Eqs.(3) and (5) in Eq. 6 gives n m qjk+1 = i=1 n Ki,t qtk = ri,j t=1 i=1 δj Ki,j m k=1 δk Ki,k m Ki,t qtk . (7) t=1 We can write the preceding equation in a matrix form: q k+1 = diag( δ ) K T diag K δ −1 Kq k . (8) Comparing this with Eq. 4, we can derive the transition matrix M as: M = diag( δ ) K T diag K δ −1 (9) K. It is easy to see that δ = M δ, so δ is the stationary distribution for M . Following the definition of M , and its stationary distribution δ, we can generate a symmetric coarse scale affinity matrix A given by A = M diag(δ) = diag( δ ) K T diag K δ −1 Kdiag(δ) , (10) where we substitute for the expression M from Eq. 9. The coarse-scale affinity matrix A is then normalized to get: L = D−1/2 AD−1/2 ; D = diag(d1 , d2 , · · · , dm ), (11) where dj is the degree of node j in the coarse-scale graph represented by the matrix A (see §3 for degree definition). Thus, the coarse scale Markov matrix M is precisely similar to a symmetric matrix L. 4.3 Selecting Kernels For demonstration purpose, we present the kernel selection details on the image of an eye shown below. To begin with, a random walk is defined where each pixel in the test image is associated with a vertex of the graph G. The edges in G are defined by the standard 8-neighbourhood of each pixel. For the demonstrations in this paper, the edge weight ai,j between neighbouring pixels xi and xj is given by a function of the difference in the 2 corresponding intensities I(xi ) and I(xj ): ai,j = exp(−(I(xi ) − I(xj ))2 /2σa ), where σa is set according to the median absolute difference |I(xi ) − I(xj )| between neighbours measured over the entire image. The affinity matrix A with the edge weights is then used to generate a Markov transition matrix M . The kernel selection process we use is fast and greedy. First, the fine scale Markov matrix M is diffused to M β using β = 4. The Markov matrix M is sparse as we make the affinity matrix A sparse. Every column in the diffused matrix M β is a potential kernel. To facilitate the selection process, the second step is to rank order the columns of M β based on a probability value in the stationary distribution π. Third, the kernels (i.e. columns of M β ) are picked in such a way that for a kernel Ki all of the neighbours of pixel i which are within the half-height of the the maximum value in the kernel Ki are suppressed from the selection process. Finally, the kernel selection is continued until every pixel in the image is within a half-height of the peak value of at least one kernel. If M is a full matrix, to avoid the expense of computing M β explicitly, random kernel centers can be selected, and only the corresponding columns of M β need be computed. We show results from a three-scale hierarchy on the eye image (below). The image has 25 × 20 pixels but is shown here enlarged for clarity. At the first coarse scale 83 kernels are picked. The kernels each correspond to a different column in the fine scale transition matrix and the pixels giving rise to these kernels are shown numbered on the image. Using these kernels as an initialization, the EM procedure derives a coarse-scale stationary distribution δ 21 14 26 4 (Eq. 2), while simultaneously updating the kernel ma12 27 2 19 trix. Using the newly updated kernel matrix K and the 5 8 13 23 30 18 6 9 derived stationary distribution δ a transition matrix M 28 20 15 32 10 22 is generated (Eq. 9). The coarse scale Markov matrix 24 17 7 is then diffused to M β , again using β = 4. The kernel Coarse Scale 1 Coarse Scale 2 selection algorithm is reapplied, this time picking 32 kernels for the second coarse scale. Larger values of β cause the coarser level to have fewer elements. But the exact number of elements depends on the form of the kernels themselves. For the random experiments that we describe later in §6 we found β = 2 in the first iteration and 4 thereafter causes the number of kernels to be reduced by a factor of roughly 1/3 to 1/4 at each level. 72 28 35 44 51 64 82 4 12 31 56 19 77 36 45 52 65 13 57 23 37 5 40 53 63 73 14 29 6 66 38 74 47 24 7 30 41 54 71 78 58 15 8 20 39 48 59 67 25 68 79 21 16 2 11 26 42 49 55 60 75 32 83 43 9 76 50 17 27 61 33 69 80 3 46 18 70 81 34 10 62 22 1 25 11 1 3 16 31 29 At coarser levels of the hierarchy, we expect the kernels to get less sparse and so will the affinity and the transition matrices. In order to promote sparsity at successive levels of the hierarchy we sparsify A by zeroing out elements associated with “small” transition probabilities in M . However, in the experiments described later in §6, we observe this sparsification step to be not critical. To summarize, we use the stationary distribution π at the fine-scale to derive a transition matrix M , and its stationary distribution δ, at the coarse-scale. The coarse scale transition in turn helps to derive an affinity matrix A and its normalized version L. It is obvious that this procedure can be repeated recursively. We describe next how to use this representation hierarchy for building a fast eigensolver. 5 Fast EigenSolver Our goal in generating a hierarchical representation of a transition matrix is to develop a fast, specialized eigen solver for spectral clustering. To this end, we perform a full eigen decomposition of the normalized affinity matrix only at the coarsest level. As discussed in the previous section, the affinity matrix at the coarsest level is not likely to be sparse, hence it will need a full (as opposed to a sparse) version of an eigen solver. However it is typically the case that e ≤ m n (even in the case of the three-scale hierarchy that we just considered) and hence we expect this step to be the least expensive computationally. The resulting eigenvectors are interpolated to the next lower level of the hierarchy by a process which will be described next. Because the eigen interpolation process between every adjacent pair of scales in the hierarchy is similar, we will assume we have access to the leading eigenvectors U (size: m × e) for the normalized affinity matrix L (size: m × m) and describe how to generate the leading eigenvectors U (size: n × e), and the leading eigenvalues S (size: e × 1), for the fine-scale normalized affinity matrix L (size: n × n). There are several steps to the eigen interpolation process and in the discussion that follows we refer to the lines in the pseudo-code presented below. First, the coarse-scale eigenvectors U can be interpolated using the kernel matrix K to generate U = K U , an approximation for the fine-scale eigenvectors (line 9). Second, interpolation alone is unlikely to set the directions of U exactly aligned with U L , the vectors one would obtain by a direct eigen decomposition of the fine-scale normalized affinity matrix L. We therefore update the directions in U by applying a small number of power iterations with L, as given in lines 13-15. e e function (U, S) = CoarseToFine(L, K, U , S) 1: INPUT 2: L, K ⇐ {L is n × n and K is n × m where m n} e e e e 3: U /S ⇐ {leading coarse-scale eigenvectors/eigenvalues of L. U is of size m × e, e ≤ m} 4: OUTPUT 5: U, S ⇐ {leading fine-scale eigenvectors/eigenvalues of L. U is n × e and S is e × 1.} x 10 0.4 3 0.96 0.94 0.92 0.9 0.35 2.5 Relative Error Absolute Relative Error 0.98 Eigen Value |δλ|λ−1 −3 Eigen Spectrum 1 2 1.5 1 5 10 15 20 Eigen Index (a) 25 30 0.2 0.15 0.1 0.5 0.88 0.3 0.25 0.05 5 10 15 20 Eigen Index (b) 25 30 5 10 15 20 Eigen Index 25 30 (c) Figure 1: Hierarchical eigensolver results. (a) comparing ground truth eigenvalues S L (red circles) with multi-scale eigensolver spectrum S (blue line) (b) Relative absolute error between eigenvalues: |S−SL | (c) Eigenvector mismatch: 1 − diag |U T UL | , between SL eigenvectors U derived by the multi-scale eigensolver and the ground truth U L . Observe the slight mismatch in the last few eigenvectors, but excellent agreement in the leading eigenvectors (see text). 6: CONSTANTS: TOL = 1e-4; POWER ITERS = 50 7: “ ” e 8: TPI = min POWER ITERS, log(e × eps/TOL)/ log(min(S)) {eps: machine accuracy} e 9: U = K U {interpolation from coarse to fine} 10: while not converged do 11: Uold = U {n × e matrix, e n} 12: for i = 1 to TPI do 13: U ⇐ LU 14: end for 15: U ⇐ Gram-Schmidt(U ) {orthogonalize U } 16: Le = U T LU {L may be sparse, but Le need not be.} 17: Ue Se UeT = svd(Le ) {eigenanalysis of Le , which is of size e × e.} 18: U ⇐ U Ue {update the leading eigenvectors of L} 19: S = diag(Se ) {grab the leading eigenvalues of L} T 20: innerProd = 1 − diag( Uold U ) {1 is a e × 1 vector of all ones} 21: converged = max[abs(innerProd)] < TOL 22: end while The number of power iterations TPI can be bounded as discussed next. Suppose v = U c where U is a matrix of true eigenvectors and c is a coefficient vector for an arbitrary vector v. After TPI power iterations v becomes v = U diag(S TPI )c, where S has the exact eigenvalues. In order for the component of a vector v in the direction Ue (the eth column of U ) not to be swamped by other components, we can limit it’s decay after TPI iterations as TPI follows: (S(e)/S(1)) >= e×eps/TOL, where S(e) is the exact eth eigenvalue, S(1) = 1, eps is the machine precision, TOL is requested accuracy. Because we do not have access to the exact value S(e) at the beginning of the interpolation procedure, we estimate it from the coarse eigenvalues S. This leads to a bound on the power iterations TPI, as derived on the line 9 above. Third, the interpolation process and the power iterations need not preserve orthogonality in the eigenvectors in U . We fix this by Gram-Schmidt orthogonalization procedure (line 16). Finally, there is a still a problem with power iterations that needs to be resolved, in that it is very hard to separate nearby eigenvalues. In particular, for the convergence of the power iterations the ratio that matters is between the (e + 1) st and eth eigenvalues. So the idea we pursue is to use the power iterations only to separate the reduced space of eigenvectors (of dimension e) from the orthogonal subspace (of dimension n − e). We then use a full SVD on the reduced space to update the leading eigenvectors U , and eigenvalues S, for the fine-scale (lines 17-20). This idea is similar to computing the Ritz values and Ritz vectors in a Rayleigh-Ritz method. 6 Interpolation Results Our multi-scale decomposition code is in Matlab. For the direct eigen decomposition, we have used the Matlab program svds.m which invokes the compiled ARPACKC routine [13], with a default convergence tolerance of 1e-10. In Fig. 1a we compare the spectrum S obtained from a three-scale decomposition on the eye image (blue line) with the ground truth, which is the spectrum SL resulting from direct eigen decomposition of the fine-scale normalized affinity matrices L (red circles). There is an excellent agreement in the leading eigenvalues. To illustrate this, we show absolute relative error between the spectra: |S−SL | in Fig. 1b. The spectra agree mostly, except for SL the last few eigenvalues. For a quantitative comparison between the eigenvectors, we plot in Fig. 1c the following measure: 1 − diag(|U T UL |), where U is the matrix of eigenvectors obtained by the multi-scale approximation, UL is the ground-truth resulting from a direct eigen decomposition of the fine-scale affinity matrix L and 1 is a vector of all ones. The relative error plot demonstrates a close match, within the tolerance threshold of 1e-4 that we chose for the multi-scale method, in the leading eigenvector directions between the two methods. The relative error is high with the last few eigen vectors, which suggests that the power iterations have not clearly separated them from other directions. So, the strategy we suggest is to pad the required number of leading eigen basis by about 20% before invoking the multi-scale procedure. Obviously, the number of hierarchical stages for the multi-scale procedure must be chosen such that the transition matrix at the coarsest scale can accommodate the slight increase in the subspace dimensions. For lack of space we are omitting extra results (see Ch.8 in [6]). Next we measure the time the hierarchical eigensolver takes to compute the leading eigenbasis for various input sizes, in comparison with the svds.m procedure [13]. We form images of different input sizes by Gaussian smoothing of i.i.d noise. The Gaussian function has a standard deviation of 3 pixels. The edges in graph G are defined by the standard 8-neighbourhood of each pixel. The edge weights between neighbouring pixels are simply given by a function of the difference in the corresponding intensities (see §4.3). The affinity matrix A with the edge weights is then used to generate a Markov transition matrix M . The fast eigensolver is run on ten different instances of the input image of a given size and the average of these times is reported here. For a fair comparison between the two procedures, we set the convergence tolerance value for the svds.m procedure to be 1e-4, the same as the one used for the fast eigensolver. We found the hierarchical representation derived from this tolerance threshold to be sufficiently accurate for a novel min-cut based segmentation results that we reported in [8]. Also, the subspace dimensionality is fixed to be 51 where we expect (and indeed observe) the leading 40 eigenpairs derived from the multi-scale procedure to be accurate. Hence, while invoking svds.m we compute only the leading 41 eigenpairs. In the table shown below, the first column corresponds to the number of nodes in the graph, while the second and third columns report the time taken in seconds by the svds.m procedure and the Matlab implementation of the multi-scale eigensolver respectively. The fourth column reports the speedups of the multi-scale eigensolver over svds.m procedure on a standard desktop (Intel P4, 2.5GHz, 1GB RAM). Lowering the tolerance threshold for svds.m made it faster by about 20 − 30%. Despite this, the multi-scale algorithm clearly outperforms the svds.m procedure. The most expensive step in the multi-scale algorithm is the power iteration required in the last stage, that is interpolating eigenvectors from the first coarse scale to the required fine scale. The complexity is of the order of n × e where e is the subspace dimensionality and n is the size of the graph. Indeed, from the table we can see that the multi-scale procedure is taking time roughly proportional to n. Deviations from the linear trend are observed at specific values of n, which we believe are due to the n 322 632 642 652 1002 1272 1282 1292 1602 2552 2562 2572 5112 5122 5132 6002 7002 8002 svds.m 1.6 10.8 20.5 12.6 44.2 91.1 230.9 96.9 179.3 819.2 2170.8 871.7 7977.2 20269 7887.2 10841.4 15048.8 Multi-Scale 1.5 4.9 5.5 5.1 13.1 20.4 35.2 20.9 34.4 90.3 188.7 93.3 458.8 739.3 461.9 644.2 1162.4 1936.6 Speedup 1.1 2.2 3.7 2.5 3.4 4.5 6.6 4.6 5.2 9.1 11.5 9.3 17.4 27.4 17.1 16.8 12.9 variations in the difficulty of the specific eigenvalue problem (eg. nearly multiple eigenvalues). The hierarchical representation has proven to be effective in a min-cut based segmentation algorithm that we proposed recently [8]. Here we explored the use of random walks and associated spectral embedding techniques for the automatic generation of suitable proposal (source and sink) regions for a min-cut based algorithm. The multiscale algorithm was used to generate the 40 leading eigenvectors of large transition matrices (eg. size 20K × 20K). In terms of future work, it will be useful to compare our work with other approximate methods for SVD such as [23]. Ack: We thank S. Roweis, F. Estrada and M. Sakr for valuable comments. References [1] D. Achlioptas and F. McSherry. Fast Computation of Low-Rank Approximations. STOC, 2001. [2] D. Achlioptas et al Sampling Techniques for Kernel Methods. NIPS, 2001. [3] S. Barnard and H. Simon Fast Multilevel Implementation of Recursive Spectral Bisection for Partitioning Unstructured Problems. PPSC, 627-632. [4] M. Belkin et al Laplacian Eigenmaps and Spectral Techniques for Embedding. NIPS, 2001. [5] M. Brand et al A unifying theorem for spectral embedding and clustering. AI & STATS, 2002. [6] C. Chennubhotla. Spectral Methods for Multi-scale Feature Extraction and Spectral Clustering. http://www.cs.toronto.edu/˜chakra/thesis.pdf Ph.D Thesis, Department of Computer Science, University of Toronto, Canada, 2004. [7] C. Chennubhotla and A. Jepson. Half-Lives of EigenFlows for Spectral Clustering. NIPS, 2002. [8] F. Estrada, A. Jepson and C. Chennubhotla. Spectral Embedding and Min-Cut for Image Segmentation. Manuscript Under Review, 2004. [9] C. Fowlkes et al Efficient spatiotemporal grouping using the Nystrom method. CVPR, 2001. [10] S. Belongie et al Spectral Partitioning with Indefinite Kernels using Nystrom app. ECCV, 2002. [11] A. Frieze et al Fast Monte-Carlo Algorithms for finding low-rank approximations. FOCS, 1998. [12] Y. Koren et al ACE: A Fast Multiscale Eigenvectors Computation for Drawing Huge Graphs IEEE Symp. on InfoVis 2002, pp. 137-144 [13] R. B. Lehoucq, D. C. Sorensen and C. Yang. ARPACK User Guide: Solution of Large Scale Eigenvalue Problems by Implicitly Restarted Arnoldi Methods. SIAM 1998. [14] J. J. Lin. Reduced Rank Approximations of Transition Matrices. AI & STATS, 2002. [15] L. Lova’sz. Random Walks on Graphs: A Survey Combinatorics, 1996, 353–398. [16] G. J. McLachlan et al Mixture Models: Inference and Applications to Clustering. 1988 [17] M. Meila and J. Shi. A random walks view of spectral segmentation. AI & STATS, 2001. [18] A. Ng, M. Jordan and Y. Weiss. On Spectral Clustering: analysis and an algorithm NIPS, 2001. [19] A. Pothen Graph partitioning algorithms with applications to scientific computing. Parallel Numerical Algorithms, D. E. Keyes et al (eds.), Kluwer Academic Press, 1996. [20] G. L. Scott et al Feature grouping by relocalization of eigenvectors of the proximity matrix. BMVC, pg. 103-108, 1990. [21] E. Sharon et al Fast Multiscale Image Segmentation CVPR, I:70-77, 2000. [22] J. Shi and J. Malik. Normalized cuts and image segmentation. PAMI, August, 2000. [23] H. Simon et al Low-Rank Matrix Approximation Using the Lanczos Bidiagonalization Process with Applications SIAM J. of Sci. Comp. 21(6):2257-2274, 2000. [24] N. Tishby et al Data clustering by Markovian Relaxation NIPS, 2001. [25] C. Williams et al Using the Nystrom method to speed up the kernel machines. NIPS, 2001.
6 0.09961427 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection
7 0.095518909 18 nips-2004-Algebraic Set Kernels with Application to Inference Over Local Image Representations
8 0.089757532 103 nips-2004-Limits of Spectral Clustering
9 0.069258049 161 nips-2004-Self-Tuning Spectral Clustering
10 0.068278715 116 nips-2004-Message Errors in Belief Propagation
11 0.067447059 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
12 0.06614805 168 nips-2004-Semigroup Kernels on Finite Sets
13 0.066055693 170 nips-2004-Similarity and Discrimination in Classical Conditioning: A Latent Variable Account
14 0.064859614 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants
15 0.064558446 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
16 0.059307758 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification
17 0.057252608 69 nips-2004-Fast Rates to Bayes for Kernel Machines
18 0.057207711 175 nips-2004-Stable adaptive control with online learning
19 0.056683518 83 nips-2004-Incremental Learning for Visual Tracking
20 0.054061696 64 nips-2004-Experts in a Markov Decision Process
topicId topicWeight
[(0, -0.188), (1, 0.003), (2, 0.116), (3, 0.026), (4, 0.0), (5, -0.137), (6, -0.085), (7, -0.02), (8, 0.001), (9, -0.065), (10, -0.079), (11, -0.066), (12, -0.006), (13, -0.009), (14, 0.059), (15, -0.041), (16, 0.038), (17, 0.06), (18, -0.017), (19, -0.06), (20, -0.056), (21, 0.011), (22, 0.054), (23, -0.089), (24, -0.017), (25, 0.037), (26, -0.019), (27, -0.053), (28, -0.073), (29, 0.021), (30, 0.015), (31, 0.103), (32, -0.209), (33, 0.052), (34, -0.142), (35, -0.033), (36, 0.009), (37, -0.086), (38, -0.217), (39, 0.094), (40, 0.103), (41, 0.012), (42, -0.067), (43, -0.035), (44, -0.005), (45, -0.032), (46, -0.029), (47, 0.111), (48, -0.058), (49, 0.065)]
simIndex simValue paperId paperTitle
same-paper 1 0.96217769 185 nips-2004-The Convergence of Contrastive Divergences
Author: Alan L. Yuille
Abstract: This paper analyses the Contrastive Divergence algorithm for learning statistical parameters. We relate the algorithm to the stochastic approximation literature. This enables us to specify conditions under which the algorithm is guaranteed to converge to the optimal solution (with probability 1). This includes necessary and sufficient conditions for the solution to be unbiased.
2 0.8564837 190 nips-2004-The Rescorla-Wagner Algorithm and Maximum Likelihood Estimation of Causal Parameters
Author: Alan L. Yuille
Abstract: This paper analyzes generalization of the classic Rescorla-Wagner (RW) learning algorithm and studies their relationship to Maximum Likelihood estimation of causal parameters. We prove that the parameters of two popular causal models, ∆P and P C, can be learnt by the same generalized linear Rescorla-Wagner (GLRW) algorithm provided genericity conditions apply. We characterize the fixed points of these GLRW algorithms and calculate the fluctuations about them, assuming that the input is a set of i.i.d. samples from a fixed (unknown) distribution. We describe how to determine convergence conditions and calculate convergence rates for the GLRW algorithms under these conditions.
3 0.49253467 48 nips-2004-Convergence and No-Regret in Multiagent Learning
Author: Michael Bowling
Abstract: Learning in a multiagent system is a challenging problem due to two key factors. First, if other agents are simultaneously learning then the environment is no longer stationary, thus undermining convergence guarantees. Second, learning is often susceptible to deception, where the other agents may be able to exploit a learner’s particular dynamics. In the worst case, this could result in poorer performance than if the agent was not learning at all. These challenges are identifiable in the two most common evaluation criteria for multiagent learning algorithms: convergence and regret. Algorithms focusing on convergence or regret in isolation are numerous. In this paper, we seek to address both criteria in a single algorithm by introducing GIGA-WoLF, a learning algorithm for normalform games. We prove the algorithm guarantees at most zero average regret, while demonstrating the algorithm converges in many situations of self-play. We prove convergence in a limited setting and give empirical results in a wider variety of situations. These results also suggest a third new learning criterion combining convergence and regret, which we call negative non-convergence regret (NNR). 1
4 0.48399767 18 nips-2004-Algebraic Set Kernels with Application to Inference Over Local Image Representations
Author: Amnon Shashua, Tamir Hazan
Abstract: This paper presents a general family of algebraic positive definite similarity functions over spaces of matrices with varying column rank. The columns can represent local regions in an image (whereby images have varying number of local parts), images of an image sequence, motion trajectories in a multibody motion, and so forth. The family of set kernels we derive is based on a group invariant tensor product lifting with parameters that can be naturally tuned to provide a cook-book of sorts covering the possible ”wish lists” from similarity measures over sets of varying cardinality. We highlight the strengths of our approach by demonstrating the set kernels for visual recognition of pedestrians using local parts representations. 1
5 0.46850842 103 nips-2004-Limits of Spectral Clustering
Author: Ulrike V. Luxburg, Olivier Bousquet, Mikhail Belkin
Abstract: An important aspect of clustering algorithms is whether the partitions constructed on finite samples converge to a useful clustering of the whole data space as the sample size increases. This paper investigates this question for normalized and unnormalized versions of the popular spectral clustering algorithm. Surprisingly, the convergence of unnormalized spectral clustering is more difficult to handle than the normalized case. Even though recently some first results on the convergence of normalized spectral clustering have been obtained, for the unnormalized case we have to develop a completely new approach combining tools from numerical integration, spectral and perturbation theory, and probability. It turns out that while in the normalized case, spectral clustering usually converges to a nice partition of the data space, in the unnormalized case the same only holds under strong additional assumptions which are not always satisfied. We conclude that our analysis gives strong evidence for the superiority of normalized spectral clustering. It also provides a basis for future exploration of other Laplacian-based methods. 1
6 0.46071586 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
7 0.46062392 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems
8 0.44603765 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods
9 0.43276644 207 nips-2004-ℓ₀-norm Minimization for Basis Selection
10 0.42428878 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification
11 0.40136304 170 nips-2004-Similarity and Discrimination in Classical Conditioning: A Latent Variable Account
12 0.37608263 161 nips-2004-Self-Tuning Spectral Clustering
13 0.37423122 30 nips-2004-Binet-Cauchy Kernels
14 0.36459634 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection
15 0.36350644 1 nips-2004-A Cost-Shaping LP for Bellman Error Minimization with Performance Guarantees
16 0.34208119 94 nips-2004-Kernels for Multi--task Learning
17 0.33474621 168 nips-2004-Semigroup Kernels on Finite Sets
18 0.33271858 124 nips-2004-Multiple Alignment of Continuous Time Series
19 0.32524419 89 nips-2004-Joint MRI Bias Removal Using Entropy Minimization Across Images
20 0.3181442 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation
topicId topicWeight
[(13, 0.058), (15, 0.084), (26, 0.607), (33, 0.128), (50, 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.94663191 185 nips-2004-The Convergence of Contrastive Divergences
Author: Alan L. Yuille
Abstract: This paper analyses the Contrastive Divergence algorithm for learning statistical parameters. We relate the algorithm to the stochastic approximation literature. This enables us to specify conditions under which the algorithm is guaranteed to converge to the optimal solution (with probability 1). This includes necessary and sufficient conditions for the solution to be unbiased.
2 0.92314655 34 nips-2004-Breaking SVM Complexity with Cross-Training
Author: Léon Bottou, Jason Weston, Gökhan H. Bakir
Abstract: We propose to selectively remove examples from the training set using probabilistic estimates related to editing algorithms (Devijver and Kittler, 1982). This heuristic procedure aims at creating a separable distribution of training examples with minimal impact on the position of the decision boundary. It breaks the linear dependency between the number of SVs and the number of training examples, and sharply reduces the complexity of SVMs during both the training and prediction stages. 1
3 0.91086417 97 nips-2004-Learning Efficient Auditory Codes Using Spikes Predicts Cochlear Filters
Author: Evan C. Smith, Michael S. Lewicki
Abstract: The representation of acoustic signals at the cochlear nerve must serve a wide range of auditory tasks that require exquisite sensitivity in both time and frequency. Lewicki (2002) demonstrated that many of the filtering properties of the cochlea could be explained in terms of efficient coding of natural sounds. This model, however, did not account for properties such as phase-locking or how sound could be encoded in terms of action potentials. Here, we extend this theoretical approach with algorithm for learning efficient auditory codes using a spiking population code. Here, we propose an algorithm for learning efficient auditory codes using a theoretical model for coding sound in terms of spikes. In this model, each spike encodes the precise time position and magnitude of a localized, time varying kernel function. By adapting the kernel functions to the statistics natural sounds, we show that, compared to conventional signal representations, the spike code achieves far greater coding efficiency. Furthermore, the inferred kernels show both striking similarities to measured cochlear filters and a similar bandwidth versus frequency dependence. 1
4 0.88266063 37 nips-2004-Co-Training and Expansion: Towards Bridging Theory and Practice
Author: Maria-florina Balcan, Avrim Blum, Ke Yang
Abstract: Co-training is a method for combining labeled and unlabeled data when examples can be thought of as containing two distinct sets of features. It has had a number of practical successes, yet previous theoretical analyses have needed very strong assumptions on the data that are unlikely to be satisfied in practice. In this paper, we propose a much weaker “expansion” assumption on the underlying data distribution, that we prove is sufficient for iterative cotraining to succeed given appropriately strong PAC-learning algorithms on each feature set, and that to some extent is necessary as well. This expansion assumption in fact motivates the iterative nature of the original co-training algorithm, unlike stronger assumptions (such as independence given the label) that allow a simpler one-shot co-training to succeed. We also heuristically analyze the effect on performance of noise in the data. Predicted behavior is qualitatively matched in synthetic experiments on expander graphs. 1
5 0.68821913 190 nips-2004-The Rescorla-Wagner Algorithm and Maximum Likelihood Estimation of Causal Parameters
Author: Alan L. Yuille
Abstract: This paper analyzes generalization of the classic Rescorla-Wagner (RW) learning algorithm and studies their relationship to Maximum Likelihood estimation of causal parameters. We prove that the parameters of two popular causal models, ∆P and P C, can be learnt by the same generalized linear Rescorla-Wagner (GLRW) algorithm provided genericity conditions apply. We characterize the fixed points of these GLRW algorithms and calculate the fluctuations about them, assuming that the input is a set of i.i.d. samples from a fixed (unknown) distribution. We describe how to determine convergence conditions and calculate convergence rates for the GLRW algorithms under these conditions.
6 0.55742824 48 nips-2004-Convergence and No-Regret in Multiagent Learning
7 0.55431956 69 nips-2004-Fast Rates to Bayes for Kernel Machines
8 0.55057997 172 nips-2004-Sparse Coding of Natural Images Using an Overcomplete Set of Limited Capacity Units
9 0.53886253 68 nips-2004-Face Detection --- Efficient and Rank Deficient
10 0.53855044 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
11 0.53692538 144 nips-2004-Parallel Support Vector Machines: The Cascade SVM
12 0.52880907 65 nips-2004-Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments
13 0.51590049 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve
14 0.51400834 103 nips-2004-Limits of Spectral Clustering
15 0.51398736 72 nips-2004-Generalization Error and Algorithmic Convergence of Median Boosting
16 0.5078029 194 nips-2004-Theory of localized synfire chain: characteristic propagation speed of stable spike pattern
17 0.50525218 7 nips-2004-A Large Deviation Bound for the Area Under the ROC Curve
18 0.49128959 153 nips-2004-Reducing Spike Train Variability: A Computational Theory Of Spike-Timing Dependent Plasticity
19 0.48720333 28 nips-2004-Bayesian inference in spiking neurons
20 0.48531777 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill