jmlr jmlr2005 jmlr2005-23 knowledge-graph by maker-knowledge-mining

23 jmlr-2005-Convergence Theorems for Generalized Alternating Minimization Procedures


Source: pdf

Author: Asela Gunawardana, William Byrne

Abstract: The EM algorithm is widely used to develop iterative parameter estimation procedures for statistical models. In cases where these procedures strictly follow the EM formulation, the convergence properties of the estimation procedures are well understood. In some instances there are practical reasons to develop procedures that do not strictly fall within the EM framework. We study EM variants in which the E-step is not performed exactly, either to obtain improved rates of convergence, or due to approximations needed to compute statistics under a model family over which E-steps cannot be realized. Since these variants are not EM procedures, the standard (G)EM convergence results do not apply to them. We present an information geometric framework for describing such algorithms and analyzing their convergence properties. We apply this framework to analyze the convergence properties of incremental EM and variational EM. For incremental EM, we discuss conditions under these algorithms converge in likelihood. For variational EM, we show how the E-step approximation prevents convergence to local maxima in likelihood. Keywords: EM, variational EM, incremental EM, convergence, information geometry

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In cases where these procedures strictly follow the EM formulation, the convergence properties of the estimation procedures are well understood. [sent-11, score-0.115]

2 We apply this framework to analyze the convergence properties of incremental EM and variational EM. [sent-16, score-0.176]

3 For incremental EM, we discuss conditions under these algorithms converge in likelihood. [sent-17, score-0.107]

4 For variational EM, we show how the E-step approximation prevents convergence to local maxima in likelihood. [sent-18, score-0.112]

5 Keywords: EM, variational EM, incremental EM, convergence, information geometry 1. [sent-19, score-0.159]

6 We propose the generalized alternating minimization (GAM) framework with the goal of understanding the convergence properties of a broad class of such EM variants. [sent-36, score-0.117]

7 We will show that this alternating minimization procedure can be extended in a manner analogous to the manner in which generalized EM (GEM) extends the M step of EM. [sent-38, score-0.115]

8 This will show that GAM algorithms are a further generalization of GEM algorithms which are no longer guaranteed to increase likelihood at every iteration, but nevertheless retain convergence to stationary points in likelihood under fairly general conditions. [sent-40, score-0.14]

9 We use the GAM framework to analyze the convergence behavior of incremental EM (Neal and Hinton, 1998) and variational EM (Jordan et al. [sent-48, score-0.176]

10 We show that incremental EM converges to stationary points in likelihood under mild assumptions on the model family. [sent-50, score-0.164]

11 We do show how GAM convergence arguments can be used to guarantee the convergence of a broad class of variational EM estimation procedures. [sent-52, score-0.12]

12 However, unlike incremental EM, this does not guarantee convergence to stationary points in likelihood. [sent-53, score-0.159]

13 On the contrary, we show that fixed points under variational EM cannot be stationary points in likelihood, except in the degenerate case when the model family is forced to satisfy the constraints that define the variational approximation itself. [sent-54, score-0.225]

14 First, the divergence from the current model to a family of distributions of a certain form is minimized to give a generating distribution. [sent-56, score-0.126]

15 In Section 4 we apply our convergence results to incremental EM and show that although the algorithm is non-monotonic in likelihood, it does converge to EM fixed points under very general conditions. [sent-71, score-0.133]

16 Note that Neal and Hinton (1998) have already shown that incremental EM gives non-increasing divergence (non-decreasing free energy) and that local minima in divergence (local maxima in free energy) are local maxima in likelihood. [sent-72, score-0.253]

17 In Section 5 we apply a similar analysis to variational EM to show that convergence to EM fixed points occurs only in degenerate cases. [sent-74, score-0.113]

18 EM and Generalized Alternating Minimization We adopt the view of the EM algorithm as an alternating minimization procedure under the information geometric framework as developed by Csisz´ r and Tusn´ dy (Csisz´ r and Tusn´ dy, 1984; a a a a Csisz´ r, 1990). [sent-77, score-0.108]

19 In particular we show how EM can be derived as the alternating minimization of the information divergence between the model family and a set of probability distributions constrained to be concentrated on the training data. [sent-81, score-0.179]

20 2, we then extend this alternating minimization framework to generalized alternating minimization (GAM) algorithms, which are EM variants that allow extensions of the E step, in addition to the M step extensions allowed by GEM algorithms. [sent-83, score-0.21]

21 (t+1) Backward Step: Find the parameter θ(t+1) that minimizes the divergence to QX (t+1) θ(t+1) ∈ argmin D(QX θ∈Θ (t+1) In the language of Csisz´ r (1975), QX a (t+1) as QX : ||PX;θ ). [sent-118, score-0.112]

22 ˆ (3) θ∈Θ (t+1) (t+1) Note that we use the notation θ(t+1) ∈ argmin D(QX ||PX;θ ) instead of θ(t+1) = argmin D(QX ||PX;θ ) because the backward step may not be unique. [sent-122, score-0.215]

23 We distinguish between the forward and backward steps of the alternating minimization procedure and the E and M steps of the EM procedure, as they are subtly different. [sent-123, score-0.35]

24 The E step corresponds to computing the (conditional) expected log likelihood (EM auxiliary function) under the result of the forward projection. [sent-124, score-0.196]

25 Thus, the E step corresponds to taking an expectation under the distribution found in the forward step. [sent-126, score-0.138]

26 (1977), because they generalize either the forward or the backward step. [sent-130, score-0.235]

27 We are interested in extending the alternating minimization formulation to such variants by relaxing the requirement that the forward and backward steps perform exact minimization over the families of distributions. [sent-135, score-0.367]

28 : (4) Generalizations of the backward step correspond to the well known GEM algorithms. [sent-139, score-0.147]

29 We refer to algorithms that consist of alternating application of such generalized forward and backward steps as generalized alternating minimization (GAM) algorithms. [sent-141, score-0.414]

30 Thus, we examine generalizations that are composed of forward and backward steps that reduce the divergence, as shown schematically in Figure 2. [sent-145, score-0.253]

31 (1999), the variational EM algorithm is best described as an alternating minimization between a set of parameterized models and a set of variational approximations to the 2054 C ONVERGENCE OF GAM P ROCEDURES posterior. [sent-148, score-0.211]

32 This corresponds to extending the forward step to be a projection onto a subset of D which satisfies additional constraints (namely, belonging to a given parametric family), rather than a projection onto D itself. [sent-149, score-0.166]

33 Then, given observations sE of the evidence nodes, the forward step for EM ˆ estimation of the Boltzmann machine is as follows. [sent-156, score-0.138]

34 ˆ Note that a closed-form solution for the backward step is not generally available, but convergent algorithms can be obtained using gradient descent or iterative proportional fitting (Darroch and Ratcliff, 1972; Byrne, 1992). [sent-158, score-0.156]

35 While the forward step can in principle be carried out exactly, this computation quickly becomes intractable as the number of states increases. [sent-159, score-0.138]

36 ˆ The forward step can then be replaced by an approximate forward step, which is now a minimization over the variational parameter µ for fixed θ(t) : Approximate Forward Step: An (approximate) desired distribution (t+1) QX ∈ argmin D(QX ||PX;θ(t) ) QX ∈DMF 2055 G UNAWARDANA AND B YRNE with p. [sent-168, score-0.391]

37 It can be seen that this variational EM variant is easily described in terms of minimizing the divergence between a constrained family of desired distributions and a model family. [sent-175, score-0.187]

38 The approximate forward step in this example is a generalization of the usual I-projection onto D , and the resulting algorithm is therefore a GAM procedure. [sent-176, score-0.152]

39 In particular, monotonic increase in likelihood and convergence to local maxima (technically, stationary points) in likelihood may no longer hold. [sent-181, score-0.149]

40 This may happen even when the divergence is non-increasing, and when stationary points of the likelihood are fixed points of the GAM procedure. [sent-182, score-0.157]

41 The divergence between a desired distribution and a model is given by D(QX ||PX;θ ) = q11 log q11 1 − q11 + (1 − q11 ) log , 2 θ (1 − θ)2 which can be shown to be convex in q11 for fixed θ and convex in θ for fixed q11 (though not jointly convex in q11 and θ). [sent-196, score-0.111]

42 The EM algorithm for estimating θ can be given by a forward step and a backward step as follows: Forward Step: As described above, the forward step is given by the I-projection of the model PX; θ(t) onto D . [sent-197, score-0.437]

43 6 (5) on the desired distribution changes the forward step, and as a result, the convergence behavior of the algorithm. [sent-210, score-0.161]

44 Note that a forward step that projects onto this constrained set of desired distributions will reduce the divergence between the desired distribution and the model, and will therefore be a GAM procedure. [sent-211, score-0.27]

45 Therefore, the forward step is given by q11 = 0. [sent-216, score-0.138]

46 The unconstrained forward step would have given q11 = 0. [sent-220, score-0.138]

47 Under the additional constraint (5), the forward (1) step is given by q11 = 0. [sent-222, score-0.138]

48 4, and the next forward step again gives (2) q11 = 0. [sent-225, score-0.138]

49 In fact, non-monotonic convergence behavior can also be seen in the case of incremental EM (Byrne and Gunawardana, 2000). [sent-238, score-0.11]

50 In the following, we will show that under smoothness conditions on the forward and backward steps, GAM procedures that strictly reduce divergence at every step, except possibly at stationary points in likelihood, will yield solutions that are stationary points in likelihood. [sent-239, score-0.454]

51 We will define the forward and the backward steps to be point-to-set maps, rather than functions, so that we may deal with extended E and M steps that do not yield unique iterates. [sent-242, score-0.271]

52 The continuity of the divergence in (QX , θ) follows from the continuity of the divergence D(QX ||PX; θ ) in QX and PX; θ and the continuity of pX; θ in θ. [sent-262, score-0.189]

53 The GAM convergence theorem and corollary provide conditions under which iterative estimation procedures converge to stationary points in likelihood. [sent-273, score-0.154]

54 ˆ ˆ From this we find the following relationship between the likelihood of the model estimates and the overall divergence D(QX ||PX; θ ) = D(QX || PX|Y =y; θ ) − log pY (y; θ). [sent-276, score-0.11]

55 The solid arrows show forward and backward steps that reduce the divergence rather than minimizing it. [sent-282, score-0.336]

56 The broken arrows show the forward steps that would have been taken by the EM algorithm (i. [sent-283, score-0.148]

57 Note that the divergence between the desired distribution yielded by the forward step and the I-projection of the model decreases, while the negative log likelihood increases. [sent-288, score-0.269]

58 ξ∈Θ This allows us to construct a map FB through the composition of generalized forward and backward steps F and B. [sent-316, score-0.282]

59 As seen in the proof it is insufficient for the forward and backward steps to satisfy the GAM and EQ conditions separately. [sent-317, score-0.264]

60 It is also necessary for the forward step to satisfy the equality condition with a unique minimizer. [sent-318, score-0.149]

61 For example, this condition is satisfied when D is defined by linear constraints as in Section 2 and the forward step is a simple projection, as in the case of EM. [sent-319, score-0.149]

62 It is important to show that FB strictly decreases the divergence for all points outside the solution set Γ, since any points where this does not hold are accumulation points of the algorithm. [sent-321, score-0.112]

63 Here, we use our GAM results to show that in most cases, the incremental updates do not sacrifice the convergence guarantees of EM, despite the non-monotonicity in likelihood. [sent-335, score-0.11]

64 To show that incremental EM can be a GAM procedure, we describe it as a nested series of n incremental forward steps and n exact backward steps. [sent-344, score-0.419]

65 We formally represent the ( j)th incremental forward step F ( j) : D × Θ → D × Θ as the singleton point-to-set map F ( j) (QX , θ) = (QX , θ) : QX = PX ( j) |Y ( j) =y( j) ; θ ∏ QX (i) ˆ . [sent-348, score-0.239]

66 The backward step is represented by a closed point-to-set map B : D × Θ → D × Θ satisfying conditions (GAM. [sent-350, score-0.205]

67 Using these composite maps we can describe incremental EM as (t+1) (QX where (t) , θ(t+1) ) ∈ FB(QX , θ(t) ) FB = B ◦ F (n) ◦ · · · ◦ B ◦ F (1) . [sent-357, score-0.104]

68 Proposition 6 As defined, incremental EM can be shown to converge to stationary points in likelihood through application of the GAM convergence theorem. [sent-358, score-0.204]

69 Since the ( j)th incremental forward step minimizes the ( j)th component divergence,we get (t+1, j) D(QX ||PX; θ(t+1, j) ) ≤ ∑ D(QX (t+1, j−1) (i) i:i= j (t+1, j−1) = D(QX (t+1, j−1) ||PX; θ(t+1, j−1) ) + D(QX ( j) ||PX; θ(t+1, j−1) ). [sent-361, score-0.233]

70 Thus, the GAM convergence theorem shows that incremental EM procedures converge to EM fixed points when the EM auxiliary function is uniquely maximized. [sent-365, score-0.209]

71 (2001) also show that the convergence behavior of incremental EM is different from that of EM in practice. [sent-369, score-0.11]

72 Variational EM as GAM Variational approximations have been popular in cases where computing the exact forward step qX (x) = pX|Y (x|y; θ) is intractable (Jordan et al. [sent-371, score-0.138]

73 Then, the variational forward step is defined to be FV (QX , θ) = (QX , θ) : QX ∈ arg min D(QX ||PX; θ ) . [sent-376, score-0.224]

74 Notice that the divergence minimized at every iteration is no longer just D(PX|Y =y; θ ||PX; θ ) ˆ (which is the negative log likelihood) as in the EM algorithm, and that therefore, the likelihood is not guaranteed to increase at every iteration. [sent-379, score-0.12]

75 We now examine if the conditions of the GAM convergence theorem of Section 3 still hold if the forward step of the EM procedure is replaced by FV . [sent-380, score-0.176]

76 If this condition holds, then the algorithm converges to minimizers of the divergence between the family of variational approximations and the model family. [sent-388, score-0.167]

77 For example, this happens when the variational E step is uniquely defined. [sent-389, score-0.109]

78 Therefore a θ∗ generated by a variational EM procedure is a stationary point in likelihood if and only if θ∗ is a parameter that locally minimizes the variational approximation error. [sent-393, score-0.215]

79 First, the variational error may have stationary points at stationary points in likelihood. [sent-395, score-0.164]

80 Thus, under this variational approximation, only Boltzmann machines where the hidden units do not depend on each other can give stationary points in likelihood. [sent-407, score-0.125]

81 To summarize, the GAM convergence theorem applies to variational EM in cases when the variational E step is uniquely defined. [sent-408, score-0.202]

82 When this is so, the resulting model is at a local minimum of the divergence between the model family and the family of variational approximations. [sent-409, score-0.18]

83 These degenerate cases occur when the model satisfies the simplifying conditions that define the family of variational approximations. [sent-411, score-0.111]

84 In this case, the variational EM algorithm is essentially performing standard EM over a restricted model family defined so as to be consistent with the variational approximations. [sent-412, score-0.156]

85 We have provided conditions under which these procedures can be shown to converge to stationary points in likelihood. [sent-415, score-0.117]

86 a We have analyzed the convergence behavior of two well known EM extensions, namely incremental EM and variational EM, as GAM procedures. [sent-418, score-0.176]

87 Our GAM convergence analysis shows that incremental EM procedures converge to stationary points in likelihood, even though incremental EM is in general neither a GEM procedure nor monotonic in likelihood. [sent-419, score-0.299]

88 We then present an information geometric argument which shows that variational EM can only converge to stationary points in likelihood in degenerate cases. [sent-422, score-0.18]

89 EM Satisfies the GAM Convergence Theorem Recall that the forward and backward steps F, B : D × Θ → D × Θ of the EM algorithm are given by F(QX , θ) = (QX , θ) : QX ∈ arg min D(QX ||PX;θ ) QX ∈D and B(QX , θ) = (QX , φ) : φ ∈ arg min D(QX ||PX;ξ ) . [sent-424, score-0.293]

90 By construction of D , it is guaranteed that QX generated by a forward step will lie in D . [sent-436, score-0.138]

91 Closedness of the forward and backward steps The following proposition and corollary show that the minimization of a continuous function forms a closed point-to-set map. [sent-439, score-0.323]

92 This implies that projection under the divergence forms a closed point-to-set-map, so that the (ungeneralized) forward and backward steps of the EM algorithm are in fact closed point-to-set maps. [sent-440, score-0.377]

93 This result together with Lemma 9 then show that the forward step F of the EM algorithm shown above is closed. [sent-459, score-0.138]

94 Similarly, it can be shown using Corollary 8 and Lemma 9 that the backward step B is closed. [sent-460, score-0.147]

95 Incremental EM: (EQ) and D In this appendix, we show that incremental EM satisfies condition EQ of the GAM convergence theorem, and show a compact restriction of the desired family that can be used to analyze convergence of incrmental EM. [sent-462, score-0.208]

96 When D(QX ||PX;θ ) = D(QX ||PX;θ ), the GAM inequality (already shown) gives that the divergence is unchanged at every incremental step: ( j−1) ( j) D(RX ||PX;φ( j) ) = D(RX ||PX;φ( j−1) ) for j = 1, · · · , n. [sent-465, score-0.159]

97 B), the divergence is unchanged at each incremental forward step F ( j) and the backward step B: ( j−1) ( j) D(RX ||PX;φ( j−1) ) = D(RX ||PX;φ( j−1) ) (10) and ( j) ( j) D(RX ||PX;φ( j) ) = D(RX ||PX;φ( j−1) ) (11) for j = 1, · · · , n. [sent-468, score-0.444]

98 Since equation (11) tells us that the divergence ( j) is unchanged at any backward step ( j) , both PX;φ( j) and PX;φ( j−1) must minimize D(RX ||·). [sent-475, score-0.223]

99 B) on the backward map to get θ ∈ argmin D(QX ||PX;ξ ). [sent-482, score-0.174]

100 Information geometry of the EM and em algorithms for neural networks. [sent-495, score-0.213]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('qx', 0.739), ('px', 0.405), ('gam', 0.311), ('em', 0.203), ('rx', 0.141), ('backward', 0.122), ('forward', 0.113), ('fb', 0.101), ('se', 0.101), ('incremental', 0.083), ('dinc', 0.08), ('variational', 0.066), ('divergence', 0.066), ('csisz', 0.063), ('alternating', 0.06), ('gem', 0.06), ('rocedures', 0.06), ('unawardana', 0.06), ('yrne', 0.06), ('onvergence', 0.05), ('procedures', 0.044), ('byrne', 0.04), ('stationary', 0.039), ('tusn', 0.038), ('py', 0.037), ('eq', 0.035), ('dv', 0.034), ('argmin', 0.034), ('eps', 0.034), ('likelihood', 0.032), ('gunawardana', 0.03), ('closed', 0.029), ('convergence', 0.027), ('step', 0.025), ('dmf', 0.025), ('gct', 0.025), ('family', 0.024), ('neal', 0.022), ('ps', 0.022), ('desired', 0.021), ('zangwill', 0.02), ('arg', 0.02), ('si', 0.02), ('maxima', 0.019), ('wu', 0.019), ('continuity', 0.019), ('minimization', 0.019), ('boltzmann', 0.019), ('dy', 0.019), ('hinton', 0.019), ('uniquely', 0.018), ('map', 0.018), ('incomplete', 0.018), ('steps', 0.018), ('compactness', 0.017), ('dempster', 0.017), ('sh', 0.017), ('arrows', 0.017), ('variants', 0.016), ('accumulation', 0.016), ('generating', 0.016), ('closedness', 0.015), ('psh', 0.015), ('pythagorean', 0.015), ('compact', 0.015), ('sequences', 0.015), ('jordan', 0.014), ('auxiliary', 0.014), ('onto', 0.014), ('qs', 0.014), ('converge', 0.013), ('thiesson', 0.013), ('th', 0.012), ('proposition', 0.012), ('minimizes', 0.012), ('log', 0.012), ('subsequence', 0.012), ('conditions', 0.011), ('composite', 0.011), ('fv', 0.011), ('condition', 0.011), ('generalized', 0.011), ('hidden', 0.01), ('unchanged', 0.01), ('distributions', 0.01), ('schematic', 0.01), ('geometric', 0.01), ('minimized', 0.01), ('degenerate', 0.01), ('geometry', 0.01), ('asela', 0.01), ('darroch', 0.01), ('liese', 0.01), ('qsh', 0.01), ('qsi', 0.01), ('salakhutdinov', 0.01), ('points', 0.01), ('maps', 0.01), ('corollary', 0.01), ('convergent', 0.009)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 23 jmlr-2005-Convergence Theorems for Generalized Alternating Minimization Procedures

Author: Asela Gunawardana, William Byrne

Abstract: The EM algorithm is widely used to develop iterative parameter estimation procedures for statistical models. In cases where these procedures strictly follow the EM formulation, the convergence properties of the estimation procedures are well understood. In some instances there are practical reasons to develop procedures that do not strictly fall within the EM framework. We study EM variants in which the E-step is not performed exactly, either to obtain improved rates of convergence, or due to approximations needed to compute statistics under a model family over which E-steps cannot be realized. Since these variants are not EM procedures, the standard (G)EM convergence results do not apply to them. We present an information geometric framework for describing such algorithms and analyzing their convergence properties. We apply this framework to analyze the convergence properties of incremental EM and variational EM. For incremental EM, we discuss conditions under these algorithms converge in likelihood. For variational EM, we show how the E-step approximation prevents convergence to local maxima in likelihood. Keywords: EM, variational EM, incremental EM, convergence, information geometry

2 0.16002508 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

Author: Aapo Hyvärinen

Abstract: One often wants to estimate statistical models where the probability density function is known only up to a multiplicative normalization constant. Typically, one then has to resort to Markov Chain Monte Carlo methods, or approximations of the normalization constant. Here, we propose that such models can be estimated by minimizing the expected squared distance between the gradient of the log-density given by the model and the gradient of the log-density of the observed data. While the estimation of the gradient of log-density function is, in principle, a very difficult non-parametric problem, we prove a surprising result that gives a simple formula for this objective function. The density function of the observed data does not appear in this formula, which simplifies to a sample average of a sum of some derivatives of the log-density given by the model. The validity of the method is demonstrated on multivariate Gaussian and independent component analysis models, and by estimating an overcomplete filter set for natural image data. Keywords: statistical estimation, non-normalized densities, pseudo-likelihood, Markov chain Monte Carlo, contrastive divergence

3 0.097867787 41 jmlr-2005-Kernel Methods for Measuring Independence

Author: Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet, Bernhard Schölkopf

Abstract: We introduce two new functionals, the constrained covariance and the kernel mutual information, to measure the degree of independence of random variables. These quantities are both based on the covariance between functions of the random variables in reproducing kernel Hilbert spaces (RKHSs). We prove that when the RKHSs are universal, both functionals are zero if and only if the random variables are pairwise independent. We also show that the kernel mutual information is an upper bound near independence on the Parzen window estimate of the mutual information. Analogous results apply for two correlation-based dependence functionals introduced earlier: we show the kernel canonical correlation and the kernel generalised variance to be independence measures for universal kernels, and prove the latter to be an upper bound on the mutual information near independence. The performance of the kernel dependence functionals in measuring independence is verified in the context of independent component analysis. Keywords: independence, covariance operator, mutual information, kernel, Parzen window estimate, independent component analysis c 2005 Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet and Bernhard Schölkopf . G RETTON , H ERBRICH , S MOLA , B OUSQUET AND S CHÖLKOPF

4 0.074405946 3 jmlr-2005-A Classification Framework for Anomaly Detection

Author: Ingo Steinwart, Don Hush, Clint Scovel

Abstract: One way to describe anomalies is by saying that anomalies are not concentrated. This leads to the problem of finding level sets for the data generating density. We interpret this learning problem as a binary classification problem and compare the corresponding classification risk with the standard performance measure for the density level problem. In particular it turns out that the empirical classification risk can serve as an empirical performance measure for the anomaly detection problem. This allows us to compare different anomaly detection algorithms empirically, i.e. with the help of a test set. Furthermore, by the above interpretation we can give a strong justification for the well-known heuristic of artificially sampling “labeled” samples, provided that the sampling plan is well chosen. In particular this enables us to propose a support vector machine (SVM) for anomaly detection for which we can easily establish universal consistency. Finally, we report some experiments which compare our SVM to other commonly used methods including the standard one-class SVM. Keywords: unsupervised learning, anomaly detection, density levels, classification, SVMs

5 0.069966637 71 jmlr-2005-Variational Message Passing

Author: John Winn, Christopher M. Bishop

Abstract: Bayesian inference is now widely established as one of the principal foundations for machine learning. In practice, exact inference is rarely possible, and so a variety of approximation techniques have been developed, one of the most widely used being a deterministic framework called variational inference. In this paper we introduce Variational Message Passing (VMP), a general purpose algorithm for applying variational inference to Bayesian Networks. Like belief propagation, VMP proceeds by sending messages between nodes in the network and updating posterior beliefs using local operations at each node. Each such update increases a lower bound on the log evidence (unless already at a local maximum). In contrast to belief propagation, VMP can be applied to a very general class of conjugate-exponential models because it uses a factorised variational approximation. Furthermore, by introducing additional variational parameters, VMP can be applied to models containing non-conjugate distributions. The VMP framework also allows the lower bound to be evaluated, and this can be used both for model comparison and for detection of convergence. Variational message passing has been implemented in the form of a general purpose inference engine called VIBES (‘Variational Inference for BayEsian networkS’) which allows models to be specified graphically and then solved variationally without recourse to coding. Keywords: Bayesian networks, variational inference, message passing

6 0.062378656 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach

7 0.028286375 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

8 0.027969662 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

9 0.025251016 20 jmlr-2005-Clustering with Bregman Divergences

10 0.023502614 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems

11 0.018512877 32 jmlr-2005-Expectation Consistent Approximate Inference

12 0.016782509 36 jmlr-2005-Gaussian Processes for Ordinal Regression

13 0.015704058 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

14 0.013823749 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification

15 0.01319604 22 jmlr-2005-Concentration Bounds for Unigram Language Models

16 0.012285653 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

17 0.011486244 64 jmlr-2005-Semigroup Kernels on Measures

18 0.011236903 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds

19 0.010643994 39 jmlr-2005-Information Bottleneck for Gaussian Variables

20 0.010424663 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.119), (1, 0.056), (2, -0.194), (3, -0.086), (4, -0.103), (5, 0.125), (6, -0.13), (7, -0.446), (8, -0.167), (9, 0.06), (10, -0.278), (11, -0.096), (12, 0.032), (13, 0.186), (14, -0.051), (15, 0.044), (16, -0.002), (17, -0.136), (18, 0.109), (19, -0.024), (20, 0.005), (21, -0.111), (22, 0.055), (23, 0.021), (24, 0.024), (25, 0.072), (26, 0.113), (27, -0.034), (28, 0.026), (29, 0.018), (30, -0.025), (31, -0.033), (32, 0.063), (33, 0.112), (34, -0.056), (35, -0.08), (36, -0.021), (37, 0.03), (38, -0.054), (39, 0.093), (40, -0.041), (41, -0.014), (42, 0.06), (43, -0.05), (44, 0.161), (45, 0.156), (46, -0.099), (47, -0.012), (48, -0.036), (49, -0.109)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99027473 23 jmlr-2005-Convergence Theorems for Generalized Alternating Minimization Procedures

Author: Asela Gunawardana, William Byrne

Abstract: The EM algorithm is widely used to develop iterative parameter estimation procedures for statistical models. In cases where these procedures strictly follow the EM formulation, the convergence properties of the estimation procedures are well understood. In some instances there are practical reasons to develop procedures that do not strictly fall within the EM framework. We study EM variants in which the E-step is not performed exactly, either to obtain improved rates of convergence, or due to approximations needed to compute statistics under a model family over which E-steps cannot be realized. Since these variants are not EM procedures, the standard (G)EM convergence results do not apply to them. We present an information geometric framework for describing such algorithms and analyzing their convergence properties. We apply this framework to analyze the convergence properties of incremental EM and variational EM. For incremental EM, we discuss conditions under these algorithms converge in likelihood. For variational EM, we show how the E-step approximation prevents convergence to local maxima in likelihood. Keywords: EM, variational EM, incremental EM, convergence, information geometry

2 0.55507261 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

Author: Aapo Hyvärinen

Abstract: One often wants to estimate statistical models where the probability density function is known only up to a multiplicative normalization constant. Typically, one then has to resort to Markov Chain Monte Carlo methods, or approximations of the normalization constant. Here, we propose that such models can be estimated by minimizing the expected squared distance between the gradient of the log-density given by the model and the gradient of the log-density of the observed data. While the estimation of the gradient of log-density function is, in principle, a very difficult non-parametric problem, we prove a surprising result that gives a simple formula for this objective function. The density function of the observed data does not appear in this formula, which simplifies to a sample average of a sum of some derivatives of the log-density given by the model. The validity of the method is demonstrated on multivariate Gaussian and independent component analysis models, and by estimating an overcomplete filter set for natural image data. Keywords: statistical estimation, non-normalized densities, pseudo-likelihood, Markov chain Monte Carlo, contrastive divergence

3 0.31334168 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach

Author: Gal Elidan, Nir Friedman

Abstract: A central challenge in learning probabilistic graphical models is dealing with domains that involve hidden variables. The common approach for learning model parameters in such domains is the expectation maximization (EM) algorithm. This algorithm, however, can easily get trapped in suboptimal local maxima. Learning the model structure is even more challenging. The structural EM algorithm can adapt the structure in the presence of hidden variables, but usually performs poorly without prior knowledge about the cardinality and location of the hidden variables. In this work, we present a general approach for learning Bayesian networks with hidden variables that overcomes these problems. The approach builds on the information bottleneck framework of Tishby et al. (1999). We start by proving formal correspondence between the information bottleneck objective and the standard parametric EM functional. We then use this correspondence to construct a learning algorithm that combines an information-theoretic smoothing term with a continuation procedure. Intuitively, the algorithm bypasses local maxima and achieves superior solutions by following a continuous path from a solution of, an easy and smooth, target function, to a solution of the desired likelihood function. As we show, our algorithmic framework allows learning of the parameters as well as the structure of a network. In addition, it also allows us to introduce new hidden variables during model selection and learn their cardinality. We demonstrate the performance of our procedure on several challenging real-life data sets. Keywords: Bayesian networks, hidden variables, information bottleneck, continuation, variational methods

4 0.29724023 41 jmlr-2005-Kernel Methods for Measuring Independence

Author: Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet, Bernhard Schölkopf

Abstract: We introduce two new functionals, the constrained covariance and the kernel mutual information, to measure the degree of independence of random variables. These quantities are both based on the covariance between functions of the random variables in reproducing kernel Hilbert spaces (RKHSs). We prove that when the RKHSs are universal, both functionals are zero if and only if the random variables are pairwise independent. We also show that the kernel mutual information is an upper bound near independence on the Parzen window estimate of the mutual information. Analogous results apply for two correlation-based dependence functionals introduced earlier: we show the kernel canonical correlation and the kernel generalised variance to be independence measures for universal kernels, and prove the latter to be an upper bound on the mutual information near independence. The performance of the kernel dependence functionals in measuring independence is verified in the context of independent component analysis. Keywords: independence, covariance operator, mutual information, kernel, Parzen window estimate, independent component analysis c 2005 Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet and Bernhard Schölkopf . G RETTON , H ERBRICH , S MOLA , B OUSQUET AND S CHÖLKOPF

5 0.2761822 3 jmlr-2005-A Classification Framework for Anomaly Detection

Author: Ingo Steinwart, Don Hush, Clint Scovel

Abstract: One way to describe anomalies is by saying that anomalies are not concentrated. This leads to the problem of finding level sets for the data generating density. We interpret this learning problem as a binary classification problem and compare the corresponding classification risk with the standard performance measure for the density level problem. In particular it turns out that the empirical classification risk can serve as an empirical performance measure for the anomaly detection problem. This allows us to compare different anomaly detection algorithms empirically, i.e. with the help of a test set. Furthermore, by the above interpretation we can give a strong justification for the well-known heuristic of artificially sampling “labeled” samples, provided that the sampling plan is well chosen. In particular this enables us to propose a support vector machine (SVM) for anomaly detection for which we can easily establish universal consistency. Finally, we report some experiments which compare our SVM to other commonly used methods including the standard one-class SVM. Keywords: unsupervised learning, anomaly detection, density levels, classification, SVMs

6 0.24704109 71 jmlr-2005-Variational Message Passing

7 0.10279445 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

8 0.10073937 20 jmlr-2005-Clustering with Bregman Divergences

9 0.088049434 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs

10 0.087538064 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

11 0.081109114 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

12 0.079243749 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models

13 0.078158528 32 jmlr-2005-Expectation Consistent Approximate Inference

14 0.071978539 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems

15 0.064666905 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification

16 0.062833123 53 jmlr-2005-Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application

17 0.060622957 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

18 0.05972426 12 jmlr-2005-An MDP-Based Recommender System

19 0.058595676 36 jmlr-2005-Gaussian Processes for Ordinal Regression

20 0.05597759 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.028), (17, 0.017), (19, 0.042), (29, 0.432), (36, 0.018), (37, 0.023), (43, 0.041), (47, 0.012), (52, 0.039), (59, 0.015), (70, 0.024), (88, 0.127), (90, 0.027), (94, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.77746195 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader

Author: Marcus Hutter, Jan Poland

Abstract: When applying aggregating strategies to Prediction with Expert Advice (PEA), the learning rate √ must be adaptively tuned. The natural choice of complexity/current loss renders the analysis of Weighted Majority (WM) derivatives quite complicated. In particular, for arbitrary weights there have been no results proven so far. The analysis of the alternative Follow the Perturbed Leader (FPL) algorithm from Kalai and Vempala (2003) based on Hannan’s algorithm is easier. We derive loss bounds for adaptive learning rate and both finite expert classes with uniform weights and countable expert classes with arbitrary weights. For the former setup, our loss bounds match the best known results so far, while for the latter our results are new. Keywords: prediction with expert advice, follow the perturbed leader, general weights, adaptive learning rate, adaptive adversary, hierarchy of experts, expected and high probability bounds, general alphabet and loss, online sequential prediction

same-paper 2 0.77197343 23 jmlr-2005-Convergence Theorems for Generalized Alternating Minimization Procedures

Author: Asela Gunawardana, William Byrne

Abstract: The EM algorithm is widely used to develop iterative parameter estimation procedures for statistical models. In cases where these procedures strictly follow the EM formulation, the convergence properties of the estimation procedures are well understood. In some instances there are practical reasons to develop procedures that do not strictly fall within the EM framework. We study EM variants in which the E-step is not performed exactly, either to obtain improved rates of convergence, or due to approximations needed to compute statistics under a model family over which E-steps cannot be realized. Since these variants are not EM procedures, the standard (G)EM convergence results do not apply to them. We present an information geometric framework for describing such algorithms and analyzing their convergence properties. We apply this framework to analyze the convergence properties of incremental EM and variational EM. For incremental EM, we discuss conditions under these algorithms converge in likelihood. For variational EM, we show how the E-step approximation prevents convergence to local maxima in likelihood. Keywords: EM, variational EM, incremental EM, convergence, information geometry

3 0.3148413 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach

Author: Gal Elidan, Nir Friedman

Abstract: A central challenge in learning probabilistic graphical models is dealing with domains that involve hidden variables. The common approach for learning model parameters in such domains is the expectation maximization (EM) algorithm. This algorithm, however, can easily get trapped in suboptimal local maxima. Learning the model structure is even more challenging. The structural EM algorithm can adapt the structure in the presence of hidden variables, but usually performs poorly without prior knowledge about the cardinality and location of the hidden variables. In this work, we present a general approach for learning Bayesian networks with hidden variables that overcomes these problems. The approach builds on the information bottleneck framework of Tishby et al. (1999). We start by proving formal correspondence between the information bottleneck objective and the standard parametric EM functional. We then use this correspondence to construct a learning algorithm that combines an information-theoretic smoothing term with a continuation procedure. Intuitively, the algorithm bypasses local maxima and achieves superior solutions by following a continuous path from a solution of, an easy and smooth, target function, to a solution of the desired likelihood function. As we show, our algorithmic framework allows learning of the parameters as well as the structure of a network. In addition, it also allows us to introduce new hidden variables during model selection and learn their cardinality. We demonstrate the performance of our procedure on several challenging real-life data sets. Keywords: Bayesian networks, hidden variables, information bottleneck, continuation, variational methods

4 0.31233376 20 jmlr-2005-Clustering with Bregman Divergences

Author: Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh

Abstract: A wide variety of distortion functions, such as squared Euclidean distance, Mahalanobis distance, Itakura-Saito distance and relative entropy, have been used for clustering. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroid-based parametric clustering approaches, such as classical kmeans, the Linde-Buzo-Gray (LBG) algorithm and information-theoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the method to a large class of clustering loss functions. This is achieved by first posing the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by rate distortion theory, and then deriving an iterative algorithm that monotonically decreases this loss. In addition, we show that there is a bijection between regular exponential families and a large class of Bregman divergences, that we call regular Bregman divergences. This result enables the development of an alternative interpretation of an efficient EM scheme for learning mixtures of exponential family distributions, and leads to a simple soft clustering algorithm for regular Bregman divergences. Finally, we discuss the connection between rate distortion theory and Bregman clustering and present an information theoretic analysis of Bregman clustering algorithms in terms of a trade-off between compression and loss in Bregman information. Keywords: clustering, Bregman divergences, Bregman information, exponential families, expectation maximization, information theory

5 0.30961078 60 jmlr-2005-On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning

Author: Petros Drineas, Michael W. Mahoney

Abstract: A problem for many kernel-based methods is that the amount of computation required to find the solution scales as O(n3 ), where n is the number of training examples. We develop and analyze an algorithm to compute an easily-interpretable low-rank approximation to an n × n Gram matrix G such that computations of interest may be performed more rapidly. The approximation is of ˜ the form Gk = CWk+CT , where C is a matrix consisting of a small number c of columns of G and Wk is the best rank-k approximation to W , the matrix formed by the intersection between those c columns of G and the corresponding c rows of G. An important aspect of the algorithm is the probability distribution used to randomly sample the columns; we will use a judiciously-chosen and data-dependent nonuniform probability distribution. Let · 2 and · F denote the spectral norm and the Frobenius norm, respectively, of a matrix, and let Gk be the best rank-k approximation to G. We prove that by choosing O(k/ε4 ) columns G −CWk+CT ξ ≤ G − Gk ξ +ε n ∑ G2 , ii i=1 both in expectation and with high probability, for both ξ = 2, F, and for all k : 0 ≤ k ≤ rank(W ). This approximation can be computed using O(n) additional space and time, after making two passes over the data from external storage. The relationships between this algorithm, other related matrix decompositions, and the Nystr¨ m method from integral equation theory are discussed.1 o Keywords: kernel methods, randomized algorithms, Gram matrix, Nystr¨ m method o

6 0.30414474 11 jmlr-2005-Algorithmic Stability and Meta-Learning

7 0.30330655 48 jmlr-2005-Learning the Kernel Function via Regularization

8 0.30253458 16 jmlr-2005-Asymptotics in Empirical Risk Minimization

9 0.30231866 13 jmlr-2005-Analysis of Variance of Cross-Validation Estimators of the Generalization Error

10 0.3017107 37 jmlr-2005-Generalization Bounds and Complexities Based on Sparsity and Clustering for Convex Combinations of Functions from Random Classes

11 0.29810521 49 jmlr-2005-Learning the Kernel with Hyperkernels     (Kernel Machines Section)

12 0.29675612 71 jmlr-2005-Variational Message Passing

13 0.29609501 39 jmlr-2005-Information Bottleneck for Gaussian Variables

14 0.29495668 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve

15 0.29442915 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors

16 0.29414526 3 jmlr-2005-A Classification Framework for Anomaly Detection

17 0.29401341 64 jmlr-2005-Semigroup Kernels on Measures

18 0.29211515 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

19 0.2895844 44 jmlr-2005-Learning Module Networks

20 0.28828841 47 jmlr-2005-Learning from Examples as an Inverse Problem