nips nips2006 nips2006-19 knowledge-graph by maker-knowledge-mining

19 nips-2006-Accelerated Variational Dirichlet Process Mixtures

Source: pdf

Author: Kenichi Kurihara, Max Welling, Nikos A. Vlassis

Abstract: Dirichlet Process (DP) mixture models are promising candidates for clustering applications where the number of clusters is unknown a priori. Due to computational considerations these models are unfortunately unsuitable for large scale data-mining applications. We propose a class of deterministic accelerated DP mixture models that can routinely handle millions of data-cases. The speedup is achieved by incorporating kd-trees into a variational Bayesian algorithm for DP mixtures in the stick-breaking representation, similar to that of Blei and Jordan (2005). Our algorithm differs in the use of kd-trees and in the way we handle truncation: we only assume that the variational distributions are ﬁxed at their priors after a certain level. Experiments show that speedups relative to the standard variational algorithm can be signiﬁcant. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 nl Abstract Dirichlet Process (DP) mixture models are promising candidates for clustering applications where the number of clusters is unknown a priori. [sent-10, score-0.188]

2 We propose a class of deterministic accelerated DP mixture models that can routinely handle millions of data-cases. [sent-12, score-0.217]

3 The speedup is achieved by incorporating kd-trees into a variational Bayesian algorithm for DP mixtures in the stick-breaking representation, similar to that of Blei and Jordan (2005). [sent-13, score-0.39]

4 Our algorithm differs in the use of kd-trees and in the way we handle truncation: we only assume that the variational distributions are ﬁxed at their priors after a certain level. [sent-14, score-0.253]

5 Experiments show that speedups relative to the standard variational algorithm can be signiﬁcant. [sent-15, score-0.295]

6 In this paper we propose computational speedups for Dirichlet Process (DP) mixture models [1, 2, 3, 4, 5, 6, 7], with the purpose of improving their applicability in modern day data-mining problems where millions of data-cases are no exception. [sent-18, score-0.226]

7 Our approach is related to, and complements, the variational mean-ﬁeld algorithm for DP mixture models of Blei and Jordan [7]. [sent-19, score-0.342]

8 In this approach, the intractable posterior of the DP mixture is approximated with a factorized variational ﬁnite (truncated) mixture model with T components, that is optimized to minimize the KL distance to the posterior. [sent-20, score-0.538]

9 However, a downside of their model is that the variational families are not nested over T , and locating an optimal truncation level T may be difﬁcult (see Section 3). [sent-21, score-0.549]

10 In this paper we propose an alternative variational mean-ﬁeld algorithm, called VDP (Variational DP), in which the variational families are nested over T . [sent-22, score-0.574]

11 In our model we allow for an unbounded number of components for the variational mixture, but we tie the variational distributions after level 1 http://aluminum. [sent-23, score-0.635]

12 Our algorithm proceeds in a greedy manner by starting with T = 1 and releasing components when this improves (signiﬁcantly) the KL bound. [sent-34, score-0.093]

13 Our approach essentially resolves the issue in [7] of searching for an optimal truncation level of the variational mixture (see Section 4). [sent-36, score-0.522]

14 A kd-tree structure recursively partitions the data space into a number of nodes, where each node contains a subset of the data-cases. [sent-38, score-0.056]

15 Following [9], for a given tree expansion we tie together the responsibility over mixture components of all data-cases contained in each outer node of the tree. [sent-39, score-0.454]

16 By caching certain sufﬁcient statistics in each node of the kd-tree we then achieve computational gains, while the variational approximation becomes a function of the depth of the tree at which one operates (see Section 6). [sent-40, score-0.337]

17 The resulting Fast-VDP algorithm provides an elegant way to trade off computational resources against accuracy. [sent-41, score-0.058]

18 We can always release new components from the pool and split kd-tree nodes as long as we have computational resources left. [sent-42, score-0.124]

19 As a result, Fast-VDP is the ﬁrst algorithm entertaining an unbounded number of clusters that is practical for modern day data-mining applications. [sent-45, score-0.091]

20 2 The Dirichlet Process Mixture in the Stick-Breaking Representation A DP mixture model in the stick-breaking representation can be viewed as possessing an inﬁnite number of components with random mixing weights [4]. [sent-46, score-0.173]

21 In particular, the generative model of a DP mixture assumes: • An inﬁnite collection of components H = {ηi }∞ that are independently drawn from a i=1 prior pη (ηi |λ) with hyperparameters λ. [sent-47, score-0.173]

22 • An inﬁnite collection of ‘stick lengths’ V = {vi }∞ , vi ∈ [0, 1], ∀i, that are independently i=1 drawn from a prior pv (vi |α) with hyperparameters α. [sent-48, score-0.224]

23 They deﬁne the mixing weights i−1 {πi }∞ of the mixture as πi (V ) = vi j=1 (1 − vj ), for i = 1, . [sent-49, score-0.245]

24 i=1 • An observation model px (x|η) that generates a datum x from component η. [sent-53, score-0.118]

25 Given a dataset X = {xn }N , each data-case xn is assumed to be generated by ﬁrst drawing a n=1 component label zn = k ∈ {1, . [sent-54, score-0.37]

26 , ∞} from the inﬁnite mixture with probability pz (zn = k|V ) ≡ πk (V ), and then drawing xn from the corresponding observation model px (xn |ηk ). [sent-57, score-0.299]

27 In clustering problems we are mainly interested in computing the posterior over data labels p(zn |X, θ), as well as the predictive density p(x|X, θ) = H,V p(x|H, V ) Z p(W |X, θ), which are both intractable since p(W |X, θ) cannot be computed analytically. [sent-59, score-0.09]

28 Blei and Jordan [7] deﬁne an explicit truncation level L ≡ T for the variational mixture in (1) by setting qvT (vT = 1) = 1 and assuming that data-cases assign zero responsibility to components with index higher than the truncation level T , i. [sent-61, score-0.846]

29 Consequently, in their model only components of the mixture up to level T need to be considered. [sent-64, score-0.222]

30 Since each distribution qzn (zn ) has nonzero support only for zn ≤ T , minimizing F (φ) results in a set of update equations for φ that involve only ﬁnite sums [7]. [sent-66, score-0.737]

31 However, explicitly truncating the variational mixture as above has the undesirable property that the variational family with truncation level T is not contained within the variational family with truncation level T + 1, i. [sent-67, score-1.228]

32 2 The result is that there may be an optimal ﬁnite truncation level T for q, that contradicts the intuition that the more components we allow in q the better the approximation should be (reaching its best when T → ∞). [sent-70, score-0.243]

33 Moreover, locating a near-optimal truncation level may be difﬁcult since F as a function of T may exhibit local minima (see Fig. [sent-71, score-0.207]

34 4 Variational Inference with an Inﬁnite Variational Model Here we propose a slightly different variational model for q that allows families over T to be nested. [sent-73, score-0.294]

35 In our setup, q is given by (1) where we let L go to inﬁnity but we tie the parameters of all models after a speciﬁc level T (with T ≪ L). [sent-74, score-0.09]

36 In particular, we impose the condition that for all components with index i > T the variational distributions for the stick-lengths qvi (vi ) and the variational distributions for the components qηi (ηi ) are equal to their corresponding priors, i. [sent-75, score-0.761]

37 , qvi (vi ; φv ) = pv (vi |α) and qηi (ηi ; φη ) = pη (ηi |λ). [sent-77, score-0.26]

38 In our model we deﬁne the free energy F as i i the limit F = limL→∞ FL , where FL is the free energy deﬁned by q in (1) and a truncated DP mixture at level L (justiﬁed by the almost sure convergence of an L-truncated Dirichlet process to an inﬁnite Dirichlet process when L → ∞ [6]). [sent-78, score-0.675]

39 Using the parameter tying assumption for i > T , the free energy reads T Eqvi log F = i=1 qη (ηi ; φη ) qvi (vi ; φv ) i i +Eqηi log i pv (vi |α) pη (ηi |λ) N + Eq log n=1 qzn (zn ) . [sent-79, score-1.106]

40 (2) pz (zn |V )px (xn |ηzn ) In our scheme T deﬁnes an implicit truncation level of the variational mixture, since there are no free parameters to optimize beyond level T . [sent-80, score-0.623]

41 As in [7], the free energy F is a function of T parameters {φv , φη }T and N distributions {qzn (zn )}N . [sent-81, score-0.246]

42 However, contrary to [7], data-cases may now n=1 i i i=1 assign nonzero responsibility to components beyond level T , and therefore each qzn (zn ) must now have inﬁnite support (which requires computing inﬁnite sums in the various quantities of interest). [sent-82, score-0.644]

43 An important implication of our setup is that the variational families are now nested with respect to T (since for i > T , qvi (vi ) and qηi (ηi ) can always revert to their priors), and as a result it is guaranteed that as we increase T there exist solutions that decrease F . [sent-83, score-0.513]

44 ¿From the last term of (2) we directly see that the qzn (zn ) that minimizes F is given by qzn (zn = i) = exp(Sn,i ) exp(Sn,j ) (3) ∞ j=1 where Sn,i = EqV [log pz (zn = i|V )] + Eqηi [log px (xn |ηi )]. [sent-85, score-0.958]

45 φv i (4) φη i Minimization of F over and can be carried out by direct differentiation of (2) for particular choices of models for qvi and qηi (see Section 5). [sent-86, score-0.171]

46 Using qzn from (3), the free energy (2) reads T F = Eqvi log i=1 qvi (vi ; φv ) qη (ηi ; φη ) i i + Eqηi log i pv (vi |α) pη (ηi |λ) ∞ N − ∞ log n=1 exp(Sn,i ). [sent-87, score-1.065]

47 (6) 1 − exp Epv [log(1 − v)] i=T +1 Using the variational q(W ) as an approximation to the true posterior p(W |X, θ), the required posterior over data labels can be approximated by p(zn |X, θ) ≈ qzn (zn ). [sent-93, score-0.765]

48 Although qzn (zn ) has inﬁnite support, in practice it sufﬁces to use the individual qzn (zn = i) for the ﬁnite part i ≤ T , and the cumulative qzn (zn > T ) for the inﬁnite part. [sent-94, score-1.23]

49 Finally, using the parameter tying assumption for ∞ i > T , and the identity i=1 πi (V ) = 1, the predictive density p(x|X, θ) can be approximated by T T p(x|X, θ) ≈ EqV [πi (V )]Eqηi [px (x|ηi )] + 1 − i=1 Epv [πi (V )] Epη [px (x|η)]. [sent-95, score-0.061]

50 (7) i=1 Note that all quantities of interest, such as the free energy (5) and the predictive distribution (7), can be computed analytically even though they involve inﬁnite sums. [sent-96, score-0.246]

51 The optimal parameters φ , φ can be found to be N φv i,1 = α1 + N qzn (zn = i) φv i,2 = α2 + n=1 = λ1 + qzn (zn = j) (14) n=1 j=i+1 N φη i,1 ∞ N qzn (zn = i)xn φη i,2 n=1 = λ2 + qzn (zn = i). [sent-101, score-1.64]

52 (15) n=1 The update equations are similar to those in [7] except that we have used Beta(α1 , α2 ) instead of ∞ Beta(1, α), and φv involves an inﬁnite sum j=i+1 qzn (zn = j) which can be computed using (3) i,2 and (6). [sent-102, score-0.434]

53 In [7] the corresponding sum is ﬁnite since qzn (zn ) is truncated at T . [sent-103, score-0.434]

54 Note that the VDP algorithm operates in a space where component labels are distinguishable, i. [sent-104, score-0.059]

55 Since the average a priori mixture weights of the components are ordered by their size, the optimal labelling of the a posteriori variational components is also ordered according to cluster size. [sent-107, score-0.51]

56 Hence, we have incorporated a reordering step of components according to approximate size after each optimization step in our ﬁnal algorithm (a feature that was not present in [7]). [sent-108, score-0.097]

57 6 Accelerating inference using a kd-tree In this section we show that we can achieve accelerated inference for large datasets when we store the data in a kd-tree [10] and cache data sufﬁcient statistics in each node of the kd-tree [8]. [sent-109, score-0.17]

58 A kd-tree is a binary tree in which the root node contains all points, and each child node contains a subset of the data points contained in its father node, where points are separated by a (typically axis-aligned) hyperplane. [sent-110, score-0.165]

59 Each point in the set is contained in exactly one node, and the set of outer nodes of a given expansion of the kd-tree form a partition of the data set. [sent-111, score-0.113]

60 Following [9], to achieve accelerated update equations we constrain all xn in outer node A to share the same qzn (zn ) ≡ qzA (zA ). [sent-113, score-0.659]

61 Similarly, if |nA | is the number of data in node A, the optimal parameters can be shown to be ∞ φv i,1 = α1 + φv i,2 |nA |qzA (zA = i) = α2 + |nA | A φη = λ 1 + i,1 |nA |qzA (zA = i) x A φη = λ 2 + i,2 A qzA (zA = j) (17) j=i+1 A |nA |qzA (zA = i). [sent-115, score-0.056]

62 (18) A Finally, using qzA (zA ) from (16) the free energy (5) reads T F = Eqvi log i=1 qη (ηi ; φη ) qvi (vi ; φv ) i i + Eqηi log i pv (vi |α) pη (ηi |λ) ∞ − |nA | log A exp(SA,i ). [sent-116, score-0.655]

63 ) Note that by reﬁning the tree (expanding outer nodes) the free energy F cannot increase. [sent-120, score-0.319]

64 This allows us to control the trade-off between computational resources and approximation: we can always choose to descend the tree until our computational resources run out, and the level of approximation will be directly tied to F (deeper levels will mean lower F ). [sent-121, score-0.158]

65 From that we can compute responsibilities qzn using (3). [sent-126, score-0.451]

66 Sample a number of ‘candidate’ components c according to size A |nA |qzA (zA = c), and split the component that leads to the maximal reduction of FT . [sent-133, score-0.081]

67 For each candidate c do: (a) Expand one-level deeper the outer nodes of the kd-tree that assign to c the highest responsibility qzA (zA = c) among all components. [sent-134, score-0.195]

68 (c) Update only SA,i , φv , φη and SA,j , φv , φη for the new components i and j, keeping all other i j i j parameters as well as the kd-tree expansion ﬁxed. [sent-137, score-0.087]

69 In the above algorithm, the number of sampled candidate components in step 2 can be tuned according to the desired cost/accuracy tradeoff. [sent-142, score-0.089]

70 8 5000 1000 2000 free energy ratio speedup factor Fast-VDP VDP BJ #data Figure 1: Relative runtimes and free energies of Fast-VDP, VDP, and BJ. [sent-146, score-0.513]

71 01 1 1 16 32 64 128 dimensionality 1 2 3 c-separation Figure 2: Speedup factors and free energy ratios between Fast-VDP and VDP. [sent-153, score-0.265]

72 Top and bottom ﬁgures show speedups and free energy ratios, respectively. [sent-154, score-0.309]

73 In order to speed up the partial updates in step 2c, we additionally set qzA (zA = k) = 0 for all k = i, j (so all responsibility is shared between the two new components). [sent-156, score-0.063]

74 In step 3 we reordered components every one cycle and expanded the kd-tree every three update cycles, controlling the expansion by the relative change of qzA (zA ) between a node and its children (alternatively one can measure the change of FT +1 ). [sent-157, score-0.167]

75 In all experiments we assumed a Gaussian observation model px (x|η) and a Gaussianinverse Wishart for pη (η|λ) and qηi (ηi ; φη ). [sent-160, score-0.1]

76 On the contrary, BJ optimizes the parameters for ﬁxed T (and potentially minimizes the resulting free energy over different values of T ), which requires a nontrivial initialization step for each T . [sent-163, score-0.275]

77 1 we show the speedup factors and free energy ratios3 among the three algorithms. [sent-168, score-0.36]

78 Fast-VDP 3 Free energy ratio is deﬁned as 1 + (FA − FB )/|FB |, where A and B are either Fast-VDP, VDP or BJ. [sent-169, score-0.151]

79 Fast-VDP VDP Figure 3: Clustering results of Fast-VDP and VDP, with a speedup of 21. [sent-170, score-0.114]

80 The clusters are ordered according to size (from top left to bottom right). [sent-171, score-0.074]

81 Moreover, Fast-VDP and VDP were always better than BJ in terms of free energy. [sent-173, score-0.124]

82 In a second synthetic set of experiments we compared the speedup of Fast-VDP over VDP. [sent-174, score-0.137]

83 2 we show the speedup factor (top) and the free energy ratio (bottom) between the two algorithms. [sent-178, score-0.389]

84 2-left illustrates that the speedup of Fast-VDP over VDP is at least linear in N , as expected from the update equations in Section 6. [sent-181, score-0.138]

85 The speedup factor was approximately 154 for one million data-cases, while the free energy ratio was almost constant over N . [sent-182, score-0.389]

86 Fast-VDP found 96 clusters in 3, 379 seconds with free energy F = 1. [sent-193, score-0.32]

87 759 × 107 , while VDP found 88 clusters in 72, 037 seconds with free energy 1. [sent-194, score-0.32]

88 The speedup was 21 and the free energy ratio was 1. [sent-196, score-0.389]

89 The mean images of the discovered components are illustrated in Fig. [sent-198, score-0.063]

90 In this problem the elapsed time of Fast-VDP and VDP were 335 seconds and 2,256 seconds, respectively, hence a speedup of 6. [sent-209, score-0.135]

91 Table 1 shows the three most frequent topics in each cluster. [sent-214, score-0.08]

92 9 Conclusions We described VDP, a variational mean-ﬁeld algorithm for Dirichlet Process mixtures, and its fast extension Fast-VDP that utilizes kd-trees to achieve speedups. [sent-217, score-0.232]

93 Our contribution is twofold: First, 4 A Gaussian mixture is c-separated if for each pair (i, j) of components we have ||mi − mj ||2 ≥ c D max(λmax , λmax ) , where λmax denotes the maximum eigenvalue of their covariance [11]. [sent-218, score-0.173]

94 2 cluster a (in descending order) the three most 1 frequent topics 2 3 81 102 59 Fast-VDP b c d 73 174 40 35 50 110 49 92 94 e A B 76 4 129 81 102 59 73 40 174 VDP C D 35 50 110 76 4 129 E F 49 92 94 73 174 40 Table 1: The three most frequent topics in each clusters. [sent-220, score-0.16]

95 cluster the most frequent topic words a, A b, B, F c, C d, E e, D 81 73 35 49 76 economic, policy, countries, bank, growth, ﬁrm, public, trade, market, . [sent-222, score-0.069]

96 Table 2: Words in the most frequent topic of each cluster. [sent-237, score-0.069]

97 we extended the framework of [7] to allow for nested variational families and an adaptive truncation level for the variational mixture. [sent-238, score-0.754]

98 Second, we showed how kd-trees can be employed in the framework offering signiﬁcant speedups, thus extending related results for ﬁnite mixture models [8, 9]. [sent-239, score-0.11]

99 , priors of the form pvi (vi |ai , bi ) = Beta(ai , bi )), as well as alternative DP mixture representations such as the Chinese restaurant process [3]. [sent-243, score-0.131]

100 Very fast EM-based mixture model clustering using multiresolution kd-trees. [sent-292, score-0.135]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('vdp', 0.479), ('qzn', 0.41), ('zn', 0.282), ('qza', 0.274), ('za', 0.238), ('variational', 0.232), ('qvi', 0.171), ('dp', 0.137), ('vi', 0.135), ('truncation', 0.131), ('eq', 0.126), ('free', 0.124), ('energy', 0.122), ('speedup', 0.114), ('mixture', 0.11), ('dirichlet', 0.104), ('px', 0.1), ('pv', 0.089), ('blei', 0.076), ('accelerated', 0.074), ('bj', 0.074), ('epv', 0.068), ('eqvi', 0.068), ('responsibility', 0.063), ('speedups', 0.063), ('components', 0.063), ('families', 0.062), ('node', 0.056), ('na', 0.055), ('clusters', 0.053), ('beta', 0.052), ('xn', 0.051), ('level', 0.049), ('nested', 0.048), ('frequent', 0.046), ('outer', 0.044), ('nite', 0.044), ('mixtures', 0.044), ('ft', 0.041), ('reads', 0.041), ('responsibilities', 0.041), ('tie', 0.041), ('tying', 0.041), ('resources', 0.04), ('pz', 0.038), ('exp', 0.036), ('log', 0.036), ('topics', 0.034), ('eqv', 0.034), ('reordering', 0.034), ('vlassis', 0.034), ('millions', 0.033), ('kurihara', 0.03), ('releasing', 0.03), ('tree', 0.029), ('initialization', 0.029), ('ratio', 0.029), ('newman', 0.027), ('locating', 0.027), ('nonparametric', 0.026), ('candidate', 0.026), ('fb', 0.025), ('clustering', 0.025), ('contained', 0.024), ('expansion', 0.024), ('update', 0.024), ('documents', 0.024), ('truncated', 0.024), ('topic', 0.023), ('tokyo', 0.023), ('deeper', 0.023), ('synthetic', 0.023), ('posterior', 0.023), ('fl', 0.022), ('factorized', 0.022), ('intractable', 0.021), ('sums', 0.021), ('kl', 0.021), ('ordered', 0.021), ('seconds', 0.021), ('labels', 0.021), ('nodes', 0.021), ('priors', 0.021), ('inference', 0.02), ('approximated', 0.02), ('operates', 0.02), ('contrary', 0.02), ('day', 0.02), ('dataset', 0.019), ('ratios', 0.019), ('expand', 0.019), ('david', 0.019), ('expanding', 0.019), ('family', 0.019), ('assign', 0.018), ('component', 0.018), ('bayesian', 0.018), ('irvine', 0.018), ('trade', 0.018), ('unbounded', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures

Author: Kenichi Kurihara, Max Welling, Nikos A. Vlassis

2 0.14642324 117 nips-2006-Learning on Graph with Laplacian Regularization

Author: Rie K. Ando, Tong Zhang

Abstract: We consider a general form of transductive learning on graphs with Laplacian regularization, and derive margin-based generalization bounds using appropriate geometric properties of the graph. We use this analysis to obtain a better understanding of the role of normalization of the graph Laplacian matrix as well as the effect of dimension reduction. The results suggest a limitation of the standard degree-based normalization. We propose a remedy from our analysis and demonstrate empirically that the remedy leads to improved classiﬁcation performance.

3 0.14316152 159 nips-2006-Parameter Expanded Variational Bayesian Methods

Author: Tommi S. Jaakkola, Yuan Qi

Abstract: Bayesian inference has become increasingly important in statistical machine learning. Exact Bayesian calculations are often not feasible in practice, however. A number of approximate Bayesian methods have been proposed to make such calculations practical, among them the variational Bayesian (VB) approach. The VB approach, while useful, can nevertheless suffer from slow convergence to the approximate solution. To address this problem, we propose Parameter-eXpanded Variational Bayesian (PX-VB) methods to speed up VB. The new algorithm is inspired by parameter-expanded expectation maximization (PX-EM) and parameterexpanded data augmentation (PX-DA). Similar to PX-EM and -DA, PX-VB expands a model with auxiliary variables to reduce the coupling between variables in the original model. We analyze the convergence rates of VB and PX-VB and demonstrate the superior convergence rates of PX-VB in variational probit regression and automatic relevance determination. 1

4 0.10192544 2 nips-2006-A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

Author: Yee W. Teh, David Newman, Max Welling

Abstract: Latent Dirichlet allocation (LDA) is a Bayesian network that has recently gained much popularity in applications ranging from document modeling to computer vision. Due to the large scale nature of these applications, current inference procedures like variational Bayes and Gibbs sampling have been found lacking. In this paper we propose the collapsed variational Bayesian inference algorithm for LDA, and show that it is computationally efﬁcient, easy to implement and signiﬁcantly more accurate than standard variational Bayesian inference for LDA.

5 0.097888187 57 nips-2006-Conditional mean field

Author: Peter Carbonetto, Nando D. Freitas

Abstract: Despite all the attention paid to variational methods based on sum-product message passing (loopy belief propagation, tree-reweighted sum-product), these methods are still bound to inference on a small set of probabilistic models. Mean ﬁeld approximations have been applied to a broader set of problems, but the solutions are often poor. We propose a new class of conditionally-speciﬁed variational approximations based on mean ﬁeld theory. While not usable on their own, combined with sequential Monte Carlo they produce guaranteed improvements over conventional mean ﬁeld. Moreover, experiments on a well-studied problem— inferring the stable conﬁgurations of the Ising spin glass—show that the solutions can be signiﬁcantly better than those obtained using sum-product-based methods. 1

6 0.090707809 91 nips-2006-Hierarchical Dirichlet Processes with Random Effects

7 0.087152191 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

8 0.075965956 158 nips-2006-PG-means: learning the number of clusters in data

9 0.066339687 181 nips-2006-Stability of $K$-Means Clustering

10 0.059729129 114 nips-2006-Learning Time-Intensity Profiles of Human Activity using Non-Parametric Bayesian Models

11 0.056590341 175 nips-2006-Simplifying Mixture Models through Function Approximation

12 0.052471995 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

13 0.051438827 131 nips-2006-Mixture Regression for Covariate Shift

14 0.050406557 69 nips-2006-Distributed Inference in Dynamical Systems

15 0.048011735 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions

16 0.047072541 66 nips-2006-Detecting Humans via Their Pose

17 0.042511001 129 nips-2006-Map-Reduce for Machine Learning on Multicore

18 0.042013973 65 nips-2006-Denoising and Dimension Reduction in Feature Space

19 0.040871322 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors

20 0.040584572 200 nips-2006-Unsupervised Regression with Applications to Nonlinear System Identification

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.154), (1, 0.048), (2, 0.015), (3, -0.02), (4, 0.05), (5, 0.048), (6, 0.183), (7, -0.0), (8, -0.05), (9, 0.045), (10, 0.038), (11, 0.07), (12, 0.011), (13, -0.013), (14, 0.006), (15, 0.14), (16, -0.012), (17, 0.003), (18, 0.056), (19, -0.088), (20, -0.107), (21, 0.009), (22, -0.161), (23, 0.045), (24, -0.13), (25, -0.065), (26, -0.002), (27, 0.038), (28, -0.053), (29, -0.048), (30, -0.137), (31, 0.065), (32, -0.012), (33, -0.172), (34, 0.028), (35, -0.131), (36, -0.17), (37, 0.159), (38, 0.071), (39, 0.027), (40, -0.044), (41, 0.021), (42, 0.017), (43, -0.037), (44, -0.175), (45, 0.022), (46, 0.039), (47, 0.053), (48, 0.052), (49, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94654894 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures

Author: Kenichi Kurihara, Max Welling, Nikos A. Vlassis

2 0.61478531 159 nips-2006-Parameter Expanded Variational Bayesian Methods

Author: Tommi S. Jaakkola, Yuan Qi

3 0.53301263 2 nips-2006-A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

Author: Yee W. Teh, David Newman, Max Welling

4 0.51300365 91 nips-2006-Hierarchical Dirichlet Processes with Random Effects

Author: Seyoung Kim, Padhraic Smyth

Abstract: Data sets involving multiple groups with shared characteristics frequently arise in practice. In this paper we extend hierarchical Dirichlet processes to model such data. Each group is assumed to be generated from a template mixture model with group level variability in both the mixing proportions and the component parameters. Variabilities in mixing proportions across groups are handled using hierarchical Dirichlet processes, also allowing for automatic determination of the number of components. In addition, each group is allowed to have its own component parameters coming from a prior described by a template mixture model. This group-level variability in the component parameters is handled using a random effects model. We present a Markov Chain Monte Carlo (MCMC) sampling algorithm to estimate model parameters and demonstrate the method by applying it to the problem of modeling spatial brain activation patterns across multiple images collected via functional magnetic resonance imaging (fMRI). 1

5 0.51123935 117 nips-2006-Learning on Graph with Laplacian Regularization

Author: Rie K. Ando, Tong Zhang

6 0.39637205 90 nips-2006-Hidden Markov Dirichlet Process: Modeling Genetic Recombination in Open Ancestral Space

7 0.35958204 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

8 0.34927565 158 nips-2006-PG-means: learning the number of clusters in data

9 0.33750024 181 nips-2006-Stability of $K$-Means Clustering

10 0.3364177 23 nips-2006-Adaptor Grammars: A Framework for Specifying Compositional Nonparametric Bayesian Models

11 0.3198683 114 nips-2006-Learning Time-Intensity Profiles of Human Activity using Non-Parametric Bayesian Models

12 0.30912295 131 nips-2006-Mixture Regression for Covariate Shift

13 0.30398247 57 nips-2006-Conditional mean field

14 0.28110853 175 nips-2006-Simplifying Mixture Models through Function Approximation

15 0.26948249 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

16 0.25598136 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

17 0.25362024 147 nips-2006-Non-rigid point set registration: Coherent Point Drift

18 0.24756344 153 nips-2006-Online Clustering of Moving Hyperplanes

19 0.23315224 68 nips-2006-Dirichlet-Enhanced Spam Filtering based on Biased Samples

20 0.21891053 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.091), (3, 0.033), (7, 0.073), (9, 0.034), (20, 0.039), (22, 0.056), (33, 0.275), (44, 0.124), (57, 0.049), (65, 0.052), (69, 0.053), (71, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78440219 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures

Author: Kenichi Kurihara, Max Welling, Nikos A. Vlassis

2 0.57558101 139 nips-2006-Multi-dynamic Bayesian Networks

Author: Karim Filali, Jeff A. Bilmes

Abstract: We present a generalization of dynamic Bayesian networks to concisely describe complex probability distributions such as in problems with multiple interacting variable-length streams of random variables. Our framework incorporates recent graphical model constructs to account for existence uncertainty, value-speciﬁc independence, aggregation relationships, and local and global constraints, while still retaining a Bayesian network interpretation and efﬁcient inference and learning techniques. We introduce one such general technique, which is an extension of Value Elimination, a backtracking search inference algorithm. Multi-dynamic Bayesian networks are motivated by our work on Statistical Machine Translation (MT). We present results on MT word alignment in support of our claim that MDBNs are a promising framework for the rapid prototyping of new MT systems. 1 INTRODUCTION The description of factorization properties of families of probabilities using graphs (i.e., graphical models, or GMs), has proven very useful in modeling a wide variety of statistical and machine learning domains such as expert systems, medical diagnosis, decision making, speech recognition, and natural language processing. There are many different types of graphical model, each with its own properties and beneﬁts, including Bayesian networks, undirected Markov random ﬁelds, and factor graphs. Moreover, for different types of scientiﬁc modeling, different types of graphs are more or less appropriate. For example, static Bayesian networks are quite useful when the size of set of random variables in the domain does not grow or shrink for all data instances and queries of interest. Hidden Markov models (HMMs), on the other hand, are such that the number of underlying random variables changes depending on the desired length (which can be a random variable), and HMMs are applicable even without knowing this length as they can be extended indeﬁnitely using online inference. HMMs have been generalized to dynamic Bayesian networks (DBNs) and temporal conditional random ﬁelds (CRFs), where an underlying set of variables gets repeated as needed to ﬁll any ﬁnite but unbounded length. Probabilistic relational models (PRMs) [5] allow for a more complex template that can be expanded in multiple dimensions simultaneously. An attribute common to all of the above cases is that the speciﬁcation of rules for expanding any particular instance of a model is ﬁnite. In other words, these forms of GM allow the speciﬁcation of models with an unlimited number of random variables (RVs) using a ﬁnite description. This is achieved using parameter tying, so while the number of RVs increases without bound, the number of parameters does not. In this paper, we introduce a new class of model we call multi-dynamic Bayesian networks. MDBNs are motivated by our research into the application of graphical models to the domain of statistical machine translation (MT) and they have two key attributes from the graphical modeling perspective. First, an MDBN generalizes a DBN in that there are multiple “streams” of variables that can get unrolled, but where each stream may be unrolled by a differing amount. In the most general case, connecting these different streams together would require the speciﬁcation of conditional probabil- ity tables with a varying and potentially unlimited number of parents. To avoid this problem and retain the template’s ﬁnite description length, we utilize a switching parent functionality (also called value-speciﬁc independence). Second, in order to capture the notion of fertility in MT-systems (deﬁned later in the text), we employ a form of existence uncertainty [7] (that we call switching existence), whereby the existence of a given random variable might depend on the value of other random variables in the network. Being fully propositional, MDBNs lie between DBNs and PRMs in terms of expressiveness. While PRMs are capable of describing any MDBN, there are, in general, advantages to restricting ourselves to a more speciﬁc class of model. For example, in the DBN case, it is possible to provide a bound on inference costs just by looking at attributes of the DBN template only (e.g., the left or right interfaces [12, 2]). Restricting the model can also make it simpler to use in practice. MDBNs are still relatively simple, while at the same time making possible the easy expression of MT systems, and opening doors to novel forms of probabilistic inference as we show below. In section 2, we introduce MDBNs, and describe their application to machine translation showing how it is possible to represent even complex MT systems. In section 3, we describe MDBN learning and decoding algorithms. In section 4, we present experimental results in the area of statistical machine translation, and future work is discussed in section 5. 2 MDBNs A standard DBN [4] template consists of a directed acyclic graph G = (V, E) = (V1 ∪ V2 , E1 ∪ → E2 ∪ E2 ) with node set V and edge set E. For t ∈ {1, 2}, the sets Vt are the nodes at slice t, Et → are the intra-slice edges between nodes in Vt , and Et are the inter-slice edges between nodes in V1 and V2 . To unroll a DBN to length T , the nodes V2 along with the edges adjacent to any node in V2 are cloned T − 1 times (where parameters of cloned variables are constrained to be the same as the template) and re-connected at the corresponding places. An MDBN with K streams consists of the union of K DBN templates along with a template structure specifying rules to connect the various streams together. An MDBN template is a directed graph (k) G = (V, E) = ( V (k) , E (k) ∪ E ) k (k) (k) th k (k) where (V , E ) is the k DBN, and the edges E are rules specifying how to connect stream k to the other streams. These rules are general in that they specify the set of edges for all values of Tk . There can be arbitrary nesting of the streams such as, for example, it is possible to specify a model that can grow along several dimensions simultaneously. An MDBN also utilizes “switching existence”, meaning some subset of the variables in V bestow existence onto other variables in the network. We call these variables existence bestowing (or ebnodes). The idea of bestowing existence is well deﬁned over a discrete space, and is not dissimilar to a variable length DBN. For example, we may have a joint distribution over lengths as follows: p(X1 , . . . , XN , N ) = p(X1 , . . . , Xn |N = n)p(N = n) where here N is an eb-node that determines the number of other random variables in the DGM. Our notion of eb-nodes allows us to model certain characteristics found within machine translation systems, such as “fertility” [3], where a given English word is cloned a random number of times in the generative process that explains a translation from French into English. This random cloning might happen simultaneously at all points along a given MDBN stream. This means that even for a given ﬁxed stream length Ti = ti , each stream could have a randomly varying number of random variables. Our graphical notation for eb-nodes consists of the eb-node as a square box containing variables whose existence is determined by the eb-node. We start by providing a simple example of an expanded MDBN for three well known MT systems, namely the IBM models 1 and 2 [3], and the “HMM” model [15].1 We adopt the convention in [3] that our goal is to translate from a string of French words F = f of length M = m into a string of English words E = e of length L = l — of course these can be any two languages. The basic generative (noisy channel) approach when translating from French to English is to represent the joint 1 We will refer to it as M-HMM to avoid confusion with regular HMMs. distribution P (f , e) = P (f |e)P (e). P (e) is a language model specifying the prior over the word string e. The key goal is to produce a ﬁnite-description length representation for P (f |e) where f and e are of arbitrary length. A hidden alignment string, a, speciﬁes how the English words align to the French word, leading to P (f |e) = a P (f , a|e). Figure 1(a) is a 2-stream MDBN expanded representation of the three models, in this case ℓ = 4 and m = 3. As shown, it appears that the fan-in to node fi will be ℓ and thus will grow without bound. However, a switching mechanism whereby P (fi |e, ai ) = P (fi |eai ) limits the number of parameters regardless of L. This means that the alignment variable ai indicates the English word eai that should be aligned to French word fi . The variable e0 is a null word that connects to French words not explained by any of e1 , . . . , eℓ . The graph expresses all three models — the difference is that, in Models 1 and 2, there are no edges between aj and aj+1 . In Model 1, p(aj = ℓ) is uniform on the set {1, . . . , L}; in Model 2, the distribution over aj is a function only of its position j, and on the English and French lengths ℓ and m respectively. In the M-HMM model, the ai variables form a ﬁrst order Markov chain. l e0 ℓ e1 e3 e2 e1 e4 e2 e3 φ1 φ2 φ3 m’ φ0 τ01 a1 f2 a2 f3 a3 m (a) Models 1,2 and M-HMM τ12 τ13 τ21 π02 π11 π12 π13 π21 f2 f3 f4 f5 f6 a1 u v τ11 f1 f1 τ02 a2 a3 a4 a5 a6 π01 w y x m (b) Expanded M3 graph Figure 1: Expanded 2-stream MDBN description of IBM Models 1 and 2, and the M-HMM model for MT; and the expanded MDBN description of IBM Model 3 with fertility assignment φ0 = 2, φ1 = 3, φ2 = 1, φ3 = 0. From the above, we see that it would be difﬁcult to express this model graphically using a standard DBN since L and M are unequal random variables. Indeed, there are two DBNs in operation, one consisting of the English string, and the other consisting of the French string and its alignment. Moreover, the fully connected structure of the graph in the ﬁgure can represent the appropriate family of model, but it also represents models whose parameter space grows without bound — the switching function allows the model template to stay ﬁnite regardless of L and M . With our MDBN descriptive abilities complete, it is now possible to describe the more complex IBM models 3, and 4[3] (an MDBN for Model3 is depicted in ﬁg. 1(b)). The top most random variable, ℓ, is a hidden switching existence variable corresponding to the length of the English string. The box abutting ℓ includes all the nodes whose existence depends on the value of ℓ. In the ﬁgure, ℓ = 3, thus resulting in three English words e1 , e2 , and e3 connected using a second-order Markov chain. To each English word ei corresponds a conditionally dependent fertility eb-node φi , which indicates how many times ei is used by words in the French string. Each φi in turn controls the existence of a set of variables under it. Given the fertilities (the ﬁgure depicts the case φ1 = 3, φ2 = 1, φ3 = 0), for each word ei , φi French word variables are granted existence and are denoted by τi1 , τi2 , . . . , τiφi , what is called the tablet [3] of ei . The values taken by the τ variables need to match the actual observed French sequence f1 , . . . , fm . This is represented as a shared constraint between all the f , π, and τ variables which have incoming edges into the observed variable v. v’s conditional probability table is such that it is one only when the associated constraint is satisﬁed2 . The variable 2 This type of encoding of constraints corresponds to the standard mechanism used by Pearl [14]. A naive implementation, however, would enumerate a number of conﬁgurations exponential in the number of constrained variables, while typically only a small fraction of the conﬁgurations would have positive probability. πi,k ∈ {1, . . . , m} is a switching dependency parent with respect to the constraint variable v and determines which fj participates in an equality constraint with τi,k . The bottom variable m is a switching existence node (observed to be 6 in the ﬁgure) with corresponding French word sequence and alignment variables. The French sequence participates in the v constraint described above, while the alignment variables aj ∈ {1, . . . , ℓ}, j ∈ 1, . . . , m constrain the fertilities to take their unique allowable values (for the given alignment). Alignments also restrict the domain of permutation variables, π, using the constraint variable x. Finally, the domain size of each aj has to lie in the interval [0, ℓ] and that is enforced by the variable u. The dashed edges connecting the alignment a variables represent an extension to implement an M3/M-HMM hybrid. ℓ The null submodel involving the deterministic node m′ (= i=1 φi ) and eb-node φ0 accounts for French words that are not explained by any of the English words e1 , . . . , eℓ . In this submodel, successive permutation variables are ordered and this constraint is implemented using the observed child w of π0i and π0(i+1) . Model 4 [3] is similar to Model 3 except that the former is based on a more elaborate distortion model that uses relative instead of absolute positions both within and between tablets. 3 Inference, Parameter Estimation and MPE Multi-dynamic Bayesian Networks are amenable to any type of inference that is applicable to regular Bayesian networks as long as switching existence relationships are respected and all the constraints (aggregation for example) are satisﬁed. Unfortunately DBN inference procedures that take advantage of the repeatable template and can preprocess it ofﬂine, are not easy to apply to MDBNs. A case in point is the Junction Tree algorithm [11]. Triangulation algorithms exist that create an ofﬂine triangulated version of the input graph and do not re-triangulate it for each different instance of the input data [12, 2]. In MDBNs, due to the ﬂexibility to unroll templates in several dimensions and to specify dependencies and constraints spanning the entire unrolled graph, it is not obvious how we can exploit any repetitive patterns in a Junction Tree-style ofﬂine triangulation of the graph template. In section 4, we discuss sampling inference methods we have used. Here we discuss our extension to a backtracking search algorithm with the same performance guarantees as the JT algorithm, but with the advantage of easily handling determinism, existence uncertainty, and constraints, both learned and explicitly stated. Value Elimination (VE) ([1]), is a backtracking Bayesian network inference technique that caches factors associated with portions of the search tree and uses them to avoid iterating again over the same subtrees. We follow the notation introduced in [1] and refer the reader to that paper for details about VE inference. We have extended the VE inference approach to handle explicitly encoded constraints, existence uncertainty, and to perform approximate local domain pruning (see section 4). We omit these details as well as others in the original paper and brieﬂy describe the main data structure required by VE and sketch the algorithm we refer to as FirstPass (ﬁg. 1) since it constitutes the ﬁrst step of the learning procedure, our main contribution in this section. A VE factor, F , is such that we can write the following marginal of the joint distribution P (X = x, Y = y, Z) = F.val × f (Z) X=x such that (X∪Y)∩Z = ∅, F.val is a constant, and f (Z) a function of Z only. Y is a set of variables previously instantiated in the current branch of search tree to the value vector y. The pair (Y, y) is referred to as a dependency set (F.Dset). X is referred to as a subsumed set (F.Sset). By caching the tuple (F.Dset, F.Sset, F.val), we avoid recomputing the marginal again whenever (1) F.Dset is active, meaning all nodes stored in F.Dset are assigned their cached values in the current branch of the search tree; and (2) none of the variables in F.Sset are assigned yet. FirstPass (alg. 1) visits nodes in the graph in Depth First fashion. In line 7, we get the values of all Newly Single-valued (NSV) CPTs i.e., CPTs that involve the current node, V , and in which all We use a general directed domain pruning constraint. Deterministic relationships then become a special case of our constraint whereby the domain of the child variable is constrained to a single value with probability one. Variable traversal order: A, B, C, and D. Factors are numbered by order of creation. *Fi denotes the activation of factor i. Tau values propagated recursively F7: Dset={} Sset={A,B,C,D} val=P(E=e) F7.tau = 1.0 = P(Evidence)/F7.val A F5: Dset={A=0} Sset={B,C,D} F2 D *F1 *F2 Factor values needed for c(A=0) and c(C=0,B=0) computation: F5.val=P(B=0|A=0)*F3.val+P(B=1|A=0)*F4.val F3.val=P(C=0|B=0)*F1.val+P(C=1|B=0)*F2.val F4.val=P(C=0|B=1)*F1.val+P(C=1|B=1)*F2.val F1.val=P(D=0|C=0)P(E=e|D=0)+P(D=1|C=0)P(E=e|D=1) F2.val=P(D=0|C=1)P(E=e|D=0)+P(D=1|C=1)P(E=e|D=1) First pass C *F3 *F4 Second pass D B F4 C F6.tau = F7.tau * P(A=1) 1 B F3: Dset={B=0} Sset={C,D} F1 F5.tau = F7.tau * P(A=0) F6 0 F3.tau = F5.tau * P(B=0|A=0) + F6.tau * P(B=0|A=1) = P(B=0) F4.tau = F5.tau * P(B=1|A=0) + F6.tau * P(B=1|A=1) = P(B=1) F1.tau = F3.tau * P(C=0|B=0) + F4.tau * P(C=0|B=1) = P(C=0) F2.tau = F3.tau * P(C=1|B=0) + F4.tau * P(C=1|B=1) = P(C=1) c(A=0)=(1/P(e))*(F7.tau*P(A=0)*F5.val)=(1/P(e))(P(A=0)*P(E=e|A=0))=P(A=0|E=e) c(C=0,B=0)=(1/P(e))*F3.tau*P(C=0|B=0)*F1.val =(1/P(e) * (P(A=0,B=0)+P(A=1,B=0)) * P(C=0|B=0) * F1.val =(1/P(e)) * P(B=0) * P(C=0|B=0) * F1.val =(1/P(e)) * P(B=0) * P(C=0|B=0) * F1.val =(1/P(e)) * P(C=0,B=0) * F1.val =P(C=0,B=0,E=e)/P(e)=P(C=0,B=0|E=e) Figure 2: Learning example using the Markov chain A → B → C → D → E, where E is observed. In the ﬁrst pass, factors (Dset, Sset and val) are learned in a bottom up fashion. Also, the normalization constant P (E = e) (probability of evidence) is obtained. In the second pass, tau values are updated in a top-down fashion and used to calculate expected counts c(F.head, pa(F.head)) corresponding to each F.head (the ﬁgure shows the derivations for (A=0) and (C=0,B=0), but all counts are updated in the same pass). other variables are already assigned (these variables and their values are accumulated into Dset). We also check for factors that are active, multiply their values in, and accumulate subsumed vars in Sset (to avoid branching on them). In line 10, we add V to the Sset. In line 11, we cache a new factor F with value F.val = sum. We store V into F.head, a pointer to the last variable to be inserted into F.Sset, and needed for parameter estimation described below. F.Dset consists of all the variables, except V , that appeared in any NSV CPT or the Dset of an activated factor at line 6. Regular Value Elimination is query-based, similar to variable elimination and recursive conditioning—what this means is that to answer a query of the type P (Q|E = e), where Q is query variable and E a set of evidence nodes, we force Q to be at the top of the search tree, run the backtracking algorithm and then read the answers to the queries P (Q = q|E = e), q ∈ Dom[Q], along each of the outgoing edges of Q. Parameter estimation would require running a number of queries on the order of the number of parameters to estimate. We extend VE into an algorithm that allows us to obtain Expectation Maximization sufﬁcient statistics in a single run of Value Elimination plus a second pass, which can never take longer than the ﬁrst one (and in practice is much faster). This two-pass procedure is analogous to the collect-distribute evidence procedure in the Junction Tree algorithm, but here we do this via a search tree. Let θX=x|pa(X)=y be a parameter associated with variable X with value x and parents Y = pa(X) when they have value y. Assuming a maximum likelihood learning scenario3 , to estimate θX=x|pa(X)=y , we need to compute f (X = x, pa(X) = y, E = e) = P (W, X = x, pa(X) = y, E = e) W\{X,pa(X)} which is a sum of joint probabilities of all conﬁgurations that are consistent with the assignment {X = x, pa(X) = y}. If we were to turn off factor caching, we would enumerate all such variable conﬁgurations and could compute the sum. When standard VE factors are used, however, this is no longer possible whenever X or any of its parents becomes subsumed. Fig. 2 illustrates an example of a VE tree and the factors that are learned in the case of a Markov chain with an evidence node at the end. We can readily estimate the parameters associated with variables A and B as they are not subsumed along any branch. C and D become subsumed, however, and we cannot obtain the correct counts along all the branches that would lead to C and D in the full enumeration case. To address this issue, we store a special value, F.tau, in each factor. F.tau holds the sum over all path probabilities from the ﬁrst level of the search tree to the level at which the factor F was 3 For Bayesian networks the likelihood function decomposes such that maximizing the expectation of the complete likelihood is equivalent to maximizing the “local likelihood” of each variable in the network. either created or activated. For example, F 6.tau in ﬁg. 2 is simply P (A = 1). Although we can compute F 3.tau directly, we can also compute it recursively using F 5.tau and F 6.tau as shown in the ﬁgure. This is because both F 5 and F 6 subsume F 3: in the context {F 5.Dset}, there exists a (unique) value dsub of F 5.head4 s.t. F 3 becomes activable. Likewise for F 6. We cannot compute F 1.tau directly, but we can, recursively, from F 3.tau and F 4.tau by taking advantage of a similar subsumption relationship. In general, we can show that the following recursive relationship holds: F pa .tau × N SVF pa .head=dsub × F.tau ← F pa ∈F pa Fact .val F.val Fact ∈Fact (1) where F pa is the set of factors that subsume F , Fact is the set of all factors (including F ) that become active in the context of {F pa .Dset, F pa .head = dsub } and N SVF pa .head=dsub is the product of all newly single valued CPTs under the same context. For top-level factors (not subsumed by any factor), F.tau = Pevidence /F.val, which is 1.0 when there is a unique top-level factor. Alg. 2 is a simple recursive computation of eq. 1 for each factor. We visit learned factors in the reverse order in which they were learned to ensure that, for any factor F ′ , F ′ .tau is incremented (line 13) by any F that might have activated F ′ (line 12). For example, in ﬁg. 2, F 4 uses F 1 and F 2, so F 4.tau needs to be updated before F 1.tau and F 2.tau. In line 11, we can increment the counts for any NSV CPT entries since F.tau will account for the possible ways of reaching the conﬁguration {F.Dset, F.head = d} in an equivalent full enumeration tree. Algorithm 1: FirstPass(level) 1 2 3 4 5 6 7 8 9 10 Input: Graph G Output: A list of learned factors and Pevidence Select var V to branch on if V ==NONE then return Sset={}, Dset={} for d ∈ Dom[V ] do V ←d prod = productOfAllNSVsAndActiveFactors(Dset, Sset) if prod != 0 then FirstPass(level+1) sum += prod Sset = Sset ∪ {V } cacheNewFactor(F.head ← V ,F.val ← sum, F.Sset ← Sset, F.Dset ← Dset); Algorithm 2: SecondPass() 1 2 3 4 5 6 7 8 9 10 11 12 13 Input: F : List of factors in the reverse order learned in the ﬁrst pass and Pevidence . Result: Updated counts foreach F ∈ F do if F.Dset = {} then F.tau ← Pevidence /F.val else F.tau ← 0.0 Assign vars in F.Dset to their values V ← F.head (last node to have been subsumed in this factor) foreach d ∈ Dom[V ] do prod = productOfAllNSVsAndActiveFactors() prod∗ = F.tau foreach newly single-valued CPT C do count(C.child,C.parents)+=prod/Pevidence F ′ =getListOfActiveFactors() for F ′ ∈ F ′ do F ′ .tau+ = prod/F ′ .val Most Probable Explanation We compute MPE using a very similar two-pass algorithm. In the ﬁrst pass, factors are used to store a maximum instead of a summation over variables in the Sset. We also keep track of the value of F.head at which the maximum is achieved. In the second pass, we recursively ﬁnd the optimal variable conﬁguration by following the trail of factors that are activated when we assign each F.head variable to its maximum value starting from the last learned factor. 4 Recall, F.head is the last variable to be added to a newly created factor in line 10 of alg. 1 4 MACHINE TRANSLATION WORD ALIGNMENT EXPERIMENTS A major motivation for pursuing the type of representation and inference described above is to make it possible to solve computationally-intensive real-world problems using large amounts of data, while retaining the full generality and expressiveness afforded by the MDBN modeling language. In the experiments below we compare running times of MDBNs to GIZA++ on IBM Models 1 through 4 and the M-HMM model. GIZA++ is a special-purpose optimized MT word alignment C++ tool that is widely used in current state-of-the-art phrase-based MT systems [10] and at the time of this writing is the only publicly available software that implements all of the IBM Models. We test on French-English 107 hand-aligned sentences5 from a corpus of the European parliament proceedings (Europarl [9]) and train on 10000 sentence pairs from the same corpus and of maximum number of words 40. The Alignment Error Rate (AER) [13] evaluation metric quantiﬁes how well the MPE assignment to the hidden alignment variables matches human-generated alignments. Several pruning and smoothing techniques are used by GIZA and MDBNs. GIZA prunes low lexical (P (f |e)) probability values and uses a default small value for unseen (or pruned) probability table entries. For models 3 and 4, for which there is no known polynomial time algorithm to perform the full E-step or compute MPE, GIZA generates a set of high probability alignments using an MHMM and hill-climbing and collects EM counts over these alignments using M3 or M4. For MDBN models we use the following pruning strategy: at each level of the search tree we prune values which, together, account for the lowest speciﬁed percentage of the total probability mass of the product of all newly active CPTs in line 6 of alg. 1. This is a more effective pruning than simply removing low-probability values of each CPD because it factors in the joint contribution of multiple active variables. Table 1 shows a comparison of timing numbers obtained GIZA++ and MDBNs. The runtime numbers shown are for the combined tasks of training and decoding; however, training time dominates given the difference in size between train and test sets. For models 1 and 2 neither GIZA nor MDBNs perform any pruning. For the M-HMM, we prune 60% of probability mass at each level and use a Dirichlet prior over the alignment variables such that long-range transitions are exponentially less likely than shorter ones.6 This model achieves similar times and AER to GIZA’s. Interestingly, without any pruning, the MDBN M-HMM takes 160 minutes to complete while only marginally improving upon the pruned model. Experimenting with several pruning thresholds, we found that AER would worsen much more slowly than runtime decreases. Models 3 and 4 have treewidth equal to the number of alignment variables (because of the global constraints tying them) and therefore require approximate inference. Using Model 3, and a drastic pruning threshold that only keeps the value with the top probability at each level, we were able to achieve an AER not much higher than GIZA’s. For M4, it achieves a best AER of 31.7% while we do not improve upon Model3, most likely because a too restrictive pruning. Nevertheless, a simple variation on Model3 in the MDBN framework achieves a lower AER than our regular M3 (with pruning still the same). The M3-HMM hybrid model combines the Markov alignment dependencies from the M-HMM model with the fertility model of M3. MCMC Inference Sampling is widely used for inference in high-treewidth models. Although MDBNs support Likelihood Weighing, it is very inefﬁcient when the probability of evidence is very small, as is the case in our MT models. Besides being slow, Markov chain Monte Carlo can be problematic when the joint distribution is not positive everywhere, in particular in the presence of determinism and hard constraints. Techniques such as blocking Gibbs sampling [8] try to address the problem. Often, however, one has to carefully choose a problem-dependent proposal distribution. We used MCMC to improve training of the M3-HMM model. We were able to achieve an AER of 32.8% (down from 39.1%) but using 400 minutes of uniprocessor time. 5 CONCLUSION The existing classes of graphical models are not ideally suited for representing SMT models because “natural” semantics for specifying the latter combine ﬂavors of different GM types on top of standard directed Bayesian network semantics: switching parents found in Bayesian Multinets [6], aggregation relationships such as in Probabilistic Relational Models [5], and existence uncertainty [7]. We 5 Available at http://www.cs.washington.edu/homes/karim French and English have similar word orders. On a different language pair, a different prior might be more appropriate. With a uniform prior, the MDBN M-HMM has 36.0% AER. 6 Model Init M1 M2 M-HMM M3 M4 M3-HMM GIZA++ M1 M-HMM 1m45s (47.7%) N/A 2m02s (41.3%) N/A 4m05s (35.0%) N/A 2m50 (45%) 5m20s (38.5%) 5m20s (34.8%) 7m45s (31.7%) N/A MDBN M1 3m20s (48.0%) 5m30s (41.0%) 4m15s (33.0%) 12m (43.6%) 25m (43.6%) 9m30 (41.0%) M-HMM N/A N/A N/A 9m (42.5%) 23m (42.6%) 9m15s (39.1%) MCMC 400m (32.8%) Table 1: MDBN VE-based learning versus GIZA++ timings and %AER using 5 EM iterations. The columns M1 and M-HMM correspond to the model that is used to initialize the model in the corresponding row. The last row is a hybrid Model3-HMM model that we implemented using MDBNs and is not expressible using GIZA. have introduced a generalization of dynamic Bayesian networks to easily and concisely build models consisting of varying-length parallel asynchronous and interacting data streams. We have shown that our framework is useful for expressing various statistical machine translation models. We have also introduced new parameter estimation and decoding algorithms using exact and approximate searchbased probability computation. While our timing results are not yet as fast as a hand-optimized C++ program on the equivalent model, we have shown that even in this general-purpose framework of MDBNs, our timing numbers are competitive and usable. Our framework can of course do much more than the IBM and HMM models. One of our goals is to use this framework to rapidly prototype novel MT systems and develop methods to statistically induce an interlingua. We also intend to use MDBNs in other domains such as multi-party social interaction analysis. References [1] F. Bacchus, S. Dalmao, and T. Pitassi. Value elimination: Bayesian inference via backtracking search. In UAI-03, pages 20–28, San Francisco, CA, 2003. Morgan Kaufmann. [2] J. Bilmes and C. Bartels. On triangulating dynamic graphical models. In Uncertainty in Artiﬁcial Intelligence: Proceedings of the 19th Conference, pages 47–56. Morgan Kaufmann, 2003. [3] P. F. Brown, J. Cocke, S. A. Della Piettra, V. J. Della Piettra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S. Roossin. A statistical approach to machine translation. Computational Linguistics, 16(2):79–85, June 1990. [4] T. Dean and K. Kanazawa. Probabilistic temporal reasoning. AAAI, pages 524–528, 1988. [5] N. Friedman, L. Getoor, D. Koller, and A. Pfeffer. Learning probabilistic relational models. In IJCAI, pages 1300–1309, 1999. [6] D. Geiger and D. Heckerman. Knowledge representation and inference in similarity networks and Bayesian multinets. Artif. Intell., 82(1-2):45–74, 1996. [7] L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models of link structure. Journal of Machine Learning Research, 3(4-5):697–707, May 2003. [8] C. Jensen, A. Kong, and U. Kjaerulff. Blocking Gibbs sampling in very large probabilistic expert systems. In International Journal of Human Computer Studies. Special Issue on Real-World Applications of Uncertain Reasoning., 1995. [9] P. Koehn. Europarl: A multilingual corpus for evaluation of machine http://www.isi.edu/koehn/publications/europarl, 2002. translation. [10] P. Koehn, F. Och, and D. Marcu. Statistical phrase-based translation. In NAACL/HLT 2003, 2003. [11] S. Lauritzen. Graphical Models. Oxford Science Publications, 1996. [12] K. Murphy. Dynamic Bayesian Networks: Representation, Inference and Learning. PhD thesis, U.C. Berkeley, Dept. of EECS, CS Division, 2002. [13] F. J. Och and H. Ney. Improved statistical alignment models. In ACL, pages 440–447, Oct 2000. [14] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 2nd printing edition, 1988. [15] S. Vogel, H. Ney, and C. Tillmann. HMM-based word alignment in statistical translation. In Proceedings of the 16th conference on Computational linguistics, pages 836–841, Morristown, NJ, USA, 1996.

3 0.56762755 121 nips-2006-Learning to be Bayesian without Supervision

Author: Martin Raphan, Eero P. Simoncelli

Abstract: unkown-abstract

4 0.5625273 159 nips-2006-Parameter Expanded Variational Bayesian Methods

Author: Tommi S. Jaakkola, Yuan Qi

5 0.55853951 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

Author: Rey Ramírez, Jason Palmer, Scott Makeig, Bhaskar D. Rao, David P. Wipf

Abstract: The ill-posed nature of the MEG/EEG source localization problem requires the incorporation of prior assumptions when choosing an appropriate solution out of an inﬁnite set of candidates. Bayesian methods are useful in this capacity because they allow these assumptions to be explicitly quantiﬁed. Recently, a number of empirical Bayesian approaches have been proposed that attempt a form of model selection by using the data to guide the search for an appropriate prior. While seemingly quite different in many respects, we apply a unifying framework based on automatic relevance determination (ARD) that elucidates various attributes of these methods and suggests directions for improvement. We also derive theoretical properties of this methodology related to convergence, local minima, and localization bias and explore connections with established algorithms. 1

6 0.55807954 57 nips-2006-Conditional mean field

7 0.55567008 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

8 0.55565619 175 nips-2006-Simplifying Mixture Models through Function Approximation

9 0.55548573 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming

10 0.55463904 193 nips-2006-Tighter PAC-Bayes Bounds

11 0.55410391 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

12 0.55388469 65 nips-2006-Denoising and Dimension Reduction in Feature Space

13 0.55349851 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics

14 0.5534972 117 nips-2006-Learning on Graph with Laplacian Regularization

15 0.55240381 20 nips-2006-Active learning for misspecified generalized linear models

16 0.54946703 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure

17 0.54935074 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

18 0.54895437 98 nips-2006-Inferring Network Structure from Co-Occurrences

19 0.54648191 55 nips-2006-Computation of Similarity Measures for Sequential Data using Generalized Suffix Trees

20 0.54517233 169 nips-2006-Relational Learning with Gaussian Processes