nips nips2006 nips2006-41 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Hugh A. Chipman, Edward I. George, Robert E. Mcculloch
Abstract: We develop a Bayesian “sum-of-trees” model, named BART, where each tree is constrained by a prior to be a weak learner. Fitting and inference are accomplished via an iterative backfitting MCMC algorithm. This model is motivated by ensemble methods in general, and boosting algorithms in particular. Like boosting, each weak learner (i.e., each weak tree) contributes a small amount to the overall model. However, our procedure is defined by a statistical model: a prior and a likelihood, while boosting is defined by an algorithm. This model-based approach enables a full and accurate assessment of uncertainty in model predictions, while remaining highly competitive in terms of predictive accuracy. 1
Reference: text
sentIndex sentText sentNum sentScore
1 McCulloch Graduate School of Business University of Chicago Chicago, IL, 60637 Abstract We develop a Bayesian “sum-of-trees” model, named BART, where each tree is constrained by a prior to be a weak learner. [sent-4, score-0.304]
2 This model is motivated by ensemble methods in general, and boosting algorithms in particular. [sent-6, score-0.212]
3 However, our procedure is defined by a statistical model: a prior and a likelihood, while boosting is defined by an algorithm. [sent-10, score-0.226]
4 This model-based approach enables a full and accurate assessment of uncertainty in model predictions, while remaining highly competitive in terms of predictive accuracy. [sent-11, score-0.123]
5 + gm (x) where each gi denotes a binary regression tree. [sent-16, score-0.112]
6 It is vastly more flexible than a single tree model which does not easily incorporate additive effects. [sent-18, score-0.284]
7 Because multivariate components can easily account for high order interaction effects, a sum-of-trees model is also much more flexible than typical additive models that use low dimensional smoothers as components. [sent-19, score-0.094]
8 We specify a prior, and then obtain a sequence of draws from the posterior using Markov chain Monte Carlo (MCMC). [sent-21, score-0.187]
9 First, with m chosen large, it restrains the fit of each individual gi so that the overall fit is made up of many small contributions in the spirit of boosting (Freund & Schapire (1997), Friedman (2001)). [sent-23, score-0.191]
10 The prior specification is kept simple and a default choice is shown to have good out of sample predictive performance. [sent-26, score-0.255]
11 Inferential uncertainty is naturally quantified in the usual Bayesian way: variation in the MCMC draws of f = gi (evaluated at a set of x of interest) and σ indicates our beliefs about plausible values given the data. [sent-27, score-0.22]
12 Note that the depth of each tree is not fixed so that we infer the level of interaction. [sent-28, score-0.188]
13 Thus, our procedure captures ensemble learning (in which many trees are combined) both in the fundamental sum-of-trees specification and in the model-averaging used to obtain the estimate. [sent-30, score-0.197]
14 1 A Sum-of-Trees Model To elaborate the form of a sum-of-trees model, we begin by establishing notation for a single tree model. [sent-33, score-0.188]
15 Let T denote a binary tree consisting of a set of interior node decision rules and a set of terminal nodes, and let M = {µ1 , µ2 , . [sent-34, score-0.435]
16 , µB } denote a set of parameter values associated with each of the B terminal nodes of T . [sent-37, score-0.211]
17 Prediction for a particular value of input vector x is accomplished as follows: If x is associated with terminal node b of T by the sequence of decision rules from top to bottom, it is then assigned the µb value associated with this terminal node. [sent-38, score-0.457]
18 (1) (2) Unlike the single tree model, when m > 1 the terminal node parameter µi given by g(x; Tj , Mj ) is merely part of the conditional mean of Y given x. [sent-41, score-0.435]
19 Such terminal node parameters will represent interaction effects when their assignment depends on more than one component of x (i. [sent-42, score-0.309]
20 Because (1) may be based on trees of varying sizes, the sum-of-trees model can incorporate both direct effects and interaction effects of varying orders. [sent-45, score-0.216]
21 In the special case where every terminal node assignment depends on just a single component of x, the sum-of-trees model reduces to a simple additive function. [sent-46, score-0.31]
22 With a large number of trees, a sum-of-trees model gains increased representation flexibility, which, when coupled with our regularization prior, gives excellent out of sample predictive performance. [sent-47, score-0.086]
23 2 A Regularization Prior The complexity of the prior specification is vastly simplified by letting the Ti be i. [sent-54, score-0.121]
24 Given these independence assumptions we need only choose priors for a single tree T , a single µ, and σ. [sent-59, score-0.188]
25 Motivated by our desire to make each g(x; Ti , Mi ) a small contribution to the overall fit, we put prior weight on small trees and small µi,b . [sent-60, score-0.211]
26 For the tree prior, we use the same specification as in Chipman, George & McCulloch (1998). [sent-61, score-0.188]
27 In all examples we use the same prior corresponding to the choice α = . [sent-63, score-0.088]
28 With this choice, trees with 1, 2, 3, 4, and ≥ 5 terminal nodes receive prior probability of 0. [sent-65, score-0.422]
29 Note that even with this prior, trees with many terminal nodes can be grown if the data demands it. [sent-71, score-0.334]
30 At any non-terminal node, the prior on the associated decision rule puts equal probability on each available variable and then equal probability on each available rule given the variable. [sent-72, score-0.088]
31 For the prior on a µ, we start by simply shifting and rescaling Y so that we believe the prior proba2 bility that E(Y | x) ∈ (−. [sent-73, score-0.176]
32 For example if k = 2 there is a 95% (conditional) prior probability that the mean of Y is in (−. [sent-99, score-0.088]
33 k = 2 is our default choice and in practice we typically rescale the response y so that its observed values range from -5. [sent-102, score-0.142]
34 Note that this prior increases the shrinkage of µi,b (toward zero) as m increases. [sent-105, score-0.13]
35 For the prior on σ we start from the usual inverted-chi-squared prior: σ 2 ∼ ν λ/χ2 . [sent-106, score-0.088]
36 99, and set λ so that the qth quantile of the prior on σ is located at σ , that is P (σ < σ ) = q. [sent-112, score-0.175]
37 For automatic use, we recommend the default setting (ν, q) = (3, 0. [sent-118, score-0.111]
38 Strong prior beliefs that σ is very small could lead to over-fitting. [sent-122, score-0.088]
39 3 A Backfitting MCMC Algorithm Given the observed data y, our Bayesian setup induces a posterior distribution p((T1 , M1 ), . [sent-123, score-0.087]
40 For notational convenience, let T(i) be the set of all trees in the sum except Ti , and similarly define M(i) . [sent-129, score-0.123]
41 The Gibbs sampler here entails m successive draws of (Ti , Mi ) conditionally on (T(i) , M(i) , σ): (T1 , M1 )|T(1) , M(1) , σ, y (T2 , M2 )|T(2) , M(2) , σ, y . [sent-130, score-0.1]
42 (Tm , Mm )|T(m) , M(m) , σ, y, (3) followed by a draw of σ from the full conditional: σ|T1 , . [sent-133, score-0.075]
43 (4) Hastie & Tibshirani (2000) considered a similar application of the Gibbs sampler for posterior sampling for additive and generalized additive models with σ fixed, and showed how it was a stochastic generalization of the backfitting algorithm for such models. [sent-140, score-0.245]
44 In contrast with the stagewise nature of most boosting algorithms (Freund & Schapire (1997), Friedman (2001), Meek, Thiesson & Heckerman (2002)), the backfitting MCMC algorithm repeatedly resamples the parameters of each learner in the ensemble. [sent-142, score-0.233]
45 The idea is that given (T(i) , M(i) ) and σ we may subtract the fit from (T(i) , M(i) ) from both sides of (1) leaving us with a single tree model with known error variance. [sent-143, score-0.188]
46 This draw may be made following the approach of Chipman et al. [sent-144, score-0.105]
47 These methods draw (Ti , Mi ) | T(i) , M(i) , σ, y as Ti | T(i) , M(i) , σ, y followed by Mi | Ti , T(i) , M(i) , σ, y. [sent-146, score-0.075]
48 The first draw is done by the Metropolis-Hastings algorithm after integrating out Mi and the second is a set of normal draws. [sent-147, score-0.075]
49 The draw of σ is easily accomplished by subtracting all the fit from both sides of (1) so the the are considered to be observed. [sent-148, score-0.171]
50 The Metropolis-Hastings draw of Ti | T(i) , M(i) , σ, y is complex and lies at the heart of our method. [sent-150, score-0.075]
51 (1998) proposes a new tree based on the current tree using one of four moves. [sent-152, score-0.376]
52 The moves and their associated proposal probabilities are: growing a terminal node (0. [sent-153, score-0.247]
53 Although the grow and prune moves change the implicit dimensionality of the proposed tree in terms of the number of terminal nodes, by integrating out Mi from the posterior, we avoid the complexities associated with reversible jumps between continuous spaces of varying dimensions (Green 1995). [sent-158, score-0.424]
54 At each iteration, each tree may increase or decrease the number of terminal nodes by one, or change one or two decision rules. [sent-160, score-0.399]
55 It is not uncommon for a tree to grow large and then subsequently collapse back down to a single node as the algorithm iterates. [sent-162, score-0.368]
56 The sum-of-trees model, with its abundance of unidentified parameters, allows for “fit” to be freely reallocated from one tree to another. [sent-163, score-0.188]
57 Compared to the single tree model MCMC approach of Chipman et al. [sent-165, score-0.218]
58 When only single tree models are considered, the MCMC algorithm tends to quickly gravitate toward a single large tree and then gets stuck in a local neighborhood of that tree. [sent-167, score-0.376]
59 In some ways backfitting MCMC is a stochastic alternative to boosting algorithms for fitting linear combinations of trees. [sent-170, score-0.138]
60 It is distinguished by the ability to sample from a posterior distribution. [sent-171, score-0.087]
61 At each iteration, we get a new draw f ∗ = g(x; T1 , M1 ) + g(x; T2 , M2 ) + . [sent-172, score-0.075]
62 + g(x; Tm , Mm ) (5) corresponding to the draw of Tj and Mj . [sent-175, score-0.075]
63 These draws are a (dependent) sample from the posterior distribution on the “true” f . [sent-176, score-0.187]
64 Rather than pick the “best” f ∗ from these draws, the set of multiple draws can be used to further enhance inference. [sent-177, score-0.142]
65 We estimate f by the posterior mean of f which is approximated by averaging the f ∗ over the draws. [sent-178, score-0.087]
66 For example, we can use the 5% and 95% quantiles of f ∗ (x) to obtain 90% posterior intervals for f (x). [sent-180, score-0.191]
67 All datasets correspond to regression problems with between 3 and 28 numeric predictors and 0 to 6 categorical predictors. [sent-202, score-0.201]
68 As competitors we considered linear regression with L1 regularization (the Lasso) (Efron, Hastie, Johnstone & Tibshirani 2004) and four black-box models: Friedman’s (2001) gradient boosting, random forests (Breiman 2001), and neural networks with one layer of hidden units. [sent-205, score-0.269]
69 We considered two versions of our Bayesian ensemble procedure BART. [sent-208, score-0.106]
70 In BART-cv, the prior hyperparameters (ν, q, k, m) were treated as operational parameters to be tuned via cross-validation. [sent-209, score-0.153]
71 For both BART-cv and BART-default, all specifications of the quantile q were made relative to the least squares linear regression estimate σ , and the number of burn-in steps and MCMC iterations used were determined by inspection of a ˆ single long run. [sent-212, score-0.146]
72 For example, with 20 predictors we used 3, 8 and 21 as candidate values for the number of hidden units. [sent-222, score-0.103]
73 Finally, in order to enable performance comparisons across all datasets, after possible nonlinear transformation, the resultant response was scaled to have sample mean 0 and standard deviation 1 prior to any train/test splitting. [sent-229, score-0.119]
74 Each learner has the following percentage of ratios larger than 2. [sent-249, score-0.095]
75 0, which are not plotted above: Neural net: 5%, BART-cv: 6%, BART-default and Boosting: 7%, Random forests 10% and Lasso 21%. [sent-250, score-0.115]
76 each of the 840 experiments, the learner with smallest RMSE was identified. [sent-251, score-0.095]
77 The relative ratio for each learner is the raw RMSE divided by the smallest RMSE. [sent-252, score-0.095]
78 Thus a relative RMSE of 1 means that the learner had the best performance in a particular experiment. [sent-253, score-0.095]
79 The strong performance of our “default” ensemble is especially noteworthy, since it requires no selection of operational parameters. [sent-261, score-0.139]
80 This results in a huge computational savings, since under cross-validation, the number of times a learner must be trained is equal to the number of settings times the number of folds. [sent-263, score-0.095]
81 Not only is the default version of BART faster, but it also provides valid statistical inference, a benefit not available to any of the other learners considered. [sent-270, score-0.142]
82 We applied BART to all 506 observations of the Boston Housing data using the default setting (ν, q, k, m) = (3, 0. [sent-272, score-0.111]
83 At each of the 506 predictor values x, we used 5% and ˆ 95% quantiles of the MCMC draws to obtain 90% posterior intervals for f (x). [sent-274, score-0.291]
84 An appealing feature of these posterior intervals is that they widen when there is less information about f (x). [sent-275, score-0.197]
85 (b) Partial dependence plot for the effect of crime on the response (log median property value), with 90% uncertainty bounds. [sent-295, score-0.227]
86 To see how the width of the 90% posterior intervals corresponded to Dx , we plotted them together in Figure 3(a). [sent-297, score-0.153]
87 Since BART provides posterior draws for f (x), calculation of a posterior distribution for the partial dependence function is straightforward. [sent-300, score-0.274]
88 For the Boston Housing data, Figure 3(b) shows the partial dependence plot for crime, with 90% posterior intervals. [sent-303, score-0.087]
89 The vast majority of data values occur for crime < 5, causing the intervals to widen as crime increases and the data become more sparse. [sent-304, score-0.302]
90 5 Discussion Our approach is a fully Bayesian approach to learning with ensembles of tree models. [sent-305, score-0.188]
91 Because of the nature of the underlying tree model, we are able to specify simple, effective priors and fully exploit the benefits of Bayesian methodology. [sent-306, score-0.188]
92 Our prior provides the regularization needed to obtain good predictive performance. [sent-307, score-0.174]
93 In particular, our default prior, which is minimially dependent on the data, performs well compared to other methods which rely on cross-validation to pick model parameters. [sent-308, score-0.153]
94 We obtain inference in the natural Bayesian way from the variation in the posterior draws. [sent-309, score-0.087]
95 In this case, gauging the inferential uncertainty is essential. [sent-311, score-0.116]
96 A common concern with Bayesian approaches is sensitivity to prior parameters. [sent-321, score-0.088]
97 (2006) found that results were robust to a reasonably wide range of prior parameters, including ν, q, σµ , as well as the number of trees, m. [sent-323, score-0.088]
98 (2006), BART: Bayesian additive regression trees, Technical report, University of Chicago. [sent-351, score-0.122]
99 (2001), ‘Greedy function approximation: A gradient boosting machine’, The Annals of Statistics 29, 1189–1232. [sent-366, score-0.138]
100 (2007), ‘Bayesian CART: Prior specification and posterior simulation’, Journal of Computational and Graphical Statistics. [sent-393, score-0.087]
wordName wordTfidf (topN-words)
[('qq', 0.537), ('chipman', 0.308), ('qqq', 0.198), ('mcculloch', 0.192), ('tree', 0.188), ('bart', 0.175), ('terminal', 0.174), ('mcmc', 0.167), ('rmse', 0.147), ('boosting', 0.138), ('trees', 0.123), ('forests', 0.115), ('default', 0.111), ('draws', 0.1), ('lasso', 0.098), ('cook', 0.096), ('crime', 0.096), ('learner', 0.095), ('prior', 0.088), ('quantile', 0.087), ('tm', 0.087), ('posterior', 0.087), ('ti', 0.084), ('back', 0.078), ('george', 0.077), ('draw', 0.075), ('bayesian', 0.074), ('ensemble', 0.074), ('node', 0.073), ('predictors', 0.07), ('tting', 0.069), ('uncertainty', 0.067), ('abreveya', 0.066), ('intervals', 0.066), ('operational', 0.065), ('mm', 0.065), ('additive', 0.063), ('regression', 0.059), ('mi', 0.057), ('loh', 0.057), ('predictive', 0.056), ('friedman', 0.055), ('gi', 0.053), ('housing', 0.052), ('tibshirani', 0.05), ('inferential', 0.049), ('boston', 0.048), ('acadia', 0.044), ('cart', 0.044), ('qqqq', 0.044), ('shih', 0.044), ('thiesson', 0.044), ('tjelmeland', 0.044), ('widen', 0.044), ('pick', 0.042), ('shrinkage', 0.042), ('categorical', 0.04), ('df', 0.04), ('hastie', 0.039), ('chaudhuri', 0.038), ('quantiles', 0.038), ('whiskers', 0.038), ('units', 0.038), ('nodes', 0.037), ('accomplished', 0.036), ('dx', 0.036), ('freund', 0.035), ('mixes', 0.035), ('breiman', 0.035), ('meek', 0.035), ('replicates', 0.035), ('schapire', 0.034), ('median', 0.033), ('hidden', 0.033), ('reversible', 0.033), ('business', 0.033), ('heckerman', 0.033), ('overestimate', 0.033), ('sigma', 0.033), ('vastly', 0.033), ('datasets', 0.032), ('considered', 0.032), ('canada', 0.032), ('chicago', 0.032), ('net', 0.032), ('interaction', 0.031), ('effects', 0.031), ('response', 0.031), ('aggressive', 0.031), ('efron', 0.031), ('johnstone', 0.031), ('learners', 0.031), ('et', 0.03), ('regularization', 0.03), ('grow', 0.029), ('hill', 0.029), ('subtracting', 0.028), ('comments', 0.028), ('named', 0.028), ('cv', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 41 nips-2006-Bayesian Ensemble Learning
Author: Hugh A. Chipman, Edward I. George, Robert E. Mcculloch
Abstract: We develop a Bayesian “sum-of-trees” model, named BART, where each tree is constrained by a prior to be a weak learner. Fitting and inference are accomplished via an iterative backfitting MCMC algorithm. This model is motivated by ensemble methods in general, and boosting algorithms in particular. Like boosting, each weak learner (i.e., each weak tree) contributes a small amount to the overall model. However, our procedure is defined by a statistical model: a prior and a likelihood, while boosting is defined by an algorithm. This model-based approach enables a full and accurate assessment of uncertainty in model predictions, while remaining highly competitive in terms of predictive accuracy. 1
2 0.12909082 78 nips-2006-Fast Discriminative Visual Codebooks using Randomized Clustering Forests
Author: Frank Moosmann, Bill Triggs, Frederic Jurie
Abstract: Some of the most effective recent methods for content-based image classification work by extracting dense or sparse local image descriptors, quantizing them according to a coding rule such as k-means vector quantization, accumulating histograms of the resulting “visual word” codes over the image, and classifying these with a conventional classifier such as an SVM. Large numbers of descriptors and large codebooks are needed for good results and this becomes slow using k-means. We introduce Extremely Randomized Clustering Forests – ensembles of randomly created clustering trees – and show that these provide more accurate results, much faster training and testing and good resistance to background clutter in several state-of-the-art image classification tasks. 1
3 0.11484914 172 nips-2006-Scalable Discriminative Learning for Natural Language Parsing and Translation
Author: Joseph Turian, Benjamin Wellington, I. D. Melamed
Abstract: Parsing and translating natural languages can be viewed as problems of predicting tree structures. For machine learning approaches to these predictions, the diversity and high dimensionality of the structures involved mandate very large training sets. This paper presents a purely discriminative learning method that scales up well to problems of this size. Its accuracy was at least as good as other comparable methods on a standard parsing task. To our knowledge, it is the first purely discriminative learning algorithm for translation with treestructured models. Unlike other popular methods, this method does not require a great deal of feature engineering a priori, because it performs feature selection over a compound feature space as it learns. Experiments demonstrate the method’s versatility, accuracy, and efficiency. Relevant software is freely available at http://nlp.cs.nyu.edu/parser and http://nlp.cs.nyu.edu/GenPar. 1
4 0.078619815 115 nips-2006-Learning annotated hierarchies from relational data
Author: Daniel M. Roy, Charles Kemp, Vikash K. Mansinghka, Joshua B. Tenenbaum
Abstract: The objects in many real-world domains can be organized into hierarchies, where each internal node picks out a category of objects. Given a collection of features and relations defined over a set of objects, an annotated hierarchy includes a specification of the categories that are most useful for describing each individual feature and relation. We define a generative model for annotated hierarchies and the features and relations that they describe, and develop a Markov chain Monte Carlo scheme for learning annotated hierarchies. We show that our model discovers interpretable structure in several real-world data sets.
5 0.0780598 69 nips-2006-Distributed Inference in Dynamical Systems
Author: Stanislav Funiak, Carlos Guestrin, Rahul Sukthankar, Mark A. Paskin
Abstract: We present a robust distributed algorithm for approximate probabilistic inference in dynamical systems, such as sensor networks and teams of mobile robots. Using assumed density filtering, the network nodes maintain a tractable representation of the belief state in a distributed fashion. At each time step, the nodes coordinate to condition this distribution on the observations made throughout the network, and to advance this estimate to the next time step. In addition, we identify a significant challenge for probabilistic inference in dynamical systems: message losses or network partitions can cause nodes to have inconsistent beliefs about the current state of the system. We address this problem by developing distributed algorithms that guarantee that nodes will reach an informative consistent distribution when communication is re-established. We present a suite of experimental results on real-world sensor data for two real sensor network deployments: one with 25 cameras and another with 54 temperature sensors. 1
6 0.077391364 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors
7 0.072404034 9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments
8 0.064046957 190 nips-2006-The Neurodynamics of Belief Propagation on Binary Markov Random Fields
9 0.063768663 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors
10 0.05827171 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions
11 0.056608357 139 nips-2006-Multi-dynamic Bayesian Networks
12 0.055941433 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
13 0.055300336 135 nips-2006-Modelling transcriptional regulation using Gaussian Processes
14 0.055186182 55 nips-2006-Computation of Similarity Measures for Sequential Data using Generalized Suffix Trees
15 0.053933941 50 nips-2006-Chained Boosting
16 0.053171631 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization
17 0.053101104 1 nips-2006-A Bayesian Approach to Diffusion Models of Decision-Making and Response Time
18 0.053010602 159 nips-2006-Parameter Expanded Variational Bayesian Methods
19 0.052961752 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods
20 0.052695349 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
topicId topicWeight
[(0, -0.175), (1, 0.01), (2, 0.03), (3, -0.117), (4, 0.062), (5, 0.052), (6, 0.052), (7, -0.019), (8, -0.025), (9, -0.068), (10, -0.013), (11, 0.069), (12, 0.064), (13, 0.066), (14, -0.053), (15, -0.032), (16, 0.014), (17, 0.128), (18, -0.06), (19, 0.053), (20, -0.138), (21, 0.012), (22, -0.094), (23, -0.029), (24, -0.101), (25, 0.07), (26, -0.022), (27, 0.001), (28, -0.141), (29, 0.022), (30, 0.031), (31, -0.042), (32, 0.089), (33, 0.126), (34, -0.015), (35, 0.15), (36, 0.053), (37, 0.091), (38, -0.078), (39, 0.034), (40, -0.001), (41, -0.035), (42, -0.139), (43, 0.068), (44, -0.004), (45, -0.066), (46, 0.032), (47, 0.086), (48, 0.102), (49, 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.93228108 41 nips-2006-Bayesian Ensemble Learning
Author: Hugh A. Chipman, Edward I. George, Robert E. Mcculloch
Abstract: We develop a Bayesian “sum-of-trees” model, named BART, where each tree is constrained by a prior to be a weak learner. Fitting and inference are accomplished via an iterative backfitting MCMC algorithm. This model is motivated by ensemble methods in general, and boosting algorithms in particular. Like boosting, each weak learner (i.e., each weak tree) contributes a small amount to the overall model. However, our procedure is defined by a statistical model: a prior and a likelihood, while boosting is defined by an algorithm. This model-based approach enables a full and accurate assessment of uncertainty in model predictions, while remaining highly competitive in terms of predictive accuracy. 1
2 0.59931248 172 nips-2006-Scalable Discriminative Learning for Natural Language Parsing and Translation
Author: Joseph Turian, Benjamin Wellington, I. D. Melamed
Abstract: Parsing and translating natural languages can be viewed as problems of predicting tree structures. For machine learning approaches to these predictions, the diversity and high dimensionality of the structures involved mandate very large training sets. This paper presents a purely discriminative learning method that scales up well to problems of this size. Its accuracy was at least as good as other comparable methods on a standard parsing task. To our knowledge, it is the first purely discriminative learning algorithm for translation with treestructured models. Unlike other popular methods, this method does not require a great deal of feature engineering a priori, because it performs feature selection over a compound feature space as it learns. Experiments demonstrate the method’s versatility, accuracy, and efficiency. Relevant software is freely available at http://nlp.cs.nyu.edu/parser and http://nlp.cs.nyu.edu/GenPar. 1
3 0.51559401 139 nips-2006-Multi-dynamic Bayesian Networks
Author: Karim Filali, Jeff A. Bilmes
Abstract: We present a generalization of dynamic Bayesian networks to concisely describe complex probability distributions such as in problems with multiple interacting variable-length streams of random variables. Our framework incorporates recent graphical model constructs to account for existence uncertainty, value-specific independence, aggregation relationships, and local and global constraints, while still retaining a Bayesian network interpretation and efficient inference and learning techniques. We introduce one such general technique, which is an extension of Value Elimination, a backtracking search inference algorithm. Multi-dynamic Bayesian networks are motivated by our work on Statistical Machine Translation (MT). We present results on MT word alignment in support of our claim that MDBNs are a promising framework for the rapid prototyping of new MT systems. 1 INTRODUCTION The description of factorization properties of families of probabilities using graphs (i.e., graphical models, or GMs), has proven very useful in modeling a wide variety of statistical and machine learning domains such as expert systems, medical diagnosis, decision making, speech recognition, and natural language processing. There are many different types of graphical model, each with its own properties and benefits, including Bayesian networks, undirected Markov random fields, and factor graphs. Moreover, for different types of scientific modeling, different types of graphs are more or less appropriate. For example, static Bayesian networks are quite useful when the size of set of random variables in the domain does not grow or shrink for all data instances and queries of interest. Hidden Markov models (HMMs), on the other hand, are such that the number of underlying random variables changes depending on the desired length (which can be a random variable), and HMMs are applicable even without knowing this length as they can be extended indefinitely using online inference. HMMs have been generalized to dynamic Bayesian networks (DBNs) and temporal conditional random fields (CRFs), where an underlying set of variables gets repeated as needed to fill any finite but unbounded length. Probabilistic relational models (PRMs) [5] allow for a more complex template that can be expanded in multiple dimensions simultaneously. An attribute common to all of the above cases is that the specification of rules for expanding any particular instance of a model is finite. In other words, these forms of GM allow the specification of models with an unlimited number of random variables (RVs) using a finite description. This is achieved using parameter tying, so while the number of RVs increases without bound, the number of parameters does not. In this paper, we introduce a new class of model we call multi-dynamic Bayesian networks. MDBNs are motivated by our research into the application of graphical models to the domain of statistical machine translation (MT) and they have two key attributes from the graphical modeling perspective. First, an MDBN generalizes a DBN in that there are multiple “streams” of variables that can get unrolled, but where each stream may be unrolled by a differing amount. In the most general case, connecting these different streams together would require the specification of conditional probabil- ity tables with a varying and potentially unlimited number of parents. To avoid this problem and retain the template’s finite description length, we utilize a switching parent functionality (also called value-specific independence). Second, in order to capture the notion of fertility in MT-systems (defined later in the text), we employ a form of existence uncertainty [7] (that we call switching existence), whereby the existence of a given random variable might depend on the value of other random variables in the network. Being fully propositional, MDBNs lie between DBNs and PRMs in terms of expressiveness. While PRMs are capable of describing any MDBN, there are, in general, advantages to restricting ourselves to a more specific class of model. For example, in the DBN case, it is possible to provide a bound on inference costs just by looking at attributes of the DBN template only (e.g., the left or right interfaces [12, 2]). Restricting the model can also make it simpler to use in practice. MDBNs are still relatively simple, while at the same time making possible the easy expression of MT systems, and opening doors to novel forms of probabilistic inference as we show below. In section 2, we introduce MDBNs, and describe their application to machine translation showing how it is possible to represent even complex MT systems. In section 3, we describe MDBN learning and decoding algorithms. In section 4, we present experimental results in the area of statistical machine translation, and future work is discussed in section 5. 2 MDBNs A standard DBN [4] template consists of a directed acyclic graph G = (V, E) = (V1 ∪ V2 , E1 ∪ → E2 ∪ E2 ) with node set V and edge set E. For t ∈ {1, 2}, the sets Vt are the nodes at slice t, Et → are the intra-slice edges between nodes in Vt , and Et are the inter-slice edges between nodes in V1 and V2 . To unroll a DBN to length T , the nodes V2 along with the edges adjacent to any node in V2 are cloned T − 1 times (where parameters of cloned variables are constrained to be the same as the template) and re-connected at the corresponding places. An MDBN with K streams consists of the union of K DBN templates along with a template structure specifying rules to connect the various streams together. An MDBN template is a directed graph (k) G = (V, E) = ( V (k) , E (k) ∪ E ) k (k) (k) th k (k) where (V , E ) is the k DBN, and the edges E are rules specifying how to connect stream k to the other streams. These rules are general in that they specify the set of edges for all values of Tk . There can be arbitrary nesting of the streams such as, for example, it is possible to specify a model that can grow along several dimensions simultaneously. An MDBN also utilizes “switching existence”, meaning some subset of the variables in V bestow existence onto other variables in the network. We call these variables existence bestowing (or ebnodes). The idea of bestowing existence is well defined over a discrete space, and is not dissimilar to a variable length DBN. For example, we may have a joint distribution over lengths as follows: p(X1 , . . . , XN , N ) = p(X1 , . . . , Xn |N = n)p(N = n) where here N is an eb-node that determines the number of other random variables in the DGM. Our notion of eb-nodes allows us to model certain characteristics found within machine translation systems, such as “fertility” [3], where a given English word is cloned a random number of times in the generative process that explains a translation from French into English. This random cloning might happen simultaneously at all points along a given MDBN stream. This means that even for a given fixed stream length Ti = ti , each stream could have a randomly varying number of random variables. Our graphical notation for eb-nodes consists of the eb-node as a square box containing variables whose existence is determined by the eb-node. We start by providing a simple example of an expanded MDBN for three well known MT systems, namely the IBM models 1 and 2 [3], and the “HMM” model [15].1 We adopt the convention in [3] that our goal is to translate from a string of French words F = f of length M = m into a string of English words E = e of length L = l — of course these can be any two languages. The basic generative (noisy channel) approach when translating from French to English is to represent the joint 1 We will refer to it as M-HMM to avoid confusion with regular HMMs. distribution P (f , e) = P (f |e)P (e). P (e) is a language model specifying the prior over the word string e. The key goal is to produce a finite-description length representation for P (f |e) where f and e are of arbitrary length. A hidden alignment string, a, specifies how the English words align to the French word, leading to P (f |e) = a P (f , a|e). Figure 1(a) is a 2-stream MDBN expanded representation of the three models, in this case ℓ = 4 and m = 3. As shown, it appears that the fan-in to node fi will be ℓ and thus will grow without bound. However, a switching mechanism whereby P (fi |e, ai ) = P (fi |eai ) limits the number of parameters regardless of L. This means that the alignment variable ai indicates the English word eai that should be aligned to French word fi . The variable e0 is a null word that connects to French words not explained by any of e1 , . . . , eℓ . The graph expresses all three models — the difference is that, in Models 1 and 2, there are no edges between aj and aj+1 . In Model 1, p(aj = ℓ) is uniform on the set {1, . . . , L}; in Model 2, the distribution over aj is a function only of its position j, and on the English and French lengths ℓ and m respectively. In the M-HMM model, the ai variables form a first order Markov chain. l e0 ℓ e1 e3 e2 e1 e4 e2 e3 φ1 φ2 φ3 m’ φ0 τ01 a1 f2 a2 f3 a3 m (a) Models 1,2 and M-HMM τ12 τ13 τ21 π02 π11 π12 π13 π21 f2 f3 f4 f5 f6 a1 u v τ11 f1 f1 τ02 a2 a3 a4 a5 a6 π01 w y x m (b) Expanded M3 graph Figure 1: Expanded 2-stream MDBN description of IBM Models 1 and 2, and the M-HMM model for MT; and the expanded MDBN description of IBM Model 3 with fertility assignment φ0 = 2, φ1 = 3, φ2 = 1, φ3 = 0. From the above, we see that it would be difficult to express this model graphically using a standard DBN since L and M are unequal random variables. Indeed, there are two DBNs in operation, one consisting of the English string, and the other consisting of the French string and its alignment. Moreover, the fully connected structure of the graph in the figure can represent the appropriate family of model, but it also represents models whose parameter space grows without bound — the switching function allows the model template to stay finite regardless of L and M . With our MDBN descriptive abilities complete, it is now possible to describe the more complex IBM models 3, and 4[3] (an MDBN for Model3 is depicted in fig. 1(b)). The top most random variable, ℓ, is a hidden switching existence variable corresponding to the length of the English string. The box abutting ℓ includes all the nodes whose existence depends on the value of ℓ. In the figure, ℓ = 3, thus resulting in three English words e1 , e2 , and e3 connected using a second-order Markov chain. To each English word ei corresponds a conditionally dependent fertility eb-node φi , which indicates how many times ei is used by words in the French string. Each φi in turn controls the existence of a set of variables under it. Given the fertilities (the figure depicts the case φ1 = 3, φ2 = 1, φ3 = 0), for each word ei , φi French word variables are granted existence and are denoted by τi1 , τi2 , . . . , τiφi , what is called the tablet [3] of ei . The values taken by the τ variables need to match the actual observed French sequence f1 , . . . , fm . This is represented as a shared constraint between all the f , π, and τ variables which have incoming edges into the observed variable v. v’s conditional probability table is such that it is one only when the associated constraint is satisfied2 . The variable 2 This type of encoding of constraints corresponds to the standard mechanism used by Pearl [14]. A naive implementation, however, would enumerate a number of configurations exponential in the number of constrained variables, while typically only a small fraction of the configurations would have positive probability. πi,k ∈ {1, . . . , m} is a switching dependency parent with respect to the constraint variable v and determines which fj participates in an equality constraint with τi,k . The bottom variable m is a switching existence node (observed to be 6 in the figure) with corresponding French word sequence and alignment variables. The French sequence participates in the v constraint described above, while the alignment variables aj ∈ {1, . . . , ℓ}, j ∈ 1, . . . , m constrain the fertilities to take their unique allowable values (for the given alignment). Alignments also restrict the domain of permutation variables, π, using the constraint variable x. Finally, the domain size of each aj has to lie in the interval [0, ℓ] and that is enforced by the variable u. The dashed edges connecting the alignment a variables represent an extension to implement an M3/M-HMM hybrid. ℓ The null submodel involving the deterministic node m′ (= i=1 φi ) and eb-node φ0 accounts for French words that are not explained by any of the English words e1 , . . . , eℓ . In this submodel, successive permutation variables are ordered and this constraint is implemented using the observed child w of π0i and π0(i+1) . Model 4 [3] is similar to Model 3 except that the former is based on a more elaborate distortion model that uses relative instead of absolute positions both within and between tablets. 3 Inference, Parameter Estimation and MPE Multi-dynamic Bayesian Networks are amenable to any type of inference that is applicable to regular Bayesian networks as long as switching existence relationships are respected and all the constraints (aggregation for example) are satisfied. Unfortunately DBN inference procedures that take advantage of the repeatable template and can preprocess it offline, are not easy to apply to MDBNs. A case in point is the Junction Tree algorithm [11]. Triangulation algorithms exist that create an offline triangulated version of the input graph and do not re-triangulate it for each different instance of the input data [12, 2]. In MDBNs, due to the flexibility to unroll templates in several dimensions and to specify dependencies and constraints spanning the entire unrolled graph, it is not obvious how we can exploit any repetitive patterns in a Junction Tree-style offline triangulation of the graph template. In section 4, we discuss sampling inference methods we have used. Here we discuss our extension to a backtracking search algorithm with the same performance guarantees as the JT algorithm, but with the advantage of easily handling determinism, existence uncertainty, and constraints, both learned and explicitly stated. Value Elimination (VE) ([1]), is a backtracking Bayesian network inference technique that caches factors associated with portions of the search tree and uses them to avoid iterating again over the same subtrees. We follow the notation introduced in [1] and refer the reader to that paper for details about VE inference. We have extended the VE inference approach to handle explicitly encoded constraints, existence uncertainty, and to perform approximate local domain pruning (see section 4). We omit these details as well as others in the original paper and briefly describe the main data structure required by VE and sketch the algorithm we refer to as FirstPass (fig. 1) since it constitutes the first step of the learning procedure, our main contribution in this section. A VE factor, F , is such that we can write the following marginal of the joint distribution P (X = x, Y = y, Z) = F.val × f (Z) X=x such that (X∪Y)∩Z = ∅, F.val is a constant, and f (Z) a function of Z only. Y is a set of variables previously instantiated in the current branch of search tree to the value vector y. The pair (Y, y) is referred to as a dependency set (F.Dset). X is referred to as a subsumed set (F.Sset). By caching the tuple (F.Dset, F.Sset, F.val), we avoid recomputing the marginal again whenever (1) F.Dset is active, meaning all nodes stored in F.Dset are assigned their cached values in the current branch of the search tree; and (2) none of the variables in F.Sset are assigned yet. FirstPass (alg. 1) visits nodes in the graph in Depth First fashion. In line 7, we get the values of all Newly Single-valued (NSV) CPTs i.e., CPTs that involve the current node, V , and in which all We use a general directed domain pruning constraint. Deterministic relationships then become a special case of our constraint whereby the domain of the child variable is constrained to a single value with probability one. Variable traversal order: A, B, C, and D. Factors are numbered by order of creation. *Fi denotes the activation of factor i. Tau values propagated recursively F7: Dset={} Sset={A,B,C,D} val=P(E=e) F7.tau = 1.0 = P(Evidence)/F7.val A F5: Dset={A=0} Sset={B,C,D} F2 D *F1 *F2 Factor values needed for c(A=0) and c(C=0,B=0) computation: F5.val=P(B=0|A=0)*F3.val+P(B=1|A=0)*F4.val F3.val=P(C=0|B=0)*F1.val+P(C=1|B=0)*F2.val F4.val=P(C=0|B=1)*F1.val+P(C=1|B=1)*F2.val F1.val=P(D=0|C=0)P(E=e|D=0)+P(D=1|C=0)P(E=e|D=1) F2.val=P(D=0|C=1)P(E=e|D=0)+P(D=1|C=1)P(E=e|D=1) First pass C *F3 *F4 Second pass D B F4 C F6.tau = F7.tau * P(A=1) 1 B F3: Dset={B=0} Sset={C,D} F1 F5.tau = F7.tau * P(A=0) F6 0 F3.tau = F5.tau * P(B=0|A=0) + F6.tau * P(B=0|A=1) = P(B=0) F4.tau = F5.tau * P(B=1|A=0) + F6.tau * P(B=1|A=1) = P(B=1) F1.tau = F3.tau * P(C=0|B=0) + F4.tau * P(C=0|B=1) = P(C=0) F2.tau = F3.tau * P(C=1|B=0) + F4.tau * P(C=1|B=1) = P(C=1) c(A=0)=(1/P(e))*(F7.tau*P(A=0)*F5.val)=(1/P(e))(P(A=0)*P(E=e|A=0))=P(A=0|E=e) c(C=0,B=0)=(1/P(e))*F3.tau*P(C=0|B=0)*F1.val =(1/P(e) * (P(A=0,B=0)+P(A=1,B=0)) * P(C=0|B=0) * F1.val =(1/P(e)) * P(B=0) * P(C=0|B=0) * F1.val =(1/P(e)) * P(B=0) * P(C=0|B=0) * F1.val =(1/P(e)) * P(C=0,B=0) * F1.val =P(C=0,B=0,E=e)/P(e)=P(C=0,B=0|E=e) Figure 2: Learning example using the Markov chain A → B → C → D → E, where E is observed. In the first pass, factors (Dset, Sset and val) are learned in a bottom up fashion. Also, the normalization constant P (E = e) (probability of evidence) is obtained. In the second pass, tau values are updated in a top-down fashion and used to calculate expected counts c(F.head, pa(F.head)) corresponding to each F.head (the figure shows the derivations for (A=0) and (C=0,B=0), but all counts are updated in the same pass). other variables are already assigned (these variables and their values are accumulated into Dset). We also check for factors that are active, multiply their values in, and accumulate subsumed vars in Sset (to avoid branching on them). In line 10, we add V to the Sset. In line 11, we cache a new factor F with value F.val = sum. We store V into F.head, a pointer to the last variable to be inserted into F.Sset, and needed for parameter estimation described below. F.Dset consists of all the variables, except V , that appeared in any NSV CPT or the Dset of an activated factor at line 6. Regular Value Elimination is query-based, similar to variable elimination and recursive conditioning—what this means is that to answer a query of the type P (Q|E = e), where Q is query variable and E a set of evidence nodes, we force Q to be at the top of the search tree, run the backtracking algorithm and then read the answers to the queries P (Q = q|E = e), q ∈ Dom[Q], along each of the outgoing edges of Q. Parameter estimation would require running a number of queries on the order of the number of parameters to estimate. We extend VE into an algorithm that allows us to obtain Expectation Maximization sufficient statistics in a single run of Value Elimination plus a second pass, which can never take longer than the first one (and in practice is much faster). This two-pass procedure is analogous to the collect-distribute evidence procedure in the Junction Tree algorithm, but here we do this via a search tree. Let θX=x|pa(X)=y be a parameter associated with variable X with value x and parents Y = pa(X) when they have value y. Assuming a maximum likelihood learning scenario3 , to estimate θX=x|pa(X)=y , we need to compute f (X = x, pa(X) = y, E = e) = P (W, X = x, pa(X) = y, E = e) W\{X,pa(X)} which is a sum of joint probabilities of all configurations that are consistent with the assignment {X = x, pa(X) = y}. If we were to turn off factor caching, we would enumerate all such variable configurations and could compute the sum. When standard VE factors are used, however, this is no longer possible whenever X or any of its parents becomes subsumed. Fig. 2 illustrates an example of a VE tree and the factors that are learned in the case of a Markov chain with an evidence node at the end. We can readily estimate the parameters associated with variables A and B as they are not subsumed along any branch. C and D become subsumed, however, and we cannot obtain the correct counts along all the branches that would lead to C and D in the full enumeration case. To address this issue, we store a special value, F.tau, in each factor. F.tau holds the sum over all path probabilities from the first level of the search tree to the level at which the factor F was 3 For Bayesian networks the likelihood function decomposes such that maximizing the expectation of the complete likelihood is equivalent to maximizing the “local likelihood” of each variable in the network. either created or activated. For example, F 6.tau in fig. 2 is simply P (A = 1). Although we can compute F 3.tau directly, we can also compute it recursively using F 5.tau and F 6.tau as shown in the figure. This is because both F 5 and F 6 subsume F 3: in the context {F 5.Dset}, there exists a (unique) value dsub of F 5.head4 s.t. F 3 becomes activable. Likewise for F 6. We cannot compute F 1.tau directly, but we can, recursively, from F 3.tau and F 4.tau by taking advantage of a similar subsumption relationship. In general, we can show that the following recursive relationship holds: F pa .tau × N SVF pa .head=dsub × F.tau ← F pa ∈F pa Fact .val F.val Fact ∈Fact (1) where F pa is the set of factors that subsume F , Fact is the set of all factors (including F ) that become active in the context of {F pa .Dset, F pa .head = dsub } and N SVF pa .head=dsub is the product of all newly single valued CPTs under the same context. For top-level factors (not subsumed by any factor), F.tau = Pevidence /F.val, which is 1.0 when there is a unique top-level factor. Alg. 2 is a simple recursive computation of eq. 1 for each factor. We visit learned factors in the reverse order in which they were learned to ensure that, for any factor F ′ , F ′ .tau is incremented (line 13) by any F that might have activated F ′ (line 12). For example, in fig. 2, F 4 uses F 1 and F 2, so F 4.tau needs to be updated before F 1.tau and F 2.tau. In line 11, we can increment the counts for any NSV CPT entries since F.tau will account for the possible ways of reaching the configuration {F.Dset, F.head = d} in an equivalent full enumeration tree. Algorithm 1: FirstPass(level) 1 2 3 4 5 6 7 8 9 10 Input: Graph G Output: A list of learned factors and Pevidence Select var V to branch on if V ==NONE then return Sset={}, Dset={} for d ∈ Dom[V ] do V ←d prod = productOfAllNSVsAndActiveFactors(Dset, Sset) if prod != 0 then FirstPass(level+1) sum += prod Sset = Sset ∪ {V } cacheNewFactor(F.head ← V ,F.val ← sum, F.Sset ← Sset, F.Dset ← Dset); Algorithm 2: SecondPass() 1 2 3 4 5 6 7 8 9 10 11 12 13 Input: F : List of factors in the reverse order learned in the first pass and Pevidence . Result: Updated counts foreach F ∈ F do if F.Dset = {} then F.tau ← Pevidence /F.val else F.tau ← 0.0 Assign vars in F.Dset to their values V ← F.head (last node to have been subsumed in this factor) foreach d ∈ Dom[V ] do prod = productOfAllNSVsAndActiveFactors() prod∗ = F.tau foreach newly single-valued CPT C do count(C.child,C.parents)+=prod/Pevidence F ′ =getListOfActiveFactors() for F ′ ∈ F ′ do F ′ .tau+ = prod/F ′ .val Most Probable Explanation We compute MPE using a very similar two-pass algorithm. In the first pass, factors are used to store a maximum instead of a summation over variables in the Sset. We also keep track of the value of F.head at which the maximum is achieved. In the second pass, we recursively find the optimal variable configuration by following the trail of factors that are activated when we assign each F.head variable to its maximum value starting from the last learned factor. 4 Recall, F.head is the last variable to be added to a newly created factor in line 10 of alg. 1 4 MACHINE TRANSLATION WORD ALIGNMENT EXPERIMENTS A major motivation for pursuing the type of representation and inference described above is to make it possible to solve computationally-intensive real-world problems using large amounts of data, while retaining the full generality and expressiveness afforded by the MDBN modeling language. In the experiments below we compare running times of MDBNs to GIZA++ on IBM Models 1 through 4 and the M-HMM model. GIZA++ is a special-purpose optimized MT word alignment C++ tool that is widely used in current state-of-the-art phrase-based MT systems [10] and at the time of this writing is the only publicly available software that implements all of the IBM Models. We test on French-English 107 hand-aligned sentences5 from a corpus of the European parliament proceedings (Europarl [9]) and train on 10000 sentence pairs from the same corpus and of maximum number of words 40. The Alignment Error Rate (AER) [13] evaluation metric quantifies how well the MPE assignment to the hidden alignment variables matches human-generated alignments. Several pruning and smoothing techniques are used by GIZA and MDBNs. GIZA prunes low lexical (P (f |e)) probability values and uses a default small value for unseen (or pruned) probability table entries. For models 3 and 4, for which there is no known polynomial time algorithm to perform the full E-step or compute MPE, GIZA generates a set of high probability alignments using an MHMM and hill-climbing and collects EM counts over these alignments using M3 or M4. For MDBN models we use the following pruning strategy: at each level of the search tree we prune values which, together, account for the lowest specified percentage of the total probability mass of the product of all newly active CPTs in line 6 of alg. 1. This is a more effective pruning than simply removing low-probability values of each CPD because it factors in the joint contribution of multiple active variables. Table 1 shows a comparison of timing numbers obtained GIZA++ and MDBNs. The runtime numbers shown are for the combined tasks of training and decoding; however, training time dominates given the difference in size between train and test sets. For models 1 and 2 neither GIZA nor MDBNs perform any pruning. For the M-HMM, we prune 60% of probability mass at each level and use a Dirichlet prior over the alignment variables such that long-range transitions are exponentially less likely than shorter ones.6 This model achieves similar times and AER to GIZA’s. Interestingly, without any pruning, the MDBN M-HMM takes 160 minutes to complete while only marginally improving upon the pruned model. Experimenting with several pruning thresholds, we found that AER would worsen much more slowly than runtime decreases. Models 3 and 4 have treewidth equal to the number of alignment variables (because of the global constraints tying them) and therefore require approximate inference. Using Model 3, and a drastic pruning threshold that only keeps the value with the top probability at each level, we were able to achieve an AER not much higher than GIZA’s. For M4, it achieves a best AER of 31.7% while we do not improve upon Model3, most likely because a too restrictive pruning. Nevertheless, a simple variation on Model3 in the MDBN framework achieves a lower AER than our regular M3 (with pruning still the same). The M3-HMM hybrid model combines the Markov alignment dependencies from the M-HMM model with the fertility model of M3. MCMC Inference Sampling is widely used for inference in high-treewidth models. Although MDBNs support Likelihood Weighing, it is very inefficient when the probability of evidence is very small, as is the case in our MT models. Besides being slow, Markov chain Monte Carlo can be problematic when the joint distribution is not positive everywhere, in particular in the presence of determinism and hard constraints. Techniques such as blocking Gibbs sampling [8] try to address the problem. Often, however, one has to carefully choose a problem-dependent proposal distribution. We used MCMC to improve training of the M3-HMM model. We were able to achieve an AER of 32.8% (down from 39.1%) but using 400 minutes of uniprocessor time. 5 CONCLUSION The existing classes of graphical models are not ideally suited for representing SMT models because “natural” semantics for specifying the latter combine flavors of different GM types on top of standard directed Bayesian network semantics: switching parents found in Bayesian Multinets [6], aggregation relationships such as in Probabilistic Relational Models [5], and existence uncertainty [7]. We 5 Available at http://www.cs.washington.edu/homes/karim French and English have similar word orders. On a different language pair, a different prior might be more appropriate. With a uniform prior, the MDBN M-HMM has 36.0% AER. 6 Model Init M1 M2 M-HMM M3 M4 M3-HMM GIZA++ M1 M-HMM 1m45s (47.7%) N/A 2m02s (41.3%) N/A 4m05s (35.0%) N/A 2m50 (45%) 5m20s (38.5%) 5m20s (34.8%) 7m45s (31.7%) N/A MDBN M1 3m20s (48.0%) 5m30s (41.0%) 4m15s (33.0%) 12m (43.6%) 25m (43.6%) 9m30 (41.0%) M-HMM N/A N/A N/A 9m (42.5%) 23m (42.6%) 9m15s (39.1%) MCMC 400m (32.8%) Table 1: MDBN VE-based learning versus GIZA++ timings and %AER using 5 EM iterations. The columns M1 and M-HMM correspond to the model that is used to initialize the model in the corresponding row. The last row is a hybrid Model3-HMM model that we implemented using MDBNs and is not expressible using GIZA. have introduced a generalization of dynamic Bayesian networks to easily and concisely build models consisting of varying-length parallel asynchronous and interacting data streams. We have shown that our framework is useful for expressing various statistical machine translation models. We have also introduced new parameter estimation and decoding algorithms using exact and approximate searchbased probability computation. While our timing results are not yet as fast as a hand-optimized C++ program on the equivalent model, we have shown that even in this general-purpose framework of MDBNs, our timing numbers are competitive and usable. Our framework can of course do much more than the IBM and HMM models. One of our goals is to use this framework to rapidly prototype novel MT systems and develop methods to statistically induce an interlingua. We also intend to use MDBNs in other domains such as multi-party social interaction analysis. References [1] F. Bacchus, S. Dalmao, and T. Pitassi. Value elimination: Bayesian inference via backtracking search. In UAI-03, pages 20–28, San Francisco, CA, 2003. Morgan Kaufmann. [2] J. Bilmes and C. Bartels. On triangulating dynamic graphical models. In Uncertainty in Artificial Intelligence: Proceedings of the 19th Conference, pages 47–56. Morgan Kaufmann, 2003. [3] P. F. Brown, J. Cocke, S. A. Della Piettra, V. J. Della Piettra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S. Roossin. A statistical approach to machine translation. Computational Linguistics, 16(2):79–85, June 1990. [4] T. Dean and K. Kanazawa. Probabilistic temporal reasoning. AAAI, pages 524–528, 1988. [5] N. Friedman, L. Getoor, D. Koller, and A. Pfeffer. Learning probabilistic relational models. In IJCAI, pages 1300–1309, 1999. [6] D. Geiger and D. Heckerman. Knowledge representation and inference in similarity networks and Bayesian multinets. Artif. Intell., 82(1-2):45–74, 1996. [7] L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models of link structure. Journal of Machine Learning Research, 3(4-5):697–707, May 2003. [8] C. Jensen, A. Kong, and U. Kjaerulff. Blocking Gibbs sampling in very large probabilistic expert systems. In International Journal of Human Computer Studies. Special Issue on Real-World Applications of Uncertain Reasoning., 1995. [9] P. Koehn. Europarl: A multilingual corpus for evaluation of machine http://www.isi.edu/koehn/publications/europarl, 2002. translation. [10] P. Koehn, F. Och, and D. Marcu. Statistical phrase-based translation. In NAACL/HLT 2003, 2003. [11] S. Lauritzen. Graphical Models. Oxford Science Publications, 1996. [12] K. Murphy. Dynamic Bayesian Networks: Representation, Inference and Learning. PhD thesis, U.C. Berkeley, Dept. of EECS, CS Division, 2002. [13] F. J. Och and H. Ney. Improved statistical alignment models. In ACL, pages 440–447, Oct 2000. [14] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 2nd printing edition, 1988. [15] S. Vogel, H. Ney, and C. Tillmann. HMM-based word alignment in statistical translation. In Proceedings of the 16th conference on Computational linguistics, pages 836–841, Morristown, NJ, USA, 1996.
4 0.50974888 78 nips-2006-Fast Discriminative Visual Codebooks using Randomized Clustering Forests
Author: Frank Moosmann, Bill Triggs, Frederic Jurie
Abstract: Some of the most effective recent methods for content-based image classification work by extracting dense or sparse local image descriptors, quantizing them according to a coding rule such as k-means vector quantization, accumulating histograms of the resulting “visual word” codes over the image, and classifying these with a conventional classifier such as an SVM. Large numbers of descriptors and large codebooks are needed for good results and this becomes slow using k-means. We introduce Extremely Randomized Clustering Forests – ensembles of randomly created clustering trees – and show that these provide more accurate results, much faster training and testing and good resistance to background clutter in several state-of-the-art image classification tasks. 1
5 0.50955623 23 nips-2006-Adaptor Grammars: A Framework for Specifying Compositional Nonparametric Bayesian Models
Author: Mark Johnson, Thomas L. Griffiths, Sharon Goldwater
Abstract: This paper introduces adaptor grammars, a class of probabilistic models of language that generalize probabilistic context-free grammars (PCFGs). Adaptor grammars augment the probabilistic rules of PCFGs with “adaptors” that can induce dependencies among successive uses. With a particular choice of adaptor, based on the Pitman-Yor process, nonparametric Bayesian models of language using Dirichlet processes and hierarchical Dirichlet processes can be written as simple grammars. We present a general-purpose inference algorithm for adaptor grammars, making it easy to define and use such models, and illustrate how several existing nonparametric Bayesian models can be expressed within this framework. 1
6 0.48686615 115 nips-2006-Learning annotated hierarchies from relational data
7 0.46619457 142 nips-2006-Mutagenetic tree Fisher kernel improves prediction of HIV drug resistance from viral genotype
8 0.4263362 192 nips-2006-Theory and Dynamics of Perceptual Bistability
9 0.41248459 90 nips-2006-Hidden Markov Dirichlet Process: Modeling Genetic Recombination in Open Ancestral Space
10 0.40287581 1 nips-2006-A Bayesian Approach to Diffusion Models of Decision-Making and Response Time
11 0.39305329 69 nips-2006-Distributed Inference in Dynamical Systems
12 0.39291233 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions
13 0.36639687 114 nips-2006-Learning Time-Intensity Profiles of Human Activity using Non-Parametric Bayesian Models
14 0.35018399 55 nips-2006-Computation of Similarity Measures for Sequential Data using Generalized Suffix Trees
15 0.34097791 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation
16 0.3289994 9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments
17 0.32786301 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure
18 0.32047707 155 nips-2006-Optimal Single-Class Classification Strategies
19 0.31831685 180 nips-2006-Speakers optimize information density through syntactic reduction
20 0.31168839 43 nips-2006-Bayesian Model Scoring in Markov Random Fields
topicId topicWeight
[(1, 0.089), (3, 0.021), (7, 0.071), (9, 0.027), (20, 0.057), (22, 0.087), (44, 0.061), (57, 0.087), (64, 0.013), (65, 0.051), (66, 0.289), (69, 0.031), (71, 0.023), (90, 0.01)]
simIndex simValue paperId paperTitle
same-paper 1 0.78145057 41 nips-2006-Bayesian Ensemble Learning
Author: Hugh A. Chipman, Edward I. George, Robert E. Mcculloch
Abstract: We develop a Bayesian “sum-of-trees” model, named BART, where each tree is constrained by a prior to be a weak learner. Fitting and inference are accomplished via an iterative backfitting MCMC algorithm. This model is motivated by ensemble methods in general, and boosting algorithms in particular. Like boosting, each weak learner (i.e., each weak tree) contributes a small amount to the overall model. However, our procedure is defined by a statistical model: a prior and a likelihood, while boosting is defined by an algorithm. This model-based approach enables a full and accurate assessment of uncertainty in model predictions, while remaining highly competitive in terms of predictive accuracy. 1
2 0.68428117 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions
Author: Christopher J. Burges, Robert Ragno, Quoc V. Le
Abstract: The quality measures used in information retrieval are particularly difficult to optimize directly, since they depend on the model scores only through the sorted order of the documents returned for a given query. Thus, the derivatives of the cost with respect to the model parameters are either zero, or are undefined. In this paper, we propose a class of simple, flexible algorithms, called LambdaRank, which avoids these difficulties by working with implicit cost functions. We describe LambdaRank using neural network models, although the idea applies to any differentiable function class. We give necessary and sufficient conditions for the resulting implicit cost function to be convex, and we show that the general method has a simple mechanical interpretation. We demonstrate significantly improved accuracy, over a state-of-the-art ranking algorithm, on several datasets. We also show that LambdaRank provides a method for significantly speeding up the training phase of that ranking algorithm. Although this paper is directed towards ranking, the proposed method can be extended to any non-smooth and multivariate cost functions. 1
3 0.54687583 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization
Author: Frank Wood, Thomas L. Griffiths
Abstract: Many unsupervised learning problems can be expressed as a form of matrix factorization, reconstructing an observed data matrix as the product of two matrices of latent variables. A standard challenge in solving these problems is determining the dimensionality of the latent matrices. Nonparametric Bayesian matrix factorization is one way of dealing with this challenge, yielding a posterior distribution over possible factorizations of unbounded dimensionality. A drawback to this approach is that posterior estimation is typically done using Gibbs sampling, which can be slow for large problems and when conjugate priors cannot be used. As an alternative, we present a particle filter for posterior estimation in nonparametric Bayesian matrix factorization models. We illustrate this approach with two matrix factorization models and show favorable performance relative to Gibbs sampling.
4 0.53734499 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
Author: Rey Ramírez, Jason Palmer, Scott Makeig, Bhaskar D. Rao, David P. Wipf
Abstract: The ill-posed nature of the MEG/EEG source localization problem requires the incorporation of prior assumptions when choosing an appropriate solution out of an infinite set of candidates. Bayesian methods are useful in this capacity because they allow these assumptions to be explicitly quantified. Recently, a number of empirical Bayesian approaches have been proposed that attempt a form of model selection by using the data to guide the search for an appropriate prior. While seemingly quite different in many respects, we apply a unifying framework based on automatic relevance determination (ARD) that elucidates various attributes of these methods and suggests directions for improvement. We also derive theoretical properties of this methodology related to convergence, local minima, and localization bias and explore connections with established algorithms. 1
5 0.53695786 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
Author: Dong S. Cheng, Vittorio Murino, Mário Figueiredo
Abstract: This paper proposes a new approach to model-based clustering under prior knowledge. The proposed formulation can be interpreted from two different angles: as penalized logistic regression, where the class labels are only indirectly observed (via the probability density of each class); as finite mixture learning under a grouping prior. To estimate the parameters of the proposed model, we derive a (generalized) EM algorithm with a closed-form E-step, in contrast with other recent approaches to semi-supervised probabilistic clustering which require Gibbs sampling or suboptimal shortcuts. We show that our approach is ideally suited for image segmentation: it avoids the combinatorial nature Markov random field priors, and opens the door to more sophisticated spatial priors (e.g., wavelet-based) in a simple and computationally efficient way. Finally, we extend our formulation to work in unsupervised, semi-supervised, or discriminative modes. 1
6 0.53603953 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation
7 0.5355233 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy
8 0.53502929 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments
9 0.53068572 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
10 0.53037339 79 nips-2006-Fast Iterative Kernel PCA
11 0.52839833 175 nips-2006-Simplifying Mixture Models through Function Approximation
12 0.527587 97 nips-2006-Inducing Metric Violations in Human Similarity Judgements
13 0.52698225 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis
14 0.52622801 65 nips-2006-Denoising and Dimension Reduction in Feature Space
15 0.52605039 20 nips-2006-Active learning for misspecified generalized linear models
16 0.52564096 115 nips-2006-Learning annotated hierarchies from relational data
17 0.52536494 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency
18 0.52504045 43 nips-2006-Bayesian Model Scoring in Markov Random Fields
19 0.52315259 61 nips-2006-Convex Repeated Games and Fenchel Duality
20 0.52291125 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model