nips nips2006 nips2006-116 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields results for classification and regression generally, and for density estimation within the exponential family. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Learning from Multiple Sources Koby Crammer, Michael Kearns, Jennifer Wortman Department of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104 Abstract We consider the problem of learning accurate models from multiple sources of “nearby” data. [sent-1, score-0.167]
2 Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. [sent-2, score-0.227]
3 This theory is applicable in a broad decision-theoretic learning framework, and yields results for classification and regression generally, and for density estimation within the exponential family. [sent-3, score-0.281]
4 A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. [sent-4, score-0.3]
5 In our model there are K unknown data sources, with source i generating a distinct sample Si of ni observations. [sent-11, score-0.487]
6 We assume we are given only the samples Si , and a disparity1 matrix D whose entry D(i, j) bounds the difference between source i and source j. [sent-12, score-0.354]
7 Our framework includes settings in which the sources produce data for classification, regression, and density estimation (and more generally any additive-loss learning problem obeying certain conditions). [sent-14, score-0.233]
8 Our main result is a general theorem establishing a bound on the expected loss incurred by using all data sources within a given disparity of the target source. [sent-15, score-0.735]
9 Our bound clearly expresses a trade-off between three quantities: the sample size used (which increases as we include data from more distant models), a weighted average of the disparities of the sources whose data is used, and a model complexity term. [sent-17, score-0.399]
10 It can be applied to any learning setting in which the underlying loss function obeys an approximate triangle inequality, and in which the class of hypothesis models under consideration obeys uniform convergence of empirical estimates of loss to expectations. [sent-18, score-1.033]
11 1 We avoid using the term distance since our results include settings in which the underlying loss measures may not be formal distances. [sent-19, score-0.307]
12 For classification problems, the standard triangle inequality holds. [sent-20, score-0.347]
13 For regression we prove a 2approximation to the triangle inequality, and for density estimation for members of the exponential family, we apply Bregman divergence techniques to provide approximate triangle inequalities. [sent-21, score-0.741]
14 Uniform convergence bounds for the settings we consider may be obtained via standard data-independent model complexity measures such as VC dimension and pseudo-dimension, or via more recent datadependent approaches such as Rademacher complexity. [sent-23, score-0.411]
15 In the current work, each source may be entirely unrelated to all others except as constrained by the bounds on disparities, requiring us to develop new techniques. [sent-25, score-0.248]
16 In Section 2 we introduce a decision-theoretic framework for probabilistic learning that includes classification, regression, density estimation and many other settings as special cases, and then give our multiple source generalization of this model. [sent-28, score-0.273]
17 In Section 3 we provide our main result, which is a general bound on the expected loss incurred by using all data within a given disparity of a target source. [sent-29, score-0.533]
18 2 Learning models Before detailing our multiple-source learning model, we first introduce a standard decision-theoretic learning framework in which our goal is to find a model minimizing a generalized notion of empirical loss [5]. [sent-32, score-0.281]
19 Each setting we consider has an associated loss function L(h, z). [sent-41, score-0.233]
20 In regression we might consider the squared loss function L(h, x, y ) = (y −h(x))2 . [sent-43, score-0.305]
21 In density estimation we might consider the log loss L(h, x) = log(1/h(x)). [sent-44, score-0.338]
22 In each case, we are interested in the expected loss of a model g2 on target g1 , e(g1 , g2 ) = Ez∼Pg1 [L(g2 , z)]. [sent-45, score-0.361]
23 In our multiple source model, we are presented with K distinct samples or piles of data S1 , . [sent-47, score-0.502]
24 Each pile Si contains ni observations that are generated from a fixed and unknown model fi , and D satisfies e(fi , fj ), e(fj , fi ) ≤ D(i, j). [sent-51, score-1.167]
25 2 Our goal is to decide which piles Sj to use in order to learn the best approximation (in terms of expected loss) to each fi . [sent-52, score-0.741]
26 While we are interested in accomplishing this goal for each fi , it suffices and is convenient to examine the problem from the perspective of a fixed fi . [sent-53, score-0.554]
27 Thus without loss of generality let us suppose that we are given piles S1 , . [sent-54, score-0.62]
28 In all cases we will ˆ analyze, for any k ≤ K, the hypothesis hk minimizing the empirical loss ek (h) on the first k piles ˆ S1 , . [sent-70, score-1.018]
29 2 While it may seem restrictive to assume that D is given, notice that D(i, j) can be often be estimated from data, for example in a classification setting in which common instances labeled by both fi and fj are available. [sent-75, score-0.352]
30 1 ˆ hk = argmin ek (h) = argmin ˆ h∈H h∈H n1:k k nj j L(h, zi ) j=1 i=1 where n1:k = n1 + · · · + nk . [sent-76, score-0.47]
31 We also denote the expected error of function h with respect to the first k piles of data as k ni e(fi , h). [sent-77, score-0.8]
32 ek (h) = E [ˆk (h)] = e n1:k i=1 3 General theory In this section we provide the first of our main results: a general bound on the expected loss of the model minimizing the empirical loss on the nearest k piles. [sent-78, score-0.778]
33 Optimization of this bound leads to a recommended number of piles to incorporate when learning f = f1 . [sent-79, score-0.51]
34 The key ingredients needed to apply this bound are an approximate triangle inequality and a uniform convergence bound, which we define below. [sent-80, score-0.651]
35 Definition 1 For α ≥ 1, we say that the α-triangle inequality holds for a class of models F and expected loss function e if for all g1 , g2 , g3 ∈ F we have e(g1 , g2 ) ≤ α(e(g1 , g3 ) + e(g3 , g2 )). [sent-82, score-0.534]
36 The choice α = 1 yields the standard triangle inequality. [sent-84, score-0.182]
37 We note that the restriction to models in the class F may in some cases be quite weak — for instance, when F is all possible classifiers or real-valued functions with bounded range — or stronger, as in densities from the exponential family. [sent-85, score-0.242]
38 Our results will require only that the unknown source models f1 , . [sent-86, score-0.185]
39 Definition 2 A uniform convergence bound for a hypothesis space H and loss function L is a bound that states that for any 0 < δ < 1, with probability at least 1 − δ for any h ∈ H |ˆ(h) − e(h)| ≤ β(n, δ) e n 1 where e(h) = n i=1 L(h, zi ) for n observations z1 , . [sent-91, score-0.699]
40 This definition simply asserts that for every model in H, its empirical loss on a sample of size n and the expectation of this loss will be “close. [sent-102, score-0.494]
41 ” In general the function β will incorporate standard measures of the complexity of H, and will be a decreasing function of the sample size n, as in the classical O( d/n) bounds of VC theory. [sent-103, score-0.25]
42 Our bounds will be derived from the rich literature on uniform convergence. [sent-104, score-0.232]
43 However, generalizing the standard uniform convergence results to this setting is straightforward. [sent-106, score-0.197]
44 Theorem 1 Let e be the expected loss function for loss L, and let F be a class of models for which the α-triangle inequality holds with respect to e. [sent-108, score-0.787]
45 Let H ⊆ F be a class of hypothesis models for which there is a uniform convergence bound β for L. [sent-109, score-0.404]
46 , fK ∈ F, {ǫi }K , i=1 ˆ {ni }K , and hk be as defined above. [sent-113, score-0.232]
47 , K} k ˆ e(f, hk ) ≤ (α + α2 ) i=1 ni n1:k ǫi + 2αβ(n1:k , δ/2K) + α2 min {e(f, h)} h∈H Before providing the proof, let us examine the bound of Theorem 1, which expresses a natural and intuitive trade-off. [sent-117, score-0.834]
48 The first term in the bound is a weighted sum of the disparities of the k ≤ K models whose data is used with respect to the target model f = f1 . [sent-118, score-0.275]
49 The second term is determined by the uniform convergence bound. [sent-120, score-0.17]
50 We expect this term to decrease with added piles due to the increased sample size. [sent-121, score-0.367]
51 The final term is what is typically called the approximation error — the residual loss that we incur simply by limiting our hypothesis model to fall in the restricted class H. [sent-122, score-0.299]
52 All three terms are influenced by the strength of the approximate triangle inequality that we have, as quantified by α. [sent-123, score-0.347]
53 The bounds given in Theorem 1 can be loose, but provide an upper bound necessary for optimization and suggest a natural choice for the number of piles k ∗ to use to estimate the target f : k ni n1:k k ∗ = argmin (α + α2 ) k i=1 ǫi + 2αβ(n1:k , δ/2K) . [sent-124, score-1.063]
54 Theorem 1 and this optimization make the implicit assumption that the best subset of piles to use will be a prefix of the piles — that is, that we should not “skip” a nearby pile in favor of more distant ones. [sent-125, score-1.006]
55 This assumption will generally be true for typical data-independent uniform convergence such as VC dimension bounds, and true on average for data-dependent bounds, where we expect uniform convergence bounds to improve with increased sample size. [sent-126, score-0.52]
56 , k}, ni n1:k ni n1:k e(f, h) ≤ (αe(f, fi ) + αe(fi , h)) Summing over all i ∈ {1, . [sent-134, score-0.944]
57 , k}, we find k e(f, h) ni n1:k ≤ i=1 k = α i=1 (αe(f, fi ) + αe(fi , h)) k ni n1:k e(f, fi ) + α i=1 ni n1:k k ni n1:k e(fi , h) ≤ α i=1 ǫi + αek (h) In the first line above we have used the α-triangle inequality to deliberately introduce a weighted summation involving the fi . [sent-137, score-2.349]
58 Notice that the first summation is a weighted average of the expected loss of each fi , while the second summation is the expected loss of h on the data. [sent-139, score-0.926]
59 In this problem there are K = 100 classifiers, each defined by 2 parameters represented by a point fi in the unit square, such that the expected disagreement rate between two such classifiers equals the L1 distance between their parameters. [sent-163, score-0.342]
60 ) We chose the 100 parameter vectors fi uniformly at random from the unit square (the circles in the left panel). [sent-165, score-0.254]
61 To generate varying pile sizes, we let ni decrease with the distance of fi from a chosen “central” point at (0. [sent-166, score-0.787]
62 75) (marked “MAX DATA” in the left panel); the resulting pile sizes for each model are shown in the bar plot in the right panel, where the origin (0, 0) is in the near corner, (1, 1) in the far corner, and the pile sizes clearly peak near (0. [sent-168, score-0.282]
63 Given these fi , ni and the pairwise distances, the undirected graph on the left includes an edge between fi and fj if and only if the data from fj is used to learn fi and/or the converse when Theorem 2 is used to optimize the distance of the data used. [sent-171, score-1.281]
64 The loss function L(h, x, y ) is defined as 0 if y = h(x) and 1 otherwise, and the corresponding expected loss is e(g1 , g2 ) = E x,y ∼Pg1 [L(g2 , x, y )] = Prx∼P [g1 (x) = g2 (x)]. [sent-182, score-0.5]
65 For 0/1 loss it is well-known and easy to see that the (standard) 1-triangle inequality holds, and classical VC theory [6] provides us with uniform convergence. [sent-183, score-0.489]
66 , fK ∈ F, ˆ {ǫi }K , {ni }K , and hk be as defined above in the multi-source learning model. [sent-190, score-0.232]
67 , K} k ˆ e(f, hk ) ≤ 2 i=1 ni n1:k ǫi + min {e(f, h)} + 2 h∈H d log (2en1:k /d) + log (16K/δ) 8n1:k In Figure 1 we provide a visual demonstration of the behavior of Theorem 1 applied to a simple classification problem. [sent-194, score-0.718]
68 Our loss function is L(h, x, y ) = (y − h(x))2 , and the expected loss is thus e(g1 , g2 ) = E x,y ∼Pg1 [L(g2 , x, y )] = Ex∼P (g1 (x) − g2 (x))2 . [sent-201, score-0.5]
69 For regression it is known that the standard 1-triangle inequality does not hold. [sent-202, score-0.225]
70 3 Lemma 1 Given any three functions g1 , g2 , g3 : X → R, a fixed and unknown distribution P on the inputs X , and the expected loss e(g1 , g2 ) = Ex∼P (g1 (x) − g2 (x))2 , e(g1 , g2 ) ≤ 2 (e(g1 , g3 ) + e(g3 , g1 )) . [sent-205, score-0.394]
71 The other required ingredient is a uniform convergence bound for regression with squared loss. [sent-206, score-0.367]
72 There is a rich literature on such bounds and their corresponding complexity measures for the model class H, including the fat-shattering generalization of VC dimension [7], ǫ-nets and entropy [6] and the combinatorial and pseudo-dimension approaches beautifully surveyed in [5]. [sent-207, score-0.32]
73 While a detailed exposition of the pseudo-dimension dim(H) of a class H of real-valued functions exceeds both our space limitations and scope, it suffices to say that it generalizes the VC dimension for binary functions and plays a similar role in uniform convergence bounds. [sent-209, score-0.312]
74 ) Ignoring constant and logarithmic factors, uniform convergence bounds can be derived in which the complexity penalty is dim(H)/n. [sent-215, score-0.362]
75 , fK ∈ ˆ F, {ǫi }K , {ni }K , and hk be as defined in the multi-source learning model. [sent-225, score-0.232]
76 3 ni d ǫi + 4 min {e(f, h)} + 128B 2 + h∈H n1:k n1:k ln(16K/δ) n1:k ln 16e2 n1:k d Density estimation We turn to the more complex application to density estimation. [sent-231, score-0.533]
77 The loss function for an observation x is the log loss L(P, x) = log (1/P (x)). [sent-233, score-0.486]
78 The expected loss is then e(P1 , P2 ) = Ex∼P1 [L(P2 , x)] = Ex∼P1 [log(1/P2 (x))]. [sent-234, score-0.294]
79 As we are not aware of an α-triangle inequality that holds simultaneously for all density functions, we provide general mathematical tools to derive specialized α-triangle inequalities for specific classes of distributions. [sent-235, score-0.256]
80 We focus on the exponential family of distributions, which is quite general and has nice properties which allow us to derive the necessary machinery to apply Theorem 1. [sent-236, score-0.192]
81 We proceed by deriving an α-triangle inequality for Kullback-Liebler divergence in exponential families that implies 3 A version of this paper with the appendix included can be found on the authors’ websites. [sent-238, score-0.495]
82 an α-triangle inequality for our expected loss function. [sent-239, score-0.459]
83 This inequality and a uniform convergence bound based on pseudo-dimension yield a general method for deriving error bounds in the multiple source setting which we illustrate using the example of multinomial distributions. [sent-240, score-0.797]
84 We define the exponential family of distributions in terms of the following components. [sent-246, score-0.196]
85 Using this fact and the linearity of expectation, we can derive the Kullback-Liebler (KL) divergence between two distributions of the same family (which use the same functions F and Ψ) and obtain KL (PF (x | µ1 ) PF (x | µ2 )) = F (µ1 ) − [F (µ2 ) + ∇F (µ2 ) · (µ1 − µ2 )] . [sent-252, score-0.23]
86 (2) states that the KL divergence between two members of the exponential family is equal to the Bregman divergence between the two corresponding expectation parameters. [sent-256, score-0.429]
87 We will use the above relation between the KL divergence for exponential families and Bregman divergences to derive a triangle inequality as required by our theory. [sent-258, score-0.648]
88 The following lemma shows that if we can provide a triangle inequality for the KL function, we can do so for expected log loss. [sent-259, score-0.521]
89 The next lemma gives an approximate triangle inequality for the KL divergence. [sent-265, score-0.396]
90 The proof (again see Appendix B) uses Taylor’s Theorem to derive upper and lower bounds on the Bregman divergence and then uses Eq. [sent-267, score-0.285]
91 Lemma 3 Let P1 , P2 , and P3 be distributions from an exponential family with parameters µ and function F . [sent-269, score-0.196]
92 The following theorem, which states bounds for multinomial distributions in the multi-source setting, is provided to illustrate the type of results that can be obtained using the machinery described in this section. [sent-272, score-0.272]
93 Let d be the pseudodimension of H under log loss, and let e be the expected log loss. [sent-275, score-0.209]
94 , fK ∈ ˆ F, {ǫi }K , 4 {n}K , and hk be as defined above in the multi-source learning model. [sent-279, score-0.232]
95 , K}, k ˆ e(f, hk ) ≤ (α + α2 ) i=1 4 ni n1:k ǫi + α min {e(f, h)} h∈H Here we can actually make the weaker assumption that the ǫi bound the KL divergences rather than the expected log loss, which avoids our needing upper bounds on the entropy of each source distribution. [sent-284, score-1.148]
96 + 128 log2 5 α 2 d + n1:k ln(16K/δ) n1:k ln 16e2 n1:k d Data-dependent bounds Given the interest in data-dependent convergence methods (such as maximum margin, PAC-Bayes, and others) in recent years, it is natural to ask how our multi-source theory can exploit these modern bounds. [sent-285, score-0.34]
97 If H is a class of functions mapping from a set X to R, we define the empirical Rademacher complexity of H on a fixed set of observations x1 , . [sent-287, score-0.216]
98 We can apply Rademacher-based convergence bounds to obtain a data-dependent multi-source bound for classification. [sent-301, score-0.32]
99 ˆ Theorem 5 Let F be the set of all functions from an input set X into {-1,1} and let Rn1:k be the empirical Rademacher complexity of H ⊆ F on the first k piles of data. [sent-303, score-0.532]
100 , fK ∈ F, {ǫi }K , {ni }K , and hk be as defined in the multii=1 i=1 source learning model. [sent-308, score-0.338]
wordName wordTfidf (topN-words)
[('piles', 0.367), ('ni', 0.345), ('fi', 0.254), ('hk', 0.232), ('loss', 0.206), ('triangle', 0.182), ('inequality', 0.165), ('kl', 0.165), ('vc', 0.161), ('rademacher', 0.15), ('bounds', 0.142), ('pile', 0.141), ('pf', 0.138), ('bregman', 0.131), ('ek', 0.12), ('theorem', 0.107), ('source', 0.106), ('exponential', 0.098), ('bound', 0.098), ('divergence', 0.096), ('sources', 0.095), ('fk', 0.095), ('appendix', 0.092), ('ex', 0.09), ('uniform', 0.09), ('expected', 0.088), ('rd', 0.083), ('convergence', 0.08), ('classi', 0.073), ('dim', 0.072), ('fj', 0.071), ('nearby', 0.071), ('target', 0.067), ('disparities', 0.067), ('observations', 0.066), ('pre', 0.063), ('divergences', 0.063), ('density', 0.061), ('hypothesis', 0.061), ('family', 0.061), ('multinomial', 0.06), ('distant', 0.06), ('regression', 0.06), ('measures', 0.058), ('ln', 0.056), ('complexity', 0.05), ('expectation', 0.05), ('users', 0.05), ('lemma', 0.049), ('sk', 0.048), ('let', 0.047), ('proof', 0.047), ('examine', 0.046), ('user', 0.046), ('ers', 0.045), ('recommended', 0.045), ('families', 0.044), ('argmin', 0.044), ('models', 0.043), ('kearns', 0.043), ('settings', 0.043), ('labelings', 0.042), ('summation', 0.042), ('squared', 0.039), ('sj', 0.039), ('dimension', 0.038), ('distributions', 0.037), ('disparity', 0.037), ('obeys', 0.037), ('incurred', 0.037), ('log', 0.037), ('min', 0.037), ('functions', 0.036), ('ingredients', 0.036), ('unknown', 0.036), ('ask', 0.034), ('crammer', 0.034), ('estimation', 0.034), ('densities', 0.033), ('aggregate', 0.033), ('machinery', 0.033), ('empirical', 0.032), ('class', 0.032), ('learn', 0.032), ('nk', 0.03), ('demonstration', 0.03), ('zn', 0.03), ('corner', 0.03), ('inequalities', 0.03), ('sites', 0.03), ('expresses', 0.029), ('multiple', 0.029), ('theory', 0.028), ('members', 0.028), ('panel', 0.028), ('examined', 0.028), ('ces', 0.028), ('inputs', 0.028), ('setting', 0.027), ('taylor', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 116 nips-2006-Learning from Multiple Sources
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields results for classification and regression generally, and for density estimation within the exponential family. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. 1
2 0.16161604 11 nips-2006-A PAC-Bayes Risk Bound for General Loss Functions
Author: Pascal Germain, Alexandre Lacasse, François Laviolette, Mario Marchand
Abstract: We provide a PAC-Bayesian bound for the expected loss of convex combinations of classifiers under a wide class of loss functions (which includes the exponential loss and the logistic loss). Our numerical experiments with Adaboost indicate that the proposed upper bound, computed on the training set, behaves very similarly as the true loss estimated on the testing set. 1 Intoduction The PAC-Bayes approach [1, 2, 3, 4, 5] has been very effective at providing tight risk bounds for large-margin classifiers such as the SVM [4, 6]. Within this approach, we consider a prior distribution P over a space of classifiers that characterizes our prior belief about good classifiers (before the observation of the data) and a posterior distribution Q (over the same space of classifiers) that takes into account the additional information provided by the training data. A remarkable result that came out from this line of research, known as the “PAC-Bayes theorem”, provides a tight upper bound on the risk of a stochastic classifier (defined on the posterior Q) called the Gibbs classifier. In the context of binary classification, the Q-weighted majority vote classifier (related to this stochastic classifier) labels any input instance with the label output by the stochastic classifier with probability more than half. Since at least half of the Q measure of the classifiers err on an example incorrectly classified by the majority vote, it follows that the error rate of the majority vote is at most twice the error rate of the Gibbs classifier. Therefore, given enough training data, the PAC-Bayes theorem will give a small risk bound on the majority vote classifier only when the risk of the Gibbs classifier is small. While the Gibbs classifiers related to the large-margin SVM classifiers have indeed a low risk [6, 4], this is clearly not the case for the majority vote classifiers produced by bagging [7] and boosting [8] where the risk of the associated Gibbs classifier is normally close to 1/2. Consequently, the PAC-Bayes theorem is currently not able to recognize the predictive power of the majority vote in these circumstances. In an attempt to progress towards a theory giving small risk bounds for low-risk majority votes having a large risk for the associated Gibbs classifier, we provide here a risk bound for convex combinations of classifiers under quite arbitrary loss functions, including those normally used for boosting (like the exponential loss) and those that can give a tighter upper bound to the zero-one loss of weighted majority vote classifiers (like the sigmoid loss). Our numerical experiments with Adaboost [8] indicate that the proposed upper bound for the exponential loss and the sigmoid loss, computed on the training set, behaves very similarly as the true loss estimated on the testing set. 2 Basic Definitions and Motivation We consider binary classification problems where the input space X consists of an arbitrary subset of Rn and the output space Y = {−1, +1}. An example is an input-output (x, y) pair where x ∈ X and y ∈ Y. Throughout the paper, we adopt the PAC setting where each example (x, y) is drawn according to a fixed, but unknown, probability distribution D on X × Y. We consider learning algorithms that work in a fixed hypothesis space H of binary classifiers and produce a convex combination fQ of binary classifiers taken from H. Each binary classifier h ∈ H contribute to fQ with a weight Q(h) ≥ 0. For any input example x ∈ X , the real-valued output fQ (x) is given by fQ (x) = Q(h)h(x) , h∈H where h(x) ∈ {−1, +1}, fQ (x) ∈ [−1, +1], and called the posterior distribution1 . h∈H Q(h) = 1. Consequently, Q(h) will be Since fQ (x) is also the expected class label returned by a binary classifier randomly chosen according to Q, the margin yfQ (x) of fQ on example (x, y) is related to the fraction WQ (x, y) of binary classifiers that err on (x, y) under measure Q as follows. Let I(a) = 1 when predicate a is true and I(a) = 0 otherwise. We then have: WQ (x, y) − Since E (x,y)∼D 1 2 = E h∼Q I(h(x) = y) − 1 2 = E − h∼Q yh(x) 1 = − 2 2 Q(h)yh(x) h∈H 1 = − yfQ (x) . 2 WQ (x, y) is the Gibbs error rate (by definition), we see that the expected margin is just one minus twice the Gibbs error rate. In contrast, the error for the Q-weighted majority vote is given by E (x,y)∼D I WQ (x, y) > 1 2 = ≤ ≤ E 1 1 tanh (β [2WQ (x, y) − 1]) + 2 2 tanh (β [2WQ (x, y) − 1]) + 1 (∀β > 0) E exp (β [2WQ (x, y) − 1]) E lim (x,y)∼D β→∞ (x,y)∼D (x,y)∼D (∀β > 0) . Hence, for large enough β, the sigmoid loss (or tanh loss) of fQ should be very close to the error rate of the Q-weighted majority vote. Moreover, the error rate of the majority vote is always upper bounded by twice that sigmoid loss for any β > 0. The sigmoid loss is, in turn, upper bounded by the exponential loss (which is used, for example, in Adaboost [9]). More generally, we will provide tight risk bounds for any loss function that can be expanded by a Taylor series around WQ (x, y) = 1/2. Hence we consider any loss function ζQ (x, y) that can be written as def ζQ (x, y) = = 1 1 + 2 2 1 1 + 2 2 ∞ g(k) (2WQ (x, y) − 1) k=1 ∞ (1) k g(k) k=1 k E − yh(x) h∼Q , (2) and our task is to provide tight bounds for the expected loss ζQ that depend on the empirical loss ζQ measured on a training sequence S = (x1 , y1 ), . . . , (xm , ym ) of m examples, where def ζQ = E (x,y)∼D ζQ (x, y) ; def ζQ = 1 m m ζQ (xi , yi ) . (3) i=1 Note that by upper bounding ζQ , we are taking into account all moments of WQ . In contrast, the PAC-Bayes theorem [2, 3, 4, 5] currently only upper bounds the first moment E WQ (x, y). (x,y)∼D 1 When H is a continuous set, Q(h) denotes a density and the summations over h are replaced by integrals. 3 A PAC-Bayes Risk Bound for Convex Combinations of Classifiers The PAC-Bayes theorem [2, 3, 4, 5] is a statement about the expected zero-one loss of a Gibbs classifier. Given any distribution over a space of classifiers, the Gibbs classifier labels any example x ∈ X according to a classifier randomly drawn from that distribution. Hence, to obtain a PACBayesian bound for the expected general loss ζQ of a convex combination of classifiers, let us relate ζQ to the zero-one loss of a Gibbs classifier. For this task, let us first write k E E − yh(x) = h∼Q (x,y)∼D E E E (−y)k h1 (x)h2 (x) · · · hk (x) . ··· E h1 ∼Q h2 ∼Q hk ∼Q (x,y) Note that the product h1 (x)h2 (x) · · · hk (x) defines another binary classifier that we denote as h1−k (x). We now define the error rate R(h1−k ) of h1−k as def R(h1−k ) = = I (−y)k h1−k (x) = sgn(g(k)) E (x,y)∼D (4) 1 1 + · sgn(g(k)) E (−y)k h1−k (x) , 2 2 (x,y)∼D where sgn(g) = +1 if g > 0 and −1 otherwise. If we now use E h1−k ∼Qk ζQ = = = to denote E E h1 ∼Q h2 ∼Q 1 1 + 2 2 1 1 + 2 2 1 + 2 · · · E , Equation 2 now becomes hk ∼Q ∞ k g(k) E E − yh(x) h∼Q (x,y)∼D k=1 ∞ |g(k)| · sgn(g(k)) k=1 ∞ |g(k)| E E R(h1−k ) − h1−k ∼Qk k=1 E h1−k ∼Qk (x,y)∼D 1 2 (−y)k h1−k (x) . (5) Apart, from constant factors, Equation 5 relates ζQ the the zero-one loss of a new type of Gibbs classifier. Indeed, if we define def ∞ c = |g(k)| , (6) k=1 Equation 5 can be rewritten as 1 c ζQ − 1 2 + 1 1 = 2 c ∞ |g(k)| E def h1−k ∼Qk k=1 R(h1−k ) = R(GQ ) . (7) The new type of Gibbs classifier is denoted above by GQ , where Q is a distribution over the product classifiers h1−k with variable length k. More precisely, given an example x to be labelled by GQ , we first choose at random a number k ∈ N+ according to the discrete probability distribution given by |g(k)|/c and then we choose h1−k randomly according to Qk to classify x with h1−k (x). The risk R(GQ ) of this new Gibbs classifier is then given by Equation 7. We will present a tight PAC-Bayesien bound for R(GQ ) which will automatically translate into a bound for ζQ via Equation 7. This bound will depend on the empirical risk RS (GQ ) which relates to the the empirical loss ζQ (measured on the training sequence S of m examples) through the equation 1 c ζQ − 1 2 + 1 1 = 2 c ∞ |g(k)| k=1 E h1−k ∼Qk def RS (h1−k ) = RS (GQ ) , where RS (h1−k ) def = 1 m m I (−yi )k h1−k (xi ) = sgn(g(k)) . i=1 (8) Note that Equations 7 and 8 imply that ζQ − ζQ = c · R(GQ ) − RS (GQ ) . Hence, any looseness in the bound for R(GQ ) will be amplified by the scaling factor c on the bound for ζQ . Therefore, within this approach, the bound for ζQ can be tight only for small values of c. Note however that loss functions having a small value of c are commonly used in practice. Indeed, learning algorithms for feed-forward neural networks, and other approaches that construct a realvalued function fQ (x) ∈ [−1, +1] from binary classification data, typically use a loss function of the form |fQ (x) − y|r /2, for r ∈ {1, 2}. In these cases we have 1 1 |fQ (x) − y|r = 2 2 r r = 2r−1 |WQ (x, y)| , E yh(x) − 1 h∼Q which gives c = 1 for r = 1, and c = 3 for r = 2. Given a set H of classifiers, a prior distribution P on H, and a training sequence S of m examples, the learner will output a posterior distribution Q on H which, in turn, gives a convex combination fQ that suffers the expected loss ζQ . Although Equation 7 holds only for a distribution Q defined by the absolute values of the Taylor coefficients g(k) and the product distribution Qk , the PAC-Bayesian theorem will hold for any prior P and posterior Q defined on def H∗ = Hk , (9) k∈N+ and for any zero-one valued loss function (h(x), y)) defined ∀h ∈ H∗ and ∀(x, y) ∈ X × Y (not just the one defined by Equation 4). This PAC-Bayesian theorem upper-bounds the value of kl RS (GQ ) R(GQ ) , where def kl(q p) = q ln q 1−q + (1 − q) ln p 1−p denotes the Kullback-Leibler divergence between the Bernoulli distributions with probability of success q and probability of success p. Note that an upper bound on kl RS (GQ ) R(GQ ) provides both and upper and a lower bound on R(GQ ). The upper bound on kl RS (GQ ) R(GQ ) depends on the value of KL(Q P ), where def E ln KL(Q P ) = h∼Q Q(h) P (h) denotes the Kullback-Leibler divergence between distributions Q and P defined on H∗ . In our case, since we want a bound on R(GQ ) that translates into a bound for ζQ , we need a Q that satisfies Equation 7. To minimize the value of KL(Q P ), it is desirable to choose a prior P having properties similar to those of Q. Namely, the probabilities assigned by P to the possible values of k will also be given by |g(k)|/c. Moreover, we will restrict ourselves to the case where the k classifiers from H are chosen independently, each according to the prior P on H (however, other choices for P are clearly possible). In this case we have KL(Q P ) = 1 c = 1 c = 1 c = ∞ |g(k)| k=1 E h1−k ∼Qk ln |g(k)| · Qk (h1−k ) |g(k)| · P k (h1−k ) ∞ k |g(k)| E k=1 ∞ k=1 h1 ∼Q ... E hk ∼Q ln i=1 Q(hi ) P (hi ) Q(h) |g(k)| · k E ln h∼Q P (h) k · KL(Q P ) , (10) where 1 c def k = ∞ |g(k)| · k . (11) k=1 We then have the following theorem. Theorem 1 For any set H of binary classifiers, any prior distribution P on H∗ , and any δ ∈ (0, 1], we have 1 m+1 KL(Q P ) + ln ≥ 1−δ. Pr ∀Q on H∗ : kl RS (GQ ) R(GQ ) ≤ S∼D m m δ Proof The proof directly follows from the fact that we can apply the PAC-Bayes theorem of [4] to priors and posteriors defined on the space H∗ of binary classifiers with any zero-one valued loss function. Note that Theorem 1 directly provides upper and lower bounds on ζQ when we use Equations 7 and 8 to relate R(GQ ) and RS (GQ ) to ζQ and ζQ and when we use Equation 10 for KL(Q P ). Consequently, we have the following theorem. Theorem 2 Consider any loss function ζQ (x, y) defined by Equation 1. Let ζQ and ζQ be, respectively, the expected loss and its empirical estimate (on a sample of m examples) as defined by Equation 3. Let c and k be defined by Equations 6 and 11 respectively. Then for any set H of binary classifiers, any prior distribution P on H, and any δ ∈ (0, 1], we have Pr S∼D m ∀Q on H : kl 1 1 1 ζQ − + c 2 2 1 1 1 ζQ − + c 2 2 ≤ 4 1 m+1 k · KL(Q P ) + ln m δ ≥ 1−δ. Bound Behavior During Adaboost We have decided to examine the behavior of the proposed bounds during Adaboost since this learning algorithm generally produces a weighted majority vote having a large Gibbs risk E (x,y) WQ (x, y) (i.e., small expected margin) and a small Var (x,y) WQ (x, y) (i.e., small variance of the margin). Indeed, recall that one of our main motivations was to find a tight risk bound for the majority vote precisely under these circumstances. We have used the “symmetric” version of Adaboost [10, 9] where, at each boosting round t, the weak learning algorithm produces a classifier ht with the smallest empirical error m t = Dt (i)I[ht (xi ) = yi ] i=1 with respect to the boosting distribution Dt (i) on the indices i ∈ {1, . . . , m} of the training examples. After each boosting round t, this distribution is updated according to Dt+1 (i) = 1 Dt (i) exp(−yi αt ht (xi )) , Zt where Zt is the normalization constant required for Dt+1 to be a distribution, and where αt = 1 ln 2 1− t . t Since our task is not to obtain the majority vote with the smallest possible risk but to investigate the tightness of the proposed bounds, we have used the standard “decision stumps” for the set H of classifiers that can be chosen by the weak learner. Each decision stump is a threshold classifier that depends on a single attribute: it outputs +y when the tested attribute exceeds the threshold and predicts −y otherwise, where y ∈ {−1, +1}. For each decision stump h ∈ H, its boolean complement is also in H. Hence, we have 2[k(i) − 1] possible decision stumps on an attribute i having k(i) possible (discrete values). Hence, for data sets having n attributes, we have exactly n |H| = 2 i=1 2[k(i) − 1] classifiers. Data sets having continuous-valued attributes have been discretized in our numerical experiments. From Theorem 2 and Equation 10, the bound on ζQ depends on KL(Q P ). We have chosen a uniform prior P (h) = 1/|H| ∀h ∈ H. We therefore have Q(h) def = Q(h) ln Q(h) + ln |H| = −H(Q) + ln |H| . KL(Q P ) = Q(h) ln P (h) h∈H h∈H At boosting round t, Adaboost changes the distribution from Dt to Dt+1 by putting more weight on the examples that are incorrectly classified by ht . This strategy is supported by the propose bound on ζQ since it has the effect of increasing the entropy H(Q) as a function of t. Indeed, apart from tiny fluctuations, the entropy was seen to be nondecreasing as a function of t in all of our boosting experiments. We have focused our attention on two different loss functions: the exponential loss and the sigmoid loss. 4.1 Results for the Exponential Loss The exponential loss EQ (x, y) is the obvious choice for boosting since, the typical analysis [8, 10, 9] shows that the empirical estimate of the exponential loss is decreasing at each boosting round 2 . More precisely, we have chosen def 1 exp (β [2WQ (x, y) − 1]) . EQ (x, y) = (12) 2 For this loss function, we have c = eβ − 1 β . k = 1 − e−β Since c increases exponentially rapidly with β, so will the risk upper-bound for EQ . Hence, unfortunately, we can obtain a tight upper-bound only for small values of β. All the data sets used were obtained from the UCI repository. Each data set was randomly split into two halves of the same size: one for the training set and the other for the testing set. Figure 1 illustrates the typical behavior for the exponential loss bound on the Mushroom and Sonar data sets containing 8124 examples and 208 examples respectively. We first note that, although the test error of the majority vote (generally) decreases as function of the number T of boosting rounds, the risk of the Gibbs classifier, E (x,y) WQ (x, y) increases as a function of T but its variance Var (x,y) WQ (x, y) decreases dramatically. Another striking feature is the fact that the exponential loss bound curve, computed on the training set, is essentially parallel to the true exponential loss curve computed on the testing set. This same parallelism was observed for all the UCI data sets we have examined so far.3 Unfortunately, as we can see in Figure 2, the risk bound increases rapidly as a function of β. Interestingly however, the risk bound curves remain parallel to the true risk curves. 4.2 Results for the Sigmoid Loss We have also investigated the sigmoid loss TQ (x, y) defined by 1 def 1 + tanh (β [2WQ (x, y) − 1]) . TQ (x, y) = 2 2 2 (13) In fact, this is true only for the positive linear combination produced by Adaboost. The empirical exponential risk of the convex combination fQ is not always decreasing as we shall see. 3 These include the following data sets: Wisconsin-breast, breast cancer, German credit, ionosphere, kr-vskp, USvotes, mushroom, and sonar. 0.6 0.4 0.5 0.3 0.4 0.2 0.1 EQ bound EQ on test 0.3 EQ bound EQ on test E(WQ ) on test MV error on test Var(WQ ) on test 0.2 E(WQ ) on test MV error on test Var(WQ ) on test 0.1 0 0 0 40 80 120 160 T 0 40 80 120 160 T Figure 1: Behavior of the exponential risk bound (EQ bound), the true exponential risk (EQ on test), the Gibbs risk (E(WQ ) on test), its variance (Var(WQ ) on test), and the test error of the majority vote (MV error on test) as of function of the boosting round T for the Mushroom (left) and the Sonar (right) data sets. The risk bound and the true risk were computed for β = ln 2. 0.5 0.8 0.7 0.4 0.6 0.5 0.3 0.4 β=1 β=2 β=3 β=4 MV error on test 0.2 0.1 β=1 β=2 β=3 β=4 MV error on test 0.3 0.2 0.1 0 0 1 40 80 120 160 T 1 40 80 120 160 T Figure 2: Behavior of the true exponential risk (left) and the exponential risk bound (right) for different values of β on the Mushroom data set. Since the Taylor series expansion for tanh(x) about x = 0 converges only for |x| < π/2, we are limited to β ≤ π/2. Under these circumstances, we have c = k = tan(β) 1 . cos(β) sin(β) Similarly as in Figure 1, we see on Figure 3 that the sigmoid loss bound curve, computed on the training set, is essentially parallel to the true sigmoid loss curve computed on the testing set. Moreover, the bound appears to be as tight as the one for the exponential risk on Figure 1. 5 Conclusion By trying to obtain a tight PAC-Bayesian risk bound for the majority vote, we have obtained a PAC-Bayesian risk bound for any loss function ζQ that has a convergent Taylor expansion around WQ = 1/2 (such as the exponential loss and the sigmoid loss). Unfortunately, the proposed risk 0.6 0.4 0.5 0.4 0.3 0.2 0.1 TQ bound TQ on test 0.3 TQ bound TQ on test E(WQ ) on test MV error on test Var(WQ ) on test 0.2 E(WQ ) on test MV error on test Var(WQ ) on test 0.1 0 0 0 40 80 120 160 T 0 40 80 120 160 T Figure 3: Behavior of the sigmoid risk bound (TQ bound), the true sigmoid risk (TQ on test), the Gibbs risk (E(WQ ) on test), its variance (Var(WQ ) on test), and the test error of the majority vote (MV error on test) as of function of the boosting round T for the Mushroom (left) and the Sonar (right) data sets. The risk bound and the true risk were computed for β = ln 2. bound is tight only for small values of the scaling factor c involved in the relation between the expected loss ζQ of a convex combination of binary classifiers and the zero-one loss of a related Gibbs classifier GQ . However, it is quite encouraging to notice in our numerical experiments with Adaboost that the proposed loss bound (for the exponential loss and the sigmoid loss), behaves very similarly as the true loss. Acknowledgments Work supported by NSERC Discovery grants 262067 and 122405. References [1] David McAllester. Some PAC-Bayesian theorems. Machine Learning, 37:355–363, 1999. [2] Matthias Seeger. PAC-Bayesian generalization bounds for gaussian processes. Journal of Machine Learning Research, 3:233–269, 2002. [3] David McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51:5–21, 2003. [4] John Langford. Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 6:273–306, 2005. [5] Francois Laviolette and Mario Marchand. PAC-Bayes risk bounds for sample-compressed ¸ Gibbs classifiers. Proceedings of the 22nth International Conference on Machine Learning (ICML 2005), pages 481–488, 2005. [6] John Langford and John Shawe-Taylor. PAC-Bayes & margins. In S. Thrun S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 423– 430. MIT Press, Cambridge, MA, 2003. [7] Leo Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996. [8] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55:119–139, 1997. [9] Robert E. Schapire and Yoram Singer. Improved bosting algorithms using confidence-rated predictions. Machine Learning, 37:297–336, 1999. [10] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26:1651– 1686, 1998.
3 0.15146883 181 nips-2006-Stability of $K$-Means Clustering
Author: Alexander Rakhlin, Andrea Caponnetto
Abstract: We phrase K-means clustering as an empirical risk minimization procedure over a class HK and explicitly calculate the covering number for this class. Next, we show that stability of K-means clustering is characterized by the geometry of HK with respect to the underlying distribution. We prove that in the case of a unique global minimizer, the clustering solution is stable with respect to complete changes of the data, while for the case of multiple minimizers, the change of Ω(n1/2 ) samples defines the transition between stability and instability. While for a finite number of minimizers this result follows from multinomial distribution estimates, the case of infinite minimizers requires more refined tools. We conclude by proving that stability of the functions in HK implies stability of the actual centers of the clusters. Since stability is often used for selecting the number of clusters in practice, we hope that our analysis serves as a starting point for finding theoretically grounded recipes for the choice of K. 1
4 0.13485275 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
Author: Manfred K. Warmuth, Dima Kuzmin
Abstract: We design an on-line algorithm for Principal Component Analysis. In each trial the current instance is projected onto a probabilistically chosen low dimensional subspace. The total expected quadratic approximation error equals the total quadratic approximation error of the best subspace chosen in hindsight plus some additional term that grows linearly in dimension of the subspace but logarithmically in the dimension of the instances. 1
5 0.13378328 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space
Author: Wenye Li, Kin-hong Lee, Kwong-sak Leung
Abstract: Kernel-based regularized learning seeks a model in a hypothesis space by minimizing the empirical error and the model’s complexity. Based on the representer theorem, the solution consists of a linear combination of translates of a kernel. This paper investigates a generalized form of representer theorem for kernel-based learning. After mapping predefined features and translates of a kernel simultaneously onto a hypothesis space by a specific way of constructing kernels, we proposed a new algorithm by utilizing a generalized regularizer which leaves part of the space unregularized. Using a squared-loss function in calculating the empirical error, a simple convex solution is obtained which combines predefined features with translates of the kernel. Empirical evaluations have confirmed the effectiveness of the algorithm for supervised learning tasks.
6 0.12986231 7 nips-2006-A Local Learning Approach for Clustering
7 0.12591363 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds
8 0.12417588 109 nips-2006-Learnability and the doubling dimension
9 0.12414689 33 nips-2006-Analysis of Representations for Domain Adaptation
10 0.12385213 193 nips-2006-Tighter PAC-Bayes Bounds
11 0.1190565 117 nips-2006-Learning on Graph with Laplacian Regularization
12 0.092801839 30 nips-2006-An Oracle Inequality for Clipped Regularized Risk Minimizers
13 0.087034084 21 nips-2006-AdaBoost is Consistent
14 0.086228691 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians
15 0.082331128 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors
16 0.082135946 186 nips-2006-Support Vector Machines on a Budget
17 0.082116619 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
18 0.081830502 50 nips-2006-Chained Boosting
19 0.08046826 150 nips-2006-On Transductive Regression
20 0.078746401 65 nips-2006-Denoising and Dimension Reduction in Feature Space
topicId topicWeight
[(0, -0.271), (1, 0.088), (2, -0.113), (3, 0.028), (4, -0.041), (5, 0.22), (6, -0.003), (7, 0.11), (8, -0.028), (9, 0.096), (10, 0.116), (11, -0.005), (12, -0.103), (13, 0.028), (14, -0.015), (15, -0.102), (16, 0.089), (17, -0.009), (18, 0.058), (19, -0.107), (20, -0.04), (21, 0.187), (22, 0.041), (23, 0.06), (24, -0.142), (25, -0.079), (26, 0.114), (27, -0.084), (28, -0.02), (29, 0.058), (30, 0.045), (31, -0.029), (32, 0.025), (33, -0.012), (34, 0.109), (35, -0.061), (36, -0.079), (37, -0.145), (38, -0.122), (39, 0.019), (40, 0.101), (41, -0.067), (42, 0.035), (43, -0.054), (44, 0.074), (45, -0.043), (46, -0.057), (47, -0.03), (48, -0.059), (49, 0.08)]
simIndex simValue paperId paperTitle
same-paper 1 0.9565196 116 nips-2006-Learning from Multiple Sources
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields results for classification and regression generally, and for density estimation within the exponential family. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. 1
2 0.74732006 109 nips-2006-Learnability and the doubling dimension
Author: Yi Li, Philip M. Long
Abstract: Given a set of classifiers and a probability distribution over their domain, one can define a metric by taking the distance between a pair of classifiers to be the probability that they classify a random item differently. We prove bounds on the sample complexity of PAC learning in terms of the doubling dimension of this metric. These bounds imply known bounds on the sample complexity of learning halfspaces with respect to the uniform distribution that are optimal up to a constant factor. We prove a bound that holds for any algorithm that outputs a classifier with zero error whenever this is possible; this bound is in terms of the maximum of the doubling dimension and the VC-dimension of , and strengthens the best known bound in terms of the VC-dimension alone. We show that there is no bound on the doubling dimension in terms of the VC-dimension of (in contrast with the metric dimension).
3 0.67164373 181 nips-2006-Stability of $K$-Means Clustering
Author: Alexander Rakhlin, Andrea Caponnetto
Abstract: We phrase K-means clustering as an empirical risk minimization procedure over a class HK and explicitly calculate the covering number for this class. Next, we show that stability of K-means clustering is characterized by the geometry of HK with respect to the underlying distribution. We prove that in the case of a unique global minimizer, the clustering solution is stable with respect to complete changes of the data, while for the case of multiple minimizers, the change of Ω(n1/2 ) samples defines the transition between stability and instability. While for a finite number of minimizers this result follows from multinomial distribution estimates, the case of infinite minimizers requires more refined tools. We conclude by proving that stability of the functions in HK implies stability of the actual centers of the clusters. Since stability is often used for selecting the number of clusters in practice, we hope that our analysis serves as a starting point for finding theoretically grounded recipes for the choice of K. 1
4 0.65204674 30 nips-2006-An Oracle Inequality for Clipped Regularized Risk Minimizers
Author: Ingo Steinwart, Don Hush, Clint Scovel
Abstract: We establish a general oracle inequality for clipped approximate minimizers of regularized empirical risks and apply this inequality to support vector machine (SVM) type algorithms. We then show that for SVMs using Gaussian RBF kernels for classification this oracle inequality leads to learning rates that are faster than the ones established in [9]. Finally, we use our oracle inequality to show that a simple parameter selection approach based on a validation set can yield the same fast learning rates without knowing the noise exponents which were required to be known a-priori in [9]. 1
5 0.55559373 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds
Author: Benjamin I. Rubinstein, Peter L. Bartlett, J. H. Rubinstein
Abstract: Under the prediction model of learning, a prediction strategy is presented with an i.i.d. sample of n − 1 points in X and corresponding labels from a concept f ∈ F, and aims to minimize the worst-case probability of erring on an nth point. By exploiting the structure of F, Haussler et al. achieved a VC(F)/n bound for the natural one-inclusion prediction strategy, improving on bounds implied by PAC-type results by a O(log n) factor. The key data structure in their result is the natural subgraph of the hypercube—the one-inclusion graph; the key step is a d = VC(F) bound on one-inclusion graph density. The first main result of this n n−1 paper is a density bound of n ≤d−1 / ( ≤d ) < d, which positively resolves a conjecture of Kuzmin & Warmuth relating to their unlabeled Peeling compression scheme and also leads to an improved mistake bound for the randomized (deterministic) one-inclusion strategy for all d (for d ≈ Θ(n)). The proof uses a new form of VC-invariant shifting and a group-theoretic symmetrization. Our second main result is a k-class analogue of the d/n mistake bound, replacing the VC-dimension by the Pollard pseudo-dimension and the one-inclusion strategy by its natural hypergraph generalization. This bound on expected risk improves on known PAC-based results by a factor of O(log n) and is shown to be optimal up to a O(log k) factor. The combinatorial technique of shifting takes a central role in understanding the one-inclusion (hyper)graph and is a running theme throughout. 1
6 0.53644669 33 nips-2006-Analysis of Representations for Domain Adaptation
7 0.53301561 193 nips-2006-Tighter PAC-Bayes Bounds
8 0.51546681 194 nips-2006-Towards a general independent subspace analysis
9 0.5145368 11 nips-2006-A PAC-Bayes Risk Bound for General Loss Functions
10 0.51010615 21 nips-2006-AdaBoost is Consistent
11 0.48900339 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space
12 0.48173437 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions
13 0.46437752 117 nips-2006-Learning on Graph with Laplacian Regularization
14 0.42268363 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
15 0.4203721 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models
16 0.41561377 155 nips-2006-Optimal Single-Class Classification Strategies
17 0.39946228 156 nips-2006-Ordinal Regression by Extended Binary Classification
18 0.39543477 157 nips-2006-PAC-Bayes Bounds for the Risk of the Majority Vote and the Variance of the Gibbs Classifier
19 0.39508936 150 nips-2006-On Transductive Regression
20 0.38089886 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
topicId topicWeight
[(1, 0.08), (3, 0.02), (7, 0.068), (9, 0.042), (22, 0.074), (44, 0.528), (57, 0.05), (65, 0.031), (69, 0.014), (71, 0.016)]
simIndex simValue paperId paperTitle
1 0.99239373 69 nips-2006-Distributed Inference in Dynamical Systems
Author: Stanislav Funiak, Carlos Guestrin, Rahul Sukthankar, Mark A. Paskin
Abstract: We present a robust distributed algorithm for approximate probabilistic inference in dynamical systems, such as sensor networks and teams of mobile robots. Using assumed density filtering, the network nodes maintain a tractable representation of the belief state in a distributed fashion. At each time step, the nodes coordinate to condition this distribution on the observations made throughout the network, and to advance this estimate to the next time step. In addition, we identify a significant challenge for probabilistic inference in dynamical systems: message losses or network partitions can cause nodes to have inconsistent beliefs about the current state of the system. We address this problem by developing distributed algorithms that guarantee that nodes will reach an informative consistent distribution when communication is re-established. We present a suite of experimental results on real-world sensor data for two real sensor network deployments: one with 25 cameras and another with 54 temperature sensors. 1
2 0.96628362 96 nips-2006-In-Network PCA and Anomaly Detection
Author: Ling Huang, Xuanlong Nguyen, Minos Garofalakis, Michael I. Jordan, Anthony Joseph, Nina Taft
Abstract: We consider the problem of network anomaly detection in large distributed systems. In this setting, Principal Component Analysis (PCA) has been proposed as a method for discovering anomalies by continuously tracking the projection of the data onto a residual subspace. This method was shown to work well empirically in highly aggregated networks, that is, those with a limited number of large nodes and at coarse time scales. This approach, however, has scalability limitations. To overcome these limitations, we develop a PCA-based anomaly detector in which adaptive local data filters send to a coordinator just enough data to enable accurate global detection. Our method is based on a stochastic matrix perturbation analysis that characterizes the tradeoff between the accuracy of anomaly detection and the amount of data communicated over the network.
same-paper 3 0.96579081 116 nips-2006-Learning from Multiple Sources
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields results for classification and regression generally, and for density estimation within the exponential family. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. 1
4 0.93298793 57 nips-2006-Conditional mean field
Author: Peter Carbonetto, Nando D. Freitas
Abstract: Despite all the attention paid to variational methods based on sum-product message passing (loopy belief propagation, tree-reweighted sum-product), these methods are still bound to inference on a small set of probabilistic models. Mean field approximations have been applied to a broader set of problems, but the solutions are often poor. We propose a new class of conditionally-specified variational approximations based on mean field theory. While not usable on their own, combined with sequential Monte Carlo they produce guaranteed improvements over conventional mean field. Moreover, experiments on a well-studied problem— inferring the stable configurations of the Ising spin glass—show that the solutions can be significantly better than those obtained using sum-product-based methods. 1
5 0.87830603 139 nips-2006-Multi-dynamic Bayesian Networks
Author: Karim Filali, Jeff A. Bilmes
Abstract: We present a generalization of dynamic Bayesian networks to concisely describe complex probability distributions such as in problems with multiple interacting variable-length streams of random variables. Our framework incorporates recent graphical model constructs to account for existence uncertainty, value-specific independence, aggregation relationships, and local and global constraints, while still retaining a Bayesian network interpretation and efficient inference and learning techniques. We introduce one such general technique, which is an extension of Value Elimination, a backtracking search inference algorithm. Multi-dynamic Bayesian networks are motivated by our work on Statistical Machine Translation (MT). We present results on MT word alignment in support of our claim that MDBNs are a promising framework for the rapid prototyping of new MT systems. 1 INTRODUCTION The description of factorization properties of families of probabilities using graphs (i.e., graphical models, or GMs), has proven very useful in modeling a wide variety of statistical and machine learning domains such as expert systems, medical diagnosis, decision making, speech recognition, and natural language processing. There are many different types of graphical model, each with its own properties and benefits, including Bayesian networks, undirected Markov random fields, and factor graphs. Moreover, for different types of scientific modeling, different types of graphs are more or less appropriate. For example, static Bayesian networks are quite useful when the size of set of random variables in the domain does not grow or shrink for all data instances and queries of interest. Hidden Markov models (HMMs), on the other hand, are such that the number of underlying random variables changes depending on the desired length (which can be a random variable), and HMMs are applicable even without knowing this length as they can be extended indefinitely using online inference. HMMs have been generalized to dynamic Bayesian networks (DBNs) and temporal conditional random fields (CRFs), where an underlying set of variables gets repeated as needed to fill any finite but unbounded length. Probabilistic relational models (PRMs) [5] allow for a more complex template that can be expanded in multiple dimensions simultaneously. An attribute common to all of the above cases is that the specification of rules for expanding any particular instance of a model is finite. In other words, these forms of GM allow the specification of models with an unlimited number of random variables (RVs) using a finite description. This is achieved using parameter tying, so while the number of RVs increases without bound, the number of parameters does not. In this paper, we introduce a new class of model we call multi-dynamic Bayesian networks. MDBNs are motivated by our research into the application of graphical models to the domain of statistical machine translation (MT) and they have two key attributes from the graphical modeling perspective. First, an MDBN generalizes a DBN in that there are multiple “streams” of variables that can get unrolled, but where each stream may be unrolled by a differing amount. In the most general case, connecting these different streams together would require the specification of conditional probabil- ity tables with a varying and potentially unlimited number of parents. To avoid this problem and retain the template’s finite description length, we utilize a switching parent functionality (also called value-specific independence). Second, in order to capture the notion of fertility in MT-systems (defined later in the text), we employ a form of existence uncertainty [7] (that we call switching existence), whereby the existence of a given random variable might depend on the value of other random variables in the network. Being fully propositional, MDBNs lie between DBNs and PRMs in terms of expressiveness. While PRMs are capable of describing any MDBN, there are, in general, advantages to restricting ourselves to a more specific class of model. For example, in the DBN case, it is possible to provide a bound on inference costs just by looking at attributes of the DBN template only (e.g., the left or right interfaces [12, 2]). Restricting the model can also make it simpler to use in practice. MDBNs are still relatively simple, while at the same time making possible the easy expression of MT systems, and opening doors to novel forms of probabilistic inference as we show below. In section 2, we introduce MDBNs, and describe their application to machine translation showing how it is possible to represent even complex MT systems. In section 3, we describe MDBN learning and decoding algorithms. In section 4, we present experimental results in the area of statistical machine translation, and future work is discussed in section 5. 2 MDBNs A standard DBN [4] template consists of a directed acyclic graph G = (V, E) = (V1 ∪ V2 , E1 ∪ → E2 ∪ E2 ) with node set V and edge set E. For t ∈ {1, 2}, the sets Vt are the nodes at slice t, Et → are the intra-slice edges between nodes in Vt , and Et are the inter-slice edges between nodes in V1 and V2 . To unroll a DBN to length T , the nodes V2 along with the edges adjacent to any node in V2 are cloned T − 1 times (where parameters of cloned variables are constrained to be the same as the template) and re-connected at the corresponding places. An MDBN with K streams consists of the union of K DBN templates along with a template structure specifying rules to connect the various streams together. An MDBN template is a directed graph (k) G = (V, E) = ( V (k) , E (k) ∪ E ) k (k) (k) th k (k) where (V , E ) is the k DBN, and the edges E are rules specifying how to connect stream k to the other streams. These rules are general in that they specify the set of edges for all values of Tk . There can be arbitrary nesting of the streams such as, for example, it is possible to specify a model that can grow along several dimensions simultaneously. An MDBN also utilizes “switching existence”, meaning some subset of the variables in V bestow existence onto other variables in the network. We call these variables existence bestowing (or ebnodes). The idea of bestowing existence is well defined over a discrete space, and is not dissimilar to a variable length DBN. For example, we may have a joint distribution over lengths as follows: p(X1 , . . . , XN , N ) = p(X1 , . . . , Xn |N = n)p(N = n) where here N is an eb-node that determines the number of other random variables in the DGM. Our notion of eb-nodes allows us to model certain characteristics found within machine translation systems, such as “fertility” [3], where a given English word is cloned a random number of times in the generative process that explains a translation from French into English. This random cloning might happen simultaneously at all points along a given MDBN stream. This means that even for a given fixed stream length Ti = ti , each stream could have a randomly varying number of random variables. Our graphical notation for eb-nodes consists of the eb-node as a square box containing variables whose existence is determined by the eb-node. We start by providing a simple example of an expanded MDBN for three well known MT systems, namely the IBM models 1 and 2 [3], and the “HMM” model [15].1 We adopt the convention in [3] that our goal is to translate from a string of French words F = f of length M = m into a string of English words E = e of length L = l — of course these can be any two languages. The basic generative (noisy channel) approach when translating from French to English is to represent the joint 1 We will refer to it as M-HMM to avoid confusion with regular HMMs. distribution P (f , e) = P (f |e)P (e). P (e) is a language model specifying the prior over the word string e. The key goal is to produce a finite-description length representation for P (f |e) where f and e are of arbitrary length. A hidden alignment string, a, specifies how the English words align to the French word, leading to P (f |e) = a P (f , a|e). Figure 1(a) is a 2-stream MDBN expanded representation of the three models, in this case ℓ = 4 and m = 3. As shown, it appears that the fan-in to node fi will be ℓ and thus will grow without bound. However, a switching mechanism whereby P (fi |e, ai ) = P (fi |eai ) limits the number of parameters regardless of L. This means that the alignment variable ai indicates the English word eai that should be aligned to French word fi . The variable e0 is a null word that connects to French words not explained by any of e1 , . . . , eℓ . The graph expresses all three models — the difference is that, in Models 1 and 2, there are no edges between aj and aj+1 . In Model 1, p(aj = ℓ) is uniform on the set {1, . . . , L}; in Model 2, the distribution over aj is a function only of its position j, and on the English and French lengths ℓ and m respectively. In the M-HMM model, the ai variables form a first order Markov chain. l e0 ℓ e1 e3 e2 e1 e4 e2 e3 φ1 φ2 φ3 m’ φ0 τ01 a1 f2 a2 f3 a3 m (a) Models 1,2 and M-HMM τ12 τ13 τ21 π02 π11 π12 π13 π21 f2 f3 f4 f5 f6 a1 u v τ11 f1 f1 τ02 a2 a3 a4 a5 a6 π01 w y x m (b) Expanded M3 graph Figure 1: Expanded 2-stream MDBN description of IBM Models 1 and 2, and the M-HMM model for MT; and the expanded MDBN description of IBM Model 3 with fertility assignment φ0 = 2, φ1 = 3, φ2 = 1, φ3 = 0. From the above, we see that it would be difficult to express this model graphically using a standard DBN since L and M are unequal random variables. Indeed, there are two DBNs in operation, one consisting of the English string, and the other consisting of the French string and its alignment. Moreover, the fully connected structure of the graph in the figure can represent the appropriate family of model, but it also represents models whose parameter space grows without bound — the switching function allows the model template to stay finite regardless of L and M . With our MDBN descriptive abilities complete, it is now possible to describe the more complex IBM models 3, and 4[3] (an MDBN for Model3 is depicted in fig. 1(b)). The top most random variable, ℓ, is a hidden switching existence variable corresponding to the length of the English string. The box abutting ℓ includes all the nodes whose existence depends on the value of ℓ. In the figure, ℓ = 3, thus resulting in three English words e1 , e2 , and e3 connected using a second-order Markov chain. To each English word ei corresponds a conditionally dependent fertility eb-node φi , which indicates how many times ei is used by words in the French string. Each φi in turn controls the existence of a set of variables under it. Given the fertilities (the figure depicts the case φ1 = 3, φ2 = 1, φ3 = 0), for each word ei , φi French word variables are granted existence and are denoted by τi1 , τi2 , . . . , τiφi , what is called the tablet [3] of ei . The values taken by the τ variables need to match the actual observed French sequence f1 , . . . , fm . This is represented as a shared constraint between all the f , π, and τ variables which have incoming edges into the observed variable v. v’s conditional probability table is such that it is one only when the associated constraint is satisfied2 . The variable 2 This type of encoding of constraints corresponds to the standard mechanism used by Pearl [14]. A naive implementation, however, would enumerate a number of configurations exponential in the number of constrained variables, while typically only a small fraction of the configurations would have positive probability. πi,k ∈ {1, . . . , m} is a switching dependency parent with respect to the constraint variable v and determines which fj participates in an equality constraint with τi,k . The bottom variable m is a switching existence node (observed to be 6 in the figure) with corresponding French word sequence and alignment variables. The French sequence participates in the v constraint described above, while the alignment variables aj ∈ {1, . . . , ℓ}, j ∈ 1, . . . , m constrain the fertilities to take their unique allowable values (for the given alignment). Alignments also restrict the domain of permutation variables, π, using the constraint variable x. Finally, the domain size of each aj has to lie in the interval [0, ℓ] and that is enforced by the variable u. The dashed edges connecting the alignment a variables represent an extension to implement an M3/M-HMM hybrid. ℓ The null submodel involving the deterministic node m′ (= i=1 φi ) and eb-node φ0 accounts for French words that are not explained by any of the English words e1 , . . . , eℓ . In this submodel, successive permutation variables are ordered and this constraint is implemented using the observed child w of π0i and π0(i+1) . Model 4 [3] is similar to Model 3 except that the former is based on a more elaborate distortion model that uses relative instead of absolute positions both within and between tablets. 3 Inference, Parameter Estimation and MPE Multi-dynamic Bayesian Networks are amenable to any type of inference that is applicable to regular Bayesian networks as long as switching existence relationships are respected and all the constraints (aggregation for example) are satisfied. Unfortunately DBN inference procedures that take advantage of the repeatable template and can preprocess it offline, are not easy to apply to MDBNs. A case in point is the Junction Tree algorithm [11]. Triangulation algorithms exist that create an offline triangulated version of the input graph and do not re-triangulate it for each different instance of the input data [12, 2]. In MDBNs, due to the flexibility to unroll templates in several dimensions and to specify dependencies and constraints spanning the entire unrolled graph, it is not obvious how we can exploit any repetitive patterns in a Junction Tree-style offline triangulation of the graph template. In section 4, we discuss sampling inference methods we have used. Here we discuss our extension to a backtracking search algorithm with the same performance guarantees as the JT algorithm, but with the advantage of easily handling determinism, existence uncertainty, and constraints, both learned and explicitly stated. Value Elimination (VE) ([1]), is a backtracking Bayesian network inference technique that caches factors associated with portions of the search tree and uses them to avoid iterating again over the same subtrees. We follow the notation introduced in [1] and refer the reader to that paper for details about VE inference. We have extended the VE inference approach to handle explicitly encoded constraints, existence uncertainty, and to perform approximate local domain pruning (see section 4). We omit these details as well as others in the original paper and briefly describe the main data structure required by VE and sketch the algorithm we refer to as FirstPass (fig. 1) since it constitutes the first step of the learning procedure, our main contribution in this section. A VE factor, F , is such that we can write the following marginal of the joint distribution P (X = x, Y = y, Z) = F.val × f (Z) X=x such that (X∪Y)∩Z = ∅, F.val is a constant, and f (Z) a function of Z only. Y is a set of variables previously instantiated in the current branch of search tree to the value vector y. The pair (Y, y) is referred to as a dependency set (F.Dset). X is referred to as a subsumed set (F.Sset). By caching the tuple (F.Dset, F.Sset, F.val), we avoid recomputing the marginal again whenever (1) F.Dset is active, meaning all nodes stored in F.Dset are assigned their cached values in the current branch of the search tree; and (2) none of the variables in F.Sset are assigned yet. FirstPass (alg. 1) visits nodes in the graph in Depth First fashion. In line 7, we get the values of all Newly Single-valued (NSV) CPTs i.e., CPTs that involve the current node, V , and in which all We use a general directed domain pruning constraint. Deterministic relationships then become a special case of our constraint whereby the domain of the child variable is constrained to a single value with probability one. Variable traversal order: A, B, C, and D. Factors are numbered by order of creation. *Fi denotes the activation of factor i. Tau values propagated recursively F7: Dset={} Sset={A,B,C,D} val=P(E=e) F7.tau = 1.0 = P(Evidence)/F7.val A F5: Dset={A=0} Sset={B,C,D} F2 D *F1 *F2 Factor values needed for c(A=0) and c(C=0,B=0) computation: F5.val=P(B=0|A=0)*F3.val+P(B=1|A=0)*F4.val F3.val=P(C=0|B=0)*F1.val+P(C=1|B=0)*F2.val F4.val=P(C=0|B=1)*F1.val+P(C=1|B=1)*F2.val F1.val=P(D=0|C=0)P(E=e|D=0)+P(D=1|C=0)P(E=e|D=1) F2.val=P(D=0|C=1)P(E=e|D=0)+P(D=1|C=1)P(E=e|D=1) First pass C *F3 *F4 Second pass D B F4 C F6.tau = F7.tau * P(A=1) 1 B F3: Dset={B=0} Sset={C,D} F1 F5.tau = F7.tau * P(A=0) F6 0 F3.tau = F5.tau * P(B=0|A=0) + F6.tau * P(B=0|A=1) = P(B=0) F4.tau = F5.tau * P(B=1|A=0) + F6.tau * P(B=1|A=1) = P(B=1) F1.tau = F3.tau * P(C=0|B=0) + F4.tau * P(C=0|B=1) = P(C=0) F2.tau = F3.tau * P(C=1|B=0) + F4.tau * P(C=1|B=1) = P(C=1) c(A=0)=(1/P(e))*(F7.tau*P(A=0)*F5.val)=(1/P(e))(P(A=0)*P(E=e|A=0))=P(A=0|E=e) c(C=0,B=0)=(1/P(e))*F3.tau*P(C=0|B=0)*F1.val =(1/P(e) * (P(A=0,B=0)+P(A=1,B=0)) * P(C=0|B=0) * F1.val =(1/P(e)) * P(B=0) * P(C=0|B=0) * F1.val =(1/P(e)) * P(B=0) * P(C=0|B=0) * F1.val =(1/P(e)) * P(C=0,B=0) * F1.val =P(C=0,B=0,E=e)/P(e)=P(C=0,B=0|E=e) Figure 2: Learning example using the Markov chain A → B → C → D → E, where E is observed. In the first pass, factors (Dset, Sset and val) are learned in a bottom up fashion. Also, the normalization constant P (E = e) (probability of evidence) is obtained. In the second pass, tau values are updated in a top-down fashion and used to calculate expected counts c(F.head, pa(F.head)) corresponding to each F.head (the figure shows the derivations for (A=0) and (C=0,B=0), but all counts are updated in the same pass). other variables are already assigned (these variables and their values are accumulated into Dset). We also check for factors that are active, multiply their values in, and accumulate subsumed vars in Sset (to avoid branching on them). In line 10, we add V to the Sset. In line 11, we cache a new factor F with value F.val = sum. We store V into F.head, a pointer to the last variable to be inserted into F.Sset, and needed for parameter estimation described below. F.Dset consists of all the variables, except V , that appeared in any NSV CPT or the Dset of an activated factor at line 6. Regular Value Elimination is query-based, similar to variable elimination and recursive conditioning—what this means is that to answer a query of the type P (Q|E = e), where Q is query variable and E a set of evidence nodes, we force Q to be at the top of the search tree, run the backtracking algorithm and then read the answers to the queries P (Q = q|E = e), q ∈ Dom[Q], along each of the outgoing edges of Q. Parameter estimation would require running a number of queries on the order of the number of parameters to estimate. We extend VE into an algorithm that allows us to obtain Expectation Maximization sufficient statistics in a single run of Value Elimination plus a second pass, which can never take longer than the first one (and in practice is much faster). This two-pass procedure is analogous to the collect-distribute evidence procedure in the Junction Tree algorithm, but here we do this via a search tree. Let θX=x|pa(X)=y be a parameter associated with variable X with value x and parents Y = pa(X) when they have value y. Assuming a maximum likelihood learning scenario3 , to estimate θX=x|pa(X)=y , we need to compute f (X = x, pa(X) = y, E = e) = P (W, X = x, pa(X) = y, E = e) W\{X,pa(X)} which is a sum of joint probabilities of all configurations that are consistent with the assignment {X = x, pa(X) = y}. If we were to turn off factor caching, we would enumerate all such variable configurations and could compute the sum. When standard VE factors are used, however, this is no longer possible whenever X or any of its parents becomes subsumed. Fig. 2 illustrates an example of a VE tree and the factors that are learned in the case of a Markov chain with an evidence node at the end. We can readily estimate the parameters associated with variables A and B as they are not subsumed along any branch. C and D become subsumed, however, and we cannot obtain the correct counts along all the branches that would lead to C and D in the full enumeration case. To address this issue, we store a special value, F.tau, in each factor. F.tau holds the sum over all path probabilities from the first level of the search tree to the level at which the factor F was 3 For Bayesian networks the likelihood function decomposes such that maximizing the expectation of the complete likelihood is equivalent to maximizing the “local likelihood” of each variable in the network. either created or activated. For example, F 6.tau in fig. 2 is simply P (A = 1). Although we can compute F 3.tau directly, we can also compute it recursively using F 5.tau and F 6.tau as shown in the figure. This is because both F 5 and F 6 subsume F 3: in the context {F 5.Dset}, there exists a (unique) value dsub of F 5.head4 s.t. F 3 becomes activable. Likewise for F 6. We cannot compute F 1.tau directly, but we can, recursively, from F 3.tau and F 4.tau by taking advantage of a similar subsumption relationship. In general, we can show that the following recursive relationship holds: F pa .tau × N SVF pa .head=dsub × F.tau ← F pa ∈F pa Fact .val F.val Fact ∈Fact (1) where F pa is the set of factors that subsume F , Fact is the set of all factors (including F ) that become active in the context of {F pa .Dset, F pa .head = dsub } and N SVF pa .head=dsub is the product of all newly single valued CPTs under the same context. For top-level factors (not subsumed by any factor), F.tau = Pevidence /F.val, which is 1.0 when there is a unique top-level factor. Alg. 2 is a simple recursive computation of eq. 1 for each factor. We visit learned factors in the reverse order in which they were learned to ensure that, for any factor F ′ , F ′ .tau is incremented (line 13) by any F that might have activated F ′ (line 12). For example, in fig. 2, F 4 uses F 1 and F 2, so F 4.tau needs to be updated before F 1.tau and F 2.tau. In line 11, we can increment the counts for any NSV CPT entries since F.tau will account for the possible ways of reaching the configuration {F.Dset, F.head = d} in an equivalent full enumeration tree. Algorithm 1: FirstPass(level) 1 2 3 4 5 6 7 8 9 10 Input: Graph G Output: A list of learned factors and Pevidence Select var V to branch on if V ==NONE then return Sset={}, Dset={} for d ∈ Dom[V ] do V ←d prod = productOfAllNSVsAndActiveFactors(Dset, Sset) if prod != 0 then FirstPass(level+1) sum += prod Sset = Sset ∪ {V } cacheNewFactor(F.head ← V ,F.val ← sum, F.Sset ← Sset, F.Dset ← Dset); Algorithm 2: SecondPass() 1 2 3 4 5 6 7 8 9 10 11 12 13 Input: F : List of factors in the reverse order learned in the first pass and Pevidence . Result: Updated counts foreach F ∈ F do if F.Dset = {} then F.tau ← Pevidence /F.val else F.tau ← 0.0 Assign vars in F.Dset to their values V ← F.head (last node to have been subsumed in this factor) foreach d ∈ Dom[V ] do prod = productOfAllNSVsAndActiveFactors() prod∗ = F.tau foreach newly single-valued CPT C do count(C.child,C.parents)+=prod/Pevidence F ′ =getListOfActiveFactors() for F ′ ∈ F ′ do F ′ .tau+ = prod/F ′ .val Most Probable Explanation We compute MPE using a very similar two-pass algorithm. In the first pass, factors are used to store a maximum instead of a summation over variables in the Sset. We also keep track of the value of F.head at which the maximum is achieved. In the second pass, we recursively find the optimal variable configuration by following the trail of factors that are activated when we assign each F.head variable to its maximum value starting from the last learned factor. 4 Recall, F.head is the last variable to be added to a newly created factor in line 10 of alg. 1 4 MACHINE TRANSLATION WORD ALIGNMENT EXPERIMENTS A major motivation for pursuing the type of representation and inference described above is to make it possible to solve computationally-intensive real-world problems using large amounts of data, while retaining the full generality and expressiveness afforded by the MDBN modeling language. In the experiments below we compare running times of MDBNs to GIZA++ on IBM Models 1 through 4 and the M-HMM model. GIZA++ is a special-purpose optimized MT word alignment C++ tool that is widely used in current state-of-the-art phrase-based MT systems [10] and at the time of this writing is the only publicly available software that implements all of the IBM Models. We test on French-English 107 hand-aligned sentences5 from a corpus of the European parliament proceedings (Europarl [9]) and train on 10000 sentence pairs from the same corpus and of maximum number of words 40. The Alignment Error Rate (AER) [13] evaluation metric quantifies how well the MPE assignment to the hidden alignment variables matches human-generated alignments. Several pruning and smoothing techniques are used by GIZA and MDBNs. GIZA prunes low lexical (P (f |e)) probability values and uses a default small value for unseen (or pruned) probability table entries. For models 3 and 4, for which there is no known polynomial time algorithm to perform the full E-step or compute MPE, GIZA generates a set of high probability alignments using an MHMM and hill-climbing and collects EM counts over these alignments using M3 or M4. For MDBN models we use the following pruning strategy: at each level of the search tree we prune values which, together, account for the lowest specified percentage of the total probability mass of the product of all newly active CPTs in line 6 of alg. 1. This is a more effective pruning than simply removing low-probability values of each CPD because it factors in the joint contribution of multiple active variables. Table 1 shows a comparison of timing numbers obtained GIZA++ and MDBNs. The runtime numbers shown are for the combined tasks of training and decoding; however, training time dominates given the difference in size between train and test sets. For models 1 and 2 neither GIZA nor MDBNs perform any pruning. For the M-HMM, we prune 60% of probability mass at each level and use a Dirichlet prior over the alignment variables such that long-range transitions are exponentially less likely than shorter ones.6 This model achieves similar times and AER to GIZA’s. Interestingly, without any pruning, the MDBN M-HMM takes 160 minutes to complete while only marginally improving upon the pruned model. Experimenting with several pruning thresholds, we found that AER would worsen much more slowly than runtime decreases. Models 3 and 4 have treewidth equal to the number of alignment variables (because of the global constraints tying them) and therefore require approximate inference. Using Model 3, and a drastic pruning threshold that only keeps the value with the top probability at each level, we were able to achieve an AER not much higher than GIZA’s. For M4, it achieves a best AER of 31.7% while we do not improve upon Model3, most likely because a too restrictive pruning. Nevertheless, a simple variation on Model3 in the MDBN framework achieves a lower AER than our regular M3 (with pruning still the same). The M3-HMM hybrid model combines the Markov alignment dependencies from the M-HMM model with the fertility model of M3. MCMC Inference Sampling is widely used for inference in high-treewidth models. Although MDBNs support Likelihood Weighing, it is very inefficient when the probability of evidence is very small, as is the case in our MT models. Besides being slow, Markov chain Monte Carlo can be problematic when the joint distribution is not positive everywhere, in particular in the presence of determinism and hard constraints. Techniques such as blocking Gibbs sampling [8] try to address the problem. Often, however, one has to carefully choose a problem-dependent proposal distribution. We used MCMC to improve training of the M3-HMM model. We were able to achieve an AER of 32.8% (down from 39.1%) but using 400 minutes of uniprocessor time. 5 CONCLUSION The existing classes of graphical models are not ideally suited for representing SMT models because “natural” semantics for specifying the latter combine flavors of different GM types on top of standard directed Bayesian network semantics: switching parents found in Bayesian Multinets [6], aggregation relationships such as in Probabilistic Relational Models [5], and existence uncertainty [7]. We 5 Available at http://www.cs.washington.edu/homes/karim French and English have similar word orders. On a different language pair, a different prior might be more appropriate. With a uniform prior, the MDBN M-HMM has 36.0% AER. 6 Model Init M1 M2 M-HMM M3 M4 M3-HMM GIZA++ M1 M-HMM 1m45s (47.7%) N/A 2m02s (41.3%) N/A 4m05s (35.0%) N/A 2m50 (45%) 5m20s (38.5%) 5m20s (34.8%) 7m45s (31.7%) N/A MDBN M1 3m20s (48.0%) 5m30s (41.0%) 4m15s (33.0%) 12m (43.6%) 25m (43.6%) 9m30 (41.0%) M-HMM N/A N/A N/A 9m (42.5%) 23m (42.6%) 9m15s (39.1%) MCMC 400m (32.8%) Table 1: MDBN VE-based learning versus GIZA++ timings and %AER using 5 EM iterations. The columns M1 and M-HMM correspond to the model that is used to initialize the model in the corresponding row. The last row is a hybrid Model3-HMM model that we implemented using MDBNs and is not expressible using GIZA. have introduced a generalization of dynamic Bayesian networks to easily and concisely build models consisting of varying-length parallel asynchronous and interacting data streams. We have shown that our framework is useful for expressing various statistical machine translation models. We have also introduced new parameter estimation and decoding algorithms using exact and approximate searchbased probability computation. While our timing results are not yet as fast as a hand-optimized C++ program on the equivalent model, we have shown that even in this general-purpose framework of MDBNs, our timing numbers are competitive and usable. Our framework can of course do much more than the IBM and HMM models. One of our goals is to use this framework to rapidly prototype novel MT systems and develop methods to statistically induce an interlingua. We also intend to use MDBNs in other domains such as multi-party social interaction analysis. References [1] F. Bacchus, S. Dalmao, and T. Pitassi. Value elimination: Bayesian inference via backtracking search. In UAI-03, pages 20–28, San Francisco, CA, 2003. Morgan Kaufmann. [2] J. Bilmes and C. Bartels. On triangulating dynamic graphical models. In Uncertainty in Artificial Intelligence: Proceedings of the 19th Conference, pages 47–56. Morgan Kaufmann, 2003. [3] P. F. Brown, J. Cocke, S. A. Della Piettra, V. J. Della Piettra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S. Roossin. A statistical approach to machine translation. Computational Linguistics, 16(2):79–85, June 1990. [4] T. Dean and K. Kanazawa. Probabilistic temporal reasoning. AAAI, pages 524–528, 1988. [5] N. Friedman, L. Getoor, D. Koller, and A. Pfeffer. Learning probabilistic relational models. In IJCAI, pages 1300–1309, 1999. [6] D. Geiger and D. Heckerman. Knowledge representation and inference in similarity networks and Bayesian multinets. Artif. Intell., 82(1-2):45–74, 1996. [7] L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models of link structure. Journal of Machine Learning Research, 3(4-5):697–707, May 2003. [8] C. Jensen, A. Kong, and U. Kjaerulff. Blocking Gibbs sampling in very large probabilistic expert systems. In International Journal of Human Computer Studies. Special Issue on Real-World Applications of Uncertain Reasoning., 1995. [9] P. Koehn. Europarl: A multilingual corpus for evaluation of machine http://www.isi.edu/koehn/publications/europarl, 2002. translation. [10] P. Koehn, F. Och, and D. Marcu. Statistical phrase-based translation. In NAACL/HLT 2003, 2003. [11] S. Lauritzen. Graphical Models. Oxford Science Publications, 1996. [12] K. Murphy. Dynamic Bayesian Networks: Representation, Inference and Learning. PhD thesis, U.C. Berkeley, Dept. of EECS, CS Division, 2002. [13] F. J. Och and H. Ney. Improved statistical alignment models. In ACL, pages 440–447, Oct 2000. [14] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 2nd printing edition, 1988. [15] S. Vogel, H. Ney, and C. Tillmann. HMM-based word alignment in statistical translation. In Proceedings of the 16th conference on Computational linguistics, pages 836–841, Morristown, NJ, USA, 1996.
6 0.73065078 159 nips-2006-Parameter Expanded Variational Bayesian Methods
7 0.72615361 157 nips-2006-PAC-Bayes Bounds for the Risk of the Majority Vote and the Variance of the Gibbs Classifier
8 0.68730962 11 nips-2006-A PAC-Bayes Risk Bound for General Loss Functions
9 0.68140477 98 nips-2006-Inferring Network Structure from Co-Occurrences
10 0.68111396 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure
11 0.67696679 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures
12 0.66993159 85 nips-2006-Geometric entropy minimization (GEM) for anomaly detection and localization
13 0.66182321 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
14 0.65673608 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
15 0.65561241 121 nips-2006-Learning to be Bayesian without Supervision
16 0.6525147 109 nips-2006-Learnability and the doubling dimension
17 0.6425181 192 nips-2006-Theory and Dynamics of Perceptual Bistability
18 0.64143723 193 nips-2006-Tighter PAC-Bayes Bounds
19 0.63575363 175 nips-2006-Simplifying Mixture Models through Function Approximation
20 0.63539994 21 nips-2006-AdaBoost is Consistent