nips nips2002 nips2002-157 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Harald Steck, Tommi S. Jaakkola
Abstract: A common objective in learning a model from data is to recover its network structure, while the model parameters are of minor interest. For example, we may wish to recover regulatory networks from high-throughput data sources. In this paper we examine how Bayesian regularization using a product of independent Dirichlet priors over the model parameters affects the learned model structure in a domain with discrete variables. We show that a small scale parameter - often interpreted as
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper we examine how Bayesian regularization using a product of independent Dirichlet priors over the model parameters affects the learned model structure in a domain with discrete variables. [sent-8, score-0.41]
2 We show that a small scale parameter - often interpreted as "equivalent sample size" or "prior strength" - leads to a strong regularization of the model structure (sparse graph) given a sufficiently large data set. [sent-9, score-0.588]
3 In particular, the empty graph is obtained in the limit of a vanishing scale parameter. [sent-10, score-0.606]
4 This is diametrically opposite to what one may expect in this limit, namely the complete graph from an (unregularized) maximum likelihood estimate. [sent-11, score-0.315]
5 Since the prior affects the parameters as expected, the scale parameter balances a trade-off between regularizing the parameters vs. [sent-12, score-0.469]
6 In the Bayesian approach, regularization is achieved by specifying a prior distribution over the parameters and subsequently averaging over the posterior distribution. [sent-16, score-0.425]
7 This regularization provides not only smoother estimates of the parameters compared to maximum likelihood but also guides the selection of model structures. [sent-17, score-0.27]
8 It was pointed out in [6] that a very large scale parameter of the Dirichlet prior can degrade predictive accuracy due to severe regularization of the parameter estimates. [sent-18, score-0.742]
9 We complement this discussion here and show that a very small scale parameter can lead to poor over-regularized structures when a product of (conjugate) Dirichlet priors is used over multinomial conditional distributions (Section 3). [sent-19, score-0.434]
10 Section 4 demonstrates the effect of the scale parameter and how it can be calibrated. [sent-20, score-0.239]
11 2 Regularization of Parameters We briefly review Bayesian regularization of parameters. [sent-22, score-0.189]
12 We follow the assumptions outlined in [6] : multinomial sample, complete data, parameter modularity, parameter independence, and Dirichlet prior. [sent-23, score-0.29]
13 Note that the Dirichlet prior over the parameters is often used for two reasons: (1) the conjugate prior permits analytical calculations, and (2) the Dirichlet prior is intimately tied to the desirable likelihood-equivalence property of network structures [6]. [sent-24, score-0.804]
14 The Dirichlet prior over the parameters 8' I11"i is given by (1) where 8Xi l11"i pertains to variable X i in state Xi given that its parents IIi are in joint state 'Tri . [sent-25, score-0.31]
15 Due to lack of prior knowledge, p is often chosen to be uniform, p(Xi,'Tri ) = 1/ (IXil·IIIil), where lXii , IIIi l denote the number of (joint) states [1]. [sent-33, score-0.177]
16 The scale parameter 0: of the Dirichlet prior is positive and independent of i, i. [sent-34, score-0.491]
17 , the estimated parameters become "smoother" or " less extreme" when the prior distribution p is close to uniform. [sent-39, score-0.177]
18 An increasing scale parameter 0: leads to a stronger regularization, while in the limit 0: -+ 0, the (unregularized) maximum likelihood estimate is obtained, as expected. [sent-40, score-0.487]
19 3 Regularization of Structure In the remainder of this paper, we outline effects due to Bayesian regularization of the Bayesian network structure when using a product of Dirichlet priors. [sent-41, score-0.446]
20 In the Bayesian approach to structure learning, the posterior probability of the network structure m is given by p(mID) = p(Dlm)p(m)/p(D), where p(D) is the (unknown) probability of given data D , and p(m) denotes the prior distribution over the network structures; we assume p(m) > 0 for all m. [sent-43, score-0.75]
21 Following the assumptions outlined in [6], including the Dirichlet prior over the parameters 8, the marginal likelihood p(Dlm) = Ep(O lm) [p(Dlm , 8)] can be calculated analytically. [sent-44, score-0.431]
22 ) data arrived in a sequential manner , it can be written as N p(Dlm) = II II n N(k-l) k = l i=l N 7r ik + 0: X:(':~l) k k Xi ,11"i , + O:11"k i (4) where N(k-l) denotes the counts implied by data D(k-l) seen before step k along the sequence (k = 1, . [sent-48, score-0.193]
23 4, we also decomposed the joint probability into a product of conditional probabilities according to the Bayesian network structure m. [sent-54, score-0.299]
24 1 Limit of Vanishing Scale-Parameter This section is concerned with the limit of a vanishing scale parameter of the Dirichlet prior, a -+ O. [sent-60, score-0.48]
25 In this limit Bayesian regularization depends crucially on the number of zero-cell-counts in the contingency table implied by the data, or in other words, on the number of different configurations (data points) contained in the data. [sent-61, score-0.757]
26 When all cell counts are positive, EP is identical to the well-known number of parameters (P), dk';) = m ) = L:i(IXil - l)IIIil, which play an important role in regularizing the learned network structure. [sent-63, score-0.343]
27 Let us now consider the behavior of the marginal likelihood (cf. [sent-65, score-0.255]
28 We find Proposition 1: Under the assumptions concerning the prior distribution outlined in Section 2, the marginal likelihood of a Bayesian network structure m vanishes in the limit a -+ 0 if the data D contain two or more different configurations. [sent-68, score-0.913]
29 The leading polynomial order is given by d(=l p(Dlm) "-' a EP as a -+ 0, (6) 4 which depends both on the network structure and the data. [sent-70, score-0.337]
30 This holds independently of a particular choice of strictly positive prior distributions P(Xi ' IIi). [sent-72, score-0.33]
31 If the prior over the network structures is strictly positive, this limiting behavior also holds for the posterior probability p( miD) . [sent-73, score-0.684]
32 First, let us consider the behavior of the Dirichlet distribution in the limit a -+ O. [sent-75, score-0.223]
33 The hyper-parameters a X i , 1r i vanish when a -+ 0, and thus the Dirichlet prior converges to a discrete distribution over the parameter simplex in the sense that the probability mass concentrates at a particular, randomly chosen corner of the simplex containing B. [sent-76, score-0.336]
34 This well-known fact also shows that the limit a -+ 0 actually corresponds to a very strong prior belief [9, 12]. [sent-80, score-0.388]
35 This is in contrast to many traditional interpretations where the limit a -+ 0 is considered as "no prior information", often motivated by Eq. [sent-81, score-0.344]
36 As pointed out in [9, 12], the interpretation of the scale parameter a as "equivalent sample size" or as the" strength" of prior belief may be misleading, particularly in the case where O:X i, 1ri < 1 for some configurations Xi, 7ri. [sent-83, score-0.649]
37 Note that the noninformative prior in the sense of entropy is achieved by setting O:Xi , 1ri = 1 for each Xi, 7ri and for all i = 1, . [sent-85, score-0.23]
38 In order to explain the behavior of the marginal likelihood in leading order of the scale parameter 0:, the properties of the Dirichlet distribution are not sufficient by themselves. [sent-91, score-0.537]
39 Additionally, it is essential that the probability distribution described by a Bayesian network decomposes into a product of conditional probabilities, and that there is a Dirichlet prior pertaining to each variable for each parent configuration. [sent-92, score-0.421]
40 All these Dirichlet priors are independent of each other under the standard assumption of parameter independence. [sent-93, score-0.197]
41 In this case, the leading polynomial order of the ratio (for given k and i) is linear in 0:, assuming P(Xi' IIi) > 0; otherwise the ratio (for given k and i) converges to a finite positive value in the limit 0: --+ O. [sent-96, score-0.496]
42 Consequently, the dependence of the marginal likelihood in leading polynomial order on 0: is completely determined by the number of different configurations in the data. [sent-97, score-0.55]
43 This is because the first term counts the number of all the different joint configurations of Xi , IIi in the data, while the second term ensures that EP counts only those configurations where (xf, 7rf) is " new" while 7rf is "old". [sent-101, score-0.566]
44 Note that the behavior of the marginal likelihood in Proposition 1 is not entirely determined by the network structure in the limit 0: --+ 0, as it still depends on the data. [sent-102, score-0.679]
45 Given data D, three Dirichlet priors are relevant regarding graph ml, Xo --+ Xl, but only two Dirichlet priors pertain to the empty graph, mo. [sent-105, score-0.42]
46 Second, let us now assume that all possible configurations occur in data D. [sent-107, score-0.189]
47 Concerning graph ml, however , the marginal likelihood now also involves the vanishing terms due to the two priors pertaining to BX1 lxo =o and BXl lxo=l, and it hence becomes p(Dlmd ~ 0: 3 . [sent-109, score-0.654]
48 Let the edge be present in graph m+ and absent in m-. [sent-113, score-0.233]
49 The fact that the marginal likelihood decomposes into terms pertaining to each of the variables (d. [sent-114, score-0.305]
50 o:dEDF in the limit a -+ 0, and hence Proposition 2: Let m+ and m- be the two network structures as defined above. [sent-118, score-0.44]
51 8 The result holds independently of a particular choice of strictly positive prior distributions P(Xi' IIi). [sent-122, score-0.33]
52 If the prior over the network structures is strictly positive, this limiting behavior also holds for the posterior ratio. [sent-123, score-0.684]
53 A positive value of the log Bayes factor indicates that the presence of the edge A f- B is favored , given the parents IIA ; conversely, a negative relative score suggests that the absence of this edge is preferred. [sent-124, score-0.52]
54 The divergence of this relative Bayesian score implies that there exists a (small) positive threshold value ao > 0 such that, for any a < ao, the same graph(s) are favored as in the limit. [sent-125, score-0.23]
55 Since Proposition 2 applies to every edge in the network, it follows immediately that the empty (complete) graph is assigned the highest relative Bayesian score when EDF are positive (negative). [sent-126, score-0.463]
56 Regularization of network structure in the case of positive EDF is therefore extreme, permitting only the empty graph. [sent-127, score-0.433]
57 This is precisely the opposite of what one may have expected in this limit, namely the complete graph corresponding to the unregularized maximum likelihood estimate (MLE). [sent-128, score-0.378]
58 In contrast, when EDF are negative, the complete graph is favored. [sent-129, score-0.197]
59 It is thus surprising that a small data set, where one might expect an increased restriction on model complexity, actually gives rise to the complete graph, while a large data set yields the - most regularized - empty graph in the limit a -+ O. [sent-132, score-0.502]
60 This is because the marginal contingency tables implied by the data with respect to a sparse (dense) graph may contain a small (large) number of zero-cell-counts. [sent-134, score-0.492]
61 The relative Bayesian score can hence become rather unstable in this case, as completely different graph structures are optimal in the limit a -+ 0, namely graphs where each variable has either the maximal number of parents or none. [sent-135, score-0.638]
62 Thus, these hyper-parameters can also vanish in the limit of a large number of configurations (x , 1f) even though the scale parameter a remains fixed. [sent-141, score-0.654]
63 With a finite data set and a large number of joint configurations, only the typical limit in Proposition 2 is possible. [sent-143, score-0.303]
64 The surprising behavior implied by Proposition 2 therefore does not carryover to Dirichlet processes. [sent-145, score-0.212]
65 Its value is determined by the priors P(Xi ' IIi), as well as by the counts implied by the data. [sent-148, score-0.29]
66 The value of the Bayes factor can be therefore easily set by adjusting the prior weights p(Xi' 1fi). [sent-149, score-0.227]
67 2 Large Scale-Parameter In the other limiting case, where a -+ 00 , the Bayes factor approaches a finite value, which in general depends on the given data and on the prior distributions p(Xi' IIi). [sent-151, score-0.362]
68 When the popular choice of a uniform prior p(Xi,II i ) is used [1], then p(Dlm+) log p(Dlm-) -+ 0 as 0:-+00, (9) which is independent of the data. [sent-161, score-0.177]
69 Hence, neither the presence nor the absence of the edge between A and B is favored in this limit. [sent-162, score-0.186]
70 Given a uniform prior over the network structures, p(m) =const, the posterior distribution p(mID) over the graphs thus becomes increasingly spread out as 0: grows, permitting more variable network structures. [sent-163, score-0.696]
71 The behavior of the Bayes factor between the two limits 0: -+ 0 and 0: -+ 00 is exemplified for positive EDF in Figure 1: there are two qualitatively different behaviors, depending on the degree of statistical dependence between A and B. [sent-164, score-0.309]
72 A sufficiently weak dependence results in a monotonically increasing Bayes factor which favors the absence of the edge A +- B at any finite value of 0:. [sent-165, score-0.421]
73 In contrast, given a sufficiently strong dependence between A and B, the log Bayes factor takes on positive values for all (finite) 0: exceeding a certain value 0:+ of the scale parameter. [sent-166, score-0.424]
74 Consequently, given a domain with a range of degrees of statistical dependences, the number of edges in the learned graph increases monotonically with growing scale parameter 0: when each variable has at most one parent (i. [sent-168, score-0.52]
75 This is because increasingly weaker statistical dependencies between variables are recovered as 0: grows; the restriction to forests excludes possible "interactions" among (several) parents of a variable. [sent-171, score-0.181]
76 As suggested by our experiments, this increase in the number of edges can also be expected to hold for general Bayesian network structures (although not necessarily in a monotonic way). [sent-172, score-0.31]
77 This reveals that regularization of network structure tends to diminish with a growing scale parameter. [sent-173, score-0.625]
78 Note that this is in the opposite direction to the regularization of parameters (cf. [sent-174, score-0.226]
79 Hence, the scale parameter 0: of the Dirichlet prior determines the trade-off between regularizing the parameters vs. [sent-176, score-0.469]
80 If a uniform prior over the network structures is chosen, p(m) = const, the above discussion also holds for the posterior ratio (instead of the Bayes factor). [sent-178, score-0.585]
81 The behavior is more complicated, however, when a non-uniform prior is assumed. [sent-179, score-0.233]
82 For instance, when a prior is chosen that penalizes the presence of edges, the posterior favours the absence of an edge not only when the scale parameter is sufficiently small, but also when it is sufficiently large. [sent-180, score-0.794]
83 This is because regularization of model structure diminishes, while regularization of parameters increases with a growing scale parameter a of the Dirichlet prior, as discussed in the previous sections. [sent-184, score-0.739]
84 When the entire model is taken into account, one can use a sensitivity analysis in order to determine the dependence of the learned model on the scale parameter a, given the prior p(Xi' IIi) (cf. [sent-185, score-0.54]
85 The influence of the scale parameter a on predictive accuracy of the model can be assessed by cross-validation or, in a Bayesian approach, prequential validation [11, 3]. [sent-188, score-0.329]
86 Another possibility is to treat the scale parameter a as an additional parameter of the model to be learned from data. [sent-189, score-0.381]
87 Hence, prior belief regarding the parameters e can then enter only through the (normalized) distributions P(Xi' IIi). [sent-190, score-0.221]
88 note that this is sufficient to determine the (average) prior parameter estimate (cf. [sent-192, score-0.277]
89 Assuming an (improper) uniform prior distribution over a, its posterior distribution is p(aID) ex: p(Dla), given data D. [sent-197, score-0.236]
90 Alternatively, assuming that the posterior over a is strongly peaked, the likelihood may also be approximated by summing over the k most likely graphs m only (k = 1 in the most extreme case; empirical Bayes). [sent-199, score-0.249]
91 e In the following, the effect of various values assigned to the scale parameter a is exemplified concerning the data set gathered from Wisconsin high-school students by Sewell and Shah [10]. [sent-201, score-0.343]
92 In this small domain, exhaustive search in the space of Bayesian network structures is feasible (29,281 graphs). [sent-203, score-0.273]
93 Figure 2 shows that the number of edges in the graph with the highest posterior probability grows with an increasing value of the scale parameter, as expected (cf. [sent-205, score-0.443]
94 This graph can easily be interpreted in a causal manner, as outlined in [5]. [sent-213, score-0.217]
95 3 We note that this graph was also obtained in [5] due, however , to additional constraints concerning network structure, as a rather small prior strength of a = 5 was used. [sent-214, score-0.572]
96 3 also shows the highest-scoring unconstraint graph due to a = 5, which does not permit a causal interpretation, cf. [sent-216, score-0.215]
97 This illustrates that the "right" choice of the scale parameter a of the Dirichlet prior, when accounting for both model structure and parameters, can have a crucial impact on the learned network structure and the resulting insight in the ("true") dependencies among the variables in the domain. [sent-218, score-0.657]
98 3Since we did not impose any constraints on the network structure, unlike to [5] , Markov-equivalence leaves the orientation of the edge between the variables IQ and CP unspecified. [sent-220, score-0.283]
99 006; likelihood of a when treated as an additional model parameter (aD = 69). [sent-233, score-0.181]
100 A note on the scale parameter of the Dirichlet process. [sent-316, score-0.239]
wordName wordTfidf (topN-words)
[('dirichlet', 0.388), ('edf', 0.238), ('dlm', 0.207), ('configurations', 0.189), ('regularization', 0.189), ('prior', 0.177), ('network', 0.175), ('limit', 0.167), ('graph', 0.162), ('xi', 0.153), ('iii', 0.152), ('bayesian', 0.144), ('scale', 0.139), ('implied', 0.12), ('marginal', 0.118), ('proposition', 0.107), ('dla', 0.106), ('parameter', 0.1), ('bayes', 0.099), ('ep', 0.099), ('structures', 0.098), ('priors', 0.097), ('finite', 0.094), ('contingency', 0.092), ('parents', 0.091), ('xf', 0.084), ('dependence', 0.082), ('structure', 0.082), ('likelihood', 0.081), ('dedf', 0.079), ('harald', 0.079), ('iiiil', 0.079), ('sufficiently', 0.078), ('positive', 0.075), ('vanishing', 0.074), ('graphs', 0.073), ('counts', 0.073), ('edge', 0.071), ('favored', 0.069), ('pertaining', 0.069), ('mid', 0.069), ('na', 0.065), ('empty', 0.064), ('unregularized', 0.063), ('posterior', 0.059), ('vanish', 0.059), ('tommi', 0.059), ('concerning', 0.058), ('behavior', 0.056), ('outlined', 0.055), ('dlmd', 0.053), ('dlmo', 0.053), ('encouragement', 0.053), ('forests', 0.053), ('ixil', 0.053), ('lbf', 0.053), ('lxo', 0.053), ('noninformative', 0.053), ('prequential', 0.053), ('sewell', 0.053), ('steck', 0.053), ('unconstraint', 0.053), ('regularizing', 0.053), ('factor', 0.05), ('score', 0.047), ('ad', 0.047), ('absence', 0.046), ('grows', 0.046), ('exemplified', 0.046), ('favours', 0.046), ('immediately', 0.044), ('belief', 0.044), ('leading', 0.043), ('const', 0.042), ('parental', 0.042), ('xo', 0.042), ('strictly', 0.042), ('joint', 0.042), ('learned', 0.042), ('limiting', 0.041), ('artificial', 0.041), ('growing', 0.04), ('ratio', 0.04), ('ao', 0.039), ('iq', 0.039), ('regularized', 0.038), ('assignment', 0.037), ('opposite', 0.037), ('predictive', 0.037), ('edges', 0.037), ('polynomial', 0.037), ('xv', 0.037), ('acknowledges', 0.037), ('permitting', 0.037), ('variables', 0.037), ('holds', 0.036), ('surprising', 0.036), ('extreme', 0.036), ('complete', 0.035), ('dk', 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization
Author: Harald Steck, Tommi S. Jaakkola
Abstract: A common objective in learning a model from data is to recover its network structure, while the model parameters are of minor interest. For example, we may wish to recover regulatory networks from high-throughput data sources. In this paper we examine how Bayesian regularization using a product of independent Dirichlet priors over the model parameters affects the learned model structure in a domain with discrete variables. We show that a small scale parameter - often interpreted as
2 0.1378981 114 nips-2002-Information Regularization with Partially Labeled Data
Author: Martin Szummer, Tommi S. Jaakkola
Abstract: Classification with partially labeled data requires using a large number of unlabeled examples (or an estimated marginal P (x)), to further constrain the conditional P (y|x) beyond a few available labeled examples. We formulate a regularization approach to linking the marginal and the conditional in a general way. The regularization penalty measures the information that is implied about the labels over covering regions. No parametric assumptions are required and the approach remains tractable even for continuous marginal densities P (x). We develop algorithms for solving the regularization problem for finite covers, establish a limiting differential equation, and exemplify the behavior of the new regularization approach in simple cases.
3 0.13320792 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
Author: Ron Meir, Tong Zhang
Abstract: We consider Bayesian mixture approaches, where a predictor is constructed by forming a weighted average of hypotheses from some space of functions. While such procedures are known to lead to optimal predictors in several cases, where sufficiently accurate prior information is available, it has not been clear how they perform when some of the prior assumptions are violated. In this paper we establish data-dependent bounds for such procedures, extending previous randomized approaches such as the Gibbs algorithm to a fully Bayesian setting. The finite-sample guarantees established in this work enable the utilization of Bayesian mixture approaches in agnostic settings, where the usual assumptions of the Bayesian paradigm fail to hold. Moreover, the bounds derived can be directly applied to non-Bayesian mixture approaches such as Bagging and Boosting. 1 Introduction and Motivation The standard approach to Computational Learning Theory is usually formulated within the so-called frequentist approach to Statistics. Within this paradigm one is interested in constructing an estimator, based on a finite sample, which possesses a small loss (generalization error). While many algorithms have been constructed and analyzed within this context, it is not clear how these approaches relate to standard optimality criteria within the frequentist framework. Two classic optimality criteria within the latter approach are the minimax and admissibility criteria, which characterize optimality of estimators in a rigorous and precise fashion [9]. Except in some special cases [12], it is not known whether any of the approaches used within the Learning community lead to optimality in either of the above senses of the word. On the other hand, it is known that under certain regularity conditions, Bayesian estimators lead to either minimax or admissible estimators, and thus to well-defined optimality in the classical (frequentist) sense. In fact, it can be shown that Bayes estimators are essentially the only estimators which can achieve optimality in the above senses [9]. This optimality feature provides strong motivation for the study of Bayesian approaches in a frequentist setting. While Bayesian approaches have been widely studied, there have not been generally applicable bounds in the frequentist framework. Recently, several approaches have attempted to address this problem. In this paper we establish finite sample datadependent bounds for Bayesian mixture methods, which together with the above optimality properties suggest that these approaches should become more widely used. Consider the problem of supervised learning where we attempt to construct an estimator based on a finite sample of pairs of examples S = {(x1 , y1 ), . . . , (xn , yn )}, each drawn independently according to an unknown distribution µ(x, y). Let A be a learning algorithm which, based on the sample S, constructs a hypothesis (estimator) h from some set of hypotheses H. Denoting by (y, h(x)) the instantaneous loss of the hypothesis h, we wish to assess the true loss L(h) = Eµ (y, h(x)) where the expectation is taken with respect to µ. In particular, the objective is to provide data-dependent bounds of the following form. For any h ∈ H and δ ∈ (0, 1), with probability at least 1 − δ, L(h) ≤ Λ(h, S) + ∆(h, S, δ), (1) where Λ(h, S) is some empirical assessment of the true loss, and ∆(h, S, δ) is a complexity term. For example, in the classic Vapnik-Chervonenkis framework, Λ(h, S) n is the empirical error (1/n) i=1 (yi , h(xi )) and ∆(h, S, δ) depends on the VCdimension of H but is independent of both the hypothesis h and the sample S. By algorithm and data-dependent bounds we mean bounds where the complexity term depends on both the hypothesis (chosen by the algorithm A) and the sample S. 2 A Decision Theoretic Bayesian Framework Consider a decision theoretic setting where we define the sample dependent loss of an algorithm A by R(µ, A, S) = Eµ (y, A(x, S)). Let θµ be the optimal predictor for y, namely the function minimizing Eµ { (y, φ(x))} over φ. It is clear that the best algorithm A (Bayes algorithm) is the one that always return θµ , assuming µ is known. We are interested in the expected loss of an algorithm averaged over samples S: R(µ, A) = ES R(µ, A, S) = R(µ, A, S)dµ(S), where the expectation is taken with respect to the sample S drawn i.i.d. from the probability measure µ. If we consider a family of measures µ, which possesses some underlying prior distribution π(µ), then we can construct the averaged risk function with respect to the prior as, r(π, A) = Eπ R(µ, A) = where dπ(µ|S) = dµ(S)dπ(µ) dµ(S)dπ(µ) dµ(S)dπ(µ) R(µ, A, S)dπ(µ|S), is the posterior distribution on the µ family, which µ induces a posterior distribution on the sample space as πS = Eπ(µ|S) µ. An algorithm minimizing the Bayes risk r(π, A) is referred to as a Bayes algorithm. In fact, for a given prior, and a given sample S, the optimal algorithm should return the Bayes optimal predictor with respect to the posterior measure πS . For many important practical problems, the optimal Bayes predictor is a linear functional of the underlying probability measure. For example, if the loss function is quadratic, namely (y, A(x)) = (y −A(x))2 , then the optimal Bayes predictor θµ (x) is the conditional mean of y, namely Eµ [y|x]. For binary classification problems, we can let the predictor be the conditional probability θµ (x) = µ(y = 1|x) (the optimal classification decision rule then corresponds to a test of whether θµ (x) > 0.5), which is also a linear functional of µ. Clearly if the Bayes predictor is a linear functional of the probability measure, then the optimal Bayes algorithm with respect to the prior π is given by AB (x, S) = θµ (x)dπ(µ|S) = µ θ (x)dµ(S)dπ(µ) µ µ µ dµ(S)dπ(µ) . (2) In this case, an optimal Bayesian algorithm can be regarded as the predictor constructed by averaging over all predictors with respect to a data-dependent posterior π(µ|S). We refer to such methods as Bayesian mixture methods. While the Bayes estimator AB (x, S) is optimal with respect to the Bayes risk r(π, A), it can be shown, that under appropriate conditions (and an appropriate prior) it is also a minimax and admissible estimator [9]. In general, θµ is unknown. Rather we may have some prior information about possible models for θµ . In view of (2) we consider a hypothesis space H, and an algorithm based on a mixture of hypotheses h ∈ H. This should be contrasted with classical approaches where an algorithm selects a single hypothesis h form H. For simplicity, we consider a countable hypothesis space H = {h1 , h2 , . . .}; the general case will be deferred to the full paper. Let q = {qj }∞ be a probability j=1 vector, namely qj ≥ 0 and j qj = 1, and construct the composite predictor by fq (x) = j qj hj (x). Observe that in general fq (x) may be a great deal more complex that any single hypothesis hj . For example, if hj (x) are non-polynomial ridge functions, the composite predictor f corresponds to a two-layer neural network with universal approximation power. We denote by Q the probability distribution defined by q, namely j qj hj = Eh∼Q h. A main feature of this work is the establishment of data-dependent bounds on L(Eh∼Q h), the loss of the Bayes mixture algorithm. There has been a flurry of recent activity concerning data-dependent bounds (a non-exhaustive list includes [2, 3, 5, 11, 13]). In a related vein, McAllester [7] provided a data-dependent bound for the so-called Gibbs algorithm, which selects a hypothesis at random from H based on the posterior distribution π(h|S). Essentially, this result provides a bound on the average error Eh∼Q L(h) rather than a bound on the error of the averaged hypothesis. Later, Langford et al. [6] extended this result to a mixture of classifiers using a margin-based loss function. A more general result can also be obtained using the covering number approach described in [14]. Finally, Herbrich and Graepel [4] showed that under certain conditions the bounds for the Gibbs classifier can be extended to a Bayesian mixture classifier. However, their bound contained an explicit dependence on the dimension (see Thm. 3 in [4]). Although the approach pioneered by McAllester came to be known as PAC-Bayes, this term is somewhat misleading since an optimal Bayesian method (in the decision theoretic framework outline above) does not average over loss functions but rather over hypotheses. In this regard, the learning behavior of a true Bayesian method is not addressed in the PAC-Bayes analysis. In this paper, we would like to narrow the discrepancy by analyzing Bayesian mixture methods, where we consider a predictor that is the average of a family of predictors with respect to a data-dependent posterior distribution. Bayesian mixtures can often be regarded as a good approximation to a true optimal Bayesian method. In fact, we have shown above that they are equivalent for many important practical problems. Therefore the main contribution of the present work is the extension of the above mentioned results in PAC-Bayes analysis to a rather unified setting for Bayesian mixture methods, where different regularization criteria may be incorporated, and their effect on the performance easily assessed. Furthermore, it is also essential that the bounds obtained are dimension-independent, since otherwise they yield useless results when applied to kernel-based methods, which often map the input space into a space of very high dimensionality. Similar results can also be obtained using the covering number analysis in [14]. However the approach presented in the current paper, which relies on the direct computation of the Rademacher complexity, is more direct and gives better bounds. The analysis is also easier to generalize than the corresponding covering number approach. Moreover, our analysis applies directly to other non-Bayesian mixture approaches such as Bagging and Boosting. Before moving to the derivation of our bounds, we formalize our approach. Consider a countable hypothesis space H = {hj }∞ , and a probability distribution {qj } over j=1 ∞ H. Introduce the vector notation k=1 qk hk (x) = q h(x). A learning algorithm within the Bayesian mixture framework uses the sample S to select a distribution Q over H and then constructs a mixture hypothesis fq (x) = q h(x). In order to constrain the class of mixtures used in constructing the mixture q h we impose constraints on the mixture vector q. Let g(q) be a non-negative convex function of q and define for any positive A, ΩA = {q ∈ S : g(q) ≤ A} ; FA = fq : fq (x) = q h(x) : q ∈ ΩA , (3) where S denotes the probability simplex. In subsequent sections we will consider different choices for g(q), which essentially acts as a regularization term. Finally, for any mixture q h we define the loss by L(q h) = Eµ (y, (q h)(x)) and the n ˆ empirical loss incurred on the sample by L(q h) = (1/n) i=1 (yi , (q h)(xi )). 3 A Mixture Algorithm with an Entropic Constraint In this section we consider an entropic constraint, which penalizes weights deviating significantly from some prior probability distribution ν = {νj }∞ , which may j=1 incorporate our prior information about he problem. The weights q themselves are chosen by the algorithm based on the data. In particular, in this section we set g(q) to be the Kullback-Leibler divergence of q from ν, g(q) = D(q ν) ; qj log(qj /νj ). D(q ν) = j Let F be a class of real-valued functions, and denote by σi independent Bernoulli random variables assuming the values ±1 with equal probability. We define the data-dependent Rademacher complexity of F as 1 ˆ Rn (F) = Eσ sup n f ∈F n σi f (xi ) |S . i=1 ˆ The expectation of Rn (F) with respect to S will be denoted by Rn (F). We note ˆ n (F) is concentrated around its mean value Rn (F) (e.g., Thm. 8 in [1]). We that R quote a slightly adapted result from [5]. Theorem 1 (Adapted from Theorem 1 in [5]) Let {x1 , x2 , . . . , xn } ∈ X be a sequence of points generated independently at random according to a probability distribution P , and let F be a class of measurable functions from X to R. Furthermore, let φ be a non-negative Lipschitz function with Lipschitz constant κ, such that φ◦f is uniformly bounded by a constant M . Then for all f ∈ F with probability at least 1 − δ Eφ(f (x)) − 1 n n φ(f (xi )) ≤ 4κRn (F) + M i=1 log(1/δ) . 2n An immediate consequence of Theorem 1 is the following. Lemma 3.1 Let the loss function be bounded by M , and assume that it is Lipschitz with constant κ. Then for all q ∈ ΩA with probability at least 1 − δ log(1/δ) . 2n ˆ L(q h) ≤ L(q h) + 4κRn (FA ) + M Next, we bound the empirical Rademacher average of FA using g(q) = D(q ν). Lemma 3.2 The empirical Rademacher complexity of FA is upper bounded as follows: 2A n ˆ Rn (FA ) ≤ 1 n sup j n hj (xi )2 . i=1 Proof: We first recall a few facts from the theory of convex duality [10]. Let p(u) be a convex function over a domain U , and set its dual s(z) = supu∈U u z − p(u) . It is known that s(z) is also convex. Setting u = q and p(q) = j qj log(qj /νj ) we find that s(v) = log j νj ezj . From the definition of s(z) it follows that for any q ∈ S, q z≤ qj log(qj /νj ) + log ν j ez j . j j Since z is arbitrary, we set z = (λ/n) i σi h(xi ) and conclude that for q ∈ ΩA and any λ > 0 n 1 λ 1 sup A + log νj exp σi q h(xi ) ≤ σi hj (xi ) . n i=1 λ n i q∈ΩA j Taking the expectation with respect to σ, and using 2 Eσ {exp ( i σi ai )} ≤ exp i ai /2 , we have that λ 1 ˆ νj exp A + Eσ log σi hj (xi ) Rn (FA ) ≤ λ n j ≤ ≤ = i 1 λ A + sup log Eσ exp 1 λ A + sup log exp j j A λ + 2 sup λ 2n j λ2 n2 λ n i σi hj (xi ) the Chernoff bound (Jensen) i hj (xi )2 2 (Chernoff) hj (xi )2 . i Minimizing the r.h.s. with respect to λ, we obtain the desired result. Combining Lemmas 3.1 and 3.2 yields our basic bound, where κ and M are defined in Lemma 3.1. Theorem 2 Let S = {(x1 , y1 ), . . . , (xn , yn )} be a sample of i.i.d. points each drawn according to a distribution µ(x, y). Let H be a countable hypothesis class, and set FA to be the class defined in (3) with g(q) = D(q ν). Set ∆H = (1/n)Eµ supj 1−δ n i=1 hj (xi )2 1/2 . Then for any q ∈ ΩA with probability at least ˆ L(q h) ≤ L(q h) + 4κ∆H 2A +M n log(1/δ) . 2n Note that if hj are uniformly bounded, hj ≤ c, then ∆H ≤ c. Theorem 2 holds for a fixed value of A. Using the so-called multiple testing Lemma (e.g. [11]) we obtain: Corollary 3.1 Let the assumptions of Theorem 2 hold, and let {Ai , pi } be a set of positive numbers such that i pi = 1. Then for all Ai and q ∈ ΩAi with probability at least 1 − δ, ˆ L(q h) ≤ L(q h) + 4κ∆H 2Ai +M n log(1/pi δ) . 2n Note that the only distinction with Theorem 2 is the extra factor of log pi which is the price paid for the uniformity of the bound. Finally, we present a data-dependent bound of the form (1). Theorem 3 Let the assumptions of Theorem 2 hold. Then for all q ∈ S with probability at least 1 − δ, ˆ L(q h) ≤ L(q h) + max(κ∆H , M ) × 130D(q ν) + log(1/δ) . n (4) Proof sketch Pick Ai = 2i and pi = 1/i(i + 1), i = 1, 2, . . . (note that i pi = 1). For each q, let i(q) be the smallest index for which Ai(q) ≥ D(q ν) implying that log(1/pi(q) ) ≤ 2 log log2 (4D(q ν)). A few lines of algebra, to be presented in the full paper, yield the desired result. The results of Theorem 3 can be compared to those derived by McAllester [8] for the randomized Gibbs procedure. In the latter case, the first term on the r.h.s. is ˆ Eh∼Q L(h), namely the average empirical error of the base classifiers h. In our case ˆ the corresponding term is L(Eh∼Q h), namely the empirical error of the average hypothesis. Since Eh∼Q h is potentially much more complex than any single h ∈ H, we expect that the empirical term in (4) is much smaller than the corresponding term in [8]. Moreover, the complexity term we obtain is in fact tighter than the corresponding term in [8] by a logarithmic factor in n (although the logarithmic factor in [8] could probably be eliminated). We thus expect that Bayesian mixture approach advocated here leads to better performance guarantees. Finally, we comment that Theorem 3 can be used to obtain so-called oracle inequalities. In particular, let q∗ be the optimal distribution minimizing L(q h), which can only be computed if the underlying distribution µ(x, y) is known. Consider an ˆ algorithm which, based only on the data, selects a distribution q by minimizing the r.h.s. of (4), with the implicit constants appropriately specified. Then, using standard approaches (e.g. [2]) we can obtain a bound on L(ˆ h) − L(q∗ h). For q lack of space, we defer the derivation of the precise bound to the full paper. 4 General Data-Dependent Bounds for Bayesian Mixtures The Kullback-Leibler divergence is but one way to incorporate prior information. In this section we extend the results to general convex regularization functions g(q). Some possible choices for g(q) besides the Kullback-Leibler divergence are the standard Lp norms q p . In order to proceed along the lines of Section 3, we let s(z) be the convex function associated with g(q), namely s(z) = supq∈ΩA q z − g(q) . Repeating n 1 the arguments of Section 3 we have for any λ > 0 that n i=1 σi q h(xi ) ≤ 1 λ i σi h(xi ) , which implies that λ A+s n 1 ˆ Rn (FA ) ≤ inf λ≥0 λ λ n A + Eσ s σi h(xi ) . (5) i n Assume that s(z) is second order differentiable, and that for any h = i=1 σi h(xi ) 1 2 (s(h + ∆h) + s(h − ∆h)) − s(h) ≤ u(∆h). Then, assuming that s(0) = 0, it is easy to show by induction that n n Eσ s (λ/n) i=1 σi h(xi ) ≤ u((λ/n)h(xi )). (6) i=1 In the remainder of the section we focus on the the case of regularization based on the Lp norm. Consider p and q such that 1/q + 1/p = 1, p ∈ (1, ∞), and let p = max(p, 2) and q = min(q, 2). Note that if p ≤ 2 then q ≥ 2, q = p = 2 and if p > 2 1 then q < 2, q = q, p = p. Consider p-norm regularization g(q) = p q p , in which p 1 case s(z) = q z q . The Rademacher averaging result for p-norm regularization q is known in the Geometric theory of Banach spaces (type structure of the Banach space), and it also follows from Khinchtine’s inequality. We show that it can be easily obtained in our framework. In this case, it is easy to see that s(z) = Substituting in (5) we have 1 ˆ Rn (FA ) ≤ inf λ≥0 λ A+ λ n q−1 q where Cq = ((q − 1)/q ) 1/q q 1 q z q q n h(xi ) q q = i=1 implies u(h(x)) ≤ Cq A1/p n1/p 1 n q−1 q h(x) q q . 1/q n h(xi ) q q i=1 . Combining this result with the methods described in Section 3, we establish a bound for regularization based on the Lp norm. Assume that h(xi ) q is finite for all i, and set ∆H,q = E (1/n) n i=1 h(xi ) q q 1/q . Theorem 4 Let the conditions of Theorem 3 hold and set g(q) = (1, ∞). Then for all q ∈ S, with probability at least 1 − δ, ˆ L(q h) ≤ L(q h) + max(κ∆H,q , M ) × O q p + n1/p log log( q p 1 p q p p , p ∈ + 3) + log(1/δ) n where O(·) hides a universal constant that depends only on p. 5 Discussion We have introduced and analyzed a class of regularized Bayesian mixture approaches, which construct complex composite estimators by combining hypotheses from some underlying hypothesis class using data-dependent weights. Such weighted averaging approaches have been used extensively within the Bayesian framework, as well as in more recent approaches such as Bagging and Boosting. While Bayesian methods are known, under favorable conditions, to lead to optimal estimators in a frequentist setting, their performance in agnostic settings, where no reliable assumptions can be made concerning the data generating mechanism, has not been well understood. Our data-dependent bounds allow the utilization of Bayesian mixture models in general settings, while at the same time taking advantage of the benefits of the Bayesian approach in terms of incorporation of prior knowledge. The bounds established, being independent of the cardinality of the underlying hypothesis space, can be directly applied to kernel based methods. Acknowledgments We thank Shimon Benjo for helpful discussions. The research of R.M. is partially supported by the fund for promotion of research at the Technion and by the Ollendorff foundation of the Electrical Engineering department at the Technion. References [1] P. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: risk bounds and structural results. In Proceedings of the Fourteenth Annual Conference on Computational Learning Theory, pages 224–240, 2001. [2] P.L. Bartlett, S. Boucheron, and G. Lugosi. Model selection and error estimation. Machine Learning, 48:85–113, 2002. [3] O. Bousquet and A. Chapelle. Stability and generalization. J. Machine Learning Research, 2:499–526, 2002. [4] R. Herbrich and T. Graepel. A pac-bayesian margin bound for linear classifiers; why svms work. In Advances in Neural Information Processing Systems 13, pages 224–230, Cambridge, MA, 2001. MIT Press. [5] V. Koltchinksii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statis., 30(1), 2002. [6] J. Langford, M. Seeger, and N. Megiddo. An improved predictive accuracy bound for averaging classifiers. In Proceeding of the Eighteenth International Conference on Machine Learning, pages 290–297, 2001. [7] D. A. McAllester. Some pac-bayesian theorems. In Proceedings of the eleventh Annual conference on Computational learning theory, pages 230–234, New York, 1998. ACM Press. [8] D. A. McAllester. PAC-bayesian model averaging. In Proceedings of the twelfth Annual conference on Computational learning theory, New York, 1999. ACM Press. [9] C. P. Robert. The Bayesian Choice: A Decision Theoretic Motivation. Springer Verlag, New York, 1994. [10] R.T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, N.J., 1970. [11] J. Shawe-Taylor, P. Bartlett, R.C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE trans. Inf. Theory, 44:1926– 1940, 1998. [12] Y. Yang. Minimax nonparametric classification - part I: rates of convergence. IEEE Trans. Inf. Theory, 45(7):2271–2284, 1999. [13] T. Zhang. Generalization performance of some learning problems in hilbert functional space. In Advances in Neural Information Processing Systems 15, Cambridge, MA, 2001. MIT Press. [14] T. Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527–550, 2002.
4 0.11060568 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
Author: Christopher M. Bishop, David Spiegelhalter, John Winn
Abstract: In recent years variational methods have become a popular tool for approximate inference and learning in a wide variety of probabilistic models. For each new application, however, it is currently necessary first to derive the variational update equations, and then to implement them in application-specific code. Each of these steps is both time consuming and error prone. In this paper we describe a general purpose inference engine called VIBES (‘Variational Inference for Bayesian Networks’) which allows a wide variety of probabilistic models to be implemented and solved variationally without recourse to coding. New models are specified either through a simple script or via a graphical interface analogous to a drawing package. VIBES then automatically generates and solves the variational equations. We illustrate the power and flexibility of VIBES using examples from Bayesian mixture modelling. 1
5 0.10819784 194 nips-2002-The Decision List Machine
Author: Marina Sokolova, Mario Marchand, Nathalie Japkowicz, John S. Shawe-taylor
Abstract: We introduce a new learning algorithm for decision lists to allow features that are constructed from the data and to allow a tradeoff between accuracy and complexity. We bound its generalization error in terms of the number of errors and the size of the classifier it finds on the training data. We also compare its performance on some natural data sets with the set covering machine and the support vector machine. 1
6 0.10008829 124 nips-2002-Learning Graphical Models with Mercer Kernels
7 0.097139537 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
8 0.08753711 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction
9 0.080314659 41 nips-2002-Bayesian Monte Carlo
10 0.077359565 110 nips-2002-Incremental Gaussian Processes
11 0.07669577 73 nips-2002-Dynamic Bayesian Networks with Deterministic Latent Tables
12 0.07433109 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
13 0.070550844 189 nips-2002-Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy
14 0.068457223 121 nips-2002-Knowledge-Based Support Vector Machine Classifiers
15 0.067449041 135 nips-2002-Learning with Multiple Labels
16 0.0664967 75 nips-2002-Dynamical Causal Learning
17 0.065662764 46 nips-2002-Boosting Density Estimation
18 0.065264001 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
19 0.063106388 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
20 0.062517822 7 nips-2002-A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences
topicId topicWeight
[(0, -0.218), (1, -0.056), (2, -0.036), (3, 0.034), (4, -0.037), (5, 0.109), (6, -0.112), (7, 0.078), (8, -0.054), (9, -0.039), (10, 0.071), (11, -0.083), (12, 0.138), (13, 0.009), (14, -0.092), (15, -0.107), (16, 0.031), (17, 0.048), (18, 0.067), (19, -0.036), (20, -0.058), (21, 0.038), (22, -0.047), (23, -0.111), (24, 0.013), (25, -0.049), (26, -0.155), (27, 0.148), (28, -0.027), (29, 0.055), (30, -0.056), (31, -0.175), (32, -0.009), (33, -0.086), (34, -0.022), (35, -0.075), (36, 0.085), (37, -0.033), (38, -0.03), (39, 0.114), (40, 0.02), (41, 0.009), (42, 0.051), (43, 0.012), (44, -0.08), (45, -0.003), (46, 0.006), (47, 0.238), (48, -0.093), (49, -0.196)]
simIndex simValue paperId paperTitle
same-paper 1 0.97145385 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization
Author: Harald Steck, Tommi S. Jaakkola
Abstract: A common objective in learning a model from data is to recover its network structure, while the model parameters are of minor interest. For example, we may wish to recover regulatory networks from high-throughput data sources. In this paper we examine how Bayesian regularization using a product of independent Dirichlet priors over the model parameters affects the learned model structure in a domain with discrete variables. We show that a small scale parameter - often interpreted as
2 0.660698 114 nips-2002-Information Regularization with Partially Labeled Data
Author: Martin Szummer, Tommi S. Jaakkola
Abstract: Classification with partially labeled data requires using a large number of unlabeled examples (or an estimated marginal P (x)), to further constrain the conditional P (y|x) beyond a few available labeled examples. We formulate a regularization approach to linking the marginal and the conditional in a general way. The regularization penalty measures the information that is implied about the labels over covering regions. No parametric assumptions are required and the approach remains tractable even for continuous marginal densities P (x). We develop algorithms for solving the regularization problem for finite covers, establish a limiting differential equation, and exemplify the behavior of the new regularization approach in simple cases.
3 0.64122355 7 nips-2002-A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences
Author: Eric P. Xing, Michael I. Jordan, Richard M. Karp, Stuart Russell
Abstract: We propose a dynamic Bayesian model for motifs in biopolymer sequences which captures rich biological prior knowledge and positional dependencies in motif structure in a principled way. Our model posits that the position-specific multinomial parameters for monomer distribution are distributed as a latent Dirichlet-mixture random variable, and the position-specific Dirichlet component is determined by a hidden Markov process. Model parameters can be fit on training motifs using a variational EM algorithm within an empirical Bayesian framework. Variational inference is also used for detecting hidden motifs. Our model improves over previous models that ignore biological priors and positional dependence. It has much higher sensitivity to motifs during detection and a notable ability to distinguish genuine motifs from false recurring patterns.
4 0.55631548 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
Author: Christopher M. Bishop, David Spiegelhalter, John Winn
Abstract: In recent years variational methods have become a popular tool for approximate inference and learning in a wide variety of probabilistic models. For each new application, however, it is currently necessary first to derive the variational update equations, and then to implement them in application-specific code. Each of these steps is both time consuming and error prone. In this paper we describe a general purpose inference engine called VIBES (‘Variational Inference for Bayesian Networks’) which allows a wide variety of probabilistic models to be implemented and solved variationally without recourse to coding. New models are specified either through a simple script or via a graphical interface analogous to a drawing package. VIBES then automatically generates and solves the variational equations. We illustrate the power and flexibility of VIBES using examples from Bayesian mixture modelling. 1
5 0.53739679 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
Author: Ron Meir, Tong Zhang
Abstract: We consider Bayesian mixture approaches, where a predictor is constructed by forming a weighted average of hypotheses from some space of functions. While such procedures are known to lead to optimal predictors in several cases, where sufficiently accurate prior information is available, it has not been clear how they perform when some of the prior assumptions are violated. In this paper we establish data-dependent bounds for such procedures, extending previous randomized approaches such as the Gibbs algorithm to a fully Bayesian setting. The finite-sample guarantees established in this work enable the utilization of Bayesian mixture approaches in agnostic settings, where the usual assumptions of the Bayesian paradigm fail to hold. Moreover, the bounds derived can be directly applied to non-Bayesian mixture approaches such as Bagging and Boosting. 1 Introduction and Motivation The standard approach to Computational Learning Theory is usually formulated within the so-called frequentist approach to Statistics. Within this paradigm one is interested in constructing an estimator, based on a finite sample, which possesses a small loss (generalization error). While many algorithms have been constructed and analyzed within this context, it is not clear how these approaches relate to standard optimality criteria within the frequentist framework. Two classic optimality criteria within the latter approach are the minimax and admissibility criteria, which characterize optimality of estimators in a rigorous and precise fashion [9]. Except in some special cases [12], it is not known whether any of the approaches used within the Learning community lead to optimality in either of the above senses of the word. On the other hand, it is known that under certain regularity conditions, Bayesian estimators lead to either minimax or admissible estimators, and thus to well-defined optimality in the classical (frequentist) sense. In fact, it can be shown that Bayes estimators are essentially the only estimators which can achieve optimality in the above senses [9]. This optimality feature provides strong motivation for the study of Bayesian approaches in a frequentist setting. While Bayesian approaches have been widely studied, there have not been generally applicable bounds in the frequentist framework. Recently, several approaches have attempted to address this problem. In this paper we establish finite sample datadependent bounds for Bayesian mixture methods, which together with the above optimality properties suggest that these approaches should become more widely used. Consider the problem of supervised learning where we attempt to construct an estimator based on a finite sample of pairs of examples S = {(x1 , y1 ), . . . , (xn , yn )}, each drawn independently according to an unknown distribution µ(x, y). Let A be a learning algorithm which, based on the sample S, constructs a hypothesis (estimator) h from some set of hypotheses H. Denoting by (y, h(x)) the instantaneous loss of the hypothesis h, we wish to assess the true loss L(h) = Eµ (y, h(x)) where the expectation is taken with respect to µ. In particular, the objective is to provide data-dependent bounds of the following form. For any h ∈ H and δ ∈ (0, 1), with probability at least 1 − δ, L(h) ≤ Λ(h, S) + ∆(h, S, δ), (1) where Λ(h, S) is some empirical assessment of the true loss, and ∆(h, S, δ) is a complexity term. For example, in the classic Vapnik-Chervonenkis framework, Λ(h, S) n is the empirical error (1/n) i=1 (yi , h(xi )) and ∆(h, S, δ) depends on the VCdimension of H but is independent of both the hypothesis h and the sample S. By algorithm and data-dependent bounds we mean bounds where the complexity term depends on both the hypothesis (chosen by the algorithm A) and the sample S. 2 A Decision Theoretic Bayesian Framework Consider a decision theoretic setting where we define the sample dependent loss of an algorithm A by R(µ, A, S) = Eµ (y, A(x, S)). Let θµ be the optimal predictor for y, namely the function minimizing Eµ { (y, φ(x))} over φ. It is clear that the best algorithm A (Bayes algorithm) is the one that always return θµ , assuming µ is known. We are interested in the expected loss of an algorithm averaged over samples S: R(µ, A) = ES R(µ, A, S) = R(µ, A, S)dµ(S), where the expectation is taken with respect to the sample S drawn i.i.d. from the probability measure µ. If we consider a family of measures µ, which possesses some underlying prior distribution π(µ), then we can construct the averaged risk function with respect to the prior as, r(π, A) = Eπ R(µ, A) = where dπ(µ|S) = dµ(S)dπ(µ) dµ(S)dπ(µ) dµ(S)dπ(µ) R(µ, A, S)dπ(µ|S), is the posterior distribution on the µ family, which µ induces a posterior distribution on the sample space as πS = Eπ(µ|S) µ. An algorithm minimizing the Bayes risk r(π, A) is referred to as a Bayes algorithm. In fact, for a given prior, and a given sample S, the optimal algorithm should return the Bayes optimal predictor with respect to the posterior measure πS . For many important practical problems, the optimal Bayes predictor is a linear functional of the underlying probability measure. For example, if the loss function is quadratic, namely (y, A(x)) = (y −A(x))2 , then the optimal Bayes predictor θµ (x) is the conditional mean of y, namely Eµ [y|x]. For binary classification problems, we can let the predictor be the conditional probability θµ (x) = µ(y = 1|x) (the optimal classification decision rule then corresponds to a test of whether θµ (x) > 0.5), which is also a linear functional of µ. Clearly if the Bayes predictor is a linear functional of the probability measure, then the optimal Bayes algorithm with respect to the prior π is given by AB (x, S) = θµ (x)dπ(µ|S) = µ θ (x)dµ(S)dπ(µ) µ µ µ dµ(S)dπ(µ) . (2) In this case, an optimal Bayesian algorithm can be regarded as the predictor constructed by averaging over all predictors with respect to a data-dependent posterior π(µ|S). We refer to such methods as Bayesian mixture methods. While the Bayes estimator AB (x, S) is optimal with respect to the Bayes risk r(π, A), it can be shown, that under appropriate conditions (and an appropriate prior) it is also a minimax and admissible estimator [9]. In general, θµ is unknown. Rather we may have some prior information about possible models for θµ . In view of (2) we consider a hypothesis space H, and an algorithm based on a mixture of hypotheses h ∈ H. This should be contrasted with classical approaches where an algorithm selects a single hypothesis h form H. For simplicity, we consider a countable hypothesis space H = {h1 , h2 , . . .}; the general case will be deferred to the full paper. Let q = {qj }∞ be a probability j=1 vector, namely qj ≥ 0 and j qj = 1, and construct the composite predictor by fq (x) = j qj hj (x). Observe that in general fq (x) may be a great deal more complex that any single hypothesis hj . For example, if hj (x) are non-polynomial ridge functions, the composite predictor f corresponds to a two-layer neural network with universal approximation power. We denote by Q the probability distribution defined by q, namely j qj hj = Eh∼Q h. A main feature of this work is the establishment of data-dependent bounds on L(Eh∼Q h), the loss of the Bayes mixture algorithm. There has been a flurry of recent activity concerning data-dependent bounds (a non-exhaustive list includes [2, 3, 5, 11, 13]). In a related vein, McAllester [7] provided a data-dependent bound for the so-called Gibbs algorithm, which selects a hypothesis at random from H based on the posterior distribution π(h|S). Essentially, this result provides a bound on the average error Eh∼Q L(h) rather than a bound on the error of the averaged hypothesis. Later, Langford et al. [6] extended this result to a mixture of classifiers using a margin-based loss function. A more general result can also be obtained using the covering number approach described in [14]. Finally, Herbrich and Graepel [4] showed that under certain conditions the bounds for the Gibbs classifier can be extended to a Bayesian mixture classifier. However, their bound contained an explicit dependence on the dimension (see Thm. 3 in [4]). Although the approach pioneered by McAllester came to be known as PAC-Bayes, this term is somewhat misleading since an optimal Bayesian method (in the decision theoretic framework outline above) does not average over loss functions but rather over hypotheses. In this regard, the learning behavior of a true Bayesian method is not addressed in the PAC-Bayes analysis. In this paper, we would like to narrow the discrepancy by analyzing Bayesian mixture methods, where we consider a predictor that is the average of a family of predictors with respect to a data-dependent posterior distribution. Bayesian mixtures can often be regarded as a good approximation to a true optimal Bayesian method. In fact, we have shown above that they are equivalent for many important practical problems. Therefore the main contribution of the present work is the extension of the above mentioned results in PAC-Bayes analysis to a rather unified setting for Bayesian mixture methods, where different regularization criteria may be incorporated, and their effect on the performance easily assessed. Furthermore, it is also essential that the bounds obtained are dimension-independent, since otherwise they yield useless results when applied to kernel-based methods, which often map the input space into a space of very high dimensionality. Similar results can also be obtained using the covering number analysis in [14]. However the approach presented in the current paper, which relies on the direct computation of the Rademacher complexity, is more direct and gives better bounds. The analysis is also easier to generalize than the corresponding covering number approach. Moreover, our analysis applies directly to other non-Bayesian mixture approaches such as Bagging and Boosting. Before moving to the derivation of our bounds, we formalize our approach. Consider a countable hypothesis space H = {hj }∞ , and a probability distribution {qj } over j=1 ∞ H. Introduce the vector notation k=1 qk hk (x) = q h(x). A learning algorithm within the Bayesian mixture framework uses the sample S to select a distribution Q over H and then constructs a mixture hypothesis fq (x) = q h(x). In order to constrain the class of mixtures used in constructing the mixture q h we impose constraints on the mixture vector q. Let g(q) be a non-negative convex function of q and define for any positive A, ΩA = {q ∈ S : g(q) ≤ A} ; FA = fq : fq (x) = q h(x) : q ∈ ΩA , (3) where S denotes the probability simplex. In subsequent sections we will consider different choices for g(q), which essentially acts as a regularization term. Finally, for any mixture q h we define the loss by L(q h) = Eµ (y, (q h)(x)) and the n ˆ empirical loss incurred on the sample by L(q h) = (1/n) i=1 (yi , (q h)(xi )). 3 A Mixture Algorithm with an Entropic Constraint In this section we consider an entropic constraint, which penalizes weights deviating significantly from some prior probability distribution ν = {νj }∞ , which may j=1 incorporate our prior information about he problem. The weights q themselves are chosen by the algorithm based on the data. In particular, in this section we set g(q) to be the Kullback-Leibler divergence of q from ν, g(q) = D(q ν) ; qj log(qj /νj ). D(q ν) = j Let F be a class of real-valued functions, and denote by σi independent Bernoulli random variables assuming the values ±1 with equal probability. We define the data-dependent Rademacher complexity of F as 1 ˆ Rn (F) = Eσ sup n f ∈F n σi f (xi ) |S . i=1 ˆ The expectation of Rn (F) with respect to S will be denoted by Rn (F). We note ˆ n (F) is concentrated around its mean value Rn (F) (e.g., Thm. 8 in [1]). We that R quote a slightly adapted result from [5]. Theorem 1 (Adapted from Theorem 1 in [5]) Let {x1 , x2 , . . . , xn } ∈ X be a sequence of points generated independently at random according to a probability distribution P , and let F be a class of measurable functions from X to R. Furthermore, let φ be a non-negative Lipschitz function with Lipschitz constant κ, such that φ◦f is uniformly bounded by a constant M . Then for all f ∈ F with probability at least 1 − δ Eφ(f (x)) − 1 n n φ(f (xi )) ≤ 4κRn (F) + M i=1 log(1/δ) . 2n An immediate consequence of Theorem 1 is the following. Lemma 3.1 Let the loss function be bounded by M , and assume that it is Lipschitz with constant κ. Then for all q ∈ ΩA with probability at least 1 − δ log(1/δ) . 2n ˆ L(q h) ≤ L(q h) + 4κRn (FA ) + M Next, we bound the empirical Rademacher average of FA using g(q) = D(q ν). Lemma 3.2 The empirical Rademacher complexity of FA is upper bounded as follows: 2A n ˆ Rn (FA ) ≤ 1 n sup j n hj (xi )2 . i=1 Proof: We first recall a few facts from the theory of convex duality [10]. Let p(u) be a convex function over a domain U , and set its dual s(z) = supu∈U u z − p(u) . It is known that s(z) is also convex. Setting u = q and p(q) = j qj log(qj /νj ) we find that s(v) = log j νj ezj . From the definition of s(z) it follows that for any q ∈ S, q z≤ qj log(qj /νj ) + log ν j ez j . j j Since z is arbitrary, we set z = (λ/n) i σi h(xi ) and conclude that for q ∈ ΩA and any λ > 0 n 1 λ 1 sup A + log νj exp σi q h(xi ) ≤ σi hj (xi ) . n i=1 λ n i q∈ΩA j Taking the expectation with respect to σ, and using 2 Eσ {exp ( i σi ai )} ≤ exp i ai /2 , we have that λ 1 ˆ νj exp A + Eσ log σi hj (xi ) Rn (FA ) ≤ λ n j ≤ ≤ = i 1 λ A + sup log Eσ exp 1 λ A + sup log exp j j A λ + 2 sup λ 2n j λ2 n2 λ n i σi hj (xi ) the Chernoff bound (Jensen) i hj (xi )2 2 (Chernoff) hj (xi )2 . i Minimizing the r.h.s. with respect to λ, we obtain the desired result. Combining Lemmas 3.1 and 3.2 yields our basic bound, where κ and M are defined in Lemma 3.1. Theorem 2 Let S = {(x1 , y1 ), . . . , (xn , yn )} be a sample of i.i.d. points each drawn according to a distribution µ(x, y). Let H be a countable hypothesis class, and set FA to be the class defined in (3) with g(q) = D(q ν). Set ∆H = (1/n)Eµ supj 1−δ n i=1 hj (xi )2 1/2 . Then for any q ∈ ΩA with probability at least ˆ L(q h) ≤ L(q h) + 4κ∆H 2A +M n log(1/δ) . 2n Note that if hj are uniformly bounded, hj ≤ c, then ∆H ≤ c. Theorem 2 holds for a fixed value of A. Using the so-called multiple testing Lemma (e.g. [11]) we obtain: Corollary 3.1 Let the assumptions of Theorem 2 hold, and let {Ai , pi } be a set of positive numbers such that i pi = 1. Then for all Ai and q ∈ ΩAi with probability at least 1 − δ, ˆ L(q h) ≤ L(q h) + 4κ∆H 2Ai +M n log(1/pi δ) . 2n Note that the only distinction with Theorem 2 is the extra factor of log pi which is the price paid for the uniformity of the bound. Finally, we present a data-dependent bound of the form (1). Theorem 3 Let the assumptions of Theorem 2 hold. Then for all q ∈ S with probability at least 1 − δ, ˆ L(q h) ≤ L(q h) + max(κ∆H , M ) × 130D(q ν) + log(1/δ) . n (4) Proof sketch Pick Ai = 2i and pi = 1/i(i + 1), i = 1, 2, . . . (note that i pi = 1). For each q, let i(q) be the smallest index for which Ai(q) ≥ D(q ν) implying that log(1/pi(q) ) ≤ 2 log log2 (4D(q ν)). A few lines of algebra, to be presented in the full paper, yield the desired result. The results of Theorem 3 can be compared to those derived by McAllester [8] for the randomized Gibbs procedure. In the latter case, the first term on the r.h.s. is ˆ Eh∼Q L(h), namely the average empirical error of the base classifiers h. In our case ˆ the corresponding term is L(Eh∼Q h), namely the empirical error of the average hypothesis. Since Eh∼Q h is potentially much more complex than any single h ∈ H, we expect that the empirical term in (4) is much smaller than the corresponding term in [8]. Moreover, the complexity term we obtain is in fact tighter than the corresponding term in [8] by a logarithmic factor in n (although the logarithmic factor in [8] could probably be eliminated). We thus expect that Bayesian mixture approach advocated here leads to better performance guarantees. Finally, we comment that Theorem 3 can be used to obtain so-called oracle inequalities. In particular, let q∗ be the optimal distribution minimizing L(q h), which can only be computed if the underlying distribution µ(x, y) is known. Consider an ˆ algorithm which, based only on the data, selects a distribution q by minimizing the r.h.s. of (4), with the implicit constants appropriately specified. Then, using standard approaches (e.g. [2]) we can obtain a bound on L(ˆ h) − L(q∗ h). For q lack of space, we defer the derivation of the precise bound to the full paper. 4 General Data-Dependent Bounds for Bayesian Mixtures The Kullback-Leibler divergence is but one way to incorporate prior information. In this section we extend the results to general convex regularization functions g(q). Some possible choices for g(q) besides the Kullback-Leibler divergence are the standard Lp norms q p . In order to proceed along the lines of Section 3, we let s(z) be the convex function associated with g(q), namely s(z) = supq∈ΩA q z − g(q) . Repeating n 1 the arguments of Section 3 we have for any λ > 0 that n i=1 σi q h(xi ) ≤ 1 λ i σi h(xi ) , which implies that λ A+s n 1 ˆ Rn (FA ) ≤ inf λ≥0 λ λ n A + Eσ s σi h(xi ) . (5) i n Assume that s(z) is second order differentiable, and that for any h = i=1 σi h(xi ) 1 2 (s(h + ∆h) + s(h − ∆h)) − s(h) ≤ u(∆h). Then, assuming that s(0) = 0, it is easy to show by induction that n n Eσ s (λ/n) i=1 σi h(xi ) ≤ u((λ/n)h(xi )). (6) i=1 In the remainder of the section we focus on the the case of regularization based on the Lp norm. Consider p and q such that 1/q + 1/p = 1, p ∈ (1, ∞), and let p = max(p, 2) and q = min(q, 2). Note that if p ≤ 2 then q ≥ 2, q = p = 2 and if p > 2 1 then q < 2, q = q, p = p. Consider p-norm regularization g(q) = p q p , in which p 1 case s(z) = q z q . The Rademacher averaging result for p-norm regularization q is known in the Geometric theory of Banach spaces (type structure of the Banach space), and it also follows from Khinchtine’s inequality. We show that it can be easily obtained in our framework. In this case, it is easy to see that s(z) = Substituting in (5) we have 1 ˆ Rn (FA ) ≤ inf λ≥0 λ A+ λ n q−1 q where Cq = ((q − 1)/q ) 1/q q 1 q z q q n h(xi ) q q = i=1 implies u(h(x)) ≤ Cq A1/p n1/p 1 n q−1 q h(x) q q . 1/q n h(xi ) q q i=1 . Combining this result with the methods described in Section 3, we establish a bound for regularization based on the Lp norm. Assume that h(xi ) q is finite for all i, and set ∆H,q = E (1/n) n i=1 h(xi ) q q 1/q . Theorem 4 Let the conditions of Theorem 3 hold and set g(q) = (1, ∞). Then for all q ∈ S, with probability at least 1 − δ, ˆ L(q h) ≤ L(q h) + max(κ∆H,q , M ) × O q p + n1/p log log( q p 1 p q p p , p ∈ + 3) + log(1/δ) n where O(·) hides a universal constant that depends only on p. 5 Discussion We have introduced and analyzed a class of regularized Bayesian mixture approaches, which construct complex composite estimators by combining hypotheses from some underlying hypothesis class using data-dependent weights. Such weighted averaging approaches have been used extensively within the Bayesian framework, as well as in more recent approaches such as Bagging and Boosting. While Bayesian methods are known, under favorable conditions, to lead to optimal estimators in a frequentist setting, their performance in agnostic settings, where no reliable assumptions can be made concerning the data generating mechanism, has not been well understood. Our data-dependent bounds allow the utilization of Bayesian mixture models in general settings, while at the same time taking advantage of the benefits of the Bayesian approach in terms of incorporation of prior knowledge. The bounds established, being independent of the cardinality of the underlying hypothesis space, can be directly applied to kernel based methods. Acknowledgments We thank Shimon Benjo for helpful discussions. The research of R.M. is partially supported by the fund for promotion of research at the Technion and by the Ollendorff foundation of the Electrical Engineering department at the Technion. References [1] P. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: risk bounds and structural results. In Proceedings of the Fourteenth Annual Conference on Computational Learning Theory, pages 224–240, 2001. [2] P.L. Bartlett, S. Boucheron, and G. Lugosi. Model selection and error estimation. Machine Learning, 48:85–113, 2002. [3] O. Bousquet and A. Chapelle. Stability and generalization. J. Machine Learning Research, 2:499–526, 2002. [4] R. Herbrich and T. Graepel. A pac-bayesian margin bound for linear classifiers; why svms work. In Advances in Neural Information Processing Systems 13, pages 224–230, Cambridge, MA, 2001. MIT Press. [5] V. Koltchinksii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statis., 30(1), 2002. [6] J. Langford, M. Seeger, and N. Megiddo. An improved predictive accuracy bound for averaging classifiers. In Proceeding of the Eighteenth International Conference on Machine Learning, pages 290–297, 2001. [7] D. A. McAllester. Some pac-bayesian theorems. In Proceedings of the eleventh Annual conference on Computational learning theory, pages 230–234, New York, 1998. ACM Press. [8] D. A. McAllester. PAC-bayesian model averaging. In Proceedings of the twelfth Annual conference on Computational learning theory, New York, 1999. ACM Press. [9] C. P. Robert. The Bayesian Choice: A Decision Theoretic Motivation. Springer Verlag, New York, 1994. [10] R.T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, N.J., 1970. [11] J. Shawe-Taylor, P. Bartlett, R.C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE trans. Inf. Theory, 44:1926– 1940, 1998. [12] Y. Yang. Minimax nonparametric classification - part I: rates of convergence. IEEE Trans. Inf. Theory, 45(7):2271–2284, 1999. [13] T. Zhang. Generalization performance of some learning problems in hilbert functional space. In Advances in Neural Information Processing Systems 15, Cambridge, MA, 2001. MIT Press. [14] T. Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527–550, 2002.
6 0.49487188 124 nips-2002-Learning Graphical Models with Mercer Kernels
7 0.47423777 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
9 0.44613808 194 nips-2002-The Decision List Machine
10 0.39461815 150 nips-2002-Multiple Cause Vector Quantization
11 0.38429782 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction
12 0.37013137 117 nips-2002-Intrinsic Dimension Estimation Using Packing Numbers
13 0.36845985 41 nips-2002-Bayesian Monte Carlo
14 0.36636972 75 nips-2002-Dynamical Causal Learning
15 0.35744673 121 nips-2002-Knowledge-Based Support Vector Machine Classifiers
16 0.35662588 84 nips-2002-Fast Exact Inference with a Factored Model for Natural Language Parsing
17 0.33809879 4 nips-2002-A Differential Semantics for Jointree Algorithms
18 0.33538052 162 nips-2002-Parametric Mixture Models for Multi-Labeled Text
19 0.33432844 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
20 0.33090869 107 nips-2002-Identity Uncertainty and Citation Matching
topicId topicWeight
[(11, 0.015), (23, 0.013), (42, 0.047), (54, 0.119), (55, 0.025), (57, 0.011), (67, 0.013), (68, 0.038), (74, 0.18), (87, 0.325), (92, 0.045), (98, 0.096)]
simIndex simValue paperId paperTitle
1 0.90684378 198 nips-2002-Theory-Based Causal Inference
Author: Joshua B. Tenenbaum, Thomas L. Griffiths
Abstract: People routinely make sophisticated causal inferences unconsciously, effortlessly, and from very little data – often from just one or a few observations. We argue that these inferences can be explained as Bayesian computations over a hypothesis space of causal graphical models, shaped by strong top-down prior knowledge in the form of intuitive theories. We present two case studies of our approach, including quantitative models of human causal judgments and brief comparisons with traditional bottom-up models of inference.
same-paper 2 0.8283574 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization
Author: Harald Steck, Tommi S. Jaakkola
Abstract: A common objective in learning a model from data is to recover its network structure, while the model parameters are of minor interest. For example, we may wish to recover regulatory networks from high-throughput data sources. In this paper we examine how Bayesian regularization using a product of independent Dirichlet priors over the model parameters affects the learned model structure in a domain with discrete variables. We show that a small scale parameter - often interpreted as
3 0.82362998 98 nips-2002-Going Metric: Denoising Pairwise Data
Author: Volker Roth, Julian Laub, Klaus-Robert Müller, Joachim M. Buhmann
Abstract: Pairwise data in empirical sciences typically violate metricity, either due to noise or due to fallible estimates, and therefore are hard to analyze by conventional machine learning technology. In this paper we therefore study ways to work around this problem. First, we present an alternative embedding to multi-dimensional scaling (MDS) that allows us to apply a variety of classical machine learning and signal processing algorithms. The class of pairwise grouping algorithms which share the shift-invariance property is statistically invariant under this embedding procedure, leading to identical assignments of objects to clusters. Based on this new vectorial representation, denoising methods are applied in a second step. Both steps provide a theoretically well controlled setup to translate from pairwise data to the respective denoised metric representation. We demonstrate the practical usefulness of our theoretical reasoning by discovering structure in protein sequence data bases, visibly improving performance upon existing automatic methods. 1
4 0.62842965 75 nips-2002-Dynamical Causal Learning
Author: David Danks, Thomas L. Griffiths, Joshua B. Tenenbaum
Abstract: Current psychological theories of human causal learning and judgment focus primarily on long-run predictions: two by estimating parameters of a causal Bayes nets (though for different parameterizations), and a third through structural learning. This paper focuses on people's short-run behavior by examining dynamical versions of these three theories, and comparing their predictions to a real-world dataset. 1
5 0.6107676 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture
Author: David R. Martin, Charless C. Fowlkes, Jitendra Malik
Abstract: The goal of this work is to accurately detect and localize boundaries in natural scenes using local image measurements. We formulate features that respond to characteristic changes in brightness and texture associated with natural boundaries. In order to combine the information from these features in an optimal way, a classifier is trained using human labeled images as ground truth. We present precision-recall curves showing that the resulting detector outperforms existing approaches.
7 0.60782093 40 nips-2002-Bayesian Models of Inductive Generalization
8 0.59711373 162 nips-2002-Parametric Mixture Models for Multi-Labeled Text
9 0.59364164 152 nips-2002-Nash Propagation for Loopy Graphical Games
10 0.5859977 53 nips-2002-Clustering with the Fisher Score
11 0.5828346 163 nips-2002-Prediction and Semantic Association
12 0.5796349 124 nips-2002-Learning Graphical Models with Mercer Kernels
13 0.57916439 90 nips-2002-Feature Selection in Mixture-Based Clustering
14 0.57893133 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games
15 0.57805026 78 nips-2002-Efficient Learning Equilibrium
16 0.57483172 66 nips-2002-Developing Topography and Ocular Dominance Using Two aVLSI Vision Sensors and a Neurotrophic Model of Plasticity
17 0.57327884 39 nips-2002-Bayesian Image Super-Resolution
18 0.57219964 89 nips-2002-Feature Selection by Maximum Marginal Diversity
19 0.57178605 57 nips-2002-Concurrent Object Recognition and Segmentation by Graph Partitioning
20 0.57061148 114 nips-2002-Information Regularization with Partially Labeled Data