nips nips2007 nips2007-147 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Zafer Barutcuoglu, Phil Long, Rocco Servedio
Abstract: This paper studies boosting algorithms that make a single pass over a set of base classifiers. We first analyze a one-pass algorithm in the setting of boosting with diverse base classifiers. Our guarantee is the same as the best proved for any boosting algorithm, but our one-pass algorithm is much faster than previous approaches. We next exhibit a random source of examples for which a “picky” variant of AdaBoost that skips poor base classifiers can outperform the standard AdaBoost algorithm, which uses every base classifier, by an exponential factor. Experiments with Reuters and synthetic data show that one-pass boosting can substantially improve on the accuracy of Naive Bayes, and that picky boosting can sometimes lead to a further improvement in accuracy.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract This paper studies boosting algorithms that make a single pass over a set of base classifiers. [sent-8, score-0.609]
2 We first analyze a one-pass algorithm in the setting of boosting with diverse base classifiers. [sent-9, score-0.678]
3 Our guarantee is the same as the best proved for any boosting algorithm, but our one-pass algorithm is much faster than previous approaches. [sent-10, score-0.241]
4 We next exhibit a random source of examples for which a “picky” variant of AdaBoost that skips poor base classifiers can outperform the standard AdaBoost algorithm, which uses every base classifier, by an exponential factor. [sent-11, score-0.736]
5 Experiments with Reuters and synthetic data show that one-pass boosting can substantially improve on the accuracy of Naive Bayes, and that picky boosting can sometimes lead to a further improvement in accuracy. [sent-12, score-0.674]
6 The aggregate classifier typically makes its class predictions using a weighted vote over the predictions made by the base classifiers, which are usually chosen one at a time in rounds. [sent-14, score-0.387]
7 When boosting is applied in practice, the base classifier at each round is usually optimized: typically, each example is assigned a weight that depends on how well it is handled by the previously chosen base classifiers, and the new base classifier is chosen to minimize the weighted training error. [sent-15, score-1.296]
8 But sometimes this is not feasible; there may be a huge number of base classifiers with insufficient apparent structure among them to avoid simply trying all of them out to find out which is best. [sent-16, score-0.353]
9 For example, there may be a base classifier for each word or k-mer. [sent-17, score-0.332]
10 Recall that if AdaBoost is run with a sequence of base classifiers b1 , . [sent-21, score-0.348]
11 , 1 − γn , then the training error of AdaBoost’s final output hypothesis is at most 2 2 n 2 exp(−2 t=1 γt ). [sent-27, score-0.132]
12 , bn of the base classifiers without looking at the data, (b) committing to use base classifier bt in round t, and (c) setting the weight with which bt votes as a function of its weighted training error using AdaBoost. [sent-31, score-1.141]
13 ) The resulting algorithm uses essentially the same computational resources as Naive Bayes [2, 7], but benefits from taking some account of the dependence among base classifiers. [sent-33, score-0.332]
14 Thus motivated, in this paper we study the performance of different boosting algorithms in a one-pass setting. [sent-34, score-0.225]
15 We begin by providing theoretical support for one-pass boosting using the “diverse base classifiers” framework previously studied in [1, 6]. [sent-36, score-0.557]
16 For an unknown subset G of k of the base classifiers, the events that the classifiers in G are correct on a random item are mutually independent. [sent-38, score-0.359]
17 This formalizes the notion that these k base classifiers 1 are not redundant. [sent-39, score-0.356]
18 Each of these k classifiers is assumed to have error 2 − γ under the initial distribution, and no assumption is made about the other n − k base classifiers. [sent-40, score-0.397]
19 In [1] it is shown that if Boost-by-Majority is applied with a weak learner that does optimization (i. [sent-41, score-0.124]
20 always uses the “best” of the n candidate base classifiers at each of Θ(k) stages of boosting), the error rate of the combined hypothesis with respect to the underlying distribution is (roughly) at most exp(−Ω(γ 2 k)). [sent-43, score-0.563]
21 In Section 2 we show that a one-pass variant of Boost-by-Majority achieves a similar bound with a single pass through the n base classifiers, reducing the computation time required by an Ω(k) factor. [sent-44, score-0.408]
22 We next show in Section 3 that when running AdaBoost using one pass, it can sometimes be advantageous to abstain from using base classifiers that are too weak. [sent-45, score-0.388]
23 Intuitively, this is because using many weak base classifiers early on can cause the boosting algorithm to reweight the data in a way that obscures the value of a strong base classifier that comes later. [sent-46, score-0.975]
24 (Note that the quadratic depenn 2 dence on γt in the exponent of the exp(−2 t=1 γt ) means that one good base classifier is more valuable than many poor ones. [sent-47, score-0.35]
25 ) In a bit more detail, suppose that base classifiers are considered in the order b1 , . [sent-48, score-0.446]
26 , bn−1 has a “small” advantage over random guessing under the initial distribution D and bn has a “large” advantage under D. [sent-54, score-0.491]
27 to change from the initial D1 in such a way that when bn is finally considered, its advantage under Dn is markedly smaller than its advantage under D0 , causing AdaBoost to assign bn a small voting weight. [sent-61, score-0.788]
28 , bn−1 (since their advantages are too small) and thus be able to reap the full benefit of using bn under distribution D0 (since when bn is finally considered the distribution D is still D0 , since no earlier base classifiers have been used). [sent-65, score-1.012]
29 These show that one-pass boosting can lead to substantial improvement in accuracy over Naive Bayes while using a similar amount of computation, and that picky one-pass boosting can sometimes further improve accuracy. [sent-67, score-0.645]
30 2 Faster learning with diverse base classifiers We consider the framework of boosting in the presence of diverse base classifiers studied in [1]. [sent-68, score-1.089]
31 We say that a set G of classifiers is diverse and γ-good with respect to D if (i) each classifier in G has advantage at least 1 γ (i. [sent-70, score-0.198]
32 Let rt (x, y) be the the number of previously chosen base classifiers h1 , . [sent-82, score-0.362]
33 until you encounter a hypothesis bj with advantage at least α with respect to Dt (and if you run out of base classifiers before this happens, then go to step 3). [sent-91, score-0.557]
34 return bj to the boosting algorithm) and set it+1 to j + 1 (i. [sent-94, score-0.267]
35 the index of the next base classifier in the list). [sent-96, score-0.332]
36 If Zt < /T , then set ht to be the constant-1 hypothesis (i. [sent-98, score-0.225]
37 return this constant hypothesis to the boosting algorithm) and set it+1 = it . [sent-100, score-0.326]
38 the algorithm ran out of base classifiers before selecting T of them), abort. [sent-104, score-0.332]
39 ii is that if Zt is small, then Lemma 4 will show that it doesn’t much matter how good this weak hypothesis is, so we simply use a constant hypothesis. [sent-111, score-0.171]
40 The following lemma from [1] shows that if the filtered distribution is not too different from the original distribution, then there is a good weak hypothesis relative to the original distribution. [sent-117, score-0.256]
41 It formalizes two ideas: (a) if the weak learners perform well, then so will the strong learner; and (b) the performance of the weak learner is not important in rounds for which Z t is small. [sent-138, score-0.273]
42 Theorem 5 Suppose the set H of base classifiers used by POPBM contains a subset G of k elements that is diverse and γ-good with respect to the initial distribution D, where γ is a constant (say 1/4). [sent-151, score-0.485]
43 Proof of Claim 6: In order for POPBM to abort, it must be the case that as the k base classifiers in G are encountered in sequence as the algorithm proceeds through h 1 , . [sent-160, score-0.373]
44 , bn is encountered during the boosting process and skipped, and for each ∈ {1, . [sent-173, score-0.569]
45 If S = k/8 or if the boosting process has terminated by the th member of G, this is obvious. [sent-186, score-0.287]
46 Let t be the round of boosting in which the th member of G is encountered. [sent-188, score-0.324]
47 Since the base classifiers were ordered randomly, any order over the remaining hypotheses is equally likely, and so also is any order over the remaining hypotheses from G. [sent-199, score-0.392]
48 Thus, the probability that the next member of G to be encountered has advantage at least α is at least 1/4, so the probability that it is skipped is at most 3/4. [sent-200, score-0.259]
49 This means that the probability that at least k/64 good elements were not skipped is at least 1 − e−O(k) , which completes the proof. [sent-206, score-0.149]
50 3 For one-pass boosting, PickyAdaBoost can outperform AdaBoost AdaBoost is the most popular boosting algorithm. [sent-207, score-0.244]
51 It is most often applied in conjunction with a weak learner that performs optimization, but it can be used with any weak learner. [sent-208, score-0.21]
52 In this section, we compare AdaBoost and its picky variant on an artificial source especially designed to illustrate why the picky variant may be needed. [sent-210, score-0.425]
53 If we run AdaBoost for T stages with weak hypotheses h1 , . [sent-213, score-0.161]
54 , hT , it constructs a final hypothesis H(x) = sgn(f (x)) where f (x) = T αt ht (x) (3) t=1 1 with αt = 2 ln 1−t t . [sent-216, score-0.336]
55 We write γt to denote 1 2 − t , the advantage of the t-th weak hypothesis under distribution D t . [sent-218, score-0.246]
56 Freund and Schapire [5] proved that if AdaBoost is run with an initial distribution D over a set of labeled examples, then the T 2 error rate of the final combined classifier H is at most exp(−2 i=1 γt ) under D: T Pr(x,y)∼D [H(x) = y] ≤ exp −2 i=1 2 γt . [sent-219, score-0.173]
57 base classifier ht being considered has advantage γ under D , where |γ| < γ. [sent-231, score-0.527]
58 If this is the case then PickyAdaBoost abstains in that round and does not include h t into the combined hypothesis it is constructing. [sent-232, score-0.145]
59 (Note that consequently the distribution for the next round of boosting will also be D . [sent-233, score-0.322]
60 ) On the other hand, if the current base classifier has advantage γ where |γ| ≥ γ, then PickyAdaBoost proceeds to use the weak hypothesis just like AdaBoost, i. [sent-234, score-0.575]
61 it adds α t ht to the function f described in (3) and adjusts D to obtain the next distribution. [sent-236, score-0.14]
62 Whether a given base classifier is used, or its negation is used, the effect that it has on the output of AdaBoost is the same (briefly, because ln 1− = − ln 1− ). [sent-238, score-0.518]
63 1 The construction We consider a sequence of n + 1 base classifiers b1 , . [sent-241, score-0.332]
64 For simplicity we suppose that the domain X is {−1, 1}n+1 and that the value of the i-th base classifier on an instance x ∈ {0, 1} n is simply bi (x) = xi . [sent-245, score-0.369]
65 , xn is chosen independently to equal 1 y with probability 2 + γ, and the bit xn+1 is chosen to equal y if there exists an i, 1 ≤ i ≤ n, for which xi = y; if xi = −y for all 1 ≤ i ≤ n then xn+1 is set to −y. [sent-251, score-0.169]
66 In the case where γ < γ ≤ 1 − ( 1 − γ)n , it is easy to analyze the error rate of 2 2 PickyAdaBoost(γ) after one pass through the base classifiers in the order b 1 , . [sent-260, score-0.487]
67 , bn has advantage exactly γ under D and bn+1 has advantage 1 − ( 1 − γ)n under D, 2 2 PickyAdaBoost(γ) will abstain in rounds 1, . [sent-267, score-0.504]
68 We thus have: Lemma 7 For γ < γ ≤ 1 − ( 1 − γ)n , PickyAdaBoost(γ) constructs a final hypothesis which has 2 2 1 error rate precisely ( 2 − γ)n under D. [sent-275, score-0.185]
69 Now let us analyze the error rate of AdaBoost after one pass through the base classifiers in the order b1 , . [sent-277, score-0.487]
70 We write Dt to denote the distribution that AdaBoost uses at the t-th stage of boosting (so D = D1 ). [sent-281, score-0.245]
71 The following claim is an easy consequence of the fact that given the value of y, the values of the base classifiers b1 , . [sent-283, score-0.416]
72 , bn are all mutually independent: Claim 8 For each 1 ≤ t ≤ n we have that γt = γ. [sent-286, score-0.347]
73 , bn are all equal to 1 2 ln 1/2+γ = 1/2−γ 1 2 ln 1+2γ . [sent-293, score-0.506]
74 1−2γ The next claim can be straightforwardly proved by induction on t: Claim 9 Let Dr denote the distribution constructed by AdaBoost after processing the base classifiers b1 , . [sent-294, score-0.452]
75 A draw of (x, y) from Dr is distributed as follows: • The bit y is uniform random from {−1, +1}; 1 • Each bit x1 , . [sent-298, score-0.176]
76 , xr−1 independently equals y with probability 2 , and each bit xr , . [sent-301, score-0.175]
77 , xn 1 independently equals y with probability 2 + γ; • The bit xn+1 is set as described in Section 3. [sent-304, score-0.195]
78 Thus an explicit expression for the final hypothesis of AdaBoost after 2 ln 2 n+1 one pass over the n + 1 classifiers b1 , . [sent-310, score-0.23]
79 , bn+1 is H(x) = sgn(f (x)), where f (x) = 1 2 ln 1+2γ 1−2γ 1 (x1 + · · · + xn ) + 2 (ln(2n − 1))xn+1 . [sent-313, score-0.141]
80 Then the AdaBoost final hypothesis error rate is Pr[B(n, 1 − γ) > A], which equals 2 n i= A +1 n (1/2 − γ)i (1/2 + γ)n−i . [sent-328, score-0.193]
81 i (5) In terms of Lemma 11, Lemma 7 states that the PickyAdaBoost(γ) final hypothesis has error 1 Pr[B(n, 2 − γ) ≥ n]. [sent-329, score-0.132]
82 We thus have that if A < n − 1 then AdaBoost’s final hypothesis has greater error than PickyAdaBoost. [sent-330, score-0.149]
83 First we observe that even in some simple cases the AdaBoost error rate (5) can be larger than the PickyAdaBoost error rate by a fairly large additive constant. [sent-332, score-0.164]
84 Viewing n as an asymptotic parameter and γ as a fixed constant, we have αn n (6) ≥ (1/2 − γ)n (7) i=0 i where α = 1 2 − ln 2 1+2γ 2 ln 1−2γ − o(1). [sent-344, score-0.186]
85 Using the bound αn n i=0 i = 2n·(H(α)±o(1)) , which holds for 1 0 < α < 2 , we see that any setting of γ such that α is bounded above zero by a constant gives an exponential gap between the error rate of PickyAdaBoost (which is (1/2−γ) n) and the lower bound on AdaBoost’s error provided by (7). [sent-345, score-0.129]
86 3 Base classifiers in an arbitrary ordering The above results show that PickyAdaBoost can outperform AdaBoost if the base classifiers are considered in the particular order b1 , . [sent-353, score-0.373]
87 A more involved analysis (omitted because of space constraints) establishes a similar difference when the base classifiers are chosen in a random order: Proposition 13 Suppose that 0. [sent-357, score-0.332]
88 5 and 0 < c < 1 are fixed constants independent of def n that satisfy Z(γ) < c, where Z(γ) = ln ln 4 (1−2γ)2 1+2γ (1−2γ)3 . [sent-359, score-0.186]
89 Suppose the base classifiers are listed in an order such that bn+1 occurs at position c · n. [sent-360, score-0.332]
90 Then the error rate of AdaBoost at least 2n(1−c) − 1 = 2Ω(n) times greater than the error of PickyAdaBoost(γ). [sent-361, score-0.174]
91 For the case of randomly ordered base classifiers, we may view c as a real value that is uniformly distributed in [0, 1], and for any fixed constant 0. [sent-362, score-0.332]
92 5 there is a constant probability (at least 1 − Z(γ)) that AdaBoost has error rate 2Ω(n) times larger than PickyAdaBoost(γ). [sent-364, score-0.125]
93 We compared the boosting algorithms with multinomial Naive Bayes [7]. [sent-375, score-0.257]
94 We used boosting with confidence-rated base classifiers [8], with a base classifier for each stem of length at most 5; analogously to the multinomial Naive Bayes, the confidence of a base classifier was taken to be the number of times its stem appeared in the text. [sent-376, score-1.313]
95 2] suggested, when the confidence of base classifiers cannot be bounded a priori, to choose each voting weight αt in order to maximize the reduction in potential. [sent-378, score-0.352]
96 The one-pass boosting algorithms usually improve on the accuracy of Naive Bayes, while retaining similar simplicity and computational efficiency. [sent-382, score-0.24]
97 The boosting algorithms predictably perform better than Naive Bayes, because Naive Bayes assigns too much weight to the correlated features. [sent-401, score-0.225]
98 The picky boosting algorithm further ameliorates the effect of this correlation. [sent-402, score-0.399]
99 Where a result is omitted, the corresponding picky algorithm did not pick any base classifiers. [sent-533, score-0.506]
100 A comparison of event models for naive bayes text classification. [sent-571, score-0.173]
wordName wordTfidf (topN-words)
[('pickyadaboost', 0.435), ('adaboost', 0.398), ('base', 0.332), ('bn', 0.32), ('ers', 0.258), ('boosting', 0.225), ('classi', 0.196), ('picky', 0.174), ('zt', 0.146), ('ht', 0.14), ('popbm', 0.139), ('dt', 0.119), ('er', 0.102), ('diverse', 0.1), ('ln', 0.093), ('naive', 0.09), ('weak', 0.086), ('hypothesis', 0.085), ('claim', 0.084), ('bit', 0.077), ('pr', 0.07), ('opab', 0.07), ('reuters', 0.069), ('bayes', 0.066), ('lemma', 0.065), ('round', 0.06), ('skipped', 0.055), ('advantage', 0.055), ('nal', 0.054), ('pass', 0.052), ('nb', 0.052), ('sgn', 0.049), ('xn', 0.048), ('error', 0.047), ('rounds', 0.039), ('member', 0.039), ('learner', 0.038), ('suppose', 0.037), ('abstain', 0.035), ('rate', 0.035), ('wt', 0.034), ('freund', 0.034), ('multinomial', 0.032), ('stem', 0.03), ('hypotheses', 0.03), ('rt', 0.03), ('source', 0.029), ('stages', 0.029), ('independently', 0.029), ('synthetic', 0.029), ('least', 0.028), ('xr', 0.028), ('rocco', 0.028), ('mutually', 0.027), ('schapire', 0.027), ('equals', 0.026), ('bj', 0.026), ('bt', 0.025), ('formalizes', 0.024), ('encountered', 0.024), ('variant', 0.024), ('guessing', 0.023), ('terminated', 0.023), ('completes', 0.023), ('dr', 0.023), ('omitted', 0.022), ('draw', 0.022), ('ordering', 0.022), ('exp', 0.021), ('sometimes', 0.021), ('vote', 0.021), ('analyze', 0.021), ('permutations', 0.02), ('fix', 0.02), ('voting', 0.02), ('distribution', 0.02), ('ltered', 0.019), ('aggregate', 0.019), ('outperform', 0.019), ('dence', 0.018), ('analyses', 0.018), ('constructs', 0.018), ('agrees', 0.018), ('initial', 0.018), ('text', 0.017), ('dn', 0.017), ('proceeds', 0.017), ('greater', 0.017), ('consequently', 0.017), ('run', 0.016), ('return', 0.016), ('nally', 0.016), ('proved', 0.016), ('probability', 0.015), ('usually', 0.015), ('colt', 0.015), ('respect', 0.015), ('shrinking', 0.015), ('wheat', 0.015), ('plong', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 147 nips-2007-One-Pass Boosting
Author: Zafer Barutcuoglu, Phil Long, Rocco Servedio
Abstract: This paper studies boosting algorithms that make a single pass over a set of base classifiers. We first analyze a one-pass algorithm in the setting of boosting with diverse base classifiers. Our guarantee is the same as the best proved for any boosting algorithm, but our one-pass algorithm is much faster than previous approaches. We next exhibit a random source of examples for which a “picky” variant of AdaBoost that skips poor base classifiers can outperform the standard AdaBoost algorithm, which uses every base classifier, by an exponential factor. Experiments with Reuters and synthetic data show that one-pass boosting can substantially improve on the accuracy of Naive Bayes, and that picky boosting can sometimes lead to a further improvement in accuracy.
2 0.24659067 166 nips-2007-Regularized Boost for Semi-Supervised Learning
Author: Ke Chen, Shihai Wang
Abstract: Semi-supervised inductive learning concerns how to learn a decision rule from a data set containing both labeled and unlabeled data. Several boosting algorithms have been extended to semi-supervised learning with various strategies. To our knowledge, however, none of them takes local smoothness constraints among data into account during ensemble learning. In this paper, we introduce a local smoothness regularizer to semi-supervised boosting algorithms based on the universal optimization framework of margin cost functionals. Our regularizer is applicable to existing semi-supervised boosting algorithms to improve their generalization and speed up their training. Comparative results on synthetic, benchmark and real world tasks demonstrate the effectiveness of our local smoothness regularizer. We discuss relevant issues and relate our regularizer to previous work. 1
3 0.18971844 39 nips-2007-Boosting the Area under the ROC Curve
Author: Phil Long, Rocco Servedio
Abstract: We show that any weak ranker that can achieve an area under the ROC curve slightly better than 1/2 (which can be achieved by random guessing) can be efficiently boosted to achieve an area under the ROC curve arbitrarily close to 1. We further show that this boosting can be performed even in the presence of independent misclassification noise, given access to a noise-tolerant weak ranker.
4 0.18033756 38 nips-2007-Boosting Algorithms for Maximizing the Soft Margin
Author: Gunnar Rätsch, Manfred K. Warmuth, Karen A. Glocer
Abstract: We present a novel boosting algorithm, called SoftBoost, designed for sets of binary labeled examples that are not necessarily separable by convex combinations of base hypotheses. Our algorithm achieves robustness by capping the distributions on the examples. Our update of the distribution is motivated by minimizing a relative entropy subject to the capping constraints and constraints on the edges of the obtained base hypotheses. The capping constraints imply a soft margin in the dual optimization problem. Our algorithm produces a convex combination of hypotheses whose soft margin is within δ of its maximum. We employ relative enN tropy projection methods to prove an O( ln 2 ) iteration bound for our algorithm, δ where N is number of examples. We compare our algorithm with other approaches including LPBoost, BrownBoost, and SmoothBoost. We show that there exist cases where the number of iterations required by LPBoost grows linearly in N instead of the logarithmic growth for SoftBoost. In simulation studies we show that our algorithm converges about as fast as LPBoost, faster than BrownBoost, and much faster than SmoothBoost. In a benchmark comparison we illustrate the competitiveness of our approach.
5 0.17675285 90 nips-2007-FilterBoost: Regression and Classification on Large Datasets
Author: Joseph K. Bradley, Robert E. Schapire
Abstract: We study boosting in the filtering setting, where the booster draws examples from an oracle instead of using a fixed training set and so may train efficiently on very large datasets. Our algorithm, which is based on a logistic regression technique proposed by Collins, Schapire, & Singer, requires fewer assumptions to achieve bounds equivalent to or better than previous work. Moreover, we give the first proof that the algorithm of Collins et al. is a strong PAC learner, albeit within the filtering setting. Our proofs demonstrate the algorithm’s strong theoretical properties for both classification and conditional probability estimation, and we validate these results through extensive experiments. Empirically, our algorithm proves more robust to noise and overfitting than batch boosters in conditional probability estimation and proves competitive in classification. 1
6 0.17429097 149 nips-2007-Optimal ROC Curve for a Combination of Classifiers
7 0.10673878 137 nips-2007-Multiple-Instance Pruning For Learning Efficient Cascade Detectors
8 0.096840389 86 nips-2007-Exponential Family Predictive Representations of State
9 0.088329576 110 nips-2007-Learning Bounds for Domain Adaptation
10 0.083676361 32 nips-2007-Bayesian Co-Training
11 0.083071418 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis
12 0.081277646 175 nips-2007-Semi-Supervised Multitask Learning
13 0.075973287 126 nips-2007-McRank: Learning to Rank Using Multiple Classification and Gradient Boosting
14 0.07518103 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators
15 0.059262723 21 nips-2007-Adaptive Online Gradient Descent
16 0.057042181 6 nips-2007-A General Boosting Method and its Application to Learning Ranking Functions for Web Search
17 0.053388562 184 nips-2007-Stability Bounds for Non-i.i.d. Processes
18 0.053214975 195 nips-2007-The Generalized FITC Approximation
19 0.051542565 69 nips-2007-Discriminative Batch Mode Active Learning
20 0.049922388 207 nips-2007-Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations
topicId topicWeight
[(0, -0.172), (1, -0.011), (2, -0.094), (3, 0.143), (4, 0.129), (5, 0.153), (6, 0.184), (7, -0.144), (8, 0.129), (9, -0.133), (10, 0.052), (11, 0.056), (12, -0.164), (13, -0.102), (14, 0.047), (15, -0.043), (16, 0.193), (17, 0.091), (18, -0.172), (19, 0.004), (20, -0.05), (21, 0.223), (22, -0.252), (23, 0.148), (24, -0.072), (25, 0.033), (26, -0.014), (27, 0.019), (28, -0.046), (29, -0.003), (30, 0.015), (31, 0.014), (32, -0.046), (33, 0.033), (34, -0.024), (35, -0.011), (36, -0.011), (37, -0.018), (38, -0.041), (39, -0.009), (40, 0.065), (41, -0.075), (42, 0.014), (43, 0.022), (44, -0.057), (45, 0.057), (46, 0.048), (47, 0.031), (48, -0.021), (49, 0.005)]
simIndex simValue paperId paperTitle
same-paper 1 0.97143918 147 nips-2007-One-Pass Boosting
Author: Zafer Barutcuoglu, Phil Long, Rocco Servedio
Abstract: This paper studies boosting algorithms that make a single pass over a set of base classifiers. We first analyze a one-pass algorithm in the setting of boosting with diverse base classifiers. Our guarantee is the same as the best proved for any boosting algorithm, but our one-pass algorithm is much faster than previous approaches. We next exhibit a random source of examples for which a “picky” variant of AdaBoost that skips poor base classifiers can outperform the standard AdaBoost algorithm, which uses every base classifier, by an exponential factor. Experiments with Reuters and synthetic data show that one-pass boosting can substantially improve on the accuracy of Naive Bayes, and that picky boosting can sometimes lead to a further improvement in accuracy.
2 0.85685056 90 nips-2007-FilterBoost: Regression and Classification on Large Datasets
Author: Joseph K. Bradley, Robert E. Schapire
Abstract: We study boosting in the filtering setting, where the booster draws examples from an oracle instead of using a fixed training set and so may train efficiently on very large datasets. Our algorithm, which is based on a logistic regression technique proposed by Collins, Schapire, & Singer, requires fewer assumptions to achieve bounds equivalent to or better than previous work. Moreover, we give the first proof that the algorithm of Collins et al. is a strong PAC learner, albeit within the filtering setting. Our proofs demonstrate the algorithm’s strong theoretical properties for both classification and conditional probability estimation, and we validate these results through extensive experiments. Empirically, our algorithm proves more robust to noise and overfitting than batch boosters in conditional probability estimation and proves competitive in classification. 1
3 0.81528366 39 nips-2007-Boosting the Area under the ROC Curve
Author: Phil Long, Rocco Servedio
Abstract: We show that any weak ranker that can achieve an area under the ROC curve slightly better than 1/2 (which can be achieved by random guessing) can be efficiently boosted to achieve an area under the ROC curve arbitrarily close to 1. We further show that this boosting can be performed even in the presence of independent misclassification noise, given access to a noise-tolerant weak ranker.
4 0.76463979 38 nips-2007-Boosting Algorithms for Maximizing the Soft Margin
Author: Gunnar Rätsch, Manfred K. Warmuth, Karen A. Glocer
Abstract: We present a novel boosting algorithm, called SoftBoost, designed for sets of binary labeled examples that are not necessarily separable by convex combinations of base hypotheses. Our algorithm achieves robustness by capping the distributions on the examples. Our update of the distribution is motivated by minimizing a relative entropy subject to the capping constraints and constraints on the edges of the obtained base hypotheses. The capping constraints imply a soft margin in the dual optimization problem. Our algorithm produces a convex combination of hypotheses whose soft margin is within δ of its maximum. We employ relative enN tropy projection methods to prove an O( ln 2 ) iteration bound for our algorithm, δ where N is number of examples. We compare our algorithm with other approaches including LPBoost, BrownBoost, and SmoothBoost. We show that there exist cases where the number of iterations required by LPBoost grows linearly in N instead of the logarithmic growth for SoftBoost. In simulation studies we show that our algorithm converges about as fast as LPBoost, faster than BrownBoost, and much faster than SmoothBoost. In a benchmark comparison we illustrate the competitiveness of our approach.
5 0.68581879 149 nips-2007-Optimal ROC Curve for a Combination of Classifiers
Author: Marco Barreno, Alvaro Cardenas, J. D. Tygar
Abstract: We present a new analysis for the combination of binary classifiers. Our analysis makes use of the Neyman-Pearson lemma as a theoretical basis to analyze combinations of classifiers. We give a method for finding the optimal decision rule for a combination of classifiers and prove that it has the optimal ROC curve. We show how our method generalizes and improves previous work on combining classifiers and generating ROC curves. 1
6 0.64760697 166 nips-2007-Regularized Boost for Semi-Supervised Learning
7 0.53606451 137 nips-2007-Multiple-Instance Pruning For Learning Efficient Cascade Detectors
8 0.39026749 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)
9 0.33881512 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data
10 0.33858126 126 nips-2007-McRank: Learning to Rank Using Multiple Classification and Gradient Boosting
11 0.33272123 175 nips-2007-Semi-Supervised Multitask Learning
12 0.30906796 32 nips-2007-Bayesian Co-Training
13 0.29423025 86 nips-2007-Exponential Family Predictive Representations of State
14 0.28677213 88 nips-2007-Fast and Scalable Training of Semi-Supervised CRFs with Application to Activity Recognition
15 0.28374109 184 nips-2007-Stability Bounds for Non-i.i.d. Processes
16 0.27439582 110 nips-2007-Learning Bounds for Domain Adaptation
17 0.26924667 15 nips-2007-A general agnostic active learning algorithm
18 0.26486927 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators
19 0.25873691 27 nips-2007-Anytime Induction of Cost-sensitive Trees
20 0.25153264 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning
topicId topicWeight
[(4, 0.026), (5, 0.022), (13, 0.035), (16, 0.014), (18, 0.025), (21, 0.091), (31, 0.014), (34, 0.052), (35, 0.031), (46, 0.016), (47, 0.058), (49, 0.019), (57, 0.255), (83, 0.159), (87, 0.019), (90, 0.069)]
simIndex simValue paperId paperTitle
same-paper 1 0.78447181 147 nips-2007-One-Pass Boosting
Author: Zafer Barutcuoglu, Phil Long, Rocco Servedio
Abstract: This paper studies boosting algorithms that make a single pass over a set of base classifiers. We first analyze a one-pass algorithm in the setting of boosting with diverse base classifiers. Our guarantee is the same as the best proved for any boosting algorithm, but our one-pass algorithm is much faster than previous approaches. We next exhibit a random source of examples for which a “picky” variant of AdaBoost that skips poor base classifiers can outperform the standard AdaBoost algorithm, which uses every base classifier, by an exponential factor. Experiments with Reuters and synthetic data show that one-pass boosting can substantially improve on the accuracy of Naive Bayes, and that picky boosting can sometimes lead to a further improvement in accuracy.
2 0.6200434 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations
Author: Amir Globerson, Tommi S. Jaakkola
Abstract: We present a novel message passing algorithm for approximating the MAP problem in graphical models. The algorithm is similar in structure to max-product but unlike max-product it always converges, and can be proven to find the exact MAP solution in various settings. The algorithm is derived via block coordinate descent in a dual of the LP relaxation of MAP, but does not require any tunable parameters such as step size or tree weights. We also describe a generalization of the method to cluster based potentials. The new method is tested on synthetic and real-world problems, and compares favorably with previous approaches. Graphical models are an effective approach for modeling complex objects via local interactions. In such models, a distribution over a set of variables is assumed to factor according to cliques of a graph with potentials assigned to each clique. Finding the assignment with highest probability in these models is key to using them in practice, and is often referred to as the MAP (maximum aposteriori) assignment problem. In the general case the problem is NP hard, with complexity exponential in the tree-width of the underlying graph. Linear programming (LP) relaxations have proven very useful in approximating the MAP problem, and often yield satisfactory empirical results. These approaches relax the constraint that the solution is integral, and generally yield non-integral solutions. However, when the LP solution is integral, it is guaranteed to be the exact MAP. For some classes of problems the LP relaxation is provably correct. These include the minimum cut problem and maximum weight matching in bi-partite graphs [8]. Although LP relaxations can be solved using standard LP solvers, this may be computationally intensive for large problems [13]. The key problem with generic LP solvers is that they do not use the graph structure explicitly and thus may be sub-optimal in terms of computational efficiency. The max-product method [7] is a message passing algorithm that is often used to approximate the MAP problem. In contrast to generic LP solvers, it makes direct use of the graph structure in constructing and passing messages, and is also very simple to implement. The relation between max-product and the LP relaxation has remained largely elusive, although there are some notable exceptions: For tree-structured graphs, max-product and LP both yield the exact MAP. A recent result [1] showed that for maximum weight matching on bi-partite graphs max-product and LP also yield the exact MAP [1]. Finally, Tree-Reweighted max-product (TRMP) algorithms [5, 10] were shown to converge to the LP solution for binary xi variables, as shown in [6]. In this work, we propose the Max Product Linear Programming algorithm (MPLP) - a very simple variation on max-product that is guaranteed to converge, and has several advantageous properties. MPLP is derived from the dual of the LP relaxation, and is equivalent to block coordinate descent in the dual. Although this results in monotone improvement of the dual objective, global convergence is not always guaranteed since coordinate descent may get stuck in suboptimal points. This can be remedied using various approaches, but in practice we have found MPLP to converge to the LP 1 solution in a majority of the cases we studied. To derive MPLP we use a special form of the dual LP, which involves the introduction of redundant primal variables and constraints. We show how the dual variables corresponding to these constraints turn out to be the messages in the algorithm. We evaluate the method on Potts models and protein design problems, and show that it compares favorably with max-product (which often does not converge for these problems) and TRMP. 1 The Max-Product and MPLP Algorithms The max-product algorithm [7] is one of the most often used methods for solving MAP problems. Although it is neither guaranteed to converge to the correct solution, or in fact converge at all, it provides satisfactory results in some cases. Here we present two algorithms: EMPLP (edge based MPLP) and NMPLP (node based MPLP), which are structurally very similar to max-product, but have several key advantages: • After each iteration, the messages yield an upper bound on the MAP value, and the sequence of bounds is monotone decreasing and convergent. The messages also have a limit point that is a fixed point of the update rule. • No additional parameters (e.g., tree weights as in [6]) are required. • If the fixed point beliefs have a unique maximizer then they correspond to the exact MAP. • For binary variables, MPLP can be used to obtain the solution to an LP relaxation of the MAP problem. Thus, when this LP relaxation is exact and variables are binary, MPLP will find the MAP solution. Moreover, for any variable whose beliefs are not tied, the MAP assignment can be found (i.e., the solution is partially decodable). Pseudo code for the algorithms (and for max-product) is given in Fig. 1. As we show in the next sections, MPLP is essentially a block coordinate descent algorithm in the dual of a MAP LP relaxation. Every update of the MPLP messages corresponds to exact minimization of a set of dual variables. For EMPLP minimization is over the set of variables corresponding to an edge, and for NMPLP it is over the set of variables corresponding to all the edges a given node appears in (i.e., a star). The properties of MPLP result from its relation to the LP dual. In what follows we describe the derivation of the MPLP algorithms and prove their properties. 2 The MAP Problem and its LP Relaxation We consider functions over n variables x = {x1 , . . . , xn } defined as follows. Given a graph G = (V, E) with n vertices, and potentials θij (xi , xj ) for all edges ij ∈ E, define the function1 f (x; θ) = θij (xi , xj ) . (1) ij∈E The MAP problem is defined as finding an assignment xM that maximizes the function f (x; θ). Below we describe the standard LP relaxation for this problem. Denote by {µij (xi , xj )}ij∈E distributions over variables corresponding to edges ij ∈ E and {µi (xi )}i∈V distributions corresponding to nodes i ∈ V . We will use µ to denote a given set of distributions over all edges and nodes. The set ML (G) is defined as the set of µ where pairwise and singleton distributions are consistent x ˆ xi µij (ˆi , xj ) = µj (xj ) , ˆ xj µij (xi , xj ) = µi (xi ) ∀ij ∈ E, xi , xj ˆ ML (G) = µ ≥ 0 ∀i ∈ V xi µi (xi ) = 1 Now consider the following linear program: MAPLPR : µL∗ = arg max µ∈ML (G) µ·θ. (2) where µ·θ is shorthand for µ·θ = ij∈E xi ,xj θij (xi , xj )µij (xi , xj ). It is easy to show (see e.g., [10]) that the optimum of MAPLPR yields an upper bound on the MAP value, i.e. µL∗ ·θ ≥ f (xM ). Furthermore, when the optimal µi (xi ) have only integral values, the assignment that maximizes µi (xi ) yields the correct MAP assignment. In what follows we show how the MPLP algorithms can be derived from the dual of MAPLPR. 1 P We note that some authors also add a term i∈V θi (xi ) to f (x; θ). However, these terms can be included in the pairwise functions θij (xi , xj ), so we ignore them for simplicity. 2 3 The LP Relaxation Dual Since MAPLPR is an LP, it has an equivalent convex dual. In App. A we derive a special dual of MAPLPR using a different representation of ML (G) with redundant variables. The advantage of this dual is that it allows the derivation of simple message passing algorithms. The dual is described in the following proposition. Proposition 1 The following optimization problem is a convex dual of MAPLPR DMAPLPR : min max xi i s.t. max βki (xk , xi ) (3) k∈N (i) xk βji (xj , xi ) + βij (xi , xj ) = θij (xi , xj ) , where the dual variables are βij (xi , xj ) for all ij, ji ∈ E and values of xi and xj . The dual has an intuitive interpretation in terms of re-parameterizations. Consider the star shaped graph Gi consisting of node i and all its neighbors N (i). Assume the potential on edge ki (for k ∈ N (i)) is βki (xk , xi ). The value of the MAP assignment for this model is max max βki (xk , xi ). This is exactly the term in the objective of DMAPLPR. Thus the dual xi k∈N (i) xk corresponds to individually decoding star graphs around all nodes i ∈ V where the potentials on the graph edges should sum to the original potential. It is easy to see that this will always result in an upper bound on the MAP value. The somewhat surprising result of the duality is that there exists a β assignment such that star decoding yields the optimal value of MAPLPR. 4 Block Coordinate Descent in the Dual To obtain a convergent algorithm we use a simple block coordinate descent strategy. At every iteration, fix all variables except a subset, and optimize over this subset. It turns out that this can be done in closed form for the cases we consider. We begin by deriving the EMPLP algorithm. Consider fixing all the β variables except those corresponding to some edge ij ∈ E (i.e., βij and βji ), and minimizing DMAPLPR over the non-fixed variables. Only two terms in the DMAPLPR objective depend on βij and βji . We can write those as f (βij , βji ) = max λ−j (xi ) + max βji (xj , xi ) + max λ−i (xj ) + max βij (xi , xj ) i j xi where we defined λ−j (xi ) = i xj k∈N (i)\j xi xi (4) λki (xi ) and λki (xi ) = maxxk βki (xk , xi ) as in App. A. Note that the function f (βij , βji ) depends on the other β values only through λ−i (xj ) and λ−j (xi ). j i This implies that the optimization can be done solely in terms of λij (xj ) and there is no need to store the β values explicitly. The optimal βij , βji are obtained by minimizing f (βij , βji ) subject to the re-parameterization constraint βji (xj , xi ) + βij (xi , xj ) = θij (xi , xj ). The following proposition characterizes the minimum of f (βij , βji ). In fact, as mentioned above, we do not need to characterize the optimal βij (xi , xj ) itself, but only the new λ values. Proposition 2 Maximizing the function f (βij , βji ) yields the following λji (xi ) (and the equivalent expression for λij (xj )) 1 −j 1 λji (xi ) = − λi (xi ) + max λ−i (xj ) + θij (xi , xj ) j 2 2 xj The proposition is proved in App. B. The λ updates above result in the EMPLP algorithm, described in Fig. 1. Note that since the β optimization affects both λji (xi ) and λij (xj ), both these messages need to be updated simultaneously. We proceed to derive the NMPLP algorithm. For a given node i ∈ V , we consider all its neighbors j ∈ N (i), and wish to optimize over the variables βji (xj , xi ) for ji, ij ∈ E (i.e., all the edges in a star centered on i), while the other variables are fixed. One way of doing so is to use the EMPLP algorithm for the edges in the star, and iterate it until convergence. We now show that the result of 3 Inputs: A graph G = (V, E), potential functions θij (xi , xj ) for each edge ij ∈ E. Initialization: Initialize messages to any value. Algorithm: • Iterate until a stopping criterion is satisfied: – Max-product: Iterate over messages and update (cji shifts the max to zero) h i mji (xi )← max m−i (xj ) + θij (xi , xj ) − cji j xj – EMPLP: For each ij ∈ E, update λji (xi ) and λij (xj ) simultaneously (the update for λij (xj ) is the same with i and j exchanged) h i 1 1 λji (xi )← − λ−j (xi ) + max λ−i (xj ) + θij (xi , xj ) j i 2 2 xj – NMPLP: Iterate over nodes i ∈ V and update all γij (xj ) where j ∈ N (i) 2 3 X 2 γij (xj )← max 4θij (xi , xj ) − γji (xi ) + γki (xi )5 xi |N (i)| + 1 k∈N(i) • Calculate node “beliefs”: Set biP i ) to be the sum of incoming messages into node i ∈ V (x (e.g., for NMPLP set bi (xi ) = k∈N(i) γki (xi )). Output: Return assignment x defined as xi = arg maxxi b(ˆi ). x ˆ Figure 1: The max-product, EMPLP and NMPLP algorithms. Max-product, EMPLP and NMPLP use mesP sages mij , λij and γij respectively. We use the notation m−i (xj ) = k∈N(j)\i mkj (xj ). j this optimization can be found in closed form. The assumption about β being fixed outside the star implies that λ−i (xj ) is fixed. Define: γji (xi ) = maxxj θij (xi , xj ) + λ−i (xj ) . Simple algebra j j yields the following relation between λ−j (xi ) and γki (xi ) for k ∈ N (i) i 2 λ−j (xi ) = −γji (xi ) + γki (xi ) (5) i |N (i)| + 1 k∈N (i) Plugging this into the definition of γji (xi ) we obtain the NMPLP update in Fig. 1. The messages for both algorithms can be initialized to any value since it can be shown that after one iteration they will correspond to valid β values. 5 Convergence Properties The MPLP algorithm decreases the dual objective (i.e., an upper bound on the MAP value) at every iteration, and thus its dual objective values form a convergent sequence. Using arguments similar to [5] it can be shown that MPLP has a limit point that is a fixed point of its updates. This in itself does not guarantee convergence to the dual optimum since coordinate descent algorithms may get stuck at a point that is not a global optimum. There are ways of overcoming this difficulty, for example by smoothing the objective [4] or using techniques as in [2] (see p. 636). We leave such extensions for further work. In this section we provide several results about the properties of the MPLP fixed points and their relation to the corresponding LP. First, we claim that if all beliefs have unique maxima then the exact MAP assignment is obtained. Proposition 3 If the fixed point of MPLP has bi (xi ) such that for all i the function bi (xi ) has a unique maximizer x∗ , then x∗ is the solution to the MAP problem and the LP relaxation is exact. i Since the dual objective is always greater than or equal to the MAP value, it suffices to show that there exists a dual feasible point whose objective value is f (x∗ ). Denote by β ∗ , λ∗ the value of the corresponding dual parameters at the fixed point of MPLP. Then the dual objective satisfies λ∗ (xi ) = ki max i xi k∈N (i) ∗ max βki (xk , x∗ ) = i i k∈N (i) xk ∗ βki (x∗ , x∗ ) = f (x∗ ) k i i 4 k∈N (i) To see why the second equality holds, note that bi (x∗ ) = maxxi ,xj λ−j (xi ) + βji (xj , xi ) and i i bj (x∗ ) = maxxi ,xj λ−i (xj ) + βij (xi , xj ). By the equalization property in Eq. 9 the arguments of j j the two max operations are equal. From the unique maximum assumption it follows that x∗ , x∗ are i j the unique maximizers of the above. It follows that βji , βij are also maximized by x∗ , x∗ . i j In the general case, the MPLP fixed point may not correspond to a primal optimum because of the local optima problem with coordinate descent. However, when the variables are binary, fixed points do correspond to primal solutions, as the following proposition states. Proposition 4 When xi are binary, the MPLP fixed point can be used to obtain the primal optimum. The claim can be shown by constructing a primal optimal solution µ∗ . For tied bi , set µ∗ (xi ) to 0.5 i and for untied bi , set µ∗ (x∗ ) to 1. If bi , bj are not tied we set µ∗ (x∗ , x∗ ) = 1. If bi is not tied but bj i i ij i j is, we set µ∗ (x∗ , xj ) = 0.5. If bi , bj are tied then βji , βij can be shown to be maximized at either ij i x∗ , x∗ = (0, 0), (1, 1) or x∗ , x∗ = (0, 1), (1, 0). We then set µ∗ to be 0.5 at one of these assignment i j i j ij ∗ pairs. The resulting µ∗ is clearly primal feasible. Setting δi = b∗ we obtain that the dual variables i (δ ∗ , λ∗ , β ∗ ) and primal µ∗ satisfy complementary slackness for the LP in Eq. 7 and therefore µ∗ is primal optimal. The binary optimality result implies partial decodability, since [6] shows that the LP is partially decodable for binary variables. 6 Beyond pairwise potentials: Generalized MPLP In the previous sections we considered maximizing functions which factor according to the edges of the graph. A more general setting considers clusters c1 , . . . , ck ⊂ {1, . . . , n} (the set of clusters is denoted by C), and a function f (x; θ) = c θc (xc ) defined via potentials over clusters θc (xc ). The MAP problem in this case also has an LP relaxation (see e.g. [11]). To define the LP we introduce the following definitions: S = {c ∩ c : c, c ∈ C, c ∩ c = ∅} is the set of intersection between clusters ˆ ˆ ˆ and S(c) = {s ∈ S : s ⊆ c} is the set of overlap sets for cluster c.We now consider marginals over the variables in c ∈ C and s ∈ S and require that cluster marginals agree on their overlap. Denote this set by ML (C). The LP relaxation is then to maximize µ · θ subject to µ ∈ ML (C). As in Sec. 4, we can derive message passing updates that result in monotone decrease of the dual LP of the above relaxation. The derivation is similar and we omit the details. The key observation is that one needs to introduce |S(c)| copies of each marginal µc (xc ) (instead of the two copies in the pairwise case). Next, as in the EMPLP derivation we assume all β are fixed except those corresponding to some cluster c. The resulting messages are λc→s (xs ) from a cluster c to all of its intersection sets s ∈ S(c). The update on these messages turns out to be: 1 1 λ−c (xs ) + max λ−c (xs ) + θc (xc ) λc→s (xs ) = − 1 − ˆ s s ˆ |S(c)| |S(c)| xc\s s∈S(c)\s ˆ where for a given c ∈ C all λc→s should be updated simultaneously for s ∈ S(c), and λ−c (xs ) is s defined as the sum of messages into s that are not from c. We refer to this algorithm as Generalized EMPLP (GEMPLP). It is possible to derive an algorithm similar to NMPLP that updates several clusters simultaneously, but its structure is more involved and we do not address it here. 7 Related Work Weiss et al. [11] recently studied the fixed points of a class of max-product like algorithms. Their analysis focused on properties of fixed points rather than convergence guarantees. Specifically, they showed that if the counting numbers used in a generalized max-product algorithm satisfy certain properties, then its fixed points will be the exact MAP if the beliefs have unique maxima, and for binary variables the solution can be partially decodable. Both these properties are obtained for the MPLP fixed points, and in fact we can show that MPLP satisfies the conditions in [11], so that we obtain these properties as corollaries of [11]. We stress however, that [11] does not address convergence of algorithms, but rather properties of their fixed points, if they converge. MPLP is similar in some aspects to Kolmogorov’s TRW-S algorithm [5]. TRW-S is also a monotone coordinate descent method in a dual of the LP relaxation and its fixed points also have similar 5 guarantees to those of MPLP [6]. Furthermore, convergence to a local optimum may occur, as it does for MPLP. One advantage of MPLP lies in the simplicity of its updates and the fact that it is parameter free. The other is its simple generalization to potentials over clusters of nodes (Sec. 6). Recently, several new dual LP algorithms have been introduced, which are more closely related to our formalism. Werner [12] presented a class of algorithms which also improve the dual LP at every iteration. The simplest of those is the max-sum-diffusion algorithm, which is similar to our EMPLP algorithm, although the updates are different from ours. Independently, Johnson et al. [4] presented a class of algorithms that improve duals of the MAP-LP using coordinate descent. They decompose the model into tractable parts by replicating variables and enforce replication constraints within the Lagrangian dual. Our basic formulation in Eq. 3 could be derived from their perspective. However, the updates in the algorithm and the analysis differ. Johnson et al. also presented a method for overcoming the local optimum problem, by smoothing the objective so that it is strictly convex. Such an approach could also be used within our algorithms. Vontobel and Koetter [9] recently introduced a coordinate descent algorithm for decoding LDPC codes. Their method is specifically tailored for this case, and uses updates that are similar to our edge based updates. Finally, the concept of dual coordinate descent may be used in approximating marginals as well. In [3] we use such an approach to optimize a variational bound on the partition function. The derivation uses some of the ideas used in the MPLP dual, but importantly does not find the minimum for each coordinate. Instead, a gradient like step is taken at every iteration to decrease the dual objective. 8 Experiments We compared NMPLP to three other message passing algorithms:2 Tree-Reweighted max-product (TRMP) [10],3 standard max-product (MP), and GEMPLP. For MP and TRMP we used the standard approach of damping messages using a factor of α = 0.5. We ran all algorithms for a maximum of 2000 iterations, and used the hit-time measure to compare their speed of convergence. This measure is defined as follows: At every iteration the beliefs can be used to obtain an assignment x with value f (x). We define the hit-time as the first iteration at which the maximum value of f (x) is achieved.4 We first experimented with a 10 × 10 grid graph, with 5 values per state. The function f (x) was 5 a Potts model: f (x) = The values for θij and θi (xi ) ij∈E θij I(xi = xj ) + i∈V θi (xi ). were randomly drawn from [−cI , cI ] and [−cF , cF ] respectively, and we used values of cI and cF in the range range [0.1, 2.35] (with intervals of 0.25), resulting in 100 different models. The clusters for GEMPLP were the faces of the graph [14]. To see if NMPLP converges to the LP solution we also used an LP solver to solve the LP relaxation. We found that the the normalized difference between NMPLP and LP objective was at most 10−3 (median 10−7 ), suggesting that NMPLP typically converged to the LP solution. Fig. 2 (top row) shows the results for the three algorithms. It can be seen that while all non-cluster based algorithms obtain similar f (x) values, NMPLP has better hit-time (in the median) than TRMP and MP, and MP does not converge in many cases (see caption). GEMPLP converges more slowly than NMPLP, but obtains much better f (x) values. In fact, in 99% of the cases the normalized difference between the GEMPLP objective and the f (x) value was less than 10−5 , suggesting that the exact MAP solution was found. We next applied the algorithms to the real world problems of protein design. In [13], Yanover et al. show how these problems can be formalized in terms of finding a MAP in an appropriately constructed graphical model.6 We used all algorithms except GNMPLP (since there is no natural choice for clusters in this case) to approximate the MAP solution on the 97 models used in [13]. In these models the number of states per variable is 2 − 158, and there are up to 180 variables per model. Fig. 2 (bottom) shows results for all the design problems. In this case only 11% of the MP runs converged, and NMPLP was better than TRMP in terms of hit-time and comparable in f (x) value. The performance of MP was good on the runs where it converged. 2 As expected, NMPLP was faster than EMPLP so only NMPLP results are given. The edge weights for TRMP corresponded to a uniform distribution over all spanning trees. 4 This is clearly a post-hoc measure since it can only be obtained after the algorithm has exceeded its maximum number of iterations. However, it is a reasonable algorithm-independent measure of convergence. 5 The potential θi (xi ) may be folded into the pairwise potential to yield a model as in Eq. 1. 6 Data available from http://jmlr.csail.mit.edu/papers/volume7/yanover06a/Rosetta Design Dataset.tgz 3 6 (a) (b) (c) 100 (d) 0.6 2000 0.04 0.4 0.02 −50 0 −0.02 −0.04 ∆(Value) 0 1000 ∆(Hit Time) ∆(Value) ∆(Hit Time) 50 0 MP TRMP GMPLP 0 −0.2 −1000 −0.4 −0.06 −100 0.2 MP TRMP GMPLP MP TRMP MP TRMP Figure 2: Evaluation of message passing algorithms on Potts models and protein design problems. (a,c): Convergence time results for the Potts models (a) and protein design problems (c). The box-plots (horiz. red line indicates median) show the difference between the hit-time for the other algorithms and NMPLP. (b,d): Value of integer solutions for the Potts models (b) and protein design problems (d). The box-plots show the normalized difference between the value of f (x) for NMPLP and the other algorithms. All figures are such that better MPLP performance yields positive Y axis values. Max-product converged on 58% of the cases for the Potts models, and on 11% of the protein problems. Only convergent max-product runs are shown. 9 Conclusion We have presented a convergent algorithm for MAP approximation that is based on block coordinate descent of the MAP-LP relaxation dual. The algorithm can also be extended to cluster based functions, which result empirically in improved MAP estimates. This is in line with the observations in [14] that generalized belief propagation algorithms can result in significant performance improvements. However generalized max-product algorithms [14] are not guaranteed to converge whereas GMPLP is. Furthermore, the GMPLP algorithm does not require a region graph and only involves intersection between pairs of clusters. In conclusion, MPLP has the advantage of resolving the convergence problems of max-product while retaining its simplicity, and offering the theoretical guarantees of LP relaxations. We thus believe it should be useful in a wide array of applications. A Derivation of the dual Before deriving the dual, we first express the constraint set ML (G) in a slightly different way. The definition of ML (G) in Sec. 2 uses a single distribution µij (xi , xj ) for every ij ∈ E. In what follows, we use two copies of this pairwise distribution for every edge, which we denote µij (xi , xj ) ¯ and µji (xj , xi ), and we add the constraint that these two copies both equal the original µij (xi , xj ). ¯ For this extended set of pairwise marginals, we consider the following set of constraints which is clearly equivalent to ML (G). On the rightmost column we give the dual variables that will correspond to each constraint (we omit non-negativity constraints). µij (xi , xj ) = µij (xi , xj ) ¯ µji (xj , xi ) = µij (xi , xj ) ¯ x xi µij (ˆi , xj ) = µj (xj ) ˆ ¯ µji (ˆj , xi ) = µi (xi ) ¯ x xj ˆ xi µi (xi ) = 1 ∀ij ∈ E, xi , xj ∀ij ∈ E, xi , xj ∀ij ∈ E, xj ∀ji ∈ E, xi ∀i ∈ V βij (xi , xj ) βji (xj , xi ) λij (xj ) λji (xi ) δi (6) ¯ We denote the set of (µ, µ) satisfying these constraints by ML (G). We can now state an LP that ¯ is equivalent to MAPLPR, only with an extended set of variables and constraints. The equivalent ¯ problem is to maximize µ · θ subject to (µ, µ) ∈ ML (G) (note that the objective uses the original ¯ µ copy). LP duality transformation of the extended problem yields the following LP min i δi s.t. λij (xj ) − βij (xi , xj ) ≥ 0 βij (xi , xj ) + βji (xj , xi ) = θij (xi , xj ) − k∈N (i) λki (xi ) + δi ≥ 0 ∀ij, ji ∈ E, xi , xj ∀ij ∈ E, xi , xj ∀i ∈ V, xi (7) We next simplify the above LP by eliminating some of its constraints and variables. Since each variable δi appears in only one constraint, and the objective minimizes δi it follows that δi = maxxi k∈N (i) λki (xi ) and the constraints with δi can be discarded. Similarly, since λij (xj ) appears in a single constraint, we have that for all ij ∈ E, ji ∈ E, xi , xj λij (xj ) = maxxi βij (xi , xj ) and the constraints with λij (xj ), λji (xi ) can also be discarded. Using the eliminated δi and λji (xi ) 7 variables, we obtain that the LP in Eq. 7 is equivalent to that in Eq. 3. Note that the objective in Eq. 3 is convex since it is a sum of point-wise maxima of convex functions. B Proof of Proposition 2 We wish to minimize f in Eq. 4 subject to the constraint that βij + βji = θij . Rewrite f as f (βij , βji ) = max λ−j (xi ) + βji (xj , xi ) + max λ−i (xj ) + βij (xi , xj ) j i xi ,xj xi ,xj (8) The sum of the two arguments in the max is λ−j (xi ) + λ−i (xj ) + θij (xi , xj ) i j (because of the constraints on β). Thus the minimum must be greater than −j −i 1 2 maxxi ,xj λi (xi ) + λj (xj ) + θij (xi , xj ) . One assignment to β that achieves this minimum is obtained by requiring an equalization condition:7 λ−i (xj ) + βij (xi , xj ) = λ−j (xi ) + βji (xj , xi ) = j i 1 θij (xi , xj ) + λ−j (xi ) + λ−i (xj ) i j 2 (9) which implies βij (xi , xj ) = 1 θij (xi , xj ) + λ−j (xi ) − λ−i (xj ) and a similar expression for βji . i j 2 The resulting λij (xj ) = maxxi βij (xi , xj ) are then the ones in Prop. 2. Acknowledgments The authors acknowledge support from the Defense Advanced Research Projects Agency (Transfer Learning program). Amir Globerson was also supported by the Rothschild Yad-Hanadiv fellowship. References [1] M. Bayati, D. Shah, and M. Sharma. Maximum weight matching via max-product belief propagation. IEEE Trans. on Information Theory (to appear), 2007. [2] D. P. Bertsekas, editor. Nonlinear Programming. Athena Scientific, Belmont, MA, 1995. [3] A. Globerson and T. Jaakkola. Convergent propagation algorithms via oriented trees. In UAI. 2007. [4] J.K. Johnson, D.M. Malioutov, and A.S. Willsky. Lagrangian relaxation for map estimation in graphical models. In Allerton Conf. Communication, Control and Computing, 2007. [5] V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10):1568–1583, 2006. [6] V. Kolmogorov and M. Wainwright. On the optimality of tree-reweighted max-product message passing. In 21st Conference on Uncertainty in Artificial Intelligence (UAI). 2005. [7] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988. [8] B. Taskar, S. Lacoste-Julien, and M. Jordan. Structured prediction, dual extragradient and bregman projections. Journal of Machine Learning Research, pages 1627–1653, 2006. [9] P.O. Vontobel and R. Koetter. Towards low-complexity linear-programming decoding. In Proc. 4th Int. Symposium on Turbo Codes and Related Topics, 2006. [10] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Map estimation via agreement on trees: messagepassing and linear programming. IEEE Trans. on Information Theory, 51(11):1120–1146, 2005. [11] Y. Weiss, C. Yanover, and T. Meltzer. Map estimation, linear programming and belief propagation with convex free energies. In UAI. 2007. [12] T. Werner. A linear programming approach to max-sum, a review. IEEE Trans. on PAMI, 2007. [13] C. Yanover, T. Meltzer, and Y. Weiss. Linear programming relaxations and belief propagation – an empirical study. Jourmal of Machine Learning Research, 7:1887–1907, 2006. [14] J.S. Yedidia, W.T. W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. on Information Theory, 51(7):2282–2312, 2005. 7 Other solutions are possible but may not yield some of the properties of MPLP. 8
3 0.61813831 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
Author: Kai Yu, Wei Chu
Abstract: This paper aims to model relational data on edges of networks. We describe appropriate Gaussian Processes (GPs) for directed, undirected, and bipartite networks. The inter-dependencies of edges can be effectively modeled by adapting the GP hyper-parameters. The framework suggests an intimate connection between link prediction and transfer learning, which were traditionally two separate research topics. We develop an efficient learning algorithm that can handle a large number of observations. The experimental results on several real-world data sets verify superior learning capacity. 1
4 0.61739689 186 nips-2007-Statistical Analysis of Semi-Supervised Regression
Author: Larry Wasserman, John D. Lafferty
Abstract: Semi-supervised methods use unlabeled data in addition to labeled data to construct predictors. While existing semi-supervised methods have shown some promising empirical performance, their development has been based largely based on heuristics. In this paper we study semi-supervised learning from the viewpoint of minimax theory. Our first result shows that some common methods based on regularization using graph Laplacians do not lead to faster minimax rates of convergence. Thus, the estimators that use the unlabeled data do not have smaller risk than the estimators that use only labeled data. We then develop several new approaches that provably lead to improved performance. The statistical tools of minimax analysis are thus used to offer some new perspective on the problem of semi-supervised learning. 1
5 0.616952 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations
Author: Charles L. Isbell, Michael P. Holmes, Alexander G. Gray
Abstract: Machine learning contains many computational bottlenecks in the form of nested summations over datasets. Kernel estimators and other methods are burdened by these expensive computations. Exact evaluation is typically O(n2 ) or higher, which severely limits application to large datasets. We present a multi-stage stratified Monte Carlo method for approximating such summations with probabilistic relative error control. The essential idea is fast approximation by sampling in trees. This method differs from many previous scalability techniques (such as standard multi-tree methods) in that its error is stochastic, but we derive conditions for error control and demonstrate that they work. Further, we give a theoretical sample complexity for the method that is independent of dataset size, and show that this appears to hold in experiments, where speedups reach as high as 1014 , many orders of magnitude beyond the previous state of the art. 1
6 0.61610776 6 nips-2007-A General Boosting Method and its Application to Learning Ranking Functions for Web Search
7 0.61476684 175 nips-2007-Semi-Supervised Multitask Learning
8 0.61462051 156 nips-2007-Predictive Matrix-Variate t Models
9 0.61313754 128 nips-2007-Message Passing for Max-weight Independent Set
10 0.6118809 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes
11 0.61158741 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)
12 0.61114991 196 nips-2007-The Infinite Gamma-Poisson Feature Model
13 0.61097687 187 nips-2007-Structured Learning with Approximate Inference
14 0.60898566 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains
15 0.60785669 63 nips-2007-Convex Relaxations of Latent Variable Training
16 0.60724974 46 nips-2007-Cluster Stability for Finite Samples
17 0.60626751 69 nips-2007-Discriminative Batch Mode Active Learning
18 0.60519081 16 nips-2007-A learning framework for nearest neighbor search
19 0.6051892 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine
20 0.60512602 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions