nips nips2001 nips2001-45 knowledge-graph by maker-knowledge-mining

45 nips-2001-Boosting and Maximum Likelihood for Exponential Models


Source: pdf

Author: Guy Lebanon, John D. Lafferty

Abstract: We derive an equivalence between AdaBoost and the dual of a convex optimization problem, showing that the only difference between minimizing the exponential loss used by AdaBoost and maximum likelihood for exponential models is that the latter requires the model to be normalized to form a conditional probability distribution over labels. In addition to establishing a simple and easily understood connection between the two methods, this framework enables us to derive new regularization procedures for boosting that directly correspond to penalized maximum likelihood. Experiments on UCI datasets support our theoretical analysis and give additional insight into the relationship between boosting and logistic regression.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In addition to establishing a simple and easily understood connection between the two methods, this framework enables us to derive new regularization procedures for boosting that directly correspond to penalized maximum likelihood. [sent-6, score-0.611]

2 Experiments on UCI datasets support our theoretical analysis and give additional insight into the relationship between boosting and logistic regression. [sent-7, score-0.551]

3 1 Introduction Several recent papers in statistics and machine learning have been devoted to the relationship between boosting and more standard statistical procedures such as logistic regression. [sent-8, score-0.54]

4 In spite of this activity, an easy-to-understand and clean connection between these different techniques has not emerged. [sent-9, score-0.038]

5 Kivinen and Warmuth [8] note that boosting is a form of “entropy projection,” and Lafferty [9] suggests the use of Bregman distances to approximate the exponential loss. [sent-11, score-0.565]

6 [10] consider boosting algorithms as functional gradient descent and Duffy and Helmbold [5] study various loss functions with respect to the PAC boosting property. [sent-13, score-0.805]

7 More recently, Collins, Schapire and Singer [2] show how different Bregman distances precisely account for boosting and logistic regression, and use this framework to give the first convergence proof of AdaBoost. [sent-14, score-0.494]

8 However, in this work the two methods are viewed as minimizing different loss functions. [sent-15, score-0.119]

9 Moreover, the optimization problems are formulated in terms of a reference distribution consisting of the zero vector, rather than the empirical distribution of the data, making the interpretation of this use of Bregman distances problematic from a statistical point of view. [sent-16, score-0.08]

10 In this paper we present a very basic connection between boosting and maximum likelihood for exponential models through a simple convex optimization problem. [sent-17, score-0.829]

11 In this setting, it is seen that the only difference between AdaBoost and maximum likelihood for exponential models, in particular logistic regression, is that the latter requires the model to be normalized to form a probability distribution. [sent-18, score-0.463]

12 The two methods minimize the same extended Kullback-Leibler divergence objective function subject to the same feature constraints. [sent-19, score-0.063]

13 Using information geometry, we show that projecting the exponential loss model onto the simplex of conditional probability distributions gives precisely the maximum likelihood exponential model with the specified sufficient statistics. [sent-20, score-0.707]

14 In many cases of practical interest, the resulting models will be identical; in particular, as the number of features increases to fit the training data the two methods will give the same classifiers. [sent-21, score-0.09]

15 We note that throughout the paper we view boosting as a procedure for minimizing the exponential loss, using either parallel or sequential update algorithms as in [2], rather than as a forward stepwise procedure as presented in [7] or [6]. [sent-22, score-0.564]

16 Given the recent interest in these techniques, it is striking that this connection has gone unobserved until now. [sent-23, score-0.038]

17 However in general, there may be many ways of writing the constraints for a convex optimization problem, and many different settings of the Lagrange multipliers (or Kuhn-Tucker vectors) that represent identical solutions. [sent-24, score-0.178]

18 The key to the connection we present here lies in the use of a particular non-standard presentation of the constraints. [sent-25, score-0.068]

19 When viewed in this way, there is no need for special-purpose Bregman distances to give a unified account of boosting and maximum likelihood, as we only make use of the standard Kullback-Leibler divergence. [sent-26, score-0.466]

20 In particular, we derive a regularization procedure for AdaBoost that directly corresponds to penalized maximum likelihood using a Gaussian prior. [sent-28, score-0.296]

21 Experiments on UCI data support our theoretical analysis, demonstrate the effectiveness of the new regularization method, and give further insight into the relationship between boosting and maximum likelihood exponential models. [sent-29, score-0.838]

22 We denote by the set of nonnegative measures on , and by the set of conditional probability distributions, for each . [sent-33, score-0.054]

23 For , we will overload the notation and ; the latter will be suggestive of a conditional probability distribution, but in general it need not be normalized. [sent-34, score-0.054]

24 These will correspond to the weak learners in boosting, and to the sufficient statistics in an exponential model. [sent-36, score-0.21]

25 Suppose that we have data with empirical distribution and marginal ; thus, We assume, without loss of generality, that for all . [sent-37, score-0.119]

26 We make this assumption mainly to correspond with the conventions used to present boosting algorithms; it is not essential to what follows. [sent-44, score-0.343]

27 Given , we define the exponential model where hood estimation problem is to determine parameters , for , by . [sent-45, score-0.178]

28 The maximum likeli- that maximize the conditional log- c f}B|2 w  ‚{7yx r|2 w  ‚{7yx 7w– ™u 2 s 0 Dtsrg 0 “hp—9 © v 2 u £G G   …— 79¨§ G R 9 • ¡ G C 9 vw £ G TSA8BE…§–x¦¥¤£¢HFEyfAx12 zw{ 0 hp—79  likelihood or minimize the log loss . [sent-46, score-0.591]

29 The objective function to be minimized in the multi-label boosting algorithm AdaBoost. [sent-47, score-0.343]

30 M2 [2] is the exponential loss given by M2 As has been often noted, the log loss and the exponential loss are qualitatively different. [sent-48, score-0.713]

31 The exponential loss grows exponentially with increasing negative “margin,” while the log loss grows linearly. [sent-49, score-0.416]

32 3 Correspondence Between AdaBoost and Maximum Likelihood Since we are working with unnormalized models we make use of the extended conditional Kullback-Leibler divergence or -divergence, given by ! [sent-50, score-0.185]

33 2 '  Gxvw 0“†•9 y9 ‘ dGu IF9 GC†U Bvw09 †•yQ9 “ C R' G G IF9 iGhfAT†vEHyFCG fA™U ˆESAR U @8D5 ' vw G 9 9 £ u AF 9 7 C 2 bcdY H‰ £ G BSAR UV @8¨64HFEyfA V9 pU9 hSABE9  2 hfAxvw { ¨ ¢ ‘3¤ £ „kUxBvw9 2  C A 975 §G C G R G 9 ' GC ¢$xG„†U€Bvw“2 " C9 ¢ ' 10• &VU; G ! [sent-52, score-0.24]

34 ' TSA5R)(9 • G G R # G R  Gh R R ¡ G 9 G 9 heAfE9 • $hSA8BE9 "§ hSAeABEBE99 •  ¦¥¤£¢TSAR fED 2 hBAvw { £ zGi•D9  C def % & defined on features (possibly taking on the value ). [sent-54, score-0.09]

35 Note that if and then this becomes the more familiar KL divergence for probabilities. [sent-55, score-0.063]

36 Consider now the following two convex optimization problems, labeled and . [sent-59, score-0.092]

37 As we’ll show, the dual problem corresponds to AdaBoost, and the dual problem corresponds to maximum likelihood for exponential models. [sent-61, score-0.668]

38 SG UF This presentation of the constraints is the key to making the correspondence between AdaBoost and maximum likelihood. [sent-62, score-0.163]

39 Note that the constraint , which is the usual presentation of the constraints for maximum likelihood (as dual to maximum entropy), doesn’t make sense for unnormalized models, since the two sides of the equation may not be “on the same scale. [sent-63, score-0.565]

40 We now derive the dual problems formally; the following section gives a precise statement of the duality result. [sent-65, score-0.312]

41 heAR fED TBAxvw £ …k—yQ u9 Y S UF For def , the connecting equation arg Thus, the dual function is given by is given by (2) The dual problem is to determine simply add additional Lagrange multipliers . [sent-67, score-0.378]

42 1 Special cases It is now straightforward to derive various boosting and logistic regression problems as special cases of the above optimization problems. [sent-70, score-0.567]

43 Then the dual problem is equivawhich is the optimization Case 2: Binary AdaBoost. [sent-76, score-0.192]

44 Then the dual problem is given by which is the optimization problem of binary AdaBoost. [sent-78, score-0.192]

45 M2 but add the additional normalization constraints: If these constraints are satisfied, then the other constraints take the form and the connecting equation becomes were is the normalizing term , which corresponds to setting the Lagrange multiplier to the appropriate value. [sent-81, score-0.171]

46 In this case, after a simple calculation the dual problem is seen to be which corresponds to maximum likelihood for a conditional exponential model with sufficient statistics . [sent-82, score-0.566]

47 ' 5•¤ £ „kU7Cu0p•79G GR • G  ˜ ‘“…C f}f|| ‚{y 2 7w‚{7yx d|p}r|2 zw2 zw‚{7yt{zyx 7wx 7wv– u uv– TSA8BEe9u0• £ TSA8BE9 8¨ ‘' 5•¤ £ „kU7Cu0p• u9 '— GR GR • ¢ G ©    ¨ ¨ is unnormalized while is normalized. [sent-87, score-0.068]

48 We now define the boosting solution and maximum likelihood solution ml as boost ¨     ml     boost ! [sent-88, score-1.965]

49 The following theorem corresponds to Proposition 4 of [3] for the usual KL divergence; in [4] the duality theorem is proved for a general class of Bregman distances, including the extended KL divergence as a special as in [2], but rather case. [sent-90, score-0.183]

50 " " # $ ' ( 4 5 2 0 ' 31(   ) 2 0 ' 31( Moreover, ml is computed in terms of boost as ml ) ml . [sent-92, score-1.434]

51 Then boost and ml exist, are unique, and satisfy   boost % & Theorem. [sent-93, score-1.088]

52 ¡2   • x • ml x boost • 2 PSfrag replacements ml x 2 PSfrag replacements boost Figure 1: Geometric view of the duality theorem. [sent-96, score-1.822]

53 Minimizing the exponential loss finds the member ¥ ¦¤ that intersects with the feasible set of measures satisfying the moment constraints (left). [sent-97, score-0.425]

54 When of we impose the additional constraint that each conditional distribution must be normalized, we introduce a Lagrange multiplier for each training example , giving a higher-dimensional family . [sent-98, score-0.088]

55 By the duality theorem, projecting the exponential loss solution onto the intersection of the feasible set with the simplex gives the maximum likelihood solution. [sent-99, score-0.632]

56 The unnormalized exponential family intersects the feasible set of measures satisfying the constraints (1) at a single point. [sent-102, score-0.374]

57 The algorithms presented in [2] determine this point, which is the exponential loss solution (see Figure 1, left). [sent-103, score-0.297]

58 equality we have that ml boost u£ ¢ ¨ " zG†•C w 9v  # 4   2 ) db a ¥  ¤cƒ‚¦y £ †x• b "   #! [sent-107, score-0.722]

59 y$9 C Q 2  4 e ¤dUa ‚¥¦y £ x • 2 0 ' 31( 4 Regularization Minimizing the exponential loss or the log loss on real data often fails to produce finite parameters. [sent-109, score-0.416]

60 Of course, even when (3) does not hold, models trained by maximum likelihood or the exponential loss can overfit the training data. [sent-111, score-0.509]

61 A standard regularization technique in the case of maximum likelihood employs parameter priors in a Bayesian framework. [sent-112, score-0.26]

62 In terms of convex duality, parameter priors for the dual problem correspond to “potentials” on the constraint values in the primal problem. [sent-114, score-0.276]

63 The case of a Gaussian prior on , for example, corresponds to a quadratic potential on the constraint values in the primal problem. [sent-115, score-0.064]

64 Define is a parameter as   ¡   ¡ ¢ ¤   ¥ £   ¡ reg given by minimize subject to ¦ §   © reg   ¨ and consider the primal problem is a convex function whose minimum is at . [sent-117, score-0.27]

65 ¡ ¦ ¦ is the convex conjugate and   (5)  ¦         ¦ ©   ¦ reg ¦  To derive the dual problem, the Lagrangian is calculated as and the dual function is reg where , we have of . [sent-118, score-0.554]

66 For a quadratic penalty the dual function becomes   ¨ ¦ where (4) A sequential update rule for (5) incurs the small additional cost of solving a nonlinear equation by Newton-Raphson every iteration. [sent-119, score-0.156]

67 See [1] for a discussion of this technique in the context of exponential models in statistical language modeling. [sent-120, score-0.178]

68 5 Experiments We performed experiments on some of the UC Irvine datasets in order to investigate the relationship between boosting and maximum likelihood empirically. [sent-121, score-0.622]

69 The weak learner was the decision stump FindAttrTest as described in [6], and the training set consisted of a randomly chosen 90% of the data. [sent-122, score-0.101]

70 The first model was trained for 10 features generated by FindAttrTest, excluding features satisfying condition (3). [sent-125, score-0.112]

71 The second model, , was trained using the exponential loss with quadratic regularization. [sent-127, score-0.297]

72 The performance was measured using the conditional log-likelihood of the (normalized) models over the training and test set, denoted train and test , as well as using the test error rate test . [sent-128, score-0.592]

73 G zu• •      For the weak learner FindAttrTest, only the Iris dataset produced features that satisfy (3). [sent-130, score-0.123]

74 On average, 4 out of the 10 features were removed. [sent-131, score-0.056]

75 As the flexibility of the weak learner is increased, (3) is expected to hold more often. [sent-132, score-0.067]

76 On this dataset regularization improves both the test set log-likelihood and error rate considerably. [sent-133, score-0.217]

77 In datasets where shows significant overfitting, regularization improves both the log-likelihood measure and the error rate. [sent-134, score-0.166]

78 In cases of little overfitting (according to the log-likelihood measure), regularization only improves the test set log-likelihood at the expense of the training set log-likelihood, however without affecting test set error. [sent-135, score-0.346]

79 &u;• • • Next we performed a set of experiments to test how much boost differs from ml , where the boosting model is normalized (after training) to form a conditional probability distribution. [sent-136, score-1.214]

80 For different experiments, FindAttrTest generated a different number of features (10–100), and the training set was selected randomly. [sent-137, score-0.09]

81 The top row in Figure 2 shows for the Sonar dataset the relationship between train ml and train boost as well as between train ml and train ml boost . [sent-138, score-2.388]

82 As the number of features increases so that the training x x ¦ G x p•9   G x p•9   G x †•C x f•9  G x p•9   §G&up;•9 dG5uf•9  0. [sent-139, score-0.09]

83 52  train Regularized test G G f•9 Unregularized Data Promoters Iris Sonar Glass Ionosphere Hepatitis Breast Pima 0. [sent-179, score-0.219]

84 25 Table 1: Comparison of unregularized to regularized boosting. [sent-187, score-0.114]

85 For both the regularized and unregularized cases, the first column shows training log-likelihood, the second column shows test loglikelihood, and the third column shows test error rate. [sent-188, score-0.338]

86 Regularization reduces error rate in some cases while it consistently improves the test set log-likelihood measure on all datasets. [sent-189, score-0.135]

87 ‰ …UG f•9 § data is more closely fit ( train ml ), the boosting and maximum likelihood models become more similar, as measured by the KL divergence. [sent-191, score-1.001]

88 The bottom row in Figure 2 shows the relationship between the test set log-likelihoods, test ml and test boost , together with the test set error rates test ml and test boost . [sent-193, score-2.106]

89 While the plots in the bottom row of Figure 2 indicate that train ml train boost , as expected, on the test data the linear trend is reversed, so that test ml test boost . [sent-196, score-2.012]

90 G x p•9  G x f•9 G x p•9  G G x x p•f•9 9     G x f•9   ‡ G xG f•9x p•9       %  The duality result suggests a possible explanation for the higher performance of boosting with respect to test . [sent-198, score-0.558]

91 The boosting model is less constrained due to the lack of normalization constraints, and therefore has a smaller -divergence to the uniform model. [sent-199, score-0.374]

92 This may be interpreted as a higher extended entropy, or less concentrated conditional model. [sent-200, score-0.054]

93    ‰ h¨G x p•79  § However, as ml , the two models come to agree (up to identifiability). [sent-201, score-0.356]

94 It is easy to show that for any exponential model By taking train ml ml boost it is seen that as the difference between ml and boost gets smaller, the divergence between the two models also gets smaller. [sent-202, score-2.165]

95 As the number of features is increased so that the training data is fit more closely, the model matches the empirical distribution and the normalizing becomes a constant. [sent-204, score-0.122]

96 In this case, normalizing the boosting model boost does term not violate the constraints, and results in the maximum likelihood model. [sent-205, score-0.919]

97 x vw rcbG&–f•79  G 4G xx f•f•7979   £ G §–G †•C†xp•x79f•  9 § 4  C G ’5–• ' ¨ x G hfA9 – • £ @§–• ¥ Acknowledgments We thank Michael Collins, Michael Jordan, Andrew Ng, Fernando Pereira, Rob Schapire, and Yair Weiss for helpful comments on an early version of this paper. [sent-206, score-0.24]

98 2 −20 PSfrag replacements PSfrag replacements 0. [sent-240, score-0.258]

99 4 ¥ −25 −25 Figure 2: Comparison of AdaBoost and maximum likelihood for Sonar dataset. [sent-249, score-0.178]

100  § ' ( 2 ' (   § §  §  §  § pares train ml to train boost (left) and train ml to train ml shows the relationship between test ml and test boost (left) and The experimental results for other UCI datasets were very similar. [sent-252, score-2.943]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('boost', 0.366), ('ml', 0.356), ('boosting', 0.343), ('vw', 0.24), ('gr', 0.207), ('exponential', 0.178), ('dual', 0.156), ('adaboost', 0.15), ('psfrag', 0.129), ('replacements', 0.129), ('train', 0.124), ('duality', 0.12), ('loss', 0.119), ('hp', 0.109), ('hea', 0.108), ('hfa', 0.108), ('logistic', 0.107), ('bregman', 0.105), ('likelihood', 0.099), ('test', 0.095), ('dca', 0.086), ('findattrtest', 0.086), ('zgi', 0.086), ('regularization', 0.082), ('yq', 0.08), ('maximum', 0.079), ('gc', 0.075), ('dg', 0.075), ('reg', 0.075), ('uv', 0.075), ('zw', 0.068), ('unnormalized', 0.068), ('dua', 0.065), ('evu', 0.065), ('hba', 0.065), ('unregularized', 0.065), ('della', 0.064), ('pietra', 0.064), ('primal', 0.064), ('divergence', 0.063), ('relationship', 0.057), ('features', 0.056), ('convex', 0.056), ('conditional', 0.054), ('constraints', 0.054), ('sonar', 0.051), ('fed', 0.051), ('lagrange', 0.05), ('regularized', 0.049), ('regression', 0.045), ('uf', 0.045), ('lafferty', 0.045), ('datasets', 0.044), ('distances', 0.044), ('hepatitis', 0.043), ('hfaxvw', 0.043), ('hfedba', 0.043), ('hfeyfaxvw', 0.043), ('hsa', 0.043), ('ieyfa', 0.043), ('kwu', 0.043), ('lebanon', 0.043), ('stepwise', 0.043), ('tba', 0.043), ('tbaxvw', 0.043), ('ug', 0.043), ('zg', 0.043), ('zyx', 0.043), ('uci', 0.043), ('gv', 0.043), ('collins', 0.041), ('carnegie', 0.041), ('kl', 0.04), ('improves', 0.04), ('mellon', 0.039), ('connection', 0.038), ('intersects', 0.037), ('promoters', 0.037), ('xg', 0.037), ('feasible', 0.037), ('optimization', 0.036), ('derive', 0.036), ('learner', 0.035), ('row', 0.035), ('training', 0.034), ('iris', 0.034), ('def', 0.034), ('duffy', 0.034), ('procedures', 0.033), ('schapire', 0.032), ('multipliers', 0.032), ('weak', 0.032), ('gh', 0.032), ('normalizing', 0.032), ('normalization', 0.031), ('presentation', 0.03), ('glass', 0.03), ('mason', 0.03), ('kivinen', 0.03), ('entropy', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999857 45 nips-2001-Boosting and Maximum Likelihood for Exponential Models

Author: Guy Lebanon, John D. Lafferty

Abstract: We derive an equivalence between AdaBoost and the dual of a convex optimization problem, showing that the only difference between minimizing the exponential loss used by AdaBoost and maximum likelihood for exponential models is that the latter requires the model to be normalized to form a conditional probability distribution over labels. In addition to establishing a simple and easily understood connection between the two methods, this framework enables us to derive new regularization procedures for boosting that directly correspond to penalized maximum likelihood. Experiments on UCI datasets support our theoretical analysis and give additional insight into the relationship between boosting and logistic regression.

2 0.2044199 8 nips-2001-A General Greedy Approximation Algorithm with Applications

Author: T. Zhang

Abstract: Greedy approximation algorithms have been frequently used to obtain sparse solutions to learning problems. In this paper, we present a general greedy algorithm for solving a class of convex optimization problems. We derive a bound on the rate of approximation for this algorithm, and show that our algorithm includes a number of earlier studies as special cases.

3 0.20257488 137 nips-2001-On the Convergence of Leveraging

Author: Gunnar Rätsch, Sebastian Mika, Manfred K. Warmuth

Abstract: We give an unified convergence analysis of ensemble learning methods including e.g. AdaBoost, Logistic Regression and the Least-SquareBoost algorithm for regression. These methods have in common that they iteratively call a base learning algorithm which returns hypotheses that are then linearly combined. We show that these methods are related to the Gauss-Southwell method known from numerical optimization and state non-asymptotical convergence results for all these methods. Our analysis includes -norm regularized cost functions leading to a clean and general way to regularize ensemble learning.

4 0.17533915 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family

Author: Michael Collins, S. Dasgupta, Robert E. Schapire

Abstract: Principal component analysis (PCA) is a commonly applied technique for dimensionality reduction. PCA implicitly minimizes a squared loss function, which may be inappropriate for data that is not real-valued, such as binary-valued data. This paper draws on ideas from the Exponential family, Generalized linear models, and Bregman distances, to give a generalization of PCA to loss functions that we argue are better suited to other data types. We describe algorithms for minimizing the loss functions, and give examples on simulated data.

5 0.16726798 167 nips-2001-Semi-supervised MarginBoost

Author: Florence D'alché-buc, Yves Grandvalet, Christophe Ambroise

Abstract: In many discrimination problems a large amount of data is available but only a few of them are labeled. This provides a strong motivation to improve or develop methods for semi-supervised learning. In this paper, boosting is generalized to this task within the optimization framework of MarginBoost . We extend the margin definition to unlabeled data and develop the gradient descent algorithm that corresponds to the resulting margin cost function. This meta-learning scheme can be applied to any base classifier able to benefit from unlabeled data. We propose here to apply it to mixture models trained with an Expectation-Maximization algorithm. Promising results are presented on benchmarks with different rates of labeled data. 1

6 0.15155961 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade

7 0.11165699 139 nips-2001-Online Learning with Kernels

8 0.10517617 62 nips-2001-Duality, Geometry, and Support Vector Regression

9 0.09882316 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

10 0.085191064 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes

11 0.074899204 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines

12 0.071845621 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM

13 0.070871882 22 nips-2001-A kernel method for multi-labelled classification

14 0.067431562 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces

15 0.065972544 190 nips-2001-Thin Junction Trees

16 0.062384132 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes

17 0.060984213 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

18 0.056959108 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms

19 0.052115217 105 nips-2001-Kernel Machines and Boolean Functions

20 0.048083715 178 nips-2001-TAP Gibbs Free Energy, Belief Propagation and Sparsity


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.197), (1, 0.084), (2, -0.006), (3, 0.191), (4, 0.064), (5, -0.129), (6, 0.02), (7, -0.017), (8, 0.038), (9, -0.122), (10, 0.049), (11, 0.101), (12, 0.012), (13, -0.005), (14, 0.049), (15, -0.176), (16, -0.073), (17, -0.04), (18, -0.061), (19, 0.285), (20, 0.154), (21, -0.038), (22, -0.115), (23, 0.082), (24, 0.006), (25, 0.052), (26, 0.032), (27, -0.031), (28, -0.016), (29, 0.037), (30, 0.02), (31, -0.099), (32, -0.023), (33, -0.125), (34, 0.039), (35, 0.054), (36, 0.06), (37, -0.032), (38, 0.034), (39, -0.023), (40, -0.035), (41, -0.154), (42, -0.017), (43, 0.013), (44, 0.025), (45, -0.069), (46, 0.092), (47, 0.044), (48, 0.13), (49, -0.082)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95748144 45 nips-2001-Boosting and Maximum Likelihood for Exponential Models

Author: Guy Lebanon, John D. Lafferty

Abstract: We derive an equivalence between AdaBoost and the dual of a convex optimization problem, showing that the only difference between minimizing the exponential loss used by AdaBoost and maximum likelihood for exponential models is that the latter requires the model to be normalized to form a conditional probability distribution over labels. In addition to establishing a simple and easily understood connection between the two methods, this framework enables us to derive new regularization procedures for boosting that directly correspond to penalized maximum likelihood. Experiments on UCI datasets support our theoretical analysis and give additional insight into the relationship between boosting and logistic regression.

2 0.71741343 137 nips-2001-On the Convergence of Leveraging

Author: Gunnar Rätsch, Sebastian Mika, Manfred K. Warmuth

Abstract: We give an unified convergence analysis of ensemble learning methods including e.g. AdaBoost, Logistic Regression and the Least-SquareBoost algorithm for regression. These methods have in common that they iteratively call a base learning algorithm which returns hypotheses that are then linearly combined. We show that these methods are related to the Gauss-Southwell method known from numerical optimization and state non-asymptotical convergence results for all these methods. Our analysis includes -norm regularized cost functions leading to a clean and general way to regularize ensemble learning.

3 0.59683615 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family

Author: Michael Collins, S. Dasgupta, Robert E. Schapire

Abstract: Principal component analysis (PCA) is a commonly applied technique for dimensionality reduction. PCA implicitly minimizes a squared loss function, which may be inappropriate for data that is not real-valued, such as binary-valued data. This paper draws on ideas from the Exponential family, Generalized linear models, and Bregman distances, to give a generalization of PCA to loss functions that we argue are better suited to other data types. We describe algorithms for minimizing the loss functions, and give examples on simulated data.

4 0.59643453 167 nips-2001-Semi-supervised MarginBoost

Author: Florence D'alché-buc, Yves Grandvalet, Christophe Ambroise

Abstract: In many discrimination problems a large amount of data is available but only a few of them are labeled. This provides a strong motivation to improve or develop methods for semi-supervised learning. In this paper, boosting is generalized to this task within the optimization framework of MarginBoost . We extend the margin definition to unlabeled data and develop the gradient descent algorithm that corresponds to the resulting margin cost function. This meta-learning scheme can be applied to any base classifier able to benefit from unlabeled data. We propose here to apply it to mixture models trained with an Expectation-Maximization algorithm. Promising results are presented on benchmarks with different rates of labeled data. 1

5 0.56347275 8 nips-2001-A General Greedy Approximation Algorithm with Applications

Author: T. Zhang

Abstract: Greedy approximation algorithms have been frequently used to obtain sparse solutions to learning problems. In this paper, we present a general greedy algorithm for solving a class of convex optimization problems. We derive a bound on the rate of approximation for this algorithm, and show that our algorithm includes a number of earlier studies as special cases.

6 0.40876397 62 nips-2001-Duality, Geometry, and Support Vector Regression

7 0.39597791 180 nips-2001-The Concave-Convex Procedure (CCCP)

8 0.38899919 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

9 0.3860929 155 nips-2001-Quantizing Density Estimators

10 0.37392014 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes

11 0.37182635 70 nips-2001-Estimating Car Insurance Premia: a Case Study in High-Dimensional Data Inference

12 0.35117918 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines

13 0.33565107 139 nips-2001-Online Learning with Kernels

14 0.32998344 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade

15 0.31186807 190 nips-2001-Thin Junction Trees

16 0.28786594 83 nips-2001-Geometrical Singularities in the Neuromanifold of Multilayer Perceptrons

17 0.2779004 149 nips-2001-Probabilistic Abstraction Hierarchies

18 0.27211109 11 nips-2001-A Maximum-Likelihood Approach to Modeling Multisensory Enhancement

19 0.26546767 120 nips-2001-Minimax Probability Machine

20 0.26057917 41 nips-2001-Bayesian Predictive Profiles With Applications to Retail Transaction Data


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.046), (17, 0.017), (19, 0.015), (27, 0.127), (30, 0.049), (36, 0.013), (38, 0.021), (59, 0.022), (70, 0.019), (72, 0.063), (79, 0.039), (83, 0.019), (88, 0.377), (91, 0.093)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82146358 45 nips-2001-Boosting and Maximum Likelihood for Exponential Models

Author: Guy Lebanon, John D. Lafferty

Abstract: We derive an equivalence between AdaBoost and the dual of a convex optimization problem, showing that the only difference between minimizing the exponential loss used by AdaBoost and maximum likelihood for exponential models is that the latter requires the model to be normalized to form a conditional probability distribution over labels. In addition to establishing a simple and easily understood connection between the two methods, this framework enables us to derive new regularization procedures for boosting that directly correspond to penalized maximum likelihood. Experiments on UCI datasets support our theoretical analysis and give additional insight into the relationship between boosting and logistic regression.

2 0.72274601 75 nips-2001-Fast, Large-Scale Transformation-Invariant Clustering

Author: Brendan J. Frey, Nebojsa Jojic

Abstract: In previous work on “transformed mixtures of Gaussians” and “transformed hidden Markov models”, we showed how the EM algorithm in a discrete latent variable model can be used to jointly normalize data (e.g., center images, pitch-normalize spectrograms) and learn a mixture model of the normalized data. The only input to the algorithm is the data, a list of possible transformations, and the number of clusters to find. The main criticism of this work was that the exhaustive computation of the posterior probabilities over transformations would make scaling up to large feature vectors and large sets of transformations intractable. Here, we describe how a tremendous speed-up is acheived through the use of a variational technique for decoupling transformations, and a fast Fourier transform method for computing posterior probabilities. For N ×N images, learning C clusters under N rotations, N scales, N x-translations and N y-translations takes only (C + 2 log N )N 2 scalar operations per iteration. In contrast, the original algorithm takes CN 6 operations to account for these transformations. We give results on learning a 4-component mixture model from a video sequence with frames of size 320 ×240. The model accounts for 360 rotations and 76,800 translations. Each iteration of EM takes only 10 seconds per frame in MATLAB, which is over 5 million times faster than the original algorithm. 1

3 0.62890875 190 nips-2001-Thin Junction Trees

Author: Francis R. Bach, Michael I. Jordan

Abstract: We present an algorithm that induces a class of models with thin junction trees—models that are characterized by an upper bound on the size of the maximal cliques of their triangulated graph. By ensuring that the junction tree is thin, inference in our models remains tractable throughout the learning process. This allows both an efficient implementation of an iterative scaling parameter estimation algorithm and also ensures that inference can be performed efficiently with the final model. We illustrate the approach with applications in handwritten digit recognition and DNA splice site detection.

4 0.56130129 98 nips-2001-Information Geometrical Framework for Analyzing Belief Propagation Decoder

Author: Shiro Ikeda, Toshiyuki Tanaka, Shun-ichi Amari

Abstract: The mystery of belief propagation (BP) decoder, especially of the turbo decoding, is studied from information geometrical viewpoint. The loopy belief network (BN) of turbo codes makes it difficult to obtain the true “belief” by BP, and the characteristics of the algorithm and its equilibrium are not clearly understood. Our study gives an intuitive understanding of the mechanism, and a new framework for the analysis. Based on the framework, we reveal basic properties of the turbo decoding.

5 0.46363586 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade

Author: Paul Viola, Michael Jones

Abstract: This paper develops a new approach for extremely fast detection in domains where the distribution of positive and negative examples is highly skewed (e.g. face detection or database retrieval). In such domains a cascade of simple classifiers each trained to achieve high detection rates and modest false positive rates can yield a final detector with many desirable features: including high detection rates, very low false positive rates, and fast performance. Achieving extremely high detection rates, rather than low error, is not a task typically addressed by machine learning algorithms. We propose a new variant of AdaBoost as a mechanism for training the simple classifiers used in the cascade. Experimental results in the domain of face detection show the training algorithm yields significant improvements in performance over conventional AdaBoost. The final face detection system can process 15 frames per second, achieves over 90% detection, and a false positive rate of 1 in a 1,000,000.

6 0.46064726 137 nips-2001-On the Convergence of Leveraging

7 0.45564166 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems

8 0.44465172 8 nips-2001-A General Greedy Approximation Algorithm with Applications

9 0.44299632 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables

10 0.44250098 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family

11 0.4376449 13 nips-2001-A Natural Policy Gradient

12 0.43644282 155 nips-2001-Quantizing Density Estimators

13 0.43513268 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

14 0.43499848 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines

15 0.43467981 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

16 0.43439624 197 nips-2001-Why Neuronal Dynamics Should Control Synaptic Learning Rules

17 0.43188846 58 nips-2001-Covariance Kernels from Bayesian Generative Models

18 0.43174893 127 nips-2001-Multi Dimensional ICA to Separate Correlated Sources

19 0.43168551 27 nips-2001-Activity Driven Adaptive Stochastic Resonance

20 0.43099815 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning