nips nips2006 nips2006-20 knowledge-graph by maker-knowledge-mining

20 nips-2006-Active learning for misspecified generalized linear models

Source: pdf

Author: Francis R. Bach

Abstract: Active learning refers to algorithmic frameworks aimed at selecting training data points in order to reduce the number of required training data points and/or improve the generalization performance of a learning method. In this paper, we present an asymptotic analysis of active learning for generalized linear models. Our analysis holds under the common practical situation of model misspeciﬁcation, and is based on realistic assumptions regarding the nature of the sampling distributions, which are usually neither independent nor identical. We derive unbiased estimators of generalization performance, as well as estimators of expected reduction in generalization error after adding a new training data point, that allow us to optimize its sampling distribution through a convex optimization problem. Our analysis naturally leads to an algorithm for sequential active learning which is applicable for all tasks supported by generalized linear models (e.g., binary classiﬁcation, multi-class classiﬁcation, regression) and can be applied in non-linear settings through the use of Mercer kernels. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 org Abstract Active learning refers to algorithmic frameworks aimed at selecting training data points in order to reduce the number of required training data points and/or improve the generalization performance of a learning method. [sent-4, score-0.263]

2 In this paper, we present an asymptotic analysis of active learning for generalized linear models. [sent-5, score-0.56]

3 Our analysis holds under the common practical situation of model misspeciﬁcation, and is based on realistic assumptions regarding the nature of the sampling distributions, which are usually neither independent nor identical. [sent-6, score-0.193]

4 We derive unbiased estimators of generalization performance, as well as estimators of expected reduction in generalization error after adding a new training data point, that allow us to optimize its sampling distribution through a convex optimization problem. [sent-7, score-0.664]

5 Our analysis naturally leads to an algorithm for sequential active learning which is applicable for all tasks supported by generalized linear models (e. [sent-8, score-0.547]

6 1 Introduction The goal of active learning is to select training data points so that the number of required training data points for a given performance is smaller than the number which is required when randomly sampling those points. [sent-11, score-0.66]

7 Active learning has emerged as a dynamic ﬁeld of research in machine learning and statistics [1], from early works in optimal experimental design [2, 3], to recent theoretical results [4] and applications, in text retrieval [5], image retrieval [6] or bioinformatics [7]. [sent-12, score-0.138]

8 The objectives of this paper are (1) to provide a theoretical analysis of active learning with realistic assumptions and (2) to derive a principled algorithm for active learning with guaranteed consistency. [sent-14, score-0.829]

9 Our analysis is based on asymptotic arguments, and follows previous asymptotic analysis of active learning [11, 12, 9, 13]; however, as shown in Section 4, we do not rely on correct model speciﬁcation and assume that the data are not identically distributed and may not be independent. [sent-16, score-0.647]

10 As shown in Section 5, our theoretical results naturally lead to convex optimization problems for selecting training data point in a sequential design. [sent-17, score-0.213]

11 In Section 6, we present simulations on synthetic data, illustrating our algorithms and comparing them favorably to usual active learning schemes. [sent-18, score-0.394]

12 • k-class classiﬁcation: the multinomial distribution leads to softmax regression, with Y = k k {y ∈ {0, 1}k , i=1 yi = 1}, T (y) = y and ψ(η) = log( i=1 eηi ). [sent-23, score-0.138]

13 The maximum likelihood population estimator θ0 is deﬁned as the minimizer of the expectation under p0 of the negative log-likelihood (y, x, θ) = −tr(θ xT (y) ) + ψ(θ x). [sent-30, score-0.234]

14 , n, we use the penalized maximum likelihood estimator, which n 1 minimizes i=1 (yi , xi , θ) + 2 λtrθ θ. [sent-36, score-0.233]

15 In practice, the model is often misspeciﬁed and it is thus of importance to consider potential misspeciﬁcation while deriving asymptotic expansions. [sent-41, score-0.174]

16 Kernels The theoretical results of this paper mainly focus on generalized linear models; however, they can be readily generalized to non-linear settings by using Mercer kernels [15], for example leading to kernel logistic regression or kernel ridge regression. [sent-42, score-0.248]

17 3 Active learning set-up We consider the following “pool-based” active learning scenario: we have a large set of i. [sent-47, score-0.394]

18 The goal of active learning is to select the points to label, i. [sent-54, score-0.431]

19 This active learning set-up is well studied and appears naturally in many applications where the input distribution p0 (x) is only known through i. [sent-64, score-0.391]

20 More precisely, we assume that the points xi are selected sequentially, and we let denote qi (xi |x1 , . [sent-71, score-0.294]

21 , xi−1 ) the sampling distribution of xi given the previously observed points. [sent-74, score-0.287]

22 In situations where the data are not sampled from the testing distribution, it has proved advantageous to consider likelihood weighting techniques [13, 19], and we thus consider weights wi = ˆ wi (xi |x1 , . [sent-75, score-0.94]

23 We let θn denote the weighted penalized ML estimator, deﬁned as the minimum with respect to θ of n λ (2) i=1 wi (yi , xi , θ) + 2 trθ θ. [sent-79, score-0.52]

24 In this paper, we work with two different assumptions regarding the sequential sampling distributions: (1) the variables xi are independent, i. [sent-80, score-0.363]

25 , xi−1 ) = qi (xi ), (2) the ˆ variable xi depends on x1 , . [sent-85, score-0.228]

26 , xi−1 only through the current empirical ML estimator θi , i. [sent-88, score-0.136]

27 , xi−1 ) = q(xi |θ assumption is not realistic, but readily leads to asymptotic expansions. [sent-94, score-0.205]

28 The second assumption is more realistic, as most of the heuristic schemes for sequential active learning satisfy this assumption. [sent-95, score-0.532]

29 It turns out that under certain assumption, the asymptotic expansions of the expected generalization performance for both sets of assumptions are identical. [sent-96, score-0.264]

30 4 Asymptotic expansions In this section, we derive the asymptotic expansions that will lead to active learning algorithms in Section 5. [sent-97, score-0.595]

31 Throughout this section, we assume that p0 (x) has a compact support K and has a twice differentiable density with respect to the Lebesgue measure, and that all sampling distributions have a compact support included in the one of p0 (x) and have twice differentiable densities. [sent-98, score-0.258]

32 We ﬁrst make the assumption that the variables xi are independent, i. [sent-99, score-0.169]

33 , we have sampling distributions qi (xi ) and weights wi (xi ), both measurable, and such that wi (xi ) > 0 for all xi ∈ K. [sent-101, score-1.157]

34 1 Bias and variance of ML estimator The following proposition is a simple extension to non identically distributed observations, of classical results on maximum likelihood for misspeciﬁed generalized linear models [21, 13]. [sent-105, score-0.501]

35 We let ED and varD denote the expectation and variance with respect to the data D = {(xi , yi ), i = 1, . [sent-106, score-0.139]

36 n Proposition 1 We let θn denote the minimizer of i=1 Eqi (xi )p0 (yi |xi ) wi (xi ) (yi , xi , θ). [sent-110, score-0.52]

37 , using a different sampling distribution than p0 (x) may introduce a non asymptotically vanishing bias in estimating θ0 . [sent-114, score-0.196]

38 Thus, active learning requires to ensure that (a) our estimators have a low bias and variance in estimating θn , and (b) that θn does actually converge to θ0 . [sent-115, score-0.489]

39 Second, When wn (x) = p0 (x)/qn (x), then θn is also equal to θ0 , and we refer to this weighting scheme as the unbiased reweighting scheme, which was used by [19] in the context of active learnu ing. [sent-119, score-1.137]

40 We refer to the weights wn = p0 (xn )/qn (xn ) as the importance weights. [sent-120, score-0.412]

41 Note however, that restricting ourselves to such unbiased estimators, as done in [19] might not be optimal because they may lead to higher variance [13], in particular due to the potential high variance of the importance weights (see simulations in Section 6). [sent-121, score-0.417]

42 ˆ We now provide an unbiased estimator of the expected generalization error of θn , which generalized the Akaike information criterion [22] (for a proof, see [23]): Proposition 2 In addition to the assumptions 2 Eqn (x) (p0 (x)/qn (x)) is bounded. [sent-124, score-0.559]

43 Let G= n i=1 1 n u ˆ wi (yi , xi , θn ) + 1 n of 1 n Proposition n i=1 1, we assume u ˆ wi wi gi (Jn )−1 gi , that (4) u ˆ where wi = p0 (xi )/qi (xi ). [sent-125, score-2.083]

44 G is an asymptotically unbiased estimator of ED Lu (θn ), i. [sent-126, score-0.388]

45 Thus, in order to ensure that our active learning method are consistent, we have to ensure that this ﬁrst term is going to its minimum value. [sent-130, score-0.364]

46 One simple way to achieve this is to always optimize our weights so that the estimate G is smaller than the estimate for the unbiased reweighting scheme (see Section 5). [sent-131, score-0.488]

47 In Section 5, we show that if the distributions and weights are properly parameterized, this leads to a convex optimization problem. [sent-135, score-0.237]

48 2 2 Proposition 3 We assume that Eqn (x) wn (x) and Eqn (x) (p0 (x)/qn (x)) are bounded. [sent-136, score-0.299]

49 We let deˆ ˆ note θn denote the weighted ML estimator obtained from the ﬁrst n points, and θn+1 the one-step ˆn+1 is obtained by one Newton step from θn [24]; ˆ estimator obtained from the ﬁrst n+1 points, i. [sent-137, score-0.272]

50 1 In this paper, we use the negative log-likelihood as a measure of performance, which allows simple asymptotic expansions, and the focus of the paper is about the differences between testing and training sampling distributions. [sent-142, score-0.28]

51 (7) are dedicated to weighting schemes for the ﬁrst n points other than the unbiased reweighting scheme. [sent-146, score-0.56]

52 For the unbiased reweighting scheme where u wi = wi , for i = 1, . [sent-147, score-1.124]

53 4 Dependent observations In this section, we show that under a certain form of weak dependence between the data points xi , i = 1, . [sent-152, score-0.21]

54 For simplicity and brevity, we restrict ourselves to the unbiased reweighting scheme, i. [sent-156, score-0.351]

55 Many sequential active learning schemes select a training data point with a distribution or criterion that depends on the estimate so far (see Section 6 for details). [sent-166, score-0.61]

56 We thus assume that the sampling ˆ distribution qn is of the form q(xn |θn ), where q(x|θ) is a ﬁxed set of smooth parameterized densities. [sent-167, score-0.462]

57 Proposition 4 (for a proof, see [23]) Let G= 1 n n i=1 ˆ wi (yi , xi , θn ) + 1 n 1 n n i=1 2 ˆ wi gi (Jn )−1 gi , (8) u ˆ ˆ where wi = wi = p0 (xi )/q(xi |θi ). [sent-168, score-2.083]

58 G is an asymptotically unbiased estimator of ED Lu (θn ), i. [sent-169, score-0.388]

59 In the algorithms presented in Section 5, the distribution qn is obtained as the solution of a convex optimization problem, and thus the previous theorem does not readily apply. [sent-174, score-0.46]

60 (5) that enables to optimize the sampling density of the (n + 1)-th point, and an estimate G in Eq. [sent-177, score-0.153]

61 Those criteria assume that the variance of the importance weights wn = p0 (xn )/qn (xn ) is controlled. [sent-181, score-0.494]

62 The sampling density qn+1 will be obtained by minimizing H(wn+1 , qn+1 |α, β) for a certain parameterization of qn+1 and wn+1 . [sent-184, score-0.153]

63 Once a new sample has been selected, and its label observed, Proposition 4 is used in a way similar to [13], in order to search for the best mixture between the current weights (wi ) and γ u u the importance weights (wi ), i. [sent-187, score-0.179]

64 , we look at weights of the form wi (wi )1−γ and perform a grid search on γ to ﬁnd γ such that G in Eq. [sent-189, score-0.457]

65 Thus, as opposed to most previous active learning heuristics, our estimator will always converge (in probability) to the ML estimator. [sent-192, score-0.534]

66 In Section 6, we show empirically that usual heuristic schemes do not share this property. [sent-193, score-0.156]

67 Convex optimization problem We assume that we have a ﬁxed set of candidate distributions sk (x) of the form sk (x) = p0 (x)rk (x). [sent-194, score-0.236]

68 Note that the multiplicative form of our candidate distri- butions allows efﬁcient sampling from a pool of samples of p0 . [sent-195, score-0.192]

69 We look at distributions qn+1 (x) with mixture density of the form s(x|η) = k ηk sk (x) = p0 (x)r(x), where the weights η are non-negative and sum to one. [sent-196, score-0.259]

70 We consider two weighting schemes: (a) one with all weights equal to one (unit weighting scheme) which leads to H0 (η|α, β), and (b) the unbiased reweighting scheme, where wn+1 (x) = p0 (x)/qn+1 (x), which leads to H1 (η|α, β). [sent-199, score-0.671]

71 We have H0 (η|α, β) = H1 (η|α, β) = 1 n3 1 n3 n u i=1 (αi + βi )wi sk (xi )) , k ηk ( u βi wi n n u . [sent-200, score-0.426]

72 i=1 αi wi + i=1 k ηk sk (xi ) (9) (10) The function H0 (η) is linear in η, while the function H1 (η) is the sum of a constant and positive inverse functions, and is thus convex [14]. [sent-201, score-0.509]

73 Unless natural candidate distributions sk (x) can be deﬁned for the active learning problem, we use the set of distributions obtained as follows: we perform K-means clustering with a large number p of 2 1 clusters (e. [sent-202, score-0.568]

74 We let wi denote the number of data points assigned to the ˜ 2 p p centroid yi . [sent-208, score-0.485]

75 We normalize by Zk = i=1 wi e−αk yi −µk / i=1 wi . [sent-209, score-0.795]

76 The variance of m m w ˜ wi ˜ u u wn+1 can be estimated as var wn+1 = i=1 r(xii ) − 1 = i=1 − 1 = V (η), which k ηk rk (xi ) is convex in η. [sent-212, score-0.544]

77 Thus constraining the variance of the new weights leads to a convex optimization problem, with convex objective and convex constraints, which can be solved efﬁciently by the logbarrier method [14], with cubic complexity in the number of candidate distributions. [sent-213, score-0.449]

78 Regularization parameter In the active learning set-up, the number of samples used for learning varies a lot. [sent-220, score-0.425]

79 We compare our algorithms to the following three active learning frameworks. [sent-224, score-0.364]

80 In the maximum variance reduction framework [25, 9] (referred to as “varred”), the next point is selected so that the variance of the resulting estimator has the lowest determinant, which ˆ ˆ−1 is equivalent to ﬁnding x such that tr (x, θn )Jn is minimum. [sent-226, score-0.353]

81 Sampling densities In Figure 1, we look at the limit selected sampling densities, i. [sent-229, score-0.218]

82 , we assume ˆ that a large number of points has been sampled, and we look at the criterion H in Eq. [sent-231, score-0.138]

83 We show the density obtained from the unbiased reweighting scheme (middle of Figure 1), as well as 5 5 0 0 −5 −5 0. [sent-233, score-0.457]

84 5 −10 −5 0 5 −10 −5 0 5 −10 −5 0 5 Figure 1: Proposal distributions: (Left) density p0 (x) with the two different classes (red and blue), (Middle) best density with unbiased reweighting, (Right) function γ(x) such that H(qn+1 (x), 1) = γ(x)qn+1 (x)dx (see text for details). [sent-239, score-0.282]

85 1 0 random no weight unbiased 50 number of samples 100 0 error rate 0. [sent-246, score-0.243]

86 number of samples averaged over 10 replications sampled from same distribution as in Figure 1: (Left) random sampling and active learning ”full”, with standard deviations, (Middle) Comparison of the two schemes “unbiased” and ”no weight”, (Right) Comparison with other methods. [sent-250, score-0.663]

87 In this framework, minimizing the cost without any constraint leads to a Dirac at the maximum of γ(x), while minimizing with a constraint on the variance of the corresponding importance weights will select point with high values of γ(x). [sent-252, score-0.264]

88 From Figure 1, we see that (a) the unit weighting scheme tends to be more selective (i. [sent-254, score-0.177]

89 , ﬁner grain) than the unbiased scheme, and (b) that the mode of the optimal densities are close to the maximum uncertainty hyperplane but some parts of this hyperplane are in fact leading to negative cost gains (e. [sent-256, score-0.381]

90 Comparison with other algorithms In Figure 2 and Figure 1, we compare the performance of our active learning algorithms. [sent-259, score-0.364]

91 In the left of Figure 2, we see that our active learning framework does perform better on average but also leads to smaller variance. [sent-260, score-0.409]

92 In the middle of Figure 2, we compare the two schemes “no weight” and “unbiased”, showing the superiority of the unit weighting scheme and the signiﬁcance of our asymptotic results in Proposition 2 and 3 which extend the unbiased framework of [13]. [sent-261, score-0.631]

93 7 Conclusion We have presented a theoretical asymptotic analysis of active learning for generalized linear models, under realistic sampling assumptions. [sent-263, score-0.745]

94 From this analysis, we obtain convex criteria which can be optimized to provide algorithms for online optimization of the sampling distributions. [sent-264, score-0.237]

95 First, our framework is not limited to generalized linear models, but can be readily extended to any convex differentiable M -estimators [24]. [sent-266, score-0.216]

96 Second, it seems advantageous to combine our active learning analysis with semi-supervised learning frameworks, in particular ones based on data-dependent regularization [26]. [sent-267, score-0.423]

97 5 random full minpred varred maxunc error rate 0. [sent-270, score-0.201]

98 Toward optimal active learning through sampling estimation of error reduction. [sent-309, score-0.482]

99 Combining active learning and semi-supervised learning using Gaussian ﬁelds and harmonic functions. [sent-331, score-0.394]

100 Support vector machine active learning with applications to text classiﬁcation. [sent-394, score-0.364]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wi', 0.351), ('qn', 0.344), ('active', 0.334), ('wn', 0.299), ('gi', 0.255), ('jn', 0.251), ('unbiased', 0.212), ('xi', 0.169), ('misspeci', 0.154), ('reweighting', 0.139), ('estimator', 0.136), ('asymptotic', 0.127), ('proposition', 0.119), ('sampling', 0.118), ('yi', 0.093), ('hi', 0.092), ('schemes', 0.086), ('convex', 0.083), ('weighting', 0.082), ('eqn', 0.077), ('sk', 0.075), ('xn', 0.072), ('scheme', 0.071), ('generalized', 0.069), ('eqi', 0.067), ('maxunc', 0.067), ('minpred', 0.067), ('varred', 0.067), ('weights', 0.066), ('rk', 0.064), ('qi', 0.059), ('ml', 0.058), ('criterion', 0.057), ('lu', 0.054), ('expansions', 0.052), ('generalization', 0.051), ('rd', 0.049), ('referred', 0.048), ('importance', 0.047), ('variance', 0.046), ('leads', 0.045), ('estimators', 0.045), ('glim', 0.044), ('vard', 0.044), ('candidate', 0.043), ('distributions', 0.043), ('sequential', 0.042), ('tr', 0.042), ('points', 0.041), ('realistic', 0.041), ('ep', 0.04), ('heuristic', 0.04), ('look', 0.04), ('asymptotically', 0.04), ('mines', 0.039), ('non', 0.038), ('hyperplane', 0.038), ('criteria', 0.036), ('propositions', 0.035), ('density', 0.035), ('training', 0.035), ('densities', 0.035), ('population', 0.034), ('maximum', 0.034), ('assumptions', 0.034), ('converge', 0.034), ('readily', 0.033), ('des', 0.033), ('replications', 0.033), ('zk', 0.033), ('sampled', 0.031), ('differentiable', 0.031), ('samples', 0.031), ('ecole', 0.031), ('paris', 0.031), ('likelihood', 0.03), ('usual', 0.03), ('learning', 0.03), ('advantageous', 0.029), ('identically', 0.029), ('middle', 0.029), ('naturally', 0.027), ('provably', 0.027), ('tong', 0.027), ('select', 0.026), ('bad', 0.026), ('retrieval', 0.026), ('regression', 0.026), ('theoretical', 0.026), ('selected', 0.025), ('ti', 0.025), ('logistic', 0.025), ('eq', 0.025), ('mercer', 0.025), ('uncertainty', 0.024), ('reduction', 0.024), ('newton', 0.024), ('ed', 0.024), ('unit', 0.024), ('icml', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 20 nips-2006-Active learning for misspecified generalized linear models

Author: Francis R. Bach

2 0.22574271 175 nips-2006-Simplifying Mixture Models through Function Approximation

Author: Kai Zhang, James T. Kwok

Abstract: Finite mixture model is a powerful tool in many statistical learning problems. In this paper, we propose a general, structure-preserving approach to reduce its model complexity, which can bring signiﬁcant computational beneﬁts in many applications. The basic idea is to group the original mixture components into compact clusters, and then minimize an upper bound on the approximation error between the original and simpliﬁed models. By adopting the L2 norm as the distance measure between mixture models, we can derive closed-form solutions that are more robust and reliable than using the KL-based distance measure. Moreover, the complexity of our algorithm is only linear in the sample size and dimensionality. Experiments on density estimation and clustering-based image segmentation demonstrate its outstanding performance in terms of both speed and accuracy.

3 0.19780596 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

Author: Andrea Vedaldi, Stefano Soatto

Abstract: Image Congealing (IC) is a non-parametric method for the joint alignment of a collection of images affected by systematic and unwanted deformations. The method attempts to undo the deformations by minimizing a measure of complexity of the image ensemble, such as the averaged per-pixel entropy. This enables alignment without an explicit model of the aligned dataset as required by other methods (e.g. transformed component analysis). While IC is simple and general, it may introduce degenerate solutions when the transformations allow minimizing the complexity of the data by collapsing them to a constant. Such solutions need to be explicitly removed by regularization. In this paper we propose an alternative formulation which solves this regularization issue on a more principled ground. We make the simple observation that alignment should simplify the data while preserving the useful information carried by them. Therefore we trade off ﬁdelity and complexity of the aligned ensemble rather than minimizing the complexity alone. This eliminates the need for an explicit regularization of the transformations, and has a number of other useful properties such as noise suppression. We show the modeling and computational beneﬁts of the approach to the some of the problems on which IC has been demonstrated. 1

4 0.18815026 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression

Author: Martin J. Wainwright, John D. Lafferty, Pradeep K. Ravikumar

Abstract: We focus on the problem of estimating the graph structure associated with a discrete Markov random ﬁeld. We describe a method based on 1 regularized logistic regression, in which the neighborhood of any given node is estimated by performing logistic regression subject to an 1 -constraint. Our framework applies to the high-dimensional setting, in which both the number of nodes p and maximum neighborhood sizes d are allowed to grow as a function of the number of observations n. Our main result is to establish suﬃcient conditions on the triple (n, p, d) for the method to succeed in consistently estimating the neighborhood of every node in the graph simultaneously. Under certain mutual incoherence conditions analogous to those imposed in previous work on linear regression, we prove that consistent neighborhood selection can be obtained as long as the number of observations n grows more quickly than 6d6 log d + 2d5 log p, thereby establishing that logarithmic growth in the number of samples n relative to graph size p is suﬃcient to achieve neighborhood consistency. Keywords: Graphical models; Markov random ﬁelds; structure learning; 1 -regularization; model selection; convex risk minimization; high-dimensional asymptotics; concentration. 1

5 0.16125923 127 nips-2006-MLLE: Modified Locally Linear Embedding Using Multiple Weights

Author: Zhenyue Zhang, Jing Wang

Abstract: The locally linear embedding (LLE) is improved by introducing multiple linearly independent local weight vectors for each neighborhood. We characterize the reconstruction weights and show the existence of the linearly independent weight vectors at each neighborhood. The modiﬁed locally linear embedding (MLLE) proposed in this paper is much stable. It can retrieve the ideal embedding if MLLE is applied on data points sampled from an isometric manifold. MLLE is also compared with the local tangent space alignment (LTSA). Numerical examples are given that show the improvement and efﬁciency of MLLE. 1

6 0.13328622 68 nips-2006-Dirichlet-Enhanced Spam Filtering based on Biased Samples

7 0.11652468 50 nips-2006-Chained Boosting

8 0.10463791 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data

9 0.098517917 186 nips-2006-Support Vector Machines on a Budget

10 0.097449116 57 nips-2006-Conditional mean field

11 0.09239725 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

12 0.092233792 15 nips-2006-A Switched Gaussian Process for Estimating Disparity and Segmentation in Binocular Stereo

13 0.090978153 156 nips-2006-Ordinal Regression by Extended Binary Classification

14 0.089732192 187 nips-2006-Temporal Coding using the Response Properties of Spiking Neurons

15 0.089104213 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

16 0.087605715 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

17 0.085127577 75 nips-2006-Efficient sparse coding algorithms

18 0.0807124 69 nips-2006-Distributed Inference in Dynamical Systems

19 0.07668189 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space

20 0.076471746 121 nips-2006-Learning to be Bayesian without Supervision

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.254), (1, 0.038), (2, -0.009), (3, 0.03), (4, 0.019), (5, 0.14), (6, -0.022), (7, -0.015), (8, -0.019), (9, 0.19), (10, -0.062), (11, -0.11), (12, 0.395), (13, 0.194), (14, -0.047), (15, 0.001), (16, 0.109), (17, 0.093), (18, -0.061), (19, -0.015), (20, 0.055), (21, 0.01), (22, -0.049), (23, -0.019), (24, 0.072), (25, -0.019), (26, -0.009), (27, -0.026), (28, 0.035), (29, -0.031), (30, 0.125), (31, -0.008), (32, -0.019), (33, -0.113), (34, -0.134), (35, 0.037), (36, 0.002), (37, -0.046), (38, -0.01), (39, -0.067), (40, 0.001), (41, -0.013), (42, 0.078), (43, 0.095), (44, 0.005), (45, 0.02), (46, -0.041), (47, 0.015), (48, -0.0), (49, -0.116)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95275778 20 nips-2006-Active learning for misspecified generalized linear models

Author: Francis R. Bach

2 0.71795171 175 nips-2006-Simplifying Mixture Models through Function Approximation

Author: Kai Zhang, James T. Kwok

3 0.71529245 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

Author: Andrea Vedaldi, Stefano Soatto

4 0.71209288 127 nips-2006-MLLE: Modified Locally Linear Embedding Using Multiple Weights

Author: Zhenyue Zhang, Jing Wang

5 0.53870189 68 nips-2006-Dirichlet-Enhanced Spam Filtering based on Biased Samples

Author: Steffen Bickel, Tobias Scheffer

Abstract: We study a setting that is motivated by the problem of ﬁltering spam messages for many users. Each user receives messages according to an individual, unknown distribution, reﬂected only in the unlabeled inbox. The spam ﬁlter for a user is required to perform well with respect to this distribution. Labeled messages from publicly available sources can be utilized, but they are governed by a distinct distribution, not adequately representing most inboxes. We devise a method that minimizes a loss function with respect to a user’s personal distribution based on the available biased sample. A nonparametric hierarchical Bayesian model furthermore generalizes across users by learning a common prior which is imposed on new email accounts. Empirically, we observe that bias-corrected learning outperforms naive reliance on the assumption of independent and identically distributed data; Dirichlet-enhanced generalization across users outperforms a single (“one size ﬁts all”) ﬁlter as well as independent ﬁlters for all users. 1

6 0.45932311 50 nips-2006-Chained Boosting

7 0.45915645 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression

8 0.45735663 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions

9 0.4346492 186 nips-2006-Support Vector Machines on a Budget

10 0.42921656 121 nips-2006-Learning to be Bayesian without Supervision

11 0.39245176 129 nips-2006-Map-Reduce for Machine Learning on Multicore

12 0.37181997 57 nips-2006-Conditional mean field

13 0.36816239 180 nips-2006-Speakers optimize information density through syntactic reduction

14 0.34641993 156 nips-2006-Ordinal Regression by Extended Binary Classification

15 0.34297591 128 nips-2006-Manifold Denoising

16 0.33984396 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data

17 0.33709788 150 nips-2006-On Transductive Regression

18 0.33680141 131 nips-2006-Mixture Regression for Covariate Shift

19 0.33469731 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields

20 0.33239391 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.144), (3, 0.038), (7, 0.064), (9, 0.04), (12, 0.013), (20, 0.012), (22, 0.144), (44, 0.096), (49, 0.136), (57, 0.101), (65, 0.084), (69, 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90309262 20 nips-2006-Active learning for misspecified generalized linear models

Author: Francis R. Bach

2 0.87372887 61 nips-2006-Convex Repeated Games and Fenchel Duality

Author: Shai Shalev-shwartz, Yoram Singer

Abstract: We describe an algorithmic framework for an abstract game which we term a convex repeated game. We show that various online learning and boosting algorithms can be all derived as special cases of our algorithmic framework. This uniﬁed view explains the properties of existing algorithms and also enables us to derive several new interesting algorithms. Our algorithmic framework stems from a connection that we build between the notions of regret in game theory and weak duality in convex optimization. 1 Introduction and Problem Setting Several problems arising in machine learning can be modeled as a convex repeated game. Convex repeated games are closely related to online convex programming (see [19, 9] and the discussion in the last section). A convex repeated game is a two players game that is performed in a sequence of consecutive rounds. On round t of the repeated game, the ﬁrst player chooses a vector wt from a convex set S. Next, the second player responds with a convex function gt : S → R. Finally, the ﬁrst player suffers an instantaneous loss gt (wt ). We study the game from the viewpoint of the ﬁrst player. The goal of the ﬁrst player is to minimize its cumulative loss, t gt (wt ). To motivate this rather abstract setting let us ﬁrst cast the more familiar setting of online learning as a convex repeated game. Online learning is performed in a sequence of consecutive rounds. On round t, the learner ﬁrst receives a question, cast as a vector xt , and is required to provide an answer for this question. For example, xt can be an encoding of an email message and the question is whether the email is spam or not. The prediction of the learner is performed based on an hypothesis, ht : X → Y, where X is the set of questions and Y is the set of possible answers. In the aforementioned example, Y would be {+1, −1} where +1 stands for a spam email and −1 stands for a benign one. After predicting an answer, the learner receives the correct answer for the question, denoted yt , and suffers loss according to a loss function (ht , (xt , yt )). In most cases, the hypotheses used for prediction come from a parameterized set of hypotheses, H = {hw : w ∈ S}. For example, the set of linear classiﬁers, which is used for answering yes/no questions, is deﬁned as H = {hw (x) = sign( w, x ) : w ∈ Rn }. Thus, rather than saying that on round t the learner chooses a hypothesis, we can say that the learner chooses a vector wt and its hypothesis is hwt . Next, we note that once the environment chooses a question-answer pair (xt , yt ), the loss function becomes a function over the hypotheses space or equivalently over the set of parameter vectors S. We can therefore redeﬁne the online learning process as follows. On round t, the learner chooses a vector wt ∈ S, which deﬁnes a hypothesis hwt to be used for prediction. Then, the environment chooses a questionanswer pair (xt , yt ), which induces the following loss function over the set of parameter vectors, gt (w) = (hw , (xt , yt )). Finally, the learner suffers the loss gt (wt ) = (hwt , (xt , yt )). We have therefore described the process of online learning as a convex repeated game. In this paper we assess the performance of the ﬁrst player using the notion of regret. Given a number of rounds T and a ﬁxed vector u ∈ S, we deﬁne the regret of the ﬁrst player as the excess loss for not consistently playing the vector u, 1 T T gt (wt ) − t=1 1 T T gt (u) . t=1 Our main result is an algorithmic framework for the ﬁrst player which guarantees low regret with respect to any vector u ∈ S. Speciﬁcally, we derive regret bounds that take the following form ∀u ∈ S, 1 T T gt (wt ) − t=1 1 T T gt (u) ≤ t=1 f (u) + L √ , T (1) where f : S → R and L ∈ R+ . Informally, the function f measures the “complexity” of vectors in S and the scalar L is related to some generalized Lipschitz property of the functions g1 , . . . , gT . We defer the exact requirements we impose on f and L to later sections. Our algorithmic framework emerges from a representation of the regret bound given in Eq. (1) using an optimization problem. Speciﬁcally, we rewrite Eq. (1) as follows 1 T T gt (wt ) ≤ inf t=1 u∈S 1 T T gt (u) + t=1 f (u) + L √ . T (2) That is, the average loss of the ﬁrst player should be bounded above by the minimum value of an optimization problem in which we jointly minimize the average loss of u and the “complexity” of u as measured by the function f . Note that the optimization problem on the right-hand side of Eq. (2) can only be solved in hindsight after observing the entire sequence of loss functions. Nevertheless, writing the regret bound as in Eq. (2) implies that the average loss of the ﬁrst player forms a lower bound for a minimization problem. The notion of duality, commonly used in convex optimization theory, plays an important role in obtaining lower bounds for the minimal value of a minimization problem (see for example [14]). By generalizing the notion of Fenchel duality, we are able to derive a dual optimization problem, which can be optimized incrementally, as the game progresses. In order to derive explicit quantitative regret bounds we make an immediate use of the fact that dual objective lower bounds the primal objective. We therefore reduce the process of playing convex repeated games to the task of incrementally increasing the dual objective function. The amount by which the dual increases serves as a new and natural notion of progress. By doing so we are able to tie the primal objective value, the average loss of the ﬁrst player, and the increase in the dual. The rest of this paper is organized as follows. In Sec. 2 we establish our notation and point to a few mathematical tools that we use throughout the paper. Our main tool for deriving algorithms for playing convex repeated games is a generalization of Fenchel duality, described in Sec. 3. Our algorithmic framework is given in Sec. 4 and analyzed in Sec. 5. The generality of our framework allows us to utilize it in different problems arising in machine learning. Speciﬁcally, in Sec. 6 we underscore the applicability of our framework for online learning and in Sec. 7 we outline and analyze boosting algorithms based on our framework. We conclude with a discussion and point to related work in Sec. 8. Due to the lack of space, some of the details are omitted from the paper and can be found in [16]. 2 Mathematical Background We denote scalars with lower case letters (e.g. x and w), and vectors with bold face letters (e.g. x and w). The inner product between vectors x and w is denoted by x, w . Sets are designated by upper case letters (e.g. S). The set of non-negative real numbers is denoted by R+ . For any k ≥ 1, the set of integers {1, . . . , k} is denoted by [k]. A norm of a vector x is denoted by x . The dual norm is deﬁned as λ = sup{ x, λ : x ≤ 1}. For example, the Euclidean norm, x 2 = ( x, x )1/2 is dual to itself and the 1 norm, x 1 = i |xi |, is dual to the ∞ norm, x ∞ = maxi |xi |. We next recall a few deﬁnitions from convex analysis. The reader familiar with convex analysis may proceed to Lemma 1 while for a more thorough introduction see for example [1]. A set S is convex if for any two vectors w1 , w2 in S, all the line between w1 and w2 is also within S. That is, for any α ∈ [0, 1] we have that αw1 + (1 − α)w2 ∈ S. A set S is open if every point in S has a neighborhood lying in S. A set S is closed if its complement is an open set. A function f : S → R is closed and convex if for any scalar α ∈ R, the level set {w : f (w) ≤ α} is closed and convex. The Fenchel conjugate of a function f : S → R is deﬁned as f (θ) = supw∈S w, θ − f (w) . If f is closed and convex then the Fenchel conjugate of f is f itself. The Fenchel-Young inequality states that for any w and θ we have that f (w) + f (θ) ≥ w, θ . A vector λ is a sub-gradient of a function f at w if for all w ∈ S we have that f (w ) − f (w) ≥ w − w, λ . The differential set of f at w, denoted ∂f (w), is the set of all sub-gradients of f at w. If f is differentiable at w then ∂f (w) consists of a single vector which amounts to the gradient of f at w and is denoted by f (w). Sub-gradients play an important role in the deﬁnition of Fenchel conjugate. In particular, the following lemma states that if λ ∈ ∂f (w) then Fenchel-Young inequality holds with equality. Lemma 1 Let f be a closed and convex function and let ∂f (w ) be its differential set at w . Then, for all λ ∈ ∂f (w ) we have, f (w ) + f (λ ) = λ , w . A continuous function f is σ-strongly convex over a convex set S with respect to a norm · if S is contained in the domain of f and for all v, u ∈ S and α ∈ [0, 1] we have 1 (3) f (α v + (1 − α) u) ≤ α f (v) + (1 − α) f (u) − σ α (1 − α) v − u 2 . 2 Strongly convex functions play an important role in our analysis primarily due to the following lemma. Lemma 2 Let · be a norm over Rn and let · be its dual norm. Let f be a σ-strongly convex function on S and let f be its Fenchel conjugate. Then, f is differentiable with f (θ) = arg maxx∈S θ, x − f (x). Furthermore, for any θ, λ ∈ Rn we have 1 f (θ + λ) − f (θ) ≤ f (θ), λ + λ 2 . 2σ Two notable examples of strongly convex functions which we use are as follows. 1 Example 1 The function f (w) = 2 w norm. Its conjugate function is f (θ) = 2 2 1 2 is 1-strongly convex over S = Rn with respect to the θ 2. 2 2 n 1 Example 2 The function f (w) = i=1 wi log(wi / n ) is 1-strongly convex over the probabilistic n simplex, S = {w ∈ R+ : w 1 = 1}, with respect to the 1 norm. Its conjugate function is n 1 f (θ) = log( n i=1 exp(θi )). 3 Generalized Fenchel Duality In this section we derive our main analysis tool. We start by considering the following optimization problem, T inf c f (w) + t=1 gt (w) , w∈S where c is a non-negative scalar. An equivalent problem is inf w0 ,w1 ,...,wT c f (w0 ) + T t=1 gt (wt ) s.t. w0 ∈ S and ∀t ∈ [T ], wt = w0 . Introducing T vectors λ1 , . . . , λT , each λt ∈ Rn is a vector of Lagrange multipliers for the equality constraint wt = w0 , we obtain the following Lagrangian T T L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) = c f (w0 ) + t=1 gt (wt ) + t=1 λt , w0 − wt . The dual problem is the task of maximizing the following dual objective value, D(λ1 , . . . , λT ) = inf L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) w0 ∈S,w1 ,...,wT = − c sup w0 ∈S = −c f −1 c w0 , − 1 c T t=1 T t=1 λt − λt − f (w0 ) − T t=1 gt (λt ) , T t=1 sup ( wt , λt − gt (wt )) wt where, following the exposition of Sec. 2, f , g1 , . . . , gT are the Fenchel conjugate functions of f, g1 , . . . , gT . Therefore, the generalized Fenchel dual problem is sup − cf λ1 ,...,λT −1 c T t=1 λt − T t=1 gt (λt ) . (4) Note that when T = 1 and c = 1, the above duality is the so called Fenchel duality. 4 A Template Learning Algorithm for Convex Repeated Games In this section we describe a template learning algorithm for playing convex repeated games. As mentioned before, we study convex repeated games from the viewpoint of the ﬁrst player which we shortly denote as P1. Recall that we would like our learning algorithm to achieve a regret bound of the form given in Eq. (2). We start by rewriting Eq. (2) as follows T m gt (wt ) − c L ≤ inf u∈S t=1 c f (u) + gt (u) , (5) t=1 √ where c = T . Thus, up to the sublinear term c L, the cumulative loss of P1 lower bounds the optimum of the minimization problem on the right-hand side of Eq. (5). In the previous section we derived the generalized Fenchel dual of the right-hand side of Eq. (5). Our construction is based on the weak duality theorem stating that any value of the dual problem is smaller than the optimum value of the primal problem. The algorithmic framework we propose is therefore derived by incrementally ascending the dual objective function. Intuitively, by ascending the dual objective we move closer to the optimal primal value and therefore our performance becomes similar to the performance of the best ﬁxed weight vector which minimizes the right-hand side of Eq. (5). Initially, we use the elementary dual solution λ1 = 0 for all t. We assume that inf w f (w) = 0 and t for all t inf w gt (w) = 0 which imply that D(λ1 , . . . , λ1 ) = 0. We assume in addition that f is 1 T σ-strongly convex. Therefore, based on Lemma 2, the function f is differentiable. At trial t, P1 uses for prediction the vector wt = f −1 c T i=1 λt i . (6) After predicting wt , P1 receives the function gt and suffers the loss gt (wt ). Then, P1 updates the dual variables as follows. Denote by ∂t the differential set of gt at wt , that is, ∂t = {λ : ∀w ∈ S, gt (w) − gt (wt ) ≥ λ, w − wt } . (7) The new dual variables (λt+1 , . . . , λt+1 ) are set to be any set of vectors which satisfy the following 1 T two conditions: (i). ∃λ ∈ ∂t s.t. D(λt+1 , . . . , λt+1 ) ≥ D(λt , . . . , λt , λ , λt , . . . , λt ) 1 1 t−1 t+1 T T (ii). ∀i > t, λt+1 = 0 i . (8) In the next section we show that condition (i) ensures that the increase of the dual at trial t is proportional to the loss gt (wt ). The second condition ensures that we can actually calculate the dual at trial t without any knowledge on the yet to be seen loss functions gt+1 , . . . , gT . We conclude this section with two update rules that trivially satisfy the above two conditions. The ﬁrst update scheme simply ﬁnds λ ∈ ∂t and set λt+1 = i λ λt i if i = t if i = t . (9) The second update deﬁnes (λt+1 , . . . , λt+1 ) = argmax D(λ1 , . . . , λT ) 1 T λ1 ,...,λT s.t. ∀i = t, λi = λt . i (10) 5 Analysis In this section we analyze the performance of the template algorithm given in the previous section. Our proof technique is based on monitoring the value of the dual objective function. The main result is the following lemma which gives upper and lower bounds for the ﬁnal value of the dual objective function. Lemma 3 Let f be a σ-strongly convex function with respect to a norm · over a set S and assume that minw∈S f (w) = 0. Let g1 , . . . , gT be a sequence of convex and closed functions such that inf w gt (w) = 0 for all t ∈ [T ]. Suppose that a dual-incrementing algorithm which satisﬁes the conditions of Eq. (8) is run with f as a complexity function on the sequence g1 , . . . , gT . Let w1 , . . . , wT be the sequence of primal vectors that the algorithm generates and λT +1 , . . . , λT +1 1 T be its ﬁnal sequence of dual variables. Then, there exists a sequence of sub-gradients λ1 , . . . , λT , where λt ∈ ∂t for all t, such that T 1 gt (wt ) − 2σc t=1 T T λt 2 ≤ D(λT +1 , . . . , λT +1 ) 1 T t=1 ≤ inf c f (w) + w∈S gt (w) . t=1 Proof The second inequality follows directly from the weak duality theorem. Turning to the left most inequality, denote ∆t = D(λt+1 , . . . , λt+1 ) − D(λt , . . . , λt ) and note that 1 1 T T T D(λ1 +1 , . . . , λT +1 ) can be rewritten as T T t=1 D(λT +1 , . . . , λT +1 ) = 1 T T t=1 ∆t − D(λ1 , . . . , λ1 ) = 1 T ∆t , (11) where the last equality follows from the fact that f (0) = g1 (0) = . . . = gT (0) = 0. The deﬁnition of the update implies that ∆t ≥ D(λt , . . . , λt , λt , 0, . . . , 0) − D(λt , . . . , λt , 0, 0, . . . , 0) for 1 t−1 1 t−1 t−1 some subgradient λt ∈ ∂t . Denoting θ t = − 1 j=1 λj , we now rewrite the lower bound on ∆t as, c ∆t ≥ −c (f (θ t − λt /c) − f (θ t )) − gt (λt ) . Using Lemma 2 and the deﬁnition of wt we get that 1 (12) ∆t ≥ wt , λt − gt (λt ) − 2 σ c λt 2 . Since λt ∈ ∂t and since we assume that gt is closed and convex, we can apply Lemma 1 to get that wt , λt − gt (λt ) = gt (wt ). Plugging this equality into Eq. (12) and summing over t we obtain that T T T 1 2 . t=1 ∆t ≥ t=1 gt (wt ) − 2 σ c t=1 λt Combining the above inequality with Eq. (11) concludes our proof. The following regret bound follows as a direct corollary of Lemma 3. T 1 Theorem 1 Under the same conditions of Lemma 3. Denote L = T t=1 λt w ∈ S we have, T T c f (w) 1 1 + 2L c . t=1 gt (wt ) − T t=1 gt (w) ≤ T T σ √ In particular, if c = T , we obtain the bound, 1 T 6 T t=1 gt (wt ) − 1 T T t=1 gt (w) ≤ f (w)+L/(2 σ) √ T 2 . Then, for all . Application to Online learning In Sec. 1 we cast the task of online learning as a convex repeated game. We now demonstrate the applicability of our algorithmic framework for the problem of instance ranking. We analyze this setting since several prediction problems, including binary classiﬁcation, multiclass prediction, multilabel prediction, and label ranking, can be cast as special cases of the instance ranking problem. Recall that on each online round, the learner receives a question-answer pair. In instance ranking, the question is encoded by a matrix Xt of dimension kt × n and the answer is a vector yt ∈ Rkt . The semantic of yt is as follows. For any pair (i, j), if yt,i > yt,j then we say that yt ranks the i’th row of Xt ahead of the j’th row of Xt . We also interpret yt,i − yt,j as the conﬁdence in which the i’th row should be ranked ahead of the j’th row. For example, each row of Xt encompasses a representation of a movie while yt,i is the movie’s rating, expressed as the number of stars this movie has received by a movie reviewer. The predictions of the learner are determined ˆ based on a weight vector wt ∈ Rn and are deﬁned to be yt = Xt wt . Finally, let us deﬁne two loss functions for ranking, both generalize the hinge-loss used in binary classiﬁcation problems. Denote by Et the set {(i, j) : yt,i > yt,j }. For all (i, j) ∈ Et we deﬁne a pair-based hinge-loss i,j (w; (Xt , yt )) = [(yt,i − yt,j ) − w, xt,i − xt,j ]+ , where [a]+ = max{a, 0} and xt,i , xt,j are respectively the i’th and j’th rows of Xt . Note that i,j is zero if w ranks xt,i higher than xt,j with a sufﬁcient conﬁdence. Ideally, we would like i,j (wt ; (Xt , yt )) to be zero for all (i, j) ∈ Et . If this is not the case, we are being penalized according to some combination of the pair-based losses i,j . For example, we can set (w; (Xt , yt )) to be the average over the pair losses, 1 avg (w; (Xt , yt )) = |Et | (i,j)∈Et i,j (w; (Xt , yt )) . This loss was suggested by several authors (see for example [18]). Another popular approach (see for example [5]) penalizes according to the maximal loss over the individual pairs, max (w; (Xt , yt )) = max(i,j)∈Et i,j (w; (Xt , yt )) . We can apply our algorithmic framework given in Sec. 4 for ranking, using for gt (w) either avg (w; (Xt , yt )) or max (w; (Xt , yt )). The following theorem provides us with a sufﬁcient condition under which the regret bound from Thm. 1 holds for ranking as well. Theorem 2 Let f be a σ-strongly convex function over S with respect to a norm · . Denote by Lt the maximum over (i, j) ∈ Et of xt,i − xt,j 2 . Then, for both gt (w) = avg (w; (Xt , yt )) and ∗ gt (w) = max (w; (Xt , yt )), the following regret bound holds ∀u ∈ S, 7 1 T T t=1 gt (wt ) − 1 T T t=1 gt (u) ≤ 1 f (u)+ T PT t=1 Lt /(2 σ) √ T . The Boosting Game In this section we describe the applicability of our algorithmic framework to the analysis of boosting algorithms. A boosting algorithm uses a weak learning algorithm that generates weak-hypotheses whose performances are just slightly better than random guessing to build a strong-hypothesis which can attain an arbitrarily low error. The AdaBoost algorithm, proposed by Freund and Schapire [6], receives as input a training set of examples {(x1 , y1 ), . . . , (xm , ym )} where for all i ∈ [m], xi is taken from an instance domain X , and yi is a binary label, yi ∈ {+1, −1}. The boosting process proceeds in a sequence of consecutive trials. At trial t, the booster ﬁrst deﬁnes a distribution, denoted wt , over the set of examples. Then, the booster passes the training set along with the distribution wt to the weak learner. The weak learner is assumed to return a hypothesis ht : X → {+1, −1} whose average error is slightly smaller than 1 . That is, there exists a constant γ > 0 such that, 2 def m 1−yi ht (xi ) = ≤ 1 −γ . (13) i=1 wt,i 2 2 The goal of the boosting algorithm is to invoke the weak learner several times with different distributions, and to combine the hypotheses returned by the weak learner into a ﬁnal, so called strong, hypothesis whose error is small. The ﬁnal hypothesis combines linearly the T hypotheses returned by the weak learner with coefﬁcients α1 , . . . , αT , and is deﬁned to be the sign of hf (x) where T hf (x) = t=1 αt ht (x) . The coefﬁcients α1 , . . . , αT are determined by the booster. In Ad1 1 aBoost, the initial distribution is set to be the uniform distribution, w1 = ( m , . . . , m ). At iter1 ation t, the value of αt is set to be 2 log((1 − t )/ t ). The distribution is updated by the rule wt+1,i = wt,i exp(−αt yi ht (xi ))/Zt , where Zt is a normalization factor. Freund and Schapire [6] have shown that under the assumption given in Eq. (13), the error of the ﬁnal strong hypothesis is at most exp(−2 γ 2 T ). t Several authors [15, 13, 8, 4] have proposed to view boosting as a coordinate-wise greedy optimization process. To do so, note ﬁrst that hf errs on an example (x, y) iff y hf (x) ≤ 0. Therefore, the exp-loss function, deﬁned as exp(−y hf (x)), is a smooth upper bound of the zero-one error, which equals to 1 if y hf (x) ≤ 0 and to 0 otherwise. Thus, we can restate the goal of boosting as minimizing the average exp-loss of hf over the training set with respect to the variables α1 , . . . , αT . To simplify our derivation in the sequel, we prefer to say that boosting maximizes the negation of the loss, that is, T m 1 (14) max − m i=1 exp −yi t=1 αt ht (xi ) . α1 ,...,αT In this view, boosting is an optimization procedure which iteratively maximizes Eq. (14) with respect to the variables α1 , . . . , αT . This view of boosting, enables the hypotheses returned by the weak learner to be general functions into the reals, ht : X → R (see for instance [15]). In this paper we view boosting as a convex repeated game between a booster and a weak learner. To motivate our construction, we would like to note that boosting algorithms deﬁne weights in two different domains: the vectors wt ∈ Rm which assign weights to examples and the weights {αt : t ∈ [T ]} over weak-hypotheses. In the terminology used throughout this paper, the weights wt ∈ Rm are primal vectors while (as we show in the sequel) each weight αt of the hypothesis ht is related to a dual vector λt . In particular, we show that Eq. (14) is exactly the Fenchel dual of a primal problem for a convex repeated game, thus the algorithmic framework described thus far for playing games naturally ﬁts the problem of iteratively solving Eq. (14). To derive the primal problem whose Fenchel dual is the problem given in Eq. (14) let us ﬁrst denote by vt the vector in Rm whose ith element is vt,i = yi ht (xi ). For all t, we set gt to be the function gt (w) = [ w, vt ]+ . Intuitively, gt penalizes vectors w which assign large weights to examples which are predicted accurately, that is yi ht (xi ) > 0. In particular, if ht (xi ) ∈ {+1, −1} and wt is a distribution over the m examples (as is the case in AdaBoost), gt (wt ) reduces to 1 − 2 t (see Eq. (13)). In this case, minimizing gt is equivalent to maximizing the error of the individual T hypothesis ht over the examples. Consider the problem of minimizing c f (w) + t=1 gt (w) where f (w) is the relative entropy given in Example 2 and c = 1/(2 γ) (see Eq. (13)). To derive its Fenchel dual, we note that gt (λt ) = 0 if there exists βt ∈ [0, 1] such that λt = βt vt and otherwise gt (λt ) = ∞ (see [16]). In addition, let us deﬁne αt = 2 γ βt . Since our goal is to maximize the αt dual, we can restrict λt to take the form λt = βt vt = 2 γ vt , and get that D(λ1 , . . . , λT ) = −c f − 1 c T βt vt t=1 =− 1 log 2γ 1 m m e− PT t=1 αt yi ht (xi ) . (15) i=1 Minimizing the exp-loss of the strong hypothesis is therefore the dual problem of the following primal minimization problem: ﬁnd a distribution over the examples, whose relative entropy to the uniform distribution is as small as possible while the correlation of the distribution with each vt is as small as possible. Since the correlation of w with vt is inversely proportional to the error of ht with respect to w, we obtain that in the primal problem we are trying to maximize the error of each individual hypothesis, while in the dual problem we minimize the global error of the strong hypothesis. The intuition of ﬁnding distributions which in retrospect result in large error rates of individual hypotheses was also alluded in [15, 8]. We can now apply our algorithmic framework from Sec. 4 to boosting. We describe the game αt with the parameters αt , where αt ∈ [0, 2 γ], and underscore that in our case, λt = 2 γ vt . At the beginning of the game the booster sets all dual variables to be zero, ∀t αt = 0. At trial t of the boosting game, the booster ﬁrst constructs a primal weight vector wt ∈ Rm , which assigns importance weights to the examples in the training set. The primal vector wt is constructed as in Eq. (6), that is, wt = f (θ t ), where θ t = − i αi vi . Then, the weak learner responds by presenting the loss function gt (w) = [ w, vt ]+ . Finally, the booster updates the dual variables so as to increase the dual objective function. It is possible to show that if the range of ht is {+1, −1} 1 then the update given in Eq. (10) is equivalent to the update αt = min{2 γ, 2 log((1 − t )/ t )}. We have thus obtained a variant of AdaBoost in which the weights αt are capped above by 2 γ. A disadvantage of this variant is that we need to know the parameter γ. We would like to note in passing that this limitation can be lifted by a different deﬁnition of the functions gt . We omit the details due to the lack of space. To analyze our game of boosting, we note that the conditions given in Lemma 3 holds T and therefore the left-hand side inequality given in Lemma 3 tells us that t=1 gt (wt ) − T T +1 T +1 1 2 , . . . , λT ) . The deﬁnition of gt and the weak learnability ast=1 λt ∞ ≤ D(λ1 2c sumption given in Eq. (13) imply that wt , vt ≥ 2 γ for all t. Thus, gt (wt ) = wt , vt ≥ 2 γ which also implies that λt = vt . Recall that vt,i = yi ht (xi ). Assuming that the range of ht is [+1, −1] we get that λt ∞ ≤ 1. Combining all the above with the left-hand side inequality T given in Lemma 3 we get that 2 T γ − 2 c ≤ D(λT +1 , . . . , λT +1 ). Using the deﬁnition of D (see 1 T Eq. (15)), the value c = 1/(2 γ), and rearranging terms we recover the original bound for AdaBoost PT 2 m 1 −yi t=1 αt ht (xi ) ≤ e−2 γ T . i=1 e m 8 Related Work and Discussion We presented a new framework for designing and analyzing algorithms for playing convex repeated games. Our framework was used for the analysis of known algorithms for both online learning and boosting settings. The framework also paves the way to new algorithms. In a previous paper [17], we suggested the use of duality for the design of online algorithms in the context of mistake bound analysis. The contribution of this paper over [17] is three fold as we now brieﬂy discuss. First, we generalize the applicability of the framework beyond the speciﬁc setting of online learning with the hinge-loss to the general setting of convex repeated games. The setting of convex repeated games was formally termed “online convex programming” by Zinkevich [19] and was ﬁrst presented by Gordon in [9]. There is voluminous amount of work on unifying approaches for deriving online learning algorithms. We refer the reader to [11, 12, 3] for work closely related to the content of this paper. By generalizing our previously studied algorithmic framework [17] beyond online learning, we can automatically utilize well known online learning algorithms, such as the EG and p-norm algorithms [12, 11], to the setting of online convex programming. We would like to note that the algorithms presented in [19] can be derived as special cases of our algorithmic framework 1 by setting f (w) = 2 w 2 . Parallel and independently to this work, Gordon [10] described another algorithmic framework for online convex programming that is closely related to the potential based algorithms described by Cesa-Bianchi and Lugosi [3]. Gordon also considered the problem of deﬁning appropriate potential functions. Our work generalizes some of the theorems in [10] while providing a somewhat simpler analysis. Second, the usage of generalized Fenchel duality rather than the Lagrange duality given in [17] enables us to analyze boosting algorithms based on the framework. Many authors derived unifying frameworks for boosting algorithms [13, 8, 4]. Nonetheless, our general framework and the connection between game playing and Fenchel duality underscores an interesting perspective of both online learning and boosting. We believe that this viewpoint has the potential of yielding new algorithms in both domains. Last, despite the generality of the framework introduced in this paper, the resulting analysis is more distilled than the earlier analysis given in [17] for two reasons. (i) The usage of Lagrange duality in [17] is somehow restricted while the notion of generalized Fenchel duality is more appropriate to the general and broader problems we consider in this paper. (ii) The strongly convex property we employ both simpliﬁes the analysis and enables more intuitive conditions in our theorems. There are various possible extensions of the work that we did not pursue here due to the lack of space. For instanc, our framework can naturally be used for the analysis of other settings such as repeated games (see [7, 19]). The applicability of our framework to online learning can also be extended to other prediction problems such as regression and sequence prediction. Last, we conjecture that our primal-dual view of boosting will lead to new methods for regularizing boosting algorithms, thus improving their generalization capabilities. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2006. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. M. Collins, R.E. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances. Machine Learning, 2002. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive aggressive algorithms. JMLR, 7, Mar 2006. Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT, 1995. Y. Freund and R.E. Schapire. Game theory, on-line prediction and boosting. In COLT, 1996. J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2), 2000. G. Gordon. Regret bounds for prediction problems. In COLT, 1999. G. Gordon. No-regret algorithms for online convex programs. In NIPS, 2006. A. J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear discriminant updates. Machine Learning, 43(3), 2001. J. Kivinen and M. Warmuth. Relative loss bounds for multidimensional regression problems. Journal of Machine Learning, 45(3),2001. L. Mason, J. Baxter, P. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classiﬁers. MIT Press, 1999. Y. Nesterov. Primal-dual subgradient methods for convex problems. Technical report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 2005. R. E. Schapire and Y. Singer. Improved boosting algorithms using conﬁdence-rated predictions. Machine Learning, 37(3):1–40, 1999. S. Shalev-Shwartz and Y. Singer. Convex repeated games and fenchel duality. Technical report, The Hebrew University, 2006. S. Shalev-Shwartz and Y. Singer. Online learning meets optimization in the dual. In COLT, 2006. J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In ESANN, April 1999. M. Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In ICML, 2003.

3 0.85787123 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy

Author: Samuel S. Gross, Olga Russakovsky, Chuong B. Do, Serafim Batzoglou

Abstract: We consider the problem of training a conditional random ﬁeld (CRF) to maximize per-label predictive accuracy on a training set, an approach motivated by the principle of empirical risk minimization. We give a gradient-based procedure for minimizing an arbitrarily accurate approximation of the empirical risk under a Hamming loss function. In experiments with both simulated and real data, our optimization procedure gives signiﬁcantly better testing performance than several current approaches for CRF training, especially in situations of high label noise. 1

4 0.85583109 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

Author: Andrea Vedaldi, Stefano Soatto

5 0.85160279 65 nips-2006-Denoising and Dimension Reduction in Feature Space

Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann

Abstract: We show that the relevant information about a classiﬁcation problem in feature space is contained up to negligible error in a ﬁnite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efﬁcient implicit representations of the data to perform classiﬁcation. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classiﬁcation. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classiﬁcation results. 1

6 0.84994048 175 nips-2006-Simplifying Mixture Models through Function Approximation

7 0.84966671 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

8 0.84914505 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections

9 0.84724289 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments

10 0.84140795 194 nips-2006-Towards a general independent subspace analysis

11 0.83877182 79 nips-2006-Fast Iterative Kernel PCA

12 0.83741319 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

13 0.83611923 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition

14 0.83606756 203 nips-2006-implicit Online Learning with Kernels

15 0.83553642 131 nips-2006-Mixture Regression for Covariate Shift

16 0.83260137 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

17 0.83161438 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming

18 0.82862282 21 nips-2006-AdaBoost is Consistent

19 0.82538807 127 nips-2006-MLLE: Modified Locally Linear Embedding Using Multiple Weights

20 0.82512999 150 nips-2006-On Transductive Regression