nips nips2006 nips2006-186 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ofer Dekel, Yoram Singer
Abstract: The standard Support Vector Machine formulation does not provide its user with the ability to explicitly control the number of support vectors used to define the generated classifier. We present a modified version of SVM that allows the user to set a budget parameter B and focuses on minimizing the loss attained by the B worst-classified examples while ignoring the remaining examples. This idea can be used to derive sparse versions of both L1-SVM and L2-SVM. Technically, we obtain these new SVM variants by replacing the 1-norm in the standard SVM formulation with various interpolation-norms. We also adapt the SMO optimization algorithm to our setting and report on some preliminary experimental results. 1
Reference: text
sentIndex sentText sentNum sentScore
1 il Abstract The standard Support Vector Machine formulation does not provide its user with the ability to explicitly control the number of support vectors used to define the generated classifier. [sent-4, score-0.291]
2 We present a modified version of SVM that allows the user to set a budget parameter B and focuses on minimizing the loss attained by the B worst-classified examples while ignoring the remaining examples. [sent-5, score-0.579]
3 Technically, we obtain these new SVM variants by replacing the 1-norm in the standard SVM formulation with various interpolation-norms. [sent-7, score-0.152]
4 Given a training set {(xi , yi )}m and i=1 a positive semi-definite kernel K, the SVM solution is a hypothesis of the form h(x) = sign i∈S αi yi K(xi , x) + b , where S is a subset of {1, . [sent-10, score-0.432]
5 The set S defines the support of the classifier, namely, the set of examples that actively participate in the classifier’s definition. [sent-14, score-0.134]
6 The examples in this set are called support vectors, and we say that the SVM solution is sparse if the fraction of support vectors (|S|/m) is reasonably small. [sent-15, score-0.374]
7 However, in some applications, the size of the support is equally important. [sent-17, score-0.18]
8 The size of the support also naturally determines the amount of memory required to store the classifier. [sent-23, score-0.201]
9 If a classifier is intended to run in a device with a limited memory, such as a mobile telephone, there may be a physical limit on the amount of memory available to store support vectors. [sent-24, score-0.201]
10 Most modern SVM learning algorithms are active set methods, namely, on every step of the training process, only a small set of active training examples are taken into account. [sent-26, score-0.22]
11 Specifically, we would like the ability to specify a budget parameter, B, which directly controls the number of support vectors used to define the SVM solution. [sent-30, score-0.598]
12 In this paper, we address this issue and present budget-SVM, a minor modification to the standard L1-SVM formulation that allows the user to set a budget parameter. [sent-31, score-0.553]
13 The budget-SVM optimization problem focuses only on the B worst-classified examples in the training set, ignoring all other examples. [sent-32, score-0.138]
14 We derive the budget-L2-SVM formulation by following the same technique used to derive budget-L1-SVM. [sent-35, score-0.13]
15 We begin by generalizing the L1SVM formulation by replacing the 1-norm with an arbitrary norm. [sent-37, score-0.19]
16 Next, we turn to the K-method of norm interpolation to obtain the 1 − ∞ interpolation-norm and the 2 − ∞ interpolation-norm, and use these norms in the Any-Norm-SVM framework. [sent-39, score-0.578]
17 These norms have the property that they depend only on the absolutely-largest elements of the vector. [sent-40, score-0.202]
18 For each of these norms, we present a simple modification of the SMO algorithm [5], which efficiently solves the respective optimization problem. [sent-42, score-0.149]
19 This technique takes a two-step approach: begin by training a standard SVM classifier, perhaps obtaining a dense solution. [sent-44, score-0.183]
20 We overcome this problem by taking the approach of [10] and reformulating the SVM optimization problem itself in a way that promotes sparsity. [sent-48, score-0.146]
21 Another technique used to obtain a sparse kernel-machine takes advantage of the inherent sparsity of linear programming solutions, and formalizes the kernel-machine learning problem as a linear program [11]. [sent-49, score-0.21]
22 This approach, often called LP-SVM or Sparse-SVM, has been shown to generally construct sparse solutions, but still lacks the ability to introduce an explicit budget parameter. [sent-50, score-0.521]
23 Yet another approach involves randomly selecting a subset of the training set to serve as support vectors [12]. [sent-51, score-0.18]
24 The problem of learning a kernel-machine on a budget also appears in the online-learning mistake-bound framework, and it is there where the term “learning on a budget” was coined [13]. [sent-52, score-0.43]
25 Two recent papers [14, 15] propose online kernel-methods on a budget with an accompanying theoretical mistake-bound. [sent-53, score-0.43]
26 We discuss the K-method of norm interpolation in Sec. [sent-57, score-0.376]
27 3 and put various interpolation norms to use within the Any-Norm-SVM framework in Sec. [sent-58, score-0.351]
28 2 Any-Norm SVM Let {(xi , yi )}m be a training set, where every xi belongs to an instance space X and every yi ∈ i=1 {−1, +1}. [sent-65, score-0.445]
29 The L1 Support Vector Machine is defined as the solution to the following convex optimization problem: 1 min f ∈H, b∈R, ξ≥0 2 f, f H +C ξ 1 s. [sent-67, score-0.186]
30 ∀ 1 ≤ i ≤ m yi f (xi ) + b ≥ 1 − ξi , (1) where ξ is a vector of m slack variables, and C is a positive constant that controls the tradeoff between the complexity of the learned classifier and how well it fits the training data. [sent-69, score-0.171]
31 L2-SVM is a variant of the optimization problem defined above, defined as follows: 2 1 min s. [sent-72, score-0.131]
32 2 f, f H + C ξ 2 f ∈H, b∈R, ξ≥0 This formulation differs from the L1 formulation in that the 1-norm is replaced by the squared 2m 2 norm, defined by ξ 2 = i=1 ξi . [sent-75, score-0.134]
33 Formally, let · be an arbitrary norm defined on Rm . [sent-77, score-0.266]
34 Recall that a norm is a real valued operator such that for every v ∈ Rm and λ ∈ R it holds that λv = |λ| v (positive homogeneity), v ≥ 0 and v = 0 if and only if v = 0 (positive definiteness), and that satisfies the triangle inequality. [sent-78, score-0.43]
35 Combining the positive homogeneity property of · with the fact that it satisfies the triangle inequality ensures that the objective function of Eq. [sent-84, score-0.194]
36 An important class of norms used extensively in our derivation is the family of p-norms, defined for m every p ≥ 1 by v p = ( j=1 |vj |p )1/p . [sent-86, score-0.244]
37 Every norm on Rm has a dual norm which is also defined on Rm . [sent-89, score-0.634]
38 The dual norm of · is denoted by · and given by u·v = max u·v . [sent-90, score-0.444]
39 For example, H¨ lder’s inequality [17] states that the dual of · p is the norm · q , where q = p/(p − 1). [sent-92, score-0.451]
40 The dual of the 1-norm is the ∞-norm and vice versa. [sent-93, score-0.18]
41 Using the definition of the dual norm, we now state the dual optimization problem of Eq. [sent-94, score-0.452]
42 (2): m max α≥0 αi − i=1 1 2 m m m αi αj yi yj K(xi , xj ) s. [sent-95, score-0.162]
43 (4) are indeed dual optimization problems relies on basic techniques in convex analysis [18], and is omitted due to the lack of space. [sent-102, score-0.392]
44 (2) takes the form f (·) = i=1 αi yi K(xi , ·), and that strong duality holds regardless of the norm used. [sent-104, score-0.45]
45 (2) and to focus on solving the dual problem in Eq. [sent-106, score-0.18]
46 Both optimization problems have convex objective functions and linear constraints. [sent-111, score-0.177]
47 (4) for any kernel K and any norm using techniques similar to those used to solve the standard L1-SVM problem. [sent-116, score-0.28]
48 We now use Peetre’s K-method of norm interpolation [19] to obtain norms that promote the sparsity of the generated classifier. [sent-118, score-0.635]
49 q2 (6) The J-functional is obviously a norm: the properties of a norm all follow immediately from the fact that · q1 and · q2 posses these properties. [sent-123, score-0.227]
50 · K(p1 ,p2 ,t) is also a norm, and moreover, · K(p1 ,p2 ,t) and · J(q1 ,q2 ,s) are dual to each other when t = 1/s. [sent-124, score-0.18]
51 We use the K-method to interpolate between the 1-norm and the ∞-norm, and to interpolate between the 2-norm and the ∞-norm. [sent-126, score-0.15]
52 To gain some intuition on the behavior of these interpolationm p norms, first note that for any p ≥ 1 and any v ∈ Rm it holds that maxi |vi |p ≤ i=1 |vi | ≤ p 1/p m maxi |vi | , and therefore v ∞ ≤ v p ≤ m v ∞ . [sent-127, score-0.181]
53 Specifically, the 1 − ∞ interpolation norm with parameter t (with t chosen to be an integer in [1, m]) is precisely equivalent to taking the sum of the absolute values of the t absolutely-greatest elements of the vector. [sent-131, score-0.415]
54 , m} it holds that v K(1,∞,B) = B 1/p then it holds that i=1 |vπ(i) |, and for any 1 ≤ p < ∞, if t = B B i=1 |vπ(i) |p 1/p ≤ v K(p,∞,t) B i=1 ≤ |vπ(i) |p 1/p + B 1/p |vπ(B) | . [sent-143, score-0.13]
55 Then B i=1 1/p 1/p = B i=1 |wπ(i) + zπ(i) |p ≤ B i=1 |wπ(i) |p 1/p + ≤ B i=1 |wπ(i) |p 1/p + (B maxi |zi |p )1/p ≤ |vπ(i) |p m i=1 |wi |p 1/p B i=1 +t z |zπ(i) |p 1/p , ∞ where the first inequality is the triangle inequality for the p-norm. [sent-146, score-0.206]
56 Since the above holds for any w m and z such that w + z = v, it also holds for the pair which minimizes ( i=1 |wi |p )1/p + t z ∞ , and which defines v K(p,∞,t) . [sent-147, score-0.13]
57 1, we know that this norm takes into account only the B largest values in ξ. [sent-163, score-0.294]
58 If there are less than B examples for which yi (f (xi )+b) < 1, then the KKT conditions of optimality immediately imply that the number of support vectors is less than B. [sent-166, score-0.259]
59 This holds true for every instance of the Any-Norm-SVM framework, and is proven for L1-SVM in [3]. [sent-167, score-0.156]
60 Therefore, we focus on the more interesting case, where yi (f (xi ) + b) < 1 for at least B examples. [sent-168, score-0.125]
61 Define µi = yi (f (xi ) + b) and let π be a permutation of {1, . [sent-177, score-0.16]
62 2 is that the number of support vectors is upper bounded by B in the case that µπ(B) = µπ(B+1) . [sent-212, score-0.134]
63 3, we know that the dual of the 1 − ∞ interpolation-norm is the function max{ u ∞ , (1/B) u 1 }. [sent-214, score-0.214]
64 (4) gives us the dual optimization problem of budget-L1-SVM. [sent-216, score-0.272]
65 SMO is an iterative process, which on every iteration selects a pair of dual variables, αk and αl , and optimizes the dual problem with respect to them, leaving all other variables fixed. [sent-220, score-0.435]
66 Assume that we start with a vector α which is a feasible point of the optimization problem in Eq. [sent-222, score-0.156]
67 When restricted to the two active variables, αk and αl , the new new old old constraint i=k,l αi yi = 0 simplifies to αk yk + αl yl = αk yk + αl yl . [sent-224, score-1.123]
68 Put another way, we can slightly overload our notation and define the linear functions αk (λ) = αk + λyk and αl (λ) = αl − λyl , (8) and find the single variable λ which maximizes our constrained optimization problem. [sent-225, score-0.125]
69 (4) define a convex and bounded feasible set, the intersection of the linear equalities in Eq. [sent-227, score-0.221]
70 The objective function, as a function of the single variable λ, takes the form O(λ) = P λ2 + Qλ + c, where c is a constant, P = K(xk , xl ) − 1 K(xk , xk ) − 1 K(xl , xl ) , 2 2 Q = yk − f (xk ) − yl − f (xl ) , m and f is the current function in the RKHS (f ≡ i=1 αi yi K(xi , ·)). [sent-229, score-0.895]
71 2P (9) If this unconstrained optimum falls inside the feasible interval, then it is equivalent to the constrained optimum. [sent-235, score-0.209]
72 i=k,l The constraints in (I) translate to yk = −1 ⇒ λ ≤ αk yl = −1 ⇒ λ ≥ −αl yk = +1 ⇒ λ ≥ −αk yl = +1 ⇒ λ ≤ αl . [sent-239, score-0.957]
73 (10) yk = +1 ⇒ λ ≤ C − αk yl = +1 ⇒ λ ≥ αl − C . [sent-240, score-0.431]
74 (11) The constraints in (II) translate to yk = −1 ⇒ λ ≥ αk − C yl = −1 ⇒ λ ≤ C − αl Constraint (III) translates to yk = −1 ∧ yl = +1 ⇒ λ ≥ yk = +1 ∧ yl = −1 ⇒ λ ≤ 1 2 1 2 m i=1 αi − BC BC − αi m i=1 . [sent-241, score-1.388]
75 (12) Finding the end-points of the interval that confines λ amounts to finding the smallest upper bound and the greatest lower bound in Eqs. [sent-242, score-0.155]
76 √ L2-SVM on a budget Next, we use the 2 − ∞ interpolation-norm with parameter t = B in the Any-Norm-SVM framework, and obtain the budget-L2-SVM problem. [sent-245, score-0.43]
77 The support size of the budget-L2-SVM solution is strongly correlated with the parameter B although the exact relation between the two is not as clear as before. [sent-248, score-0.183]
78 Again we begin with the dual formulation defined in Eq. [sent-249, score-0.288]
79 The intersection of this constraint with the other constraints defines a convex and bounded feasible set, and its intersection with the linear equalities in Eq. [sent-251, score-0.406]
80 i=k,l i=k,l the Constrain (I) is similar to√ constraint we had in the budget-L1-SVM case, and is given in terms of λ by replacing B with B in Eq. [sent-258, score-0.136]
81 Constraint (II) is new, and can be written in terms of λ as m 2 λ2 + λβ + γ ≤ 0, where β = αk yk − αl yl and γ = 1 ( i=1 αi − C 2 ). [sent-260, score-0.431]
82 (13) In addition, we still have the constraint α ≥ 0, which is common to every instance of the AnyNorm-SVM framework. [sent-262, score-0.135]
83 Overall, the end-points of the interval we are searching for are found by taking the smallest upper bound and the greatest √ lower bound in Eqs. [sent-265, score-0.155]
84 In our setting, a more natural way to speed up the learning process is to run the iterative SMO optimization algorithm for a fixed number of iterations and then to keep only the B largest weights, setting the m − B remaining weights to zero. [sent-272, score-0.129]
85 This pruning heuristic enforces the budget constraint in a brute-force way, and can be equally applied to any kernel-machine. [sent-273, score-0.785]
86 However, the natural question is how much will the pruning heuristic affect the classification accuracy of the kernel-machine it is applied to. [sent-274, score-0.216]
87 If our technique indeed lives up to its theoretical promise, we expect the pruning heuristic to have little impact on classification accuracy. [sent-275, score-0.279]
88 On the other hand, if we train an L1-SVM and it so happens that the number of large weights exceeds B, then applying the pruning heuristic should have a dramatic negative effect on classification accuracy. [sent-276, score-0.253]
89 For each binary problem, we trained both the L1 and the L2 budget SVMs with B = 20, 40, . [sent-282, score-0.43]
90 This heuristic choice of C attempts to preserve the relative weight of the regularization term with respect to the norm term in Eq. [sent-288, score-0.335]
91 The average test error for every choice of B (the budget parameter in the optimization problem) and s (the number of non-zero weights kept) is summarized in Fig. [sent-295, score-0.601]
92 Note that the test-error attained by L1-SVM (without a budget parameter) and L2-SVM are represented by the top-right corners of the respective plots. [sent-298, score-0.58]
93 However, the accuracy attained by L1-SVM and L2-SVM can be equally attained using significantly less support vectors. [sent-300, score-0.366]
94 6 Discussion Using the Any-Norm-SVM framework with interesting norms enabled us to introduce a budget parameter to the SVM formulation. [sent-301, score-0.632]
95 These bounds give some insight into how such an interpolation would behave. [sent-306, score-0.149]
96 Another possible norm that can be used in our framework is the Mahalanobis norm ( v = (v M v)1/2 , where M is a positive definite matrix), which would define a loss function that takes into account pair-wise relationships between examples. [sent-307, score-0.487]
97 We are currently exploring extensions to our SMO variant that would quickly converge to the sparse solution without the help of the pruning heuristic. [sent-310, score-0.253]
98 Combining support vector and mathematical programming methods for classification. [sent-389, score-0.134]
99 In Advances in kernel methods: support vector learning, pages 307–326. [sent-390, score-0.187]
100 Tracking the best hyperplane with a simple budget perceptron. [sent-414, score-0.43]
wordName wordTfidf (topN-words)
[('budget', 0.43), ('norm', 0.227), ('yl', 0.227), ('svm', 0.223), ('yk', 0.204), ('norms', 0.202), ('smo', 0.189), ('dual', 0.18), ('interpolation', 0.149), ('support', 0.134), ('classi', 0.127), ('yi', 0.125), ('pruning', 0.108), ('heuristic', 0.108), ('rm', 0.104), ('peetre', 0.102), ('bc', 0.1), ('er', 0.098), ('xl', 0.096), ('vi', 0.093), ('constraint', 0.093), ('attained', 0.093), ('optimization', 0.092), ('interpolate', 0.075), ('xk', 0.074), ('vj', 0.07), ('formulation', 0.067), ('holds', 0.065), ('xi', 0.065), ('feasible', 0.064), ('technique', 0.063), ('burges', 0.062), ('triangle', 0.06), ('equalities', 0.059), ('unconstrained', 0.058), ('maxi', 0.058), ('respective', 0.057), ('sparsity', 0.057), ('sparse', 0.057), ('nes', 0.056), ('translate', 0.056), ('user', 0.056), ('falls', 0.054), ('interpolating', 0.054), ('promotes', 0.054), ('dekel', 0.054), ('kernel', 0.053), ('intersection', 0.053), ('homogeneity', 0.05), ('kkt', 0.05), ('proven', 0.049), ('solution', 0.049), ('rkhs', 0.047), ('training', 0.046), ('equally', 0.046), ('interval', 0.046), ('greatest', 0.045), ('convex', 0.045), ('inequality', 0.044), ('replacing', 0.043), ('active', 0.043), ('roughly', 0.043), ('variants', 0.042), ('every', 0.042), ('begin', 0.041), ('objective', 0.04), ('lkopf', 0.039), ('arbitrary', 0.039), ('constraints', 0.039), ('integer', 0.039), ('variant', 0.039), ('omitted', 0.039), ('acquired', 0.038), ('restricting', 0.038), ('proves', 0.038), ('weights', 0.037), ('concrete', 0.037), ('concludes', 0.037), ('max', 0.037), ('sch', 0.037), ('wi', 0.037), ('plugging', 0.036), ('valued', 0.036), ('mnist', 0.036), ('primal', 0.036), ('lack', 0.036), ('permutation', 0.035), ('concave', 0.035), ('ability', 0.034), ('ii', 0.034), ('know', 0.034), ('sign', 0.034), ('memory', 0.034), ('constrained', 0.033), ('optimizes', 0.033), ('store', 0.033), ('takes', 0.033), ('es', 0.032), ('annual', 0.032), ('bound', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 186 nips-2006-Support Vector Machines on a Budget
Author: Ofer Dekel, Yoram Singer
Abstract: The standard Support Vector Machine formulation does not provide its user with the ability to explicitly control the number of support vectors used to define the generated classifier. We present a modified version of SVM that allows the user to set a budget parameter B and focuses on minimizing the loss attained by the B worst-classified examples while ignoring the remaining examples. This idea can be used to derive sparse versions of both L1-SVM and L2-SVM. Technically, we obtain these new SVM variants by replacing the 1-norm in the standard SVM formulation with various interpolation-norms. We also adapt the SMO optimization algorithm to our setting and report on some preliminary experimental results. 1
2 0.11885414 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space
Author: Wenye Li, Kin-hong Lee, Kwong-sak Leung
Abstract: Kernel-based regularized learning seeks a model in a hypothesis space by minimizing the empirical error and the model’s complexity. Based on the representer theorem, the solution consists of a linear combination of translates of a kernel. This paper investigates a generalized form of representer theorem for kernel-based learning. After mapping predefined features and translates of a kernel simultaneously onto a hypothesis space by a specific way of constructing kernels, we proposed a new algorithm by utilizing a generalized regularizer which leaves part of the space unregularized. Using a squared-loss function in calculating the empirical error, a simple convex solution is obtained which combines predefined features with translates of the kernel. Empirical evaluations have confirmed the effectiveness of the algorithm for supervised learning tasks.
3 0.11710759 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections
Author: Yonatan Amit, Shai Shalev-shwartz, Yoram Singer
Abstract: We describe and analyze an algorithmic framework for online classification where each online trial consists of multiple prediction tasks that are tied together. We tackle the problem of updating the online hypothesis by defining a projection problem in which each prediction task corresponds to a single linear constraint. These constraints are tied together through a single slack parameter. We then introduce a general method for approximately solving the problem by projecting simultaneously and independently on each constraint which corresponds to a prediction sub-problem, and then averaging the individual solutions. We show that this approach constitutes a feasible, albeit not necessarily optimal, solution for the original projection problem. We derive concrete simultaneous projection schemes and analyze them in the mistake bound model. We demonstrate the power of the proposed algorithm in experiments with online multiclass text categorization. Our experiments indicate that a combination of class-dependent features with the simultaneous projection method outperforms previously studied algorithms. 1
4 0.11672898 61 nips-2006-Convex Repeated Games and Fenchel Duality
Author: Shai Shalev-shwartz, Yoram Singer
Abstract: We describe an algorithmic framework for an abstract game which we term a convex repeated game. We show that various online learning and boosting algorithms can be all derived as special cases of our algorithmic framework. This unified view explains the properties of existing algorithms and also enables us to derive several new interesting algorithms. Our algorithmic framework stems from a connection that we build between the notions of regret in game theory and weak duality in convex optimization. 1 Introduction and Problem Setting Several problems arising in machine learning can be modeled as a convex repeated game. Convex repeated games are closely related to online convex programming (see [19, 9] and the discussion in the last section). A convex repeated game is a two players game that is performed in a sequence of consecutive rounds. On round t of the repeated game, the first player chooses a vector wt from a convex set S. Next, the second player responds with a convex function gt : S → R. Finally, the first player suffers an instantaneous loss gt (wt ). We study the game from the viewpoint of the first player. The goal of the first player is to minimize its cumulative loss, t gt (wt ). To motivate this rather abstract setting let us first cast the more familiar setting of online learning as a convex repeated game. Online learning is performed in a sequence of consecutive rounds. On round t, the learner first receives a question, cast as a vector xt , and is required to provide an answer for this question. For example, xt can be an encoding of an email message and the question is whether the email is spam or not. The prediction of the learner is performed based on an hypothesis, ht : X → Y, where X is the set of questions and Y is the set of possible answers. In the aforementioned example, Y would be {+1, −1} where +1 stands for a spam email and −1 stands for a benign one. After predicting an answer, the learner receives the correct answer for the question, denoted yt , and suffers loss according to a loss function (ht , (xt , yt )). In most cases, the hypotheses used for prediction come from a parameterized set of hypotheses, H = {hw : w ∈ S}. For example, the set of linear classifiers, which is used for answering yes/no questions, is defined as H = {hw (x) = sign( w, x ) : w ∈ Rn }. Thus, rather than saying that on round t the learner chooses a hypothesis, we can say that the learner chooses a vector wt and its hypothesis is hwt . Next, we note that once the environment chooses a question-answer pair (xt , yt ), the loss function becomes a function over the hypotheses space or equivalently over the set of parameter vectors S. We can therefore redefine the online learning process as follows. On round t, the learner chooses a vector wt ∈ S, which defines a hypothesis hwt to be used for prediction. Then, the environment chooses a questionanswer pair (xt , yt ), which induces the following loss function over the set of parameter vectors, gt (w) = (hw , (xt , yt )). Finally, the learner suffers the loss gt (wt ) = (hwt , (xt , yt )). We have therefore described the process of online learning as a convex repeated game. In this paper we assess the performance of the first player using the notion of regret. Given a number of rounds T and a fixed vector u ∈ S, we define the regret of the first player as the excess loss for not consistently playing the vector u, 1 T T gt (wt ) − t=1 1 T T gt (u) . t=1 Our main result is an algorithmic framework for the first player which guarantees low regret with respect to any vector u ∈ S. Specifically, we derive regret bounds that take the following form ∀u ∈ S, 1 T T gt (wt ) − t=1 1 T T gt (u) ≤ t=1 f (u) + L √ , T (1) where f : S → R and L ∈ R+ . Informally, the function f measures the “complexity” of vectors in S and the scalar L is related to some generalized Lipschitz property of the functions g1 , . . . , gT . We defer the exact requirements we impose on f and L to later sections. Our algorithmic framework emerges from a representation of the regret bound given in Eq. (1) using an optimization problem. Specifically, we rewrite Eq. (1) as follows 1 T T gt (wt ) ≤ inf t=1 u∈S 1 T T gt (u) + t=1 f (u) + L √ . T (2) That is, the average loss of the first player should be bounded above by the minimum value of an optimization problem in which we jointly minimize the average loss of u and the “complexity” of u as measured by the function f . Note that the optimization problem on the right-hand side of Eq. (2) can only be solved in hindsight after observing the entire sequence of loss functions. Nevertheless, writing the regret bound as in Eq. (2) implies that the average loss of the first player forms a lower bound for a minimization problem. The notion of duality, commonly used in convex optimization theory, plays an important role in obtaining lower bounds for the minimal value of a minimization problem (see for example [14]). By generalizing the notion of Fenchel duality, we are able to derive a dual optimization problem, which can be optimized incrementally, as the game progresses. In order to derive explicit quantitative regret bounds we make an immediate use of the fact that dual objective lower bounds the primal objective. We therefore reduce the process of playing convex repeated games to the task of incrementally increasing the dual objective function. The amount by which the dual increases serves as a new and natural notion of progress. By doing so we are able to tie the primal objective value, the average loss of the first player, and the increase in the dual. The rest of this paper is organized as follows. In Sec. 2 we establish our notation and point to a few mathematical tools that we use throughout the paper. Our main tool for deriving algorithms for playing convex repeated games is a generalization of Fenchel duality, described in Sec. 3. Our algorithmic framework is given in Sec. 4 and analyzed in Sec. 5. The generality of our framework allows us to utilize it in different problems arising in machine learning. Specifically, in Sec. 6 we underscore the applicability of our framework for online learning and in Sec. 7 we outline and analyze boosting algorithms based on our framework. We conclude with a discussion and point to related work in Sec. 8. Due to the lack of space, some of the details are omitted from the paper and can be found in [16]. 2 Mathematical Background We denote scalars with lower case letters (e.g. x and w), and vectors with bold face letters (e.g. x and w). The inner product between vectors x and w is denoted by x, w . Sets are designated by upper case letters (e.g. S). The set of non-negative real numbers is denoted by R+ . For any k ≥ 1, the set of integers {1, . . . , k} is denoted by [k]. A norm of a vector x is denoted by x . The dual norm is defined as λ = sup{ x, λ : x ≤ 1}. For example, the Euclidean norm, x 2 = ( x, x )1/2 is dual to itself and the 1 norm, x 1 = i |xi |, is dual to the ∞ norm, x ∞ = maxi |xi |. We next recall a few definitions from convex analysis. The reader familiar with convex analysis may proceed to Lemma 1 while for a more thorough introduction see for example [1]. A set S is convex if for any two vectors w1 , w2 in S, all the line between w1 and w2 is also within S. That is, for any α ∈ [0, 1] we have that αw1 + (1 − α)w2 ∈ S. A set S is open if every point in S has a neighborhood lying in S. A set S is closed if its complement is an open set. A function f : S → R is closed and convex if for any scalar α ∈ R, the level set {w : f (w) ≤ α} is closed and convex. The Fenchel conjugate of a function f : S → R is defined as f (θ) = supw∈S w, θ − f (w) . If f is closed and convex then the Fenchel conjugate of f is f itself. The Fenchel-Young inequality states that for any w and θ we have that f (w) + f (θ) ≥ w, θ . A vector λ is a sub-gradient of a function f at w if for all w ∈ S we have that f (w ) − f (w) ≥ w − w, λ . The differential set of f at w, denoted ∂f (w), is the set of all sub-gradients of f at w. If f is differentiable at w then ∂f (w) consists of a single vector which amounts to the gradient of f at w and is denoted by f (w). Sub-gradients play an important role in the definition of Fenchel conjugate. In particular, the following lemma states that if λ ∈ ∂f (w) then Fenchel-Young inequality holds with equality. Lemma 1 Let f be a closed and convex function and let ∂f (w ) be its differential set at w . Then, for all λ ∈ ∂f (w ) we have, f (w ) + f (λ ) = λ , w . A continuous function f is σ-strongly convex over a convex set S with respect to a norm · if S is contained in the domain of f and for all v, u ∈ S and α ∈ [0, 1] we have 1 (3) f (α v + (1 − α) u) ≤ α f (v) + (1 − α) f (u) − σ α (1 − α) v − u 2 . 2 Strongly convex functions play an important role in our analysis primarily due to the following lemma. Lemma 2 Let · be a norm over Rn and let · be its dual norm. Let f be a σ-strongly convex function on S and let f be its Fenchel conjugate. Then, f is differentiable with f (θ) = arg maxx∈S θ, x − f (x). Furthermore, for any θ, λ ∈ Rn we have 1 f (θ + λ) − f (θ) ≤ f (θ), λ + λ 2 . 2σ Two notable examples of strongly convex functions which we use are as follows. 1 Example 1 The function f (w) = 2 w norm. Its conjugate function is f (θ) = 2 2 1 2 is 1-strongly convex over S = Rn with respect to the θ 2. 2 2 n 1 Example 2 The function f (w) = i=1 wi log(wi / n ) is 1-strongly convex over the probabilistic n simplex, S = {w ∈ R+ : w 1 = 1}, with respect to the 1 norm. Its conjugate function is n 1 f (θ) = log( n i=1 exp(θi )). 3 Generalized Fenchel Duality In this section we derive our main analysis tool. We start by considering the following optimization problem, T inf c f (w) + t=1 gt (w) , w∈S where c is a non-negative scalar. An equivalent problem is inf w0 ,w1 ,...,wT c f (w0 ) + T t=1 gt (wt ) s.t. w0 ∈ S and ∀t ∈ [T ], wt = w0 . Introducing T vectors λ1 , . . . , λT , each λt ∈ Rn is a vector of Lagrange multipliers for the equality constraint wt = w0 , we obtain the following Lagrangian T T L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) = c f (w0 ) + t=1 gt (wt ) + t=1 λt , w0 − wt . The dual problem is the task of maximizing the following dual objective value, D(λ1 , . . . , λT ) = inf L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) w0 ∈S,w1 ,...,wT = − c sup w0 ∈S = −c f −1 c w0 , − 1 c T t=1 T t=1 λt − λt − f (w0 ) − T t=1 gt (λt ) , T t=1 sup ( wt , λt − gt (wt )) wt where, following the exposition of Sec. 2, f , g1 , . . . , gT are the Fenchel conjugate functions of f, g1 , . . . , gT . Therefore, the generalized Fenchel dual problem is sup − cf λ1 ,...,λT −1 c T t=1 λt − T t=1 gt (λt ) . (4) Note that when T = 1 and c = 1, the above duality is the so called Fenchel duality. 4 A Template Learning Algorithm for Convex Repeated Games In this section we describe a template learning algorithm for playing convex repeated games. As mentioned before, we study convex repeated games from the viewpoint of the first player which we shortly denote as P1. Recall that we would like our learning algorithm to achieve a regret bound of the form given in Eq. (2). We start by rewriting Eq. (2) as follows T m gt (wt ) − c L ≤ inf u∈S t=1 c f (u) + gt (u) , (5) t=1 √ where c = T . Thus, up to the sublinear term c L, the cumulative loss of P1 lower bounds the optimum of the minimization problem on the right-hand side of Eq. (5). In the previous section we derived the generalized Fenchel dual of the right-hand side of Eq. (5). Our construction is based on the weak duality theorem stating that any value of the dual problem is smaller than the optimum value of the primal problem. The algorithmic framework we propose is therefore derived by incrementally ascending the dual objective function. Intuitively, by ascending the dual objective we move closer to the optimal primal value and therefore our performance becomes similar to the performance of the best fixed weight vector which minimizes the right-hand side of Eq. (5). Initially, we use the elementary dual solution λ1 = 0 for all t. We assume that inf w f (w) = 0 and t for all t inf w gt (w) = 0 which imply that D(λ1 , . . . , λ1 ) = 0. We assume in addition that f is 1 T σ-strongly convex. Therefore, based on Lemma 2, the function f is differentiable. At trial t, P1 uses for prediction the vector wt = f −1 c T i=1 λt i . (6) After predicting wt , P1 receives the function gt and suffers the loss gt (wt ). Then, P1 updates the dual variables as follows. Denote by ∂t the differential set of gt at wt , that is, ∂t = {λ : ∀w ∈ S, gt (w) − gt (wt ) ≥ λ, w − wt } . (7) The new dual variables (λt+1 , . . . , λt+1 ) are set to be any set of vectors which satisfy the following 1 T two conditions: (i). ∃λ ∈ ∂t s.t. D(λt+1 , . . . , λt+1 ) ≥ D(λt , . . . , λt , λ , λt , . . . , λt ) 1 1 t−1 t+1 T T (ii). ∀i > t, λt+1 = 0 i . (8) In the next section we show that condition (i) ensures that the increase of the dual at trial t is proportional to the loss gt (wt ). The second condition ensures that we can actually calculate the dual at trial t without any knowledge on the yet to be seen loss functions gt+1 , . . . , gT . We conclude this section with two update rules that trivially satisfy the above two conditions. The first update scheme simply finds λ ∈ ∂t and set λt+1 = i λ λt i if i = t if i = t . (9) The second update defines (λt+1 , . . . , λt+1 ) = argmax D(λ1 , . . . , λT ) 1 T λ1 ,...,λT s.t. ∀i = t, λi = λt . i (10) 5 Analysis In this section we analyze the performance of the template algorithm given in the previous section. Our proof technique is based on monitoring the value of the dual objective function. The main result is the following lemma which gives upper and lower bounds for the final value of the dual objective function. Lemma 3 Let f be a σ-strongly convex function with respect to a norm · over a set S and assume that minw∈S f (w) = 0. Let g1 , . . . , gT be a sequence of convex and closed functions such that inf w gt (w) = 0 for all t ∈ [T ]. Suppose that a dual-incrementing algorithm which satisfies the conditions of Eq. (8) is run with f as a complexity function on the sequence g1 , . . . , gT . Let w1 , . . . , wT be the sequence of primal vectors that the algorithm generates and λT +1 , . . . , λT +1 1 T be its final sequence of dual variables. Then, there exists a sequence of sub-gradients λ1 , . . . , λT , where λt ∈ ∂t for all t, such that T 1 gt (wt ) − 2σc t=1 T T λt 2 ≤ D(λT +1 , . . . , λT +1 ) 1 T t=1 ≤ inf c f (w) + w∈S gt (w) . t=1 Proof The second inequality follows directly from the weak duality theorem. Turning to the left most inequality, denote ∆t = D(λt+1 , . . . , λt+1 ) − D(λt , . . . , λt ) and note that 1 1 T T T D(λ1 +1 , . . . , λT +1 ) can be rewritten as T T t=1 D(λT +1 , . . . , λT +1 ) = 1 T T t=1 ∆t − D(λ1 , . . . , λ1 ) = 1 T ∆t , (11) where the last equality follows from the fact that f (0) = g1 (0) = . . . = gT (0) = 0. The definition of the update implies that ∆t ≥ D(λt , . . . , λt , λt , 0, . . . , 0) − D(λt , . . . , λt , 0, 0, . . . , 0) for 1 t−1 1 t−1 t−1 some subgradient λt ∈ ∂t . Denoting θ t = − 1 j=1 λj , we now rewrite the lower bound on ∆t as, c ∆t ≥ −c (f (θ t − λt /c) − f (θ t )) − gt (λt ) . Using Lemma 2 and the definition of wt we get that 1 (12) ∆t ≥ wt , λt − gt (λt ) − 2 σ c λt 2 . Since λt ∈ ∂t and since we assume that gt is closed and convex, we can apply Lemma 1 to get that wt , λt − gt (λt ) = gt (wt ). Plugging this equality into Eq. (12) and summing over t we obtain that T T T 1 2 . t=1 ∆t ≥ t=1 gt (wt ) − 2 σ c t=1 λt Combining the above inequality with Eq. (11) concludes our proof. The following regret bound follows as a direct corollary of Lemma 3. T 1 Theorem 1 Under the same conditions of Lemma 3. Denote L = T t=1 λt w ∈ S we have, T T c f (w) 1 1 + 2L c . t=1 gt (wt ) − T t=1 gt (w) ≤ T T σ √ In particular, if c = T , we obtain the bound, 1 T 6 T t=1 gt (wt ) − 1 T T t=1 gt (w) ≤ f (w)+L/(2 σ) √ T 2 . Then, for all . Application to Online learning In Sec. 1 we cast the task of online learning as a convex repeated game. We now demonstrate the applicability of our algorithmic framework for the problem of instance ranking. We analyze this setting since several prediction problems, including binary classification, multiclass prediction, multilabel prediction, and label ranking, can be cast as special cases of the instance ranking problem. Recall that on each online round, the learner receives a question-answer pair. In instance ranking, the question is encoded by a matrix Xt of dimension kt × n and the answer is a vector yt ∈ Rkt . The semantic of yt is as follows. For any pair (i, j), if yt,i > yt,j then we say that yt ranks the i’th row of Xt ahead of the j’th row of Xt . We also interpret yt,i − yt,j as the confidence in which the i’th row should be ranked ahead of the j’th row. For example, each row of Xt encompasses a representation of a movie while yt,i is the movie’s rating, expressed as the number of stars this movie has received by a movie reviewer. The predictions of the learner are determined ˆ based on a weight vector wt ∈ Rn and are defined to be yt = Xt wt . Finally, let us define two loss functions for ranking, both generalize the hinge-loss used in binary classification problems. Denote by Et the set {(i, j) : yt,i > yt,j }. For all (i, j) ∈ Et we define a pair-based hinge-loss i,j (w; (Xt , yt )) = [(yt,i − yt,j ) − w, xt,i − xt,j ]+ , where [a]+ = max{a, 0} and xt,i , xt,j are respectively the i’th and j’th rows of Xt . Note that i,j is zero if w ranks xt,i higher than xt,j with a sufficient confidence. Ideally, we would like i,j (wt ; (Xt , yt )) to be zero for all (i, j) ∈ Et . If this is not the case, we are being penalized according to some combination of the pair-based losses i,j . For example, we can set (w; (Xt , yt )) to be the average over the pair losses, 1 avg (w; (Xt , yt )) = |Et | (i,j)∈Et i,j (w; (Xt , yt )) . This loss was suggested by several authors (see for example [18]). Another popular approach (see for example [5]) penalizes according to the maximal loss over the individual pairs, max (w; (Xt , yt )) = max(i,j)∈Et i,j (w; (Xt , yt )) . We can apply our algorithmic framework given in Sec. 4 for ranking, using for gt (w) either avg (w; (Xt , yt )) or max (w; (Xt , yt )). The following theorem provides us with a sufficient condition under which the regret bound from Thm. 1 holds for ranking as well. Theorem 2 Let f be a σ-strongly convex function over S with respect to a norm · . Denote by Lt the maximum over (i, j) ∈ Et of xt,i − xt,j 2 . Then, for both gt (w) = avg (w; (Xt , yt )) and ∗ gt (w) = max (w; (Xt , yt )), the following regret bound holds ∀u ∈ S, 7 1 T T t=1 gt (wt ) − 1 T T t=1 gt (u) ≤ 1 f (u)+ T PT t=1 Lt /(2 σ) √ T . The Boosting Game In this section we describe the applicability of our algorithmic framework to the analysis of boosting algorithms. A boosting algorithm uses a weak learning algorithm that generates weak-hypotheses whose performances are just slightly better than random guessing to build a strong-hypothesis which can attain an arbitrarily low error. The AdaBoost algorithm, proposed by Freund and Schapire [6], receives as input a training set of examples {(x1 , y1 ), . . . , (xm , ym )} where for all i ∈ [m], xi is taken from an instance domain X , and yi is a binary label, yi ∈ {+1, −1}. The boosting process proceeds in a sequence of consecutive trials. At trial t, the booster first defines a distribution, denoted wt , over the set of examples. Then, the booster passes the training set along with the distribution wt to the weak learner. The weak learner is assumed to return a hypothesis ht : X → {+1, −1} whose average error is slightly smaller than 1 . That is, there exists a constant γ > 0 such that, 2 def m 1−yi ht (xi ) = ≤ 1 −γ . (13) i=1 wt,i 2 2 The goal of the boosting algorithm is to invoke the weak learner several times with different distributions, and to combine the hypotheses returned by the weak learner into a final, so called strong, hypothesis whose error is small. The final hypothesis combines linearly the T hypotheses returned by the weak learner with coefficients α1 , . . . , αT , and is defined to be the sign of hf (x) where T hf (x) = t=1 αt ht (x) . The coefficients α1 , . . . , αT are determined by the booster. In Ad1 1 aBoost, the initial distribution is set to be the uniform distribution, w1 = ( m , . . . , m ). At iter1 ation t, the value of αt is set to be 2 log((1 − t )/ t ). The distribution is updated by the rule wt+1,i = wt,i exp(−αt yi ht (xi ))/Zt , where Zt is a normalization factor. Freund and Schapire [6] have shown that under the assumption given in Eq. (13), the error of the final strong hypothesis is at most exp(−2 γ 2 T ). t Several authors [15, 13, 8, 4] have proposed to view boosting as a coordinate-wise greedy optimization process. To do so, note first that hf errs on an example (x, y) iff y hf (x) ≤ 0. Therefore, the exp-loss function, defined as exp(−y hf (x)), is a smooth upper bound of the zero-one error, which equals to 1 if y hf (x) ≤ 0 and to 0 otherwise. Thus, we can restate the goal of boosting as minimizing the average exp-loss of hf over the training set with respect to the variables α1 , . . . , αT . To simplify our derivation in the sequel, we prefer to say that boosting maximizes the negation of the loss, that is, T m 1 (14) max − m i=1 exp −yi t=1 αt ht (xi ) . α1 ,...,αT In this view, boosting is an optimization procedure which iteratively maximizes Eq. (14) with respect to the variables α1 , . . . , αT . This view of boosting, enables the hypotheses returned by the weak learner to be general functions into the reals, ht : X → R (see for instance [15]). In this paper we view boosting as a convex repeated game between a booster and a weak learner. To motivate our construction, we would like to note that boosting algorithms define weights in two different domains: the vectors wt ∈ Rm which assign weights to examples and the weights {αt : t ∈ [T ]} over weak-hypotheses. In the terminology used throughout this paper, the weights wt ∈ Rm are primal vectors while (as we show in the sequel) each weight αt of the hypothesis ht is related to a dual vector λt . In particular, we show that Eq. (14) is exactly the Fenchel dual of a primal problem for a convex repeated game, thus the algorithmic framework described thus far for playing games naturally fits the problem of iteratively solving Eq. (14). To derive the primal problem whose Fenchel dual is the problem given in Eq. (14) let us first denote by vt the vector in Rm whose ith element is vt,i = yi ht (xi ). For all t, we set gt to be the function gt (w) = [ w, vt ]+ . Intuitively, gt penalizes vectors w which assign large weights to examples which are predicted accurately, that is yi ht (xi ) > 0. In particular, if ht (xi ) ∈ {+1, −1} and wt is a distribution over the m examples (as is the case in AdaBoost), gt (wt ) reduces to 1 − 2 t (see Eq. (13)). In this case, minimizing gt is equivalent to maximizing the error of the individual T hypothesis ht over the examples. Consider the problem of minimizing c f (w) + t=1 gt (w) where f (w) is the relative entropy given in Example 2 and c = 1/(2 γ) (see Eq. (13)). To derive its Fenchel dual, we note that gt (λt ) = 0 if there exists βt ∈ [0, 1] such that λt = βt vt and otherwise gt (λt ) = ∞ (see [16]). In addition, let us define αt = 2 γ βt . Since our goal is to maximize the αt dual, we can restrict λt to take the form λt = βt vt = 2 γ vt , and get that D(λ1 , . . . , λT ) = −c f − 1 c T βt vt t=1 =− 1 log 2γ 1 m m e− PT t=1 αt yi ht (xi ) . (15) i=1 Minimizing the exp-loss of the strong hypothesis is therefore the dual problem of the following primal minimization problem: find a distribution over the examples, whose relative entropy to the uniform distribution is as small as possible while the correlation of the distribution with each vt is as small as possible. Since the correlation of w with vt is inversely proportional to the error of ht with respect to w, we obtain that in the primal problem we are trying to maximize the error of each individual hypothesis, while in the dual problem we minimize the global error of the strong hypothesis. The intuition of finding distributions which in retrospect result in large error rates of individual hypotheses was also alluded in [15, 8]. We can now apply our algorithmic framework from Sec. 4 to boosting. We describe the game αt with the parameters αt , where αt ∈ [0, 2 γ], and underscore that in our case, λt = 2 γ vt . At the beginning of the game the booster sets all dual variables to be zero, ∀t αt = 0. At trial t of the boosting game, the booster first constructs a primal weight vector wt ∈ Rm , which assigns importance weights to the examples in the training set. The primal vector wt is constructed as in Eq. (6), that is, wt = f (θ t ), where θ t = − i αi vi . Then, the weak learner responds by presenting the loss function gt (w) = [ w, vt ]+ . Finally, the booster updates the dual variables so as to increase the dual objective function. It is possible to show that if the range of ht is {+1, −1} 1 then the update given in Eq. (10) is equivalent to the update αt = min{2 γ, 2 log((1 − t )/ t )}. We have thus obtained a variant of AdaBoost in which the weights αt are capped above by 2 γ. A disadvantage of this variant is that we need to know the parameter γ. We would like to note in passing that this limitation can be lifted by a different definition of the functions gt . We omit the details due to the lack of space. To analyze our game of boosting, we note that the conditions given in Lemma 3 holds T and therefore the left-hand side inequality given in Lemma 3 tells us that t=1 gt (wt ) − T T +1 T +1 1 2 , . . . , λT ) . The definition of gt and the weak learnability ast=1 λt ∞ ≤ D(λ1 2c sumption given in Eq. (13) imply that wt , vt ≥ 2 γ for all t. Thus, gt (wt ) = wt , vt ≥ 2 γ which also implies that λt = vt . Recall that vt,i = yi ht (xi ). Assuming that the range of ht is [+1, −1] we get that λt ∞ ≤ 1. Combining all the above with the left-hand side inequality T given in Lemma 3 we get that 2 T γ − 2 c ≤ D(λT +1 , . . . , λT +1 ). Using the definition of D (see 1 T Eq. (15)), the value c = 1/(2 γ), and rearranging terms we recover the original bound for AdaBoost PT 2 m 1 −yi t=1 αt ht (xi ) ≤ e−2 γ T . i=1 e m 8 Related Work and Discussion We presented a new framework for designing and analyzing algorithms for playing convex repeated games. Our framework was used for the analysis of known algorithms for both online learning and boosting settings. The framework also paves the way to new algorithms. In a previous paper [17], we suggested the use of duality for the design of online algorithms in the context of mistake bound analysis. The contribution of this paper over [17] is three fold as we now briefly discuss. First, we generalize the applicability of the framework beyond the specific setting of online learning with the hinge-loss to the general setting of convex repeated games. The setting of convex repeated games was formally termed “online convex programming” by Zinkevich [19] and was first presented by Gordon in [9]. There is voluminous amount of work on unifying approaches for deriving online learning algorithms. We refer the reader to [11, 12, 3] for work closely related to the content of this paper. By generalizing our previously studied algorithmic framework [17] beyond online learning, we can automatically utilize well known online learning algorithms, such as the EG and p-norm algorithms [12, 11], to the setting of online convex programming. We would like to note that the algorithms presented in [19] can be derived as special cases of our algorithmic framework 1 by setting f (w) = 2 w 2 . Parallel and independently to this work, Gordon [10] described another algorithmic framework for online convex programming that is closely related to the potential based algorithms described by Cesa-Bianchi and Lugosi [3]. Gordon also considered the problem of defining appropriate potential functions. Our work generalizes some of the theorems in [10] while providing a somewhat simpler analysis. Second, the usage of generalized Fenchel duality rather than the Lagrange duality given in [17] enables us to analyze boosting algorithms based on the framework. Many authors derived unifying frameworks for boosting algorithms [13, 8, 4]. Nonetheless, our general framework and the connection between game playing and Fenchel duality underscores an interesting perspective of both online learning and boosting. We believe that this viewpoint has the potential of yielding new algorithms in both domains. Last, despite the generality of the framework introduced in this paper, the resulting analysis is more distilled than the earlier analysis given in [17] for two reasons. (i) The usage of Lagrange duality in [17] is somehow restricted while the notion of generalized Fenchel duality is more appropriate to the general and broader problems we consider in this paper. (ii) The strongly convex property we employ both simplifies the analysis and enables more intuitive conditions in our theorems. There are various possible extensions of the work that we did not pursue here due to the lack of space. For instanc, our framework can naturally be used for the analysis of other settings such as repeated games (see [7, 19]). The applicability of our framework to online learning can also be extended to other prediction problems such as regression and sequence prediction. Last, we conjecture that our primal-dual view of boosting will lead to new methods for regularizing boosting algorithms, thus improving their generalization capabilities. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2006. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. M. Collins, R.E. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances. Machine Learning, 2002. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive aggressive algorithms. JMLR, 7, Mar 2006. Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT, 1995. Y. Freund and R.E. Schapire. Game theory, on-line prediction and boosting. In COLT, 1996. J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2), 2000. G. Gordon. Regret bounds for prediction problems. In COLT, 1999. G. Gordon. No-regret algorithms for online convex programs. In NIPS, 2006. A. J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear discriminant updates. Machine Learning, 43(3), 2001. J. Kivinen and M. Warmuth. Relative loss bounds for multidimensional regression problems. Journal of Machine Learning, 45(3),2001. L. Mason, J. Baxter, P. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classifiers. MIT Press, 1999. Y. Nesterov. Primal-dual subgradient methods for convex problems. Technical report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 2005. R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):1–40, 1999. S. Shalev-Shwartz and Y. Singer. Convex repeated games and fenchel duality. Technical report, The Hebrew University, 2006. S. Shalev-Shwartz and Y. Singer. Online learning meets optimization in the dual. In COLT, 2006. J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In ESANN, April 1999. M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML, 2003.
5 0.11614759 48 nips-2006-Branch and Bound for Semi-Supervised Support Vector Machines
Author: Olivier Chapelle, Vikas Sindhwani, S. S. Keerthi
Abstract: Semi-supervised SVMs (S3 VM) attempt to learn low-density separators by maximizing the margin over labeled and unlabeled examples. The associated optimization problem is non-convex. To examine the full potential of S3 VMs modulo local minima problems in current implementations, we apply branch and bound techniques for obtaining exact, globally optimal solutions. Empirical evidence suggests that the globally optimal solution can return excellent generalization performance in situations where other implementations fail completely. While our current implementation is only applicable to small datasets, we discuss variants that can potentially lead to practically useful algorithms. 1
6 0.11519749 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models
7 0.1148407 89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising
8 0.11341407 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
9 0.10115532 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
10 0.099854082 130 nips-2006-Max-margin classification of incomplete data
11 0.098517917 20 nips-2006-Active learning for misspecified generalized linear models
12 0.095114157 104 nips-2006-Large-Scale Sparsified Manifold Regularization
13 0.095005579 75 nips-2006-Efficient sparse coding algorithms
14 0.092297308 193 nips-2006-Tighter PAC-Bayes Bounds
15 0.090489894 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition
16 0.087979756 65 nips-2006-Denoising and Dimension Reduction in Feature Space
17 0.087640151 50 nips-2006-Chained Boosting
18 0.086557686 156 nips-2006-Ordinal Regression by Extended Binary Classification
19 0.083047561 150 nips-2006-On Transductive Regression
20 0.082248725 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods
topicId topicWeight
[(0, -0.274), (1, 0.081), (2, -0.052), (3, 0.013), (4, -0.058), (5, 0.148), (6, -0.149), (7, 0.002), (8, -0.005), (9, 0.058), (10, -0.142), (11, -0.131), (12, -0.008), (13, -0.041), (14, 0.029), (15, 0.032), (16, 0.038), (17, -0.033), (18, -0.007), (19, 0.009), (20, 0.104), (21, 0.025), (22, 0.016), (23, 0.032), (24, 0.003), (25, -0.017), (26, -0.088), (27, 0.08), (28, 0.043), (29, -0.028), (30, 0.106), (31, -0.028), (32, 0.095), (33, -0.163), (34, -0.049), (35, 0.082), (36, 0.022), (37, 0.049), (38, -0.115), (39, -0.109), (40, -0.084), (41, 0.119), (42, -0.026), (43, -0.007), (44, 0.005), (45, 0.048), (46, -0.02), (47, 0.026), (48, -0.084), (49, 0.087)]
simIndex simValue paperId paperTitle
same-paper 1 0.96082443 186 nips-2006-Support Vector Machines on a Budget
Author: Ofer Dekel, Yoram Singer
Abstract: The standard Support Vector Machine formulation does not provide its user with the ability to explicitly control the number of support vectors used to define the generated classifier. We present a modified version of SVM that allows the user to set a budget parameter B and focuses on minimizing the loss attained by the B worst-classified examples while ignoring the remaining examples. This idea can be used to derive sparse versions of both L1-SVM and L2-SVM. Technically, we obtain these new SVM variants by replacing the 1-norm in the standard SVM formulation with various interpolation-norms. We also adapt the SMO optimization algorithm to our setting and report on some preliminary experimental results. 1
2 0.7400806 156 nips-2006-Ordinal Regression by Extended Binary Classification
Author: Ling Li, Hsuan-tien Lin
Abstract: We present a reduction framework from ordinal regression to binary classification based on extended examples. The framework consists of three steps: extracting extended examples from the original examples, learning a binary classifier on the extended examples with any binary classification algorithm, and constructing a ranking rule from the binary classifier. A weighted 0/1 loss of the binary classifier would then bound the mislabeling cost of the ranking rule. Our framework allows not only to design good ordinal regression algorithms based on well-tuned binary classification approaches, but also to derive new generalization bounds for ordinal regression from known bounds for binary classification. In addition, our framework unifies many existing ordinal regression algorithms, such as perceptron ranking and support vector ordinal regression. When compared empirically on benchmark data sets, some of our newly designed algorithms enjoy advantages in terms of both training speed and generalization performance over existing algorithms, which demonstrates the usefulness of our framework. 1
3 0.67909861 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models
Author: S. S. Keerthi, Vikas Sindhwani, Olivier Chapelle
Abstract: We consider the task of tuning hyperparameters in SVM models based on minimizing a smooth performance validation function, e.g., smoothed k-fold crossvalidation error, using non-linear optimization techniques. The key computation in this approach is that of the gradient of the validation function with respect to hyperparameters. We show that for large-scale problems involving a wide choice of kernel-based models and validation functions, this computation can be very efficiently done; often within just a fraction of the training time. Empirical results show that a near-optimal set of hyperparameters can be identified by our approach with very few training rounds and gradient computations. . 1
4 0.64739847 129 nips-2006-Map-Reduce for Machine Learning on Multicore
Author: Cheng-tao Chu, Sang K. Kim, Yi-an Lin, Yuanyuan Yu, Gary Bradski, Kunle Olukotun, Andrew Y. Ng
Abstract: We are at the beginning of the multicore era. Computers will have increasingly many cores (processors), but there is still no good programming framework for these architectures, and thus no simple and unified way for machine learning to take advantage of the potential speed up. In this paper, we develop a broadly applicable parallel programming method, one that is easily applied to many different learning algorithms. Our work is in distinct contrast to the tradition in machine learning of designing (often ingenious) ways to speed up a single algorithm at a time. Specifically, we show that algorithms that fit the Statistical Query model [15] can be written in a certain “summation form,” which allows them to be easily parallelized on multicore computers. We adapt Google’s map-reduce [7] paradigm to demonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN). Our experimental results show basically linear speedup with an increasing number of processors. 1
5 0.62004888 140 nips-2006-Multiple Instance Learning for Computer Aided Diagnosis
Author: Murat Dundar, Balaji Krishnapuram, R. B. Rao, Glenn M. Fung
Abstract: Many computer aided diagnosis (CAD) problems can be best modelled as a multiple-instance learning (MIL) problem with unbalanced data: i.e. , the training data typically consists of a few positive bags, and a very large number of negative instances. Existing MIL algorithms are much too computationally expensive for these datasets. We describe CH, a framework for learning a Convex Hull representation of multiple instances that is significantly faster than existing MIL algorithms. Our CH framework applies to any standard hyperplane-based learning algorithm, and for some algorithms, is guaranteed to find the global optimal solution. Experimental studies on two different CAD applications further demonstrate that the proposed algorithm significantly improves diagnostic accuracy when compared to both MIL and traditional classifiers. Although not designed for standard MIL problems (which have both positive and negative bags and relatively balanced datasets), comparisons against other MIL methods on benchmark problems also indicate that the proposed method is competitive with the state-of-the-art.
6 0.60040617 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space
7 0.59811825 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition
8 0.57023001 130 nips-2006-Max-margin classification of incomplete data
9 0.56192905 104 nips-2006-Large-Scale Sparsified Manifold Regularization
10 0.54963684 75 nips-2006-Efficient sparse coding algorithms
11 0.54683161 107 nips-2006-Large Margin Multi-channel Analog-to-Digital Conversion with Applications to Neural Prosthesis
12 0.54315311 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods
13 0.54186565 20 nips-2006-Active learning for misspecified generalized linear models
14 0.5192461 150 nips-2006-On Transductive Regression
15 0.51749921 101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow
16 0.51354331 179 nips-2006-Sparse Representation for Signal Classification
17 0.50280929 89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising
18 0.5013845 193 nips-2006-Tighter PAC-Bayes Bounds
19 0.49135563 48 nips-2006-Branch and Bound for Semi-Supervised Support Vector Machines
20 0.48977223 50 nips-2006-Chained Boosting
topicId topicWeight
[(1, 0.634), (3, 0.011), (7, 0.048), (9, 0.038), (22, 0.041), (44, 0.045), (57, 0.059), (65, 0.045), (69, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.99590302 186 nips-2006-Support Vector Machines on a Budget
Author: Ofer Dekel, Yoram Singer
Abstract: The standard Support Vector Machine formulation does not provide its user with the ability to explicitly control the number of support vectors used to define the generated classifier. We present a modified version of SVM that allows the user to set a budget parameter B and focuses on minimizing the loss attained by the B worst-classified examples while ignoring the remaining examples. This idea can be used to derive sparse versions of both L1-SVM and L2-SVM. Technically, we obtain these new SVM variants by replacing the 1-norm in the standard SVM formulation with various interpolation-norms. We also adapt the SMO optimization algorithm to our setting and report on some preliminary experimental results. 1
2 0.9899438 46 nips-2006-Blind source separation for over-determined delayed mixtures
Author: Lars Omlor, Martin Giese
Abstract: Blind source separation, i.e. the extraction of unknown sources from a set of given signals, is relevant for many applications. A special case of this problem is dimension reduction, where the goal is to approximate a given set of signals by superpositions of a minimal number of sources. Since in this case the signals outnumber the sources the problem is over-determined. Most popular approaches for addressing this problem are based on purely linear mixing models. However, many applications like the modeling of acoustic signals, EMG signals, or movement trajectories, require temporal shift-invariance of the extracted components. This case has only rarely been treated in the computational literature, and specifically for the case of dimension reduction almost no algorithms have been proposed. We present a new algorithm for the solution of this problem, which is based on a timefrequency transformation (Wigner-Ville distribution) of the generative model. We show that this algorithm outperforms classical source separation algorithms for linear mixtures, and also a related method for mixtures with delays. In addition, applying the new algorithm to trajectories of human gaits, we demonstrate that it is suitable for the extraction of spatio-temporal components that are easier to interpret than components extracted with other classical algorithms. 1
3 0.98036867 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods
Author: Matthias Seeger
Abstract: We propose a highly efficient framework for kernel multi-class models with a large and structured set of classes. Kernel parameters are learned automatically by maximizing the cross-validation log likelihood, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical class structure, achieving state-of-the-art results in an order of magnitude less time than previous work. 1
4 0.9751246 142 nips-2006-Mutagenetic tree Fisher kernel improves prediction of HIV drug resistance from viral genotype
Author: Tobias Sing, Niko Beerenwinkel
Abstract: Starting with the work of Jaakkola and Haussler, a variety of approaches have been proposed for coupling domain-specific generative models with statistical learning methods. The link is established by a kernel function which provides a similarity measure based inherently on the underlying model. In computational biology, the full promise of this framework has rarely ever been exploited, as most kernels are derived from very generic models, such as sequence profiles or hidden Markov models. Here, we introduce the MTreeMix kernel, which is based on a generative model tailored to the underlying biological mechanism. Specifically, the kernel quantifies the similarity of evolutionary escape from antiviral drug pressure between two viral sequence samples. We compare this novel kernel to a standard, evolution-agnostic amino acid encoding in the prediction of HIV drug resistance from genotype, using support vector regression. The results show significant improvements in predictive performance across 17 anti-HIV drugs. Thus, in our study, the generative-discriminative paradigm is key to bridging the gap between population genetic modeling and clinical decision making. 1
5 0.95974338 73 nips-2006-Efficient Methods for Privacy Preserving Face Detection
Author: Shai Avidan, Moshe Butman
Abstract: Bob offers a face-detection web service where clients can submit their images for analysis. Alice would very much like to use the service, but is reluctant to reveal the content of her images to Bob. Bob, for his part, is reluctant to release his face detector, as he spent a lot of time, energy and money constructing it. Secure MultiParty computations use cryptographic tools to solve this problem without leaking any information. Unfortunately, these methods are slow to compute and we introduce a couple of machine learning techniques that allow the parties to solve the problem while leaking a controlled amount of information. The first method is an information-bottleneck variant of AdaBoost that lets Bob find a subset of features that are enough for classifying an image patch, but not enough to actually reconstruct it. The second machine learning technique is active learning that allows Alice to construct an online classifier, based on a small number of calls to Bob’s face detector. She can then use her online classifier as a fast rejector before using a cryptographically secure classifier on the remaining image patches. 1
6 0.83359635 179 nips-2006-Sparse Representation for Signal Classification
7 0.83211309 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections
8 0.82528281 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space
9 0.82290703 108 nips-2006-Large Scale Hidden Semi-Markov SVMs
10 0.7926982 138 nips-2006-Multi-Task Feature Learning
11 0.78086221 82 nips-2006-Gaussian and Wishart Hyperkernels
12 0.77737582 177 nips-2006-Sparse Kernel Orthonormalized PLS for feature extraction in large data sets
13 0.77401698 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation
14 0.77109027 140 nips-2006-Multiple Instance Learning for Computer Aided Diagnosis
15 0.76165473 111 nips-2006-Learning Motion Style Synthesis from Perceptual Observations
16 0.76006263 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
17 0.75828177 75 nips-2006-Efficient sparse coding algorithms
18 0.75684965 65 nips-2006-Denoising and Dimension Reduction in Feature Space
19 0.75062215 99 nips-2006-Information Bottleneck Optimization and Independent Component Extraction with Spiking Neurons
20 0.74900937 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors