jmlr jmlr2007 jmlr2007-90 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ryan M. Rifkin, Ross A. Lippert
Abstract: Regularization is an approach to function learning that balances fit and smoothness. In practice, we search for a function f with a finite representation f = ∑i ci φi (·). In most treatments, the ci are the primary objects of study. We consider value regularization, constructing optimization problems in which the predicted values at the training points are the primary variables, and therefore the central objects of study. Although this is a simple change, it has profound consequences. From convex conjugacy and the theory of Fenchel duality, we derive separate optimality conditions for the regularization and loss portions of the learning problem; this technique yields clean and short derivations of standard algorithms. This framework is ideally suited to studying many other phenomena at the intersection of learning theory and optimization. We obtain a value-based variant of the representer theorem, which underscores the transductive nature of regularization in reproducing kernel Hilbert spaces. We unify and extend previous results on learning kernel functions, with very simple proofs. We analyze the use of unregularized bias terms in optimization problems, and low-rank approximations to kernel matrices, obtaining new results in these areas. In summary, the combination of value regularization and Fenchel duality are valuable tools for studying the optimization problems in machine learning. Keywords: kernel machines, duality, optimization, convex analysis, kernel learning
Reference: text
sentIndex sentText sentNum sentScore
1 From convex conjugacy and the theory of Fenchel duality, we derive separate optimality conditions for the regularization and loss portions of the learning problem; this technique yields clean and short derivations of standard algorithms. [sent-12, score-0.42]
2 We obtain a value-based variant of the representer theorem, which underscores the transductive nature of regularization in reproducing kernel Hilbert spaces. [sent-14, score-0.483]
3 In summary, the combination of value regularization and Fenchel duality are valuable tools for studying the optimization problems in machine learning. [sent-17, score-0.317]
4 A common framework for solving this problem is Tikhonov regularization (Tikhonov and Arsenin, 1977): n inf f ∈F λ ∑ v( f (Xi ),Yi ) + 2 Ω( f ) . [sent-23, score-0.305]
5 In this situation, we rewrite (1) as λ n inf f ∈H ∑ v( f (Xi ),Yi ) + 2 || f ||2 k , (2) i=1 indicating that the regularization term is now the squared norm of the function in the RKHS. [sent-33, score-0.305]
6 Defining K as the n × n kernel matrix with Ki j = k(Xi , X j ), the regularization penalty || f ||2 k becomes ct Kc, the output at training point i is given by f (Xi ) = ∑n c j k(Xi , X j ) = (Kc)i , and we j=1 can rewrite (2) as λ n infn c∈R ∑ v((Kc)i ,Yi ) + 2 ct Kc . [sent-40, score-0.278]
7 The benefits of value regularization are greatly amplified by the second central idea of this work: instead of Lagrangian duality, we use Fenchel duality (Borwein and Lewis, 2000), a form of duality that is well-matched to the problems of learning theory. [sent-71, score-0.427]
8 Convexity of f and g (and some topological qualifications) ensure that the optimal objective values of (5) and (6) are equivalent, and that any optimal y and z satisfy: f (y) − yt z + f ∗ (z) = 0 g(y) + yt z + g∗ (−z) = 0. [sent-80, score-0.42]
9 For convex optimization problems, all local optima are globally optimal, and we can formulate a complete set of optimality conditions which the primal and dual solutions will simultaneously satisfy. [sent-84, score-0.384]
10 4 Fenchel duality makes it clear that the two functions f and g, or, in our case, the loss term and the regularization, contribute individual local optimality conditions, and the total optimality conditions for the overall problem are simply the union of the individual optimality conditions. [sent-87, score-0.462]
11 As an example, for the common case of the RKHS regularizer 1 yt K −1 y, we will find that 2 the primal-dual relationship is given by y = λ−1 Kz. [sent-90, score-0.292]
12 Value regularization and Fenchel duality reinforce each other’s strengths. [sent-92, score-0.292]
13 Because the kernel affects only the regularization term and not the loss term, applying Fenchel duality to value regularization yields extremely clean formulations. [sent-93, score-0.57]
14 The major contribution of this paper is the combination of value regularization and Fenchel duality in a framework that yields new insights into many of the optimization problems that arise in learning theory. [sent-94, score-0.317]
15 In Section 4, we specialize the convex analysis results to Tikhonov value regularizations consisting of the sum of a regularization and a loss term. [sent-97, score-0.505]
16 Section 6 is concerned with value regularization in the context of L1 regularization, a regularization with the explicit goal of obtaining sparsity in the finite representation. [sent-99, score-0.314]
17 The common wisdom about the representer theorem is that it guarantees that the solution to an optimization problem in an (infinite-dimensional) function space has a finite representation as an expansion of kernel functions around the training points. [sent-108, score-0.289]
18 The value-based formulation is ideal here, because the kernel appears only in the regularization term and not in the loss term. [sent-119, score-0.278]
19 (2004) considers a case where the kernel function is a linear combination of a finite set of kernels; we will see that a minor modification to their formulation yields a representer theorem and an agreement of inductive and transductive algorithms. [sent-123, score-0.409]
20 We show that unregularized bias terms arise from infimal convolutions and that the optimality conditions implied by bias terms are independent of both the particular loss function and the particular regularization. [sent-131, score-0.334]
21 For example, including an unregularized constant term b in any Tikhonov optimization problem yields a constraint on the dual variables ∑ zi = 0; this constraint does not require a particular loss function, or even that we work in an RKHS. [sent-132, score-0.31]
22 In summary, Fenchel duality is a powerful way to look at the regularization problems that arise in learning theory. [sent-149, score-0.292]
23 Definition 3 (subgradients and subdifferentials) If f : Rn → (−∞, ∞] is convex and y ∈ dom f , then φ ∈ Rn is a subgradient of f at y if it satisfies φt z ≤ f (y + z) − f (y) for all z ∈ Rn . [sent-221, score-0.352]
24 It is possible for the √ / subdifferential of a convex function to be 0 when y ∈ dom f (for example, f (y) = − y has ∂ f (0) = / / 0 even though f (0) = 0). [sent-226, score-0.328]
25 y∈dom f Being an arbitrary intersection of closed and convex sets, epi f ∗ is closed and convex. [sent-244, score-0.324]
26 Additionally, for z ∈ dom f ∗ , by (7), we have Hz, f ∗ (z) (y) = zt y − f ∗ (z) ≤ f (y) for all y, and hence, epi f ⊂ \ z∈dom epi Hz, f ∗ (z) , f∗ holding with equality when f is closed and convex (Theorem 4. [sent-246, score-0.649]
27 f (y) + f ∗ (z) ≥ yt z Combining the previous results, if f is closed and convex then z ∈ ∂ f (y) ⇔ y ∈ ∂ f ∗ (z), and if z ∈ int(dom f ∗ ) the supremum in (7) is attained. [sent-257, score-0.376]
28 Given convex f , g : Rn → (−∞, ∞] f (y) + f ∗ (z) − yt z ≥ 0 ∗ t g(y) + g (−z) − y (−z) ≥ 0. [sent-259, score-0.303]
29 (8) (9) Summing the above inequalities and minimizing, f (y) + g(y) + f ∗ (z) + g∗ (−z) ≥ 0 ∗ ∗ inf { f (y) + g(y)} + inf { f (z) + g (−z)} ≥ 0. [sent-260, score-0.296]
30 / The topological sufficiency condition, 0 ∈ int(dom f − dom g), is stronger than dom f ∩ dom g = 0 / (which is necessary for f + g to be proper) and weaker than dom f ∩ int(dom g) = 0 or int(dom f ) ∩ / dom g = 0 (which is, in practice, easier to check). [sent-269, score-1.175]
31 If 0 ∈ int(dom f ∗ + dom g∗ ), then (11) is an equality and the infimum of infy { f (y) + g(y)} is attained. [sent-271, score-0.296]
32 This shows that inf { f ∗ (z) + g∗ (−z)} + inf { f ∗∗ (y) + g∗∗ (y)} ≥ 0 z y is an equality; applying Theorem 5 proves the result. [sent-273, score-0.296]
33 If 0 ∈ int(dom f − dom g) or 0 ∈ int(dom f ∗ + dom g∗ ), then inf { f (y) + g(y) + f ∗ (z) + g∗ (−z)} = 0, y,z and all minimizers y, z satisfy the complementarity equations: f (y) − yt z + f ∗ (z) = 0 g(y) + yt z + g∗ (−z) = 0. [sent-276, score-1.08]
34 Additionally, if 0 ∈ int(dom f − dom g) and 0 ∈ int(dom f ∗ + dom g∗ ) then a minimizer (y, z) exists. [sent-277, score-0.496]
35 / If h is ccp then h is convex and dom h = 0, however, this does not guarantee that h is exact, closed, or that h > −∞ (i. [sent-290, score-0.464]
36 Proof If h (y) − yt z + h ∗ (z) = 0 and h (y) = h(y, u) then h(y, u) − yt z + h∗ (z, 0) = 0 and thus (z, 0) ∈ ∂h(y, u). [sent-299, score-0.42]
37 Conversely, if h(y, u) − yt z + h∗ (z, 0) = 0, since h(y, u) ≥ h (y), we have 0 ≥ h (y) − yt z + h ∗ (z). [sent-300, score-0.42]
38 It is straightforward to see that g∗ (z, w) = y ∞ else Proof For fixed y ∈ dom h , define gy (y , u) = yt z w = 0 , and dom g∗ = Rn × {0}m . [sent-305, score-0.68]
39 Lastly, define W = {z ∈ Rn : (z, z ) ∈ dom h∗ } = dom f ∗ − dom g∗ . [sent-321, score-0.705]
40 Definition 16 (indicator functions, support functions, and polarity) For any non-empty set C ⊂ Rn , the indicator function δC , the support function σC , and the polar of C, C ◦ are given by δC (y) = 0 y ∈C ∞ y∈C / σC (y) = sup zt y z∈C C ◦ = {z ∈ Rn : ∀y ∈ C, yt z ≤ 1}. [sent-327, score-0.33]
41 Since epi σC = y∈C epi Hy,0 and arbitrary intersections of closed and convex sets are closed and convex, σC is ccp. [sent-334, score-0.409]
42 T If C is closed, convex and non-empty, then δC is ccp, and Theorem 6 applies: z ∈ ∂δC (y) ⇔ y ∈ ∂σC (z) ⇔ δC (y) − yt z + σC (z) = 0 ⇔ y ∈ C, ∀y ∈ C, zt (y − y) ≤ 0. [sent-337, score-0.381]
43 • (Fenchel-Young Theorem) f (y) − yt z + f ∗ (z) ≥ 0, and z ∈ ∂ f (y) ⇔ y ∈ ∂ f ∗ (z) ⇔ f (y) − yt z + f ∗ (z) = 0. [sent-360, score-0.42]
44 • (Fenchel Duality) The minimizer of f (y) + g(y) satisfies f (y) − yt z + f ∗ (z) = 0 g(y) + yt z + g∗ (−z) = 0 for some z which also minimizes f ∗ (z) + g∗ (−z). [sent-361, score-0.446]
45 A regularization is any closed convex function R : Rn → R≥0 , such that R(0) = 0. [sent-375, score-0.323]
46 Clearly, regularizations and loss functions are closed under addition. [sent-382, score-0.289]
47 We can also show that the conjugates of regularizations and loss functions are, respectively, regularizations and loss functions. [sent-383, score-0.462]
48 i=1 i Proof A direct consequence of the definition, V ∗ (z) = sup yt z − ∑ vi (yi ) y = ∑ sup {yi zi − vi (yi )} y i = i i ∑ v∗ (zi ). [sent-394, score-0.435]
49 Theorem 22 (regression Fenchel duality) Let R : Rn → R≥0 be a regularization and V (y) = ∑n vi (yi ) for loss functions vi : R → R. [sent-396, score-0.289]
50 If 0 ∈ int(dom R − dom V ) or 0 ∈ int(dom R∗ + dom V ∗ ) i=1 then inf {R(y) +V (y) + R∗ (z) +V ∗ (−z)} = 0 y,z with all minimizers y, z satisfying the complementarity equations: R(y) − yt z + R∗ (z) = 0 vi (yi ) + yi zi + v∗ (−zi ) i = 0. [sent-397, score-1.052]
51 (13) (14) Additionally, if 0 ∈ int(dom R − dom V ) and 0 ∈ int(dom R∗ + dom V ∗ ) then a minimizer exists. [sent-398, score-0.496]
52 From the figure, it is plain that the only supporting hyperplanes are those with slopes in the interval [−1, 0], thus dom f ∗ = [−1, 0], and that f ∗ (y) = y when y ∈ [−1, 0]. [sent-414, score-0.265]
53 R(y) R1 R2 (y) R(At y) σC (y) σC (At y) 1 QA (y) = 2 yt Ay ||y|| p ||At y||1 1 2 2 ||y|| p dual reg. [sent-426, score-0.299]
54 We start with a mathematical program in terms of the y i , using a regularization QK −1 (with K symmetric, positive definite); in this framework, we are able to state general optimality conditions and derive a very clean form of the representer theorem. [sent-430, score-0.382]
55 2 Equation 13 specializes to 1 1 t −1 λy K y − yt z + λ−1 zt Kz = 0 2 2 1 −1 t (y − λ Kz) (λK −1 y − z) = 0. [sent-433, score-0.288]
56 This demonstrates that whenever we are performing regularization in an RKHS, the optimal y values at the training points can be obtained by multiplying the optimal dual variables z by λ −1 K, for any convex loss function. [sent-435, score-0.393]
57 2 The optimality equation v(yi ) + yi zi + v∗ (zi ) reduces to y + z = Y , so (13) and (14) become yi + zi = Yi λ−1 Kz = y. [sent-445, score-0.377]
58 Regularizing in an RKHS, we have the primal and dual problems: inf y∈Rn inf z∈Rn 1 t −1 λy K y + ∑(1 − yiYi )+ 2 i 1 −1 t zi λ z Kz + ∑ δ[0,1] 2 Yi i − zi Yi . [sent-451, score-0.597]
59 L1 Regularizations Tikhonov regularization in a reproducing kernel Hilbert space is the most common form of regularization in practice, and has a number of mathematical properties which make it especially nice. [sent-462, score-0.381]
60 In particular, the representer theorem makes RKHS regularization essentially the only case where we can start with a problem of optimizing an infinite-dimensional function in a function space, and obtain a finite-dimensional representation. [sent-463, score-0.329]
61 As usual, we can combine primal and dual regularizers with primal and dual loss functions. [sent-475, score-0.458]
62 y∈Rn The dual to the “hard loss” v(yi ) = δ[−ε,ε] (Yi − yi ) is given by v∗ (−zi ) = sup{−yi zi − δ[Yi −ε,Yi +ε] (yi )} yi = max{−(Yi − ε)zi , −(Yi + ε)zi } = −Yi zi + ε|zi |. [sent-497, score-0.375]
63 Therefore, the dual problem is inf λ−1 QK (z) −Y t z + ε||z||1 = λ · infn QK (λ−1 z) −Y t (λ−1 z) + ε||λ−1 z||1 . [sent-498, score-0.291]
64 R∗ (z1 , z2 ) = QK (z1 , z2 ), thus dom R∗ = Rn+m and R is therefore closed and exact. [sent-522, score-0.308]
65 , yn )} = inf n QK −1 (y ) +V (y ) , y∈Rn+m y ∈R NN −1 where yi = yi for 1 ≤ i ≤ n and yi = ∑n k(Xi , X j )c j with c = KNN y . [sent-541, score-0.388]
66 , yn )} = y∈Rn+m = 463 inf y ∈Rn inf y ∈Rn inf QK −1 (y , y ) +V (y ) y ∈Rm inf QK −1 (y ) +V (y ) , y ∈Rm NN R IFKIN AND L IPPERT by Corollary 24. [sent-545, score-0.592]
67 Our variant of the representer theorem shows that any point y i which appears in the regularization but not the loss term can be determined from an expansion of the kernel function about the training points. [sent-546, score-0.475]
68 1 Semi-supervised and Transductive Learning In this section we discuss in detail the somewhat subtle relationships between semi-supervised, transductive and inductive learning, quadratic regularizers, and the representer theorem. [sent-569, score-0.304]
69 As originally formulated, manifold regularization is defined only for points on a graph, and is therefore a transductive algorithm. [sent-581, score-0.282]
70 By the standard representer theorem, solving the above optimization is equivalent to finding a function with minimal sum of its RKHS norm and the “loss function” λ2 yt Ly +V (yN ). [sent-585, score-0.369]
71 Learning the Kernel Standard RKHS regularization involves a fixed kernel function k and kernel matrix K. [sent-610, score-0.291]
72 We consider a variant of Tikhonov regularization where the regularization term itself contains parameters u ∈ R p , which are optimized; we learn the regularization in tandem with the regression. [sent-613, score-0.471]
73 Given the independence of the loss from u, we can move the u-infimum inside inf inf R(y, u) + ∑ vi (y) y u i obtaining a new regularization R (y) = infu R(y, u). [sent-616, score-0.58]
74 If R is exact then the optimality condition for the regularization becomes R(y, u) − yt z + R∗ (z, 0) = 0, or, equivalently, (z, 0) ∈ ∂R(y, u). [sent-619, score-0.458]
75 We now specialize to the RKHS case, and allow the entire kernel to play the role of the auxiliary variables: R(y, K) = λQK −1 (y) + F(K) for some closed convex F whose domain K is contained in the n × n symmetric positive semidefinite matrices. [sent-620, score-0.299]
76 By Lemmas 11 and 13, R ∗ = F ∗ 1 λ−1 zzt and R is closed and exact if ∀z ∈ Rn , zzt ∈ int(dom F ∗ ). [sent-623, score-0.297]
77 λy K y − yt z + λ−1 zt Kz + F(K) − tr λzz K + F ∗ λ zz 2 2 2 2 Each of the two bracketed terms is non-negative, by Theorem 6; therefore they must both vanish. [sent-627, score-0.288]
78 Furthermore, if the minimizer in the inf is unique, we can use KN to determine KM given the new x points, we will have a representer theorem, and the inductive and transductive cases will be identical. [sent-635, score-0.478]
79 Under this assumption, F is ccp, and the optimality condition from the regularization term becomes y = λ−1 Kz and zt Kz = σK ⊕ (zzt ), K ∈ K ⊕ . [sent-669, score-0.326]
80 Since K does not appear in the loss optimality conditions (which depend only on z and y), we see that we can construct an optimal y, K, and z for the entire kernel learning problem where K is a convex combination of at most n + 1 “atoms” of K that solve the problem. [sent-675, score-0.305]
81 Without loss of generality, we cast this theorem in terms of an infimal convolution in the primal regularization, though by biconjugation, this lemma applies symmetrically to the primal and dual. [sent-696, score-0.324]
82 Our primal problem is: n inf R(y − b1n ) + ∑ (1 − yiYi )+ y,b i=1 n = inf (R δ1n R )(y) + ∑ (1 − yiYi )+ . [sent-726, score-0.382]
83 We have shown that for any regularization and loss function, allowing a free unregularized constant b in the y values induces a constraint ∑i zi = 0 in the dual problem. [sent-729, score-0.442]
84 2, we change the primal regularizer from Q K −1 to QK −1 δ1n R , we change the dual regularizer from QK to QK + δ(1n R)⊥ , or, equivalently, we add the constraint ∑i zi = 0 to the dual optimization problem. [sent-732, score-0.516]
85 There is no need to take the entire dual again—the loss function is unchanged, and in the Fenchel formulation, the regularization and loss make separate contributions. [sent-733, score-0.354]
86 Recalling that the loss optimality condition for RLS is y + z = Y , we can eliminate either y or z from the optimality condition, obtaining either a pure primal or pure dual formulation for the so-called least squares SVM (Suykens et al. [sent-737, score-0.411]
87 λ−1 K 1t 1 0 Yi b = I + λ−1 K 1t I + λ−1 K 1t 1 0 z b = Yi 0 1 0 y 0 , Note that introducing biases via infimal convolutions with RKHS regularization preserves the representer property: KNN KNM ∈ R(n+m)×(n+m) be symmetric positive definite where KNN ∈ KMN KMM Rn×n . [sent-739, score-0.335]
88 inf y∈Rn+1 ,β∈R i=0 By Corollary 29, the right hand problem has the regularization optimality condition λ−1 K et0 e0 0 z β = y , 0 Clearly, at optimality, z0 = 0 and y0 will assume a value with minimal loss: v0 (y0 ) = infy v(y) ≡ ¯ v. [sent-753, score-0.457]
89 2 2 QK (v + n) = In comparison (QK δ(KV0 )⊥ )(v + n) = inf QK (v + n − w) + δ(KV0 )⊥ (w) w = QK (v) + inf w∈(KV0 )⊥ {QK (n − w)} = QK (v). [sent-779, score-0.296]
90 Lemma 31 implies inf λ−1 QK (z) +V ∗ (−z) ˜ z = inf λ−1 (QK δ(KV0 )⊥ )(z) +V ∗ (−z) z = = inf z∈Rn ,z ∈(KV0 )⊥ inf z ∈Rn λ−1 QK (z − z ) +V ∗ (−z) λ−1 QK (z ) + inf z ∈(KV0 )⊥ V ∗ (−z − z ) . [sent-780, score-0.74]
91 o 473 R IFKIN AND L IPPERT The modified primal problem, the Fenchel dual of (23), is inf {λQK −1 (y) + δKV0 (y) +V (y)} = y inf {λQK −1 (y) +V (y)} (24) y∈KV0 which is identical to the original optimization problem, but with a restricted domain, and hence a higher value, in general. [sent-785, score-0.496]
92 0 0 In standard Tikhonov regularization in an RKHS, the complementarity equation from the regularization, y = λ−1 Kz, and the representer theorem tells us that we can obtain an optimizing function ˜ using c = λ−1 z. [sent-792, score-0.371]
93 By eliminating y for r in the modified primal problem (24), we see that f p is the solution of the subset of regressors method, where we construct a function using only those coefficients associated with points in N: inf {λQK −1 (y) +V (y)} = y∈KV0 infn λ−1 QKNN (r) +V λ−1 r∈R KNN KMN r . [sent-797, score-0.355]
94 = are “K −1 orthogonal”: yt K −1 We first note that by (25), y and α 0 α α 0 : As a consequence, the quadratic form QK −1 “distributes” over y and α 0 α QK −1 (λ−1 Kz) = QK −1 y + = QK −1 (y) + yt K −1 = QK −1 (y) + QK −1 0 + QK −1 α 0 α 0 . [sent-804, score-0.42]
95 Fenchel duality allows us to analyze the optimality conditions for regularization and loss terms separately, and combine these optimality conditions to get complete optimality conditions for machine learning schemes. [sent-820, score-0.619]
96 Value regularization makes it easy to reason about optimization problems over training and testing points simultaneously, and isolates the RKHS kernel to the regularization term. [sent-821, score-0.406]
97 We have used the framework to gain new insights into several topics in machine learning, including the representer theorem, learning a kernel matrix, biased regularizations, and low-rank approximations to kernel matrices. [sent-822, score-0.298]
98 In Sections 7 and 8, we showed that both supervised learning in an RKHS and learning the kernel matrix had a representer theorem that caused the inductive and natural transductive algorithms to make the same prediction. [sent-824, score-0.409]
99 It seems unlikely that value regularization in an RKHS will lead directly to improved algorithms (for example, a better SVM solver), because the inverse kernel matrix K −1 appearing in the primal is a computationally inconvenient object. [sent-830, score-0.31]
100 However, we might imagine designing regularizers for which value-based optimization is computationally effective, such as a transductive algorithm with a 476 VALUE R EGULARIZATION AND F ENCHEL D UALITY regularizer that directly imposes some non-RKHS smoothness over the input space. [sent-831, score-0.286]
wordName wordTfidf (topN-words)
[('qk', 0.406), ('dom', 0.235), ('fenchel', 0.23), ('yt', 0.21), ('rkhs', 0.19), ('kz', 0.183), ('rn', 0.178), ('regularizations', 0.162), ('regularization', 0.157), ('egularization', 0.152), ('enchel', 0.152), ('ifkin', 0.152), ('ippert', 0.152), ('uality', 0.152), ('inf', 0.148), ('ccp', 0.136), ('duality', 0.135), ('representer', 0.134), ('knn', 0.129), ('transductive', 0.125), ('kmn', 0.12), ('mal', 0.115), ('zzt', 0.112), ('yy', 0.109), ('int', 0.108), ('nystr', 0.095), ('convex', 0.093), ('optimality', 0.091), ('dual', 0.089), ('primal', 0.086), ('epi', 0.085), ('regularizer', 0.082), ('yi', 0.08), ('unregularized', 0.079), ('zt', 0.078), ('tikhonov', 0.077), ('closed', 0.073), ('kernel', 0.067), ('regressors', 0.067), ('knm', 0.064), ('zi', 0.063), ('infy', 0.061), ('kc', 0.06), ('unlabelled', 0.059), ('argyriou', 0.056), ('om', 0.054), ('loss', 0.054), ('regularizers', 0.054), ('infn', 0.054), ('lanckriet', 0.048), ('yiyi', 0.048), ('zy', 0.047), ('inductive', 0.045), ('convolutions', 0.044), ('borwein', 0.042), ('complementarity', 0.042), ('sup', 0.042), ('nystrom', 0.041), ('hinge', 0.039), ('vi', 0.039), ('specialize', 0.039), ('theorem', 0.038), ('rifkin', 0.036), ('lemma', 0.035), ('labelled', 0.034), ('hz', 0.034), ('supy', 0.034), ('infu', 0.034), ('kmm', 0.034), ('bias', 0.033), ('willams', 0.032), ('lewis', 0.032), ('girosi', 0.031), ('corollary', 0.031), ('conjugates', 0.03), ('supporting', 0.03), ('biased', 0.03), ('rm', 0.029), ('conjugate', 0.028), ('kv', 0.028), ('tomaso', 0.027), ('qa', 0.027), ('ci', 0.027), ('primary', 0.027), ('auxiliary', 0.027), ('svm', 0.026), ('nn', 0.026), ('semide', 0.026), ('cients', 0.026), ('mum', 0.026), ('minimizer', 0.026), ('expansion', 0.025), ('predicted', 0.025), ('optimization', 0.025), ('coef', 0.025), ('derivations', 0.025), ('convolution', 0.025), ('subgradient', 0.024), ('altun', 0.024), ('arn', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 90 jmlr-2007-Value Regularization and Fenchel Duality
Author: Ryan M. Rifkin, Ross A. Lippert
Abstract: Regularization is an approach to function learning that balances fit and smoothness. In practice, we search for a function f with a finite representation f = ∑i ci φi (·). In most treatments, the ci are the primary objects of study. We consider value regularization, constructing optimization problems in which the predicted values at the training points are the primary variables, and therefore the central objects of study. Although this is a simple change, it has profound consequences. From convex conjugacy and the theory of Fenchel duality, we derive separate optimality conditions for the regularization and loss portions of the learning problem; this technique yields clean and short derivations of standard algorithms. This framework is ideally suited to studying many other phenomena at the intersection of learning theory and optimization. We obtain a value-based variant of the representer theorem, which underscores the transductive nature of regularization in reproducing kernel Hilbert spaces. We unify and extend previous results on learning kernel functions, with very simple proofs. We analyze the use of unregularized bias terms in optimization problems, and low-rank approximations to kernel matrices, obtaining new results in these areas. In summary, the combination of value regularization and Fenchel duality are valuable tools for studying the optimization problems in machine learning. Keywords: kernel machines, duality, optimization, convex analysis, kernel learning
2 0.13254808 45 jmlr-2007-Learnability of Gaussians with Flexible Variances
Author: Yiming Ying, Ding-Xuan Zhou
Abstract: Gaussian kernels with flexible variances provide a rich family of Mercer kernels for learning algorithms. We show that the union of the unit balls of reproducing kernel Hilbert spaces generated by Gaussian kernels with flexible variances is a uniform Glivenko-Cantelli (uGC) class. This result confirms a conjecture concerning learnability of Gaussian kernels and verifies the uniform convergence of many learning algorithms involving Gaussians with changing variances. Rademacher averages and empirical covering numbers are used to estimate sample errors of multi-kernel regularization schemes associated with general loss functions. It is then shown that the regularization error associated with the least square loss and the Gaussian kernels can be greatly improved when flexible variances are allowed. Finally, for regularization schemes generated by Gaussian kernels with flexible variances we present explicit learning rates for regression with least square loss and classification with hinge loss. Keywords: Gaussian kernel, flexible variances, learning theory, Glivenko-Cantelli class, regularization scheme, empirical covering number
Author: Miroslav Dudík, Steven J. Phillips, Robert E. Schapire
Abstract: We present a unified and complete account of maximum entropy density estimation subject to constraints represented by convex potential functions or, alternatively, by convex regularization. We provide fully general performance guarantees and an algorithm with a complete convergence proof. As special cases, we easily derive performance guarantees for many known regularization types, including 1 , 2 , 2 , and 1 + 2 style regularization. We propose an algorithm solving a large and 2 2 general subclass of generalized maximum entropy problems, including all discussed in the paper, and prove its convergence. Our approach generalizes and unifies techniques based on information geometry and Bregman divergences as well as those based more directly on compactness. Our work is motivated by a novel application of maximum entropy to species distribution modeling, an important problem in conservation biology and ecology. In a set of experiments on real-world data, we demonstrate the utility of maximum entropy in this setting. We explore effects of different feature types, sample sizes, and regularization levels on the performance of maxent, and discuss interpretability of the resulting models. Keywords: maximum entropy, density estimation, regularization, iterative scaling, species distribution modeling
4 0.090508074 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression
Author: Kwangmoo Koh, Seung-Jean Kim, Stephen Boyd
Abstract: Logistic regression with 1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interior-point method for solving large-scale 1 -regularized logistic regression problems. Small problems with up to a thousand or so features and examples can be solved in seconds on a PC; medium sized problems, with tens of thousands of features and examples, can be solved in tens of seconds (assuming some sparsity in the data). A variation on the basic method, that uses a preconditioned conjugate gradient method to compute the search step, can solve very large problems, with a million features and examples (e.g., the 20 Newsgroups data set), in a few minutes, on a PC. Using warm-start techniques, a good approximation of the entire regularization path can be computed much more efficiently than by solving a family of problems independently. Keywords: logistic regression, feature selection, 1 regularization, regularization path, interiorpoint methods.
5 0.085612774 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)
Author: Ambuj Tewari, Peter L. Bartlett
Abstract: Binary classification is a well studied special case of the classification problem. Statistical properties of binary classifiers, such as consistency, have been investigated in a variety of settings. Binary classification methods can be generalized in many ways to handle multiple classes. It turns out that one can lose consistency in generalizing a binary classification method to deal with multiple classes. We study a rich family of multiclass methods and provide a necessary and sufficient condition for their consistency. We illustrate our approach by applying it to some multiclass methods proposed in the literature. Keywords: multiclass classification, consistency, Bayes risk
6 0.083812289 9 jmlr-2007-AdaBoost is Consistent
7 0.07939513 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis
8 0.07799679 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods
9 0.077949516 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss
10 0.076906353 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
11 0.070843987 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
12 0.067707263 88 jmlr-2007-Unlabeled Compression Schemes for Maximum Classes
13 0.065239355 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis
14 0.062402304 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models
15 0.054225191 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results
16 0.048237197 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data
17 0.04663831 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression
18 0.043278608 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
19 0.043245897 71 jmlr-2007-Refinable Kernels
20 0.043047115 43 jmlr-2007-Integrating Naïve Bayes and FOIL
topicId topicWeight
[(0, 0.26), (1, -0.224), (2, 0.139), (3, 0.046), (4, 0.001), (5, 0.053), (6, -0.192), (7, 0.093), (8, -0.042), (9, -0.05), (10, 0.234), (11, 0.101), (12, -0.138), (13, 0.074), (14, -0.068), (15, 0.201), (16, -0.068), (17, -0.063), (18, -0.188), (19, 0.061), (20, -0.097), (21, -0.08), (22, 0.079), (23, 0.025), (24, -0.018), (25, -0.083), (26, -0.069), (27, -0.107), (28, -0.019), (29, 0.002), (30, -0.061), (31, 0.021), (32, 0.05), (33, 0.059), (34, -0.044), (35, 0.062), (36, 0.132), (37, 0.056), (38, -0.004), (39, -0.05), (40, 0.081), (41, 0.096), (42, -0.067), (43, 0.11), (44, -0.026), (45, -0.054), (46, -0.148), (47, -0.089), (48, 0.002), (49, -0.036)]
simIndex simValue paperId paperTitle
same-paper 1 0.95771962 90 jmlr-2007-Value Regularization and Fenchel Duality
Author: Ryan M. Rifkin, Ross A. Lippert
Abstract: Regularization is an approach to function learning that balances fit and smoothness. In practice, we search for a function f with a finite representation f = ∑i ci φi (·). In most treatments, the ci are the primary objects of study. We consider value regularization, constructing optimization problems in which the predicted values at the training points are the primary variables, and therefore the central objects of study. Although this is a simple change, it has profound consequences. From convex conjugacy and the theory of Fenchel duality, we derive separate optimality conditions for the regularization and loss portions of the learning problem; this technique yields clean and short derivations of standard algorithms. This framework is ideally suited to studying many other phenomena at the intersection of learning theory and optimization. We obtain a value-based variant of the representer theorem, which underscores the transductive nature of regularization in reproducing kernel Hilbert spaces. We unify and extend previous results on learning kernel functions, with very simple proofs. We analyze the use of unregularized bias terms in optimization problems, and low-rank approximations to kernel matrices, obtaining new results in these areas. In summary, the combination of value regularization and Fenchel duality are valuable tools for studying the optimization problems in machine learning. Keywords: kernel machines, duality, optimization, convex analysis, kernel learning
Author: Miroslav Dudík, Steven J. Phillips, Robert E. Schapire
Abstract: We present a unified and complete account of maximum entropy density estimation subject to constraints represented by convex potential functions or, alternatively, by convex regularization. We provide fully general performance guarantees and an algorithm with a complete convergence proof. As special cases, we easily derive performance guarantees for many known regularization types, including 1 , 2 , 2 , and 1 + 2 style regularization. We propose an algorithm solving a large and 2 2 general subclass of generalized maximum entropy problems, including all discussed in the paper, and prove its convergence. Our approach generalizes and unifies techniques based on information geometry and Bregman divergences as well as those based more directly on compactness. Our work is motivated by a novel application of maximum entropy to species distribution modeling, an important problem in conservation biology and ecology. In a set of experiments on real-world data, we demonstrate the utility of maximum entropy in this setting. We explore effects of different feature types, sample sizes, and regularization levels on the performance of maxent, and discuss interpretability of the resulting models. Keywords: maximum entropy, density estimation, regularization, iterative scaling, species distribution modeling
3 0.46379796 45 jmlr-2007-Learnability of Gaussians with Flexible Variances
Author: Yiming Ying, Ding-Xuan Zhou
Abstract: Gaussian kernels with flexible variances provide a rich family of Mercer kernels for learning algorithms. We show that the union of the unit balls of reproducing kernel Hilbert spaces generated by Gaussian kernels with flexible variances is a uniform Glivenko-Cantelli (uGC) class. This result confirms a conjecture concerning learnability of Gaussian kernels and verifies the uniform convergence of many learning algorithms involving Gaussians with changing variances. Rademacher averages and empirical covering numbers are used to estimate sample errors of multi-kernel regularization schemes associated with general loss functions. It is then shown that the regularization error associated with the least square loss and the Gaussian kernels can be greatly improved when flexible variances are allowed. Finally, for regularization schemes generated by Gaussian kernels with flexible variances we present explicit learning rates for regression with least square loss and classification with hinge loss. Keywords: Gaussian kernel, flexible variances, learning theory, Glivenko-Cantelli class, regularization scheme, empirical covering number
4 0.44448942 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression
Author: Kwangmoo Koh, Seung-Jean Kim, Stephen Boyd
Abstract: Logistic regression with 1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interior-point method for solving large-scale 1 -regularized logistic regression problems. Small problems with up to a thousand or so features and examples can be solved in seconds on a PC; medium sized problems, with tens of thousands of features and examples, can be solved in tens of seconds (assuming some sparsity in the data). A variation on the basic method, that uses a preconditioned conjugate gradient method to compute the search step, can solve very large problems, with a million features and examples (e.g., the 20 Newsgroups data set), in a few minutes, on a PC. Using warm-start techniques, a good approximation of the entire regularization path can be computed much more efficiently than by solving a family of problems independently. Keywords: logistic regression, feature selection, 1 regularization, regularization path, interiorpoint methods.
Author: Ambuj Tewari, Peter L. Bartlett
Abstract: Binary classification is a well studied special case of the classification problem. Statistical properties of binary classifiers, such as consistency, have been investigated in a variety of settings. Binary classification methods can be generalized in many ways to handle multiple classes. It turns out that one can lose consistency in generalizing a binary classification method to deal with multiple classes. We study a rich family of multiclass methods and provide a necessary and sufficient condition for their consistency. We illustrate our approach by applying it to some multiclass methods proposed in the literature. Keywords: multiclass classification, consistency, Bayes risk
6 0.3876631 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
7 0.37441599 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods
8 0.36424533 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis
9 0.3347491 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results
10 0.33075055 88 jmlr-2007-Unlabeled Compression Schemes for Maximum Classes
11 0.32628551 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis
12 0.3217147 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss
13 0.2794759 9 jmlr-2007-AdaBoost is Consistent
14 0.25511065 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models
15 0.24115555 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
16 0.23481274 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data
18 0.21587346 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression
19 0.20009822 43 jmlr-2007-Integrating Naïve Bayes and FOIL
20 0.19171567 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
topicId topicWeight
[(4, 0.014), (8, 0.015), (10, 0.022), (12, 0.023), (15, 0.012), (28, 0.059), (40, 0.564), (48, 0.022), (60, 0.051), (85, 0.034), (98, 0.076)]
simIndex simValue paperId paperTitle
1 0.97318387 42 jmlr-2007-Infinitely Imbalanced Logistic Regression
Author: Art B. Owen
Abstract: In binary classification problems it is common for the two classes to be imbalanced: one case is very rare compared to the other. In this paper we consider the infinitely imbalanced case where one class has a finite sample size and the other class’s sample size grows without bound. For logistic regression, the infinitely imbalanced case often has a useful solution. Under mild conditions, the intercept diverges as expected, but the rest of the coefficient vector approaches a non trivial and useful limit. That limit can be expressed in terms of exponential tilting and is the minimum of a convex objective function. The limiting form of logistic regression suggests a computational shortcut for fraud detection problems. Keywords: classification, drug discovery, fraud detection, rare events, unbalanced data
same-paper 2 0.91043794 90 jmlr-2007-Value Regularization and Fenchel Duality
Author: Ryan M. Rifkin, Ross A. Lippert
Abstract: Regularization is an approach to function learning that balances fit and smoothness. In practice, we search for a function f with a finite representation f = ∑i ci φi (·). In most treatments, the ci are the primary objects of study. We consider value regularization, constructing optimization problems in which the predicted values at the training points are the primary variables, and therefore the central objects of study. Although this is a simple change, it has profound consequences. From convex conjugacy and the theory of Fenchel duality, we derive separate optimality conditions for the regularization and loss portions of the learning problem; this technique yields clean and short derivations of standard algorithms. This framework is ideally suited to studying many other phenomena at the intersection of learning theory and optimization. We obtain a value-based variant of the representer theorem, which underscores the transductive nature of regularization in reproducing kernel Hilbert spaces. We unify and extend previous results on learning kernel functions, with very simple proofs. We analyze the use of unregularized bias terms in optimization problems, and low-rank approximations to kernel matrices, obtaining new results in these areas. In summary, the combination of value regularization and Fenchel duality are valuable tools for studying the optimization problems in machine learning. Keywords: kernel machines, duality, optimization, convex analysis, kernel learning
Author: Miroslav Dudík, Steven J. Phillips, Robert E. Schapire
Abstract: We present a unified and complete account of maximum entropy density estimation subject to constraints represented by convex potential functions or, alternatively, by convex regularization. We provide fully general performance guarantees and an algorithm with a complete convergence proof. As special cases, we easily derive performance guarantees for many known regularization types, including 1 , 2 , 2 , and 1 + 2 style regularization. We propose an algorithm solving a large and 2 2 general subclass of generalized maximum entropy problems, including all discussed in the paper, and prove its convergence. Our approach generalizes and unifies techniques based on information geometry and Bregman divergences as well as those based more directly on compactness. Our work is motivated by a novel application of maximum entropy to species distribution modeling, an important problem in conservation biology and ecology. In a set of experiments on real-world data, we demonstrate the utility of maximum entropy in this setting. We explore effects of different feature types, sample sizes, and regularization levels on the performance of maxent, and discuss interpretability of the resulting models. Keywords: maximum entropy, density estimation, regularization, iterative scaling, species distribution modeling
4 0.50262964 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression
Author: Kwangmoo Koh, Seung-Jean Kim, Stephen Boyd
Abstract: Logistic regression with 1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interior-point method for solving large-scale 1 -regularized logistic regression problems. Small problems with up to a thousand or so features and examples can be solved in seconds on a PC; medium sized problems, with tens of thousands of features and examples, can be solved in tens of seconds (assuming some sparsity in the data). A variation on the basic method, that uses a preconditioned conjugate gradient method to compute the search step, can solve very large problems, with a million features and examples (e.g., the 20 Newsgroups data set), in a few minutes, on a PC. Using warm-start techniques, a good approximation of the entire regularization path can be computed much more efficiently than by solving a family of problems independently. Keywords: logistic regression, feature selection, 1 regularization, regularization path, interiorpoint methods.
5 0.45748222 77 jmlr-2007-Stagewise Lasso
Author: Peng Zhao, Bin Yu
Abstract: Many statistical machine learning algorithms minimize either an empirical loss function as in AdaBoost, or a penalized empirical loss as in Lasso or SVM. A single regularization tuning parameter controls the trade-off between fidelity to the data and generalizability, or equivalently between bias and variance. When this tuning parameter changes, a regularization “path” of solutions to the minimization problem is generated, and the whole path is needed to select a tuning parameter to optimize the prediction or interpretation performance. Algorithms such as homotopy-Lasso or LARS-Lasso and Forward Stagewise Fitting (FSF) (aka e-Boosting) are of great interest because of their resulted sparse models for interpretation in addition to prediction. In this paper, we propose the BLasso algorithm that ties the FSF (e-Boosting) algorithm with the Lasso method that minimizes the L1 penalized L2 loss. BLasso is derived as a coordinate descent method with a fixed stepsize applied to the general Lasso loss function (L1 penalized convex loss). It consists of both a forward step and a backward step. The forward step is similar to e-Boosting or FSF, but the backward step is new and revises the FSF (or e-Boosting) path to approximate the Lasso path. In the cases of a finite number of base learners and a bounded Hessian of the loss function, the BLasso path is shown to converge to the Lasso path when the stepsize goes to zero. For cases with a larger number of base learners than the sample size and when the true model is sparse, our simulations indicate that the BLasso model estimates are sparser than those from FSF with comparable or slightly better prediction performance, and that the the discrete stepsize of BLasso and FSF has an additional regularization effect in terms of prediction and sparsity. Moreover, we introduce the Generalized BLasso algorithm to minimize a general convex loss penalized by a general convex function. Since the (Generalized) BLasso relies only on differences
6 0.45649391 71 jmlr-2007-Refinable Kernels
7 0.43190211 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data
8 0.42180195 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
10 0.41536748 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression
11 0.407332 3 jmlr-2007-A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation
12 0.40473861 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss
13 0.40133873 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification
14 0.40083057 14 jmlr-2007-Behavioral Shaping for Geometric Concepts
15 0.39693564 39 jmlr-2007-Handling Missing Values when Applying Classification Models
16 0.38775912 88 jmlr-2007-Unlabeled Compression Schemes for Maximum Classes
17 0.38503233 13 jmlr-2007-Bayesian Quadratic Discriminant Analysis
18 0.38196152 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections
19 0.3791045 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
20 0.37722838 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds