nips nips2008 nips2008-163 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Richard Nock, Frank Nielsen
Abstract: Bartlett et al (2006) recently proved that a ground condition for convex surrogates, classification calibration, ties up the minimization of the surrogates and classification risks, and left as an important problem the algorithmic questions about the minimization of these surrogates. In this paper, we propose an algorithm which provably minimizes any classification calibrated surrogate strictly convex and differentiable — a set whose losses span the exponential, logistic and squared losses —, with boosting-type guaranteed convergence rates under a weak learning assumption. A particular subclass of these surrogates, that we call balanced convex surrogates, has a key rationale that ties it to maximum likelihood estimation, zerosum games and the set of losses that satisfy some of the most common requirements for losses in supervised learning. We report experiments on more than 50 readily available domains of 11 flavors of the algorithm, that shed light on new surrogates, and the potential of data dependent strategies to tune surrogates. 1
Reference: text
sentIndex sentText sentNum sentScore
1 fr Abstract Bartlett et al (2006) recently proved that a ground condition for convex surrogates, classification calibration, ties up the minimization of the surrogates and classification risks, and left as an important problem the algorithmic questions about the minimization of these surrogates. [sent-6, score-0.453]
2 In this paper, we propose an algorithm which provably minimizes any classification calibrated surrogate strictly convex and differentiable — a set whose losses span the exponential, logistic and squared losses —, with boosting-type guaranteed convergence rates under a weak learning assumption. [sent-7, score-0.722]
3 A particular subclass of these surrogates, that we call balanced convex surrogates, has a key rationale that ties it to maximum likelihood estimation, zerosum games and the set of losses that satisfy some of the most common requirements for losses in supervised learning. [sent-8, score-0.611]
4 Surrogates play fundamental roles in some of the most successful supervised learning algorithms, including AdaBoost, additive logistic regression, decision tree induction, Support Vector Machines [13, 7, 10]. [sent-11, score-0.088]
5 In this paper, we address and solve this problem for all strictly convex differentiable CCS, a set referred to as strictly convex surrogates (SCS). [sent-14, score-0.571]
6 We propose a minimization algorithm, ULS, which outputs linear separators, with two key properties: it provably achieves the optimum of the surrogate, and meets Boosting-type convergence under a weak learning assumption. [sent-15, score-0.134]
7 There is more, as we show that SCS strictly contains another set of surrogates of important rationale, balanced convex surrogates (BCS). [sent-16, score-0.635]
8 This set, which contains the logistic and squared losses but not the exponential loss, coincides with the set of losses satisfying three common requirements about losses in learning. [sent-17, score-0.603]
9 In fact, BCS spans a large subset of the expected losses for zero-sum games of [9], by which ULS may also be viewed as an efficient learner for decision making (in simple environments, though). [sent-18, score-0.224]
10 Section 2 gives preliminary definitions; section 3 presents surrogates losses and risks; sections 4 and 5 present ULS and its properties; section 6 discusses experiments with ULS; section 7 concludes. [sent-19, score-0.394]
11 Following a convention of [6], we compute to which extent the outputs of H and the labels in S disagree, ε(S, H), by summing a loss which quantifies pointwise disagreements: ε(S, H) . [sent-37, score-0.083]
12 (2) i The fundamental loss is the 0/1 loss, 0/1 (c, H) (to ease readability, the second argument is written H instead of H(o)). [sent-39, score-0.083]
13 (3) The following notations are introduced in (3): for a clear distinction of the output of H, we put in index to and ε an indication of the loss’ domain of parameters: R, meaning it is actually some O ⊆ R, or [0, 1]. [sent-43, score-0.064]
14 The exponent to gives the indication of the loss name. [sent-44, score-0.083]
15 Finally, 1 π is the indicator variable that takes value 1 iff predicate π is true, and 0 otherwise; σ : R → {−1, +1} is +1 iff x ≥ 0 and −1 otherwise; τ : [0, 1] → {0, 1} is 1 iff x ≥ 1/2, and 0 otherwise. [sent-45, score-0.204]
16 Both losses R and [0,1] are defined simultaneously via popular transforms on H, such as the logit . [sent-46, score-0.238]
17 In supervised learning, the objective is to carry out the minimization of the expectation of the 0/1 loss in generalization, the so-called true risk. [sent-50, score-0.17]
18 Very often however, this task can be relaxed to the minimization of the empirical risk of H, which is (2) with the 0/1 loss [6]: ε0/1 (S, H) . [sent-51, score-0.184]
19 The main classifiers we investigate are linear separators (LS). [sent-54, score-0.062]
20 In this case, H(o) = features ht with im(ht ) ⊆ R and leveraging coefficients αt ∈ R. [sent-55, score-0.15]
21 3 t αt ht (o) for Losses and surrogates A serious alternative to directly minimizing (4) is to rather focus on the minimization of a surrogate risk [3]. [sent-56, score-0.527]
22 This is a function ε(S, H) as in (2) whose surrogate loss (c, H(o)) satisfies 0/1 (c, H(o)) ≤ (c, H(o)). [sent-57, score-0.188]
23 Four are particularly important in supervised learning, defined via the following surrogate losses: . [sent-58, score-0.128]
24 exp ∗ ∗ (5) R (y , H) = exp(−y H) , log R sqr R hinge R (y ∗ , H) . [sent-59, score-0.058]
25 (6) (7) (8) (5) is the exponential loss, (6) is the logistic loss, (7) is the squared loss and (8) is hinge loss. [sent-63, score-0.177]
26 Definition 1 A Strictly Convex Loss (SCL) is a strictly convex function ψ : X → R+ differentiable on int(X) with X symmetric interval with respect to zero, s. [sent-64, score-0.195]
27 φB (x) = x(1 − x) 0 R [−1, 1] 1 2 + √ 2 1 2 + H (1−µ)2 +H 2 √H 2 1+H 2 exp(H) 1+exp(H) 1 +H 2 2 ∗ log(1 + exp(−y H)) (1 − y ∗ H)2 Table 1: permissible functions, their corresponding BCLs and the matching [0, 1] predictions. [sent-71, score-0.139]
28 Any surrogate risk built from a SCL is called a Strictly Convex Surrogate (SCS). [sent-74, score-0.142]
29 From Theorem 4 in [3], it comes that SCL contains all classification calibrated losses (CCL) that are strictly convex and differentiable, such as (5), (6), (7). [sent-75, score-0.386]
30 A function ψ ψ φ : [0, 1] → R+ is called permissible iff it is differentiable on (0, 1), strictly concave, symmet. [sent-82, score-0.344]
31 the linear hinge loss [8], a generalization of (8) for which x = y ∗ H − 1. [sent-100, score-0.119]
32 Thus, while hinge loss is not a BCL, it is the limit behavior of any BCL (see Figure 1). [sent-101, score-0.119]
33 Table 1 (left column) gives some examples of permissible φ. [sent-102, score-0.139]
34 Table 1 also gives the expressions of Fφ along with the im(H) = O ⊆ R allowed by the BCL, for the corresponding permissible function. [sent-104, score-0.139]
35 It is interesting to note the constraint on im(H) for the squared loss to be a BCL, which makes it monotonous in the interval, but implies to rescale the outputs of classifiers like linear separators to remain in [−1, 1]. [sent-105, score-0.172]
36 4 ULS: the efficient minimization of any SCS For any strictly convex function ψ : X → R differentiable on int(X), the Bregman Loss Function (BLF) Dψ with generator ψ is [5]: Dψ (x||x ) . [sent-106, score-0.259]
37 J do [WU] (weight update) wj ← (M αj ) w0 ; Let Tj ⊆ {1, 2, . [sent-113, score-0.308]
38 Output: H(x) = T t=1 m i=1 mit ((M δj ) wj )i = 0 ; αJ+1,t ht (x) ∈ LS Lemma 1 For any SCL ψ, ψ(y ∗ H) = Dψ (0|| −1 (y ∗ H)) − ψ (0). [sent-117, score-0.399]
39 φ φ φ The second equality is important because it ties real predictions (right) with [0, 1] predictions (left). [sent-119, score-0.081]
40 It also separates SCL and BCL, as for any ψ in SCL, it can be shown that there exists a functions ϕ such that Dϕ (y|| −1 (H)) = ψ(y ∗ H) iff ψ ∈ BCL. [sent-120, score-0.068]
41 ϕ We show that there exists an algorithm, ULS, which fits a linear separator H to the minimization . [sent-122, score-0.064]
42 (13) With this notation, the first equality in Lemma 1 becomes: ψ(y ∗ H) = Dψ (0|| ˜ −1 ∗ ˜ (−y H)) ψ ˜ − ψ(0) . [sent-126, score-0.044]
43 We let W = dom( ψ ) = −im( ψ ), where this latter equality comes from ψ (x) = ˜ ˜ −1 − ψ (−x) = − ψ (−x). [sent-128, score-0.065]
44 Because any BLF is strictly convex ˜ in its first argument, we can compute its Legendre conjugate. [sent-130, score-0.146]
45 , T ) known in advance, the problem thus reducing to the computation of the leveraging coefficients. [sent-139, score-0.059]
46 Given leveraging coefficients vector α ∈ RT , we get: ∗ −yi H(oi ) = (M α)i . [sent-142, score-0.059]
47 (18) (19) We can specialize this setting to classical greedy induction frameworks for LS: in classical boosting, at step j, we would fit a single αt [6]; in totally corrective boosting, we would rather fit {αt , 1 ≤ t ≤ j} [14]. [sent-143, score-0.102]
48 Intermediate schemes may be used as well for Tj , provided they ensure that, at each step j of the algorithm and for any feature ht , it may be chosen at some j > j. [sent-144, score-0.091]
49 In Algorithm 1, notations are vector-based: the Legendre duals are computed component-wise; furthermore, Tj may be chosen according to whichever scheme underlined above. [sent-146, score-0.074]
50 R Proof sketch: In step [WU] in ULS, (17) brings wj+1 = (M αj+1 ) w0 = (M δj ) wj . [sent-149, score-0.335]
51 Dψ (0||wj+1 ) − Dψ (0||wj ) ˜ ˜ = −Dψ (wj+1 ||wj ) ˜ (20) Let Aψ (wj+1 , wj ) = −Dψ (wj+1 ||wj ), which is just, from (20) and (14), the difference between ˜ ˜ two successive SCL in Algorithm 1. [sent-151, score-0.308]
52 ∈ KerM , this would make Aψ (wj+1 , wj ) an ˜ auxiliary function for ULS, which is enough to prove the convergence of ULS towards the optimum [6]. [sent-154, score-0.334]
53 Thus, suppose that wj+1 = wj (ULS has converged). [sent-155, score-0.331]
54 The case of totally corrective boosting is simpler, as after the last iteration we would have wJ+1 ∈ KerM . [sent-165, score-0.188]
55 The optimum is defined by the LS with features in M that realizes the smallest εψ . [sent-171, score-0.053]
56 Notice that in practice, it may be a tedious task to satisfy exactly (20), in particular for totally R corrective boosting [14]. [sent-172, score-0.188]
57 ULS has the flavor of boosting algorithms, repeatedly modifying a set of weights w over the examples. [sent-173, score-0.086]
58 In fact, this similarity is more than syntactical, as ULS satisfies two first popular algorithmic boosting properties, the first of which being that step [LC] in ULS is equivalent to saying that this LS has zero edge on wj+1 [14]. [sent-174, score-0.086]
59 Lemma 2 Suppose that there does not exist some ht with all mit of the same sign, ∀i = 1, 2, . [sent-176, score-0.091]
60 ˜ (21) ˜ ˜ ψ(−(M (δj + αj ))i ) from (14), a function convex in all leveraging We have Z = mψ(0) + . [sent-183, score-0.117]
61 ˜ ˜ wj )i ), with ϕ(x) = d2 ψ(x)/dx2 a function strictly positive in int(W) since ψ is strictly convex. [sent-191, score-0.484]
62 The condition for the Lemma to work is absolutely not restrictive, as if such an h t were to exist, we would not need to run ULS: indeed, we would have either ε0/1 (S, ht ) = 0, or ε0/1 (S, −ht ) = 0. [sent-199, score-0.091]
63 In ˜ this case, ψ = φ, W = [0, 1] (scaling issues underlined for the logit in Section 2 make it desirable to close W), and w0 = 1/21. [sent-206, score-0.105]
64 1 0 In all these cases, where W ⊆ R+ , wj is always a distribution x p 1/2 p up to a normalization factor, and this would also be the case for φ any strictly monotonous SCS ψ. [sent-207, score-0.423]
65 Consider example (oi , yi ), and its Figure 2: A typical φ (red: weight update, w ← (M α ) w = (−y ∗ H(o )) w for j,i j i 0,i i 0,i i strictly increasing, symmetric wrt the current classifier H. [sent-210, score-0.088]
66 We see that the new weight of the example gets larger iff x p computed from x and p. [sent-212, score-0.068]
67 iff the example is given the wrong class by H, which is the second boosting property met by ULS. [sent-215, score-0.173]
68 ULS turns out to meet a third boosting property, and the most important as it contributes to root the algorithm in the seminal boosting theory of the early nineties: we have guarantees on its convergence rate under a generalization of the well-known “Weak Learning Assumption” (WLA) [13]. [sent-216, score-0.172]
69 (22) This is indeed a generalization of the usual WLA for boosting algorithms, that we obtain taking |Tj | = 1, ht ∈ {−1, +1} [12]. [sent-221, score-0.177]
70 Few algorithms are known that formally boost WLA in the sense that requiring only WLA implies guaranteed rates for the minimization of εψ . [sent-222, score-0.064]
71 ˜ ˜ But, (14) together with the definition of wj in [WU] (see ULS) yields Dψ (0||wJ+1,i ) = ψ(0) + ˜ ∗ ψ(yi H(oi )), ∀i = 1, 2, . [sent-234, score-0.308]
72 , m, which ties up the SCS to (24); the guaranteed decrease in the rhs of (24) by (23) makes that there remains to check when the rhs becomes negative to conclude that ULS has terminated. [sent-237, score-0.077]
73 γ 5 ULS, BCL, maximum likelihood and zero-sum games BCL matches through the second equality in Lemma 1 the set of losses that satisfy the main requirements about losses used in machine learning. [sent-240, score-0.441]
74 Suppose im(H) ⊆ [0, 1], and consider the following requirements about some loss [0,1] (y, H): (R1) The loss is lower-bounded. [sent-242, score-0.201]
75 (R3) The loss is symmetric in the following sense: [0,1] (y, H) = [0,1] (1 − y, 1 − H). [sent-248, score-0.083]
76 For R2, we can write ε[0,1] (S, x) = p [0,1] (1, x) + (1 − p) [0,1] (0, x) = L(p, x), which is just the expected loss of zero-sum games used in [9] (eq. [sent-250, score-0.117]
77 The fact that the minimum is achieved at x = p makes the loss a proper scoring rule. [sent-252, score-0.083]
78 Finally, we say that loss [0,1] is properly defined iff dom( [0,1] ) = [0, 1]2 and it is twice differentiable on (0, 1)2 . [sent-255, score-0.221]
79 This is only a technical convenience: even the 0/1 loss coincides on {0, 1} with properly defined losses. [sent-256, score-0.122]
80 ) is properly defined and meets requirements R1, R2, R3 iff [0,1] (y, H) = z + Dφ (y||H) for some permissible φ. [sent-262, score-0.307]
81 The second equality in Lemma 1 makes a tight connection between the predictions of H in [0, 1] and R. [sent-264, score-0.044]
82 −1 ˆ (H(o)) , (25) Prφ [c = c+ |H; o] = φ With this definition, illustrated in Table 1, Lemma 3 and the second equality in Lemma 1 show that BCL matches the set of losses of Lemma 3. [sent-266, score-0.208]
83 This definition also brings the true nature of the minimization of any BCS with real valued hypotheses like linear separators (in ULS). [sent-267, score-0.153]
84 From Lemma 3 and [2], there exists a bijection between BCL and a subclass of the exponential families whose members’ pdfs may be written as: Prφ [y|θ] = exp(−Dφ (y|| −1 (θ)) + φ(y) − ν(y)), where θ ∈ R is the φ natural parameter and ν(. [sent-268, score-0.056]
85 Lemma 4 Minimizing any BCS with classifier H yields the maximum likelihood estimation, for each observation, of the natural parameter θ = H(o) of an exponential family defined by signature φ. [sent-272, score-0.052]
86 To summarize, minimizing any loss that meets R1, R2 and R3 (i. [sent-277, score-0.127]
87 any BCL) amounts to the same ultimate goal; Since ULS works for any of the corresponding surrogate risks, the crux of the choice of the BCL relies on data-dependent considerations. [sent-279, score-0.105]
88 Finally, we can go further in the parallel with game theory developed above for R2: using notations in [9], the loss function of the decision maker can be written L(X, q) = Dφ (1||q(X)). [sent-280, score-0.175]
89 R3 makes it easy to recover losses like the log loss or the Brier score [9] respectively from φ Q and φB (Table 1). [sent-281, score-0.247]
90 In this sense, ULS is also a sound learner for decision making in the zero-sum game of [9]. [sent-282, score-0.049]
91 True risks are estimated via stratified 10-fold cross validation; ULS is ran for r (fixed) features ht , each of which is a Boolean rule: If Monomial then Class= ±1 else Class = 1, with at most l (fixed) literals, induced following the greedy minimization of the BCS at hand. [sent-285, score-0.21]
92 Histograms are ordered from left to right in increasing average true risk over all domains (shown below histograms). [sent-288, score-0.098]
93 ”) pick the BCS which minimizes the empirical risk in the bag; two others (noted “E. [sent-300, score-0.06]
94 There are two different bags corresponding to four permissible functions each: the first (index “1”) contains the φ in Table 1, the second (index “2”) replaces φB by φυ . [sent-302, score-0.139]
95 We wanted to evaluate φB because it forces to renormalize the leveraging coefficients in H each time it is selected, to ensure that the output of H lies in [−1, 1]. [sent-303, score-0.059]
96 ) fail at improving the results; this is a strong advocacy for a particular treatment of this surrogate tuning problem. [sent-334, score-0.105]
97 Conclusion In this paper, we have shown the existence of a supervised learning algorithm which minimizes any strictly convex, differentiable classification calibrated surrogate [3], inducing linear separators. [sent-337, score-0.32]
98 Since the surrogate is now in the input of the algorithm, along with the learning sample, it opens the interesting problem of the tuning of this surrogate to the data at hand to further reduce the true risk. [sent-338, score-0.21]
99 The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. [sent-372, score-0.116]
100 On the boosting ability of top-down decision tree learning algorithms. [sent-407, score-0.112]
wordName wordTfidf (topN-words)
[('uls', 0.603), ('wj', 0.308), ('bcl', 0.263), ('surrogates', 0.23), ('bcs', 0.216), ('scl', 0.17), ('losses', 0.164), ('tj', 0.149), ('permissible', 0.139), ('wla', 0.139), ('im', 0.127), ('scs', 0.122), ('legendre', 0.108), ('surrogate', 0.105), ('oi', 0.094), ('ht', 0.091), ('strictly', 0.088), ('boosting', 0.086), ('loss', 0.083), ('adaboost', 0.077), ('logit', 0.074), ('lemma', 0.072), ('ls', 0.07), ('iff', 0.068), ('lc', 0.066), ('minimization', 0.064), ('separators', 0.062), ('corrective', 0.062), ('leveraging', 0.059), ('convex', 0.058), ('int', 0.058), ('risks', 0.055), ('calibrated', 0.055), ('avors', 0.054), ('differentiable', 0.049), ('kerm', 0.046), ('equality', 0.044), ('dom', 0.044), ('meets', 0.044), ('notations', 0.043), ('domains', 0.042), ('totally', 0.04), ('pr', 0.04), ('logistic', 0.039), ('bregman', 0.039), ('risk', 0.037), ('ties', 0.037), ('subclass', 0.037), ('hinge', 0.036), ('requirements', 0.035), ('games', 0.034), ('signature', 0.033), ('bcls', 0.031), ('blf', 0.031), ('blfs', 0.031), ('ccs', 0.031), ('euv', 0.031), ('infim', 0.031), ('limx', 0.031), ('matsushita', 0.031), ('nock', 0.031), ('underlined', 0.031), ('rationale', 0.03), ('balanced', 0.029), ('classi', 0.029), ('coef', 0.028), ('monotonous', 0.027), ('realizes', 0.027), ('brings', 0.027), ('optimum', 0.026), ('decision', 0.026), ('syntactical', 0.025), ('cedex', 0.025), ('nielsen', 0.025), ('pick', 0.023), ('suppose', 0.023), ('eventual', 0.023), ('supervised', 0.023), ('game', 0.023), ('convexity', 0.023), ('wu', 0.023), ('exp', 0.022), ('minx', 0.022), ('az', 0.022), ('cients', 0.021), ('properly', 0.021), ('index', 0.021), ('avor', 0.021), ('singleton', 0.021), ('banerjee', 0.021), ('comes', 0.021), ('bartlett', 0.02), ('rhs', 0.02), ('dual', 0.019), ('exponential', 0.019), ('ordered', 0.019), ('met', 0.019), ('ci', 0.018), ('table', 0.018), ('coincides', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 163 nips-2008-On the Efficient Minimization of Classification Calibrated Surrogates
Author: Richard Nock, Frank Nielsen
Abstract: Bartlett et al (2006) recently proved that a ground condition for convex surrogates, classification calibration, ties up the minimization of the surrogates and classification risks, and left as an important problem the algorithmic questions about the minimization of these surrogates. In this paper, we propose an algorithm which provably minimizes any classification calibrated surrogate strictly convex and differentiable — a set whose losses span the exponential, logistic and squared losses —, with boosting-type guaranteed convergence rates under a weak learning assumption. A particular subclass of these surrogates, that we call balanced convex surrogates, has a key rationale that ties it to maximum likelihood estimation, zerosum games and the set of losses that satisfy some of the most common requirements for losses in supervised learning. We report experiments on more than 50 readily available domains of 11 flavors of the algorithm, that shed light on new surrogates, and the potential of data dependent strategies to tune surrogates. 1
2 0.10003942 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost
Author: Hamed Masnadi-shirazi, Nuno Vasconcelos
Abstract: The machine learning problem of classifier design is studied from the perspective of probability elicitation, in statistics. This shows that the standard approach of proceeding from the specification of a loss, to the minimization of conditional risk is overly restrictive. It is shown that a better alternative is to start from the specification of a functional form for the minimum conditional risk, and derive the loss function. This has various consequences of practical interest, such as showing that 1) the widely adopted practice of relying on convex loss functions is unnecessary, and 2) many new losses can be derived for classification problems. These points are illustrated by the derivation of a new loss which is not convex, but does not compromise the computational tractability of classifier design, and is robust to the contamination of data with outliers. A new boosting algorithm, SavageBoost, is derived for the minimization of this loss. Experimental results show that it is indeed less sensitive to outliers than conventional methods, such as Ada, Real, or LogitBoost, and converges in fewer iterations. 1
3 0.098937079 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization
Author: Tong Zhang
Abstract: We study learning formulations with non-convex regularizaton that are natural for sparse linear models. There are two approaches to this problem: • Heuristic methods such as gradient descent that only find a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution. • Convex relaxation such as L1 -regularization that solves the problem under some conditions. However it often leads to sub-optimal sparsity in reality. This paper tries to remedy the above gap between theory and practice. In particular, we investigate a multi-stage convex relaxation scheme for solving problems with non-convex regularization. Theoretically, we analyze the behavior of a resulting two-stage relaxation scheme for the capped-L1 regularization. Our performance bound shows that the procedure is superior to the standard L1 convex relaxation for learning sparse targets. Experiments confirm the effectiveness of this method on some simulation and real data. 1
4 0.083121181 228 nips-2008-Support Vector Machines with a Reject Option
Author: Yves Grandvalet, Alain Rakotomamonjy, Joseph Keshet, Stéphane Canu
Abstract: We consider the problem of binary classification where the classifier may abstain instead of classifying each observation. The Bayes decision rule for this setup, known as Chow’s rule, is defined by two thresholds on posterior probabilities. From simple desiderata, namely the consistency and the sparsity of the classifier, we derive the double hinge loss function that focuses on estimating conditional probabilities only in the vicinity of the threshold points of the optimal decision rule. We show that, for suitable kernel machines, our approach is universally consistent. We cast the problem of minimizing the double hinge loss as a quadratic program akin to the standard SVM optimization problem and propose an active set method to solve it efficiently. We finally provide preliminary experimental results illustrating the interest of our constructive approach to devising loss functions. 1
5 0.077464342 250 nips-2008-Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning
Author: Ali Rahimi, Benjamin Recht
Abstract: Randomized neural networks are immortalized in this AI Koan: In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6. “What are you doing?” asked Minsky. “I am training a randomly wired neural net to play tic-tac-toe,” Sussman replied. “Why is the net wired randomly?” asked Minsky. Sussman replied, “I do not want it to have any preconceptions of how to play.” Minsky then shut his eyes. “Why do you close your eyes?” Sussman asked his teacher. “So that the room will be empty,” replied Minsky. At that moment, Sussman was enlightened. We analyze shallow random networks with the help of concentration of measure inequalities. Specifically, we consider architectures that compute a weighted sum of their inputs after passing them through a bank of arbitrary randomized nonlinearities. We identify conditions under which these networks exhibit good classification performance, and bound their test error in terms of the size of the dataset and the number of random nonlinearities. 1
6 0.072368175 62 nips-2008-Differentiable Sparse Coding
7 0.068550766 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization
8 0.064034812 15 nips-2008-Adaptive Martingale Boosting
9 0.062723041 239 nips-2008-Tighter Bounds for Structured Estimation
10 0.060888566 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features
11 0.055551417 237 nips-2008-The Recurrent Temporal Restricted Boltzmann Machine
12 0.054039497 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
13 0.052944012 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization
14 0.052753519 73 nips-2008-Estimating Robust Query Models with Convex Optimization
15 0.052215535 104 nips-2008-Improved Moves for Truncated Convex Models
16 0.047053002 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
17 0.044818383 194 nips-2008-Regularized Learning with Networks of Features
18 0.044098515 85 nips-2008-Fast Rates for Regularized Objectives
19 0.043780696 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning
20 0.043742586 65 nips-2008-Domain Adaptation with Multiple Sources
topicId topicWeight
[(0, -0.13), (1, -0.032), (2, -0.093), (3, 0.012), (4, -0.045), (5, 0.023), (6, -0.017), (7, -0.062), (8, -0.006), (9, 0.046), (10, -0.012), (11, 0.075), (12, 0.02), (13, -0.008), (14, 0.0), (15, 0.02), (16, -0.021), (17, -0.073), (18, 0.013), (19, 0.065), (20, -0.035), (21, 0.012), (22, -0.007), (23, 0.001), (24, -0.007), (25, 0.053), (26, -0.022), (27, -0.042), (28, -0.094), (29, -0.033), (30, 0.103), (31, -0.037), (32, 0.037), (33, -0.003), (34, 0.027), (35, -0.021), (36, -0.075), (37, 0.017), (38, 0.081), (39, 0.133), (40, 0.062), (41, -0.074), (42, -0.047), (43, -0.179), (44, 0.089), (45, 0.159), (46, 0.011), (47, -0.021), (48, -0.045), (49, -0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.92640233 163 nips-2008-On the Efficient Minimization of Classification Calibrated Surrogates
Author: Richard Nock, Frank Nielsen
Abstract: Bartlett et al (2006) recently proved that a ground condition for convex surrogates, classification calibration, ties up the minimization of the surrogates and classification risks, and left as an important problem the algorithmic questions about the minimization of these surrogates. In this paper, we propose an algorithm which provably minimizes any classification calibrated surrogate strictly convex and differentiable — a set whose losses span the exponential, logistic and squared losses —, with boosting-type guaranteed convergence rates under a weak learning assumption. A particular subclass of these surrogates, that we call balanced convex surrogates, has a key rationale that ties it to maximum likelihood estimation, zerosum games and the set of losses that satisfy some of the most common requirements for losses in supervised learning. We report experiments on more than 50 readily available domains of 11 flavors of the algorithm, that shed light on new surrogates, and the potential of data dependent strategies to tune surrogates. 1
2 0.78837395 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost
Author: Hamed Masnadi-shirazi, Nuno Vasconcelos
Abstract: The machine learning problem of classifier design is studied from the perspective of probability elicitation, in statistics. This shows that the standard approach of proceeding from the specification of a loss, to the minimization of conditional risk is overly restrictive. It is shown that a better alternative is to start from the specification of a functional form for the minimum conditional risk, and derive the loss function. This has various consequences of practical interest, such as showing that 1) the widely adopted practice of relying on convex loss functions is unnecessary, and 2) many new losses can be derived for classification problems. These points are illustrated by the derivation of a new loss which is not convex, but does not compromise the computational tractability of classifier design, and is robust to the contamination of data with outliers. A new boosting algorithm, SavageBoost, is derived for the minimization of this loss. Experimental results show that it is indeed less sensitive to outliers than conventional methods, such as Ada, Real, or LogitBoost, and converges in fewer iterations. 1
3 0.5981524 228 nips-2008-Support Vector Machines with a Reject Option
Author: Yves Grandvalet, Alain Rakotomamonjy, Joseph Keshet, Stéphane Canu
Abstract: We consider the problem of binary classification where the classifier may abstain instead of classifying each observation. The Bayes decision rule for this setup, known as Chow’s rule, is defined by two thresholds on posterior probabilities. From simple desiderata, namely the consistency and the sparsity of the classifier, we derive the double hinge loss function that focuses on estimating conditional probabilities only in the vicinity of the threshold points of the optimal decision rule. We show that, for suitable kernel machines, our approach is universally consistent. We cast the problem of minimizing the double hinge loss as a quadratic program akin to the standard SVM optimization problem and propose an active set method to solve it efficiently. We finally provide preliminary experimental results illustrating the interest of our constructive approach to devising loss functions. 1
4 0.5249607 15 nips-2008-Adaptive Martingale Boosting
Author: Phil Long, Rocco Servedio
Abstract: In recent work Long and Servedio [LS05] presented a “martingale boosting” algorithm that works by constructing a branching program over weak classifiers and has a simple analysis based on elementary properties of random walks. [LS05] showed that this martingale booster can tolerate random classification noise when it is run with a noise-tolerant weak learner; however, a drawback of the algorithm is that it is not adaptive, i.e. it cannot effectively take advantage of variation in the quality of the weak classifiers it receives. We present an adaptive variant of the martingale boosting algorithm. This adaptiveness is achieved by modifying the original algorithm so that the random walks that arise in its analysis have different step size depending on the quality of the weak learner at each stage. The new algorithm inherits the desirable properties of the original [LS05] algorithm, such as random classification noise tolerance, and has other advantages besides adaptiveness: it requires polynomially fewer calls to the weak learner than the original algorithm, and it can be used with confidencerated weak hypotheses that output real values rather than Boolean predictions.
5 0.52240568 250 nips-2008-Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning
Author: Ali Rahimi, Benjamin Recht
Abstract: Randomized neural networks are immortalized in this AI Koan: In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6. “What are you doing?” asked Minsky. “I am training a randomly wired neural net to play tic-tac-toe,” Sussman replied. “Why is the net wired randomly?” asked Minsky. Sussman replied, “I do not want it to have any preconceptions of how to play.” Minsky then shut his eyes. “Why do you close your eyes?” Sussman asked his teacher. “So that the room will be empty,” replied Minsky. At that moment, Sussman was enlightened. We analyze shallow random networks with the help of concentration of measure inequalities. Specifically, we consider architectures that compute a weighted sum of their inputs after passing them through a bank of arbitrary randomized nonlinearities. We identify conditions under which these networks exhibit good classification performance, and bound their test error in terms of the size of the dataset and the number of random nonlinearities. 1
6 0.51737005 185 nips-2008-Privacy-preserving logistic regression
7 0.5010094 239 nips-2008-Tighter Bounds for Structured Estimation
8 0.44893259 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization
9 0.44887885 85 nips-2008-Fast Rates for Regularized Objectives
10 0.4406251 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning
11 0.42419243 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features
12 0.39533943 217 nips-2008-Sparsity of SVMs that use the epsilon-insensitive loss
13 0.39218238 5 nips-2008-A Transductive Bound for the Voted Classifier with an Application to Semi-supervised Learning
14 0.39011246 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation
15 0.38031414 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
16 0.36184278 173 nips-2008-Optimization on a Budget: A Reinforcement Learning Approach
17 0.34784812 36 nips-2008-Beyond Novelty Detection: Incongruent Events, when General and Specific Classifiers Disagree
18 0.34290573 104 nips-2008-Improved Moves for Truncated Convex Models
19 0.3415536 183 nips-2008-Predicting the Geometry of Metal Binding Sites from Protein Sequence
20 0.33481997 186 nips-2008-Probabilistic detection of short events, with application to critical care monitoring
topicId topicWeight
[(6, 0.081), (7, 0.06), (12, 0.038), (15, 0.016), (28, 0.139), (32, 0.014), (50, 0.324), (57, 0.043), (59, 0.013), (63, 0.03), (71, 0.027), (77, 0.028), (78, 0.02), (81, 0.011), (83, 0.056)]
simIndex simValue paperId paperTitle
same-paper 1 0.73630285 163 nips-2008-On the Efficient Minimization of Classification Calibrated Surrogates
Author: Richard Nock, Frank Nielsen
Abstract: Bartlett et al (2006) recently proved that a ground condition for convex surrogates, classification calibration, ties up the minimization of the surrogates and classification risks, and left as an important problem the algorithmic questions about the minimization of these surrogates. In this paper, we propose an algorithm which provably minimizes any classification calibrated surrogate strictly convex and differentiable — a set whose losses span the exponential, logistic and squared losses —, with boosting-type guaranteed convergence rates under a weak learning assumption. A particular subclass of these surrogates, that we call balanced convex surrogates, has a key rationale that ties it to maximum likelihood estimation, zerosum games and the set of losses that satisfy some of the most common requirements for losses in supervised learning. We report experiments on more than 50 readily available domains of 11 flavors of the algorithm, that shed light on new surrogates, and the potential of data dependent strategies to tune surrogates. 1
2 0.64977556 168 nips-2008-Online Metric Learning and Fast Similarity Search
Author: Prateek Jain, Brian Kulis, Inderjit S. Dhillon, Kristen Grauman
Abstract: Metric learning algorithms can provide useful distance functions for a variety of domains, and recent work has shown good accuracy for problems where the learner can access all distance constraints at once. However, in many real applications, constraints are only available incrementally, thus necessitating methods that can perform online updates to the learned metric. Existing online algorithms offer bounds on worst-case performance, but typically do not perform well in practice as compared to their offline counterparts. We present a new online metric learning algorithm that updates a learned Mahalanobis metric based on LogDet regularization and gradient descent. We prove theoretical worst-case performance bounds, and empirically compare the proposed method against existing online metric learning algorithms. To further boost the practicality of our approach, we develop an online locality-sensitive hashing scheme which leads to efficient updates to data structures used for fast approximate similarity search. We demonstrate our algorithm on multiple datasets and show that it outperforms relevant baselines.
3 0.50667977 202 nips-2008-Robust Regression and Lasso
Author: Huan Xu, Constantine Caramanis, Shie Mannor
Abstract: We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a unified robustness perspective. 1
4 0.50535768 62 nips-2008-Differentiable Sparse Coding
Author: J. A. Bagnell, David M. Bradley
Abstract: Prior work has shown that features which appear to be biologically plausible as well as empirically useful can be found by sparse coding with a prior such as a laplacian (L1 ) that promotes sparsity. We show how smoother priors can preserve the benefits of these sparse priors while adding stability to the Maximum A-Posteriori (MAP) estimate that makes it more useful for prediction problems. Additionally, we show how to calculate the derivative of the MAP estimate efficiently with implicit differentiation. One prior that can be differentiated this way is KL-regularization. We demonstrate its effectiveness on a wide variety of applications, and find that online optimization of the parameters of the KL-regularized model can significantly improve prediction performance. 1
5 0.50445551 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
Author: Francis R. Bach
Abstract: For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the ℓ1 -norm or the block ℓ1 -norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.
6 0.50334078 75 nips-2008-Estimating vector fields using sparse basis field expansions
7 0.50163889 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
8 0.50163537 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
9 0.50158149 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
10 0.50138295 196 nips-2008-Relative Margin Machines
11 0.50120395 226 nips-2008-Supervised Dictionary Learning
12 0.50095439 194 nips-2008-Regularized Learning with Networks of Features
13 0.49936214 143 nips-2008-Multi-label Multiple Kernel Learning
14 0.4977974 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias
15 0.49745002 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost
16 0.49698013 85 nips-2008-Fast Rates for Regularized Objectives
17 0.49554744 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features
18 0.49552563 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE
19 0.49524206 239 nips-2008-Tighter Bounds for Structured Estimation
20 0.49510232 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization