nips nips2008 nips2008-163 knowledge-graph by maker-knowledge-mining

163 nips-2008-On the Efficient Minimization of Classification Calibrated Surrogates

Source: pdf

Author: Richard Nock, Frank Nielsen

Abstract: Bartlett et al (2006) recently proved that a ground condition for convex surrogates, classiﬁcation calibration, ties up the minimization of the surrogates and classiﬁcation risks, and left as an important problem the algorithmic questions about the minimization of these surrogates. In this paper, we propose an algorithm which provably minimizes any classiﬁcation calibrated surrogate strictly convex and differentiable — a set whose losses span the exponential, logistic and squared losses —, with boosting-type guaranteed convergence rates under a weak learning assumption. A particular subclass of these surrogates, that we call balanced convex surrogates, has a key rationale that ties it to maximum likelihood estimation, zerosum games and the set of losses that satisfy some of the most common requirements for losses in supervised learning. We report experiments on more than 50 readily available domains of 11 ﬂavors of the algorithm, that shed light on new surrogates, and the potential of data dependent strategies to tune surrogates. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 fr Abstract Bartlett et al (2006) recently proved that a ground condition for convex surrogates, classiﬁcation calibration, ties up the minimization of the surrogates and classiﬁcation risks, and left as an important problem the algorithmic questions about the minimization of these surrogates. [sent-6, score-0.453]

2 In this paper, we propose an algorithm which provably minimizes any classiﬁcation calibrated surrogate strictly convex and differentiable — a set whose losses span the exponential, logistic and squared losses —, with boosting-type guaranteed convergence rates under a weak learning assumption. [sent-7, score-0.722]

3 A particular subclass of these surrogates, that we call balanced convex surrogates, has a key rationale that ties it to maximum likelihood estimation, zerosum games and the set of losses that satisfy some of the most common requirements for losses in supervised learning. [sent-8, score-0.611]

4 Surrogates play fundamental roles in some of the most successful supervised learning algorithms, including AdaBoost, additive logistic regression, decision tree induction, Support Vector Machines [13, 7, 10]. [sent-11, score-0.088]

5 In this paper, we address and solve this problem for all strictly convex differentiable CCS, a set referred to as strictly convex surrogates (SCS). [sent-14, score-0.571]

6 We propose a minimization algorithm, ULS, which outputs linear separators, with two key properties: it provably achieves the optimum of the surrogate, and meets Boosting-type convergence under a weak learning assumption. [sent-15, score-0.134]

7 There is more, as we show that SCS strictly contains another set of surrogates of important rationale, balanced convex surrogates (BCS). [sent-16, score-0.635]

8 This set, which contains the logistic and squared losses but not the exponential loss, coincides with the set of losses satisfying three common requirements about losses in learning. [sent-17, score-0.603]

9 In fact, BCS spans a large subset of the expected losses for zero-sum games of [9], by which ULS may also be viewed as an efﬁcient learner for decision making (in simple environments, though). [sent-18, score-0.224]

10 Section 2 gives preliminary deﬁnitions; section 3 presents surrogates losses and risks; sections 4 and 5 present ULS and its properties; section 6 discusses experiments with ULS; section 7 concludes. [sent-19, score-0.394]

11 Following a convention of [6], we compute to which extent the outputs of H and the labels in S disagree, ε(S, H), by summing a loss which quantiﬁes pointwise disagreements: ε(S, H) . [sent-37, score-0.083]

12 (2) i The fundamental loss is the 0/1 loss, 0/1 (c, H) (to ease readability, the second argument is written H instead of H(o)). [sent-39, score-0.083]

13 (3) The following notations are introduced in (3): for a clear distinction of the output of H, we put in index to and ε an indication of the loss’ domain of parameters: R, meaning it is actually some O ⊆ R, or [0, 1]. [sent-43, score-0.064]

14 The exponent to gives the indication of the loss name. [sent-44, score-0.083]

15 Finally, 1 π is the indicator variable that takes value 1 iff predicate π is true, and 0 otherwise; σ : R → {−1, +1} is +1 iff x ≥ 0 and −1 otherwise; τ : [0, 1] → {0, 1} is 1 iff x ≥ 1/2, and 0 otherwise. [sent-45, score-0.204]

16 Both losses R and [0,1] are deﬁned simultaneously via popular transforms on H, such as the logit . [sent-46, score-0.238]

17 In supervised learning, the objective is to carry out the minimization of the expectation of the 0/1 loss in generalization, the so-called true risk. [sent-50, score-0.17]

18 Very often however, this task can be relaxed to the minimization of the empirical risk of H, which is (2) with the 0/1 loss [6]: ε0/1 (S, H) . [sent-51, score-0.184]

19 The main classiﬁers we investigate are linear separators (LS). [sent-54, score-0.062]

20 In this case, H(o) = features ht with im(ht ) ⊆ R and leveraging coefﬁcients αt ∈ R. [sent-55, score-0.15]

21 3 t αt ht (o) for Losses and surrogates A serious alternative to directly minimizing (4) is to rather focus on the minimization of a surrogate risk [3]. [sent-56, score-0.527]

22 This is a function ε(S, H) as in (2) whose surrogate loss (c, H(o)) satisﬁes 0/1 (c, H(o)) ≤ (c, H(o)). [sent-57, score-0.188]

23 Four are particularly important in supervised learning, deﬁned via the following surrogate losses: . [sent-58, score-0.128]

24 exp ∗ ∗ (5) R (y , H) = exp(−y H) , log R sqr R hinge R (y ∗ , H) . [sent-59, score-0.058]

25 (6) (7) (8) (5) is the exponential loss, (6) is the logistic loss, (7) is the squared loss and (8) is hinge loss. [sent-63, score-0.177]

26 Deﬁnition 1 A Strictly Convex Loss (SCL) is a strictly convex function ψ : X → R+ differentiable on int(X) with X symmetric interval with respect to zero, s. [sent-64, score-0.195]

27 φB (x) = x(1 − x) 0 R [−1, 1] 1 2 + √ 2 1 2 + H (1−µ)2 +H 2 √H 2 1+H 2 exp(H) 1+exp(H) 1 +H 2 2 ∗ log(1 + exp(−y H)) (1 − y ∗ H)2 Table 1: permissible functions, their corresponding BCLs and the matching [0, 1] predictions. [sent-71, score-0.139]

28 Any surrogate risk built from a SCL is called a Strictly Convex Surrogate (SCS). [sent-74, score-0.142]

29 From Theorem 4 in [3], it comes that SCL contains all classiﬁcation calibrated losses (CCL) that are strictly convex and differentiable, such as (5), (6), (7). [sent-75, score-0.386]

30 A function ψ ψ φ : [0, 1] → R+ is called permissible iff it is differentiable on (0, 1), strictly concave, symmet. [sent-82, score-0.344]

31 the linear hinge loss [8], a generalization of (8) for which x = y ∗ H − 1. [sent-100, score-0.119]

32 Thus, while hinge loss is not a BCL, it is the limit behavior of any BCL (see Figure 1). [sent-101, score-0.119]

33 Table 1 (left column) gives some examples of permissible φ. [sent-102, score-0.139]

34 Table 1 also gives the expressions of Fφ along with the im(H) = O ⊆ R allowed by the BCL, for the corresponding permissible function. [sent-104, score-0.139]

35 It is interesting to note the constraint on im(H) for the squared loss to be a BCL, which makes it monotonous in the interval, but implies to rescale the outputs of classiﬁers like linear separators to remain in [−1, 1]. [sent-105, score-0.172]

36 4 ULS: the efﬁcient minimization of any SCS For any strictly convex function ψ : X → R differentiable on int(X), the Bregman Loss Function (BLF) Dψ with generator ψ is [5]: Dψ (x||x ) . [sent-106, score-0.259]

37 J do [WU] (weight update) wj ← (M αj ) w0 ; Let Tj ⊆ {1, 2, . [sent-113, score-0.308]

38 Output: H(x) = T t=1 m i=1 mit ((M δj ) wj )i = 0 ; αJ+1,t ht (x) ∈ LS Lemma 1 For any SCL ψ, ψ(y ∗ H) = Dψ (0|| −1 (y ∗ H)) − ψ (0). [sent-117, score-0.399]

39 φ φ φ The second equality is important because it ties real predictions (right) with [0, 1] predictions (left). [sent-119, score-0.081]

40 It also separates SCL and BCL, as for any ψ in SCL, it can be shown that there exists a functions ϕ such that Dϕ (y|| −1 (H)) = ψ(y ∗ H) iff ψ ∈ BCL. [sent-120, score-0.068]

41 ϕ We show that there exists an algorithm, ULS, which ﬁts a linear separator H to the minimization . [sent-122, score-0.064]

42 (13) With this notation, the ﬁrst equality in Lemma 1 becomes: ψ(y ∗ H) = Dψ (0|| ˜ −1 ∗ ˜ (−y H)) ψ ˜ − ψ(0) . [sent-126, score-0.044]

43 We let W = dom( ψ ) = −im( ψ ), where this latter equality comes from ψ (x) = ˜ ˜ −1 − ψ (−x) = − ψ (−x). [sent-128, score-0.065]

44 Because any BLF is strictly convex ˜ in its ﬁrst argument, we can compute its Legendre conjugate. [sent-130, score-0.146]

45 , T ) known in advance, the problem thus reducing to the computation of the leveraging coefﬁcients. [sent-139, score-0.059]

46 Given leveraging coefﬁcients vector α ∈ RT , we get: ∗ −yi H(oi ) = (M α)i . [sent-142, score-0.059]

47 (18) (19) We can specialize this setting to classical greedy induction frameworks for LS: in classical boosting, at step j, we would ﬁt a single αt [6]; in totally corrective boosting, we would rather ﬁt {αt , 1 ≤ t ≤ j} [14]. [sent-143, score-0.102]

48 Intermediate schemes may be used as well for Tj , provided they ensure that, at each step j of the algorithm and for any feature ht , it may be chosen at some j > j. [sent-144, score-0.091]

49 In Algorithm 1, notations are vector-based: the Legendre duals are computed component-wise; furthermore, Tj may be chosen according to whichever scheme underlined above. [sent-146, score-0.074]

50 R Proof sketch: In step [WU] in ULS, (17) brings wj+1 = (M αj+1 ) w0 = (M δj ) wj . [sent-149, score-0.335]

51 Dψ (0||wj+1 ) − Dψ (0||wj ) ˜ ˜ = −Dψ (wj+1 ||wj ) ˜ (20) Let Aψ (wj+1 , wj ) = −Dψ (wj+1 ||wj ), which is just, from (20) and (14), the difference between ˜ ˜ two successive SCL in Algorithm 1. [sent-151, score-0.308]

52 ∈ KerM , this would make Aψ (wj+1 , wj ) an ˜ auxiliary function for ULS, which is enough to prove the convergence of ULS towards the optimum [6]. [sent-154, score-0.334]

53 Thus, suppose that wj+1 = wj (ULS has converged). [sent-155, score-0.331]

54 The case of totally corrective boosting is simpler, as after the last iteration we would have wJ+1 ∈ KerM . [sent-165, score-0.188]

55 The optimum is deﬁned by the LS with features in M that realizes the smallest εψ . [sent-171, score-0.053]

56 Notice that in practice, it may be a tedious task to satisfy exactly (20), in particular for totally R corrective boosting [14]. [sent-172, score-0.188]

57 ULS has the ﬂavor of boosting algorithms, repeatedly modifying a set of weights w over the examples. [sent-173, score-0.086]

58 In fact, this similarity is more than syntactical, as ULS satisﬁes two ﬁrst popular algorithmic boosting properties, the ﬁrst of which being that step [LC] in ULS is equivalent to saying that this LS has zero edge on wj+1 [14]. [sent-174, score-0.086]

59 Lemma 2 Suppose that there does not exist some ht with all mit of the same sign, ∀i = 1, 2, . [sent-176, score-0.091]

60 ˜ (21) ˜ ˜ ψ(−(M (δj + αj ))i ) from (14), a function convex in all leveraging We have Z = mψ(0) + . [sent-183, score-0.117]

61 ˜ ˜ wj )i ), with ϕ(x) = d2 ψ(x)/dx2 a function strictly positive in int(W) since ψ is strictly convex. [sent-191, score-0.484]

62 The condition for the Lemma to work is absolutely not restrictive, as if such an h t were to exist, we would not need to run ULS: indeed, we would have either ε0/1 (S, ht ) = 0, or ε0/1 (S, −ht ) = 0. [sent-199, score-0.091]

63 In ˜ this case, ψ = φ, W = [0, 1] (scaling issues underlined for the logit in Section 2 make it desirable to close W), and w0 = 1/21. [sent-206, score-0.105]

64 1 0 In all these cases, where W ⊆ R+ , wj is always a distribution x p 1/2 p up to a normalization factor, and this would also be the case for φ any strictly monotonous SCS ψ. [sent-207, score-0.423]

65 Consider example (oi , yi ), and its Figure 2: A typical φ (red: weight update, w ← (M α ) w = (−y ∗ H(o )) w for j,i j i 0,i i 0,i i strictly increasing, symmetric wrt the current classiﬁer H. [sent-210, score-0.088]

66 We see that the new weight of the example gets larger iff x p computed from x and p. [sent-212, score-0.068]

67 iff the example is given the wrong class by H, which is the second boosting property met by ULS. [sent-215, score-0.173]

68 ULS turns out to meet a third boosting property, and the most important as it contributes to root the algorithm in the seminal boosting theory of the early nineties: we have guarantees on its convergence rate under a generalization of the well-known “Weak Learning Assumption” (WLA) [13]. [sent-216, score-0.172]

69 (22) This is indeed a generalization of the usual WLA for boosting algorithms, that we obtain taking |Tj | = 1, ht ∈ {−1, +1} [12]. [sent-221, score-0.177]

70 Few algorithms are known that formally boost WLA in the sense that requiring only WLA implies guaranteed rates for the minimization of εψ . [sent-222, score-0.064]

71 ˜ ˜ But, (14) together with the deﬁnition of wj in [WU] (see ULS) yields Dψ (0||wJ+1,i ) = ψ(0) + ˜ ∗ ψ(yi H(oi )), ∀i = 1, 2, . [sent-234, score-0.308]

72 , m, which ties up the SCS to (24); the guaranteed decrease in the rhs of (24) by (23) makes that there remains to check when the rhs becomes negative to conclude that ULS has terminated. [sent-237, score-0.077]

73 γ 5 ULS, BCL, maximum likelihood and zero-sum games BCL matches through the second equality in Lemma 1 the set of losses that satisfy the main requirements about losses used in machine learning. [sent-240, score-0.441]

74 Suppose im(H) ⊆ [0, 1], and consider the following requirements about some loss [0,1] (y, H): (R1) The loss is lower-bounded. [sent-242, score-0.201]

75 (R3) The loss is symmetric in the following sense: [0,1] (y, H) = [0,1] (1 − y, 1 − H). [sent-248, score-0.083]

76 For R2, we can write ε[0,1] (S, x) = p [0,1] (1, x) + (1 − p) [0,1] (0, x) = L(p, x), which is just the expected loss of zero-sum games used in [9] (eq. [sent-250, score-0.117]

77 The fact that the minimum is achieved at x = p makes the loss a proper scoring rule. [sent-252, score-0.083]

78 Finally, we say that loss [0,1] is properly deﬁned iff dom( [0,1] ) = [0, 1]2 and it is twice differentiable on (0, 1)2 . [sent-255, score-0.221]

79 This is only a technical convenience: even the 0/1 loss coincides on {0, 1} with properly deﬁned losses. [sent-256, score-0.122]

80 ) is properly deﬁned and meets requirements R1, R2, R3 iff [0,1] (y, H) = z + Dφ (y||H) for some permissible φ. [sent-262, score-0.307]

81 The second equality in Lemma 1 makes a tight connection between the predictions of H in [0, 1] and R. [sent-264, score-0.044]

82 −1 ˆ (H(o)) , (25) Prφ [c = c+ |H; o] = φ With this deﬁnition, illustrated in Table 1, Lemma 3 and the second equality in Lemma 1 show that BCL matches the set of losses of Lemma 3. [sent-266, score-0.208]

83 This deﬁnition also brings the true nature of the minimization of any BCS with real valued hypotheses like linear separators (in ULS). [sent-267, score-0.153]

84 From Lemma 3 and [2], there exists a bijection between BCL and a subclass of the exponential families whose members’ pdfs may be written as: Prφ [y|θ] = exp(−Dφ (y|| −1 (θ)) + φ(y) − ν(y)), where θ ∈ R is the φ natural parameter and ν(. [sent-268, score-0.056]

85 Lemma 4 Minimizing any BCS with classiﬁer H yields the maximum likelihood estimation, for each observation, of the natural parameter θ = H(o) of an exponential family deﬁned by signature φ. [sent-272, score-0.052]

86 To summarize, minimizing any loss that meets R1, R2 and R3 (i. [sent-277, score-0.127]

87 any BCL) amounts to the same ultimate goal; Since ULS works for any of the corresponding surrogate risks, the crux of the choice of the BCL relies on data-dependent considerations. [sent-279, score-0.105]

88 Finally, we can go further in the parallel with game theory developed above for R2: using notations in [9], the loss function of the decision maker can be written L(X, q) = Dφ (1||q(X)). [sent-280, score-0.175]

89 R3 makes it easy to recover losses like the log loss or the Brier score [9] respectively from φ Q and φB (Table 1). [sent-281, score-0.247]

90 In this sense, ULS is also a sound learner for decision making in the zero-sum game of [9]. [sent-282, score-0.049]

91 True risks are estimated via stratiﬁed 10-fold cross validation; ULS is ran for r (ﬁxed) features ht , each of which is a Boolean rule: If Monomial then Class= ±1 else Class = 1, with at most l (ﬁxed) literals, induced following the greedy minimization of the BCS at hand. [sent-285, score-0.21]

92 Histograms are ordered from left to right in increasing average true risk over all domains (shown below histograms). [sent-288, score-0.098]

93 ”) pick the BCS which minimizes the empirical risk in the bag; two others (noted “E. [sent-300, score-0.06]

94 There are two different bags corresponding to four permissible functions each: the ﬁrst (index “1”) contains the φ in Table 1, the second (index “2”) replaces φB by φυ . [sent-302, score-0.139]

95 We wanted to evaluate φB because it forces to renormalize the leveraging coefﬁcients in H each time it is selected, to ensure that the output of H lies in [−1, 1]. [sent-303, score-0.059]

96 ) fail at improving the results; this is a strong advocacy for a particular treatment of this surrogate tuning problem. [sent-334, score-0.105]

97 Conclusion In this paper, we have shown the existence of a supervised learning algorithm which minimizes any strictly convex, differentiable classiﬁcation calibrated surrogate [3], inducing linear separators. [sent-337, score-0.32]

98 Since the surrogate is now in the input of the algorithm, along with the learning sample, it opens the interesting problem of the tuning of this surrogate to the data at hand to further reduce the true risk. [sent-338, score-0.21]

99 The relaxation method of ﬁnding the common point of convex sets and its application to the solution of problems in convex programming. [sent-372, score-0.116]

100 On the boosting ability of top-down decision tree learning algorithms. [sent-407, score-0.112]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('uls', 0.603), ('wj', 0.308), ('bcl', 0.263), ('surrogates', 0.23), ('bcs', 0.216), ('scl', 0.17), ('losses', 0.164), ('tj', 0.149), ('permissible', 0.139), ('wla', 0.139), ('im', 0.127), ('scs', 0.122), ('legendre', 0.108), ('surrogate', 0.105), ('oi', 0.094), ('ht', 0.091), ('strictly', 0.088), ('boosting', 0.086), ('loss', 0.083), ('adaboost', 0.077), ('logit', 0.074), ('lemma', 0.072), ('ls', 0.07), ('iff', 0.068), ('lc', 0.066), ('minimization', 0.064), ('separators', 0.062), ('corrective', 0.062), ('leveraging', 0.059), ('convex', 0.058), ('int', 0.058), ('risks', 0.055), ('calibrated', 0.055), ('avors', 0.054), ('differentiable', 0.049), ('kerm', 0.046), ('equality', 0.044), ('dom', 0.044), ('meets', 0.044), ('notations', 0.043), ('domains', 0.042), ('totally', 0.04), ('pr', 0.04), ('logistic', 0.039), ('bregman', 0.039), ('risk', 0.037), ('ties', 0.037), ('subclass', 0.037), ('hinge', 0.036), ('requirements', 0.035), ('games', 0.034), ('signature', 0.033), ('bcls', 0.031), ('blf', 0.031), ('blfs', 0.031), ('ccs', 0.031), ('euv', 0.031), ('infim', 0.031), ('limx', 0.031), ('matsushita', 0.031), ('nock', 0.031), ('underlined', 0.031), ('rationale', 0.03), ('balanced', 0.029), ('classi', 0.029), ('coef', 0.028), ('monotonous', 0.027), ('realizes', 0.027), ('brings', 0.027), ('optimum', 0.026), ('decision', 0.026), ('syntactical', 0.025), ('cedex', 0.025), ('nielsen', 0.025), ('pick', 0.023), ('suppose', 0.023), ('eventual', 0.023), ('supervised', 0.023), ('game', 0.023), ('convexity', 0.023), ('wu', 0.023), ('exp', 0.022), ('minx', 0.022), ('az', 0.022), ('cients', 0.021), ('properly', 0.021), ('index', 0.021), ('avor', 0.021), ('singleton', 0.021), ('banerjee', 0.021), ('comes', 0.021), ('bartlett', 0.02), ('rhs', 0.02), ('dual', 0.019), ('exponential', 0.019), ('ordered', 0.019), ('met', 0.019), ('ci', 0.018), ('table', 0.018), ('coincides', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 163 nips-2008-On the Efficient Minimization of Classification Calibrated Surrogates

Author: Richard Nock, Frank Nielsen

2 0.10003942 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

Author: Hamed Masnadi-shirazi, Nuno Vasconcelos

Abstract: The machine learning problem of classiﬁer design is studied from the perspective of probability elicitation, in statistics. This shows that the standard approach of proceeding from the speciﬁcation of a loss, to the minimization of conditional risk is overly restrictive. It is shown that a better alternative is to start from the speciﬁcation of a functional form for the minimum conditional risk, and derive the loss function. This has various consequences of practical interest, such as showing that 1) the widely adopted practice of relying on convex loss functions is unnecessary, and 2) many new losses can be derived for classiﬁcation problems. These points are illustrated by the derivation of a new loss which is not convex, but does not compromise the computational tractability of classiﬁer design, and is robust to the contamination of data with outliers. A new boosting algorithm, SavageBoost, is derived for the minimization of this loss. Experimental results show that it is indeed less sensitive to outliers than conventional methods, such as Ada, Real, or LogitBoost, and converges in fewer iterations. 1

3 0.098937079 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization

Author: Tong Zhang

Abstract: We study learning formulations with non-convex regularizaton that are natural for sparse linear models. There are two approaches to this problem: • Heuristic methods such as gradient descent that only ﬁnd a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution. • Convex relaxation such as L1 -regularization that solves the problem under some conditions. However it often leads to sub-optimal sparsity in reality. This paper tries to remedy the above gap between theory and practice. In particular, we investigate a multi-stage convex relaxation scheme for solving problems with non-convex regularization. Theoretically, we analyze the behavior of a resulting two-stage relaxation scheme for the capped-L1 regularization. Our performance bound shows that the procedure is superior to the standard L1 convex relaxation for learning sparse targets. Experiments conﬁrm the effectiveness of this method on some simulation and real data. 1

4 0.083121181 228 nips-2008-Support Vector Machines with a Reject Option

Author: Yves Grandvalet, Alain Rakotomamonjy, Joseph Keshet, Stéphane Canu

Abstract: We consider the problem of binary classiﬁcation where the classiﬁer may abstain instead of classifying each observation. The Bayes decision rule for this setup, known as Chow’s rule, is deﬁned by two thresholds on posterior probabilities. From simple desiderata, namely the consistency and the sparsity of the classiﬁer, we derive the double hinge loss function that focuses on estimating conditional probabilities only in the vicinity of the threshold points of the optimal decision rule. We show that, for suitable kernel machines, our approach is universally consistent. We cast the problem of minimizing the double hinge loss as a quadratic program akin to the standard SVM optimization problem and propose an active set method to solve it efﬁciently. We ﬁnally provide preliminary experimental results illustrating the interest of our constructive approach to devising loss functions. 1

5 0.077464342 250 nips-2008-Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning

Author: Ali Rahimi, Benjamin Recht

Abstract: Randomized neural networks are immortalized in this AI Koan: In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6. “What are you doing?” asked Minsky. “I am training a randomly wired neural net to play tic-tac-toe,” Sussman replied. “Why is the net wired randomly?” asked Minsky. Sussman replied, “I do not want it to have any preconceptions of how to play.” Minsky then shut his eyes. “Why do you close your eyes?” Sussman asked his teacher. “So that the room will be empty,” replied Minsky. At that moment, Sussman was enlightened. We analyze shallow random networks with the help of concentration of measure inequalities. Speciﬁcally, we consider architectures that compute a weighted sum of their inputs after passing them through a bank of arbitrary randomized nonlinearities. We identify conditions under which these networks exhibit good classiﬁcation performance, and bound their test error in terms of the size of the dataset and the number of random nonlinearities. 1

6 0.072368175 62 nips-2008-Differentiable Sparse Coding

7 0.068550766 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization

8 0.064034812 15 nips-2008-Adaptive Martingale Boosting

9 0.062723041 239 nips-2008-Tighter Bounds for Structured Estimation

10 0.060888566 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features

11 0.055551417 237 nips-2008-The Recurrent Temporal Restricted Boltzmann Machine

12 0.054039497 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data

13 0.052944012 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization

14 0.052753519 73 nips-2008-Estimating Robust Query Models with Convex Optimization

15 0.052215535 104 nips-2008-Improved Moves for Truncated Convex Models

16 0.047053002 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models

17 0.044818383 194 nips-2008-Regularized Learning with Networks of Features

18 0.044098515 85 nips-2008-Fast Rates for Regularized Objectives

19 0.043780696 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

20 0.043742586 65 nips-2008-Domain Adaptation with Multiple Sources

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.13), (1, -0.032), (2, -0.093), (3, 0.012), (4, -0.045), (5, 0.023), (6, -0.017), (7, -0.062), (8, -0.006), (9, 0.046), (10, -0.012), (11, 0.075), (12, 0.02), (13, -0.008), (14, 0.0), (15, 0.02), (16, -0.021), (17, -0.073), (18, 0.013), (19, 0.065), (20, -0.035), (21, 0.012), (22, -0.007), (23, 0.001), (24, -0.007), (25, 0.053), (26, -0.022), (27, -0.042), (28, -0.094), (29, -0.033), (30, 0.103), (31, -0.037), (32, 0.037), (33, -0.003), (34, 0.027), (35, -0.021), (36, -0.075), (37, 0.017), (38, 0.081), (39, 0.133), (40, 0.062), (41, -0.074), (42, -0.047), (43, -0.179), (44, 0.089), (45, 0.159), (46, 0.011), (47, -0.021), (48, -0.045), (49, -0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92640233 163 nips-2008-On the Efficient Minimization of Classification Calibrated Surrogates

Author: Richard Nock, Frank Nielsen

2 0.78837395 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

Author: Hamed Masnadi-shirazi, Nuno Vasconcelos

3 0.5981524 228 nips-2008-Support Vector Machines with a Reject Option

Author: Yves Grandvalet, Alain Rakotomamonjy, Joseph Keshet, Stéphane Canu

4 0.5249607 15 nips-2008-Adaptive Martingale Boosting

Author: Phil Long, Rocco Servedio

Abstract: In recent work Long and Servedio [LS05] presented a “martingale boosting” algorithm that works by constructing a branching program over weak classiﬁers and has a simple analysis based on elementary properties of random walks. [LS05] showed that this martingale booster can tolerate random classiﬁcation noise when it is run with a noise-tolerant weak learner; however, a drawback of the algorithm is that it is not adaptive, i.e. it cannot effectively take advantage of variation in the quality of the weak classiﬁers it receives. We present an adaptive variant of the martingale boosting algorithm. This adaptiveness is achieved by modifying the original algorithm so that the random walks that arise in its analysis have different step size depending on the quality of the weak learner at each stage. The new algorithm inherits the desirable properties of the original [LS05] algorithm, such as random classiﬁcation noise tolerance, and has other advantages besides adaptiveness: it requires polynomially fewer calls to the weak learner than the original algorithm, and it can be used with conﬁdencerated weak hypotheses that output real values rather than Boolean predictions.

5 0.52240568 250 nips-2008-Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning

Author: Ali Rahimi, Benjamin Recht

6 0.51737005 185 nips-2008-Privacy-preserving logistic regression

7 0.5010094 239 nips-2008-Tighter Bounds for Structured Estimation

8 0.44893259 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization

9 0.44887885 85 nips-2008-Fast Rates for Regularized Objectives

10 0.4406251 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

11 0.42419243 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features

12 0.39533943 217 nips-2008-Sparsity of SVMs that use the epsilon-insensitive loss

13 0.39218238 5 nips-2008-A Transductive Bound for the Voted Classifier with an Application to Semi-supervised Learning

14 0.39011246 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation

15 0.38031414 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models

16 0.36184278 173 nips-2008-Optimization on a Budget: A Reinforcement Learning Approach

17 0.34784812 36 nips-2008-Beyond Novelty Detection: Incongruent Events, when General and Specific Classifiers Disagree

18 0.34290573 104 nips-2008-Improved Moves for Truncated Convex Models

19 0.3415536 183 nips-2008-Predicting the Geometry of Metal Binding Sites from Protein Sequence

20 0.33481997 186 nips-2008-Probabilistic detection of short events, with application to critical care monitoring

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.081), (7, 0.06), (12, 0.038), (15, 0.016), (28, 0.139), (32, 0.014), (50, 0.324), (57, 0.043), (59, 0.013), (63, 0.03), (71, 0.027), (77, 0.028), (78, 0.02), (81, 0.011), (83, 0.056)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73630285 163 nips-2008-On the Efficient Minimization of Classification Calibrated Surrogates

Author: Richard Nock, Frank Nielsen

2 0.64977556 168 nips-2008-Online Metric Learning and Fast Similarity Search

Author: Prateek Jain, Brian Kulis, Inderjit S. Dhillon, Kristen Grauman

Abstract: Metric learning algorithms can provide useful distance functions for a variety of domains, and recent work has shown good accuracy for problems where the learner can access all distance constraints at once. However, in many real applications, constraints are only available incrementally, thus necessitating methods that can perform online updates to the learned metric. Existing online algorithms offer bounds on worst-case performance, but typically do not perform well in practice as compared to their ofﬂine counterparts. We present a new online metric learning algorithm that updates a learned Mahalanobis metric based on LogDet regularization and gradient descent. We prove theoretical worst-case performance bounds, and empirically compare the proposed method against existing online metric learning algorithms. To further boost the practicality of our approach, we develop an online locality-sensitive hashing scheme which leads to efﬁcient updates to data structures used for fast approximate similarity search. We demonstrate our algorithm on multiple datasets and show that it outperforms relevant baselines.

3 0.50667977 202 nips-2008-Robust Regression and Lasso

Author: Huan Xu, Constantine Caramanis, Shie Mannor

Abstract: We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a uniﬁed robustness perspective. 1

4 0.50535768 62 nips-2008-Differentiable Sparse Coding

Author: J. A. Bagnell, David M. Bradley

Abstract: Prior work has shown that features which appear to be biologically plausible as well as empirically useful can be found by sparse coding with a prior such as a laplacian (L1 ) that promotes sparsity. We show how smoother priors can preserve the beneﬁts of these sparse priors while adding stability to the Maximum A-Posteriori (MAP) estimate that makes it more useful for prediction problems. Additionally, we show how to calculate the derivative of the MAP estimate efﬁciently with implicit differentiation. One prior that can be differentiated this way is KL-regularization. We demonstrate its effectiveness on a wide variety of applications, and ﬁnd that online optimization of the parameters of the KL-regularized model can signiﬁcantly improve prediction performance. 1

5 0.50445551 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

Author: Francis R. Bach

Abstract: For supervised and unsupervised learning, positive deﬁnite kernels allow to use large and potentially inﬁnite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the ℓ1 -norm or the block ℓ1 -norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efﬁciently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.

6 0.50334078 75 nips-2008-Estimating vector fields using sparse basis field expansions

7 0.50163889 245 nips-2008-Unlabeled data: Now it helps, now it doesn't

8 0.50163537 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models

9 0.50158149 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization

10 0.50138295 196 nips-2008-Relative Margin Machines

11 0.50120395 226 nips-2008-Supervised Dictionary Learning

12 0.50095439 194 nips-2008-Regularized Learning with Networks of Features

13 0.49936214 143 nips-2008-Multi-label Multiple Kernel Learning

14 0.4977974 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias

15 0.49745002 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

16 0.49698013 85 nips-2008-Fast Rates for Regularized Objectives

17 0.49554744 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features

18 0.49552563 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

19 0.49524206 239 nips-2008-Tighter Bounds for Structured Estimation

20 0.49510232 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization