nips nips2012 nips2012-67 knowledge-graph by maker-knowledge-mining

67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

Source: pdf

Author: Harish G. Ramaswamy, Shivani Agarwal

Abstract: We study consistency properties of surrogate loss functions for general multiclass classiﬁcation problems, deﬁned by a general loss matrix. We extend the notion of classiﬁcation calibration, which has been studied for binary and multiclass 0-1 classiﬁcation problems (and for certain other speciﬁc learning problems), to the general multiclass setting, and derive necessary and sufﬁcient conditions for a surrogate loss to be classiﬁcation calibrated with respect to a loss matrix in this setting. We then introduce the notion of classiﬁcation calibration dimension of a multiclass loss matrix, which measures the smallest ‘size’ of a prediction space for which it is possible to design a convex surrogate that is classiﬁcation calibrated with respect to the loss matrix. We derive both upper and lower bounds on this quantity, and use these results to analyze various loss matrices. In particular, as one application, we provide a different route from the recent result of Duchi et al. (2010) for analyzing the difﬁculty of designing ‘low-dimensional’ convex surrogates that are consistent with respect to pairwise subset ranking losses. We anticipate the classiﬁcation calibration dimension may prove to be a useful tool in the study and design of surrogate losses for general multiclass learning problems. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 in Abstract We study consistency properties of surrogate loss functions for general multiclass classiﬁcation problems, deﬁned by a general loss matrix. [sent-5, score-1.139]

2 We then introduce the notion of classiﬁcation calibration dimension of a multiclass loss matrix, which measures the smallest ‘size’ of a prediction space for which it is possible to design a convex surrogate that is classiﬁcation calibrated with respect to the loss matrix. [sent-7, score-1.923]

3 We derive both upper and lower bounds on this quantity, and use these results to analyze various loss matrices. [sent-8, score-0.249]

4 (2010) for analyzing the difﬁculty of designing ‘low-dimensional’ convex surrogates that are consistent with respect to pairwise subset ranking losses. [sent-10, score-0.44]

5 We anticipate the classiﬁcation calibration dimension may prove to be a useful tool in the study and design of surrogate losses for general multiclass learning problems. [sent-11, score-1.184]

6 Such ﬁnite-output problems can all be viewed as instances of a general multiclass learning problem, whose structure is deﬁned by a loss function, or equivalently, by a loss matrix. [sent-13, score-0.666]

7 While the studies above have contributed to the understanding of learning problems corresponding to certain forms of loss matrices, a framework for analyzing consistency properties for a general multiclass learning problem, deﬁned by a general loss matrix, has remained elusive. [sent-14, score-0.856]

8 In this paper, we analyze consistency of surrogate losses for general multiclass learning problems, building on the results of [3, 5–7] and others. [sent-15, score-0.886]

9 We start in Section 2 with some background and examples that will be used as running examples to illustrate concepts throughout the paper, and formalize the notion of classiﬁcation calibration with respect to a general loss matrix. [sent-16, score-0.61]

10 In Section 3, we derive both necessary and sufﬁcient conditions for classiﬁcation calibration with respect to general multiclass losses; these are both of independent interest and useful in our later results. [sent-17, score-0.635]

11 Section 4 introduces the notion of classiﬁcation calibration dimension of a loss matrix, a fundamental quantity that measures the smallest ‘size’ of a prediction space for which it is possible to design a convex surrogate that is classiﬁcation calibrated with respect to the loss matrix. [sent-18, score-1.723]

12 We derive both upper and lower bounds on this quantity, and use these results to analyze various loss matrices. [sent-19, score-0.249]

13 [10] for analyzing the difﬁculty of designing ‘low-dimensional’ convex surrogates that are consistent with respect to certain pairwise subset ranking losses. [sent-21, score-0.482]

14 , k} of target labels in which predictions are to be made, and a loss function � : Y × T →[0, ∞), where �(y, t) denotes the loss incurred on predicting t ∈ T when the label is y ∈ Y. [sent-37, score-0.636]

15 We will ﬁnd it convenient to represent the loss function � as a loss matrix L ∈ Rn×k (here R+ = + [0, ∞)), and for each y ∈ [n], t ∈ [k], will denote by �yt the (y, t)-th element of L, �yt = (L)yt = �(y, t), and by �t the t-th column of L, �t = (�1t , . [sent-41, score-0.442]

16 Here Y = T = [n], and the loss incurred is 1 if the predicted label t is different from the actual class label y, and 0 otherwise: �0-1 (y, t) = 1(t �= y) , where 1(·) is 1 if the argument is true and 0 otherwise. [sent-46, score-0.324]

17 The loss matrix L0-1 for n = 3 is shown in Figure 1(a). [sent-47, score-0.221]

18 The loss matrix Lord for n = 3 is shown in Figure 1(b). [sent-52, score-0.221]

19 Here Y = T = [2r ] for some r ∈ N, and the loss incurred on predicting t when the actual class label is y is the number of bit-positions in which the r-bit binary �r representations of t − 1 and y − 1 differ: �Ham (y, t) = i=1 1((t − 1)i �= (y − 1)i ) , where for any r z ∈ {0, . [sent-54, score-0.293]

20 The loss matrix LHam for r = 2 is shown in Figure 1(c). [sent-58, score-0.221]

21 This loss is used in sequence labeling tasks [16]. [sent-59, score-0.221]

22 One possible loss function in this setting assigns a loss of 1 to incorrect predictions in [n], 0 to correct predictions, and 1 for abstaining: �(? [sent-62, score-0.469]

23 In particular, the goal is to learn a function with �-risk close to the optimal �-risk, deﬁned as � er�,∗ = D inf h:X →[k] er� [h] = D inf h:X →[k] EX p(X)� �h(X) = EX min p(X)� �t . [sent-70, score-0.18]

24 (3) D y=1 � The learned function f : X →T is then used to make predictions in [k] via some transformation pred : � →[k]: the prediction on a new instance x ∈ X is given by pred(f (x)), and the �-risk incurred is T er� [pred ◦ f ]. [sent-72, score-0.28]

25 As an example, several algorithms for multiclass classiﬁcation with respect to 0-1 loss D learn a function of the form f : X →Rn and predict according to pred(f (x)) = argmaxt∈[n] ft (x). [sent-73, score-0.495]

26 Below we will ﬁnd it useful to represent the surrogate loss function ψ via n real-valued functions � t) t) ψy : T →R+ deﬁned as ψy (ˆ = ψ(y, ˆ for y ∈ [n], or equivalently, as a vector-valued function � t) t), t)) ψ : T →Rn deﬁned as ψ(ˆ = (ψ1 (ˆ . [sent-74, score-0.586]

27 We will also deﬁne the sets + � � � � Rψ = ψ(ˆ : ˆ ∈ T t) t � and Sψ = conv(Rψ ) , (4) where for any A ⊆ Rn , conv(A) denotes the convex hull of A. [sent-78, score-0.134]

28 to converge (in probability) to the optimal ψ-risk, deﬁned as � erψ,∗ = D inf erψ [f ] = D � f :X →T inf EX p(X)� ψ(f (X)) = EX inf p(X)� z = EX inf p(X)� z . [sent-84, score-0.36]

29 z∈Rψ � f :X →T z∈Sψ (5) This raises the natural question of whether, for a given loss �, there are surrogate losses ψ for which consistency with respect to the ψ-risk also guarantees consistency with respect to the �-risk, i. [sent-85, score-1.116]

30 This question has been studied in detail for the 0-1 loss, and for square losses of the form �(y, t) = ay 1(t �= y), which can be analyzed similarly to the 0-1 loss [6, 7]. [sent-89, score-0.41]

31 In this paper, we consider this question for general multiclass losses � : [n] × [k]→R+ , including rectangular losses with k �= n. [sent-90, score-0.602]

32 The notion of classiﬁcation calibration will be central to our study; as Theorem 3 below shows, classiﬁcation calibration of a surrogate loss ψ w. [sent-94, score-1.232]

33 A surrogate loss function ψ : [n] × T →R+ is said to be classiﬁcation calibrated with respect to a loss function � : [n] × [k]→R+ over P ⊆ Δn if there exists � a function pred : T →[k] such that p� ψ(ˆ > inf p� ψ(ˆ . [sent-106, score-1.478]

34 t) t) ∀p ∈ P : inf ˆ T t∈ � ˆ T :pred(ˆ ∈argmint p� �t t∈ � t) / � Lemma 2. [sent-107, score-0.09]

35 Then ψ is classiﬁcation calibrated with respect to � over P ⊆ Δn iff there exists a function pred� : Sψ →[k] such that ∀p ∈ P : inf z∈Sψ :pred� (z)∈argmint p� �t / p� z > inf p� z . [sent-109, score-0.569]

36 Then ψ is classiﬁcation calibrated with � →[k] such that for all distributions D on X × [n] and respect to � over Δn iff ∃ a function pred : T � all sequences of random (vector) functions fm : X →T (depending on (X1 , Y1 ), . [sent-112, score-0.644]

37 Let ψ : [n] × T normals at z is deﬁned as4 � � � NSψ (z) = p ∈ Δn : p� (z − z� ) ≤ 0 ∀z� ∈ Sψ . [sent-118, score-0.133]

38 For each t ∈ [k], the set of trigger probabilities of t with respect to � is deﬁned as � � � � � Q� = p ∈ Δn : p� (�t − �t� ) ≤ 0 ∀t� ∈ [k] = p ∈ Δn : t ∈ argmint� ∈[k] p� �t� . [sent-121, score-0.13]

39 t Examples of trigger probability sets for various losses are shown in Figure 2. [sent-122, score-0.291]

40 2 Here Δn denotes the probability simplex in Rn , Δn = {p ∈ Rn : pi ≥ 0 ∀ i ∈ [n], P 3 Here − denotes convergence in probability. [sent-123, score-0.102]

41 → 4 The set of positive normals is non-empty only at points z in the boundary of Sψ . [sent-124, score-0.154]

42 ) Q4 = {p ∈ Δ3 : p1 ≥ = {p ∈ Δ3 : p2 ≥ = {p ∈ Δ3 : p3 ≥ 1 2} 1 2} 1 2} = {p ∈ Δ3 : max(p1 , p2 , p3 ) ≤ 1 2} (b) (c) Figure 2: Trigger probability sets for (a) 0-1 loss �0-1 ; (b) ordinal regression loss �ord ; and (c) ‘abstain’ loss �(? [sent-130, score-0.8]

43 3 Necessary and Sufﬁcient Conditions for Classiﬁcation Calibration We start by giving a necessary condition for classiﬁcation calibration of a surrogate loss ψ with respect to any multiclass loss � over Δn , which requires the positive normals of all points z ∈ Sψ to be ‘well-behaved’ w. [sent-134, score-1.589]

44 � and generalizes the ‘admissibility’ condition used for 0-1 loss in [7]. [sent-137, score-0.221]

45 Let ψ : [n] × T →R+ be classiﬁcation calibrated with respect to � : [n] × [k]→R+ over Δn . [sent-140, score-0.369]

46 t We note that, as in [7], it is possible to give a necessary and sufﬁcient condition for classiﬁcation calibration in terms of a similar property holding for positive normals associated with projections of Sψ in lower dimensions. [sent-142, score-0.463]

47 Instead, below we give a different sufﬁcient condition that will be helpful in showing classiﬁcation calibration of certain surrogates. [sent-143, score-0.374]

48 In particular, we show that for a surrogate loss ψ to be classiﬁcation calibrated with respect to � over Δn , it is sufﬁcient for the above property of positive normals to hold only at a ﬁnite number of points in Rψ , as long as their positive normal sets jointly cover Δn : � Theorem 7. [sent-144, score-1.131]

49 Then t ψ is classiﬁcation calibrated with respect to � over Δn . [sent-150, score-0.369]

50 The conditions in the above results both involve the sets of positive normals NSψ (z) at various points z ∈ Sψ . [sent-152, score-0.207]

51 Thus in order to use the above results to show that a surrogate ψ is (or is not) classiﬁcation calibrated with respect to a loss �, one needs to be able to compute or characterize the sets NSψ (z). [sent-153, score-0.977]

52 Here we give a method for computing these sets for certain surrogate losses ψ and points z ∈ Sψ . [sent-154, score-0.639]

53 ¯ Recall � the subdifferential of a convex function φ : Rd →� + at a point u0 ∈ Rd is deﬁned as that R ∂φ(u0 ) = w ∈ Rd : φ(u) − φ(u0 ) ≥ w� (u − u0 ) ∀u ∈ Rd and is a convex set in Rd (e. [sent-185, score-0.195]

54 6 4 We give an example illustrating the use of Theorem 7 and Lemma 8 to show classiﬁcation calibration of a certain surrogate loss with respect to the ordinal regression loss �ord deﬁned in Example 2: Example 5 (Classiﬁcation calibrated surrogate for ordinal regression loss). [sent-188, score-2.12]

55 Consider the ordinal � regression loss �ord deﬁned in Example 2 for n = 3. [sent-189, score-0.336]

56 Observe � � ˆ that T here is a convex set and ψ : T →R3 is a convex function. [sent-194, score-0.17]

57 0 0 0 1 This gives NSψ (z1 ) = = = Figure 3: The surrogate ψ � p ∈ Δ3 : p = (q1 + q2 , q3 , q4 ) for some q ∈ Δ4 , q1 − q2 − q3 − q4 = 0 � � p ∈ Δ3 : p = (q1 + q2 , q3 , q4 ) for some q ∈ Δ4 , q1 = 1 2 � � p ∈ Δ3 : p1 ≥ 1 2 � = Qord . [sent-198, score-0.365]

58 Thus, by Theorem 7, we get that 2 3 ψ is classiﬁcation calibrated with respect to �ord over Δ3 . [sent-200, score-0.369]

59 4 Classiﬁcation Calibration Dimension We now turn to the study of a fundamental quantity associated with the property of classiﬁcation calibration with respect to a general multiclass loss �. [sent-202, score-0.826]

60 Speciﬁcally, in the above example, we saw that to develop a classiﬁcation calibrated surrogate loss w. [sent-203, score-0.905]

61 the ordinal regression loss for n = 3, � it was sufﬁcient to consider a surrogate target space T = R, with dimension d = 1; in addition, this yielded a convex surrogate ψ : R→R3 which can be used in developing computationally efﬁcient + algorithms. [sent-206, score-1.296]

62 In fact the same surrogate target space with d = 1 can be used to develop a similar convex, classiﬁcation calibrated surrogate loss w. [sent-207, score-1.316]

63 However not all losses � have such ‘low-dimensional’ surrogates. [sent-211, score-0.189]

64 This raises the natural question of what is the smallest dimension d that supports a convex classiﬁcation calibrated surrogate for a given multiclass loss �, and leads us to the following deﬁnition: Deﬁnition 9 (Classiﬁcation calibration dimension). [sent-212, score-1.645]

65 Deﬁne the classiﬁcation calibration dimension (CC dimension) of � as � � � � CCdim(�) = min d ∈ N : ∃ a convex set T ⊆ Rd and a convex surrogate ψ : T →Rn + � that is classiﬁcation calibrated w. [sent-214, score-1.26]

66 In the following, we will be interested in developing an understanding of the CC dimension for general losses �, and in particular in deriving upper and lower bounds on this. [sent-219, score-0.316]

67 1 Upper Bounds on the Classiﬁcation Calibration Dimension We start with a simple result that establishes that the CC dimension of any multiclass loss � is ﬁnite, and in fact is strictly smaller than the number of class labels n. [sent-221, score-0.588]

68 j∈[n−1],j�=y Then ψ is classiﬁcation calibrated with respect to � over Δn . [sent-225, score-0.369]

69 It may appear surprising that the convex surrogate ψ in the above lemma is classiﬁcation calibrated with respect to all multiclass losses � on n classes. [sent-227, score-1.29]

70 Minimizing the above surrogate effectively corresponds to such class probability estimation. [sent-229, score-0.365]

71 Indeed, the above lemma can be shown to hold for any surrogate that is a strictly proper composite multiclass loss [21]. [sent-230, score-0.868]

72 Next we give a different upper bound on the CC dimension that depends on the loss �, and for certain losses, can be signiﬁcantly tighter than the general bound above. [sent-232, score-0.444]

73 Then CCdim(�) ≤ rank(L), the rank of the loss matrix L. [sent-235, score-0.246]

74 We will construct a convex classiﬁcation calibrated surrogate loss ψ for � � with surrogate target space T ⊆ Rd . [sent-238, score-1.401]

75 , �k }); from the deﬁnitions of positive normals and trigger probabilities, it then follows that NSψ (zt ) = NSψ (�t ) = Q� for all t ∈ [k]. [sent-266, score-0.213]

76 Thus by Theorem 7, ψ is classiﬁcation calibrated w. [sent-267, score-0.319]

77 Consider the Hamming loss �Ham deﬁned in Example 3, for n = 2r . [sent-272, score-0.221]

78 Then the loss matrix LHam satisﬁes r LHam = r � 1� ee − σi σi � , 2 2 i=1 where e is the n × 1 all ones vector. [sent-274, score-0.221]

79 6 We note that the upper bound of Theorem 11 need not always be tight: for example, for the ordinal regression loss, for which we already know CCdim(�ord ) = 1, the theorem actually gives an upper bound of n, which is even weaker than that implied by Lemma 10. [sent-277, score-0.26]

80 2 Lower Bound on the Classiﬁcation Calibration Dimension In this section we give a lower bound on the CC dimension of a loss function � and illustrate it by using it to calculate the CC dimension of the 0-1 loss. [sent-279, score-0.446]

81 Section 5 we will explore consequences of the lower bound for classiﬁcation calibrated surrogates for certain types of ranking losses. [sent-280, score-0.565]

82 The feasible subspace dimension of a convex set C at p ∈ C, denoted by µC (p), is deﬁned as the dimension of the subspace FC (p) ∩ (−FC (p)), where FC (p) is the cone of feasible directions of C at p. [sent-282, score-0.395]

83 7 The following gives a lower bound on the CC dimension of a loss � in terms of the feasible subspace dimension of the trigger probability sets Q� at certain points p ∈ Q� : t t Theorem 13. [sent-283, score-0.667]

84 t The proof requires extensions of the deﬁnition of positive normals and the necessary condition of Theorem 6 to sequences of points in Sψ and is quite technical. [sent-288, score-0.177]

85 In the appendix, we provide a proof in the special case when p ∈ relint(Δn ) is such that inf z∈Sψ p� z is achieved in Sψ , which does not require these extensions. [sent-289, score-0.09]

86 Both the proof of the lower bound and its applications make use of the following lemma, which gives a method to calculate the feasible subspace dimension for certain convex sets C and points p ∈ C: � � Lemma 14. [sent-291, score-0.352]

87 Let p ∈ C be such that � 1� �� 1 �� A A , the dimension of the null space of . [sent-293, score-0.139]

88 Then µC (p) = nullity 3 A A3 The above lower bound allows us to calculate precisely the CC dimension of the 0-1 loss: Example 7 (CC dimension of 0-1 loss). [sent-295, score-0.331]

89 , n} � � � = q ∈ Rn : −en−1 In−1 q ≤ 0, −q ≤ 0, e� q = 1} , n where en−1 , en denote the (n − 1) × 1 and n × 1 all ones vectors, respectively, and In−1 denotes � � the (n − 1) × (n − 1) identity matrix. [sent-305, score-0.096]

90 [10] showed recently that for certain pairwise subset ranking losses �, ﬁnding a predictor that minimizes the �-risk is an NP-hard problem. [sent-329, score-0.389]

91 Here we provide an alternative route to analyzing the difﬁculty of obtaining consistent surrogates for such pairwise subset ranking problems using the classiﬁcation calibration dimension. [sent-333, score-0.624]

92 Speciﬁcally, we will show that even for a simple setting of such problems, the classiﬁcation calibration dimension � of the underlying loss � is greater than r, and therefore no convex surrogate operating on T ⊆ Rr can be classiﬁcation calibrated w. [sent-334, score-1.462]

93 Consider now the following simple pairwise loss �pair : Y × T →R+ : � � �pair (g(i,j) , σ) = 1 σ(i) > σ(j) . [sent-342, score-0.26]

94 Therefore, by Lemma 14, we get σ1 σt �� µQpair (p) = nullity (�pair − �pair ), . [sent-363, score-0.106]

95 It is not hard to see that the set {�pair : σ ∈ T } spans a σ � � r(r−1) dimensional space, and hence the nullity of the above matrix is at most r(r−1)− r(r−1) −1 . [sent-368, score-0.106]

96 An interesting direction would be to develop a generic procedure for designing consistent convex surrogates operating on a ‘minimal’ surrogate target space according to the classiﬁcation calibration dimension of the loss matrix. [sent-372, score-1.296]

97 On the bayes-risk consistency of regularized boosting a methods. [sent-378, score-0.108]

98 Statistical behavior and consistency of classiﬁcation methods based on convex risk minimization. [sent-384, score-0.193]

99 Learning scoring e e functions with order-preserving losses and standardized supervision. [sent-412, score-0.189]

100 How to compare different loss functions and their risks. [sent-421, score-0.221]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('surrogate', 0.365), ('calibrated', 0.319), ('calibration', 0.307), ('ccdim', 0.254), ('multiclass', 0.224), ('loss', 0.221), ('pred', 0.212), ('losses', 0.189), ('ns', 0.184), ('cc', 0.137), ('normals', 0.133), ('ord', 0.131), ('classi', 0.131), ('qord', 0.127), ('relint', 0.127), ('consistency', 0.108), ('lham', 0.106), ('nullity', 0.106), ('conv', 0.103), ('rn', 0.1), ('dimension', 0.099), ('ranking', 0.096), ('inf', 0.09), ('ordinal', 0.088), ('convex', 0.085), ('argmint', 0.085), ('qpair', 0.085), ('ex', 0.083), ('cation', 0.083), ('surrogates', 0.081), ('trigger', 0.08), ('abstain', 0.069), ('en', 0.069), ('rd', 0.068), ('operating', 0.066), ('ham', 0.064), ('wsy', 0.064), ('lemma', 0.058), ('pair', 0.057), ('er', 0.057), ('py', 0.052), ('respect', 0.05), ('simplex', 0.048), ('rr', 0.046), ('target', 0.046), ('fc', 0.044), ('fm', 0.043), ('byj', 0.042), ('harish', 0.042), ('listwise', 0.042), ('duchi', 0.042), ('certain', 0.042), ('incurred', 0.041), ('analyzing', 0.04), ('null', 0.04), ('pairwise', 0.039), ('route', 0.038), ('ingo', 0.037), ('lord', 0.037), ('tong', 0.036), ('theorem', 0.035), ('nicolas', 0.035), ('notion', 0.032), ('conditions', 0.031), ('ambuj', 0.031), ('label', 0.031), ('hamming', 0.03), ('sy', 0.03), ('india', 0.03), ('tewari', 0.028), ('upper', 0.028), ('feasible', 0.028), ('subspace', 0.028), ('nition', 0.028), ('regression', 0.027), ('bound', 0.027), ('denotes', 0.027), ('predictions', 0.027), ('preference', 0.026), ('ut', 0.026), ('designing', 0.026), ('rank', 0.025), ('helpful', 0.025), ('subdifferential', 0.025), ('raises', 0.025), ('giving', 0.024), ('quantity', 0.024), ('ym', 0.023), ('documents', 0.023), ('subset', 0.023), ('yt', 0.023), ('necessary', 0.023), ('tj', 0.022), ('labels', 0.022), ('establishes', 0.022), ('nitions', 0.022), ('sets', 0.022), ('points', 0.021), ('iff', 0.02), ('zj', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

Author: Harish G. Ramaswamy, Shivani Agarwal

2 0.20642208 227 nips-2012-Multiclass Learning with Simplex Coding

Author: Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco, Jean-jeacques Slotine

Abstract: In this paper we discuss a novel framework for multiclass learning, deﬁned by a suitable coding/decoding strategy, namely the simplex coding, that allows us to generalize to multiple classes a relaxation approach commonly used in binary classiﬁcation. In this framework, we develop a relaxation error analysis that avoids constraints on the considered hypotheses class. Moreover, using this setting we derive the ﬁrst provably consistent regularized method with training/tuning complexity that is independent to the number of classes. We introduce tools from convex analysis that can be used beyond the scope of this paper. 1

3 0.17827564 200 nips-2012-Local Supervised Learning through Space Partitioning

Author: Joseph Wang, Venkatesh Saligrama

Abstract: We develop a novel approach for supervised learning based on adaptively partitioning the feature space into different regions and learning local region-speciﬁc classiﬁers. We formulate an empirical risk minimization problem that incorporates both partitioning and classiﬁcation in to a single global objective. We show that space partitioning can be equivalently reformulated as a supervised learning problem and consequently any discriminative learning method can be utilized in conjunction with our approach. Nevertheless, we consider locally linear schemes by learning linear partitions and linear region classiﬁers. Locally linear schemes can not only approximate complex decision boundaries and ensure low training error but also provide tight control on over-ﬁtting and generalization error. We train locally linear classiﬁers by using LDA, logistic regression and perceptrons, and so our scheme is scalable to large data sizes and high-dimensions. We present experimental results demonstrating improved performance over state of the art classiﬁcation techniques on benchmark datasets. We also show improved robustness to label noise.

4 0.17022654 323 nips-2012-Statistical Consistency of Ranking Methods in A Rank-Differentiable Probability Space

Author: Yanyan Lan, Jiafeng Guo, Xueqi Cheng, Tie-yan Liu

Abstract: This paper is concerned with the statistical consistency of ranking methods. Recently, it was proven that many commonly used pairwise ranking methods are inconsistent with the weighted pairwise disagreement loss (WPDL), which can be viewed as the true loss of ranking, even in a low-noise setting. This result is interesting but also surprising, given that the pairwise ranking methods have been shown very effective in practice. In this paper, we argue that the aforementioned result might not be conclusive, depending on what kind of assumptions are used. We give a new assumption that the labels of objects to rank lie in a rank-differentiable probability space (RDPS), and prove that the pairwise ranking methods become consistent with WPDL under this assumption. What is especially inspiring is that RDPS is actually not stronger than but similar to the low-noise setting. Our studies provide theoretical justiﬁcations of some empirical ﬁndings on pairwise ranking methods that are unexplained before, which bridge the gap between theory and applications.

5 0.1360103 280 nips-2012-Proper losses for learning from partial labels

Author: Jesús Cid-sueiro

Abstract: This paper discusses the problem of calibrating posterior class probabilities from partially labelled data. Each instance is assumed to be labelled as belonging to one of several candidate categories, at most one of them being true. We generalize the concept of proper loss to this scenario, we establish a necessary and sufﬁcient condition for a loss function to be proper, and we show a direct procedure to construct a proper loss for partial labels from a conventional proper loss. The problem can be characterized by the mixing probability matrix relating the true class of the data and the observed labels. The full knowledge of this matrix is not required, and losses can be constructed that are proper for a wide set of mixing probability matrices. 1

6 0.12820809 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

7 0.12820198 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

8 0.12236422 16 nips-2012-A Polynomial-time Form of Robust Regression

9 0.10445837 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

10 0.097113319 64 nips-2012-Calibrated Elastic Regularization in Matrix Completion

11 0.094690017 330 nips-2012-Supervised Learning with Similarity Functions

12 0.087379634 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications

13 0.083511762 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking

14 0.081976786 148 nips-2012-Hamming Distance Metric Learning

15 0.078021981 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

16 0.07516674 160 nips-2012-Imitation Learning by Coaching

17 0.071070731 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

18 0.070606008 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

19 0.070453994 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme

20 0.068736874 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.179), (1, 0.016), (2, 0.083), (3, -0.042), (4, 0.131), (5, -0.035), (6, -0.007), (7, 0.218), (8, -0.066), (9, 0.057), (10, 0.047), (11, 0.1), (12, 0.02), (13, 0.046), (14, 0.057), (15, -0.04), (16, 0.002), (17, -0.019), (18, -0.014), (19, 0.09), (20, -0.072), (21, 0.057), (22, -0.124), (23, 0.197), (24, -0.042), (25, 0.003), (26, -0.069), (27, 0.027), (28, -0.095), (29, -0.095), (30, 0.099), (31, -0.007), (32, 0.082), (33, 0.012), (34, 0.129), (35, 0.056), (36, 0.028), (37, -0.04), (38, -0.046), (39, -0.061), (40, 0.01), (41, -0.043), (42, -0.009), (43, -0.044), (44, -0.027), (45, -0.043), (46, -0.005), (47, 0.018), (48, 0.019), (49, 0.006)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96447963 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

Author: Harish G. Ramaswamy, Shivani Agarwal

2 0.80683076 227 nips-2012-Multiclass Learning with Simplex Coding

Author: Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco, Jean-jeacques Slotine

3 0.78801107 280 nips-2012-Proper losses for learning from partial labels

Author: Jesús Cid-sueiro

4 0.71351385 323 nips-2012-Statistical Consistency of Ranking Methods in A Rank-Differentiable Probability Space

Author: Yanyan Lan, Jiafeng Guo, Xueqi Cheng, Tie-yan Liu

5 0.71231169 217 nips-2012-Mixability in Statistical Learning

Author: Tim V. Erven, Peter Grünwald, Mark D. Reid, Robert C. Williamson

Abstract: Statistical learning and sequential prediction are two different but related formalisms to study the quality of predictions. Mapping out their relations and transferring ideas is an active area of investigation. We provide another piece of the puzzle by showing that an important concept in sequential prediction, the mixability of a loss, has a natural counterpart in the statistical setting, which we call stochastic mixability. Just as ordinary mixability characterizes fast rates for the worst-case regret in sequential prediction, stochastic mixability characterizes fast rates in statistical learning. We show that, in the special case of log-loss, stochastic mixability reduces to a well-known (but usually unnamed) martingale condition, which is used in existing convergence theorems for minimum description length and Bayesian inference. In the case of 0/1-loss, it reduces to the margin condition of Mammen and Tsybakov, and in the case that the model under consideration contains all possible predictors, it is equivalent to ordinary mixability. 1

6 0.70733392 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

7 0.63613361 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

8 0.63226992 16 nips-2012-A Polynomial-time Form of Robust Regression

9 0.62686336 361 nips-2012-Volume Regularization for Binary Classification

10 0.60141855 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications

11 0.59597892 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

12 0.58671916 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

13 0.57926726 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification

14 0.5677864 271 nips-2012-Pointwise Tracking the Optimal Regression Function

15 0.56617171 169 nips-2012-Label Ranking with Partial Abstention based on Thresholded Probabilistic Models

16 0.56018192 200 nips-2012-Local Supervised Learning through Space Partitioning

17 0.51063395 148 nips-2012-Hamming Distance Metric Learning

18 0.50303 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking

19 0.49958915 330 nips-2012-Supervised Learning with Similarity Functions

20 0.49258277 141 nips-2012-GenDeR: A Generic Diversified Ranking Algorithm

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.016), (38, 0.109), (42, 0.028), (54, 0.011), (74, 0.033), (76, 0.124), (80, 0.539), (92, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97182614 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks

Author: Tivadar Papai, Henry Kautz, Daniel Stefankovic

Abstract: Markov logic is a widely used tool in statistical relational learning, which uses a weighted ﬁrst-order logic knowledge base to specify a Markov random ﬁeld (MRF) or a conditional random ﬁeld (CRF). In many applications, a Markov logic network (MLN) is trained in one domain, but used in a different one. This paper focuses on dynamic Markov logic networks, where the size of the discretized time-domain typically varies between training and testing. It has been previously pointed out that the marginal probabilities of truth assignments to ground atoms can change if one extends or reduces the domains of predicates in an MLN. We show that in addition to this problem, the standard way of unrolling a Markov logic theory into a MRF may result in time-inhomogeneity of the underlying Markov chain. Furthermore, even if these representational problems are not signiﬁcant for a given domain, we show that the more practical problem of generating samples in a sequential conditional random ﬁeld for the next slice relying on the samples from the previous slice has high computational cost in the general case, due to the need to estimate a normalization factor for each sample. We propose a new discriminative model, slice normalized dynamic Markov logic networks (SN-DMLN), that suffers from none of these issues. It supports efﬁcient online inference, and can directly model inﬂuences between variables within a time slice that do not have a causal direction, in contrast with fully directed models (e.g., DBNs). Experimental results show an improvement in accuracy over previous approaches to online inference in dynamic Markov logic networks. 1

2 0.9709425 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

Author: James Scott, Jonathan W. Pillow

Abstract: Characterizing the information carried by neural populations in the brain requires accurate statistical models of neural spike responses. The negative-binomial distribution provides a convenient model for over-dispersed spike counts, that is, responses with greater-than-Poisson variability. Here we describe a powerful data-augmentation framework for fully Bayesian inference in neural models with negative-binomial spiking. Our approach relies on a recently described latentvariable representation of the negative-binomial distribution, which equates it to a Polya-gamma mixture of normals. This framework provides a tractable, conditionally Gaussian representation of the posterior that can be used to design efﬁcient EM and Gibbs sampling based algorithms for inference in regression and dynamic factor models. We apply the model to neural data from primate retina and show that it substantially outperforms Poisson regression on held-out data, and reveals latent structure underlying spike count correlations in simultaneously recorded spike trains. 1

3 0.97057217 204 nips-2012-MAP Inference in Chains using Column Generation

Author: David Belanger, Alexandre Passos, Sebastian Riedel, Andrew McCallum

Abstract: Linear chains and trees are basic building blocks in many applications of graphical models, and they admit simple exact maximum a-posteriori (MAP) inference algorithms based on message passing. However, in many cases this computation is prohibitively expensive, due to quadratic dependence on variables’ domain sizes. The standard algorithms are inefﬁcient because they compute scores for hypotheses for which there is strong negative local evidence. For this reason there has been signiﬁcant previous interest in beam search and its variants; however, these methods provide only approximate results. This paper presents new exact inference algorithms based on the combination of column generation and pre-computed bounds on terms of the model’s scoring function. While we do not improve worst-case performance, our method substantially speeds real-world, typical-case inference in chains and trees. Experiments show our method to be twice as fast as exact Viterbi for Wall Street Journal part-of-speech tagging and over thirteen times faster for a joint part-of-speed and named-entity-recognition task. Our algorithm is also extendable to new techniques for approximate inference, to faster 0/1 loss oracles, and new opportunities for connections between inference and learning. We encourage further exploration of high-level reasoning about the optimization problem implicit in dynamic programs. 1

4 0.96984547 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

Author: Ashish Kapoor, Raajay Viswanathan, Prateek Jain

Abstract: In this paper, we present a Bayesian framework for multilabel classiďŹ cation using compressed sensing. The key idea in compressed sensing for multilabel classiďŹ cation is to ďŹ rst project the label vector to a lower dimensional space using a random transformation and then learn regression functions over these projections. Our approach considers both of these components in a single probabilistic model, thereby jointly optimizing over compression as well as learning tasks. We then derive an efďŹ cient variational inference scheme that provides joint posterior distribution over all the unobserved labels. The two key beneďŹ ts of the model are that a) it can naturally handle datasets that have missing labels and b) it can also measure uncertainty in prediction. The uncertainty estimate provided by the model allows for active learning paradigms where an oracle provides information about labels that promise to be maximally informative for the prediction task. Our experiments show signiďŹ cant boost over prior methods in terms of prediction performance over benchmark datasets, both in the fully labeled and the missing labels case. Finally, we also highlight various useful active learning scenarios that are enabled by the probabilistic model. 1

5 0.94383341 100 nips-2012-Discriminative Learning of Sum-Product Networks

Author: Robert Gens, Pedro Domingos

Abstract: Sum-product networks are a new deep architecture that can perform fast, exact inference on high-treewidth models. Only generative methods for training SPNs have been proposed to date. In this paper, we present the ﬁrst discriminative training algorithms for SPNs, combining the high accuracy of the former with the representational power and tractability of the latter. We show that the class of tractable discriminative SPNs is broader than the class of tractable generative ones, and propose an efﬁcient backpropagation-style algorithm for computing the gradient of the conditional log likelihood. Standard gradient descent suffers from the diffusion problem, but networks with many layers can be learned reliably using “hard” gradient descent, where marginal inference is replaced by MPE inference (i.e., inferring the most probable state of the non-evidence variables). The resulting updates have a simple and intuitive form. We test discriminative SPNs on standard image classiﬁcation tasks. We obtain the best results to date on the CIFAR-10 dataset, using fewer features than prior methods with an SPN architecture that learns local image structure discriminatively. We also report the highest published test accuracy on STL-10 even though we only use the labeled portion of the dataset. 1

same-paper 6 0.93800271 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

7 0.8380723 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

8 0.8176353 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

9 0.78806502 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

10 0.77986103 251 nips-2012-On Lifting the Gibbs Sampling Algorithm

11 0.77193463 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning

12 0.77105021 200 nips-2012-Local Supervised Learning through Space Partitioning

13 0.77098167 293 nips-2012-Relax and Randomize : From Value to Algorithms

14 0.7486257 207 nips-2012-Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification

15 0.73899209 197 nips-2012-Learning with Recursive Perceptual Representations

16 0.73897338 279 nips-2012-Projection Retrieval for Classification

17 0.73529232 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

18 0.72897279 315 nips-2012-Slice sampling normalized kernel-weighted completely random measure mixture models

19 0.7204752 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

20 0.7158637 330 nips-2012-Supervised Learning with Similarity Functions