nips nips2012 nips2012-67 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Harish G. Ramaswamy, Shivani Agarwal
Abstract: We study consistency properties of surrogate loss functions for general multiclass classification problems, defined by a general loss matrix. We extend the notion of classification calibration, which has been studied for binary and multiclass 0-1 classification problems (and for certain other specific learning problems), to the general multiclass setting, and derive necessary and sufficient conditions for a surrogate loss to be classification calibrated with respect to a loss matrix in this setting. We then introduce the notion of classification calibration dimension of a multiclass loss matrix, which measures the smallest ‘size’ of a prediction space for which it is possible to design a convex surrogate that is classification calibrated with respect to the loss matrix. We derive both upper and lower bounds on this quantity, and use these results to analyze various loss matrices. In particular, as one application, we provide a different route from the recent result of Duchi et al. (2010) for analyzing the difficulty of designing ‘low-dimensional’ convex surrogates that are consistent with respect to pairwise subset ranking losses. We anticipate the classification calibration dimension may prove to be a useful tool in the study and design of surrogate losses for general multiclass learning problems. 1
Reference: text
sentIndex sentText sentNum sentScore
1 in Abstract We study consistency properties of surrogate loss functions for general multiclass classification problems, defined by a general loss matrix. [sent-5, score-1.139]
2 We then introduce the notion of classification calibration dimension of a multiclass loss matrix, which measures the smallest ‘size’ of a prediction space for which it is possible to design a convex surrogate that is classification calibrated with respect to the loss matrix. [sent-7, score-1.923]
3 We derive both upper and lower bounds on this quantity, and use these results to analyze various loss matrices. [sent-8, score-0.249]
4 (2010) for analyzing the difficulty of designing ‘low-dimensional’ convex surrogates that are consistent with respect to pairwise subset ranking losses. [sent-10, score-0.44]
5 We anticipate the classification calibration dimension may prove to be a useful tool in the study and design of surrogate losses for general multiclass learning problems. [sent-11, score-1.184]
6 Such finite-output problems can all be viewed as instances of a general multiclass learning problem, whose structure is defined by a loss function, or equivalently, by a loss matrix. [sent-13, score-0.666]
7 While the studies above have contributed to the understanding of learning problems corresponding to certain forms of loss matrices, a framework for analyzing consistency properties for a general multiclass learning problem, defined by a general loss matrix, has remained elusive. [sent-14, score-0.856]
8 In this paper, we analyze consistency of surrogate losses for general multiclass learning problems, building on the results of [3, 5–7] and others. [sent-15, score-0.886]
9 We start in Section 2 with some background and examples that will be used as running examples to illustrate concepts throughout the paper, and formalize the notion of classification calibration with respect to a general loss matrix. [sent-16, score-0.61]
10 In Section 3, we derive both necessary and sufficient conditions for classification calibration with respect to general multiclass losses; these are both of independent interest and useful in our later results. [sent-17, score-0.635]
11 Section 4 introduces the notion of classification calibration dimension of a loss matrix, a fundamental quantity that measures the smallest ‘size’ of a prediction space for which it is possible to design a convex surrogate that is classification calibrated with respect to the loss matrix. [sent-18, score-1.723]
12 We derive both upper and lower bounds on this quantity, and use these results to analyze various loss matrices. [sent-19, score-0.249]
13 [10] for analyzing the difficulty of designing ‘low-dimensional’ convex surrogates that are consistent with respect to certain pairwise subset ranking losses. [sent-21, score-0.482]
14 , k} of target labels in which predictions are to be made, and a loss function � : Y × T →[0, ∞), where �(y, t) denotes the loss incurred on predicting t ∈ T when the label is y ∈ Y. [sent-37, score-0.636]
15 We will find it convenient to represent the loss function � as a loss matrix L ∈ Rn×k (here R+ = + [0, ∞)), and for each y ∈ [n], t ∈ [k], will denote by �yt the (y, t)-th element of L, �yt = (L)yt = �(y, t), and by �t the t-th column of L, �t = (�1t , . [sent-41, score-0.442]
16 Here Y = T = [n], and the loss incurred is 1 if the predicted label t is different from the actual class label y, and 0 otherwise: �0-1 (y, t) = 1(t �= y) , where 1(·) is 1 if the argument is true and 0 otherwise. [sent-46, score-0.324]
17 The loss matrix L0-1 for n = 3 is shown in Figure 1(a). [sent-47, score-0.221]
18 The loss matrix Lord for n = 3 is shown in Figure 1(b). [sent-52, score-0.221]
19 Here Y = T = [2r ] for some r ∈ N, and the loss incurred on predicting t when the actual class label is y is the number of bit-positions in which the r-bit binary �r representations of t − 1 and y − 1 differ: �Ham (y, t) = i=1 1((t − 1)i �= (y − 1)i ) , where for any r z ∈ {0, . [sent-54, score-0.293]
20 The loss matrix LHam for r = 2 is shown in Figure 1(c). [sent-58, score-0.221]
21 This loss is used in sequence labeling tasks [16]. [sent-59, score-0.221]
22 One possible loss function in this setting assigns a loss of 1 to incorrect predictions in [n], 0 to correct predictions, and 1 for abstaining: �(? [sent-62, score-0.469]
23 In particular, the goal is to learn a function with �-risk close to the optimal �-risk, defined as � er�,∗ = D inf h:X →[k] er� [h] = D inf h:X →[k] EX p(X)� �h(X) = EX min p(X)� �t . [sent-70, score-0.18]
24 (3) D y=1 � The learned function f : X →T is then used to make predictions in [k] via some transformation pred : � →[k]: the prediction on a new instance x ∈ X is given by pred(f (x)), and the �-risk incurred is T er� [pred ◦ f ]. [sent-72, score-0.28]
25 As an example, several algorithms for multiclass classification with respect to 0-1 loss D learn a function of the form f : X →Rn and predict according to pred(f (x)) = argmaxt∈[n] ft (x). [sent-73, score-0.495]
26 Below we will find it useful to represent the surrogate loss function ψ via n real-valued functions � t) t) ψy : T →R+ defined as ψy (ˆ = ψ(y, ˆ for y ∈ [n], or equivalently, as a vector-valued function � t) t), t)) ψ : T →Rn defined as ψ(ˆ = (ψ1 (ˆ . [sent-74, score-0.586]
27 We will also define the sets + � � � � Rψ = ψ(ˆ : ˆ ∈ T t) t � and Sψ = conv(Rψ ) , (4) where for any A ⊆ Rn , conv(A) denotes the convex hull of A. [sent-78, score-0.134]
28 to converge (in probability) to the optimal ψ-risk, defined as � erψ,∗ = D inf erψ [f ] = D � f :X →T inf EX p(X)� ψ(f (X)) = EX inf p(X)� z = EX inf p(X)� z . [sent-84, score-0.36]
29 z∈Rψ � f :X →T z∈Sψ (5) This raises the natural question of whether, for a given loss �, there are surrogate losses ψ for which consistency with respect to the ψ-risk also guarantees consistency with respect to the �-risk, i. [sent-85, score-1.116]
30 This question has been studied in detail for the 0-1 loss, and for square losses of the form �(y, t) = ay 1(t �= y), which can be analyzed similarly to the 0-1 loss [6, 7]. [sent-89, score-0.41]
31 In this paper, we consider this question for general multiclass losses � : [n] × [k]→R+ , including rectangular losses with k �= n. [sent-90, score-0.602]
32 The notion of classification calibration will be central to our study; as Theorem 3 below shows, classification calibration of a surrogate loss ψ w. [sent-94, score-1.232]
33 A surrogate loss function ψ : [n] × T →R+ is said to be classification calibrated with respect to a loss function � : [n] × [k]→R+ over P ⊆ Δn if there exists � a function pred : T →[k] such that p� ψ(ˆ > inf p� ψ(ˆ . [sent-106, score-1.478]
34 t) t) ∀p ∈ P : inf ˆ T t∈ � ˆ T :pred(ˆ ∈argmint p� �t t∈ � t) / � Lemma 2. [sent-107, score-0.09]
35 Then ψ is classification calibrated with respect to � over P ⊆ Δn iff there exists a function pred� : Sψ →[k] such that ∀p ∈ P : inf z∈Sψ :pred� (z)∈argmint p� �t / p� z > inf p� z . [sent-109, score-0.569]
36 Then ψ is classification calibrated with � →[k] such that for all distributions D on X × [n] and respect to � over Δn iff ∃ a function pred : T � all sequences of random (vector) functions fm : X →T (depending on (X1 , Y1 ), . [sent-112, score-0.644]
37 Let ψ : [n] × T normals at z is defined as4 � � � NSψ (z) = p ∈ Δn : p� (z − z� ) ≤ 0 ∀z� ∈ Sψ . [sent-118, score-0.133]
38 For each t ∈ [k], the set of trigger probabilities of t with respect to � is defined as � � � � � Q� = p ∈ Δn : p� (�t − �t� ) ≤ 0 ∀t� ∈ [k] = p ∈ Δn : t ∈ argmint� ∈[k] p� �t� . [sent-121, score-0.13]
39 t Examples of trigger probability sets for various losses are shown in Figure 2. [sent-122, score-0.291]
40 2 Here Δn denotes the probability simplex in Rn , Δn = {p ∈ Rn : pi ≥ 0 ∀ i ∈ [n], P 3 Here − denotes convergence in probability. [sent-123, score-0.102]
41 → 4 The set of positive normals is non-empty only at points z in the boundary of Sψ . [sent-124, score-0.154]
42 ) Q4 = {p ∈ Δ3 : p1 ≥ = {p ∈ Δ3 : p2 ≥ = {p ∈ Δ3 : p3 ≥ 1 2} 1 2} 1 2} = {p ∈ Δ3 : max(p1 , p2 , p3 ) ≤ 1 2} (b) (c) Figure 2: Trigger probability sets for (a) 0-1 loss �0-1 ; (b) ordinal regression loss �ord ; and (c) ‘abstain’ loss �(? [sent-130, score-0.8]
43 3 Necessary and Sufficient Conditions for Classification Calibration We start by giving a necessary condition for classification calibration of a surrogate loss ψ with respect to any multiclass loss � over Δn , which requires the positive normals of all points z ∈ Sψ to be ‘well-behaved’ w. [sent-134, score-1.589]
44 � and generalizes the ‘admissibility’ condition used for 0-1 loss in [7]. [sent-137, score-0.221]
45 Let ψ : [n] × T →R+ be classification calibrated with respect to � : [n] × [k]→R+ over Δn . [sent-140, score-0.369]
46 t We note that, as in [7], it is possible to give a necessary and sufficient condition for classification calibration in terms of a similar property holding for positive normals associated with projections of Sψ in lower dimensions. [sent-142, score-0.463]
47 Instead, below we give a different sufficient condition that will be helpful in showing classification calibration of certain surrogates. [sent-143, score-0.374]
48 In particular, we show that for a surrogate loss ψ to be classification calibrated with respect to � over Δn , it is sufficient for the above property of positive normals to hold only at a finite number of points in Rψ , as long as their positive normal sets jointly cover Δn : � Theorem 7. [sent-144, score-1.131]
49 Then t ψ is classification calibrated with respect to � over Δn . [sent-150, score-0.369]
50 The conditions in the above results both involve the sets of positive normals NSψ (z) at various points z ∈ Sψ . [sent-152, score-0.207]
51 Thus in order to use the above results to show that a surrogate ψ is (or is not) classification calibrated with respect to a loss �, one needs to be able to compute or characterize the sets NSψ (z). [sent-153, score-0.977]
52 Here we give a method for computing these sets for certain surrogate losses ψ and points z ∈ Sψ . [sent-154, score-0.639]
53 ¯ Recall � the subdifferential of a convex function φ : Rd →� + at a point u0 ∈ Rd is defined as that R ∂φ(u0 ) = w ∈ Rd : φ(u) − φ(u0 ) ≥ w� (u − u0 ) ∀u ∈ Rd and is a convex set in Rd (e. [sent-185, score-0.195]
54 6 4 We give an example illustrating the use of Theorem 7 and Lemma 8 to show classification calibration of a certain surrogate loss with respect to the ordinal regression loss �ord defined in Example 2: Example 5 (Classification calibrated surrogate for ordinal regression loss). [sent-188, score-2.12]
55 Consider the ordinal � regression loss �ord defined in Example 2 for n = 3. [sent-189, score-0.336]
56 Observe � � ˆ that T here is a convex set and ψ : T →R3 is a convex function. [sent-194, score-0.17]
57 0 0 0 1 This gives NSψ (z1 ) = = = Figure 3: The surrogate ψ � p ∈ Δ3 : p = (q1 + q2 , q3 , q4 ) for some q ∈ Δ4 , q1 − q2 − q3 − q4 = 0 � � p ∈ Δ3 : p = (q1 + q2 , q3 , q4 ) for some q ∈ Δ4 , q1 = 1 2 � � p ∈ Δ3 : p1 ≥ 1 2 � = Qord . [sent-198, score-0.365]
58 Thus, by Theorem 7, we get that 2 3 ψ is classification calibrated with respect to �ord over Δ3 . [sent-200, score-0.369]
59 4 Classification Calibration Dimension We now turn to the study of a fundamental quantity associated with the property of classification calibration with respect to a general multiclass loss �. [sent-202, score-0.826]
60 Specifically, in the above example, we saw that to develop a classification calibrated surrogate loss w. [sent-203, score-0.905]
61 the ordinal regression loss for n = 3, � it was sufficient to consider a surrogate target space T = R, with dimension d = 1; in addition, this yielded a convex surrogate ψ : R→R3 which can be used in developing computationally efficient + algorithms. [sent-206, score-1.296]
62 In fact the same surrogate target space with d = 1 can be used to develop a similar convex, classification calibrated surrogate loss w. [sent-207, score-1.316]
63 However not all losses � have such ‘low-dimensional’ surrogates. [sent-211, score-0.189]
64 This raises the natural question of what is the smallest dimension d that supports a convex classification calibrated surrogate for a given multiclass loss �, and leads us to the following definition: Definition 9 (Classification calibration dimension). [sent-212, score-1.645]
65 Define the classification calibration dimension (CC dimension) of � as � � � � CCdim(�) = min d ∈ N : ∃ a convex set T ⊆ Rd and a convex surrogate ψ : T →Rn + � that is classification calibrated w. [sent-214, score-1.26]
66 In the following, we will be interested in developing an understanding of the CC dimension for general losses �, and in particular in deriving upper and lower bounds on this. [sent-219, score-0.316]
67 1 Upper Bounds on the Classification Calibration Dimension We start with a simple result that establishes that the CC dimension of any multiclass loss � is finite, and in fact is strictly smaller than the number of class labels n. [sent-221, score-0.588]
68 j∈[n−1],j�=y Then ψ is classification calibrated with respect to � over Δn . [sent-225, score-0.369]
69 It may appear surprising that the convex surrogate ψ in the above lemma is classification calibrated with respect to all multiclass losses � on n classes. [sent-227, score-1.29]
70 Minimizing the above surrogate effectively corresponds to such class probability estimation. [sent-229, score-0.365]
71 Indeed, the above lemma can be shown to hold for any surrogate that is a strictly proper composite multiclass loss [21]. [sent-230, score-0.868]
72 Next we give a different upper bound on the CC dimension that depends on the loss �, and for certain losses, can be significantly tighter than the general bound above. [sent-232, score-0.444]
73 Then CCdim(�) ≤ rank(L), the rank of the loss matrix L. [sent-235, score-0.246]
74 We will construct a convex classification calibrated surrogate loss ψ for � � with surrogate target space T ⊆ Rd . [sent-238, score-1.401]
75 , �k }); from the definitions of positive normals and trigger probabilities, it then follows that NSψ (zt ) = NSψ (�t ) = Q� for all t ∈ [k]. [sent-266, score-0.213]
76 Thus by Theorem 7, ψ is classification calibrated w. [sent-267, score-0.319]
77 Consider the Hamming loss �Ham defined in Example 3, for n = 2r . [sent-272, score-0.221]
78 Then the loss matrix LHam satisfies r LHam = r � 1� ee − σi σi � , 2 2 i=1 where e is the n × 1 all ones vector. [sent-274, score-0.221]
79 6 We note that the upper bound of Theorem 11 need not always be tight: for example, for the ordinal regression loss, for which we already know CCdim(�ord ) = 1, the theorem actually gives an upper bound of n, which is even weaker than that implied by Lemma 10. [sent-277, score-0.26]
80 2 Lower Bound on the Classification Calibration Dimension In this section we give a lower bound on the CC dimension of a loss function � and illustrate it by using it to calculate the CC dimension of the 0-1 loss. [sent-279, score-0.446]
81 Section 5 we will explore consequences of the lower bound for classification calibrated surrogates for certain types of ranking losses. [sent-280, score-0.565]
82 The feasible subspace dimension of a convex set C at p ∈ C, denoted by µC (p), is defined as the dimension of the subspace FC (p) ∩ (−FC (p)), where FC (p) is the cone of feasible directions of C at p. [sent-282, score-0.395]
83 7 The following gives a lower bound on the CC dimension of a loss � in terms of the feasible subspace dimension of the trigger probability sets Q� at certain points p ∈ Q� : t t Theorem 13. [sent-283, score-0.667]
84 t The proof requires extensions of the definition of positive normals and the necessary condition of Theorem 6 to sequences of points in Sψ and is quite technical. [sent-288, score-0.177]
85 In the appendix, we provide a proof in the special case when p ∈ relint(Δn ) is such that inf z∈Sψ p� z is achieved in Sψ , which does not require these extensions. [sent-289, score-0.09]
86 Both the proof of the lower bound and its applications make use of the following lemma, which gives a method to calculate the feasible subspace dimension for certain convex sets C and points p ∈ C: � � Lemma 14. [sent-291, score-0.352]
87 Let p ∈ C be such that � 1� �� 1 �� A A , the dimension of the null space of . [sent-293, score-0.139]
88 Then µC (p) = nullity 3 A A3 The above lower bound allows us to calculate precisely the CC dimension of the 0-1 loss: Example 7 (CC dimension of 0-1 loss). [sent-295, score-0.331]
89 , n} � � � = q ∈ Rn : −en−1 In−1 q ≤ 0, −q ≤ 0, e� q = 1} , n where en−1 , en denote the (n − 1) × 1 and n × 1 all ones vectors, respectively, and In−1 denotes � � the (n − 1) × (n − 1) identity matrix. [sent-305, score-0.096]
90 [10] showed recently that for certain pairwise subset ranking losses �, finding a predictor that minimizes the �-risk is an NP-hard problem. [sent-329, score-0.389]
91 Here we provide an alternative route to analyzing the difficulty of obtaining consistent surrogates for such pairwise subset ranking problems using the classification calibration dimension. [sent-333, score-0.624]
92 Specifically, we will show that even for a simple setting of such problems, the classification calibration dimension � of the underlying loss � is greater than r, and therefore no convex surrogate operating on T ⊆ Rr can be classification calibrated w. [sent-334, score-1.462]
93 Consider now the following simple pairwise loss �pair : Y × T →R+ : � � �pair (g(i,j) , σ) = 1 σ(i) > σ(j) . [sent-342, score-0.26]
94 Therefore, by Lemma 14, we get σ1 σt �� �� � µQpair (p) = nullity (�pair − �pair ), . [sent-363, score-0.106]
95 It is not hard to see that the set {�pair : σ ∈ T } spans a σ � � r(r−1) dimensional space, and hence the nullity of the above matrix is at most r(r−1)− r(r−1) −1 . [sent-368, score-0.106]
96 An interesting direction would be to develop a generic procedure for designing consistent convex surrogates operating on a ‘minimal’ surrogate target space according to the classification calibration dimension of the loss matrix. [sent-372, score-1.296]
97 On the bayes-risk consistency of regularized boosting a methods. [sent-378, score-0.108]
98 Statistical behavior and consistency of classification methods based on convex risk minimization. [sent-384, score-0.193]
99 Learning scoring e e functions with order-preserving losses and standardized supervision. [sent-412, score-0.189]
100 How to compare different loss functions and their risks. [sent-421, score-0.221]
wordName wordTfidf (topN-words)
[('surrogate', 0.365), ('calibrated', 0.319), ('calibration', 0.307), ('ccdim', 0.254), ('multiclass', 0.224), ('loss', 0.221), ('pred', 0.212), ('losses', 0.189), ('ns', 0.184), ('cc', 0.137), ('normals', 0.133), ('ord', 0.131), ('classi', 0.131), ('qord', 0.127), ('relint', 0.127), ('consistency', 0.108), ('lham', 0.106), ('nullity', 0.106), ('conv', 0.103), ('rn', 0.1), ('dimension', 0.099), ('ranking', 0.096), ('inf', 0.09), ('ordinal', 0.088), ('convex', 0.085), ('argmint', 0.085), ('qpair', 0.085), ('ex', 0.083), ('cation', 0.083), ('surrogates', 0.081), ('trigger', 0.08), ('abstain', 0.069), ('en', 0.069), ('rd', 0.068), ('operating', 0.066), ('ham', 0.064), ('wsy', 0.064), ('lemma', 0.058), ('pair', 0.057), ('er', 0.057), ('py', 0.052), ('respect', 0.05), ('simplex', 0.048), ('rr', 0.046), ('target', 0.046), ('fc', 0.044), ('fm', 0.043), ('byj', 0.042), ('harish', 0.042), ('listwise', 0.042), ('duchi', 0.042), ('certain', 0.042), ('incurred', 0.041), ('analyzing', 0.04), ('null', 0.04), ('pairwise', 0.039), ('route', 0.038), ('ingo', 0.037), ('lord', 0.037), ('tong', 0.036), ('theorem', 0.035), ('nicolas', 0.035), ('notion', 0.032), ('conditions', 0.031), ('ambuj', 0.031), ('label', 0.031), ('hamming', 0.03), ('sy', 0.03), ('india', 0.03), ('tewari', 0.028), ('upper', 0.028), ('feasible', 0.028), ('subspace', 0.028), ('nition', 0.028), ('regression', 0.027), ('bound', 0.027), ('denotes', 0.027), ('predictions', 0.027), ('preference', 0.026), ('ut', 0.026), ('designing', 0.026), ('rank', 0.025), ('helpful', 0.025), ('subdifferential', 0.025), ('raises', 0.025), ('giving', 0.024), ('quantity', 0.024), ('ym', 0.023), ('documents', 0.023), ('subset', 0.023), ('yt', 0.023), ('necessary', 0.023), ('tj', 0.022), ('labels', 0.022), ('establishes', 0.022), ('nitions', 0.022), ('sets', 0.022), ('points', 0.021), ('iff', 0.02), ('zj', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses
Author: Harish G. Ramaswamy, Shivani Agarwal
Abstract: We study consistency properties of surrogate loss functions for general multiclass classification problems, defined by a general loss matrix. We extend the notion of classification calibration, which has been studied for binary and multiclass 0-1 classification problems (and for certain other specific learning problems), to the general multiclass setting, and derive necessary and sufficient conditions for a surrogate loss to be classification calibrated with respect to a loss matrix in this setting. We then introduce the notion of classification calibration dimension of a multiclass loss matrix, which measures the smallest ‘size’ of a prediction space for which it is possible to design a convex surrogate that is classification calibrated with respect to the loss matrix. We derive both upper and lower bounds on this quantity, and use these results to analyze various loss matrices. In particular, as one application, we provide a different route from the recent result of Duchi et al. (2010) for analyzing the difficulty of designing ‘low-dimensional’ convex surrogates that are consistent with respect to pairwise subset ranking losses. We anticipate the classification calibration dimension may prove to be a useful tool in the study and design of surrogate losses for general multiclass learning problems. 1
2 0.20642208 227 nips-2012-Multiclass Learning with Simplex Coding
Author: Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco, Jean-jeacques Slotine
Abstract: In this paper we discuss a novel framework for multiclass learning, defined by a suitable coding/decoding strategy, namely the simplex coding, that allows us to generalize to multiple classes a relaxation approach commonly used in binary classification. In this framework, we develop a relaxation error analysis that avoids constraints on the considered hypotheses class. Moreover, using this setting we derive the first provably consistent regularized method with training/tuning complexity that is independent to the number of classes. We introduce tools from convex analysis that can be used beyond the scope of this paper. 1
3 0.17827564 200 nips-2012-Local Supervised Learning through Space Partitioning
Author: Joseph Wang, Venkatesh Saligrama
Abstract: We develop a novel approach for supervised learning based on adaptively partitioning the feature space into different regions and learning local region-specific classifiers. We formulate an empirical risk minimization problem that incorporates both partitioning and classification in to a single global objective. We show that space partitioning can be equivalently reformulated as a supervised learning problem and consequently any discriminative learning method can be utilized in conjunction with our approach. Nevertheless, we consider locally linear schemes by learning linear partitions and linear region classifiers. Locally linear schemes can not only approximate complex decision boundaries and ensure low training error but also provide tight control on over-fitting and generalization error. We train locally linear classifiers by using LDA, logistic regression and perceptrons, and so our scheme is scalable to large data sizes and high-dimensions. We present experimental results demonstrating improved performance over state of the art classification techniques on benchmark datasets. We also show improved robustness to label noise.
4 0.17022654 323 nips-2012-Statistical Consistency of Ranking Methods in A Rank-Differentiable Probability Space
Author: Yanyan Lan, Jiafeng Guo, Xueqi Cheng, Tie-yan Liu
Abstract: This paper is concerned with the statistical consistency of ranking methods. Recently, it was proven that many commonly used pairwise ranking methods are inconsistent with the weighted pairwise disagreement loss (WPDL), which can be viewed as the true loss of ranking, even in a low-noise setting. This result is interesting but also surprising, given that the pairwise ranking methods have been shown very effective in practice. In this paper, we argue that the aforementioned result might not be conclusive, depending on what kind of assumptions are used. We give a new assumption that the labels of objects to rank lie in a rank-differentiable probability space (RDPS), and prove that the pairwise ranking methods become consistent with WPDL under this assumption. What is especially inspiring is that RDPS is actually not stronger than but similar to the low-noise setting. Our studies provide theoretical justifications of some empirical findings on pairwise ranking methods that are unexplained before, which bridge the gap between theory and applications.
5 0.1360103 280 nips-2012-Proper losses for learning from partial labels
Author: Jesús Cid-sueiro
Abstract: This paper discusses the problem of calibrating posterior class probabilities from partially labelled data. Each instance is assumed to be labelled as belonging to one of several candidate categories, at most one of them being true. We generalize the concept of proper loss to this scenario, we establish a necessary and sufficient condition for a loss function to be proper, and we show a direct procedure to construct a proper loss for partial labels from a conventional proper loss. The problem can be characterized by the mixing probability matrix relating the true class of the data and the observed labels. The full knowledge of this matrix is not required, and losses can be constructed that are proper for a wide set of mixing probability matrices. 1
6 0.12820809 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
7 0.12820198 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs
8 0.12236422 16 nips-2012-A Polynomial-time Form of Robust Regression
9 0.10445837 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback
10 0.097113319 64 nips-2012-Calibrated Elastic Regularization in Matrix Completion
11 0.094690017 330 nips-2012-Supervised Learning with Similarity Functions
12 0.087379634 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications
13 0.083511762 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking
14 0.081976786 148 nips-2012-Hamming Distance Metric Learning
15 0.078021981 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification
16 0.07516674 160 nips-2012-Imitation Learning by Coaching
17 0.071070731 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
18 0.070606008 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)
19 0.070453994 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme
20 0.068736874 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach
topicId topicWeight
[(0, 0.179), (1, 0.016), (2, 0.083), (3, -0.042), (4, 0.131), (5, -0.035), (6, -0.007), (7, 0.218), (8, -0.066), (9, 0.057), (10, 0.047), (11, 0.1), (12, 0.02), (13, 0.046), (14, 0.057), (15, -0.04), (16, 0.002), (17, -0.019), (18, -0.014), (19, 0.09), (20, -0.072), (21, 0.057), (22, -0.124), (23, 0.197), (24, -0.042), (25, 0.003), (26, -0.069), (27, 0.027), (28, -0.095), (29, -0.095), (30, 0.099), (31, -0.007), (32, 0.082), (33, 0.012), (34, 0.129), (35, 0.056), (36, 0.028), (37, -0.04), (38, -0.046), (39, -0.061), (40, 0.01), (41, -0.043), (42, -0.009), (43, -0.044), (44, -0.027), (45, -0.043), (46, -0.005), (47, 0.018), (48, 0.019), (49, 0.006)]
simIndex simValue paperId paperTitle
same-paper 1 0.96447963 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses
Author: Harish G. Ramaswamy, Shivani Agarwal
Abstract: We study consistency properties of surrogate loss functions for general multiclass classification problems, defined by a general loss matrix. We extend the notion of classification calibration, which has been studied for binary and multiclass 0-1 classification problems (and for certain other specific learning problems), to the general multiclass setting, and derive necessary and sufficient conditions for a surrogate loss to be classification calibrated with respect to a loss matrix in this setting. We then introduce the notion of classification calibration dimension of a multiclass loss matrix, which measures the smallest ‘size’ of a prediction space for which it is possible to design a convex surrogate that is classification calibrated with respect to the loss matrix. We derive both upper and lower bounds on this quantity, and use these results to analyze various loss matrices. In particular, as one application, we provide a different route from the recent result of Duchi et al. (2010) for analyzing the difficulty of designing ‘low-dimensional’ convex surrogates that are consistent with respect to pairwise subset ranking losses. We anticipate the classification calibration dimension may prove to be a useful tool in the study and design of surrogate losses for general multiclass learning problems. 1
2 0.80683076 227 nips-2012-Multiclass Learning with Simplex Coding
Author: Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco, Jean-jeacques Slotine
Abstract: In this paper we discuss a novel framework for multiclass learning, defined by a suitable coding/decoding strategy, namely the simplex coding, that allows us to generalize to multiple classes a relaxation approach commonly used in binary classification. In this framework, we develop a relaxation error analysis that avoids constraints on the considered hypotheses class. Moreover, using this setting we derive the first provably consistent regularized method with training/tuning complexity that is independent to the number of classes. We introduce tools from convex analysis that can be used beyond the scope of this paper. 1
3 0.78801107 280 nips-2012-Proper losses for learning from partial labels
Author: Jesús Cid-sueiro
Abstract: This paper discusses the problem of calibrating posterior class probabilities from partially labelled data. Each instance is assumed to be labelled as belonging to one of several candidate categories, at most one of them being true. We generalize the concept of proper loss to this scenario, we establish a necessary and sufficient condition for a loss function to be proper, and we show a direct procedure to construct a proper loss for partial labels from a conventional proper loss. The problem can be characterized by the mixing probability matrix relating the true class of the data and the observed labels. The full knowledge of this matrix is not required, and losses can be constructed that are proper for a wide set of mixing probability matrices. 1
4 0.71351385 323 nips-2012-Statistical Consistency of Ranking Methods in A Rank-Differentiable Probability Space
Author: Yanyan Lan, Jiafeng Guo, Xueqi Cheng, Tie-yan Liu
Abstract: This paper is concerned with the statistical consistency of ranking methods. Recently, it was proven that many commonly used pairwise ranking methods are inconsistent with the weighted pairwise disagreement loss (WPDL), which can be viewed as the true loss of ranking, even in a low-noise setting. This result is interesting but also surprising, given that the pairwise ranking methods have been shown very effective in practice. In this paper, we argue that the aforementioned result might not be conclusive, depending on what kind of assumptions are used. We give a new assumption that the labels of objects to rank lie in a rank-differentiable probability space (RDPS), and prove that the pairwise ranking methods become consistent with WPDL under this assumption. What is especially inspiring is that RDPS is actually not stronger than but similar to the low-noise setting. Our studies provide theoretical justifications of some empirical findings on pairwise ranking methods that are unexplained before, which bridge the gap between theory and applications.
5 0.71231169 217 nips-2012-Mixability in Statistical Learning
Author: Tim V. Erven, Peter Grünwald, Mark D. Reid, Robert C. Williamson
Abstract: Statistical learning and sequential prediction are two different but related formalisms to study the quality of predictions. Mapping out their relations and transferring ideas is an active area of investigation. We provide another piece of the puzzle by showing that an important concept in sequential prediction, the mixability of a loss, has a natural counterpart in the statistical setting, which we call stochastic mixability. Just as ordinary mixability characterizes fast rates for the worst-case regret in sequential prediction, stochastic mixability characterizes fast rates in statistical learning. We show that, in the special case of log-loss, stochastic mixability reduces to a well-known (but usually unnamed) martingale condition, which is used in existing convergence theorems for minimum description length and Bayesian inference. In the case of 0/1-loss, it reduces to the margin condition of Mammen and Tsybakov, and in the case that the model under consideration contains all possible predictors, it is equivalent to ordinary mixability. 1
6 0.70733392 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs
7 0.63613361 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification
8 0.63226992 16 nips-2012-A Polynomial-time Form of Robust Regression
9 0.62686336 361 nips-2012-Volume Regularization for Binary Classification
10 0.60141855 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications
11 0.59597892 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound
12 0.58671916 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
13 0.57926726 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification
14 0.5677864 271 nips-2012-Pointwise Tracking the Optimal Regression Function
15 0.56617171 169 nips-2012-Label Ranking with Partial Abstention based on Thresholded Probabilistic Models
16 0.56018192 200 nips-2012-Local Supervised Learning through Space Partitioning
17 0.51063395 148 nips-2012-Hamming Distance Metric Learning
18 0.50303 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking
19 0.49958915 330 nips-2012-Supervised Learning with Similarity Functions
20 0.49258277 141 nips-2012-GenDeR: A Generic Diversified Ranking Algorithm
topicId topicWeight
[(0, 0.016), (38, 0.109), (42, 0.028), (54, 0.011), (74, 0.033), (76, 0.124), (80, 0.539), (92, 0.019)]
simIndex simValue paperId paperTitle
1 0.97182614 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks
Author: Tivadar Papai, Henry Kautz, Daniel Stefankovic
Abstract: Markov logic is a widely used tool in statistical relational learning, which uses a weighted first-order logic knowledge base to specify a Markov random field (MRF) or a conditional random field (CRF). In many applications, a Markov logic network (MLN) is trained in one domain, but used in a different one. This paper focuses on dynamic Markov logic networks, where the size of the discretized time-domain typically varies between training and testing. It has been previously pointed out that the marginal probabilities of truth assignments to ground atoms can change if one extends or reduces the domains of predicates in an MLN. We show that in addition to this problem, the standard way of unrolling a Markov logic theory into a MRF may result in time-inhomogeneity of the underlying Markov chain. Furthermore, even if these representational problems are not significant for a given domain, we show that the more practical problem of generating samples in a sequential conditional random field for the next slice relying on the samples from the previous slice has high computational cost in the general case, due to the need to estimate a normalization factor for each sample. We propose a new discriminative model, slice normalized dynamic Markov logic networks (SN-DMLN), that suffers from none of these issues. It supports efficient online inference, and can directly model influences between variables within a time slice that do not have a causal direction, in contrast with fully directed models (e.g., DBNs). Experimental results show an improvement in accuracy over previous approaches to online inference in dynamic Markov logic networks. 1
2 0.9709425 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking
Author: James Scott, Jonathan W. Pillow
Abstract: Characterizing the information carried by neural populations in the brain requires accurate statistical models of neural spike responses. The negative-binomial distribution provides a convenient model for over-dispersed spike counts, that is, responses with greater-than-Poisson variability. Here we describe a powerful data-augmentation framework for fully Bayesian inference in neural models with negative-binomial spiking. Our approach relies on a recently described latentvariable representation of the negative-binomial distribution, which equates it to a Polya-gamma mixture of normals. This framework provides a tractable, conditionally Gaussian representation of the posterior that can be used to design efficient EM and Gibbs sampling based algorithms for inference in regression and dynamic factor models. We apply the model to neural data from primate retina and show that it substantially outperforms Poisson regression on held-out data, and reveals latent structure underlying spike count correlations in simultaneously recorded spike trains. 1
3 0.97057217 204 nips-2012-MAP Inference in Chains using Column Generation
Author: David Belanger, Alexandre Passos, Sebastian Riedel, Andrew McCallum
Abstract: Linear chains and trees are basic building blocks in many applications of graphical models, and they admit simple exact maximum a-posteriori (MAP) inference algorithms based on message passing. However, in many cases this computation is prohibitively expensive, due to quadratic dependence on variables’ domain sizes. The standard algorithms are inefficient because they compute scores for hypotheses for which there is strong negative local evidence. For this reason there has been significant previous interest in beam search and its variants; however, these methods provide only approximate results. This paper presents new exact inference algorithms based on the combination of column generation and pre-computed bounds on terms of the model’s scoring function. While we do not improve worst-case performance, our method substantially speeds real-world, typical-case inference in chains and trees. Experiments show our method to be twice as fast as exact Viterbi for Wall Street Journal part-of-speech tagging and over thirteen times faster for a joint part-of-speed and named-entity-recognition task. Our algorithm is also extendable to new techniques for approximate inference, to faster 0/1 loss oracles, and new opportunities for connections between inference and learning. We encourage further exploration of high-level reasoning about the optimization problem implicit in dynamic programs. 1
4 0.96984547 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing
Author: Ashish Kapoor, Raajay Viswanathan, Prateek Jain
Abstract: In this paper, we present a Bayesian framework for multilabel classiďŹ cation using compressed sensing. The key idea in compressed sensing for multilabel classiďŹ cation is to ďŹ rst project the label vector to a lower dimensional space using a random transformation and then learn regression functions over these projections. Our approach considers both of these components in a single probabilistic model, thereby jointly optimizing over compression as well as learning tasks. We then derive an efďŹ cient variational inference scheme that provides joint posterior distribution over all the unobserved labels. The two key beneďŹ ts of the model are that a) it can naturally handle datasets that have missing labels and b) it can also measure uncertainty in prediction. The uncertainty estimate provided by the model allows for active learning paradigms where an oracle provides information about labels that promise to be maximally informative for the prediction task. Our experiments show signiďŹ cant boost over prior methods in terms of prediction performance over benchmark datasets, both in the fully labeled and the missing labels case. Finally, we also highlight various useful active learning scenarios that are enabled by the probabilistic model. 1
5 0.94383341 100 nips-2012-Discriminative Learning of Sum-Product Networks
Author: Robert Gens, Pedro Domingos
Abstract: Sum-product networks are a new deep architecture that can perform fast, exact inference on high-treewidth models. Only generative methods for training SPNs have been proposed to date. In this paper, we present the first discriminative training algorithms for SPNs, combining the high accuracy of the former with the representational power and tractability of the latter. We show that the class of tractable discriminative SPNs is broader than the class of tractable generative ones, and propose an efficient backpropagation-style algorithm for computing the gradient of the conditional log likelihood. Standard gradient descent suffers from the diffusion problem, but networks with many layers can be learned reliably using “hard” gradient descent, where marginal inference is replaced by MPE inference (i.e., inferring the most probable state of the non-evidence variables). The resulting updates have a simple and intuitive form. We test discriminative SPNs on standard image classification tasks. We obtain the best results to date on the CIFAR-10 dataset, using fewer features than prior methods with an SPN architecture that learns local image structure discriminatively. We also report the highest published test accuracy on STL-10 even though we only use the labeled portion of the dataset. 1
same-paper 6 0.93800271 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses
7 0.8380723 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems
8 0.8176353 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions
9 0.78806502 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders
10 0.77986103 251 nips-2012-On Lifting the Gibbs Sampling Algorithm
11 0.77193463 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning
12 0.77105021 200 nips-2012-Local Supervised Learning through Space Partitioning
13 0.77098167 293 nips-2012-Relax and Randomize : From Value to Algorithms
14 0.7486257 207 nips-2012-Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification
15 0.73899209 197 nips-2012-Learning with Recursive Perceptual Representations
16 0.73897338 279 nips-2012-Projection Retrieval for Classification
17 0.73529232 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization
18 0.72897279 315 nips-2012-Slice sampling normalized kernel-weighted completely random measure mixture models
19 0.7204752 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback
20 0.7158637 330 nips-2012-Supervised Learning with Similarity Functions